fasttosmile t1_iqrel80 wrote on October 2, 2022 at 3:35 PM

Reply to comment by ClearlyCylindrical in [Discussion] If we had enough memory to always do full batch gradient descent, would we still need rmsprop/momentum/adam? by 029187

This is wrong see: https://www.youtube.com/watch?v=kcVWAKf7UAg

The real reason is it's just faster to train on smaller batches (because the steps are quicker).

ClearlyCylindrical t1_iqrmrxz wrote on October 2, 2022 at 4:29 PM

Yes that too, although my explanation wasn't incorrect, there was just more needed to the explanation right?

fasttosmile t1_iqrolwa wrote on October 2, 2022 at 4:40 PM

There was for a while the belief that the stochasticity was key for good performance (one paper supporting the hypothesis from 2016). Your framing makes it sound like that is still the case - you suggest no other reason for not doing full batch descent - and I think it's important to point out it's not.