ClearlyCylindrical t1_iqna0cr wrote
If it were possible to do full batch all the time minibatches would likely still be used. The stochasticity created by minibatch gradient descent generally improves a models generalisation performance.
Ephemeral_Epoch t1_iqnscns wrote
Seems like you could approximate a minibatch with a full batch + noise? Maybe there's a better noising procedure when using full batch gradients.
SNAPscientist t1_iqr3sej wrote
Capturing the distribution characteristics of high-dimensional data is very hard. In fact if we could do that well, we might be able to use classic bayesian techniques for many NN problems which would be more principled and interpretable. Any noise one would end up adding by hand is unlikely to introduce the kind of stochasticity that sampling on real data (using minibatches or similar procedures) would. Getting the distribution wrong would likely mean poor generalization.
fasttosmile t1_iqrel80 wrote
This is wrong see: https://www.youtube.com/watch?v=kcVWAKf7UAg
The real reason is it's just faster to train on smaller batches (because the steps are quicker).
ClearlyCylindrical t1_iqrmrxz wrote
Yes that too, although my explanation wasn't incorrect, there was just more needed to the explanation right?
fasttosmile t1_iqrolwa wrote
There was for a while the belief that the stochasticity was key for good performance (one paper supporting the hypothesis from 2016). Your framing makes it sound like that is still the case - you suggest no other reason for not doing full batch descent - and I think it's important to point out it's not.
Viewing a single comment thread. View all comments