Viewing a single comment thread. View all comments

ClearlyCylindrical t1_iqna0cr wrote

If it were possible to do full batch all the time minibatches would likely still be used. The stochasticity created by minibatch gradient descent generally improves a models generalisation performance.

26

Ephemeral_Epoch t1_iqnscns wrote

Seems like you could approximate a minibatch with a full batch + noise? Maybe there's a better noising procedure when using full batch gradients.

5

SNAPscientist t1_iqr3sej wrote

Capturing the distribution characteristics of high-dimensional data is very hard. In fact if we could do that well, we might be able to use classic bayesian techniques for many NN problems which would be more principled and interpretable. Any noise one would end up adding by hand is unlikely to introduce the kind of stochasticity that sampling on real data (using minibatches or similar procedures) would. Getting the distribution wrong would likely mean poor generalization.

2

fasttosmile t1_iqrel80 wrote

This is wrong see: https://www.youtube.com/watch?v=kcVWAKf7UAg

The real reason is it's just faster to train on smaller batches (because the steps are quicker).

2

ClearlyCylindrical t1_iqrmrxz wrote

Yes that too, although my explanation wasn't incorrect, there was just more needed to the explanation right?

1

fasttosmile t1_iqrolwa wrote

There was for a while the belief that the stochasticity was key for good performance (one paper supporting the hypothesis from 2016). Your framing makes it sound like that is still the case - you suggest no other reason for not doing full batch descent - and I think it's important to point out it's not.

1