magpiesonskates t1_j1u1oag wrote on December 27, 2022 at 11:38 AM

This is only true if you use batch size of 1. Randomly sampled batches should average out the effect you speak of

eigenham t1_j1uhtn7 wrote on December 27, 2022 at 2:25 PM

A similar phenomenon happens because of batching in general though. More generally, the distribution of the samples in each batch determines what the cost function "looks like" (as a function approximation) to the gradient calculation. That sample (and thus function approximation) can be biased towards a single sample or a subset of samples. I think OP's question is still an interesting one for the general case.

derpderp3200 OP t1_j1vmr20 wrote on December 27, 2022 at 7:07 PM

Similar but not identical? What effect do you mean?

But yeah, the way I see it, the network isn't navigating a single gradient towards "a good classifier" optima, but rather down whatever gradient is left after the otherwise-destructive inference of gradients of individual training examples, as opposed to a more "purposeful" extraction of features.

Which happens to result in a gradual movement towards being a decent classifier, but it strictly relies on balanced, large, and well-crafted datasets to balance the "pull vectors" out to "zero" so the convergence effect dominates, as well as incredibly high training costs.

I don't know how it would look, but surely a more "cooperative" learning process would learn faster if not better.

Zondartul t1_j2cthgl wrote on December 31, 2022 at 7:54 AM

Would using a bath size of "all your data at once" (so basically no batching) be ideal, if unfeasible?

derpderp3200 OP t1_j1ufkob wrote on December 27, 2022 at 2:06 PM

Are there any articles or papers benchmarking this, or exploring more elaborate solutions than just batching?

HateRedditCantQuitit t1_j1v0fto wrote on December 27, 2022 at 4:41 PM

The whole SGD & optimizer field is kinda this. Think about how momentum and the problem you’re talking about interact, for a small example.