Submitted by derpderp3200 t3_zwd49c in MachineLearning

I don't remember where I've read about this, but it left a lasting impression on me as it feels intuitively true and impactful - in a manner, the learning on each datapoint pulls the network towards encoding that individual example, relying on stochastic emergence of shared features, which in turn relies on a dataset:model size ratio that prevents overfitting and a balanced dataset.

Has there been any research into counteracting this phenomenon, such as more purposeful extraction of features, clever batching schemas, synthetic datapoints or anything else such?

4

Comments

You must log in or register to comment.

magpiesonskates t1_j1u1oag wrote

This is only true if you use batch size of 1. Randomly sampled batches should average out the effect you speak of

10

eigenham t1_j1uhtn7 wrote

A similar phenomenon happens because of batching in general though. More generally, the distribution of the samples in each batch determines what the cost function "looks like" (as a function approximation) to the gradient calculation. That sample (and thus function approximation) can be biased towards a single sample or a subset of samples. I think OP's question is still an interesting one for the general case.

1

derpderp3200 OP t1_j1vmr20 wrote

Similar but not identical? What effect do you mean?

But yeah, the way I see it, the network isn't navigating a single gradient towards "a good classifier" optima, but rather down whatever gradient is left after the otherwise-destructive inference of gradients of individual training examples, as opposed to a more "purposeful" extraction of features.

Which happens to result in a gradual movement towards being a decent classifier, but it strictly relies on balanced, large, and well-crafted datasets to balance the "pull vectors" out to "zero" so the convergence effect dominates, as well as incredibly high training costs.

I don't know how it would look, but surely a more "cooperative" learning process would learn faster if not better.

1

Zondartul t1_j2cthgl wrote

Would using a bath size of "all your data at once" (so basically no batching) be ideal, if unfeasible?

1

derpderp3200 OP t1_j1ufkob wrote

Are there any articles or papers benchmarking this, or exploring more elaborate solutions than just batching?

0

HateRedditCantQuitit t1_j1v0fto wrote

The whole SGD & optimizer field is kinda this. Think about how momentum and the problem you’re talking about interact, for a small example.

3

ResponsibilityNo7189 t1_j1ulsd1 wrote

That is why you have hundreds of millions of parameters in a network. There is so many ways for the weights to move that it's not a zero-sum game: some direction will not be so detrimental to other examples. It's precisely for this reason that self-supervised methods tend to work best on very deep networks. see "Scaling Vision Transformers".

7

derpderp3200 OP t1_j1vgi23 wrote

I assume this is the case early into training, but eventually the training process starts needing to "compress" information so a given parameter handles more than one very specific case, at which point it'll be subject to this phenomenon again- any dog example will want "not dog" neurons inactive, any dog example will want neurons contributing to classification of other classes inactive.

Sure, statistically you're still descending down the slope of a network that's good at each class, but this is only the case when your classes - and thus the "pull effects" are balanced, not as an intrinsic ability of the network to extract differentiating features.

1

velcher t1_j217txd wrote

https://arxiv.org/abs/2001.06782 Gradient Surgery for Multi-Task Learning

Some related work in Multi-task RL. But I remember my impression of it was that it only moderately helps Multi-task RL.

3

Red-Portal t1_j1vu94s wrote

I think what you're describing is similar to curriculum learning and importance sampling SGD. The former claims that there is a better order of feeding data during SGD that results in better training. But I'm not sure how scientifically grounded that line of research has become. It used to be closer to art. The latter is simple. Since some samples are more "destructive" (higher variance), sample them less often while numerically compensating for that.

1

Nameless1995 t1_j1xeo4l wrote

There is a literature related to taking gradient agreement/conflict into account for different motivations (usually different from the exact motivation in OP).

This is one place to start looking: https://arxiv.org/abs/2009.00329 (you can find some related work from the citations in google scholar/semantic scholar)

1

derpderp3200 OP t1_j1ygqtj wrote

What a fascinating paper- reminds me of an idea I had to store some sort of secondary value in weights that contribute to correct outputs that prevents unlearning their features, but had no specific idea of how to execute it- can't believe I didn't think of what this paper's authors did. Thank you.

1

IndecisivePhysicist t1_j1yd893 wrote

This is good actually. The "noise" of batch sgd acts as a regularizer. You don't actually want to find the global minimum of the training set but rather you want a generalizeable minimum which usually means a flat minimum because there will be some distribution shift at test time. The minimum being slightly different for each minibatch helps draw you toward a flat minimum.

1

derpderp3200 OP t1_j1yh8qe wrote

But is it the most efficient and effective method?

I'd imagine it's likely possible to converge much faster, and that at some point into training, you likely run into a "limit" where the "signal"(learnable features) can no longer overcome the "noise"(the "pull effect").

1

nonotan t1_j1yovo3 wrote

It's probably not the most efficient method. However, in general, methods that converge faster tend to lead to slightly worse minima (think momentum-based methods vs "plain" SGD), which "intuitively" makes some degree of sense (the additional time spent training isn't completely wasted, with some of it effectively helping explore the possibility space, optimizing the model in ways that simple gradient-following might miss entirely)

I would be shocked if there doesn't exist a method that does even better than SGD while also being significantly more efficient. But it's probably not going to be easy to find, and I expect most simple heuristics ("this seems to be helping, do it more" or "this doesn't seem to be helping, do it less") will lead to training time vs accuracy tradeoffs, rather than universal improvements.

3

IndecisivePhysicist t1_j20lxts wrote

Converge to what though? My whole point was that you don't want to converge to the actual global minimum on the test set, you want one of the many local minima and you want one that is flat.

1

realjunkman t1_j2a54lt wrote

There was a paper recently about finding parameter regions that are unused and only updating those on fine tuned data. Can't remember the name but that was an interesting approach.

1

derpderp3200 OP t1_j2avw24 wrote

Interesting! I thought about something similar, a "no parameter is left unused" during training, but using unused regions for fine-tuning sounds like a much more clever application of the principle.

1

realjunkman t1_j2cc4nm wrote

It was a presentation I saw at EMNLP this past year. I’ll try and look for it, but if I don’t report back… it was a presentation during day 3!

1

big_haptun777 t1_j1ug2fb wrote

I believe that it has already been solved via shuffling and batching. You will possibly not get stuck in local minima.

−3