derpderp3200 OP t1_j1yh8qe wrote on December 28, 2022 at 9:26 AM

Reply to comment by IndecisivePhysicist in [D] Has any research been done to counteract the fact that each training datapoint "pulls the model in a different direction", partly undoing learning until shared features emerge? by derpderp3200

But is it the most efficient and effective method?

I'd imagine it's likely possible to converge much faster, and that at some point into training, you likely run into a "limit" where the "signal"(learnable features) can no longer overcome the "noise"(the "pull effect").

nonotan t1_j1yovo3 wrote on December 28, 2022 at 11:12 AM

It's probably not the most efficient method. However, in general, methods that converge faster tend to lead to slightly worse minima (think momentum-based methods vs "plain" SGD), which "intuitively" makes some degree of sense (the additional time spent training isn't completely wasted, with some of it effectively helping explore the possibility space, optimizing the model in ways that simple gradient-following might miss entirely)

I would be shocked if there doesn't exist a method that does even better than SGD while also being significantly more efficient. But it's probably not going to be easy to find, and I expect most simple heuristics ("this seems to be helping, do it more" or "this doesn't seem to be helping, do it less") will lead to training time vs accuracy tradeoffs, rather than universal improvements.

IndecisivePhysicist t1_j20lxts wrote on December 28, 2022 at 8:10 PM

Converge to what though? My whole point was that you don't want to converge to the actual global minimum on the test set, you want one of the many local minima and you want one that is flat.