currentscurrents t1_jbbmmqs wrote on March 7, 2023 at 9:41 PM

Reply to comment by _Arsenie_Boca_ in [D] Can someone explain the discrepancy between the findings of LLaMA and Chinchilla? by __Maximum__

Eventually you can reach a point where any possible change to the model decreases performance. Then you've fully converged.

Nobody ever does this though because of diminishing returns.

farmingvillein t1_jbk2uyw wrote on March 9, 2023 at 4:45 PM

> Nobody ever does this though because of diminishing returns.

Extending the LLaMa concept, I would love to see someone like Meta run the experiment where they do take their 1.4T (or w/e) tokens, and run training to convergence...on the largest model that will converge (subject to reasonable LR decay policies) in a "reasonable" time frame.

Meaning, if they trained, say, a 1M param LLM...presumably it would hit convergence (get saturated) pretty quickly. And what about 10M, 100M, etc.?

I.e., how much more can we squeeze out of a relatively-tiny model? Probably it doesn't end up super interesting from a purely generative POV, but it might look like--e.g.--Roberta+.

With a model that is so small, the cost to run this test probably(?) wouldn't be that high.

cztomsik t1_jbgdoar wrote on March 8, 2023 at 9:17 PM

but this is likely going to take forever because of LR decay, right?