_Arsenie_Boca_ t1_jbbh5ng wrote on March 7, 2023 at 9:07 PM

Until convergence is something that we often say and hear but makes no sense by definition. Convergence never ends

Maximum OP t1_jbbi89l wrote on March 7, 2023 at 9:14 PM

Until looking at loss does not get you excited?

currentscurrents t1_jbbmmqs wrote on March 7, 2023 at 9:41 PM

Eventually you can reach a point where any possible change to the model decreases performance. Then you've fully converged.

Nobody ever does this though because of diminishing returns.

farmingvillein t1_jbk2uyw wrote on March 9, 2023 at 4:45 PM

> Nobody ever does this though because of diminishing returns.

Extending the LLaMa concept, I would love to see someone like Meta run the experiment where they do take their 1.4T (or w/e) tokens, and run training to convergence...on the largest model that will converge (subject to reasonable LR decay policies) in a "reasonable" time frame.

Meaning, if they trained, say, a 1M param LLM...presumably it would hit convergence (get saturated) pretty quickly. And what about 10M, 100M, etc.?

I.e., how much more can we squeeze out of a relatively-tiny model? Probably it doesn't end up super interesting from a purely generative POV, but it might look like--e.g.--Roberta+.

With a model that is so small, the cost to run this test probably(?) wouldn't be that high.

cztomsik t1_jbgdoar wrote on March 8, 2023 at 9:17 PM

but this is likely going to take forever because of LR decay, right?

adt t1_jbbzba8 wrote on March 7, 2023 at 11:06 PM

There are a few that 'feel' that way. Try Megatron-11B (~200:1) based on RoBERTa (6,198:1). Wayyyyy ahead of its time, and I've matched it with much larger models in some testing.

https://app.inferkit.com/demo

Here's the full table of Chinchilla-align comparisons:

https://lifearchitect.ai/models-table/

whata_wonderful_day t1_jbcxdwf wrote on March 8, 2023 at 3:23 AM

Nice! How did you get access to Megatron-11B? I can't find it online anywhere

Jepacor t1_jbdrovb wrote on March 8, 2023 at 8:57 AM

The link to the model is in the Google sheets they linked : https://github.com/facebookresearch/fairseq/blob/main/examples/megatron_11b/README.md

whata_wonderful_day t1_jbhp4gb wrote on March 9, 2023 at 2:53 AM

Thanks, alas I thought it was an encoder model. I've been on the lookout for a big one, largest I've seen is deberta V2 with 1.5B params

Maximum OP t1_jbdqy5c wrote on March 8, 2023 at 8:47 AM

Thanks for the links. Looks like RoBERTa did not gain a lot from the additional trainings, only minor improvements, but yeah, it was a tiny model. How was this not a good lesson? Why did people need Chinchilla? Maybe it's just having a lot of data comes easy so people gather as much as possible, even though they know they will go maximum 1 epoch over it.

[D] Can someone explain the discrepancy between the findings of LLaMA and Chinchilla?

Maximum OP t1_jbb5bzm wrote on March 7, 2023 at 7:53 PM