FallUpJV t1_jcje89y wrote
I don't get this anymore if it's not the model size nor the transformer architecture then what is it?
Models were just not trained enough / not on the right data?
xEdwin23x t1_jcjfnlj wrote
First, this is not a "small" model so size DOES matter. It may not be hundreds billion parameters but it's definitely not small imo.
Second, it always has been (about data) astronaut pointing gun meme.
FallUpJV t1_jclpydo wrote
Yes it's definitely not small, I meant comparated to the models people have been paying attention to the most on the last few years I guess.
The astronaut pointing gun meme is a good analogy, almost a scary one, I wonder how much we could improve existing models with simply better data.
MysteryInc152 t1_jclpjzi wrote
It's predicting language. as long as the structure can allow properly to learn to predict language, you're good to go.
turnip_burrito t1_jcoul9i wrote
Yes, exactly. Everyone keeps leaving the architecture's inductive structural priors out of the discussion.
It's not all about data! The model matters too!
satireplusplus t1_jcp6bu4 wrote
This model uses a "trick" to efficiently train RNNs at scale and I still I have to take a look to understand how it works. Hopefully the paper is out soon!
Otherwise size is what matters! To get there it's a combination of factors - the transformer architecture scales well and was the first architecture that allowed to train these LLMs cranked up to enormous sizes. Enterprise GPU hardware with lots of memory (40G, 80G) and frameworks like pytorch that make parallelizing training across multiple GPUs easy.
And OPs 14B model might be "small" by today's standard, but its still gigantic compared to a few years ago. It's ~27GB of FP16 weights.
Having access to 1TB of preprocessed text data that you can download right away without doing your own crawling is also neat (pile).
Viewing a single comment thread. View all comments