Submitted by besabestin t3_10lp3g4 in MachineLearning
andreichiffa t1_j60625r wrote
So. First of all it’s not the size, or at least not only the size.
Before ChatGPT OpenAI experimented with InstructGPT, which at 6B parameters completely destroyed the 175B GPT3 when it came to satisfying users interacting with it and not being completely psycho.
Code-generating abilities start around 12B parameters (OpenAI codex), so most of things you are interacting with and are impressed by could be done with 12B parameters model. What really is doing heavy lifting for Chat-GPT is fine-tuning and guided generation to make it conform to user’s expectations.
Now, the model size allows for nice emerging properties, but there is a relationship between the dataset size and model size, meaning that without increasing the dataset, bigger model do nothing better. At 175B parameters, GPT-3 was already past that point compared to the curated dataset OpenAI used for it. And given that their dataset already contained CommonCrawl, it was pretty much all public writing on the internet.
They weren’t short by a bit - over a factor of 10x. Finding enough data to just finish training GPT-3 is a challenge already; larger models would need even more. That’s why they could dump code and more text into GPT-3 to create GPT-3.5 without creating bottlenecks.
Now, alternative models to GPT-3 have been trained (OPT175B or BLOOM), but at least for OPT175, it underperforms. OpenAI actually did a lot of data preparation, meaning that anyone who would want to replicate it would need to figure out the “secret sauce”.
visarga t1_j6bzixy wrote
> without increasing the dataset, bigger model do nothing better
Wrong, bigger models are better than small models even when both are trained on exactly the same data. Bigger models reach the same accuracy using fewer examples. Sometimes using a bigger model is the solution to having too little data.
andreichiffa t1_j6c9xf1 wrote
That’s a very bold claim that flies in the face of pretty much all the research on the subject to the date.
Surely you have extraordinary evidence to support such extraordinary claims?
visarga t1_j6n5mgc wrote
Oh, yes, gladly. This "open"AI paper says it:
> Larger models are significantly more sample efficient, such that optimally compute efficient training involves training very large models on a relatively modest amount of data and stopping significantly before convergence.
https://arxiv.org/abs/2001.08361
You can improve outcomes from small datasets by making the model larger.
andreichiffa t1_j6n9lg6 wrote
A lot of the conclusions from that paper has been called into question by the discovery GPT-2 was actually memorizing a lot of information from the training dataset a little less than a year later: https://arxiv.org/abs/2012.07805
About a year after that Anthropic came out with a paper that suggested that there were scaling laws that meant undertrained larger models did not that much better and actually did need more data: https://arxiv.org/pdf/2202.07785.pdf
Finally, more recent results from DeepMind did an additional pass on the topic and seem to suggest that the relationship between the data and model size is much more tight than anticipated and that a 4x smaller model trained for 4x the time would out-perform the larger model: https://arxiv.org/pdf/2203.15556.pdf
Basically the original OpenAI paper did contradict a lot of prior research on overfitting and generalization and seems to be due to a Simpson paradox instance on some of the batching they were doing.
Viewing a single comment thread. View all comments