kermunnist

kermunnist t1_j9kqsaw wrote

That's because the smaller models are less useful. With neural networks (likely including biological ones) there's a hard trade off between specialized performance and general performance. If these 100+x smaller models were trained on the same data as GPT-3 they would perform 100+x worse on these metrics (maybe not exactly because in this case the model was multimodal which definitely gave a performance advantage). The big reason this model performed so much is because it was fine tuned on problems similar to the ones on this exam where as GPT-3 was fine turned on anything and everything. This means that this model would likely not be a great conversationalist and would probably flounder at most other tasks GPT-3.5 does well on.

5