Submitted by Vegetable-Skill-9700 t3_121a8p4 in MachineLearning
drinkingsomuchcoffee t1_jdpg1cb wrote
Reply to comment by YoloSwaggedBased in [D] Do we really need 100B+ parameters in a large language model? by Vegetable-Skill-9700
The problem is learned features aren't factored nicely into a minimal set of parameters. For example, identifying if an image is a cat may be 1000s of parameters over n layers, where it may actually be expressed as 10 parameters over fewer layers. A small model does this automatically, as it's obviously physically constrained. A large model has no such constraint, so it is wasteful. There's probably many solutions to get the best of both worlds at training time, but it's by no means an easy problem. And the current distillation methods or retraining feel clunky. We actually want the big model to use all its parameters efficiently and not waste them, which it's likely doing if much more compact models can get similar results. It's probably extremely wasteful if it requires an order of magnitude in size to get a few percentage points improvement. Compare that to biological entities where an order of magnitude size increase results in huge cognitive improvements.
Viewing a single comment thread. View all comments