Viewing a single comment thread. View all comments

cdsmith t1_j2uzks4 wrote

The idea is that there's an inflection point: at first you are mainly removing (masking with zeros) dimensions whose values are extremely small anyway and don't make much difference in the response, so you don't lose much accuracy. But after you're removed those dimensions, the remaining dimensions are specifically the ones that do matter, so you can't just go find more non-impactful dimensions again. They are already gone.

As far as what would happen if you over-pruned a model trained on a large number of parameters, I'd naively expect it to do much worse. If you train on more parameters and then zero out significant weights, then not only do you have a lower-dimensional space to model in (which is unavoidable), but you also lose out on the information that was correlated with the dimensions you've captured, because at training time your model relied on the parameters you have now zeroed out to capture that information.

4

visarga t1_j2yvpjs wrote

Recent papers showed even small models under 10B can benefit from training on multi-task data. Learning to solve a large number of tasks works even when the model is not over 60B.

But no model comes even at 50% of GPT-3's scores, not including closed models.

1