Submitted by starstruckmon t3_1027geh in MachineLearning
EmmyNoetherRing t1_j5253a8 wrote
Reply to comment by mycall in [R] Massive Language Models Can Be Accurately Pruned in One-Shot by starstruckmon
>Softmax activation function
Ok, got it. huh (on reviewing wikipedia). so to rephrase the quoted paragraph, they find that the divergence between the training and testing distribution (between the compressed versions of the training and testing data sets in my analogy) starts decreasing smoothly as the scale of the model increases, long before the actual final task performance locks into place successfully.
Hm. Says something more about task complexity (maybe in some computability sense, a fundamental task complexity, that we don't have well defined for those types of tasks yet?). Rather than imagination I think, but I'm still with you on imagination being a factor, and of course the paper and the blog post both leave the cliff problem unsolved. Possibly there's a definition of imagination such that we can say degree X of it is needed to successfully complete those tasks.
Viewing a single comment thread. View all comments