Submitted by starstruckmon t3_1027geh in MachineLearning
currentscurrents t1_j2srptn wrote
Reply to comment by EmmyNoetherRing in [R] Massive Language Models Can Be Accurately Pruned in One-Shot by starstruckmon
I've seen other research that pruning as a continual process during training can actually improve performance. Which is interesting since that is what the brain does.
EmmyNoetherRing t1_j2ss8qe wrote
Learning is compression, sorta.
mycall t1_j50ibgp wrote
Not always. Imagination can be learning which is an expansion from steady state.
EmmyNoetherRing t1_j50q53i wrote
Huh, fair. Got a concrete example?
mycall t1_j51wahq wrote
I'm not exactly sure what it is or how it would manifest, but perhaps it is related to Emergent Abilities of Large Language Models
EmmyNoetherRing t1_j51wpgz wrote
thank you, I've been looking for something along these lines.
mycall t1_j51zmqh wrote
https://ai.googleblog.com/2022/11/characterizing-emergent-phenomena-in.html
This is another paper worth looking at.
EmmyNoetherRing t1_j51x98z wrote
> As an alternative evaluation, we measure cross-entropy loss, which is used in scaling laws for pre-training, for the six emergent BIG-Bench tasks, as detailed in Appendix A. This analysis follows the same experimental setup from BIG-Bench (2022) and affirms their conclusions for the six emergent tasks we consider. Namely, cross-entropy loss improves even for small model scales where the downstream metrics (exact match, BLEU, and accuracy) are close to random and do not improve, which shows that improvements in the log-likelihood of the target sequence can be masked by such downstream metrics. However, this analysis does not explain why downstream metrics are emergent or enable us to predict the scale at which emergence occurs. Overall, more work is needed to tease apart what enables scale to unlock emergent abilities.
Don't suppose you know what cross-entropy is?
mycall t1_j51xq1r wrote
Loss/cost functions are used to optimize the model during training. The objective is almost always to minimize the loss function. The lower the loss the better the model. Cross-Entropy loss is a most important cost function. It is used to optimize classification models. The understanding of Cross-Entropy is pegged on understanding of Softmax activation function.
EmmyNoetherRing t1_j522inn wrote
So I'm in a different flavor of data science, which means I've got the basic terminology, but not the specifics. I know what a loss function is and what entropy is. What role does "cross" play here? A cross between what?
EmmyNoetherRing t1_j5253a8 wrote
>Softmax activation function
Ok, got it. huh (on reviewing wikipedia). so to rephrase the quoted paragraph, they find that the divergence between the training and testing distribution (between the compressed versions of the training and testing data sets in my analogy) starts decreasing smoothly as the scale of the model increases, long before the actual final task performance locks into place successfully.
Hm. Says something more about task complexity (maybe in some computability sense, a fundamental task complexity, that we don't have well defined for those types of tasks yet?). Rather than imagination I think, but I'm still with you on imagination being a factor, and of course the paper and the blog post both leave the cliff problem unsolved. Possibly there's a definition of imagination such that we can say degree X of it is needed to successfully complete those tasks.
gordonisadog t1_j2w9xpf wrote
Didn’t we already learn that with dropout, 10 years ago?
Viewing a single comment thread. View all comments