Paper : https://arxiv.org/abs/2301.00774

Abstract :

>We show for the first time that large-scale generative pretrained transformer (GPT) family models can be pruned to at least 50% sparsity in one-shot, without any retraining, at minimal loss of accuracy. This is achieved via a new pruning method called SparseGPT, specifically designed to work efficiently and accurately on massive GPT-family models. When executing SparseGPT on the largest available open-source models, OPT-175B and BLOOM-176B, we can reach 60% sparsity with negligible increase in perplexity: remarkably, more than 100 billion weights from these models can be ignored at inference time. SparseGPT generalizes to semi-structured (2:4 and 4:8) patterns, and is compatible with weight quantization approaches.

Comments

You must log in or register to comment.

Taenk t1_j2sc1a2 wrote on January 3, 2023 at 4:50 PM

So you need 5 RTX 3090 to run BLOOM-176B at home instead of 8.

whyvitamins t1_j2sxlc2 wrote on January 3, 2023 at 7:03 PM

🙏😊

bloc97 t1_j2s05hy wrote on January 3, 2023 at 3:31 PM

It's curious that a 40% pruning of OPT-175 decreases perplexity, but the same effect is not seen in BLOOM... Could be a fluke but might warrant further investigation.

omniron t1_j2stl7w wrote on January 3, 2023 at 6:39 PM

Just shows we have a huge amount to learn about how these systems actually work

mycall t1_j50h4l7 wrote on January 19, 2023 at 3:27 PM

It probably is definitely complicated. There are many DAGs to reach similar or repeating patterns, or connections are suboptimal and thus never needed. How do you choose which to keep and which to delete.

learn-deeply t1_j2u53ek wrote on January 3, 2023 at 11:33 PM

My unsubstantiated hypothesis: BLOOM is severely undertrained, so most neurons aren't contributing at all to the final result compared to OPT-175.

ElectronicCress3132 t1_j2v4vy4 wrote on January 4, 2023 at 3:50 AM

Could you elaborate what you mean by undertrained?

learn-deeply t1_j2vac5q wrote on January 4, 2023 at 4:34 AM

The model hasn't reached convergence, and/or the train dataset was too small.

matth0x01 t1_j2u5rwm wrote on January 3, 2023 at 11:38 PM

Sorry - What's meant by perplexity here?

prototypist t1_j2uskwt wrote on January 4, 2023 at 2:18 AM

It's a metric comparing the model's generative probabilities / text predictions vs. the actual text.

matth0x01 t1_j2vxl6g wrote on January 4, 2023 at 8:49 AM

Thanks! Hm, seems to be a measure of sharpness for the predicted words?

unkz t1_j2v9edv wrote on January 4, 2023 at 4:26 AM

https://en.wikipedia.org/wiki/Perplexity

matth0x01 t1_j2vx7z4 wrote on January 4, 2023 at 8:44 AM

Yes, I know the concept, but where's the connection to the pruning approach here?

unkz t1_j2wzgf3 wrote on January 4, 2023 at 3:16 PM

Perplexity is one of the key evaluation metrics for how well a language model understands language. Pruning one model decreases perplexity (makes the model better), which is interesting.

matth0x01 t1_j2x49gm wrote on January 4, 2023 at 3:48 PM

Thanks - I think I got it. Kind of new to me why language models use perplexity instead of log-likelihood which is a monotonic function of perplexity.

From Wikipedia it seems that perplexity is in unit "words" instead of "nats/bits", which might be more interpretable.

Are there other advantages I overlook?

unkz t1_j2x7ggd wrote on January 4, 2023 at 4:09 PM

That’s basically it, cross entropy (sum of negative log likelihood) and perplexity are related by

Perplexity = 2^entropy

So the main two things are, interpretability (perplexity is a measure of how many words the model is choosing from at any point), and scale (small changes in cross entropy result in large changes in perplexity).

EmmyNoetherRing t1_j2s64d8 wrote on January 3, 2023 at 4:11 PM

There’s a lot of cognitive psychology research in how human brains forget things strategically, which I always found interesting. Another point of evidence that it’s computationally possible to learn how to process complex info without hanging onto everything you observed in the process of learning.

currentscurrents t1_j2srptn wrote on January 3, 2023 at 6:27 PM

I've seen other research that pruning as a continual process during training can actually improve performance. Which is interesting since that is what the brain does.

EmmyNoetherRing t1_j2ss8qe wrote on January 3, 2023 at 6:31 PM

Learning is compression, sorta.

mycall t1_j50ibgp wrote on January 19, 2023 at 3:35 PM

Not always. Imagination can be learning which is an expansion from steady state.

EmmyNoetherRing t1_j50q53i wrote on January 19, 2023 at 4:25 PM

Huh, fair. Got a concrete example?

mycall t1_j51wahq wrote on January 19, 2023 at 8:40 PM

I'm not exactly sure what it is or how it would manifest, but perhaps it is related to Emergent Abilities of Large Language Models

EmmyNoetherRing t1_j51wpgz wrote on January 19, 2023 at 8:42 PM

thank you, I've been looking for something along these lines.

mycall t1_j51zmqh wrote on January 19, 2023 at 8:59 PM

https://ai.googleblog.com/2022/11/characterizing-emergent-phenomena-in.html

This is another paper worth looking at.

EmmyNoetherRing t1_j51x98z wrote on January 19, 2023 at 8:45 PM

> As an alternative evaluation, we measure cross-entropy loss, which is used in scaling laws for pre-training, for the six emergent BIG-Bench tasks, as detailed in Appendix A. This analysis follows the same experimental setup from BIG-Bench (2022) and affirms their conclusions for the six emergent tasks we consider. Namely, cross-entropy loss improves even for small model scales where the downstream metrics (exact match, BLEU, and accuracy) are close to random and do not improve, which shows that improvements in the log-likelihood of the target sequence can be masked by such downstream metrics. However, this analysis does not explain why downstream metrics are emergent or enable us to predict the scale at which emergence occurs. Overall, more work is needed to tease apart what enables scale to unlock emergent abilities.

Don't suppose you know what cross-entropy is?

mycall t1_j51xq1r wrote on January 19, 2023 at 8:48 PM

Loss/cost functions are used to optimize the model during training. The objective is almost always to minimize the loss function. The lower the loss the better the model. Cross-Entropy loss is a most important cost function. It is used to optimize classification models. The understanding of Cross-Entropy is pegged on understanding of Softmax activation function.

EmmyNoetherRing t1_j522inn wrote on January 19, 2023 at 9:16 PM

So I'm in a different flavor of data science, which means I've got the basic terminology, but not the specifics. I know what a loss function is and what entropy is. What role does "cross" play here? A cross between what?

EmmyNoetherRing t1_j5253a8 wrote on January 19, 2023 at 9:31 PM

>Softmax activation function

Ok, got it. huh (on reviewing wikipedia). so to rephrase the quoted paragraph, they find that the divergence between the training and testing distribution (between the compressed versions of the training and testing data sets in my analogy) starts decreasing smoothly as the scale of the model increases, long before the actual final task performance locks into place successfully.

Hm. Says something more about task complexity (maybe in some computability sense, a fundamental task complexity, that we don't have well defined for those types of tasks yet?). Rather than imagination I think, but I'm still with you on imagination being a factor, and of course the paper and the blog post both leave the cliff problem unsolved. Possibly there's a definition of imagination such that we can say degree X of it is needed to successfully complete those tasks.

gordonisadog t1_j2w9xpf wrote on January 4, 2023 at 11:31 AM

Didn’t we already learn that with dropout, 10 years ago?

Purplekeyboard t1_j2s8it2 wrote on January 3, 2023 at 4:27 PM

Bloom's not very good, pruned or not.

Taenk t1_j2sgndx wrote on January 3, 2023 at 5:19 PM

Compared to what? I have been playing with it for a little bit via Petals and it performs decently, although ChatGPT certainly sets a very high bar of success. Personally I think that it is a shame, that OpenAI gets exclusive access to the absolutely massive dataset of interacting with actual humans and models like BLOOM could certainly profit from having publically accessible interactions.

nutpeabutter t1_j2snx76 wrote on January 3, 2023 at 6:04 PM

From my personal interactions it just gave off this vibe that it was trained on websites, rather than the GPT-3 (both base and chat) models which felt much more natural. Something to do with having to learn too many languages?

C0hentheBarbarian t1_j2sl0n3 wrote on January 3, 2023 at 5:46 PM

What about BLOOMZ? Isn’t it fine tuned in a similar way to GPT-3? Instruction fine tuned?

yahma t1_j2ssc01 wrote on January 3, 2023 at 6:31 PM

I wasn't very impressed with BLOOMZ. Responses seem short and optimized for Q/A style output. Perhaps Zero-Shot and single-shot worked better than Bloom, but Bloom seemed to produce better output for stories or writing in general.

I was only able to test the 6B models though, so not sure how the 176B models compare.

thejuror8 t1_j2ruboc wrote on January 3, 2023 at 2:50 PM

60% sparsity seems astounding

DigThatData t1_j2s71x9 wrote on January 3, 2023 at 4:17 PM

I'd like to see this evaluated on more than just a single dataset

starstruckmon OP t1_j2sqwfb wrote on January 3, 2023 at 6:22 PM

Personally, I'd like to see this tested on a Chinchilla scale model.

yahma t1_j2ss1ox wrote on January 3, 2023 at 6:29 PM

So with pruning and 8-bit quantization, are we able to run BLOOM-176B on a single GPU yet?

artsybashev t1_j2suada wrote on January 3, 2023 at 6:43 PM

A100 can run about 75B parameters in 8bit. With pruning that is doable, but it wont be quite the same perplexity.

currentscurrents t1_j2trd40 wrote on January 3, 2023 at 10:04 PM

If only it could run on a card that doesn't cost as much as a car.

I wonder if we will eventually hit a wall where more compute is required for further improvement, and we can only wait for GPU manufacturers. Similar to how they could never have created these language models in the 80s, no matter how clever their algorithms - they just didn't have enough compute power, memory, or the internet to use as a dataset.

artsybashev t1_j2v9lx2 wrote on January 4, 2023 at 4:28 AM

If you believe in singularity, at some point we reach an infinite loop where "AI" creates better methods to run calculations that it uses to build better "AI". In a way that is already happening but once that loop gets faster and more autonomous it can find a balance where the development is "optimally" fast.

visarga t1_j2yv5hp wrote on January 4, 2023 at 10:15 PM

I hope 2023 will be the year of AI generated training data - Evolution through Large Models https://arxiv.org/abs/2206.08896

itsnotlupus t1_j2tbhzu wrote on January 3, 2023 at 8:28 PM

Can you prune a pruned model? And then prune that again?

There's apparently no retraining needed here. Just loop over the matrices and shrink them (although it'd be nicer if there was a code repo to actually see that in action.)

I get that each successive pruning is going to make things increasingly worse, but I'm wondering if this might mean you can take an OPT-175B model and shrink it down in size to fit on commodity hardware like OPT-6.7B while still being closer in performance to the larger initial model than to the natively smaller model.

cdsmith t1_j2uzks4 wrote on January 4, 2023 at 3:09 AM

The idea is that there's an inflection point: at first you are mainly removing (masking with zeros) dimensions whose values are extremely small anyway and don't make much difference in the response, so you don't lose much accuracy. But after you're removed those dimensions, the remaining dimensions are specifically the ones that do matter, so you can't just go find more non-impactful dimensions again. They are already gone.

As far as what would happen if you over-pruned a model trained on a large number of parameters, I'd naively expect it to do much worse. If you train on more parameters and then zero out significant weights, then not only do you have a lower-dimensional space to model in (which is unavoidable), but you also lose out on the information that was correlated with the dimensions you've captured, because at training time your model relied on the parameters you have now zeroed out to capture that information.

visarga t1_j2yvpjs wrote on January 4, 2023 at 10:18 PM

Recent papers showed even small models under 10B can benefit from training on multi-task data. Learning to solve a large number of tasks works even when the model is not over 60B.

But no model comes even at 50% of GPT-3's scores, not including closed models.

drooobie t1_j2tgxnh wrote on January 3, 2023 at 9:01 PM

It's probably approximately idempotent.

shawdys t1_j3xfbqn wrote on January 11, 2023 at 6:56 PM

Has the SparseGPT code been published anywhere? I tried to look but couldn't find it.

[deleted] t1_j2w0jvd wrote on January 4, 2023 at 9:29 AM

[deleted]

johnrachwan t1_j4zz9vw wrote on January 19, 2023 at 1:16 PM

I'm curious if results improve with some slight retraining

starstruckmon OP t1_j501y7y wrote on January 19, 2023 at 1:38 PM

From the paper

>One natural avenue for future work would be to investigate fine-tuning mechanisms for such large-scale models, which would allow further accuracy recovery. We conjecture that this should be possible, and that probably at least 80-90% sparsity can be achieved with progressive pruning and fine-tuning.

So, that comes next. Though I doubt the 80-90% guesstimate.

mycall t1_j51zz0w wrote on January 19, 2023 at 9:01 PM

I wonder how pruning the sparsity affects emergent abilities in scaling parameters.

Sylv__ t1_j2sukvs wrote on January 3, 2023 at 6:45 PM

What's the TL;DR of the novelty here?

chimp73 t1_j2vw251 wrote on January 4, 2023 at 8:29 AM

I made a summary of the related work section with some help from ChatGPT:

> Pruning has been applied to smaller models, but has not been studied in large models like GPT with over 10 billion parameters. Previous pruning methods have required retraining the model after pruning, which is time-consuming and resource-intensive for large models like GPT. SparseGPT has been developed for pruning large GPT models without retraining. There has been significant research on post-training methods for quantizing GPT-scale models, which involve reducing the precision of the weights and activations in the model to reduce memory and computational requirements. The SparseGPT method can be used in conjunction with these quantization methods to further compress the model.