Submitted by Secure-Technology-78 t3_10mdhxb in MachineLearning

>Large Language Models (LLMs) from the Generative Pretrained Transformer (GPT) family have shown remarkable performance on a wide range of tasks, but are difficult to deploy because of their massive size and computational costs. For instance, the top-performing GPT-175B model has 175 billion parameters, which total at least 320GB (counting multiples of 1024) of storage in half-precision (FP16) format, leading it to require at least five A100 GPUs with 80GB of memory each for inference. It is therefore natural that there has been significant interest in reducing these costs via model compression. To date, virtually all existing GPT compression approaches have focused on quantization, that is, reducing the precision of the numerical representation of individual weights. A complementary approach for model compression is pruning, which removes network elements, from individual weights (unstructured pruning) to higher-granularity components such as entire rows/columns of the weight matrices (structured pruning). This approach has a long history, and has been applied successfully in the case of vision and smaller-scale language models and tasks. Yet, the best-performing pruning methods require extensive retraining of the model to recover from the accuracy loss due to removed elements, which is extremely expensive in the case of GPT-scale models. While some one-shot pruning methods also exist, which compress the model without retraining, they are unfortunately too computationally-expensive to be applied to models with billions of parameters. Thus, to date, there is virtually no work on accurate pruning of GPT3-scale models. Overview. In this paper, we propose SparseGPT, the first accurate one-shot pruning method which works efficiently at the scale of models with 10-100 billion parameters. SparseGPT works by reducing the pruning problem to an extremely large-scale instance of sparse regression. It is based on a new approximate sparse regression solver, used to solve a layer-wise compression problem, which is efficient enough to execute in a few hours on the largest openly-available GPT models (175B parameters), using a single GPU. At the same time, SparseGPT is accurate enough to drop negligible accuracy post-pruning, without any fine-tuning. For example, when executed on the largest publicly-available generative language models (OPT-175B and BLOOM-176B), SparseGPT induces 50-60% sparsity in one-shot, with minor accuracy loss, measured either in terms of perplexity or zero-shot accuracy.

Full paper: https://arxiv.org/abs/2301.00774

210

Comments

You must log in or register to comment.

CKtalon t1_j62hmsr wrote

Before people get their hopes up, BLOOM and OPT are known to be seriously undertrained (not Chinchilla-optimal, BLOOM more so than OPT), so it’s possible that most of the weights were useless to begin with. The results of this paper seem to imply that.

97

data-drone t1_j62n3b9 wrote

How much more training they need?

14

CKtalon t1_j62n9yw wrote

About 10-12 times more then the tokens seen.

26

maizeq t1_j66b3l5 wrote

Chinchilla (70B) is trained with 1.4 trillion, so 140B would presumably need at least 2.8 trillion (it scales linearly afaik).

I’m not sure a 2.8 trillion token dataset actually exists

3

rainy_moon_bear t1_j676oo9 wrote

This is something people don't seem to understand. Pretty much all models 100B+ are undertrained.

3

Taenk t1_j688cev wrote

> I’m not sure a 2.8 trillion token dataset actually exists

DeepMind's Massive Text is assumed to be 10TB large, the largest publically available dataset is The Pile and weighs in at about 820GB.

A 2.8 trillion token dataset would need to be more than 20TB large, which could be possible by including more of Common Crawl - weighing in at 380TiB - or non-English resources. I have a suspicion that training LLMs on more languages, especially outside of the Indo-European family, will improve performance within the Indo-European family.

2

maizeq t1_j69vuec wrote

Nice. How are you converting between dataset size and number of tokens?

Doesn’t common crawl get deduplicated and that’s why the number of usable tokens decreases - or is it also curation? How much of that 380TiB is actually utilisable.

Given the ostensibly impressive performance of the bilingual GLM-130B (Chinese+English) model that came out of Tsinghua university that might very well be the case.

1

lookatmetype t1_j64nstm wrote

To be fair, most of the weights in every "Foundation" model are useless.

3

flashdude64 t1_j65z2q4 wrote

Do you have a citation for this that I could read?

1

nmfisher t1_j62y29r wrote

Slight tangent - has anyone ever tried "fine-tuning" a large speech recognition model (e.g. Whisper) by feeding a training set and pruning activations? The idea being that only a subset of weights/activations are necessary for a given speaker/dataset, so you can compress a larger model into a smaller one (and then continue retraining conventionally) that performs equally well for a given subset of data. Presumably this would require some degree of sparsity to begin with?

29

_Ruffy_ t1_j635zdc wrote

Good idea in principle, anyone know more about this or any references?

5

anony_sci_guy t1_j63nj0u wrote

This was exactly my first thought too - free up all those extra parameters & re-randomize them. Problem could be that the re-randomized parameters will have a big gap in distribution between the pre-tuned and re-randomized weights, so you'd want different step sizes for them. I've played with it before & ran into this problem, but got too lazy to actually implement a solution. (I'm actually a biologist, so don't really have bandwidth to dig into the ML side as much)..

3

starfries t1_j64qhqa wrote

Can you elaborate on this? I'm trying something similar, so I'm curious what your results were and if you ran across any literature about this idea.

2

anony_sci_guy t1_j681trq wrote

Yeah, there is some stuff published out there. It's related to pruning (A link to a ton of papers on it); the lottery ticket method solves this one well, because you're re-training from scratch, just with "lucky" selection of the initialized weights. Results-wise, I never got anything to improve because of the distributional changes caused by trying to re-randomize a subset in the middle of training. Still saw the same level of performance as without re-randomizing, but that basically just showed that the way that I was re-randomizing wasn't helping or hurting b/c those neurons weren't important...

2

starfries t1_j6l0aeq wrote

Thanks for that resource, I've been experimenting with the lottery ticket method but that's a lot of papers I haven't seen! Did you initialize the weights as if training from scratch, or did you do something like trying to match the variance of the old and new weights? I'm intrigued that your method didn't hurt performance - most of the things I've tested were detrimental to the network. I have seen some performance improvements under different conditions but I'm still trying to rule out any confounding factors.

1

anony_sci_guy t1_j6mr4k6 wrote

Glad it helped! The first thing I tried was just to re-initialize just like at the beginning of training, but I don't remember how much I dug into trying to modify it before moving on. That's great your seeing some improvements though! Would love to hear how the rest of your experiment goes!! =)

2

ApprehensiveNature69 t1_j651pux wrote

Yep! This is known technique - if you search for it lots of papers on sparse fine tuning show up, its a very valid technique.

2

mycall t1_j643o1d wrote

It unknown if this affects emergent abilities as the model scales up. Correct?

6

element8 t1_j64uglo wrote

Is network pruning in this case analogous to discarding specific evidence for more general intuitions, or is that over anthropomorphizing? How does it affect future training once pruned? can the pruning mask be applied during training since the method is operating within a local subset?

3

muchcharles t1_j65b3a6 wrote

Deepmind put out a paper on adjusting the pruning mask during training (by reviving pruned weights if a transiently stored gradient exceeds some threshold).

The paper is called Rigging the Lottery (referencing initial weight lottery hypothesis) and method RigL I think.

4

r2m2 t1_j64uah5 wrote

Isn’t this a (somewhat) well-known “free lunch” effect w/ naive one-shot magnitude pruning? I feel like this is a folklore fact for many models like ResNet/VGG (& a paper from a few years back validated the same for BERT)

2

Sylv__ t1_j65ib3y wrote

already posted a few weeks ago, thank you for your low effort post linking to an arxiv link

−12

Secure-Technology-78 OP t1_j65ifpn wrote

Awwww i’m sorry baby, i promise i’ll work very very hard on my next post for you!

7