Submitted by starstruckmon t3_1027geh in MachineLearning
Paper : https://arxiv.org/abs/2301.00774
Abstract :
>We show for the first time that large-scale generative pretrained transformer (GPT) family models can be pruned to at least 50% sparsity in one-shot, without any retraining, at minimal loss of accuracy. This is achieved via a new pruning method called SparseGPT, specifically designed to work efficiently and accurately on massive GPT-family models. When executing SparseGPT on the largest available open-source models, OPT-175B and BLOOM-176B, we can reach 60% sparsity with negligible increase in perplexity: remarkably, more than 100 billion weights from these models can be ignored at inference time. SparseGPT generalizes to semi-structured (2:4 and 4:8) patterns, and is compatible with weight quantization approaches.
Taenk t1_j2sc1a2 wrote
So you need 5 RTX 3090 to run BLOOM-176B at home instead of 8.