Submitted by Secure-Technology-78 t3_10mdhxb in MachineLearning
>Large Language Models (LLMs) from the Generative Pretrained Transformer (GPT) family have shown remarkable performance on a wide range of tasks, but are difficult to deploy because of their massive size and computational costs. For instance, the top-performing GPT-175B model has 175 billion parameters, which total at least 320GB (counting multiples of 1024) of storage in half-precision (FP16) format, leading it to require at least five A100 GPUs with 80GB of memory each for inference. It is therefore natural that there has been significant interest in reducing these costs via model compression. To date, virtually all existing GPT compression approaches have focused on quantization, that is, reducing the precision of the numerical representation of individual weights. A complementary approach for model compression is pruning, which removes network elements, from individual weights (unstructured pruning) to higher-granularity components such as entire rows/columns of the weight matrices (structured pruning). This approach has a long history, and has been applied successfully in the case of vision and smaller-scale language models and tasks. Yet, the best-performing pruning methods require extensive retraining of the model to recover from the accuracy loss due to removed elements, which is extremely expensive in the case of GPT-scale models. While some one-shot pruning methods also exist, which compress the model without retraining, they are unfortunately too computationally-expensive to be applied to models with billions of parameters. Thus, to date, there is virtually no work on accurate pruning of GPT3-scale models. Overview. In this paper, we propose SparseGPT, the first accurate one-shot pruning method which works efficiently at the scale of models with 10-100 billion parameters. SparseGPT works by reducing the pruning problem to an extremely large-scale instance of sparse regression. It is based on a new approximate sparse regression solver, used to solve a layer-wise compression problem, which is efficient enough to execute in a few hours on the largest openly-available GPT models (175B parameters), using a single GPU. At the same time, SparseGPT is accurate enough to drop negligible accuracy post-pruning, without any fine-tuning. For example, when executed on the largest publicly-available generative language models (OPT-175B and BLOOM-176B), SparseGPT induces 50-60% sparsity in one-shot, with minor accuracy loss, measured either in terms of perplexity or zero-shot accuracy.
Full paper: https://arxiv.org/abs/2301.00774
CKtalon t1_j62hmsr wrote
Before people get their hopes up, BLOOM and OPT are known to be seriously undertrained (not Chinchilla-optimal, BLOOM more so than OPT), so it’s possible that most of the weights were useless to begin with. The results of this paper seem to imply that.