shanereid1 t1_jdlt38a wrote on March 25, 2023 at 10:19 AM

Have you read about the lotto ticket hypothesis? It was a paper from a few years ago that showed that within a fully connected neural network there exists a smaller sub network that can perform equally as well, even when the subnetwork is as low as a few % of the size of the original network. AFAIK they only proved this for MLP and CNNs. Its almost certain that the power of these LLMs can be distilled in some fashion without significantly degrading performance.

tdgros t1_jdlxy8a wrote on March 25, 2023 at 11:24 AM

There are versions for NLP (and a special one for vision transformers), here is the BERT one from some of the same authors (Frankle & Carbin) https://proceedings.neurips.cc/paper/2020/file/b6af2c9703f203a2794be03d443af2e3-Paper.pdf

It is still costly, as it includes rewinding and finding masks, we probably need to switch to dedicated sparse computations to fully benefit from it.

wrossmorrow t1_jdmsbvf wrote on March 25, 2023 at 3:46 PM

Probably related https://arxiv.org/abs/2106.09685

fiftyfourseventeen t1_jdngwum wrote on March 25, 2023 at 6:40 PM

Eh.... Not really, that's training a low rank representation of the model, not actually making it smaller.

Wilfred86 t1_jdot7gb wrote on March 26, 2023 at 12:39 AM

Is this like pruning in the brain?

andreichiffa t1_jdvojfg wrote on March 27, 2023 at 3:20 PM

It's a common result from the domain of flat minima - to train the model needs to be overparametrized to avoid getting stuck in local minima and to smooth the loss landscape.

However the overparameterization at the training stage can be trimmed at the inference stage.