Submitted by Vegetable-Skill-9700 t3_121a8p4 in MachineLearning
shanereid1 t1_jdlt38a wrote
Have you read about the lotto ticket hypothesis? It was a paper from a few years ago that showed that within a fully connected neural network there exists a smaller sub network that can perform equally as well, even when the subnetwork is as low as a few % of the size of the original network. AFAIK they only proved this for MLP and CNNs. Its almost certain that the power of these LLMs can be distilled in some fashion without significantly degrading performance.
tdgros t1_jdlxy8a wrote
There are versions for NLP (and a special one for vision transformers), here is the BERT one from some of the same authors (Frankle & Carbin) https://proceedings.neurips.cc/paper/2020/file/b6af2c9703f203a2794be03d443af2e3-Paper.pdf
It is still costly, as it includes rewinding and finding masks, we probably need to switch to dedicated sparse computations to fully benefit from it.
wrossmorrow t1_jdmsbvf wrote
Probably related https://arxiv.org/abs/2106.09685
fiftyfourseventeen t1_jdngwum wrote
Eh.... Not really, that's training a low rank representation of the model, not actually making it smaller.
Wilfred86 t1_jdot7gb wrote
Is this like pruning in the brain?
andreichiffa t1_jdvojfg wrote
It's a common result from the domain of flat minima - to train the model needs to be overparametrized to avoid getting stuck in local minima and to smooth the loss landscape.
However the overparameterization at the training stage can be trimmed at the inference stage.
Viewing a single comment thread. View all comments