adfoucart

adfoucart t1_jdq3jy5 wrote

The parameters don't store the training data. They store a mapping between inputs (for LLMs: sequences of words) and predicted outputs (next word in the sequence). If there is not a lot of training data, then this mapping may allow you to recall the specific data points in the training set (eg if you start a sentence from the data set, it will predict the rest). But that's not the desired behaviour (such a model is said to "overfit" the data.

If there is enough data, then the mapping no longer "recalls" any particular data point. It instead encodes relationships between patterns in the inputs and in the outputs. But those relationships "summarize" many data points.

So for instance when an LLM completes "Napoléon was born on" with "August 15, 1769", it's not recalling one specific piece of information, but using a pattern detected from the many inputs that put those sequences of words (or similar sequences) together.

So it's not really accurate to talk about "compression" here. Or, rather, LLMs compress text in the same sense that linear regression "compress" the information of a point cloud...

2