adfoucart t1_jdq3jy5 wrote on March 26, 2023 at 8:40 AM

The parameters don't store the training data. They store a mapping between inputs (for LLMs: sequences of words) and predicted outputs (next word in the sequence). If there is not a lot of training data, then this mapping may allow you to recall the specific data points in the training set (eg if you start a sentence from the data set, it will predict the rest). But that's not the desired behaviour (such a model is said to "overfit" the data.

If there is enough data, then the mapping no longer "recalls" any particular data point. It instead encodes relationships between patterns in the inputs and in the outputs. But those relationships "summarize" many data points.

So for instance when an LLM completes "Napoléon was born on" with "August 15, 1769", it's not recalling one specific piece of information, but using a pattern detected from the many inputs that put those sequences of words (or similar sequences) together.

So it's not really accurate to talk about "compression" here. Or, rather, LLMs compress text in the same sense that linear regression "compress" the information of a point cloud...

[deleted] t1_jdq3sym wrote on March 26, 2023 at 8:44 AM

[removed]

samyall OP t1_jdqacnn wrote on March 26, 2023 at 10:21 AM

I really like your last point there. That is a good analogy.

I guess my question boils down to "how to think about information in a trained model". What I am wondering is whether a model can carry more information than it's raw size which I think it may be able to conceptually as the relationship between neurons carries information but isnt reflected in the file size of a model.

So like a regression represents a point cloud, could we now vectorise a book or a movie (if that was what we wanted)?