norbertus t1_j9me8dv wrote on February 23, 2023 at 12:35 AM

A lot of these models are under-trained

https://www.deepmind.com/publications/an-empirical-analysis-of-compute-optimal-large-language-model-training

and seem to be forming a type of "lossy" text compression, where their ability to memorize data is both poorly understood, and accomplished using only a fraction of the information-theoretic capacity of the model design

https://arxiv.org/pdf/1802.08232.pdf

Also, as indicated in the first citation above, it turns out that the quality of large language models is more determined by the size and quality of the training set rather than the size of the model itself.