Taenk t1_j688cev wrote on January 28, 2023 at 1:00 PM

Reply to comment by maizeq in [R] SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot by Secure-Technology-78

> I’m not sure a 2.8 trillion token dataset actually exists

DeepMind's Massive Text is assumed to be 10TB large, the largest publically available dataset is The Pile and weighs in at about 820GB.

A 2.8 trillion token dataset would need to be more than 20TB large, which could be possible by including more of Common Crawl - weighing in at 380TiB - or non-English resources. I have a suspicion that training LLMs on more languages, especially outside of the Indo-European family, will improve performance within the Indo-European family.

maizeq t1_j69vuec wrote on January 28, 2023 at 8:18 PM

Nice. How are you converting between dataset size and number of tokens?

Doesn’t common crawl get deduplicated and that’s why the number of usable tokens decreases - or is it also curation? How much of that 380TiB is actually utilisable.

Given the ostensibly impressive performance of the bilingual GLM-130B (Chinese+English) model that came out of Tsinghua university that might very well be the case.