SlowThePath t1_je2buak wrote on March 28, 2023 at 10:08 PM

Reply to comment by antonivs in [D] FOMO on the rapid pace of LLMs by 00001746

No models are trained on internet sized corpuses.That would take an infinite amount of time. I would think.

antonivs t1_je7ws1v wrote on March 30, 2023 at 1:28 AM

I was referring to what the OpenAI GPT models are trained on. For GPT-3, that involved about 45 TB of text data, part of which was Common Crawl, a multi-petabyte corpus obtained from 8 years of web crawling.

On top of that, 16% of its corpus was books, totaling about 67 billion tokens.

SlowThePath t1_je7xmaz wrote on March 30, 2023 at 1:35 AM

Definitely not denying that it was trained on a massive amount of data because it was, but calling it internet sized is not accurate. I guess you were speaking in hyperbole and I juts didn't read it that way. I know what you mean.