Viewing a single comment thread. View all comments

MegavirusOfDoom t1_j4oelbd wrote

less than 500MB is used for code learning, 690GB is used for culture, geography, history, fiction and non-fiction... 2GB for cats, 2GB bread, horses, dogs, Cheese, Wine, Italy, France, Politics, Television, Music, Japan, Africa. less than 1% of the training is on science and technology, i.e. 300MB is biology, 200MB chemistry, 100MB physics, 400MB maths...

2

yahma t1_j4owot0 wrote

This may be the size of the datasets, but i it's hard to say how many parameters will be needed for a good llm that's just really good at explaining code.

5

MegavirusOfDoom t1_j4pfdi1 wrote

Then we'd have to crawl all of stack exchange, all of wiki, and 1 terabyte of programming books... This "generalist NLP" is for article writing, for poetry.

I'm a big fan of teaching ChatGPT how to interpret graphs, the origin lines, to record in a vector engine that is couple with the NLP. For a coding engine, I believe NLP should be paired with a compiler, just like a maths specialized NLP should also have a mathlab type engine.

2