MegavirusOfDoom t1_j4oelbd wrote on January 17, 2023 at 3:16 AM

Reply to comment by LetGoAndBeReal in [D] Fine-tuning open source models on specific tasks to compete with ChatGPT? by jaqws

less than 500MB is used for code learning, 690GB is used for culture, geography, history, fiction and non-fiction... 2GB for cats, 2GB bread, horses, dogs, Cheese, Wine, Italy, France, Politics, Television, Music, Japan, Africa. less than 1% of the training is on science and technology, i.e. 300MB is biology, 200MB chemistry, 100MB physics, 400MB maths...

yahma t1_j4owot0 wrote on January 17, 2023 at 5:48 AM

This may be the size of the datasets, but i it's hard to say how many parameters will be needed for a good llm that's just really good at explaining code.

MegavirusOfDoom t1_j4pfdi1 wrote on January 17, 2023 at 9:40 AM

Then we'd have to crawl all of stack exchange, all of wiki, and 1 terabyte of programming books... This "generalist NLP" is for article writing, for poetry.

I'm a big fan of teaching ChatGPT how to interpret graphs, the origin lines, to record in a vector engine that is couple with the NLP. For a coding engine, I believe NLP should be paired with a compiler, just like a maths specialized NLP should also have a mathlab type engine.