Viewing a single comment thread. View all comments

shoegraze t1_irwlpoz wrote

But surely with a dataset as large as the Pile and enough weights, the model will be able to learn at least decently well how to interpret misspellings and abbreviations. If anything wouldn’t this data “issue” help improve a LLM’s robustness? Not sure I see what the issue is in the context of LLMs, but to be fair I agree with you if you’re trying to train a small model on a small amount of context-specific text data (but then you shouldn’t be using the Pile should you?)

9

freezelikeastatue t1_irx3brk wrote

Yeah so this gets pretty philosophical and theoretical real quick. Also, interpretation of data is unique to every individual. I did place the constraint of my purposes only which are, admittedly, not necessary for such large model sets and I can achieve similar if not better results with a smaller, more defined model.

I also have not created a curated data set on the level of CLIP or OpenAI or OPT. I’ve tried scaling my data by applying a text generator to each parameter of data that I had and replicate a faux variable exponentially to generate the number of parameters by 1/1000th of the number of parameters in GPT-3’s model but got noise in return.

My summation is viability of the model is wholly dependent upon the unique properties and ensured individuality of each variable. I can say I have achieved higher benchmarks with regards to few and zero shot settings, with the highest being 89.2% on few shot but it was a very specified data set.

−1