Submitted by hapliniste t3_10g5r52 in MachineLearning
wind_dude t1_j50x6ad wrote
Yea, unless they master continual learning, the models will get stale quick, or need to rely on iterative training, very expensive and slow. I don't see hardware catching up soon.
I think you'll still need to run a fairly sophisticated LLM as the base model for a query based archetecture. But you can probably reduce the cost of running it by distilling it, and curating the input data. I actually don't think there has been a ton of research on curating the input data before training (OpenAI did something similar curating responses in chatGPT with the RLHF, so similar concept), although concerns/critiques may arise of what junk, which is why it hasn't been looked at in depth before. I believe SD did this in the latest checkpoint removing anything "pornographic", which is over censorship.
You look at something like CC that makes up a fairly large portion of the training data, run it through a classifier to remove junk before training. And even CC text, a lot of it is probably landing type pages, or even a blocked by paywall msging. To my knowledge the percent of these making up CC hasn't even been looked at, let alone trimmed from the training datasets used.
Viewing a single comment thread. View all comments