Comments
Shelfrock77 OP t1_ivoxqcj wrote
Merging the bridge between natural and synthetic to reunite once again.
[deleted] t1_ivps7e7 wrote
[removed]
TheRidgeAndTheLadder t1_ivrgobo wrote
Training on generated data seems like it would reinforce local maxima
LeavingTheCradle t1_ivpxao5 wrote
Boo
Down_The_Rabbithole t1_ivsks9k wrote
Yeah, no. This doesn't work if you actually understand the math involved, like some other commenter said it would reinforce local maxima which means it would work well in very specific and isolated cases but wouldn't generalize well.
Training data generation is the largest problem current AI models face and it's going to make the entire industry stagnate over the next couple of years as we're slowly running out of data to train bigger models with. Synthetic data training however is very unlikely to be a viable solution.
If anything we'd probably need to have jobs of organic data generation by actual humans to train AI better with in the future.
userbrn1 t1_ivtrjuo wrote
Maybe someone can help explain this to me.
It doesn't make sense to me that an AI model is able to generate synthetic data that results in another AI model being better trained in the real world than if it was trained on real world data. Seems like sorcery to generate better real world performance by using synthetic data.
TemetN t1_ivpj7wb wrote
Frankly I'm not concerned about copyright, but synthetic data is a promising area given how hungry models have gotten post new scaling laws.