Comments

You must log in or register to comment.

TemetN t1_ivpj7wb wrote

Frankly I'm not concerned about copyright, but synthetic data is a promising area given how hungry models have gotten post new scaling laws.

23

Shelfrock77 OP t1_ivoxqcj wrote

Merging the bridge between natural and synthetic to reunite once again.

15

TheRidgeAndTheLadder t1_ivrgobo wrote

Training on generated data seems like it would reinforce local maxima

6

Down_The_Rabbithole t1_ivsks9k wrote

Yeah, no. This doesn't work if you actually understand the math involved, like some other commenter said it would reinforce local maxima which means it would work well in very specific and isolated cases but wouldn't generalize well.

Training data generation is the largest problem current AI models face and it's going to make the entire industry stagnate over the next couple of years as we're slowly running out of data to train bigger models with. Synthetic data training however is very unlikely to be a viable solution.

If anything we'd probably need to have jobs of organic data generation by actual humans to train AI better with in the future.

2

userbrn1 t1_ivtrjuo wrote

Maybe someone can help explain this to me.

It doesn't make sense to me that an AI model is able to generate synthetic data that results in another AI model being better trained in the real world than if it was trained on real world data. Seems like sorcery to generate better real world performance by using synthetic data.

2