PacmanIncarnate t1_jb5vofk wrote on March 6, 2023 at 5:47 PM

Reply to comment by AuspiciousApple in [R] We found nearly half a billion duplicated images on LAION-2B-en. by von-hust

You sell it short. You could de duplicate while merging the associated text to solve half your problem. And the goal of base SD is to be as generic as possible, so there’s little value in allowing duplicates to impact the weights in most situations and there’s a significant downside of overfitting. Then fine tuning allows for more customized models to choose where weights are adjusted.

The only downside is if the dataset ends up with fewer quality images overall because 100000 legit painting dups got removed, leaving a larger percentage of memes and other junk.