Submitted by von-hust t3_11jyrfj in MachineLearning
PacmanIncarnate t1_jb5vofk wrote
Reply to comment by AuspiciousApple in [R] We found nearly half a billion duplicated images on LAION-2B-en. by von-hust
You sell it short. You could de duplicate while merging the associated text to solve half your problem. And the goal of base SD is to be as generic as possible, so there’s little value in allowing duplicates to impact the weights in most situations and there’s a significant downside of overfitting. Then fine tuning allows for more customized models to choose where weights are adjusted.
The only downside is if the dataset ends up with fewer quality images overall because 100000 legit painting dups got removed, leaving a larger percentage of memes and other junk.
Viewing a single comment thread. View all comments