Submitted by von-hust t3_11jyrfj in MachineLearning
AuspiciousApple t1_jb557q8 wrote
Reply to comment by JrdnRgrs in [R] We found nearly half a billion duplicated images on LAION-2B-en. by von-hust
Not obviously so.
First, de-duplicating text data didn't help much in the cramming paper. Second, even if the images are duplicates, the captions might be different so you still learn more than if you only had one copy of each image.
Finally, even with exact copies of text and image, it would just weigh those images more heavily than the rest - which could harm performance, not matter at all, or even help performance (for instance if those images tend to be higher quality/more interesting/etc.)
currentscurrents t1_jb5hswo wrote
AuspiciousApple t1_jb6gzcd wrote
Can't wait to see this replicated!
astrange t1_jb6hn1a wrote
StableDiffusion claims they also dedupe following this, in SD2.X at least.
Though, deduplicating images feels incomplete to me - what if the same thing appears in different images? That's kind of what you want, but also not what you want.
NotARedditUser3 t1_jb5kq1q wrote
Honestly I think this is the answer here
PacmanIncarnate t1_jb5vofk wrote
You sell it short. You could de duplicate while merging the associated text to solve half your problem. And the goal of base SD is to be as generic as possible, so there’s little value in allowing duplicates to impact the weights in most situations and there’s a significant downside of overfitting. Then fine tuning allows for more customized models to choose where weights are adjusted.
The only downside is if the dataset ends up with fewer quality images overall because 100000 legit painting dups got removed, leaving a larger percentage of memes and other junk.
midasp t1_jb5p7v0 wrote
In my experience, all of these issues will occur. It's going to vary from model to model. To be certain, you still have to make an objective test to determine whether the impact is positive or negative, and measure the significance of the impact.
Viewing a single comment thread. View all comments