AuspiciousApple t1_jb557q8 wrote on March 6, 2023 at 2:48 PM

Reply to comment by JrdnRgrs in [R] We found nearly half a billion duplicated images on LAION-2B-en. by von-hust

Not obviously so.

First, de-duplicating text data didn't help much in the cramming paper. Second, even if the images are duplicates, the captions might be different so you still learn more than if you only had one copy of each image.

Finally, even with exact copies of text and image, it would just weigh those images more heavily than the rest - which could harm performance, not matter at all, or even help performance (for instance if those images tend to be higher quality/more interesting/etc.)

currentscurrents t1_jb5hswo wrote on March 6, 2023 at 4:15 PM

Dall-E found that deduplication improved performance and reduced memorization.

AuspiciousApple t1_jb6gzcd wrote on March 6, 2023 at 8:13 PM

Can't wait to see this replicated!

astrange t1_jb6hn1a wrote on March 6, 2023 at 8:17 PM

StableDiffusion claims they also dedupe following this, in SD2.X at least.

Though, deduplicating images feels incomplete to me - what if the same thing appears in different images? That's kind of what you want, but also not what you want.

NotARedditUser3 t1_jb5kq1q wrote on March 6, 2023 at 4:35 PM

Honestly I think this is the answer here

PacmanIncarnate t1_jb5vofk wrote on March 6, 2023 at 5:47 PM

You sell it short. You could de duplicate while merging the associated text to solve half your problem. And the goal of base SD is to be as generic as possible, so there’s little value in allowing duplicates to impact the weights in most situations and there’s a significant downside of overfitting. Then fine tuning allows for more customized models to choose where weights are adjusted.

The only downside is if the dataset ends up with fewer quality images overall because 100000 legit painting dups got removed, leaving a larger percentage of memes and other junk.

midasp t1_jb5p7v0 wrote on March 6, 2023 at 5:04 PM

In my experience, all of these issues will occur. It's going to vary from model to model. To be certain, you still have to make an objective test to determine whether the impact is positive or negative, and measure the significance of the impact.