Viewing a single comment thread. View all comments

AuspiciousApple t1_jb557q8 wrote

Not obviously so.

First, de-duplicating text data didn't help much in the cramming paper. Second, even if the images are duplicates, the captions might be different so you still learn more than if you only had one copy of each image.

Finally, even with exact copies of text and image, it would just weigh those images more heavily than the rest - which could harm performance, not matter at all, or even help performance (for instance if those images tend to be higher quality/more interesting/etc.)

98

AuspiciousApple t1_jb6gzcd wrote

Can't wait to see this replicated!

6

astrange t1_jb6hn1a wrote

StableDiffusion claims they also dedupe following this, in SD2.X at least.

Though, deduplicating images feels incomplete to me - what if the same thing appears in different images? That's kind of what you want, but also not what you want.

11

PacmanIncarnate t1_jb5vofk wrote

You sell it short. You could de duplicate while merging the associated text to solve half your problem. And the goal of base SD is to be as generic as possible, so there’s little value in allowing duplicates to impact the weights in most situations and there’s a significant downside of overfitting. Then fine tuning allows for more customized models to choose where weights are adjusted.

The only downside is if the dataset ends up with fewer quality images overall because 100000 legit painting dups got removed, leaving a larger percentage of memes and other junk.

9

midasp t1_jb5p7v0 wrote

In my experience, all of these issues will occur. It's going to vary from model to model. To be certain, you still have to make an objective test to determine whether the impact is positive or negative, and measure the significance of the impact.

2