graphicteadatasci t1_jb9afw5 wrote on March 7, 2023 at 11:53 AM

Reply to comment by enjakuro in [R] We found nearly half a billion duplicated images on LAION-2B-en. by von-hust

Really? Because copying all your data once is the same as running your dataset twice per epoch instead of once. Doesn't sound right. Unless your test data is drawn from the same dataset and duplication happens before splitting in which case you would certainly expect metric improvements. Or was this a case of duplicating rare text in which case it is the opposite of having duplicate images in LAION.

enjakuro t1_jb9l86l wrote on March 7, 2023 at 1:37 PM

Ah it was the rare text thing I believe. Now that I'm more awake I also realized that they copied the source to target, meaning the same language as source and target while keeping the rest bilingual. If I can recall correctly, you can have up to 50% copied data which makes the training set much bigger. I guess if the images aren't exactly the same this would have the same effect. Basically training a language model.

graphicteadatasci t1_jbdt33t wrote on March 8, 2023 at 9:18 AM

Yeah, because there's some very nice results on classification models where they remove data that doesn't contribute to learning and it made training faster and more accurate. But of course I can't remember at all what the paper was called.

enjakuro t1_jbf0yco wrote on March 8, 2023 at 4:09 PM

Same hahaha, would've linked it otherwise xD