enjakuro t1_jb9l86l wrote on March 7, 2023 at 1:37 PM

Reply to comment by graphicteadatasci in [R] We found nearly half a billion duplicated images on LAION-2B-en. by von-hust

Ah it was the rare text thing I believe. Now that I'm more awake I also realized that they copied the source to target, meaning the same language as source and target while keeping the rest bilingual. If I can recall correctly, you can have up to 50% copied data which makes the training set much bigger. I guess if the images aren't exactly the same this would have the same effect. Basically training a language model.

graphicteadatasci t1_jbdt33t wrote on March 8, 2023 at 9:18 AM

Yeah, because there's some very nice results on classification models where they remove data that doesn't contribute to learning and it made training faster and more accurate. But of course I can't remember at all what the paper was called.

enjakuro t1_jbf0yco wrote on March 8, 2023 at 4:09 PM

Same hahaha, would've linked it otherwise xD