enjakuro t1_jbf0yco wrote on March 8, 2023 at 4:09 PM

Reply to comment by graphicteadatasci in [R] We found nearly half a billion duplicated images on LAION-2B-en. by von-hust

Same hahaha, would've linked it otherwise xD

enjakuro t1_jb9l86l wrote on March 7, 2023 at 1:37 PM

Reply to comment by graphicteadatasci in [R] We found nearly half a billion duplicated images on LAION-2B-en. by von-hust

Ah it was the rare text thing I believe. Now that I'm more awake I also realized that they copied the source to target, meaning the same language as source and target while keeping the rest bilingual. If I can recall correctly, you can have up to 50% copied data which makes the training set much bigger. I guess if the images aren't exactly the same this would have the same effect. Basically training a language model.

enjakuro t1_jb8thcg wrote on March 7, 2023 at 7:59 AM

Reply to [R] We found nearly half a billion duplicated images on LAION-2B-en. by von-hust

Yeah but copying data in a corpus has yielded better results, at least in NLP translation tasks. It's always good to know what's in your data though. Just saying that it might not be a bad thing.