enjakuro
enjakuro t1_jb9l86l wrote
Reply to comment by graphicteadatasci in [R] We found nearly half a billion duplicated images on LAION-2B-en. by von-hust
Ah it was the rare text thing I believe. Now that I'm more awake I also realized that they copied the source to target, meaning the same language as source and target while keeping the rest bilingual. If I can recall correctly, you can have up to 50% copied data which makes the training set much bigger. I guess if the images aren't exactly the same this would have the same effect. Basically training a language model.
enjakuro t1_jb8thcg wrote
Yeah but copying data in a corpus has yielded better results, at least in NLP translation tasks. It's always good to know what's in your data though. Just saying that it might not be a bad thing.
enjakuro t1_jbf0yco wrote
Reply to comment by graphicteadatasci in [R] We found nearly half a billion duplicated images on LAION-2B-en. by von-hust
Same hahaha, would've linked it otherwise xD