alushamir
alushamir t1_jb9fdgy wrote
Reply to comment by TikiTDO in [R] We found nearly half a billion duplicated images on LAION-2B-en. by von-hust
I agree that mislabels are also an issue.
You can see some examples in this video:
https://www.youtube.com/watch?v=s6qamoFzyis&t=7s
We have used fastdup to analyse Laion-400M.
alushamir t1_jb6o8jd wrote
Hi,I'm one of the authors of fastdup. In an analysis we did 5 months ago we have found only around 15% of duplicated in Laion 400M.
You can check out a short video on the matter here: https://www.youtube.com/watch?v=s6qamoFzyis
For additional info read here: https://visual-layer.readme.io
alushamir t1_jbao8ez wrote
Reply to comment by TikiTDO in [R] We found nearly half a billion duplicated images on LAION-2B-en. by von-hust
>BLIP VQA
Thanks for sharing! you can try fastdup. It's free and scales. It's also very easy to use.
https://github.com/visual-layer/fastdup
Would love to get your feedback. PM or join our Slack channel. Will be happy to talk more.