Submitted by von-hust t3_11jyrfj in MachineLearning
Using our new method, we found that at least 25% of the LAION-2B-en dataset are near duplicates (wrt to image data). You may find the de duplicated set and code to verify result here:
https://github.com/ryanwebster90/snip-dedup
In addition, we used the duplicate histograms, and found a handful of “verbatim copied” generated images by stable diffusion, with much less resources than deepmind (our process runs on a standard computer), like the following
stable diffusion verbatim copy
disclaimer This is a fairly new result, we’ll publish once we’ve done more verification. Take it with a grain of salt. You are welcome to explore and verify the deduplicated set we’ve released.
JrdnRgrs t1_jb53xvx wrote
Very interesting, so what is the implication for stable diffusion?
Does this mean that if the data set was corrected for these duplicated images that a corrected model using this data set would be of even "higher quality"? Can't wait