Submitted by t3_1043mb2 in MachineLearning

A friend and I are working on a project that requires us to take images as input find images that match them from our database. What is the most effective way to do this? We've tried SIFT and a few similar solutions, but nothing's been super effective so far. Does anyone have any suggestions? Are there any solid open-source solutions?



You must log in or register to comment.

t1_j32mj4h wrote

For every image in your database, you could use features from the penultimate layer in a CNN, index them.

Then to search over images, simply calculate the distance between the query image features and the database features.

This can be expensive computationally and memory wise if you have a lot of images. Some solutions could be to cluster your database embeddings, use sparse matrices, use approximate KNN, add some explore-exploit heuristics (take the images with the lowest distance compared to the first 37% images in the database, this cuts down search time by up to 63%, but might not be great). There is possibly more out there in SoTA, but I am not up to date there.


t1_j32rcxp wrote

> Some solutions could be to cluster your database embeddings, use sparse matrices, use approximate KNN, add some explore-exploit heuristics

Pretty sure Faiss can help with that


I'd recommend this Course to anyone who wants to try it out.


t1_j32sw4k wrote

Looks cool, thanks for pointing it out


t1_j37aaxd wrote

I've used Faiss before to retrieve similar images based on CLIP embeddings (so I could do text-to-image searches). It works okay, but it doesn't order the results very well. It had 'favorite' images it preferred returning over everything else. So, for my use case, I found Faiss worked best as a good first-pass tool as opposed to a complete solution here.

If you do this approach, I would recommend asking Faiss to retrieve a few more images than you need, then calculating cosine similarity yourself on the images Faiss retrieves to get the 'best' matched images.

Edit: Also this was the tutorial I followed to get Faiss working. I found it pretty easy to follow and adapt to CLIP.


t1_j3mirpm wrote

>If you do this approach, I would recommend asking Faiss to retrieve a few more images than you need, then calculating cosine similarity yourself on the images Faiss retrieves to get the 'best' matched images.

Why not just index by cosine distance in the first place?


t1_j3mlz40 wrote

Well, if you're storing 1 million images in the database, it's going to take a long time to do the cosine distance for all 1 million images. FAISS will give you very roughly the 1000 nearest and you can do the cosine distance from there. My usage was anybody could enter any text phrase and search my dataset. I can't precompute the cosine distance for every query somebody might make.


t1_j3mtoyy wrote

What I mean is that faiss can compute knn for a variety of metrics including cosine distance. So you can just directly index by cosine distance instead of L2


t1_j3mx169 wrote

Ah, I see. I didn’t know. I guess you could do it that way.


t1_j34t0cm wrote

Depends if they want to match nearly exact images or match images that are just similar in visual appearance to a human. If it is the latter, then the distances in these later layers need not be close for similar images. A popular example of this is adversarial images.


t1_j33xh4d wrote

Sift is good if you want to match images of the same building or cereal box seen from another point of view or with different lightning.

If you want to match images that have dogs or cars or Bavarian houses you might need some sort of convolutional auto encoder as a featuriser.

If you have a lot of GPUs available you can use ViT, a transformer based architecture, to compute features.

Once you have features you might use a nearest neighbors library to find close representations.


t1_j33ya2o wrote

What if you wanted to match faces? OpenCV has a NN module that detects faces, is there a good solution for face recognition against a database?


t1_j34fk6v wrote

In the last month I came across a blog post about vector databases. The post argued that there are a few basic types of distances (L1, L2, cosine) and that you are going to have better fortune using a vector database that supports those than searching using your own heuristic and hybrid solutions. So my suggestion would be to represent faces in some space that you can search over with a vector database or with some nearest neighbors index


t1_j34k8y6 wrote

You probably want something like perceptual hash that find invariants in an image and has an efficient retrieval algorithm for a huge database.


t1_j35m42b wrote

I have tried imagehash python library, and the perceptual hashing and differential hashing technique has given good results.


t1_j373c3z wrote

What about using a vector search engine like Weaviate?

Grab a pre-trained autoencoder if you don't already have one and batch your images through it and into weaviate, then use it's search functionality to compute image similarity.


t1_j34uve3 wrote

Try a Siamese network trained with the triplet loss function as one baseline if you can label/construct a database with pairs labeled as “similar” and “dissimilar” if the definition of similarity is easy for a human to understand but hard to code up as a simple algorithm.

I’m assuming you aren’t just searching for nearly exact replicas of some input image and your definition of “similar” is more complex, as the former should be fairly trivial, no?


t1_j32skz3 wrote


t1_j33hg7a wrote

Amazing work and bless you for writing it up!!

How does it handle translated or rotated images?


t1_j3mn2ji wrote

It should be able to capture some transformations of the original images, but maybe I should think about measuring that. Thanks for the idea!


t1_j33bvxn wrote

As the other comment suggests, you can use some kind of dense vector representation to search for nearest neighbours. I think the most effective method would be to use the latent vectors learned by an autoencoder. There is a great tutorial on how you could achieve this.


t1_j341bqs wrote

If you want a really fast retrieve after using encoding using NN you can use a LSH algorithm to fast reduce the search space


t1_j34427r wrote

What you’re looking for is embeddings. Take an auto encoder and produce an embedding (latent code) for every image in your database. When you need to query for an image, produce an embedding for that image and use a nearest neighbors algorithm to find the most similar images.


t1_j35vetc wrote

Apple’s “neural hash” algorithm (the one they were using to detect CSAM) does this. People have extracted the model sans weights, so you could use that, and then do a distance calculation on the hash of the query and the hashes in your DB


t1_j36cm8w wrote

Hey you can use any deep learning framework and remove the top layer and get 2000 vectors or something and store results of all image in a matrix with image name on rows and columns and use cosine or some other similarity score to get distance and store that as element of matrix


t1_j36jldd wrote

If your matching is based on keypoings matching you can use the SuperGlue state of the at matching deep learning model which is very effective for this task.


t1_j36jlrx wrote

If your matching is based on keypoings matching you can use the SuperGlue state of the at matching deep learning model which is very effective for this task.