This is a really cool idea. I'm currently using the CLIP model for an image retrieval task at university. We're using the Ball Tree for finding the closest images to the text in the vector space. What algorithm are you using for finding the nearest neighbors?
hermlon t1_j29a939 wrote
Reply to comment by RingoCatKeeper in [P]Run CLIP on your iPhone to Search Photos offline. by RingoCatKeeper
So you go trough all the images each time and compute the cosine similarity between it and the text each time?