External_Oven_6379 OP t1_itpo4t5 wrote on October 25, 2022 at 12:08 PM

Reply to comment by Dear-Acanthisitta698 in Combining image and text embedding [P] by External_Oven_6379

I used the pretrained VGG 19 for the image. Regarding CLIP, I had the doubts above. I thought the categories are already the most dense form of information representation. Can you recommend a model, apart from CLIP?

Dear-Acanthisitta698 t1_itpqu2j wrote on October 25, 2022 at 12:33 PM

I think the problem is concatenating visual and text feature. While dim of text feature is a lot smaller than visual feature, these information might be white out. So you may following LastVariation 's ideas (first get images with same categories then search within them) or scale up the text vector (maybe multiply 80, this is a hyperparmeter).