Viewing a single comment thread. View all comments

External_Oven_6379 OP t1_itpo4t5 wrote

I used the pretrained VGG 19 for the image. Regarding CLIP, I had the doubts above. I thought the categories are already the most dense form of information representation. Can you recommend a model, apart from CLIP?

3

Dear-Acanthisitta698 t1_itpqu2j wrote

I think the problem is concatenating visual and text feature. While dim of text feature is a lot smaller than visual feature, these information might be white out. So you may following LastVariation 's ideas (first get images with same categories then search within them) or scale up the text vector (maybe multiply 80, this is a hyperparmeter).

4