Dear-Acanthisitta698 t1_itpqu2j wrote on October 25, 2022 at 12:33 PM

Reply to comment by External_Oven_6379 in Combining image and text embedding [P] by External_Oven_6379

I think the problem is concatenating visual and text feature. While dim of text feature is a lot smaller than visual feature, these information might be white out. So you may following LastVariation 's ideas (first get images with same categories then search within them) or scale up the text vector (maybe multiply 80, this is a hyperparmeter).