LastVariation t1_itps1fq wrote on October 25, 2022 at 12:43 PM

Reply to comment by External_Oven_6379 in Combining image and text embedding [P] by External_Oven_6379

R.e. the scale of one-hot vectors, it's a little hard to say, it probably depends on your data and task. Essentially you could scale the one hot vectors up by sqrt(K), where K is the average similarity of two images with the same label. That way having the same label has the cosine similarity as two images being averagely similar for the label. In practice you'd probably want to fit K as a hyperparameter with some training data.

R.e. CLIP, you can input categorical text labels as raw text and the model is decent at interpreting it. I believe it's common practice to make the text a bit more natural language in that case, so "a photo of a <object>" rather than just "<object>".