Submitted by External_Oven_6379 t3_yd0549 in MachineLearning

I am currently working on a database retrieval framework, that takes an image and categorical text data, creates an embedding of these and calculates the distance of this combined embedding to other known datapoints. However, my results seem to be off.

So I was wondering, what would be an appropriate way of combining these embeddings?

The details about the embedding:

  • image features are embedded with a pretrained vgg19 model
  • categorical text features are embedded by creating one-hot vectors
  • both embeddings are combined by concatenating the vectors

So in the end, i get a vector that looks like this: [image embedding(1,8192+ text embedding (1,137)]

Use of the embeddings:

The embeddings are then used to find the NearestNeighbors by calculating the cosine distance.

Question/Issue:

My question is, would that be an appropriate way of combining features of a sample in n-dimensional space? Are there any other/preferred ways?

30

Comments

You must log in or register to comment.

LastVariation t1_itpa0b2 wrote

Maybe the distance between two similar images is on a different scale to the difference between two different categorical labels. Using one-hot for the categoricals means 2 different labels are always a distance 1 apart. It could be worth looking at the cosine distances between all image embeddings with a given label, and some average of those embeddings to get a sense of the scale.

Also one-hot might not be best if the categorical labels aren't actually orthogonal - e.g. you'd expect there to be correlations between images of "cats" and "kittens".

Have you thought about just using something like CLIP for embedding both image and label?

17

Dear-Acanthisitta698 t1_itpckwk wrote

I suggest using recent pretrained models to extract features. Open AI Clip might be your start point.

10

External_Oven_6379 OP t1_itpny3q wrote

Thank you for your input!

I checked on the scale of the VGG19 feature embedding. All values are between [0, 9.7]. So in that case, should the values of the onehot vector be either 0 and 9.7?

The labels are textures like floral or leopard. So you are right, they are not necessarily orthogonal, but it's difficult to estimate the correlation among these classes. So one-hot vectors were the most accessible to me.

I have read about CLIP when starting this. My thoughts were that CLIP input consists of images and a text input like an image description, e.g. "Flowers in the middle of a blue floor" (which is not categorical). Could categorical text be used?

4

Dear-Acanthisitta698 t1_itpqu2j wrote

I think the problem is concatenating visual and text feature. While dim of text feature is a lot smaller than visual feature, these information might be white out. So you may following LastVariation 's ideas (first get images with same categories then search within them) or scale up the text vector (maybe multiply 80, this is a hyperparmeter).

4

acerb14 t1_itprawa wrote

Have a look at Jina AI. They have really good examples of text and image search combinations.

1

LastVariation t1_itps1fq wrote

R.e. the scale of one-hot vectors, it's a little hard to say, it probably depends on your data and task. Essentially you could scale the one hot vectors up by sqrt(K), where K is the average similarity of two images with the same label. That way having the same label has the cosine similarity as two images being averagely similar for the label. In practice you'd probably want to fit K as a hyperparameter with some training data.

R.e. CLIP, you can input categorical text labels as raw text and the model is decent at interpreting it. I believe it's common practice to make the text a bit more natural language in that case, so "a photo of a <object>" rather than just "<object>".

3

killver t1_itpw963 wrote

The easiest way that works well in practice is to just concatenate them. You can also normalize them first separately before concatenation. If one dimension is significantly different, you can just concatenate the other one multiple times to weight them similarly, or use a dimensionality reduction beforehand.

Another way is to just calculate two similarities separately and then average them (or weighted average).

You can take a look at this kaggle competition's solutions for inspiration: https://www.kaggle.com/competitions/shopee-product-matching/discussion

3

NonFocusNorm t1_itq4vts wrote

I believe robust backbone models are very crucial since they are feature extractors and determine how good your embeddings are. So I suggest using CLIP from openAI, a very OP model that works well for zero-shot learning task. I personally use it and suprisingly outperform others in an text-image retrieval task, highly recommend you try it out.

3

DigThatData t1_itqbww4 wrote

CLIP is definitely what you want here, and it's unclear to me why you are so convinced that a categorical text representation is an important feature considering you're planning on projecting it to a dense text embedding anyway.

You should really learn about CLIP or at least survey the state of multi-modal representation learning before committing to your current layout.

10

londons_explorer t1_itqixfz wrote

Before concatenating them, I would want to be sure that the mean and variance of both embedding vectors is normalized....

2

Appropriate_Ant_4629 t1_itr458y wrote

Currently working on the same thing.

I think you'll want to keep them as separate vectors.

The Jina guys had an interesting demo where you could assign different weights to the text-based-vector and the image-based-vector to fine-tune ranking.

2

External_Oven_6379 OP t1_itysdcc wrote

thank you for your input. Since I conduct the project by myself, I have no one to bounce back ideas. This is the first time I am getting some input from an experienced audience. I don't know when I made that decision for the architecture exactly, but I remember that I also had openAI's CLIP on the table, but must have come to the conclusion that the mentioned approach could work better.... how wrong I was!

1