Viewing a single comment thread. View all comments

DigThatData t1_itqbww4 wrote

CLIP is definitely what you want here, and it's unclear to me why you are so convinced that a categorical text representation is an important feature considering you're planning on projecting it to a dense text embedding anyway.

You should really learn about CLIP or at least survey the state of multi-modal representation learning before committing to your current layout.

10

External_Oven_6379 OP t1_itysdcc wrote

thank you for your input. Since I conduct the project by myself, I have no one to bounce back ideas. This is the first time I am getting some input from an experienced audience. I don't know when I made that decision for the architecture exactly, but I remember that I also had openAI's CLIP on the table, but must have come to the conclusion that the mentioned approach could work better.... how wrong I was!

1