DigThatData t1_itqbww4 wrote on October 25, 2022 at 3:08 PM

CLIP is definitely what you want here, and it's unclear to me why you are so convinced that a categorical text representation is an important feature considering you're planning on projecting it to a dense text embedding anyway.

You should really learn about CLIP or at least survey the state of multi-modal representation learning before committing to your current layout.

External_Oven_6379 OP t1_itysdcc wrote on October 27, 2022 at 8:45 AM

thank you for your input. Since I conduct the project by myself, I have no one to bounce back ideas. This is the first time I am getting some input from an experienced audience. I don't know when I made that decision for the architecture exactly, but I remember that I also had openAI's CLIP on the table, but must have come to the conclusion that the mentioned approach could work better.... how wrong I was!