Submitted by External_Oven_6379 t3_yd0549 in MachineLearning
DigThatData t1_itqbww4 wrote
CLIP is definitely what you want here, and it's unclear to me why you are so convinced that a categorical text representation is an important feature considering you're planning on projecting it to a dense text embedding anyway.
You should really learn about CLIP or at least survey the state of multi-modal representation learning before committing to your current layout.
External_Oven_6379 OP t1_itysdcc wrote
thank you for your input. Since I conduct the project by myself, I have no one to bounce back ideas. This is the first time I am getting some input from an experienced audience. I don't know when I made that decision for the architecture exactly, but I remember that I also had openAI's CLIP on the table, but must have come to the conclusion that the mentioned approach could work better.... how wrong I was!
Viewing a single comment thread. View all comments