visarga t1_j0whrvn wrote
Reply to comment by Tyanuh in Prediction: De-facto Pure AGI is going to be arriving next year. Pessimistically in 3 years. by Ace_Snowlight
Dall-E 1, Flamingo and Gato are like that. It is possible to concatenate the image tokens with the text tokens and have the model learn cross-modality inferencing.
Another way is to use a very large collection of text-image pairs and train a pair of models to match the right text to the right image (CLIP).
They both display generalisation, for example CLIP is a zero-shot image classifier, so so convenient. And it can guide diffusion to generate images.
The BLIP model can even generate captions - used to fix low quality captions in the training set.
Viewing a single comment thread. View all comments