visarga t1_j0whrvn wrote on December 19, 2022 at 10:56 PM

Reply to comment by Tyanuh in Prediction: De-facto Pure AGI is going to be arriving next year. Pessimistically in 3 years. by Ace_Snowlight

Dall-E 1, Flamingo and Gato are like that. It is possible to concatenate the image tokens with the text tokens and have the model learn cross-modality inferencing.

Another way is to use a very large collection of text-image pairs and train a pair of models to match the right text to the right image (CLIP).

They both display generalisation, for example CLIP is a zero-shot image classifier, so so convenient. And it can guide diffusion to generate images.

The BLIP model can even generate captions - used to fix low quality captions in the training set.