Comments

You must log in or register to comment.

currentscurrents t1_jai5dk2 wrote

Basically all of the text-to-image generators available today are diffusion models based around convolutional U-Nets. Google has an (unreleased) one that uses vision transformers.

There is more variety in the text encoder, which turns out to be more important than the diffuser. CLIP is very popular, but large language models like T5 show better performance and are probably the future.

6