Submitted by AImSamy t3_11f637p in MachineLearning
There are tens of companies proposing their text-to-image models:
NightCafe, Dream by WOMBO, DALL-E 2, Midjourney, Stable Diffusion, StabilityAI ...etc
What are the different architectures they use ? or do they only differ on training datasets ?
currentscurrents t1_jai5dk2 wrote
Basically all of the text-to-image generators available today are diffusion models based around convolutional U-Nets. Google has an (unreleased) one that uses vision transformers.
There is more variety in the text encoder, which turns out to be more important than the diffuser. CLIP is very popular, but large language models like T5 show better performance and are probably the future.