There are tens of companies proposing their text-to-image models:

NightCafe, Dream by WOMBO, DALL-E 2, Midjourney, Stable Diffusion, StabilityAI ...etc

What are the different architectures they use ? or do they only differ on training datasets ?

Comments

You must log in or register to comment.

currentscurrents t1_jai5dk2 wrote on March 1, 2023 at 4:41 PM

Basically all of the text-to-image generators available today are diffusion models based around convolutional U-Nets. Google has an (unreleased) one that uses vision transformers.

There is more variety in the text encoder, which turns out to be more important than the diffuser. CLIP is very popular, but large language models like T5 show better performance and are probably the future.

ninjasaid13 t1_jaj678q wrote on March 1, 2023 at 8:32 PM

>T5

but isn't it much heavier?

currentscurrents t1_jaj8jze wrote on March 1, 2023 at 8:47 PM

Yup. But in neural networks, bigger is better!

ninjasaid13 t1_jajamez wrote on March 1, 2023 at 8:59 PM

but in industry, don't we want things to be cheap? cost might be a bigger factor than performance.

currentscurrents t1_jajh007 wrote on March 1, 2023 at 9:38 PM

That's always a balance you'll have to make. You can only run what fits on your available hardware.

ninjasaid13 t1_jal8x3x wrote on March 2, 2023 at 5:45 AM

>You can only run what fits on your available hardware.

Precisely.

bjergerk1ng t1_jakt9hi wrote on March 2, 2023 at 3:22 AM

Source about Google using ViT?

xEdwin23x t1_jals78j wrote on March 2, 2023 at 9:55 AM

I'm guessing he refers to this one: https://parti.research.google/

bjergerk1ng t1_jalzy46 wrote on March 2, 2023 at 11:38 AM

That's not diffusion though

AImSamy OP t1_jbacun0 wrote on March 7, 2023 at 4:51 PM

Thanks a lot for the reply.
Do you have documentation for that ?