Submitted by AImSamy t3_11f637p in MachineLearning
currentscurrents t1_jai5dk2 wrote
Basically all of the text-to-image generators available today are diffusion models based around convolutional U-Nets. Google has an (unreleased) one that uses vision transformers.
There is more variety in the text encoder, which turns out to be more important than the diffuser. CLIP is very popular, but large language models like T5 show better performance and are probably the future.
ninjasaid13 t1_jaj678q wrote
>T5
but isn't it much heavier?
currentscurrents t1_jaj8jze wrote
Yup. But in neural networks, bigger is better!
ninjasaid13 t1_jajamez wrote
but in industry, don't we want things to be cheap? cost might be a bigger factor than performance.
currentscurrents t1_jajh007 wrote
That's always a balance you'll have to make. You can only run what fits on your available hardware.
ninjasaid13 t1_jal8x3x wrote
>You can only run what fits on your available hardware.
Precisely.
bjergerk1ng t1_jakt9hi wrote
Source about Google using ViT?
xEdwin23x t1_jals78j wrote
I'm guessing he refers to this one: https://parti.research.google/
bjergerk1ng t1_jalzy46 wrote
That's not diffusion though
AImSamy OP t1_jbacun0 wrote
Thanks a lot for the reply.
Do you have documentation for that ?
Viewing a single comment thread. View all comments