Viewing a single comment thread. View all comments

currentscurrents t1_jai5dk2 wrote

Basically all of the text-to-image generators available today are diffusion models based around convolutional U-Nets. Google has an (unreleased) one that uses vision transformers.

There is more variety in the text encoder, which turns out to be more important than the diffuser. CLIP is very popular, but large language models like T5 show better performance and are probably the future.

6

ninjasaid13 t1_jaj678q wrote

>T5

but isn't it much heavier?

2

currentscurrents t1_jaj8jze wrote

Yup. But in neural networks, bigger is better!

1

ninjasaid13 t1_jajamez wrote

but in industry, don't we want things to be cheap? cost might be a bigger factor than performance.

1

currentscurrents t1_jajh007 wrote

That's always a balance you'll have to make. You can only run what fits on your available hardware.

1

ninjasaid13 t1_jal8x3x wrote

>You can only run what fits on your available hardware.

Precisely.

1

AImSamy OP t1_jbacun0 wrote

Thanks a lot for the reply.
Do you have documentation for that ?

1