SuperSpaceEye t1_iwhjfk1 wrote on November 15, 2022 at 6:10 PM

Well, if you want to generate a coherent text you need a quite large model because you will easily find logical and writing errors as smaller models will give artifacts that will ruin the quality of output. The same with music as we are quite perceptive in small inaccuracies. Now images on the other hand can have "large" errors and still be beautiful to look at. Also, images can have large variations in textures, backgrounds, etc, making it easier for model to make "good enough" picture which won't work for text or audio, allowing for much smaller models.

Jordan117 OP t1_iwhs5cw wrote on November 15, 2022 at 7:05 PM

Is there a reason the language model part of image diffusion requires a lot less horsepower than running a language model by itself? I'm still amazed SD works quickly on my 2016-era PC, but apparently something like GPT-J requires dozens or hundreds of GB of memory to even store. Is it the difference between generating new text vs. working with existing text?

SuperSpaceEye t1_iwht6hf wrote on November 15, 2022 at 7:12 PM

Two different tasks. Language model in SD just encodes text to some abstract representation that diffusion part of the model then uses. Text-to-text model such as GPT-J does different task which is much harder. Also, GPT-J is 6B parameters, which will only take like 12GB or VRAM, not hundreds.

Jordan117 OP t1_iwhtnxu wrote on November 15, 2022 at 7:15 PM

Thanks for the clarification, I must have misread an older post talking about CPU memory requirements instead of GPU.