Submitted by Jordan117 t3_yw1uxc in singularity
When it comes to digital media file size, generally speaking text < images < audio < video. This seems to reflect the typical "information density" of each medium (alphanumeric vs. waveform vs. still image vs. moving image). Processing large amounts of text is lightning-fast, while video usually takes much longer because there's just more there there. Etc.
But in terms of AI media synthesis, the compute times seem really out of whack. A desktop PC with an older consumer graphics card can generate a high quality Stable Diffusion image in under a minute, but generating a 30-second AI Jukebox clip takes many hours on the best Colab-powered GPUs, while decent text-based LLMs are difficult-to-impossible to run locally. What factors explain the disparity? Can we expect the relative difficulty of generating text/audio/images/video to hew closer to what you'd expect as the systems are refined?
SuperSpaceEye t1_iwhjfk1 wrote
Well, if you want to generate a coherent text you need a quite large model because you will easily find logical and writing errors as smaller models will give artifacts that will ruin the quality of output. The same with music as we are quite perceptive in small inaccuracies. Now images on the other hand can have "large" errors and still be beautiful to look at. Also, images can have large variations in textures, backgrounds, etc, making it easier for model to make "good enough" picture which won't work for text or audio, allowing for much smaller models.