Submitted by dahdarknite t3_10r5gku in MachineLearning
Stable diffusion seems to be a departure from the trend of building larger and larger models.
It has 10x less parameters than other image generation models like DALLE-2.
What allows stable diffusion to work so well with a lot less parameters? Are there any drawbacks to this, like requiring stable diffusion to be fine tuned more than DALLE-2 for example?
LetterRip t1_j6v57y5 wrote
Mostly the language model - Imagen is using T5-XXL (the 4.6 billion parameters), Dall-E 2 uses GPT-3 (presumably 2.7B not the much larger variants used for ChatGPT). SD is just using CLIP without anything else. The more sophisticated the language model, the better the image generation can understand what you want. CLIP is close to using bag of words.