Submitted by minimaxir t3_z733uy in MachineLearning

I just published a blog post with many academic experiments on getting good results from Stable Diffusion 2.0, showing that negative prompts are the key with its new text encoder:

https://minimaxir.com/2022/11/stable-diffusion-negative-prompt/

I also released Colab Notebooks to reproduce the workflow and use the negative embeddings yourself (links in comment due to antispam filters for too many URLs)

182

Comments

You must log in or register to comment.

sam__izdat t1_iy5dmp7 wrote

You may get generally better results if you remove the nonsense from the embedding, like "too many fingers" and "bad anatomy." It made some people on /r/StableDiffusion very angry, but I ran a comparison for those (several, actually), and it went exactly as expected. Some of the words in the original embedding (e.g. lowres, text, error, blurry, ugly, etc) are probably doing something like what was intended. Most of the rest are a superstitious warding ritual.

55

ReginaldIII t1_iy5li80 wrote

That is both hilarious and a really interesting result. Thanks for sharing this.

How did you search for that improved negative prompt?

It would be interesting to see a third column for the same prompts/seeds without any negative prompt as a baseline.

15

sam__izdat t1_iy5lq3w wrote

Just typed in literally the first nonsense that came to mind. I doubt there's anything special about it. I imagine a string of random characters will have roughly the same effect.

To be clear, I was just comparing the finger situation. Words like "ugly" do seem to have an effect -- i.e. it seems to smooth out faces, remove blemishes and wrinkles and generally makes people look a little more like headshots of supermodels, in my limited testing.

18

ReginaldIII t1_iy5motj wrote

What metric are you using to say this is an improved prompt? I think it's fair to say it is somewhat comparable but I think you'd need a set of metrics to define an improvement.

A proportion of N images produced where the hands were correct. Or a comparative user study where participants see the image pairs side by side randomly swapped and choose which they prefer.

And it definitely needs a comparison to a baseline of no negative prompt.

It will also be interesting to see if this still applies to SD 2 since it uses a different language model.

3

sam__izdat t1_iy5ms2r wrote

> What metric are you using to say this is an improved prompt.

It isn't an improved prompt. It was just a silly joke and a spoof on a research paper. Like you say, I think they are comparable -- that is, equally useless for correcting terrible hands. At least on the base model. I don't know how the anime ones are trained, so maybe that's different if someone actually went and captioned anatomical errors.

9

ReginaldIII t1_iy5nmcv wrote

I hate that I'm going to be "that guy" but it's not obvious enough that it's just a joke because it does actually produce reasonably similar results. "Improved" is, at least from this, somewhat plausible so I would be careful saying it because you don't actually mean it seriously but that isn't clear.

You'd have been dunking on them just as well if you'd said a bullshit random prompt performs comparatively.

2

sam__izdat t1_iy5nsm3 wrote

It wasn't my intention to deceive anyone. I thought it was pretty clear that this is humor and not serious research.

6

JanssonsFrestelse t1_iy73zgf wrote

Should have used the negative prompt "a bullshit random prompt that performs comparatively"

2

sam__izdat t1_iy5q0d5 wrote

> It would be interesting to see a third column for the same prompts/seeds without any negative prompt as a baseline.

Oh, and I don't have any with three columns, but here's one of the "too many fingers" prompt vs no negative prompt. Apologies for the lazy layout.

2

ReginaldIII t1_iy5si81 wrote

I do think there's something interesting here, the presence of a negative prompt does seem beneficial.

I wonder if, having "any" negative prompt is almost taking up some of the "slack" in the latent manifold. A better defined negative prompt might have some diminishing returns with regard to quality. But it does seem to have the ability to significantly influence the style, colour palette, and composition of the images.

1

sam__izdat t1_iy5swts wrote

Fingers aside, I don't see much improvement, but if there is any -- and I am only guessing -- I reckon "blurry" and "ugly" are pulling a lot of weight. If you do something like:

> ugly, hands, blurry, low resolution, lowres, text, error, cropped, worst quality, low quality, jpeg artifacts, long neck [etc]

- it will definitely have a pronounced effect. Is it the one you want? Well - maybe, maybe not. But it does seem to make things more professional-looking and the subjects more conventionally attractive. It'll also try to obscure hands completely, which is probably the right call all things considered.

And on top of that there's also the blue car effect. It's entirely possible that putting in "close up photo of a plate of food, potatoes, meat stew, green beans, meatballs, indian women dressed in traditional red clothing, a red rug, donald trump, naked people kissing" will amplify some of what you want and cut out some of what's (presumably) a bunch of irrelevant or low-quality SEO spam. Here's somebody's hypothesis on what might be happening.

1

ReginaldIII t1_iy5w27q wrote

I would argue for the images in the blue car post, that while the cars themselves reached a good fidelity and stopped improving, the backgrounds really improved and grounded the cars in their scenes better.

I think because this is treading into human subjective perception and aesthetic and compositional preferences, this sort of idea can only be tested by a wide scale blind comparative user study.

Similar to how such studies are conducted in lossy compression research.

> It's entirely possible that putting in "close up photo of a plate of food, potatoes, meat stew, green beans, meatballs, indian women dressed in traditional red clothing, a red rug, donald trump, naked people kissing" will amplify some of what you want and cut out some of what's (presumably) a bunch of irrelevant or low-quality SEO spam.

I think the nature of the datasets and language models is always going to mean a specialized negative prompt for where your image is located in the latent space will be needed to tune that image to it's optimum output for whatever composition you are aiming for. It's letting to nudge it around. How much wiggle room that area of the latent manifold has to give for variation will vary greatly.

1

astrange t1_iy5mm2j wrote

Yeah, "bad anatomy" and things like that come from NovelAI because its dataset has images literally tagged with that. It doesn't work on other models.

SD is scraped off the internet so something that might work is negative keywords associated with websites of images you don't like. Like "zillow" "clipart" "coindesk" etc.

Or try clip-interrogator or textual inversion against bad looking images (but IMO clip-interrogator doesn't work very well yet either).

10

sam__izdat t1_iy5myz9 wrote

> from NovelAI because its dataset has images literally tagged with that

That makes a lot more sense now, thanks. I thought they were also just using LAION 5B or some subset.

2

Jonno_FTW t1_iy6b2uc wrote

> superstitious warding ritual

I prefer "cargo cult prompting"

3

hadaev t1_iy4hp4q wrote

So its all im2im?

3

DigThatData t1_iy5j8en wrote

actually it's all text2im, but "text" includes some custom learned tokens.

3

shadowknight094 t1_iy69mv8 wrote

Does anyone know how much did SD 2 cost to train? Also is their model open sourced? Like can I take the code and train it assuming I had the money and resources?

3

ThatInternetGuy t1_iy777n6 wrote

About $500,000.

However, if the dataset was carefully filtered, you could bring the cost down to $120,000.

Most people can only afford to finetune it with less than 10 hours of A100, which would cost less than $50. This approach is probably better for most people.

7

sam__izdat t1_iy6o4z0 wrote

  1. They had 4,000 A100s chewing on it, toward the end. I think it's 5,000 now. You can probably do the math from the info in the model card to figure out how much that is in power bills.

  2. It's licensed under RAIL-M. It is questionable whether this licensing has any legal basis because it's unclear whether models themselves are copyrightable. They allow permissive sublicensing with their inference code. You'll have to look at the wording of the license to see how this is reconciled with RAIL-M's usage-based restrictions.

  3. Yes. You can finetune it cheaply and pretty quickly (maybe an hour or two or even less, depending on GPU and settings) with DreamBooth. Retraining a general-purpose model from scratch is probably out of the reach of most people. There is some code available for training from scratch, though, and a special-purpose model might be doable without millions in resources. I think there's been one or two of those, if I'm not mistaken.

5

DigThatData t1_iy5j566 wrote

excellent work, thanks for digging so deeply into this phenomenon and writing up!

2