TLDR : People uses dreambooth or textual inversion to fine-tune their own stable diffusion models. There is a better way: Use LoRA to fine-tune twice as faster, with end result being less than 4MB. Dedicated CLI, package, and pre-trained models are available at https://github.com/cloneofsimo/lora

fine tuned LoRA on pixar footages. Inspired by modern-disney-diffusion

Thanks to the generous work of Stability AI and Huggingface, so many people have enjoyed fine-tuning stable diffusion models to fit their needs and generate higher fidelity images. However, the fine-tuning process is very slow, and it is not easy to find a good balance between the number of steps and the quality of the results.

Also, the final results (fully fined-tuned model) is rather very large. Consequently, merging checkpoints to find user's best fit is painstakingly SSD-consuming process. Some people instead works with textual-inversion as an alternative for this. But clearly this is suboptimal: textual inversion only creates a small word-embedding, and the final image is not as good as a fully fine-tuned model.

I've managed to make an alternative work out pretty well with Stable Diffusion: adapters. Parameter-efficient adapation has been a thing for quite a long time now. Mainly, LoRA seems to work robustly in many scenarios according to many researches. (https://arxiv.org/abs/2112.06825, https://arxiv.org/abs/2203.16329)

LoRA was originally proposed as part of LLM's method, but this is rather model-agnostic method, as long as there is some space for low-rank tensor decomposition (which literally every linear layer has). No one seems to have tried them on Stable diffusion, other than perhaps (not sure if they did, because tey used other form of adapters) NovelAI, known as hypernetworks.

# But is it really good though?

I've tried my best to validate my answer : Yes. it's sometimes even better than fully fine-tuning. Note that even though we are fine-tuning 3MB of parameters, being even better than fully fine-tuning is not surprising : original paper's benchmark had similar results.

What do I mean by better? Well I could've used zero-shot FID score on some shifted dataset, but that would literally take years as generating 50,000 images on single 3090 device takes forever.

Instead, I've used Kernel Inception Distance (https://arxiv.org/abs/1801.01401) that has small standard deviation which I can reliably use as a metric. For the shifted dataset, I've gathered 2358 icon images and fine tuned them on 12000 steps for both fully fine-tuning and LORA fine-tuning. The end result is as follows:

LoRA 0.5 stands for merging only half of LoRA into original model. All initiated from Stable Diffusion version 2.0.

LoRA clearly wins full fine-tuning in terms of KID. But in the end, perceptual results are all that matters and I think end users will prove their effectiveness. I haven't had enough time to play with these to conclusively say anything about their superiority, but I did train LoRA on 3 different datasets (vector illustrations, disney style, pop-art style) which is available in my repo. End results seems pleasing enough to validate the perceptual quality.

# How fast is it?

Tested on 3090 device with 5950x cpu, LoRA takes 36 min on 12000 steps, while fully fine-tuning takes 1 hour 20 min. This is more than twice the speed. You also get to keep much of Adam memory saved + much of the parameters don't require grad so that's extra vram saved also.

Contributions are welcomed! This repo has been tested on Linux device, so if something doesn't work please leave a Issue/PR.If you've managed to train your own LoRA model, please share them!

Comments

LetterRip t1_izdam40 wrote on December 8, 2022 at 6:44 AM

Just tried this and it ran great on a 6GB VRAM card on a laptop with only 16GB of RAM (barely fit into VRAM - using bitsnbytes and xformers I think). I've only tried the corgi example but seemed to work fine. Trying it with a person now.

cloneofsimo OP t1_izdlve0 wrote on December 8, 2022 at 9:20 AM

Glad it worked for you with such small memory constraints!

LetterRip t1_izdm55i wrote on December 8, 2022 at 9:24 AM

> Glad it worked for you with such small memory constraints!

Currently training image size 768, and accumulation steps=2.

If steps is set to 2000, will it be going to 4000? It didn't stop at 2000 as expected and is currently over 3500, figured I'd wait till over 4000 to kill it in case the accumulation steps acts as a multiplier. (Went to 3718 and quit, right after I wrote the above).

Teotz t1_izjzdve wrote on December 9, 2022 at 5:44 PM

Don't leave us hanging!!! :)

How did the training go with a person?

LetterRip t1_izksf4k wrote on December 9, 2022 at 8:51 PM

It is working, but I need to use prior preservation loss, otherwise all of the words in the phrase have the concept bleed into them. So generating photos for preservation loss now.

LetterRip t1_izm8rkq wrote on December 10, 2022 at 3:25 AM

It did work, now I can no longer launch lora training even with 768 or 512 (CUDA VRAM exceeded), only 256 no idea what changed.

JanssonsFrestelse t1_j0l89ve wrote on December 17, 2022 at 2:40 PM

Same here with 8GB VRAM, although looks like I can't use mixed_precision=fp16 with my RTX 2070, so that might be why.

ThatInternetGuy t1_izenxjo wrote on December 8, 2022 at 3:41 PM

This could be a great choice between textual inversion and a full-blown Dreambooth. I think it could benefit from saving the text encoder too (about 250MB half-precision).

johnslegers t1_izexocv wrote on December 8, 2022 at 4:45 PM

End result being less than 4MB?

So this means the finetuned content is saved separately?

What if I don't want that? What if I want it to be merged with the model, as is the case for Dreambooth training?

Is there a way to merge the trained concept with the model itself?

PrimaCora t1_izgzw9a wrote on December 9, 2022 at 1:09 AM

It's in the repo, but yeah, they have a way to merge to models, and to merge multiple dream booth trainings into one.

johnslegers t1_izh1tue wrote on December 9, 2022 at 1:24 AM

Oh, wow, that changes things.

Thanks for the info.

Definitely will need to check out LoRa, then...

yupignome t1_izfd87n wrote on December 8, 2022 at 6:25 PM

this looks great, but needs more documentation, as running it as it is doesn't work

Educational_Job7383 t1_izie3yz wrote on December 9, 2022 at 9:30 AM

want img2img function, is there a demo ?

[deleted] t1_izlim95 wrote on December 9, 2022 at 11:52 PM

[deleted]

Desuka15 t1_izoh1qo wrote on December 10, 2022 at 5:23 PM

Can you help me with this? I’m a bit lost on it. Please pm me.

[P] Using LoRA to efficiently fine-tune diffusion models. Output model less than 4MB, two times faster to train, with better performance. (Again, with Stable Diffusion)

Comments

LetterRip t1_izdam40 wrote on December 8, 2022 at 6:44 AM

cloneofsimo OP t1_izdlve0 wrote on December 8, 2022 at 9:20 AM

LetterRip t1_izdm55i wrote on December 8, 2022 at 9:24 AM

Teotz t1_izjzdve wrote on December 9, 2022 at 5:44 PM

LetterRip t1_izksf4k wrote on December 9, 2022 at 8:51 PM

LetterRip t1_izm8rkq wrote on December 10, 2022 at 3:25 AM

JanssonsFrestelse t1_j0l89ve wrote on December 17, 2022 at 2:40 PM

CatalyzeX_code_bot t1_izcb6mx wrote on December 8, 2022 at 1:30 AM

hentieDesu t1_izebuz3 wrote on December 8, 2022 at 2:14 PM

Why_Soooo_Serious t1_izhwqbu wrote on December 9, 2022 at 5:44 AM

sam__izdat t1_izi6p0a wrote on December 9, 2022 at 7:45 AM

Why_Soooo_Serious t1_izia31z wrote on December 9, 2022 at 8:32 AM

sam__izdat t1_iziau6e wrote on December 9, 2022 at 8:43 AM

[deleted] t1_j0e1sfq wrote on December 15, 2022 at 11:31 PM

sam__izdat t1_j0e20q4 wrote on December 15, 2022 at 11:33 PM

[deleted] t1_j0e29x1 wrote on December 15, 2022 at 11:35 PM

sam__izdat t1_j0e2kcd wrote on December 15, 2022 at 11:37 PM

LetterRip t1_izlburh wrote on December 9, 2022 at 11:02 PM