Submitted by cloneofsimo t3_zfkqjh in MachineLearning
​
TLDR : People uses dreambooth or textual inversion to fine-tune their own stable diffusion models. There is a better way: Use LoRA to fine-tune twice as faster, with end result being less than 4MB. Dedicated CLI, package, and pre-trained models are available at https://github.com/cloneofsimo/lora
​
fine tuned LoRA on pixar footages. Inspired by modern-disney-diffusion
​
fine tuned LoRA on pop-art style
Thanks to the generous work of Stability AI and Huggingface, so many people have enjoyed fine-tuning stable diffusion models to fit their needs and generate higher fidelity images. However, the fine-tuning process is very slow, and it is not easy to find a good balance between the number of steps and the quality of the results.
Also, the final results (fully fined-tuned model) is rather very large. Consequently, merging checkpoints to find user's best fit is painstakingly SSD-consuming process. Some people instead works with textual-inversion as an alternative for this. But clearly this is suboptimal: textual inversion only creates a small word-embedding, and the final image is not as good as a fully fine-tuned model.
I've managed to make an alternative work out pretty well with Stable Diffusion: adapters. Parameter-efficient adapation has been a thing for quite a long time now. Mainly, LoRA seems to work robustly in many scenarios according to many researches. (https://arxiv.org/abs/2112.06825, https://arxiv.org/abs/2203.16329)
LoRA was originally proposed as part of LLM's method, but this is rather model-agnostic method, as long as there is some space for low-rank tensor decomposition (which literally every linear layer has). No one seems to have tried them on Stable diffusion, other than perhaps (not sure if they did, because tey used other form of adapters) NovelAI, known as hypernetworks.
# But is it really good though?
I've tried my best to validate my answer : Yes. it's sometimes even better than fully fine-tuning. Note that even though we are fine-tuning 3MB of parameters, being even better than fully fine-tuning is not surprising : original paper's benchmark had similar results.
What do I mean by better? Well I could've used zero-shot FID score on some shifted dataset, but that would literally take years as generating 50,000 images on single 3090 device takes forever.
Instead, I've used Kernel Inception Distance (https://arxiv.org/abs/1801.01401) that has small standard deviation which I can reliably use as a metric. For the shifted dataset, I've gathered 2358 icon images and fine tuned them on 12000 steps for both fully fine-tuning and LORA fine-tuning. The end result is as follows:
​
LoRA clearly wins full fine-tuning in terms of KID. But in the end, perceptual results are all that matters and I think end users will prove their effectiveness. I haven't had enough time to play with these to conclusively say anything about their superiority, but I did train LoRA on 3 different datasets (vector illustrations, disney style, pop-art style) which is available in my repo. End results seems pleasing enough to validate the perceptual quality.
# How fast is it?
Tested on 3090 device with 5950x cpu, LoRA takes 36 min on 12000 steps, while fully fine-tuning takes 1 hour 20 min. This is more than twice the speed. You also get to keep much of Adam memory saved + much of the parameters don't require grad so that's extra vram saved also.
Contributions are welcomed! This repo has been tested on Linux device, so if something doesn't work please leave a Issue/PR.If you've managed to train your own LoRA model, please share them!
LetterRip t1_izdam40 wrote
Just tried this and it ran great on a 6GB VRAM card on a laptop with only 16GB of RAM (barely fit into VRAM - using bitsnbytes and xformers I think). I've only tried the corgi example but seemed to work fine. Trying it with a person now.