Co0k1eGal3xy t1_jdqfxcr wrote on March 26, 2023 at 11:36 AM

Reply to comment by tdgros in Is it possible to merge transformers? [D] by seraphaplaca2

Most stable diffusion UIs DO merge weights by averaging them
Averaging weights between checkpoints works really well with CLIP fine-tuning, improving performance over both checkpoints for their respective validation sets. https://github.com/mlfoundations/wise-ft
Git-rebasin found that their method of merging weights works for merging checkpoints with completely different pretraining data + init weights and improves accuracy on a mixed validation set over just using one model or the other. https://arxiv.org/abs/2209.04836

You're right that merging the model outputs has higher quality than merging the weights, but OP was asking if it was possible and it is very much possible if the weight tensors have the same shape.

tdgros t1_jdqjc8q wrote on March 26, 2023 at 12:15 PM

there's also weight averaging in eSRGAN that I knew about, but that always irked me. The permutation argument from your third point is the usual reason I evoke on this subject, and the paper does show why it's not as simple as just blending weights! The same reasoning also shows why blending subsequent checkpoints isn't like blending random networks.

_Arsenie_Boca_ t1_jdqy1n8 wrote on March 26, 2023 at 2:28 PM

Merging model outputs also means you have to run both models. I think the best option is to merge the weights and recover performance using datasets from both domains and distillation from the respective expert model.