Submitted by floppy_llama t3_1266d02 in MachineLearning
gmork_13 t1_je9e6wu wrote
Reply to comment by CasulaScience in [R] LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention by floppy_llama
I'm wondering the same thing.
In the LoRA paper they had some pros vs cons on other adapters (where LoRA won out). Though you technically could do both, you'd probably pick one.
Indeed, this adapter wins out vs LoRA when looking at weight size, but since we're talking about MB it's an almost negligible difference (in this scenario). It's a shame they didn't include LoRA training time in their comparison.
They say 1hr on 8*A100, whereas the alpaca-LoRA github says 4-5 hrs on 1*4090.
8*A100's is 640GB VRAM (assuming 80GB model) as opposed to the 4090's 24GB - there are also differences in speed and the fact that the alpaca-LoRA github may have run the inference on an 8bit quantized model.
Since the adapter paper says nothing about quantization, I'm assuming it's 640GB VRAM used for the full fp32 7B model for one hour (or fp16?), compared to the alpaca-LoRA git which runs 24GB VRAM on 8int 7B model for 4.5 hrs.
They both train on the stanford dataset, but alpaca-LoRA-git trains for 3 epochs on the cleaned dataset whereas llama-adapter trains on the full dataset for 5 epochs.
That's a lot of small differences to account for if you're trying to figure out what's faster.
It can be done, but the question remains whether the end result is comparable and whether it was trained to an optimal point.
Since the authors trained alpaca-LoRA, why didn't they write how long alpaca-LoRA took in their comparison table? They trained on the same hardware and dataset, I assume.
If the only difference between this adapter and others is, as they mention in the paper, the gating, zero init and multi-modality then the downsides mentioned in the LoRA paper might still hold (bottlenecks). I'm no expert though.
Viewing a single comment thread. View all comments