suflaj t1_ix4bbi0 wrote on November 20, 2022 at 5:30 PM

3. and 4. in your case are probably intertwined, and likely the reason why you are stuck. You should probably keep the learning rate constant for all layers, freeze some at most if you're dealing with a big shift when finetuning.

You should use warmup, a low learning rate (what that entails depends, but since music data is similar to text, that means 1e-6 to 1e-5 maximum learning rate), and increase batch size if you get stuck as training progresses.

Without warmup, your network will not converge.

With a high learning rate, it will likely diverge on a nasty sample, or even have its gradients explode. In practice even when using gradient clipping your network might run in a circle, depending on your samples.

By lowering the learning rate when stuck it will not generalize well, but increasing the batch size (even if you slightly increase the learning rate while you're at it) seems to fix the problem, you just have to find the right numbers. I work on text, so whenever I doubled the batch size, I increased the learning rate by a factor of square or cube root of 2 to keep the "learning pressure" the same. YMMV.

EDIT: And as other people said make sure you have a large enough datasets. Transformers have almost no inductive biases, meaning that they have to learn them from data. Unless your augmentations are really good, I wouldn't recommend even attempting to train a transformer without at least 100k-1mil unique samples. For the size you're mentioning, the model would ideally like 1-10mil samples for finetuning and 1-10bil for pretraining.

parabellum630 OP t1_ix4eozs wrote on November 20, 2022 at 5:53 PM

Thank you so much for these insights!! I will try these out.