suflaj t1_iym8e85 wrote on December 2, 2022 at 1:10 PM

NVLink will not pool your memory. You were already told this in your previous post.

Tbose models already require more than 24GB RAM if you do not accumulate your gradients, and it's unlikely they'll need more than 24 GB per batch even for their auccessors. 4090s will be faster, obviously.

normie1990 t1_iym8leb wrote on December 2, 2022 at 1:11 PM

I thought memory pooling was the whole point of NVLink?

>Tbose models already require more than 24GB RAM if you do not accumulate your gradients

Could you elaborate?

suflaj t1_iym8sa5 wrote on December 2, 2022 at 1:13 PM

NVLink itself does not pool memory. It just increases bandwidth. Memory pools are a software feature, partially made easier by NVLink.

> Could you elaborate?

Those model are trained with batch sizes that are too large to fit on any commercial GPU, meaning you will have to accumulate them either way.

normie1990 t1_iym8yq4 wrote on December 2, 2022 at 1:15 PM

I probably should have specified that I'll do fine tuning, not training from scratch, if that makes any difference.

>Memory pools are a software feature.

I know it's a software feature, AFAIK pytorch supports it, right?

suflaj t1_iym94jr wrote on December 2, 2022 at 1:16 PM

> I probably should have specified that I'll do fine tuning, not training from scratch, if that makes any difference.

Unless you're freezing layers, it doesn't.

> I know it's a software feature, AFAIK pytorch supports it, right?

No. PyTorch supports Data Parallelism. To get pooling in its full meaning, you need Model Parallelism, for which you'd have to write your own multi-GPU layers and a load balancing heuristic.

Be as it be, using Pytorch itself, NVLink gets you less than 5% gains. Obviously not worth compared to 30-90% gains from a 4090. You need stuff like Apex to see visible improvements, but they do not compare to generational leaps, nor do they parallelize the model (you still have to do it yourself). Apex' data parallelism is similar to PyTorches anyways.

Once you parallelize your model, however, you're bound to be bottlenecked by bandwidth. This is the reason it's not done more often, as it makes sense only if the model itself is very large, yet its gradients fit in pooled memory. NVLink provides only 300 GB/s of bandwidth in the best case scenario, amounting to roughly 30% performance gains in bandwidth bottlenecked tasks in the best case.

normie1990 t1_iyma2hh wrote on December 2, 2022 at 1:25 PM

>Be as it be, using Pytorch itself, NVLink gets you less than 5% gains. Obviously not worth compared to 30-90% gains from a 4090.

Thanks, I think I have my answer.

Obviously I'm new to ML and didn't understand everything that you tried to explain (which I appreciate). I know that much - I will be freezing layers when fine-tuning, so from your earlier comment I guess I won't need more than 24GB.

Will I ever need more than 24GB VRAM to train models like Detectron2 and YOLOv5?

Comments