Submitted by soupstock123 t3_106zlpz in deeplearning
hjups22 t1_j3k2kei wrote
What is the intended use case for the GPUs? I presume you intend to train networks, but which kind and at what scale? Many small models, or one big model at a time?
Or if you are doing inference, what types of models do you intend to run.
The configuration you suggested is really only good for training / inferencing many small models in parallel, and will not be performant for anything that uses more than 2 GPUs via NVLink.
Also don't forget about system RAM... depending on the models, you may need ~1.5x the total VRAM capacity in system RAM, and deepspeed requires a lot more than that (upwards of 4x) - I would probably go with at least 128GB for the setup you described.
VinnyVeritas t1_j3l04w8 wrote
Each time someone asks this question, someone repeats this misinformed answer.
This is incorrect, NVLink doesn't make much difference.
hjups22 t1_j3l1l6n wrote
That information is very outdated, and also not very relevant...
The 3090 is an Ampere card with 2x faster NVLink, which has a significant advantage in speed compared to the older GPUs. I'm not aware of any benchmarks that explicitly tested this though.
Also, Puget benchmarked what I would consider "small" models. If the model is small enough, then the interconnect won't really matter all that much as you're going to spend more time in com setup than transfer.
But for the bigger models, you better bet it matter!
Although to be fair, my original statement is based on a node with 4x A6000 GPUs, configured in a pair-wise NVLink configuration. When you jump from 2 paired GPUs over to 4 GPUs with batch-parallelism, the training time (for big models - ones which barely fit in the 3090) will only increase by about 20% rather than the expected 80%.
It's possible that the same scaling will not be seen on 3090s, but I would expect the scaling to be worse in the system described by the OP, since the 4x system allocated a full 16 lanes to each GPU via dual sockets.
Note that this is why I asked about the type of training being done, since if the models are small enough (like ResNet-50), then it won't matter - though ResNet-50 training is pretty quick and won't really benefit that much from multiple GPUs in the grand scheme of things.
qiltb t1_j3l9suz wrote
that also depends on input image size though...
hjups22 t1_j3lk4e5 wrote
Could you elaborate on what you mean by that?
The advantage of NVLink is gradient / weight communication, which is independent of image size.
qiltb t1_j3mmcca wrote
Sorry, I referred explicitly to the the last paragraph of yours (that it's quick for small models)
hjups22 t1_j3nqeim wrote
Then I agree. If you are doing ResNet inference on 8K images, then it will probably be quite slow. However 8K segmentation will probably be even slower (the point of comparison that I was thinking of).
Also, when you get to large images, I suspect the PCIe will become a bottleneck (sending data to the GPUs), which will not be helped by the setup described by the OP.
qiltb t1_j3l9q8s wrote
Well, in just the most basic tasks - like plain resnet100 training (classification) by using nvlink - there is a huge difference.
VinnyVeritas t1_j3ng2u9 wrote
Do you have some numbers or a link because all benchmarks I've seen point to the contrary? I'm happy to update my opinion if things have changed and there's data to support it.
[deleted] t1_j3nxy3d wrote
[deleted]
soupstock123 OP t1_j3l2srl wrote
Right now mostly CNNs, RNNs, and playing around with style transfers with GANs. Future plans include running computer vision models trained on videos and testing inferencing, but still researching how demanding that would be.
hjups22 t1_j3l3ln2 wrote
Those are all going to be pretty small models (under 200M parameters), so what I said probably won't apply to you. Although, I would still recommend parallel training rather than trying to link them together (4 GPUs means you can run 4 experiments in parallel - or 8 if you double up on a single GPU).
Regarding RAM speed, it has an effect, but it probably won't be all that significant given your planned workload. I recently changed the memory on one of my nodes so that it could train GPT-J (reduced the RAM speed so that I could increase the capacity), the speed difference for other tasks is probably within 5%, which I don't think matters (when you expect to run month long experiments, an extra day is irrelevant).
Viewing a single comment thread. View all comments