Hey everyone, I'm going to build a new workstation for work and I woul like to have some help weighting the pros/cons of the different GPUs that as fairly new there aren't much information online.

I was thinking between 4x RTX 6000 ADA or 1 Hopper h100. The general idea would be that one of train various deep learning model, principally vision transformers and build some kind of service over it. What about the nvlink? The cloud option is not considered at the moment due to the recent bills.

Any suggestions or clarification is highly appreciated.

Comments

You must log in or register to comment.

Zeratas t1_j8cwg7l wrote on February 13, 2023 at 11:08 AM

You're not going to be putting in an H100, and a workstation. That's a server card.

With the GPUs you were mentioning, are you prepared to spend 30 to 50 thousand dollars just on the GPUs?

IIRC, the A6000s are the top of the line desktop cards.

IMHO, take a look at the specs, performance in your own workload. You'd get better value doing something like one or two A6000s, and maybe investing in a longer term server-based solution.

N3urAlgorithm OP t1_j8cwwcr wrote on February 13, 2023 at 11:15 AM

Yes due to the fact I'm going to use it for work, it'll be ok to build a server option

artsybashev t1_j8e2dmj wrote on February 13, 2023 at 5:00 PM

I understand that you have given up hope for Cloud. Just so you understand the options, $50k gives you about 1000 days of 4x A100 from vast.ai with todays pricing. Since in 3 years there is going to be at least one new generation, you will probably get more like 6 years of 4x A100 or one year of 4x A100 + 1 year of 4x H100. Keeping your rig at 100% utilization for 3 years might be hard if you plan to have holidays.

Appropriate_Ant_4629 t1_j8h5l44 wrote on February 14, 2023 at 7:09 AM

> Keeping your rig at 100% utilization for 3 years might be hard if you plan to have holidays.

With his ask, he probably has jobs big enough they'll run through the holidays.

artsybashev t1_j8i33cp wrote on February 14, 2023 at 1:58 PM

Yeah might be. I've only seen companies do machine learning in two ways. On is to rent a cluster of gpus and train something big for a week or two to explore something interesting. The other use pattern is to retrain a model every week with fresh data. Maybe this is the case for OP. Retraining a model each week and serving that model with some cloud platform. It makes sense to build a dedicated instance for a reoccuring tasks if you know that there is a need for it for more than a year. I guess it is also cheaper than using the upfront payment option in aws.

CKtalon t1_j8dbtpk wrote on February 13, 2023 at 1:51 PM

RTX 6000 Ada has no NVLink. Speedwise, 2x RTX 6000 Ada should be ~ 1x H100 based on last gen's A6000 vs A100. 4x RTX 6000 should be faster, and has more VRAM than a single H100.

Thing to take note is the likely lack of a Tensor Memory Accelerator on the RTX 6000 Ada which is present on the H100—if you plan on training FP8 models.

N3urAlgorithm OP t1_j8dqtv4 wrote on February 13, 2023 at 3:42 PM

Thank you, the TMA is actually a big deal to speed up things but as far as i've found even if the 4x rtx has more vram it can't be used for memory pooling. But basically if i'm not wrong even with this limitation I can still distribute training along the 4 gpus but still for a maximum of 48gb.

CKtalon t1_j8e0j48 wrote on February 13, 2023 at 4:47 PM

You’ll have to use something like DeepSpeed to split the layers across multiple GPUs. Of course, if the model can fit on one GPU, then you can go to crazier with bigger batch sizes

lambda_matt t1_j8db78v wrote on February 13, 2023 at 1:46 PM

No more NVLink on the-cards-formerly-known-as-quadro, so if your models are VRAM hungry you may be constrained by the ada6ks. PCIe 5 and Genoa/Sapphire Rapids might even this out, but I am not on the product development side of things and am not fully up to speed on next-gen and there have been lots of delays on the cpu/motherboards.

Also, the TDPs for pretty much all of the Ada cards are massive and will make multi-gpu configurations difficult and likely limited to 2x.

NVIDIA has killed off the the dgx workstation so they are pretty committed to keeping the H100s a server platform.

There still isn’t much real world info, as there are very few of any of these cards in the wild.

Here are some benchmarks for the H100 at least https://lambdalabs.com/gpu-benchmarks And are useful for comparing to to Ampere-gen.

Disclaimer: I work for Lambda

N3urAlgorithm OP t1_j8dokrh wrote on February 13, 2023 at 3:27 PM

So basically rtx6k does not support shared memory and so the stack of ada rtx will only be useful to accelerate things, isn't it?

For the h100 instead is it possible to do something like that?

Is the price difference of 7.5k for the 6000 wrt 30k for the h100 legit?

lambda_matt t1_j8facir wrote on February 13, 2023 at 9:50 PM

Short answer is, it’s complicated. Some workloads can handle being distributed across slower memory busses.

Frameworks have also implemented strategies for doing single node distributed training https://pytorch.org/tutorials/beginner/dist_overview.html

N3urAlgorithm OP t1_j8drfq9 wrote on February 13, 2023 at 3:47 PM

you said that nvidia has killed of the dgx workstation but as I can see from here there's still something for h100?

lambda_matt t1_j8estga wrote on February 13, 2023 at 7:57 PM

That’s a server. The DGX station was a downclocked v/a100 based workstation

https://images.nvidia.com/aem-dam/Solutions/Data-Center/nvidia-dgx-station-a100-infographic.pdf

lambda_matt t1_j8oozo1 wrote on February 15, 2023 at 9:18 PM

https://lambdalabs.com/gpu-benchmarks

Now has rtx Ada 6k

BellyDancerUrgot t1_j8e6rq6 wrote on February 13, 2023 at 5:33 PM

Top reply presents it well. Also, I think Jeff Heaton might make a video on the rtx6000 since he just posted an unboxing recently. Might want to check that out incase he talks about it in details.