bentheaeg t1_jcsvdy1 wrote on March 19, 2023 at 8:39 AM

Not able to reply for sure right now (A6000 Ada are missing open tests), I don´t think that many people can. I work at a scale up though (PhotoRoom), and we're getting a 4xA6000 Ada server next week , we were planning to publish benchmarks vs. our other platforms (DGXs, custom servers, .. from A100 to A600 and 3090), stay tuned !

From a distance, semi educated guess:

- A6000 Ada are really, really good in compute. So models which are really compute bound (think Transformers with very big embeddings) should do well, models which are more IO bound (convnets for instance) will not do as well, especially vs. the A100 which has much faster memory

- the impact of nvlink is not super clear to me, its bandwidth was not really big to begin with anyway. My guess is that it may be more useful for latency bound inter GPU communications, like when using syncbatchnorm.

- there are a lot of training tweaks that you can use (model or pipeline parallel, FSDP, grad accumulation to cut on the comms..), so the best training setup for each platform may differ, it's also a game of apples to oranges, and this is by design

- I would take extra care around the cooling system, if you're not a cloud operator then a server going down will be a mess in your lab. This happened to us 3 times in the past 6 months, always because of the cooling. These machines can tap into 2kW+ H24 , this has to be extracted out and from our limited experience some setups (even from really big names) are not up to the task and go belly up in the middle of a job. 80GB A100s are 400 to 450W, A6000s (Ada or now) are 300W, easier to cool down if you're not buckled up. Not a point against the A100 per say, but a point against the A100 & unproven cooling let's say.