Submitted by Outrageous_Room_3167 t3_zpw2ew in deeplearning

Hey new to infrastructure builds, we're a small start-up https://www.axibo.com/ curious about what the biggest 3090 deep learning rigs people have done.

How do we scale past one machine? My guess is a very fast direct connection across the machines, is this feasible with the 3090s?

The cost of these has gone down dramatically & per unit basis, almost as good as A100.

3

Comments

You must log in or register to comment.

sigmoid_amidst_relus t1_j0wsqyz wrote

3090 is not as good as an A100 in terms of pure performance.

It's much better than an A100 in perf/$

A single consumer-grade deep learning node won't scale past 3x 3090s without diminishing returns until and unless all you work with are datasets that fit in your memory or have a great storage solution. Top end prosumer and server grade platforms will do fine with up to 4-6x in a non-rack mounted setting, but not without custom cooling. The problem isn't just how well you can feed the gpus; 3090s are simply not designed to work at such high node densities like server end cards are. That's why companies are happy to pay pretty penny for A100s and other server grade cards (even if we ignore the need for certifications and Nvidia mandates): infrastructure and running costs of a good quality server facility far outweigh GPU costs and money lost to potential downtime.

Connecting multi-node setups is done through high bandwidth interconnects, like mellanox infiniband stuff.

Most mining farms don't run GPUs on full pcie x16 as mining isn't memory intensive, so you're not going to scale as well as that.

You can very well scale to 64x GPU "farm" easily, but it's going to be a pain in a consumer-grade only setup, esp in terms of interconnects and stuff, not to mention terribly space and cooling inefficient.

3

peder2tm t1_j0xtfsu wrote

I have seen 10xRTX3090 in a single rack mounted server node with 2x40 core Intel CPU. This is a university setup and nodes are connected with infiniband and managed with slurm.
If you need to mount 10 rtx3090 in the same node, you must get ones with blower style fans to get the heat out and get the most powerful case fans you can get.

1

VinnyVeritas t1_j0w0gvh wrote

I'm not following: you're doing start-up on infrastructure build and you have to ask for advice on reddit to scale past 1 machine? That gives a terrible image of your startup. To the average person like me it sounds like you don't know what you're doing.

0

TheMrZZ0 t1_j0wawcb wrote

I don't think they're an infra startup - they want a GPU rig for ML tasks, that's all. Their website doesn't promote anything infra-related

2

VinnyVeritas t1_j0xc7l5 wrote

Thanks that makes sense, I thought they were a startup in the business of building computers, I was completely confused!!!

2

Outrageous_Room_3167 OP t1_j16hyqb wrote

That would be funny infrastructure startup that doesn't know anything about infrastructure LOL we're a robotics company :)

2

MeMyself_And_Whateva t1_j0w8yup wrote

If you're having more than three, maybe you can set them up in a mining rig. Go to NVIDIA's CUDA website. They should have information.

0