CKtalon t1_j8dbtpk wrote on February 13, 2023 at 1:51 PM

RTX 6000 Ada has no NVLink. Speedwise, 2x RTX 6000 Ada should be ~ 1x H100 based on last gen's A6000 vs A100. 4x RTX 6000 should be faster, and has more VRAM than a single H100.

Thing to take note is the likely lack of a Tensor Memory Accelerator on the RTX 6000 Ada which is present on the H100—if you plan on training FP8 models.

N3urAlgorithm OP t1_j8dqtv4 wrote on February 13, 2023 at 3:42 PM

Thank you, the TMA is actually a big deal to speed up things but as far as i've found even if the 4x rtx has more vram it can't be used for memory pooling. But basically if i'm not wrong even with this limitation I can still distribute training along the 4 gpus but still for a maximum of 48gb.

CKtalon t1_j8e0j48 wrote on February 13, 2023 at 4:47 PM

You’ll have to use something like DeepSpeed to split the layers across multiple GPUs. Of course, if the model can fit on one GPU, then you can go to crazier with bigger batch sizes