learn-deeply t1_iv3tuol wrote on November 5, 2022 at 1:57 AM

Thanks for creating the benchmark!

FYI these results aren't exactly accurate because CUDA 12 supporting Hopper architecture isn't out yet, so none of the fp8 cores are being used and its not taking advantage of optimizations specific to Hopper. From the Nvidia whitepaper:

> With the new FP8 format, the GeForce RTX 4090 delivers 1.3 PetaFLOPS of performance for AI inference workloads.

CUDA 12 will be released some time in 2023, whenever they start delivering H100 GPUs, and it'll take some time for frameworks to add support.

Also the multi-GPU test is lacking some details that would be really helpful to know - how many lanes of PCIe4 is each GPU using? Is the test doing model parallel or data parallel?

Flag_Red t1_iv4h3kz wrote on November 5, 2022 at 5:49 AM

I'm super hyped for fp8 support in CUDA. Combined with some other techniques it could put LLM inference (GPT-175B, for example) in reach of consumer hardware.

whata_wonderful_day t1_iv57znb wrote on November 5, 2022 at 12:10 PM

Performance will definitely get better as time goes, but fp8 is going to be extra work to use, just like fp16.

chuanli11 t1_ivhkx9p wrote on November 8, 2022 at 12:46 AM

Hey, Thanks for the comment. We made sure each GPU uses x16 PCIe 4.0 lanes. It is data parallel (PyTorch DDP specifically).

We look forward to the FP8/CUDA 12 update too.

learn-deeply t1_ivimog5 wrote on November 8, 2022 at 6:00 AM

Thanks for the additional details.