pommedeterresautee OP t1_j7tk4fx wrote on February 9, 2023 at 8:18 AM

Reply to comment by SnooHesitations8849 in [P] Get 2x Faster Transcriptions with OpenAI Whisper Large on Kernl by pommedeterresautee

On large DL models like Whisper large, CPU is never on par with GPUs because CPU is latency oriented hardware and GPU is throughput oriented. The only ways large models are run on CPUs is by reducing the number of operations to perform like by sparsification or pruning.

Moreover, PyTorch is mostly C++ with a Python layer over it (for now at least, PyTorch 2.0 may be a start of change in this architecture). The Python layer brings most of the PyTorch latency.

And then, even C++ engine launching operations on GPU can not be on par with CUDA graphs (most of the time at least), because you have still to send instruction at a time, and there is still some latency overhead associated in running things that way, just much less than Python. With CUDA graphs there is almost none at all.There is a second thing not discussed here, it's that the graph of instructions is optimized.

Main drawback of CG is the memory overhead, you need at least to double the space taken for input tensors. On generative models with K/V cache, it matters as explained in this post. Plus you need to copy input tensors, which offsets a -very-small part of the gains (at least that s what we saw in our tests on Whisper and Bert / Roberta).

That is why TensorRT (a big C++ piece) for instance supports CUDA graphs.

Still, TBH, as you pointed out, the most important thing is that ... it's easier to build and run :-)

programmerChilli t1_j7toust wrote on February 9, 2023 at 9:25 AM

> The Python layer brings most of the PyTorch latency.

This actually isn't true - I believe most of the per-operator latency come from C++.

pommedeterresautee OP t1_j7tp663 wrote on February 9, 2023 at 9:30 AM

I guess you better know than me :-)

Which part? The dispatcher thing or it's spread on several steps?

programmerChilli t1_j7tpwd7 wrote on February 9, 2023 at 9:40 AM

Lots of things. You can see a flamegraph here: https://horace.io/img/perf_intro/flamegraph.png (taken from https://horace.io/brrr_intro.html).

Dispatcher is about 1us, but there's a lot of other things that need to go on - inferring dtype, error checking, building the op, allocating output tensors, etc.