tysam_and_co OP t1_j6h2qlh wrote on January 30, 2023 at 7:10 AM

Reply to comment by batrobin in [R] Train CIFAR10 in under 10 seconds on an A100 (new world record!) by tysam_and_co

That's a good question, and I'm somewhat curious what you mean by HPC/MLHPC techniques or code optimizations. Do you mean something like distributed computing? (which that is an interesting rabbit-hole alone to get into....)

Regardless -- yep! I'm a professional in this industry, and there's a lot of detail underlying a ton of seemingly-simple changes (and potentially even more frustrating if simple, understandable changes shave off large chunks and swathes of what was previously the world record). So basically everything I'm doing is informed by, well, years of doing basically this exact same thing over and over again. Something that I've found is that the younger/newer ML engineers (myself included when I was at that point) are often really attracted to the "new shiny", when in reality, good HPC on a smaller scale is like writing a really nice, quiet, tight haiku. Less is more, but a single 'syllable' equivalent can make or break the whole thing.

Lots of people use models inefficiently. This model is still somewhat inefficient in its own ways, though it definitely I think is more efficient by far than most nearly all of the ones it's currently competing with. When I design a model, I'm thinking about keeping the GPU occupancy high, utilizing tensor cores as much as possible, mathematically fusing operations to reduce overhead, managing memory layout to make sure the right paths get activated in the GPU (like tensor cores, etc), and seeing if there are good approximations or alternatives to some things that are much more mathematically cheap (or if there are alternate paths with specialized kernels that I can boutique-design the network around).

I'll have to cut myself short early, but I'll leave you with a singular example, which is a technical breakdown which was behind what was in practice a very 'simple' change in the backend. I also made another comment in here, this reddit thread (https://www.reddit.com/r/MachineLearning/comments/10op6va/comment/j6h0z6b/?utm_source=share&utm_medium=web2x&context=3), with a technical breakdown behind one other very 'simple' change. Don't get pulled away by the shiny fancy techniques that are slow/etc, sometimes the simplest is the best!

Here's the breakdown: https://github.com/tysam-code/hlb-CIFAR10/issues/2#issuecomment-1379711156

Let me know if this answered your question at all or if you have any follow-ups, much love, cheers, and thanks! <3 :D :D :D :D :))))

batrobin t1_j6h8d9d wrote on January 30, 2023 at 8:26 AM

Thank you. You have answered what I had in mind. I was thinking about techniques like changing memory access pattern, changing memory layout, custom cuda kernels, fusing operations, reducing overheads etc. which some of them are mentioned in this paper: https://arxiv.org/abs/2007.00072. I also see that you have done some profiling in your issue, it should be interesting to read into.

I was previously working on some large scale transformer code optimization, seems like this repo would be good to learn from, thanks a lot.

tysam_and_co OP t1_j6h8nhh wrote on January 30, 2023 at 8:30 AM

Excellent, and thank you very much for sharing that paper, I shall have to take a look at it! :D

I might need to do some operator fusion manually at some point in the future, though I'm hoping the torch.compile() command does it well (but I am somewhat scared because compiling territory can be more rigid and error-prone).