batrobin
batrobin t1_j6h1o7u wrote
I am surprized to see most of the work you have done are on hyperparameter tunings and model tricks. Have you tried any HPC/MLHPC techniques, profiling or code optimizations? Are they in a future roadmap, not the goal of this project, or are there just not much to improve in that direction?
batrobin t1_j6h8d9d wrote
Reply to comment by tysam_and_co in [R] Train CIFAR10 in under 10 seconds on an A100 (new world record!) by tysam_and_co
Thank you. You have answered what I had in mind. I was thinking about techniques like changing memory access pattern, changing memory layout, custom cuda kernels, fusing operations, reducing overheads etc. which some of them are mentioned in this paper: https://arxiv.org/abs/2007.00072. I also see that you have done some profiling in your issue, it should be interesting to read into.
I was previously working on some large scale transformer code optimization, seems like this repo would be good to learn from, thanks a lot.