batrobin

batrobin t1_j6h8d9d wrote

Thank you. You have answered what I had in mind. I was thinking about techniques like changing memory access pattern, changing memory layout, custom cuda kernels, fusing operations, reducing overheads etc. which some of them are mentioned in this paper: https://arxiv.org/abs/2007.00072. I also see that you have done some profiling in your issue, it should be interesting to read into.

I was previously working on some large scale transformer code optimization, seems like this repo would be good to learn from, thanks a lot.

3