Viewing a single comment thread. View all comments

batrobin t1_j6h8d9d wrote

Thank you. You have answered what I had in mind. I was thinking about techniques like changing memory access pattern, changing memory layout, custom cuda kernels, fusing operations, reducing overheads etc. which some of them are mentioned in this paper: https://arxiv.org/abs/2007.00072. I also see that you have done some profiling in your issue, it should be interesting to read into.

I was previously working on some large scale transformer code optimization, seems like this repo would be good to learn from, thanks a lot.

3

tysam_and_co OP t1_j6h8nhh wrote

Excellent, and thank you very much for sharing that paper, I shall have to take a look at it! :D

I might need to do some operator fusion manually at some point in the future, though I'm hoping the torch.compile() command does it well (but I am somewhat scared because compiling territory can be more rigid and error-prone).

1