I have a feeling that's because the heuristic is likely not giving you flash attention at all, but instead this kernel, which appropriately fits your usecase and is in the list of possibly-automatically-selected kernels: https://arxiv.org/abs/2112.05682
Viewing a single comment thread. View all comments