tripple13 t1_jck9593 wrote on March 17, 2023 at 1:19 PM

~~Does anyone know why they didn't add the flashattention directly into the MultiheadAttention-modules?~~ Seems to be integrated, awesome!

programmerChilli t1_jcnydmw wrote on March 18, 2023 at 5:35 AM

I think it is used in Pytorch’s nn.transformerencoder but a lot of people like implementing their own.

mike94025 t1_jcv94un wrote on March 19, 2023 at 8:41 PM

SDPA is used by F.multi_head_attention_forward (if need_weights=False) which is used by nn.MHA and nn.Transformer* as well as other libraries. (source)

Public service announcement: need_weights defaults to True, and guts performance. (Because allocating and writing the attention weight tensor defeats the memory BW advantages of flash attention.)

Also, if `key_padding_mask is not None` performance will suffer (because this is converted into an attention mask, and only the causal attention mask is suppprted by Flash Attention). Use Nested Tensors for variable sequence length batches.

mike94025 t1_je5oe88 wrote on March 29, 2023 at 4:22 PM

https://www.linkedin.com/posts/michael-gschwind-3704222_pytorch-activity-7046773418288955393-gOSh

mike94025 t1_je5ojaw wrote on March 29, 2023 at 4:23 PM

It is. Follow the call tree into F.multi_head_attention_forward

tripple13 t1_je5seed wrote on March 29, 2023 at 4:48 PM

Is that right? I some how end up here when trying to assess what the F.multi_head_attention call does in the Class definition.

But I trust you're right, it would only make sense, I just couldn't identify the calls myself.