oathbreakerkeeper t1_jd0lu2p wrote on March 20, 2023 at 11:34 PM

Reply to comment by mike94025 in [D] PyTorch 2.0 Native Flash Attention 32k Context Window by super_deap

Am I looking in the wrong place? It seems like the torch 2.0 code still requires training==False in order to use FlashAttention:

https://github.com/pytorch/pytorch/blob/663e7c9eeb66fb049b8487a6a5a7ea4311fb53d3/torch/nn/modules/activation.py#L1139

Dependent_Ad5120 t1_jd3m0ce wrote on March 21, 2023 at 4:25 PM

try fp16, that doesn't require training=False apparently.

oathbreakerkeeper t1_jd43931 wrote on March 21, 2023 at 6:14 PM

I'm using amp mixed precision which should be using fp16. It still requires training==false.

But the torch code also disables flash attention if autocast is enabled I'm not sure how to deal with that one.

Dependent_Ad5120 t1_jdec7kx wrote on March 23, 2023 at 7:57 PM

I don't know. I was using pure fp16, no autocast and it works.

oathbreakerkeeper t1_jdgjte0 wrote on March 24, 2023 at 6:14 AM

How do you use pure fp16 out of curiosity? I've only ever trained with mixed precision, letting pytorch handle the fp16 stuff from there.

Do you have an example of a github repo that does it?

Dependent_Ad5120 t1_je5qfmp wrote on March 29, 2023 at 4:35 PM

I don't have a github repo for this, but it is pretty simple:

```

model = nn.Transformer().cuda().half

input = torch.rand(..).cuda().half

with sdp_kernel(...enable only flash attn):

output = model(input)

```

These 4 lines should be enough.

mike94025 t1_je5nrdi wrote on March 29, 2023 at 4:18 PM

You’re looking in the wrong place. What you’re looking at is the BT gen1 fastpath, not the BT gern 2 custom kernels.

You need to look at F.multi_head_attention_forward().

The fastpath still services inference until a full rewrite of activation.py for now that will hopefully be refactored in a future release. (There’s always a tension between refactoring and introducing new features under a tone and staffing constrained problem formulation.)