RobbinDeBank t1_jcki2vl wrote on March 17, 2023 at 2:26 PM

What are its biggest improvements over pytorch 1?

royalemate357 t1_jckqgsr wrote on March 17, 2023 at 3:22 PM

Pretty sure the main improvement is "torch.compile" which can optimize your code in a nice easy one liner. There's some other nice quality of life improvements like the built in flash attention OP is using, and I think some distributed training stuff. But it's fully backwards compatible, which is great (looking at you tensorflow) https://pytorch.org/get-started/pytorch-2.0/#pytorch-2x-faster-more-pythonic-and-as-dynamic-as-ever

MoistYogurtcloset400 t1_jclv6r1 wrote on March 17, 2023 at 7:44 PM

Is this torch.compile only compatible with cuda device only?

royalemate357 t1_jclz4t0 wrote on March 17, 2023 at 8:09 PM

hmm, I am not too sure but their blogpost says this:

>TorchInductor uses a pythonic define-by-run loop level IR to automatically map PyTorch models into generated Triton code on GPUs and C++/OpenMP on CPUs.

so it seems like they support CPU. I also tried it briefly on google colab CPU-only, and it seems to work (i didn't benchmark speed though). I doubt it supports non cuda GPUs but then again support for those even in the general case isnt very good.

mike94025 t1_jcn7ksu wrote on March 18, 2023 at 1:26 AM

Works for all. You need a compiler backend that can code-gen for your target, and need a frontend for the optimizer that can process the IR.

Alternatively, you need a backend for Triton (or another already supported optimizer) that can codegen for your target architecture.

royalemate357 t1_jcnjaeo wrote on March 18, 2023 at 3:04 AM

oh cool, thanks for the clarification. Nice that you folk made it more backend independent. Would be interesting to try it out on amd/mps devices, i wonder if those requirements are met on those devices though.

mike94025 t1_jcv7ltl wrote on March 19, 2023 at 8:30 PM

You might look into https://github.com/pytorch/pytorch/pull/95793.

programmerChilli t1_jcny4qx wrote on March 18, 2023 at 5:33 AM

We currently officially support Cuda and CPU, although in principle it could be used for other backends too.

[deleted] t1_jckkb5t wrote on March 17, 2023 at 2:41 PM

[deleted]

Competitive-Rub-1958 t1_jcl97q0 wrote on March 17, 2023 at 5:22 PM

> Either autograd is disabled (using torch.inference_mode or torch.no_grad) or no tensor argument requires_grad > training is disabled (using .eval())

What's the point of FlashAttention if you can't use it during training? 🤔

https://pytorch.org/docs/stable/generated/torch.nn.MultiheadAttention.html

mike94025 t1_jcly3mi wrote on March 17, 2023 at 8:03 PM

Documentation was not updated. Yes, you can use flash attention for training.

The first version included only forward() as we were resolving some issues with backward(). Docstring will be updated.

Competitive-Rub-1958 t1_jcm5ahk wrote on March 17, 2023 at 8:50 PM

cool! So I just need to enable `flash_sdp`, then ensure I'm basically computing self-attention and have `batch_first=True`. Would that be correct?

mike94025 t1_jcmho8t wrote on March 17, 2023 at 10:16 PM

Don't call flash_sdp directly. That way you're locked into particular hardware and create non-portable models. You can either use F.scaled_dot_product_attention() , or you use nn.MultiHeadAttention. In either case it will pick the right implementation based on the hardware you have, and the constraints. Ideally, the constraints would be weakened in the future, and/or new kernels might support other operating points in an optimized manner, and then the kernel picker can dispatch to that implementation.

See the kernel-picker logic that dispatches based on input characteristics in the source code, and/or the SDPA tutorial here => https://pytorch.org/tutorials/intermediate/scaled_dot_product_attention_tutorial.html

Competitive-Rub-1958 t1_jcn8bti wrote on March 18, 2023 at 1:32 AM

cool. I just wanted to make it explicit to make sure I'm running `FlashAttention`. Perhaps there's an easy way to check that?

mike94025 t1_jcv83hu wrote on March 19, 2023 at 8:34 PM

Yes - use the backend context manager to disable all other backends to see that you're running the one you want. (Otherwise, since all other backends are disabled, you'll get an error.)

SDPA context manager is intended to facilitate debug (for perf or correctness), and is not (and should not be) required for normal operational usage.

Check out the SPDA tutorial at https://pytorch.org/tutorials/intermediate/scaled_dot_product_attention_tutorial.html#explicit-dispatcher-control

Competitive-Rub-1958 t1_jd40cwb wrote on March 21, 2023 at 5:56 PM

would that mean for forcing MHA to use it, I should wrap the ctxmanager around the line where I forward through it?

with torch.backends.cuda.sdp_kernel(enable_flash=True, enable_mem_efficient=True):
            x = x + self.attn_head(x, x, x, need_weights=False)[0]

because that doesn't really seem to work :(

mike94025 t1_je5mfa8 wrote on March 29, 2023 at 4:10 PM

This doesn't force it. It says that flash is enabled, and stone others. To force it, you have to disable all other kernels. Then it’s flash or bust.

You can find more in our blog which got published today and the SDPA tutorial. Both are linked here https://www.linkedin.com/posts/michael-gschwind-3704222_pytorch-activity-7046773418288955393-gOSh

PS: the context manager can be used anywhere outside the call as well, including around the call to model.forward.

JustOneAvailableName t1_jcobgnf wrote on March 18, 2023 at 8:36 AM

That is a very nice surprise

oathbreakerkeeper t1_jd0lu2p wrote on March 20, 2023 at 11:34 PM

Am I looking in the wrong place? It seems like the torch 2.0 code still requires training==False in order to use FlashAttention:

https://github.com/pytorch/pytorch/blob/663e7c9eeb66fb049b8487a6a5a7ea4311fb53d3/torch/nn/modules/activation.py#L1139

Dependent_Ad5120 t1_jd3m0ce wrote on March 21, 2023 at 4:25 PM

try fp16, that doesn't require training=False apparently.

oathbreakerkeeper t1_jd43931 wrote on March 21, 2023 at 6:14 PM

I'm using amp mixed precision which should be using fp16. It still requires training==false.

But the torch code also disables flash attention if autocast is enabled I'm not sure how to deal with that one.

Dependent_Ad5120 t1_jdec7kx wrote on March 23, 2023 at 7:57 PM

I don't know. I was using pure fp16, no autocast and it works.

oathbreakerkeeper t1_jdgjte0 wrote on March 24, 2023 at 6:14 AM

How do you use pure fp16 out of curiosity? I've only ever trained with mixed precision, letting pytorch handle the fp16 stuff from there.

Do you have an example of a github repo that does it?

Dependent_Ad5120 t1_je5qfmp wrote on March 29, 2023 at 4:35 PM

I don't have a github repo for this, but it is pretty simple:

```

model = nn.Transformer().cuda().half

input = torch.rand(..).cuda().half

with sdp_kernel(...enable only flash attn):

output = model(input)

```

These 4 lines should be enough.

mike94025 t1_je5nrdi wrote on March 29, 2023 at 4:18 PM

You’re looking in the wrong place. What you’re looking at is the BT gen1 fastpath, not the BT gern 2 custom kernels.

You need to look at F.multi_head_attention_forward().

The fastpath still services inference until a full rewrite of activation.py for now that will hopefully be refactored in a future release. (There’s always a tension between refactoring and introducing new features under a tone and staffing constrained problem formulation.)

Dependent_Ad5120 t1_jd1d00j wrote on March 21, 2023 at 2:53 AM

It seems to me that I have to call model.eval() to use the memory_efficient attention. Otherwise, it throws an error of no available kernel.

I tried on both rtx 3090 and A100, in both cases, it seems only have enable_flash=True resulted in the same error of no available kernel, even with model.eval().

So my questions are:

with model.eval(), does it mean drop_out is not enabled during training?
Am I doing something wrong for flash attention? How do I actually enable it?

Thanks a lot!

Dependent_Ad5120 t1_jd3knio wrote on March 21, 2023 at 4:17 PM

OK, I found out why. To use flash attention, I had to use fp16. It is a bit faster then using memory_efficient attention in my test.

mike94025 t1_je5o126 wrote on March 29, 2023 at 4:20 PM

https://www.linkedin.com/posts/michael-gschwind-3704222_pytorch-activity-7046773418288955393-gOSh

cthorrez t1_jclwi3d wrote on March 17, 2023 at 7:52 PM

Fast inference is also important to anyone who wants to deploy these things.

mike94025 t1_jcmepd6 wrote on March 17, 2023 at 9:55 PM

With Better Transformer, ¿Por que no los dos?

cthorrez t1_jcmhg8m wrote on March 17, 2023 at 10:14 PM

Because the algorithm seems to only work on inference. Probably due to memory management of the cached activations or something. (Idk the actual technical reasons)

mike94025 t1_jcmlddm wrote on March 17, 2023 at 10:42 PM

Better Transformer supports both, today. Some optimizations are still inference-only (and in particular support for variable-sequence length Nested Tensor) and the inference fastpath is a bit silo'ed, but nothing that future PyTorch update could not fix.

[deleted] t1_jcmtgb4 wrote on March 17, 2023 at 11:41 PM

[removed]

ChuckSeven t1_jcjt0je wrote on March 17, 2023 at 10:40 AM

This is the way.

[D] PyTorch 2.0 Native Flash Attention 32k Context Window

No-Belt7582 t1_jcjqk6s wrote on March 17, 2023 at 10:08 AM