Submitted by super_deap t3_11tmpc5 in MachineLearning
royalemate357 t1_jckqgsr wrote
Reply to comment by RobbinDeBank in [D] PyTorch 2.0 Native Flash Attention 32k Context Window by super_deap
Pretty sure the main improvement is "torch.compile" which can optimize your code in a nice easy one liner. There's some other nice quality of life improvements like the built in flash attention OP is using, and I think some distributed training stuff. But it's fully backwards compatible, which is great (looking at you tensorflow) https://pytorch.org/get-started/pytorch-2.0/#pytorch-2x-faster-more-pythonic-and-as-dynamic-as-ever
MoistYogurtcloset400 t1_jclv6r1 wrote
Is this torch.compile only compatible with cuda device only?
royalemate357 t1_jclz4t0 wrote
hmm, I am not too sure but their blogpost says this:
>TorchInductor uses a pythonic define-by-run loop level IR to automatically map PyTorch models into generated Triton code on GPUs and C++/OpenMP on CPUs.
so it seems like they support CPU. I also tried it briefly on google colab CPU-only, and it seems to work (i didn't benchmark speed though). I doubt it supports non cuda GPUs but then again support for those even in the general case isnt very good.
mike94025 t1_jcn7ksu wrote
Works for all. You need a compiler backend that can code-gen for your target, and need a frontend for the optimizer that can process the IR.
Alternatively, you need a backend for Triton (or another already supported optimizer) that can codegen for your target architecture.
royalemate357 t1_jcnjaeo wrote
oh cool, thanks for the clarification. Nice that you folk made it more backend independent. Would be interesting to try it out on amd/mps devices, i wonder if those requirements are met on those devices though.
mike94025 t1_jcv7ltl wrote
You might look into https://github.com/pytorch/pytorch/pull/95793.
programmerChilli t1_jcny4qx wrote
We currently officially support Cuda and CPU, although in principle it could be used for other backends too.
Viewing a single comment thread. View all comments