Demo: https://huggingface.co/spaces/fxmarty/bettertransformer-demo

Hi everyone,

In the latest PyTorch stable release 1.13, the BetterTransformer feature was marked as stable! It is a free-lunch optimization to gain x1.25 - x4 speedups on the inference of Transformer-based models. Notably, it leverages kernels fusion and the sparsity due to the padding tokens.

In order to support BetterTransformer with the canonical Transformer models from Transformers library, an integration was done with the open-source library Optimum as a one-liner:

from optimum.bettertransformer import BetterTransformer

model = BetterTransformer.transform(model)

I did a Space to showcase a bit the speedups we can have in a end-to-end case with TorchServe to deploy the model on a cloud instance (AWS EC2 g4dn, using one T4 GPU): https://huggingface.co/spaces/fxmarty/bettertransformer-demo

The idea of the Space is to show two use case scenarios:

One-shot input (batch size = 1), where we would like to optimize for latency (this not where BetterTransformer is the best, it has better perfs with larger batch size and some padding)
Heavy workload, with many inference requests, where we would like to optimize for throughput (samples/s)

The TL;DR is:

You can reduce your latency between x1.25 - x4 depending on your hardware (even better results on Ampere, CPU can be leveraged as well), batch size, sequence length, padding ratio.
TorchServe is great for out-of-the-box deployment, although it requires some configuration. Achieving maximum throughput is not super straightforward, and I am convinced we could have even better results than on the demo by tuning TorchServe or using other serving tools.

For more precise benchmarks, check out as well the blog post on PyTorch's Medium about the integration, and the Optimum documentation for more details on the implementation!

If you would like to deploy a model powered by BetterTransformer super straightforwardly, I would recommend trying the HF's Inference Endpoints with a custom handler. In the future I'll try Nvidia Triton as well, although I hear it can be a bit more involved to configure compared to TorchServe.

Kudos to Hamid Shojanazeri from PyTorch for his great advice on the demo!

Comments

You must log in or register to comment.

visarga t1_ixd4ks5 wrote on November 22, 2022 at 3:29 PM

Does it include Flash Attention?

fxmarty OP t1_ixd5yaq wrote on November 22, 2022 at 3:38 PM

I believe it does not in PyTorch 1.13. However if you try PyTorch nightlies there is support for FlashAttention and MemoryEfficientAttention. Example notebook: https://colab.research.google.com/drive/1eCDJ4pql8102J_BtGSyjCRJwLp3TTN_h . Digging into the source code of PyTorch we indeed see them.

However, this is only limited to inference for now, but given that there is work from PyTorch's team to include this natively, I would expect to see support for training in the future!

JahrudZ t1_ixdaa0l wrote on November 22, 2022 at 4:08 PM

Does this work with deepspeed and int8 models?

younesbelkada t1_ixdwsh6 wrote on November 22, 2022 at 6:35 PM

I know at least that this is mutually exclusive with int8, did not tried with DS though.

JahrudZ t1_ixdx5d9 wrote on November 22, 2022 at 6:38 PM

Any idea why it would be mutually exclusive? Thanks

younesbelkada t1_ixdyvls wrote on November 22, 2022 at 6:49 PM

because BetterTransformer merges the whole TransformerEncoderLayer operations in a single operation. This is called with the appropriate weights / biases at runtime.

For int8, each linear layer is replaced by the linear layer from bitsandbytes, that are slightly particular. At runtime it decomposes the matrix multiplication in two stages, and this is done with particular CUDA kernels. Therefore since this is not embedded in the fused operation from PyTorch, these two options are mutually exclusive. Please read more about int8 models here: https://huggingface.co/blog/hf-bitsandbytes-integration

fxmarty OP t1_ixe3kms wrote on November 22, 2022 at 7:20 PM

To complete, if you were thinking about the more traditional 8-bits quantization with full 8-bits integer arithmetic, it is currently not usable along BetterTransformer. However, I don't see reasons why similar custom layers could not be implemented with fused kernels + quantization + optimization w.r.t. padding.

FlashAttention + quantization has to the best of knowledge not yet been explored, but I think it would be a great engineering direction. I would not expect to see this any time soon natively in PyTorch's BetterTransformer though. /u/pommedeterresautee & folks at ELS-RD made an awesome work releasing kernl where custom implementations (through OpenAI Triton) could maybe easily live.

koolaidman123 t1_ixdbjen wrote on November 22, 2022 at 4:17 PM

does this integrate with torchscript/triton?

if so, what's the speedup over either one of the methods

if not, how does bettertransformer compare?

[deleted] t1_ixi6xlf wrote on November 23, 2022 at 5:04 PM

[removed]

cygn t1_ixe0rgg wrote on November 22, 2022 at 7:01 PM

does it work with vision transformers/swin?

cygn t1_ixe1n1j wrote on November 22, 2022 at 7:07 PM

looks like VIT is supported, swin isn't.

Lewba t1_ixe8x1x wrote on November 22, 2022 at 7:55 PM

If it's integrated with Optimum, does that mean Optimum's ORTOptimizer and ORTQuantizer work with these models (after conversion of course)?

fxmarty OP t1_ixgnwin wrote on November 23, 2022 at 8:20 AM

Unfortunately, the ONNX export with BetterTransformer will not work. It's a bit unfortunate the model optimization / compression efforts are spread out between different (sometimes) incompatible tools, but then again different use cases require different toolings.

Lewba t1_ixgp0jc wrote on November 23, 2022 at 8:36 AM

Understandable. I'll just have to choose between ORT optimized and BetterTransformer in the meantime.

[deleted] t1_ixgptxg wrote on November 23, 2022 at 8:48 AM

[deleted]

killver t1_ixgqjha wrote on November 23, 2022 at 8:58 AM

Sorry for hijacking the topic, but I recently started researching improving transformer inference speed on CPUs and am a bit overwhelmed with all the different methods out there.

The only thing that helped me for now is to transform to ONNX. Are there any other low-hanging fruits to apply?

fxmarty OP t1_ixhpdui wrote on November 23, 2022 at 3:07 PM

It's a vast question really. If you are able to convert your model to ONNX with meaningful outputs, that's a good start, it means you don't have dynamic control flows and your model is tracable.

I could recommend giving a try to OpenVino, or ONNX Runtime. They both can consume ONNX intermediate representation.

If you are specifically dealing with transformer-based models inheriting from the implementations in Transformers library, I would recommend to give a look at https://huggingface.co/blog/openvino and the documentation (and Optimum for ONNX Runtime, it could make your life easier).

Overall, compression techniques like structured pruning and quantization can be leveraged on CPUs - but once you start going in edge cases there may be diminishing benefits compared to the time spent on trying to optimize. Neural Magic has a closed-source inference engine that seem to have good recipes to exploit sparsity on CPUs.

Did not read it but this paper from Intel looks interesting: https://arxiv.org/abs/2211.07715

killver t1_ixhqi87 wrote on November 23, 2022 at 3:15 PM

Thanks for the reply. Yeah ONNX and Openvino are already promising, but quantization on top makes the accuracy awful and actually it is even getting slower, maybe I am doing something wrong. I also had no luck with optimum library, which honestly has very bad documentation and API and is a bit too much tailored to using the transformers library out of the box.

fxmarty OP t1_ixi140r wrote on November 23, 2022 at 4:25 PM

Are you doing dynamic or static quantization? Static quantization can be tricky, usually dynamic quantization is more straightforward. Also, if you deal with encoder-decoder models, it could be that quantization error accumulates in the decoder. For the slowdowns you are seeing... there could be many reasons. The first thing you should check is whether running through ONNX Runtime / OpenVino is at least on par (if not better) than PyTorch eager. If not, there may be an issue at a higher level (e.g. here). If yes, it could be your CPU does not support AVX VNNI instructions for example. Also depending on batch size, sequence length, the speedups from quantization may greatly vary.

Yes Optimum lib's documentation is unfortunately not yet in best shape. I would be really thankful if you fill an issue detailing where the doc can be improved: https://github.com/huggingface/optimum/issues . Also, if you have feature requests, such as having a more flexible API, we are eager for community contributions or suggestions!

killver t1_ixi5dns wrote on November 23, 2022 at 4:53 PM

I actually only tried dynamic quantization by using onnxruntime.quantization.quantize_dynamic - is there anything better?

fxmarty OP t1_ixi7sge wrote on November 23, 2022 at 5:09 PM

Not that I know of (at least in the ONNX ecosystem). I would recommend tuning the available arguments: https://github.com/microsoft/onnxruntime/blob/9168e2573836099b841ab41121a6e91f48f45768/onnxruntime/python/tools/quantization/quantize.py#L414

If you are dealing with a canonical model, feel free to fill an issue as well!

killver t1_ixiah49 wrote on November 23, 2022 at 5:27 PM

Thanks a lot for all these replies. I have one more question if you do not mind: Sometimes I have huggingface models as a backbone in my model definitions, how would I go along to only apply the transformer based quantization on only the backbone? Usually these are called on the full model, but if my full model is already in onnx format it is complicated.

C0hentheBarbarian t1_ixgr1dg wrote on November 23, 2022 at 9:06 AM

Is it possible to use sentence-transformer models using BetterTransformer?

Navya_Sri_Regalla t1_ixg8fvw wrote on November 23, 2022 at 5:12 AM

How Hugging Face Becomes Better Transformer of PyTorch?