Submitted by fxmarty t3_z1titt in MachineLearning
Demo: https://huggingface.co/spaces/fxmarty/bettertransformer-demo
Hi everyone,
In the latest PyTorch stable release 1.13, the BetterTransformer feature was marked as stable! It is a free-lunch optimization to gain x1.25 - x4 speedups on the inference of Transformer-based models. Notably, it leverages kernels fusion and the sparsity due to the padding tokens.
In order to support BetterTransformer with the canonical Transformer models from Transformers library, an integration was done with the open-source library Optimum as a one-liner:
from optimum.bettertransformer import BetterTransformer
model = BetterTransformer.transform(model)
I did a Space to showcase a bit the speedups we can have in a end-to-end case with TorchServe to deploy the model on a cloud instance (AWS EC2 g4dn, using one T4 GPU): https://huggingface.co/spaces/fxmarty/bettertransformer-demo
The idea of the Space is to show two use case scenarios:
- One-shot input (batch size = 1), where we would like to optimize for latency (this not where BetterTransformer is the best, it has better perfs with larger batch size and some padding)
- Heavy workload, with many inference requests, where we would like to optimize for throughput (samples/s)
The TL;DR is:
- You can reduce your latency between x1.25 - x4 depending on your hardware (even better results on Ampere, CPU can be leveraged as well), batch size, sequence length, padding ratio.
- TorchServe is great for out-of-the-box deployment, although it requires some configuration. Achieving maximum throughput is not super straightforward, and I am convinced we could have even better results than on the demo by tuning TorchServe or using other serving tools.
For more precise benchmarks, check out as well the blog post on PyTorch's Medium about the integration, and the Optimum documentation for more details on the implementation!
If you would like to deploy a model powered by BetterTransformer super straightforwardly, I would recommend trying the HF's Inference Endpoints with a custom handler. In the future I'll try Nvidia Triton as well, although I hear it can be a bit more involved to configure compared to TorchServe.
Kudos to Hamid Shojanazeri from PyTorch for his great advice on the demo!
visarga t1_ixd4ks5 wrote
Does it include Flash Attention?