Viewing a single comment thread. View all comments

fxmarty OP t1_ixhpdui wrote

It's a vast question really. If you are able to convert your model to ONNX with meaningful outputs, that's a good start, it means you don't have dynamic control flows and your model is tracable.

I could recommend giving a try to OpenVino, or ONNX Runtime. They both can consume ONNX intermediate representation.

If you are specifically dealing with transformer-based models inheriting from the implementations in Transformers library, I would recommend to give a look at https://huggingface.co/blog/openvino and the documentation (and Optimum for ONNX Runtime, it could make your life easier).

Overall, compression techniques like structured pruning and quantization can be leveraged on CPUs - but once you start going in edge cases there may be diminishing benefits compared to the time spent on trying to optimize. Neural Magic has a closed-source inference engine that seem to have good recipes to exploit sparsity on CPUs.

Did not read it but this paper from Intel looks interesting: https://arxiv.org/abs/2211.07715

3

killver t1_ixhqi87 wrote

Thanks for the reply. Yeah ONNX and Openvino are already promising, but quantization on top makes the accuracy awful and actually it is even getting slower, maybe I am doing something wrong. I also had no luck with optimum library, which honestly has very bad documentation and API and is a bit too much tailored to using the transformers library out of the box.

1

fxmarty OP t1_ixi140r wrote

Are you doing dynamic or static quantization? Static quantization can be tricky, usually dynamic quantization is more straightforward. Also, if you deal with encoder-decoder models, it could be that quantization error accumulates in the decoder. For the slowdowns you are seeing... there could be many reasons. The first thing you should check is whether running through ONNX Runtime / OpenVino is at least on par (if not better) than PyTorch eager. If not, there may be an issue at a higher level (e.g. here). If yes, it could be your CPU does not support AVX VNNI instructions for example. Also depending on batch size, sequence length, the speedups from quantization may greatly vary.

Yes Optimum lib's documentation is unfortunately not yet in best shape. I would be really thankful if you fill an issue detailing where the doc can be improved: https://github.com/huggingface/optimum/issues . Also, if you have feature requests, such as having a more flexible API, we are eager for community contributions or suggestions!

3

killver t1_ixi5dns wrote

I actually only tried dynamic quantization by using onnxruntime.quantization.quantize_dynamic - is there anything better?

1

fxmarty OP t1_ixi7sge wrote

Not that I know of (at least in the ONNX ecosystem). I would recommend tuning the available arguments: https://github.com/microsoft/onnxruntime/blob/9168e2573836099b841ab41121a6e91f48f45768/onnxruntime/python/tools/quantization/quantize.py#L414

If you are dealing with a canonical model, feel free to fill an issue as well!

1

killver t1_ixiah49 wrote

Thanks a lot for all these replies. I have one more question if you do not mind: Sometimes I have huggingface models as a backbone in my model definitions, how would I go along to only apply the transformer based quantization on only the backbone? Usually these are called on the full model, but if my full model is already in onnx format it is complicated.

1