TL;DR I am trying to work out the ‘best’ options for speeding up model inference and model serving.

Specifically, I am looking to host a number of PyTorch models and want -

the fastest inference speed,
an easy to use and deploy model serving framework that is also fast.

For 1), what is the easiest way to speed up inference (assume only PyTorch and primarily GPU but also some CPU)? I have been using ONNX and Torchscript but there is a bit of a learning curve and sometimes it can be tricky to get the model to actually work. Is there anything else worth trying? I am enthused by things like TorchDynamo (although I have not tested it extensively) due to its apparent ease of use. I also saw the post yesterday about Kernl using (OpenAI) Triton kernels to speed up transformer models which also looks interesting. Are things like SageMaker Neo or NeuralMagic worth trying? My only reservation with some of these is they still seem to be pretty model/architecture specific. I am a little reluctant to put much time into these unless I know others have had some success first.

For 2), I am aware of a few options. Triton inference server is an obvious one as is the ‘transformer-deploy’ version from LDS. My only reservation here is that they require the model compilation or are architecture specific. I am aware of others like Bento, Ray serving and TorchServe. Ideally I would have something that allows any (PyTorch model) to be used without the extra compilation effort (or at least optionally) and has some convenience things like ease of use, easy to deploy, easy to host multiple models and can perform some dynamic batching. Anyway, I am really interested to hear people's experience here as I know there are now quite a few options! Any help is appreciated!

Disclaimer - I have no affiliation or are connected in any way with the libraries or companies listed here. These are just the ones I know of.

Thanks in advance.

Comments

You must log in or register to comment.

sobagood t1_iu6ucfo wrote on October 29, 2022 at 12:15 AM

If you intend to run on CPU, and other intel hardware, OpenVINO is a great choice. They optimised it for their hardware and it is indeed faster than others on their hardware

whata_wonderful_day t1_iu81vzp wrote on October 29, 2022 at 7:41 AM

I tried OpenVINO ~1.5 years back and it didn't match ONNXRuntime on transformers. For CNNs it's the fastest though. I also found OpenVINO to be pretty buggy and not user friendly. I needed to fix their internal transformer conversion script

big_dog_2k OP t1_iu6yc7b wrote on October 29, 2022 at 12:47 AM

Thanks! Does it work with non-intel chipsets and how easy have you found it to use?

sobagood t1_iu6zuhk wrote on October 29, 2022 at 12:59 AM

If you mean nvidia gpu, it has cuda plugin to run it on nvidia gpu but i have never tried. It has several other plugins so you could check it out. It also provides its own deploy server. Nvidia triton also supports openvino runtime without gpu support with an obvious reason. They have similar process like onnx that transform graph to their intermediate representation with ‘model optimizer’ which could go wrong. If you could successfully create this representation, there should be no new bottleneck.

big_dog_2k OP t1_iu7paw3 wrote on October 29, 2022 at 4:51 AM

Thanks. I might need to take a closer look. I was also thinking AMD and arm based cpu. I was surprised at how good the cpu based inference can be for some models these days.

sobagood t1_iu801e4 wrote on October 29, 2022 at 7:14 AM

I dont think they support AMD as they are rival to each other.

BestSentence4868 t1_iu8fluq wrote on October 29, 2022 at 11:01 AM

Love love love Triton inference server for the framework flexibility. Super mature with so much stuff I never thought I'd need like model warmup etc. ORT and TensorRT are cool, but if all else fails Python backend is awesome.

big_dog_2k OP t1_iu8gcr1 wrote on October 29, 2022 at 11:11 AM

Great! Does Triton allow something like native pytorch models? Or is it onnx, tensorRT, torchscript?

BestSentence4868 t1_iu8gi68 wrote on October 29, 2022 at 11:13 AM

Yep! Fire up Triton(I'd used their docker container), install pytorch via pip or just put it in the dockerfile and you're off to the races! I actually did just deploy Triton+pytorch+flask for a web app this week :)

big_dog_2k OP t1_iu8gxg1 wrote on October 29, 2022 at 11:18 AM

Wow! I did not know that! I think I have answers to my questions now.

BestSentence4868 t1_iu8h0kj wrote on October 29, 2022 at 11:19 AM

Feel free to dm for any further questions

poems_4_you t1_iu7sovr wrote on October 29, 2022 at 5:32 AM

I use triton and have been very pleased thus far

big_dog_2k OP t1_iu86jfr wrote on October 29, 2022 at 8:51 AM

Thanks! I have now seen quite a consistent theme from people that Triton is worth it. I might then bite the bullet and invest more time in getting onnx conversions right.

robdupre t1_iu7z0uu wrote on October 29, 2022 at 6:59 AM

We use onnx models deployed using Nvidias tensorRT. We have been impressed with it so far

pommedeterresautee t1_iu9zg8x wrote on October 29, 2022 at 6:37 PM

Hi, author of transformer deploy and Kernl here. Whatever option you choose, something to keep in mind next to speed is being able to maintain precision output. I can tell you it’s our number one pain point, on both tensorrt and onnx runtime. We have even built some tooling to help on that, it helped but it’s not yet perfect. Triton inference server is really a cool option with a good documentation.

big_dog_2k OP t1_iua8wgp wrote on October 29, 2022 at 7:45 PM

Thanks. I was aware of this and had some difficultly in the past. Evaluation criteria now compares precision loss across model outputs as well as the performance (accuracy or equivalent) measured on the full system. What methods have you found to mitigate this? I would love to know!

pommedeterresautee t1_iuacchj wrote on October 29, 2022 at 8:10 PM

To mitigate precision issues:

on ONNX related engines, we built a tool to check the output of each node and tag those that won't behave well in fp16 or bf16. Described here: https://www.reddit.com/r/MachineLearning/comments/uwkpmt/p_what_we_learned_by_making_t5large_2x_faster/
on Kernl, we "just" understand what happens as the code is simple (and we wrote it). We choose to not do terrible things to make the inference faster, basically no approx in our kernels, and accumulation is in fp32 (basically it's even better than vanilla mixed precision, and still much faster). IMO that's the most robust approach...

big_dog_2k OP t1_iuaexff wrote on October 29, 2022 at 8:28 PM

Thank you! I think I will try kernl today as well. If I understand correctly, only Ampere generation cards are supported? Also, does it work on any huggingface model or are there still exceptions?

pommedeterresautee t1_iuaodj2 wrote on October 29, 2022 at 9:36 PM

Yes for Ampere.

For HF models, the Kernels will work for most of them out of the box but you need to have search replace patterns for your specific architecture. That's why we do not have our own implementations of X and Y.

Check https://github.com/ELS-RD/kernl/blob/main/src/kernl/optimizer/linear.py for an example.

big_dog_2k OP t1_iuaw55q wrote on October 29, 2022 at 10:35 PM

Great. I might try this out as I like the direction this is going plus it seems like Pytorch is heading in a similar way. I'll let you know if I have questions or I will raise them on github. I appreciate all the information!

jukujala t1_iu7x05z wrote on October 29, 2022 at 6:30 AM

Has anyone tried to transform ONNX to TF SavedModel and use TF serving? TF at least in the past was good at inference.

ibmw t1_iu7zr9l wrote on October 29, 2022 at 7:10 AM

In my previous company, we used the Nvidia: Triton + ONNX + ONNX runtime it works well, but with some engineering, because the models we used were not fully supported by ONNX, and we do some work to be able to change some components (like python/conda env to more generic and fastest solution)

In addition, we have some models that run on CPU (without openVINO -- actually, we didn't have time to test that), and we use a k8s cluster to deploy and do the scaling. It works, but we still need to improve the inference time to align with the use cases... I don't know if they have managed to tackle this part since my departure.

Finally, we have done some benchmarks (triton, kserve, torchserve, sagemaker), and with Triton (with engineering) we managed to get the best result for throughput (our target, but I know that we could have done the same for latency)

big_dog_2k OP t1_iu86mb7 wrote on October 29, 2022 at 8:52 AM

Thanks! It sounds like investing time in onnx and using triton is the best bet.

yubozhao t1_iu6huvq wrote on October 28, 2022 at 10:34 PM

Hello. I am the founder of BentoML. We are working on integration with triton and other high-performance serving runtime solutions.

big_dog_2k OP t1_iu6yjf3 wrote on October 29, 2022 at 12:49 AM

Hi! Can you give the elevator pitch for Bento? When should I use it and for what part of my model serving problems will it solve? If you integrate with another serving solution - how much more complexity is that going to add and how are you thinking about deployment?

braintampon t1_iu9n6k7 wrote on October 29, 2022 at 5:11 PM

Why is this dude downvoted

yubozhao t1_iu9qzm3 wrote on October 29, 2022 at 5:38 PM

I guess others see this is spammy or ads? Honestly I disclosure who I am and didn’t try to sell (from my pov). I guess that’s not welcomed in this sub. shrug.jpg

Edit: typo

braintampon t1_iua3jkt wrote on October 29, 2022 at 7:06 PM

Hahah i mean your answer is quite pertinent to OPs post and also i dont see how selling is wrong lmao

But being the founder of BentoML what is your answer to OPs question tho? Which is the fastest, most dev friendly model serving framework acc to you? Which model serving framework in your opinion is the biggest threat (competitor) to BentoML? Is there some benchmarking you guys have done that indicates some potential inferencing speed ups?

My organisation uses BentoML and I personally love what yall have done w it btw. Would be awesome to get your honest opinion on OPs question

TIA!

big_dog_2k OP t1_iua8imm wrote on October 29, 2022 at 7:42 PM

Great! Exactly this, I just want someone to provide feedback. Do you see throughout improvements using bento with dynamic batching vs without? Is the throughout good in general or is the biggest benefit ease of use?

programmerChilli t1_iufqn15 wrote on October 30, 2022 at 11:51 PM

Well, you disclosed who you are, but that's pretty much all you did :P

The OP asked a number of questions, and you didn't really answer any of them. You didn't explain what BentoML can offer, you didn't explain how it can speed up inference, you didn't really even explain what BentoML is.

Folks will tolerate "advertising" if it comes in the form of interesting technical content. However, you basically just mentioned your company and provided no technical content, so it's just pure negative value from most people's perspective.

yubozhao t1_iufs7wp wrote on October 31, 2022 at 12:03 AM

Fair enough. I will probably get to it. I don’t know about you. I need to “charge up” and make sure my answer is good. That takes time and it was the Halloween weeks after all.