Submitted by big_dog_2k t3_yg1mpz in MachineLearning

TL;DR I am trying to work out the ‘best’ options for speeding up model inference and model serving.

Specifically, I am looking to host a number of PyTorch models and want -

  1. the fastest inference speed,
  2. an easy to use and deploy model serving framework that is also fast.

For 1), what is the easiest way to speed up inference (assume only PyTorch and primarily GPU but also some CPU)? I have been using ONNX and Torchscript but there is a bit of a learning curve and sometimes it can be tricky to get the model to actually work. Is there anything else worth trying? I am enthused by things like TorchDynamo (although I have not tested it extensively) due to its apparent ease of use. I also saw the post yesterday about Kernl using (OpenAI) Triton kernels to speed up transformer models which also looks interesting. Are things like SageMaker Neo or NeuralMagic worth trying? My only reservation with some of these is they still seem to be pretty model/architecture specific. I am a little reluctant to put much time into these unless I know others have had some success first.

For 2), I am aware of a few options. Triton inference server is an obvious one as is the ‘transformer-deploy’ version from LDS. My only reservation here is that they require the model compilation or are architecture specific. I am aware of others like Bento, Ray serving and TorchServe. Ideally I would have something that allows any (PyTorch model) to be used without the extra compilation effort (or at least optionally) and has some convenience things like ease of use, easy to deploy, easy to host multiple models and can perform some dynamic batching. Anyway, I am really interested to hear people's experience here as I know there are now quite a few options! Any help is appreciated!

Disclaimer - I have no affiliation or are connected in any way with the libraries or companies listed here. These are just the ones I know of.

Thanks in advance.

55

Comments

You must log in or register to comment.

yubozhao t1_iu6huvq wrote

Hello. I am the founder of BentoML. We are working on integration with triton and other high-performance serving runtime solutions.

−5

sobagood t1_iu6ucfo wrote

If you intend to run on CPU, and other intel hardware, OpenVINO is a great choice. They optimised it for their hardware and it is indeed faster than others on their hardware

9

big_dog_2k OP t1_iu6yjf3 wrote

Hi! Can you give the elevator pitch for Bento? When should I use it and for what part of my model serving problems will it solve? If you integrate with another serving solution - how much more complexity is that going to add and how are you thinking about deployment?

4

sobagood t1_iu6zuhk wrote

If you mean nvidia gpu, it has cuda plugin to run it on nvidia gpu but i have never tried. It has several other plugins so you could check it out. It also provides its own deploy server. Nvidia triton also supports openvino runtime without gpu support with an obvious reason. They have similar process like onnx that transform graph to their intermediate representation with ‘model optimizer’ which could go wrong. If you could successfully create this representation, there should be no new bottleneck.

1

big_dog_2k OP t1_iu7paw3 wrote

Thanks. I might need to take a closer look. I was also thinking AMD and arm based cpu. I was surprised at how good the cpu based inference can be for some models these days.

1

poems_4_you t1_iu7sovr wrote

I use triton and have been very pleased thus far

3

jukujala t1_iu7x05z wrote

Has anyone tried to transform ONNX to TF SavedModel and use TF serving? TF at least in the past was good at inference.

2

robdupre t1_iu7z0uu wrote

We use onnx models deployed using Nvidias tensorRT. We have been impressed with it so far

3

ibmw t1_iu7zr9l wrote

In my previous company, we used the Nvidia: Triton + ONNX + ONNX runtime it works well, but with some engineering, because the models we used were not fully supported by ONNX, and we do some work to be able to change some components (like python/conda env to more generic and fastest solution)

In addition, we have some models that run on CPU (without openVINO -- actually, we didn't have time to test that), and we use a k8s cluster to deploy and do the scaling. It works, but we still need to improve the inference time to align with the use cases... I don't know if they have managed to tackle this part since my departure.

Finally, we have done some benchmarks (triton, kserve, torchserve, sagemaker), and with Triton (with engineering) we managed to get the best result for throughput (our target, but I know that we could have done the same for latency)

2

whata_wonderful_day t1_iu81vzp wrote

I tried OpenVINO ~1.5 years back and it didn't match ONNXRuntime on transformers. For CNNs it's the fastest though. I also found OpenVINO to be pretty buggy and not user friendly. I needed to fix their internal transformer conversion script

4

big_dog_2k OP t1_iu86jfr wrote

Thanks! I have now seen quite a consistent theme from people that Triton is worth it. I might then bite the bullet and invest more time in getting onnx conversions right.

2

BestSentence4868 t1_iu8fluq wrote

Love love love Triton inference server for the framework flexibility. Super mature with so much stuff I never thought I'd need like model warmup etc. ORT and TensorRT are cool, but if all else fails Python backend is awesome.

6

BestSentence4868 t1_iu8gi68 wrote

Yep! Fire up Triton(I'd used their docker container), install pytorch via pip or just put it in the dockerfile and you're off to the races! I actually did just deploy Triton+pytorch+flask for a web app this week :)

1

yubozhao t1_iu9qzm3 wrote

I guess others see this is spammy or ads? Honestly I disclosure who I am and didn’t try to sell (from my pov). I guess that’s not welcomed in this sub. shrug.jpg

Edit: typo

3

pommedeterresautee t1_iu9zg8x wrote

Hi, author of transformer deploy and Kernl here. Whatever option you choose, something to keep in mind next to speed is being able to maintain precision output. I can tell you it’s our number one pain point, on both tensorrt and onnx runtime. We have even built some tooling to help on that, it helped but it’s not yet perfect. Triton inference server is really a cool option with a good documentation.

3

braintampon t1_iua3jkt wrote

Hahah i mean your answer is quite pertinent to OPs post and also i dont see how selling is wrong lmao

But being the founder of BentoML what is your answer to OPs question tho? Which is the fastest, most dev friendly model serving framework acc to you? Which model serving framework in your opinion is the biggest threat (competitor) to BentoML? Is there some benchmarking you guys have done that indicates some potential inferencing speed ups?

My organisation uses BentoML and I personally love what yall have done w it btw. Would be awesome to get your honest opinion on OPs question

TIA!

1

big_dog_2k OP t1_iua8imm wrote

Great! Exactly this, I just want someone to provide feedback. Do you see throughout improvements using bento with dynamic batching vs without? Is the throughout good in general or is the biggest benefit ease of use?

2

big_dog_2k OP t1_iua8wgp wrote

Thanks. I was aware of this and had some difficultly in the past. Evaluation criteria now compares precision loss across model outputs as well as the performance (accuracy or equivalent) measured on the full system. What methods have you found to mitigate this? I would love to know!

1

pommedeterresautee t1_iuacchj wrote

To mitigate precision issues:

  • on ONNX related engines, we built a tool to check the output of each node and tag those that won't behave well in fp16 or bf16. Described here: https://www.reddit.com/r/MachineLearning/comments/uwkpmt/p_what_we_learned_by_making_t5large_2x_faster/
  • on Kernl, we "just" understand what happens as the code is simple (and we wrote it). We choose to not do terrible things to make the inference faster, basically no approx in our kernels, and accumulation is in fp32 (basically it's even better than vanilla mixed precision, and still much faster). IMO that's the most robust approach...
1

big_dog_2k OP t1_iuaexff wrote

Thank you! I think I will try kernl today as well. If I understand correctly, only Ampere generation cards are supported? Also, does it work on any huggingface model or are there still exceptions?

1

big_dog_2k OP t1_iuaw55q wrote

Great. I might try this out as I like the direction this is going plus it seems like Pytorch is heading in a similar way. I'll let you know if I have questions or I will raise them on github. I appreciate all the information!

2

programmerChilli t1_iufqn15 wrote

Well, you disclosed who you are, but that's pretty much all you did :P

The OP asked a number of questions, and you didn't really answer any of them. You didn't explain what BentoML can offer, you didn't explain how it can speed up inference, you didn't really even explain what BentoML is.

Folks will tolerate "advertising" if it comes in the form of interesting technical content. However, you basically just mentioned your company and provided no technical content, so it's just pure negative value from most people's perspective.

1

yubozhao t1_iufs7wp wrote

Fair enough. I will probably get to it. I don’t know about you. I need to “charge up” and make sure my answer is good. That takes time and it was the Halloween weeks after all.

1