Submitted by big_dog_2k t3_yg1mpz in MachineLearning
TL;DR I am trying to work out the ‘best’ options for speeding up model inference and model serving.
Specifically, I am looking to host a number of PyTorch models and want -
- the fastest inference speed,
- an easy to use and deploy model serving framework that is also fast.
For 1), what is the easiest way to speed up inference (assume only PyTorch and primarily GPU but also some CPU)? I have been using ONNX and Torchscript but there is a bit of a learning curve and sometimes it can be tricky to get the model to actually work. Is there anything else worth trying? I am enthused by things like TorchDynamo (although I have not tested it extensively) due to its apparent ease of use. I also saw the post yesterday about Kernl using (OpenAI) Triton kernels to speed up transformer models which also looks interesting. Are things like SageMaker Neo or NeuralMagic worth trying? My only reservation with some of these is they still seem to be pretty model/architecture specific. I am a little reluctant to put much time into these unless I know others have had some success first.
For 2), I am aware of a few options. Triton inference server is an obvious one as is the ‘transformer-deploy’ version from LDS. My only reservation here is that they require the model compilation or are architecture specific. I am aware of others like Bento, Ray serving and TorchServe. Ideally I would have something that allows any (PyTorch model) to be used without the extra compilation effort (or at least optionally) and has some convenience things like ease of use, easy to deploy, easy to host multiple models and can perform some dynamic batching. Anyway, I am really interested to hear people's experience here as I know there are now quite a few options! Any help is appreciated!
Disclaimer - I have no affiliation or are connected in any way with the libraries or companies listed here. These are just the ones I know of.
Thanks in advance.
yubozhao t1_iu6huvq wrote
Hello. I am the founder of BentoML. We are working on integration with triton and other high-performance serving runtime solutions.