BestSentence4868
BestSentence4868 t1_j1bjak3 wrote
run int8 instead of fp16 gg rip
BestSentence4868 t1_iu8h0kj wrote
Reply to comment by big_dog_2k in [D] How to get the fastest PyTorch inference and what is the "best" model serving framework? by big_dog_2k
Feel free to dm for any further questions
BestSentence4868 t1_iu8gi68 wrote
Reply to comment by big_dog_2k in [D] How to get the fastest PyTorch inference and what is the "best" model serving framework? by big_dog_2k
Yep! Fire up Triton(I'd used their docker container), install pytorch via pip or just put it in the dockerfile and you're off to the races! I actually did just deploy Triton+pytorch+flask for a web app this week :)
BestSentence4868 t1_iu8fluq wrote
Reply to [D] How to get the fastest PyTorch inference and what is the "best" model serving framework? by big_dog_2k
Love love love Triton inference server for the framework flexibility. Super mature with so much stuff I never thought I'd need like model warmup etc. ORT and TensorRT are cool, but if all else fails Python backend is awesome.
BestSentence4868 t1_j1cackq wrote
Reply to comment by _underlines_ in [D] When chatGPT stops being free: Run SOTA LLM in cloud by _underlines_
is DeepSpeed-MII publically available or only on azure?