Submitted by johnhopiler t3_11a8tru in MachineLearning

Let's assume for a minute one has:

  • the necessary compute instances
  • enough $ to cough up to rent those instances somewhere

What are the latest "easy" solutions to get optbloomzand flan-t5hosted as API endpoints?

I spent about 2 weeks trying to get seldon-core and MLServer to work with its huggingface wrapper. But I've lost hope at this point. There are so many parameters and tweaks one has to be mindful of and I feel like I'm behaving like a very crude operating system replacement when I pass a device_mapto a python function to tell it how much ram to use for what instance. In what world can MS 95 manage 4 DIM DDR rams but in 2023, we cannot auto-assign model data to the right GPUs?

So. What's the "right way" to do this? I am aware of

Any pointers would be appreciated. We have a goal to get 2-3 models up and running as API endpoints in 2 weeks and I have a lot of ppl waiting for me to get this done...

​

Edit:

I am talking about self hosted solutions where the inference input & output is "under your control"

​

Edit:

What about a K8S + Ray Cluster + alpa.ai? It feels like the most industrialised version of all the things I've seen so far after reading up on ray (which feels like a spark cluster for ML)

10

Comments

You must log in or register to comment.

Desticheq t1_j9qo0mu wrote

Hugginface actually allows a fairly easy deployment process for models trained with their framework

8

theLastNenUser t1_j9uwhcd wrote

You will have to message them if you want to use the larger GPU boxes, and the autoscaling isn’t great for larger models. The customizability of the “handler.py” file is nice though

2

Desticheq t1_j9xiv9l wrote

Well, in terms of "out-of-the-box," I'm not sure what else could be better. AWS, Azure or Google provide empty units basically, and you'd have to configure all the "Ops" stuff like network, security, load balancing, etc. That's not that difficult if you do it once in a while, but for a "test-it-and-forget-it" project it might be too difficult.

2

CKtalon t1_j9r2k9j wrote

Probably FasterTransformers with Triton Inference Server

3

whata_wonderful_day t1_ja3kh4d wrote

Yeah this is what the big bois use. It'll give you max performance, but isn't exactly user friendly

1

memberjan6 t1_j9rsdvk wrote

Cohere, deepset, ....

1