I'm building an application that runs an AI model inference on GPU servers. Based on the demand profile for no. of requests and usage of GPUs I want to autoscale the GPU servers up/down.

I don't want to use Kubernetes for orchestration/autoscaling as it is an overkill for my application which is pretty experimental right now.

Also, I don't need all the MLOps lifecycle management as I am using an open source model which doesn't need that fast updates.

All I am currently looking for is suggestions on how should I go about implementing a simple approach for scaling GPU servers based on incoming demand (such as no. of requests/min or GPU utilization).

Comments

You must log in or register to comment.

alibrarydweller t1_iuedvyb wrote on October 30, 2022 at 6:18 PM

You might look at Nomad -- it manages containers like K8s, but it's significantly simpler. We run GPU jobs on it, although we don't currently autoscale.

EnvironmentOptimal98 t1_iuepik7 wrote on October 30, 2022 at 7:34 PM

Pm me with your project details, and ill give you a ton of tips if you're not making a competing project

Crazy-Space5384 t1_iuf6325 wrote on October 30, 2022 at 9:25 PM

virtualized or bare metal ? Running on a cloud provider or on your own premises?

fgp121 OP t1_iuf6inj wrote on October 30, 2022 at 9:27 PM

Containers.
Cloud

m98789 t1_iug9ma9 wrote on October 31, 2022 at 2:19 AM

I think the simplest approach is just to set up GPU-enabled VMs with your cloud providers auto-scale option (like scale sets), which can respond to http traffic “triggers” to create more or less of the same VMs in a pool.

When a VM comes online, it has an auto-start action to pull and run your container, joining the load balanced pool of workers.

As a starting point to learn more of this approach (Azure link, but they are all similar):

https://azure.microsoft.com/en-us/products/virtual-machine-scale-sets/#overview

I suggest VM as the simplest approach rather than your cloud provider’s serverless container instance infra because usually they lack or are limited in GPU support, or it is more experimental or complex. A VM approach is about as simple as it gets.