Submitted by fgp121 t3_yhjpo2 in MachineLearning
I'm building an application that runs an AI model inference on GPU servers. Based on the demand profile for no. of requests and usage of GPUs I want to autoscale the GPU servers up/down.
I don't want to use Kubernetes for orchestration/autoscaling as it is an overkill for my application which is pretty experimental right now.
Also, I don't need all the MLOps lifecycle management as I am using an open source model which doesn't need that fast updates.
All I am currently looking for is suggestions on how should I go about implementing a simple approach for scaling GPU servers based on incoming demand (such as no. of requests/min or GPU utilization).
alibrarydweller t1_iuedvyb wrote
You might look at Nomad -- it manages containers like K8s, but it's significantly simpler. We run GPU jobs on it, although we don't currently autoscale.