Submitted by nateharada t3_10do40p in MachineLearning

Hey /r/machinelearning,

Long time reader, first time posting non-anonymously. I've been training models using various cloud services, but as an individual user it's stressful for me to worry about shutting down the instances if training fails or stops. Crashes, bad code, etc can cause GPU utilization to drop without the program successfully "finishing", and this idle time can cost a lot of money if you don't catch it quickly.

Thus, I built this tiny lil tool to help. It watches the GPU utilization of your instance, and performs an action if it drops too low for too long. For example, shutdown the instance if GPU usage drops under 30% for 5 minutes.

It's easy to use and install, just pip install gpu_sentinel

If this is useful please leave comments here or on the Github page: https://github.com/moonshinelabs-ai/gpu_sentinel

I'm hoping it helps save some other folks money!

86

Comments

You must log in or register to comment.

Zealousideal_Low1287 t1_j4n2ahm wrote

Looks nice. I probably wouldn’t use it for shutting down or anything, but a notification on failure might be useful!

5

nateharada OP t1_j4ne979 wrote

Nice! Right now you can use the end_process trigger to just return 0 when the trigger is hit from the process, but it should be fairly straightforward to externalize the API a little bit more. This would let you do something like this in your script:

from gpu_sentinel import Sentinel, get_gpu_usage
sentinel = Sentinel(
    arm_duration=10,
    arm_threshold=0.7,
    kill_duration=60,
    kill_threshold=0.7,
    kill_fn=my_callback_fn,
)
while True:
    gpu_usage = get_gpu_usage(device_ids=[0, 1, 2, 3])
    sentinel.tick(gpu_usage)
    time.sleep(1)

Is that something that would be useful? You can define the callback function yourself so maybe you trigger an alert, etc.

5

MuonManLaserJab t1_j4o2mlx wrote

I have a little script called gpu_Speed that blows up my laptop if it drops below 50 mph % GPU utilization

41

Fit_Schedule5951 t1_j4obl4w wrote

Nice, I think an extension where this could be beneficial is when your process hangs - it's using full GPU memory but not training, this happened to me recently training models with fairseq. (I am not sure how you can catch these conditions)

2

nateharada OP t1_j4otocf wrote

This tool actually doesn't look at memory right now, just actual computation. Usually loading your model into memory eats up basically the max memory until the training is done, even if compute usage is very low.

If your training is hanging and still burning GPU cycles that'd be harder to detect I think.

4

Kinwwizl t1_j4slam2 wrote

That's one of the reasons GCP is nice for ML training workloads - you can kill VM after training is finished calling poweroff at the end of bash script for training.

1

MrAcurite t1_j4t9ch1 wrote

At work, we've got this thing that will notify you if a cloud instance has been running for 24 hours. However, it does this by messaging your work email, you can't configure it to go to a personal device or anything. Meaning, if you set a job to run at the end of the week, you can come back on Monday to over a thousand dollars of cloud charges and like fifty angry emails about it.

1