Submitted by nateharada t3_10do40p in MachineLearning
Hey /r/machinelearning,
Long time reader, first time posting non-anonymously. I've been training models using various cloud services, but as an individual user it's stressful for me to worry about shutting down the instances if training fails or stops. Crashes, bad code, etc can cause GPU utilization to drop without the program successfully "finishing", and this idle time can cost a lot of money if you don't catch it quickly.
Thus, I built this tiny lil tool to help. It watches the GPU utilization of your instance, and performs an action if it drops too low for too long. For example, shutdown the instance if GPU usage drops under 30% for 5 minutes.
It's easy to use and install, just pip install gpu_sentinel
If this is useful please leave comments here or on the Github page: https://github.com/moonshinelabs-ai/gpu_sentinel
I'm hoping it helps save some other folks money!
scaredoftheinternet t1_j4mkrqk wrote
Wow this is actually really cool, thanks for sharing.