Viewing a single comment thread. View all comments

nateharada OP t1_j4ne979 wrote

Nice! Right now you can use the end_process trigger to just return 0 when the trigger is hit from the process, but it should be fairly straightforward to externalize the API a little bit more. This would let you do something like this in your script:

from gpu_sentinel import Sentinel, get_gpu_usage
sentinel = Sentinel(
    arm_duration=10,
    arm_threshold=0.7,
    kill_duration=60,
    kill_threshold=0.7,
    kill_fn=my_callback_fn,
)
while True:
    gpu_usage = get_gpu_usage(device_ids=[0, 1, 2, 3])
    sentinel.tick(gpu_usage)
    time.sleep(1)

Is that something that would be useful? You can define the callback function yourself so maybe you trigger an alert, etc.

5

Zealousideal_Low1287 t1_j4neybb wrote

Yeah, that’s something which would be useful indeed. Don’t worry yourself though, I can put in a PR.

5

nateharada OP t1_j4ngy65 wrote

It's actually almost entirely ready now, I just need to alter a few things. I'll go ahead and push it soon! Need to do some final tests.

EDIT: The above code should work! See the README on the Github for a complete example.

6