nateharada

nateharada t1_jeh5bir wrote

I personally feel we need large scale collaboration, not each lab having a small increase. Something like a James Webb telescope or a CERN. If they make a large cluster that's just time shared between labs that's not as useful IMO as allowing many universities to collaborate on a truly public LLM that competes with the biggest private AI organizations.

5

nateharada OP t1_j4otocf wrote

This tool actually doesn't look at memory right now, just actual computation. Usually loading your model into memory eats up basically the max memory until the training is done, even if compute usage is very low.

If your training is hanging and still burning GPU cycles that'd be harder to detect I think.

4

nateharada OP t1_j4ne979 wrote

Nice! Right now you can use the end_process trigger to just return 0 when the trigger is hit from the process, but it should be fairly straightforward to externalize the API a little bit more. This would let you do something like this in your script:

from gpu_sentinel import Sentinel, get_gpu_usage
sentinel = Sentinel(
    arm_duration=10,
    arm_threshold=0.7,
    kill_duration=60,
    kill_threshold=0.7,
    kill_fn=my_callback_fn,
)
while True:
    gpu_usage = get_gpu_usage(device_ids=[0, 1, 2, 3])
    sentinel.tick(gpu_usage)
    time.sleep(1)

Is that something that would be useful? You can define the callback function yourself so maybe you trigger an alert, etc.

5