Viewing a single comment thread. View all comments

VirtualHat t1_izfu724 wrote

I use three scripts.

train.py (which trains my model)

worker.py (which picks up the next job and runs it using train.py)

runner.py (which is basically a list of jobs and code to display what's happening).

I then have multiple machines running multiple instances of worker.py. When a new job is created, the workers see it and start processing it. Work is broken into 5-epoch blocks, and at the end of each block, a new job from the priority queue is selected.

This way I can simply add a new job and within 30 minutes or so one of the workers will finish its current block and pick it up. Also because of the chunking, I get early results on all the jobs rather than having to wait for them to finish. This is important as I often know early on if it's worth finishing or not.

I evaluate the results in a Jupyter notebook using the logs that each job creates.

edit: fixed links.

5

moyle t1_izgsce9 wrote

Guild.ai can easily automate this pocess. I really recommend checking it out

3

RSchaeffer t1_izgxqod wrote

These links don't work for me. Can you double check them?

2

thundergolfer t1_izgyu6x wrote

They're not actually links, they've just been formatted like they are. They just link to train.py which is not a website.

3

VirtualHat t1_izjmbm0 wrote

Oh my bad, didn't realise Reddit automatically created links when writing abc.xyz. I've edited the reply to include links to my code.

2