Submitted by I_will_delete_myself t3_115z9hc in MachineLearning
I really like training in the cloud for some reason and feels satisfying, however here is a couple of things I would've wished I knew beforehand to get things started.
- Use a spot instance unless you absolutely must make sure it isn't interrupted. Your wallet will thank you later.
- Make sure Nvidia drivers are installed and don't experiment with Operating systems. You are paying by the hour.
- Make sure to use something like tmux to save the sessions running in your terminal so you don't have to start from scratch or in case you disconnect from the vm (but the VM isn't shut down). That way you can just click out of the terminal and not bother with it until it's done.
- Debug on your local machine on CPU if you don't have CUDA. You can debug the model on a CPU perfectly fine.
Now what about you all?
royalemate357 t1_j94ax4h wrote
Depending on what scale you're working at, egress fees / data transfer fees can be something to look out for. Be aware of them if you are moving data around or data is leaving (e.g. you are downloading a model checkpoint).