I really like training in the cloud for some reason and feels satisfying, however here is a couple of things I would've wished I knew beforehand to get things started.

Use a spot instance unless you absolutely must make sure it isn't interrupted. Your wallet will thank you later.
Make sure Nvidia drivers are installed and don't experiment with Operating systems. You are paying by the hour.
Make sure to use something like tmux to save the sessions running in your terminal so you don't have to start from scratch or in case you disconnect from the vm (but the VM isn't shut down). That way you can just click out of the terminal and not bother with it until it's done.
Debug on your local machine on CPU if you don't have CUDA. You can debug the model on a CPU perfectly fine.

Now what about you all?

Comments

royalemate357 t1_j94ax4h wrote on February 19, 2023 at 3:20 AM

#1,865,851

Depending on what scale you're working at, egress fees / data transfer fees can be something to look out for. Be aware of them if you are moving data around or data is leaving (e.g. you are downloading a model checkpoint).

I_will_delete_myself OP t1_j94c8d1 wrote on February 19, 2023 at 3:31 AM

#1,865,947

Replying to royalemate357 (#1,865,851)

This is something most cloud services use to lock you in to their services and discourage migrations to another vendor.

Demortus t1_j94q9zd wrote on February 19, 2023 at 5:42 AM

#1,866,975

Running Linux on your desktop/laptop makes it significantly easier to run projects on the cloud. Namely, you will be familiar with all dependencies needed to run your project and how to install them online. Moreover, you will not need to make many, if any changes, to your scripts to get them to work.

I_will_delete_myself OP t1_j94qohm wrote on February 19, 2023 at 5:46 AM

#1,866,996

Replying to Demortus (#1,866,975)

I agree. It also helps with deploying an API for you model. Also systemMD is useful to keep things running is the server gets reset for whatever reason.

Tgs91 t1_j954pam wrote on February 19, 2023 at 8:42 AM

#1,867,961

If you work in a job where you're frequently asked to apply your code using different cloud environments (AWS, Azure, Google, local machines, etc, etc), then it's good to dev/test code locally and have a mix of Windows and Mac on your team. If your tests pass on both Mac and Windows, then they'll probably also pass on just about any Linux based environment in a cloud service. Dev local, train on cloud with minimal debugging because you pay by the hour.

Lifaux t1_j9572rl wrote on February 19, 2023 at 9:15 AM

#1,868,109

Replying to Demortus (#1,866,975)

Alternatively, you can always use WSL2 if you don't want to dual boot.

RideOrDieRemember t1_j957mmn wrote on February 19, 2023 at 9:22 AM

#1,868,149

Is there a trick to spot instances on aws? In the past when I tried to spot instance a gpu it was never available.

Lifaux t1_j957q53 wrote on February 19, 2023 at 9:24 AM

#1,868,158

If you're having to debug code, VSCode has really good integrations for running on your remote server. Unless you're already very familiar with vim, it's going to be quicker to set this up.

Ensure you've got rsync experience - no one wants to include venv when pulling your changes back from the remote side.

Run the image you're using remotely locally via docker first. Check your code works, you don't want to be messing around with fixes while your GPUs sit idle.

If you're running compiled code, check the CPU architecture. I wasted a day debugging a fault that was due to compiling starspace on a build server that had different architecture to our remote server.

Tmux is a godsend.

skippy_nk t1_j95aurm wrote on February 19, 2023 at 10:08 AM

#1,868,338

The discovery of tmux was one of my greatest achievements of the early 2022

Mefaso t1_j95hjkm wrote on February 19, 2023 at 11:44 AM

#1,868,762

Replying to Demortus (#1,866,975)

>Running Linux on your desktop/laptop makes it significantly easier to run projects on the cloud

Just as a note, this can easily be done in a docker consider on windows as well.

Mefaso t1_j95hl4n wrote on February 19, 2023 at 11:44 AM

#1,868,764

Replying to RideOrDieRemember (#1,868,149)

Maybe try different regions?

__lawless t1_j95ixov wrote on February 19, 2023 at 12:02 PM

#1,868,861

Use code-server (VS code in browser) it is amazing

I_will_delete_myself OP t1_j95u7e7 wrote on February 19, 2023 at 2:04 PM

#1,869,825

Replying to RideOrDieRemember (#1,868,149)

Aws isn't the only one doing spot instances

dancingnightly t1_j95wa9s wrote on February 19, 2023 at 2:22 PM

#1,870,014

Replying to RideOrDieRemember (#1,868,149)

Try multiple regions and zones. There are peaks and troughs in availability, most notably the weekend is a good time to spot. There are some sites that help you do this / scripts online that use the aws cli to check for you.

VaxxBetrayal t1_j975a7w wrote on February 19, 2023 at 7:43 PM

#1,873,913

Replying to Lifaux (#1,868,109)

Embrace extend extinguish.

Appropriate_Ant_4629 t1_j97gjhy wrote on February 19, 2023 at 9:01 PM

#1,874,830

Replying to royalemate357 (#1,865,851)

>egress fees / data transfer fees

On the bright side, ingress is often free.

It costs surprisingly little to stream live video ***into*** the cloud and spew back tiny embedding vectors from models running there.

fasttosmile t1_j97r2fc wrote on February 19, 2023 at 10:15 PM

#1,875,597

byobu > tmux

No_Goat277 t1_j98oklc wrote on February 20, 2023 at 2:34 AM

#1,878,533

What is cost of cloud total vs running your servers on prem? I need to start a project with 2/4 RTX cards to train my stable diffusion model.

I_will_delete_myself OP t1_j98p0vg wrote on February 20, 2023 at 2:38 AM

#1,878,580

Replying to No_Goat277 (#1,878,533)

I been running the A100 the entire weekend and so far it’s only costing me under 20 bucks. If you need it around an hour and it would probably cost you between 1-3 dollars

I would recommend you plan a budget before you get started and it will almost always be cheaper on a year basis. Try Colab first and see if you will need it longer than 12 hours.

No_Goat277 t1_j98pvwn wrote on February 20, 2023 at 2:45 AM

#1,878,673

Replying to I_will_delete_myself (#1,878,580)

Thank you. I have scientific team so our PhD is requesting GPU for SD training. Our other team is using Midjourney but there is no API to it, so they happy but we can’t move forward due to lack of API.

I_will_delete_myself OP t1_j98ql8h wrote on February 20, 2023 at 2:51 AM

#1,878,747

Replying to No_Goat277 (#1,878,673)

You can get free credits online if you ask for it up to the thousands for research

https://aws.amazon.com/government-education/research-and-technical-computing/cloud-credit-for-research/

https://www.microsoft.com/en-us/azure-academic-research/

https://edu.google.com/intl/ALL_us/programs/credits/research/?modal_active=none

The cloud vs local debate depends on your needs though.

milleeeee t1_j9cbrxg wrote on February 20, 2023 at 9:50 PM

#1,890,654

Replying to No_Goat277 (#1,878,673)

Azure has cheap A100 spot instances. Only 1$ per hour per A100. Up until now I have always gotten my instances immediately and I have only been kicked out twice in over 100 training runs (each run lasts a couple hours). So I am very happy with it at the moment and would highly recommend it

danielgafni t1_j9cjwet wrote on February 20, 2023 at 10:46 PM

#1,891,355

Replying to skippy_nk (#1,868,338)

Time to learn about Zellij

DeepDeeperRIPgradien t1_j9eo5uw wrote on February 21, 2023 at 11:05 AM

#1,898,358

Can you recommend a tutorial or something that explains the steps to move from (e.g. pytorch) training on your own machine to training that model in the Cloud (e.g. AWS)? What type of instances to chose, how/where to store data, making sure Nvidia/CUDA stuff is working properly, etc.?

gamerx88 t1_j9evm62 wrote on February 21, 2023 at 12:33 PM

#1,899,065

How do you utilize a spot instance for training? How do you automatically resume training from a checkpoint? Or are you referring to something like Sagemaker's managed spot training?

I_will_delete_myself OP t1_j9fodao wrote on February 21, 2023 at 4:15 PM

#1,902,181

Replying to DeepDeeperRIPgradien (#1,898,358)

>Can you recommend a tutorial or something that explains the steps to move from (e.g. pytorch) training on your own machine to training that model in the Cloud (e.g. AWS)?

Same as running on your own machine.

>What type of instances to chose, how/where to store data, making sure Nvidia/CUDA stuff is working properly, etc.?

Just look up a EC2 or VM that has the gpu you want and there you go. nvidia-smi is the command that should tell you the gpu you have. It's working if it outputs the GPU you have. I would suggest checking in the code if CUDA is running.

I prefer to use a EC2 or VM because it's normally cheaper, but you have to do your own research on pricing. Cloud is a competitive market, so there is always someone ready to offer a A100 at a cheaper price. Lambda Cloud I heard was super cheap for on demand.

I_will_delete_myself OP t1_j9fp5fh wrote on February 21, 2023 at 4:20 PM

#1,902,264

Replying to gamerx88 (#1,899,065)

Try looking into if they have an API. shutdown is rare, but it happens so I only ran into it once. Having the cloud on your mobile device is great, it allows you to check anywhere and do some simple things quickly.

https://aws.amazon.com/about-aws/whats-new/2013/01/08/use-amazon-cloudwatch-to-detect-and-shut-down-unused-amazon-ec2-instances/