No_Ninja3309_NoNoYes t1_je90yip wrote on March 30, 2023 at 8:32 AM

Formally it means minimizing error like curve fitting. For example fitting to a line. There's some steps like:

Defining the problem
Choosing architecture
Getting data
Exploring the data
Cleaning the data
Coding up some experiments
Splitting the data into training and test data. The test is only used to evaluate the errors like an exam. And you need some data to tweak hyperparameters. The train data set is bigger than the other sets.
Setting up the infrastructure
Doing something that is close to the real training project for a while like a rehearsal just to make sure.

Once the training starts you have to be able to monitor it through logs and diagnostic plots. You need to be able to take snapshots of the system. It's basically like running a Google search, but one that takes a long time. Google has internal systems that actually do the search. No one can actually know all the details.

Adding more machines is limited by network latency and Amdahl's law. But it does help