Viewing a single comment thread. View all comments

No_Ninja3309_NoNoYes t1_je90yip wrote

Formally it means minimizing error like curve fitting. For example fitting to a line. There's some steps like:

  1. Defining the problem

  2. Choosing architecture

  3. Getting data

  4. Exploring the data

  5. Cleaning the data

  6. Coding up some experiments

  7. Splitting the data into training and test data. The test is only used to evaluate the errors like an exam. And you need some data to tweak hyperparameters. The train data set is bigger than the other sets.

  8. Setting up the infrastructure

  9. Doing something that is close to the real training project for a while like a rehearsal just to make sure.

Once the training starts you have to be able to monitor it through logs and diagnostic plots. You need to be able to take snapshots of the system. It's basically like running a Google search, but one that takes a long time. Google has internal systems that actually do the search. No one can actually know all the details.

Adding more machines is limited by network latency and Amdahl's law. But it does help

1