Viewing a single comment thread. View all comments

ab3rratic t1_izgrfxy wrote

Batch gradient descent (the usual method) does not require the entire dataset to fit into memory -- only one batch, as it were.

1

IdeaEnough443 OP t1_izgwvp5 wrote

but the training process would be slower than parallelization? is batch gradient descent the industry standard for handling large dataset in nn training?

1

PassionatePossum t1_izi24ow wrote

You can still parallelize using batch gradient descent. If you for example use the MirroredStrategy in Tensorflow you split up the batch between multiple GPUs. The only downside is, that this approach doesn’t scale well if you want to train on more than one machine since the model needs to be synced after each iteration.

But you should think long and hard whether training on multiple machines is really necessary since that brings a whole new set of problems. 700GB is not that large. We do that all the time. I don’t know what kind of model you are trying to train but we have a GPU Server with 8 GPUs and I’ve never felt the need to go beyond the normal MirroredStrategy for parallelization. And should you run into the problem that you cannot fit the data onto the machine where you are training: Load it over the network.

You just need to make sure that your input pipeline supports that efficiently. Shard your dataset so you can have many concurrent I/O operations.

And in case scaling is really important to you. May I suggest you look into Horovod?

2

SwordOfVarjo t1_izgx533 wrote

It's the industry standard for NN training period. Your dataset isn't that big, just train on one machine.

1

IdeaEnough443 OP t1_izgyjq8 wrote

our datset take close to a day to finish training, if we have 5x the data it won't work with our application, thats why we are trying to see if distributed training would help lower training time

1