Viewing a single comment thread. View all comments

nwatab t1_iy75dep wrote

I was training 10GB dataset on AWS ec2 (AMI: Deep Learning AMI GPU TensorFlow 2.10.0 (Amazon Linux 2) 20221116). After half an epoch, ec2 is very slow due to lack of memory. Does anyone know why? I don't understand why "after about half an epoch (around less than 10 minutes)", it gets slow, instead of the beginning of training.

1

I-am_Sleepy t1_iy7dqu4 wrote

I am not sure, but maybe the read data is cached? Try disable that first or maybe there is memory leak code somewhere

If your data is a single large file, it will try to read entire tensor first, before load into memory. So if it is too large, try implement your dataset as a generator (batching), or speed up preprocessing time by save the processed input as protobuff files

But single large file dataset shouldn’t slowdown at half epoch, so that is up to debate I guess

1

nwatab t1_iy7yfr8 wrote

Thanks. My data is one CSV and a lot of jpgs. I'm using tf.data input pipelines. .cache() could cause a problem according to your insights. I'll check them.

1

nwatab t1_iy8bssy wrote

Yes, it was cache that caused a problem. Now it works good. Somehow it didn't come up to me. Thanks!

1