spaccetime t1_j01j8j7 wrote on December 13, 2022 at 12:16 PM

Reply to comment by Shardsmp in [D] Does Google TPU v4 compete with GPUs in price/performance? by Shardsmp

I’d love to have as daily work station Dell Precision with A5500, but our hardware budget can’t afford it. 😀

For us, so far, anything with fp32 tensor-cores and 16GB VRAM was sufficient for developing and debugging our models, mainly RNNs, Transformers, CNNs and GANs, but the moment we want to train on millions of samples with higher batch size we have to switch to a faster machine or cluster.

spaccetime t1_izw55mj wrote on December 12, 2022 at 7:41 AM

Reply to comment by pommedeterresautee in [D] Does Google TPU v4 compete with GPUs in price/performance? by Shardsmp

Yes, just as /u/Mrgod2u82 mentioned - it’s expensive.

You should debug and prepare your model on less expensive machine - your experimental and development machine - and then run the top model with all the data on the TPU - your production-grade machine.

For example, we trained BERT for 4 days. If we didn’t pay enough attention when setting up the training we could have spent another 800$ just for experimenting, which is too expensive for us. Of course, at some companies like Google Brain and OpenAI they probably don’t care about cost minimization. There you can use TPU as your daily work station.😄

Use one machine for development and one for the heavy-and-long training.

spaccetime t1_izsnm7n wrote on December 11, 2022 at 3:41 PM

Reply to [D] Does Google TPU v4 compete with GPUs in price/performance? by Shardsmp

8x NVIDIA A100 = 25$/hour

TPU v3-4 = 8$/hour

TPU v4-4 = 12$/hour

When training BERT on 27B tokens I measured faster training times when using the TPU.

Nvidias’ GPUs are great for Deep Learning, but DL is not what they are designed for. They have CUDA cores or even RT-cores. You pay extra for being good at rendering, but you don’t use this or use it just just a little when training deep learning models.

Google’s TPU is engineered only for Deep Learning. The MXU is unrivaled.

For short term usage take the TPU and for long term a DGX station or another cluster.

TPU is not for experimental usage. Use it only when you are sure that your model, data and parameterization make sense.