Viewing a single comment thread. View all comments

melgor89 t1_j5u6pdr wrote

As said in the topic, gradient accumulation is not a solution. However, gradient checkpointing could be. https://paperswithcode.com/method/gradient-checkpointing It recompute some of the features map during backwards pass so that they are not stored in memory. So you can fit bigger batch size

1