melgor89 t1_j5u6pdr wrote on January 25, 2023 at 4:27 PM

Reply to comment by mgwizdala in [D] Self-Supervised Contrastive Approaches that don’t use large batch size. by shingekichan1996

As said in the topic, gradient accumulation is not a solution. However, gradient checkpointing could be. https://paperswithcode.com/method/gradient-checkpointing It recompute some of the features map during backwards pass so that they are not stored in memory. So you can fit bigger batch size