Submitted by shingekichan1996 t3_10ky2oh in MachineLearning
Paedor t1_j5ur6tx wrote
Reply to comment by altmly in [D] Self-Supervised Contrastive Approaches that don’t use large batch size. by shingekichan1996
The trouble is that contrastive methods often compare elements from the same batch, instead of treating elements as independent like pretty much all other ML (except batchnorm).
As a simple example with a really weird version of contrastive learning: with a batch of 2N, contrastive learning might use the 4N^2 distances between batch elements to calculate a loss, while with two accumulated batches of N, contrastive learning could only use 2N^2 pairs for loss.
satireplusplus t1_j5v24u2 wrote
If you don't have 8 GPUs you can always run the same computation 8x in series on one GPU. Then you merge the results the same way the parallel implementation would do it. In most cases that's probably gonna end up being a form of gradient accumulation. Think of it this way: you basically compute your distances on a subset of n, but since there are much fewer pairs of distances, the gradient would be noisy. So you just run it a couple of times and average the result to get an approximation of the real thing. Very likely that this is what the parallel implementation does too.
Viewing a single comment thread. View all comments