This thread is dedicated to exploring the various techniques used in self-supervised contrastive learning that utilize standard batch sizes. I am seeking information on the current methods in this field, specifically those that do not rely on large batch sizes.

I am familiar with the SimSiam paper published by META research, which utilizes 256 batch size for 8-GPUs. However, for individuals with limited resources such as myself, access to a large number of GPUs may not be feasible. As a result, I am interested in learning about other methods that can be used with smaller batch sizes and a single GPU, such as those that would be suitable for training on 1024x1024 input images.

Additionally, I am curious about any more efficient architectures that have been developed in this field. This includes, but is not limited to, techniques used in natural language processing that may have applications in other areas of artificial intelligence.

***posted the same question in PyTorch forums, reposting here for wider reach.

Comments

You must log in or register to comment.

IntelArtiGen t1_j5tijjx wrote on January 25, 2023 at 1:41 PM

#1,484,001

I managed to use SwAV on 1 GPU (8GB), batch size 240, 224x224 images, FP16, ResNet18.

Of course it works, the problem isn't just the batch size but the accuracy - batchsize trade-off, and the accuracy was quite bad (still usable for my task though). If 50% top5 on imagenet is ok for you, you can do it. But I'm not sure there are many tasks where it makes sense.

Perhaps contrastive learning isn't the best for single GPU. I'm not sure about the current SOTA on this task.

shingekichan1996 OP t1_j5tiupg wrote on January 25, 2023 at 1:44 PM

#1,484,045

Replying to IntelArtiGen (#1,484,001)

For 224x224 images, sure. But for images with large sizes, for example satellite images, it is hard to get 200+ batch size for a single gpu.

shingekichan1996 OP t1_j5tjy44 wrote on January 25, 2023 at 1:53 PM

#1,484,170

Replying to IntelArtiGen (#1,484,001)

I think single GPU for SSL contrastive learning is a research direction to pursue, I'm not sure if anyone published papers on it, but if there's none, I'm surprised.

Purple_noise_84 t1_j5tl26u wrote on January 25, 2023 at 2:01 PM

#1,484,286

How about mocov2? That should work on a single gpu

mgwizdala t1_j5tyf1f wrote on January 25, 2023 at 3:34 PM

#1,485,836

If you are willing to trade time for batch size you can try with gradient accumulation

RaptorDotCpp t1_j5u0yxq wrote on January 25, 2023 at 3:50 PM

#1,486,117

Replying to mgwizdala (#1,485,836)

Gradient accumulation is tricky for contrastive methods that rely on having lots of negatives in a batch.

shingekichan1996 OP t1_j5u22zn wrote on January 25, 2023 at 3:57 PM

#1,486,235

Replying to mgwizdala (#1,485,836)

Curious about this, I have not read any paper related. What is its effect on the performance (accuracy, etc) ?

mgwizdala t1_j5u2mgr wrote on January 25, 2023 at 4:01 PM

#1,486,307

Replying to shingekichan1996 (#1,486,235)

It depends on implementation. Naive gradient accumulation will probably give better results than small batches, but as u/RaptorDotCpp mentioned, if you relay on many negative samples inside one batch, it will still be worse than a large batch training.

There is also a cool paper about gradient caching, which somehow solves this issue, but again with an additional penalty on training speed. https://arxiv.org/pdf/2101.06983v2.pdf

shingekichan1996 OP t1_j5u40dx wrote on January 25, 2023 at 4:09 PM

#1,486,460

Replying to mgwizdala (#1,486,307)

exactly the paper I need to read! Thanks!

melgor89 t1_j5u6pdr wrote on January 25, 2023 at 4:27 PM

#1,486,744

Replying to mgwizdala (#1,485,836)

As said in the topic, gradient accumulation is not a solution. However, gradient checkpointing could be. https://paperswithcode.com/method/gradient-checkpointing It recompute some of the features map during backwards pass so that they are not stored in memory. So you can fit bigger batch size

melgor89 t1_j5u766t wrote on January 25, 2023 at 4:30 PM

#1,486,794

There is a great paper about analyzing batch size vs accuracy correlation. They propose loss function, which is able to learn SimClr on bs=256 instead of 4k. So, there is some research in this domain. https://arxiv.org/abs/2110.06848

Irate_Librarian1503 t1_j5uaqwq wrote on January 25, 2023 at 4:52 PM

#1,487,218

Barlow twins, maybe? Easy to implement and batch size effective.

No_Cryptographer9806 t1_j5ufip4 wrote on January 25, 2023 at 5:22 PM

#1,487,787

FastSiam: SimSiam that fit on one GPU small batch size (down to 32 smth) https://dl.acm.org/doi/abs/10.1007/978-3-031-16788-1_4

altmly t1_j5uglpx wrote on January 25, 2023 at 5:29 PM

#1,487,906

Replying to RaptorDotCpp (#1,486,117)

I'm confused. Gradient accumulation is exactly equivalent to batching as long as the data is the same, unless you use things like batch norm (you shouldn't).

koolaidman123 t1_j5ujfpv wrote on January 25, 2023 at 5:46 PM

#1,488,224

Replying to altmly (#1,487,906)

contrastive methods require in-batch negatives, you can't replicate that with grad accumulation

koolaidman123 t1_j5uk2ai wrote on January 25, 2023 at 5:50 PM

#1,488,308

cache your predictions on each smaller batch w/ labels until you get a similar batch size, then run your loss function

so instead of calculating loss and accumulating like gradient accumulation, you only calculate loss once you reach the target batch size

[deleted] t1_j5uooap wrote on January 25, 2023 at 6:18 PM

#1,488,872

[deleted]

Paedor t1_j5ur6tx wrote on January 25, 2023 at 6:33 PM

#1,489,181

Replying to altmly (#1,487,906)

The trouble is that contrastive methods often compare elements from the same batch, instead of treating elements as independent like pretty much all other ML (except batchnorm).

As a simple example with a really weird version of contrastive learning: with a batch of 2N, contrastive learning might use the 4N^2 distances between batch elements to calculate a loss, while with two accumulated batches of N, contrastive learning could only use 2N^2 pairs for loss.

satireplusplus t1_j5v24u2 wrote on January 25, 2023 at 7:39 PM

#1,490,434

Replying to Paedor (#1,489,181)

If you don't have 8 GPUs you can always run the same computation 8x in series on one GPU. Then you merge the results the same way the parallel implementation would do it. In most cases that's probably gonna end up being a form of gradient accumulation. Think of it this way: you basically compute your distances on a subset of n, but since there are much fewer pairs of distances, the gradient would be noisy. So you just run it a couple of times and average the result to get an approximation of the real thing. Very likely that this is what the parallel implementation does too.

draconicmoniker t1_j5v76ho wrote on January 25, 2023 at 8:10 PM

#1,491,056

Non-contrastive approaches e.g. SWaV can handle lower batch sizes https://arxiv.org/abs/2006.09882

squidward2022 t1_j5vmb95 wrote on January 25, 2023 at 9:41 PM

#1,492,861

(https://arxiv.org/pdf/2106.04156.pdf ) This was a cool paper from NeurIPS 2020 which aimed to theoretically explain the success of CL by relating it spectral clustering. They present a loss with a very similar form to InfoNCE, which they use for their theory. One of the plus sides found was it worked well with small batch sizes.

(https://arxiv.org/abs/2110.06848) I skimmed this work a while back, one of their main claims is that this approach works with small batch sizes.

[deleted] t1_j5w9rbv wrote on January 26, 2023 at 12:21 AM

#1,495,807

Replying to koolaidman123 (#1,488,308)

[deleted]

koolaidman123 t1_j5wbk37 wrote on January 26, 2023 at 12:34 AM

#1,496,031

Replying to [deleted] (#1,495,807)

Thats not the same thing...

Gradient accumulation calcs the loss on each batch, it doesnt work with in batch negatives because you need compare input from batch 1 to inputs of batch 2, hence offloading and caching predictions, then calculating the loss with 1 batch

Thats why gradient accumulation doesnt work to simulate large batch sizes for contrastive learning, if youre familiar with it

maximalentropy t1_j5wl3gx wrote on January 26, 2023 at 1:40 AM

#1,497,176

Momentum encoder based approaches don’t need large batch size because they use a queue for storing negatives rather than taking negatives from the mini-batch

https://github.com/facebookresearch/moco

shingekichan1996 OP t1_j5wlavz wrote on January 26, 2023 at 1:42 AM

#1,497,196

Replying to melgor89 (#1,486,794)

I saw an implementation of that paper here: https://github.com/raminnakhli/Decoupled-Contrastive-Learning

And I saw also that the same paper is rejected at NeurIPS'21 becuase of its similar impact on other methods like Barlow Twins, SimSiam, BYOL, etc.

However, at first glance at the re-implemented results, it works great on small batch-size indeed.

rapist1 t1_j5xmv9n wrote on January 26, 2023 at 6:59 AM

#1,501,672

Replying to koolaidman123 (#1,488,308)

How do you implement the cacheing? You have to cache all the activations to do the bawards pass

kdqg t1_j5xzfx4 wrote on January 26, 2023 at 9:47 AM

#1,502,725

VICReg

youngintegrator t1_j61dfqk wrote on January 27, 2023 at 12:39 AM

#1,518,340

Is there any reason you'd like a contrastive algorithm? (intra-class discrimination?)

Barlow twins showed to work quite well with lower batches (32) and HSIC-SSL is a nice variant on this style of learning if you only care about clusters. Im sure simsiam is fine too (avoid BYOL for small batches).

In terms of contrastive approaches, methods that avoid any "coupling" mentioned in DCL for the negative terms will work with smaller batch sizes (contrastive estimates converge to mle assuming large noise samples). This is seen in the spectral algorithm or in align-uniform. These work because they ignore the comparing the representations from the same augmented samples. SWAV also does this by contrastive prototypes which are basically free variables which don't have gradients that conflict with any alignment goal. I think it's fair to say that algorithms with LSE transforms are less stable for small batch sizes since the gradients will be biases to randomly coupled terms. With sufficiently many terms this coupling matters less.

From what i've noticed, methods that avoid comparing the augmented views of the same base sample will require slightly more tuning to get things just right. (align + weight * diversity)

Notes: NNCLR is nicer than moco imo. VicReg is good but is a mess to finetune. I am assuming youre using a CNN and have omitted transformer and masked based algorithms.