I have a rather successful model which I have trained to an extent that the loss has now plateaued. The loss over my training dataset follows a Power Law type curve:

https://preview.redd.it/qotu2k09237a1.png?width=825&format=png&auto=webp&s=b16ca887ce8e259f8de4a20609e35ff7f7298df9

That means, 80% of the training examples have a loss which is well below my tolerance threshold. 15% have a loss which is slightly above threshold tolerance. 4% have a loss which is significant above threshold. And 1% have a very high loss.

This results from the inherent complexity of the training examples themselves. Some are simple. Some are complex. And I was wondering, are there any techniques developed to keep optimizing a model when you encounter such a situation? I thought, such a situation is surely very common so maybe some people came up with some strategies or algorithms, but my Google-fu has failed me. Please refer me to literature on the topic if it exists.

So far I have tried pre-selecting and training on the hard examples only and I have tried multiplying the loss gradients with a scalar that depends on the loss itself. None of these approaches give me satisfactory results.

Maybe it is just that the model is not complex enough. But I am maxing out my GPU RAM already (Nvdia A100s) so I cannot really do much better. But I am not sure I have yet reached the limits of complexity with this model.

Comments

You must log in or register to comment.

dumbmachines t1_j0ztjdq wrote on December 20, 2022 at 5:14 PM

Have you tried something like this?

You're not able to overfit on the hard examples alone? Why not?

Dartagnjan OP t1_j103ef6 wrote on December 20, 2022 at 6:17 PM

I have already tried my own version of selective backprob, but thanks for the link. this is exactly what I was looking for. I want to know how other people implement it and if I did something wrong.
Overfitting on the hard examples is a test that I carried out already multiple times but not yet on the latest experiments. Thanks for reminding me of this. I guess from this I can infer whether my complexity is definitely too low, if I cannot overfit. If I can overfit. If I can overfit on the hard examples it does not mean the model is able to handle easy and hard examples at the same time, still.

-Rizhiy- t1_j10nstz wrote on December 20, 2022 at 8:27 PM

Can you collect more data similar to hard examples?

People like to focus on the architecture or training techniques, but most real problems can be solved by collecting more relevant data.

If the loss remains high even after getting more data, two potential problems come to mind:

There is not enough information in your data to correctly predict the target.
Your model is not complex/expressive enough to properly estimate the target.

carbocation t1_j103ehe wrote on December 20, 2022 at 6:17 PM

Have you tried focal loss? If I’m reading you correctly it’s appropriate for this type of question, although if the hard samples are distributed evenly across classes it is probably not actually going to help. I don’t think you mention what type of problem you’re solving (classification, regression, segmentation, etc) so it’s hard to guess.

Dartagnjan OP t1_j105tp0 wrote on December 20, 2022 at 6:32 PM

It's a regression problem, but I already tried something similiar. I scaled the loss according to how hard the example is which was derived from a hand crafted heuristic, but I did not get good results with it.

trajo123 t1_j0zyd7i wrote on December 20, 2022 at 5:45 PM

You are maxing out you GPU RAM even for a batch size of 1? If not then you can set the batch size to 1 and set accumulate_grad_batches (or whatever that is in your DL framework) to whatever you want your effective batch size to be. https://pytorch-lightning.readthedocs.io/en/stable/advanced/training_tricks.html

Note that your loss will never be 0 unless you run into numerical issues. However your metric of interest such as accuracy, F1 score, etc can still be perfect for the training set even if the loss is not 0. Can you get a perfect score on the training set? If not then it seems that your model is not big/complex enough for your training data. Actually this a good sanity check for your model building and training - being able to get a perfect score on the training set.

Depending on the problem you can also look into focal loss, hard-example mining, etc. But not achieving a perfect score on the training set is not necessarily a bad thing. For instance if you have mislabelled examples in your training set then you actually want the model to assigns a high loss to those. Are you sure your high-loss training examples are labelled correctly?

Dartagnjan OP t1_j103e5a wrote on December 20, 2022 at 6:17 PM

Yes, I already have batch_size=1. I am looking to sharding the model on multiple GPUs now. In my case, not being able to predict on the 1% of super hard examples means that those examples have features that the model has not learned to understand yet. The labeling is very close to perfect with mathematically proven error bounds...

> focal loss, hard-example mining

I think these are exactly the keywords that I was missing in my search.

dumbmachines t1_j133fcs wrote on December 21, 2022 at 8:46 AM

If focal loss is interesting, check out polyloss, which is a generalization of the focal loss idea.

techni_24 t1_j123prp wrote on December 21, 2022 at 2:37 AM

Maybe this is the novice in me showing, but how does minimizing the batch size to 1, effect the model performance? I thought it only effected the speed of training.

trajo123 t1_j13gu3z wrote on December 21, 2022 at 11:48 AM

Reducing the batch size to 1 can allow you to train a bigger model, allowing you to reach a lower loss on the training set. Note that accumulate_grad_batches takes on the meaning of batch_size when the latter is set to 1.

Zealousideal_Low1287 t1_j1111zo wrote on December 20, 2022 at 9:54 PM

Perhaps you can do some data augmentation and resampling proportional to the difficulty.

Also perhaps a scheme like this could be appropriate:

https://arxiv.org/abs/2206.07137

JustOneAvailableName t1_j107lf3 wrote on December 20, 2022 at 6:43 PM

Perhaps something like keep track of harder data points and sample half the batch from that? What happened exactly when training on the hard examples only?

Dartagnjan OP t1_j108k4y wrote on December 20, 2022 at 6:50 PM

That is what I have already done. So far, the loss just oscillates but remains high, which leads me to believe that either I am not training in the right way i.e. maybe the difference between the easy and hard training examples is too drastic to bridge. Or my model is just not capable of handing the harder examples.

JustOneAvailableName t1_j1096lz wrote on December 20, 2022 at 6:53 PM

Sounds like you need a higher batch size. What happens on a plateaued model on the hard examples when you take a huge batch size?

junetwentyfirst2020 t1_j126tuh wrote on December 21, 2022 at 3:02 AM

My first thought would be curriculum learning.

FreddieM007 t1_j12qcmz wrote on December 21, 2022 at 6:02 AM

Since your current model is perhaps not complex or expressive enough and vram is limited: have you tried building a classification model first that partitions the data in 2 classes? What is the quality there? Then you can build separate regression models for each class each using all vram.

BoiElroy t1_j12qq57 wrote on December 21, 2022 at 6:06 AM

!RemindMe 7 days

RemindMeBot t1_j12qrfl wrote on December 21, 2022 at 6:06 AM

I will be messaging you in 7 days on 2022-12-28 06:06:15 UTC to remind you of this link

CLICK THIS LINK to send a PM to also be reminded and to reduce spam.

^(Parent commenter can ) ^(delete this message to hide from others.)

^(Info)	^(Custom)	^(Your Reminders)	^(Feedback)

93-summer-days t1_j12yqb2 wrote on December 21, 2022 at 7:43 AM

Distributional Robust Optimization

solresol t1_j10nsf2 wrote on December 20, 2022 at 8:27 PM

Epistemic status: I don't know what I'm talking about, and I know I'm not fully coherent. Be kind in replies.

I *think* that your data might not have a finite mean and finite variance. If so, then there's no obvious "best" regression at all. As you get more data, optimality will change. A different random subsample of data will lead to wildly different results.

I have done some research on problems like this in linguistic data, and I was able to do dirty stuff by swapping out the underlying metric so that the notion of where "infinity" was changed. But if you have real-valued data, I don't think this can help.

[deleted] t1_j15e7wg wrote on December 21, 2022 at 8:08 PM

[deleted]

solresol t1_j16ed0l wrote on December 22, 2022 at 12:19 AM

As a real-world example that I encountered with a client that sells software to ecommerce stores... they wanted to know the number of products in a typical ecommerce store.

It turns out that there's a power law at work. If you sample N stores and count the number of products in all stores in total, you get XN products. Great! The mean is X.

But if you sample 2N stores, the number of products in total in all the stores is 4XN. That's because you have doubled your chances of finding a store that on its own has 2XN products, and the rest of the stores contribute the 2XN that you would have expected.

When you only sampled N stores, the average number of products per store was X. When you doubled the size of the sample, the average number of products was 2X.

Similar things happen to the variance.

As you increased the sample size, the average number of products goes up as well.

In a sane universe you would expect this to end eventually. This particular client is still growing, still analysing more stores, and they are always finding bigger and bigger stores (stores which on their own have more products than all other stores put together). Eventually they will have analysed every store in the world, and then they will be able to answer the question of "what's the average number of products in an ecommerce store that exists right now?"

But who knows? Maybe stores are being created algorithmically. It wouldn't surprise me. Certainly there will be more ecommerce stores in the future, so we probably can't say "that average number of products in an ecommerce store over all time?"

Anyway, the punchline is, you can't sample this data to find out the mean nor can you find its variance.

The original poster is finding that his residuals follow a power law. Depending on how steep the exponent is, it's possible that there is no well-defined mean for his residuals: as he collects more data, his mean will go up in proportion to the number of data points. If he is defining his loss function in terms of the mean of the residuals (or anything along those lines) then gradient descent is going to have some unresolvable[*] problems. If this is true, gradient descent will take his parameters on an exciting adventure through fractal saddles, where there's always a direction where it can reduce the loss function that makes no improvement to the majority of his data.

This looks to me like what is happening to him.

[*] Unresolvable with the state of the art at the moment AFAICT. I'm going to put this on my PhD research to do list.