Submitted by redditnit21 t3_y5qn9h in deeplearning

I am training a ViT for Image Classification and I am getting testing accuracy 2-4% higher than training accuracy. So, is there any problem with my model if yes, then how can I prevent this from happening? And why testing accuracy shouldn’t be higher than training?

My dataset has 9 classes and I have split my train and test set by 80-20.

13

Comments

You must log in or register to comment.

danielgafni t1_isl8voi wrote

Are you using dropout or other regularizations that affect training but not testing? You’ve got the answer

14

DeskFan9 t1_isl9qfe wrote

More info would help answser your question.

How big is your dataset? Small data sets could have this issue.

What accuracy are you getting?

Are the classes balanced overall, in the training set, and in the test set? Could be that network is biased towards certain classes and the test set happens to have more of those classes.

3

znihilist t1_islbm58 wrote

Almost always this is an issue of sampling. Make sure everything is well represented everywhere.

> And why testing accuracy shouldn’t be higher than training?

There is no law that says this shouldn't happen, but in 99.99% of cases it is a sampling issue. However, sometimes when doing off-time testing, this issue can prop out, and isn't necessarily something that means your model is flawed (in this specific context).

I've had an issue with a model we were working on, and we needed to prove that the model works for different time periods, and we needed to remove the last two month's of data from the training and left them for validation. It turns out that in the last months of data, specific subset of the data was over represented than in the previous months, and it was the "good" data.

4

pornthrowaway42069l t1_isldxay wrote

One of the reasons, besides mentioned in other comments, is that sometimes test set is just easier to solve than the train set. Not saying that is your crux, but might be worth a try by getting different splits.

10

DrXaos t1_ismok0x wrote

In this case the train and test probably wasn't split stratified by classes, and there's an imbalance in the relative proportions of classes and there is some bias in the predictor.

And it's probably measuring top 1 accuracy which isn't the loss function being optimized directly.

  1. do a stratified test/train split
  2. measure more statistics on train vs test
  3. check dropout or other regularization differences
2

redditnit21 OP t1_isn2joq wrote

I am using a stratified test/train split. "train_df, test_df = model_selection.train_test_split(
df, test_size=0.2, random_state=42, stratify=df['Class']
)"

All the classes are equally proportioned except 1 class. I am using dropout layer in the model for training. Is the dropout layer creating this issue?

1

redditnit21 OP t1_isn2qbq wrote

There are 9 Classes in the dataset labelled as 1 - 9. These are 7916 images in training and 1979 images in testing. All the classes are equally proportioned except 1 class which has more images.

​

Training accuracy is around 96% and testing is around 98%.

2

DrXaos t1_isn3k9e wrote

Certainly could be dropout. Dropout is on during training, stochastically perturbing activations in its usual form in packages, and off during test.

Take out dropout, use other regularization and report directly on your optimized loss function, train and test, often NLL if you're using a conventional softmax + CE loss function which is the most common for multinomial outcomes.

3

redditnit21 OP t1_isn465e wrote

>Views

Yeah I am using conventional softmax + CE loss function which is the most common for multinomial outcomes. Which regularization method would you suggest me to use and what's the main reason why test acc should be less than train acc?

1

DrXaos t1_isn4fs5 wrote

top 1 accuracy is a noisy measurement particularly if it's a binary 0/1 measurement.

A continuous performance statistic will more likely show the expected behavior of train perf better than test. Note on loss functions lower is better.

There's lots of regularization possible, but start with L2, weight decay, and/or limiting the size of your network.

1

Dmytro_P t1_isnljcv wrote

It depends on how large and diverse your dataset is, but in most cases you should. You'd see an even larger difference between the train and test sets.

You can also try to use multiple folds, to train model 5 times for example with the different test set, to check if the test set you selected accidentally contains simpler samples.

2

_Arsenie_Boca_ t1_isnyhtb wrote

You should test if this happens only during training or also when evaluating on the train set afterwards. As others have mentioned, dropout could be a possible factor. But you should also consider that the train accuracy is calculated during the training process, while the model is still learning. I.e. the final weights are not reflected in the average train acc.

1

BrotherAmazing t1_iso8emo wrote

It’s what other people have already said: This is extremely common to see if you’re using dropout. There’s nothing necessarily wrong here either, and this network might outperform a network (on test data—the data we care about!) that is trained without dropout and gets higher training accuracy.

Here is how you can prove it to yourself:

  1. You can keep dropout activated during test time as an experiment and see that the test accuracy, when dropout remains on, does indeed decrease to be below the training accuracy.

  2. You can keep everything else fixed and just parametrically dial down the dropout % in each dropout layer. Usually 0.5 (50%) is a default, but you’ll see for a fixed training/test split that as that parameter goes from 0.5 —> 0.25 —> 0.1 —> 0.05 —> 0 that the training accuracy will increase back to be at/above the test accuracy.

You can also rule out the possibility you had a rare split that led to an easy test set and hard training set by splitting randomly over and over and seeing that this phenomenology is not rare, but the norm across nearly all splits, but if 1 and 2 above exhibit behavior consistent with dropout being the reason, then I see this last exercise as a waste of time unless you just want to win an argument against someone who insists it is due to a “bad” split. If they really insist that vs. just propose it as a possible reason, then they don’t have much real-world experience using dropout! This is very common, nothing wrong, telltale sign of dropout.

1

BrotherAmazing t1_iso8zyl wrote

If you have a different problem where this happens without dropout then you may indeed want to make sure the training/test split isn’t a “bad” one and do k-fold validation.

The other thing to check would be other regularizers you may be using during training but not test that make it harder for the network to do well on training sets; i.e., you can dial down data augmentation if you are using that, and so on.

Things people have touched upon already for the most part, but this is very common to see when using dropout layers.

1

redditnit21 OP t1_isoa0lc wrote

I commented from the different account by mistake. After looking at everyone’s comment. I tried without dropout and the same thing is happening. I am not using any data augmentation except the rescaling (1/.255).

1