Submitted by thanderrine t3_zntz2d in MachineLearning

So I was wondering, basically the title.

If my CNN model is trained to classify images into cat and dog, and I show it an image of a horse. My model should be giving either dog or a cat as the answer, however the confidence of this answer (passing horse through a softmax) should be low.

But I have found that mostly the models are quite cocky with a high confidence that it is indeed a dog. Or a cat.

So is there a better way? Is there a technique or method or algorithm that gives accurate confidence on a classification?

53

Comments

You must log in or register to comment.

ttt05 t1_j0jbubh wrote

What you are describing is exactly the problem of OOD detection in image classification. The canonical reference for that would be Dan Hendrycks et. al. 2019. In short, no. Softmax probability might not be the best indicator of confidence as deep networks are often over confident on the wrong outputs.

29

3nilBarca t1_j0jtiw0 wrote

Could you please share the title of the paper you’re referring to since Hendrycks published multiple highly cited papers in 2019?

6

madhatter09 t1_j0jtxaq wrote

There are several papers on this idea - the best one is probably On Calibration of Modern Neural Networks by Guo et al. The gist is that you want your softmax output to be the same as the probability of your prediction being correct. For your architecture they do this through something that is called temperature scaling. Why this works is a more involved topic, but you can get a better handle of what the consequences are of cross entropy using hard (1s and 0s) vs soft labels (not so much a 1 or 0).

I think then going into OOD as the others suggest would be more fruitful. The whole deal with distribution shift and then the extremes of OOD, gets very murky and detached from what happens in practice, but ultimately the goal is to have the ability to know that a mismatch of input to model is happening, vice just having low confidence.

24

vwings t1_j0l4gq9 wrote

Great question and comment! In think the first statement here is that usually the CNNs are overconfident.

One thing that the original post is looking for is calibration of the classifier on a calibration set. On the calibration set, the softmax values can be re-adjusted to be close to probabilities. This is essentially what Conformal Prediction and Platt Scaling do.

I strongly recommend this year's talk on Conformal Prediction which provides insights into these problems. Will try to find the link...

1

Extra_Intro_Version t1_j0ldo9z wrote

I have a classification model that uses conformal prediction. This has been helpful in working towards building out a high confidence dataset.

2

bremen79 t1_j0jgr6l wrote

The only approach that gives valid uncertainty quantification is conformal prediction, a quick google search should result in a good number of tutorials.

7

vwings t1_j0l4q82 wrote

This is not true... There are many other works,e.g. Platt Scaling that also provide calibrated classifiers (i suppose this is what you call "valid"). But conformal prediction indeed tackles this problem...

4

bremen79 t1_j0n8wsn wrote

Platt scaling does not have any guarantee and in fact it is easy to construct examples where it fails. On the other hand, conformal prediction methods, under very weak assumptions, on the multiclass problem of the question would give you a set of labels that is guaranteed to contain the true label with a specified probability.

3

vwings t1_j0osaa7 wrote

Now you are completely deviating from the original scope of the discussion. We discussed what is more general, but - since you changed scope - you agree with me on that.

About "guarantees": also for CP, it is easy to construct examples where it fails. If the distribution of the new data is different from the calibration set, it's not exchangeable anymore, and the guarantee is gone.

1

Extra_Intro_Version t1_j0lduaf wrote

IIRC, CP extends Platt.

2

vwings t1_j0ljgfr wrote

That's what the CP guys say. :)

I would even say that Platt generalizes CP. Whereas CP focuses on the empirical distribution of prediction score only around a particular location at the tails, e.g. at confidence level 5%, Platt scaling tries to mold the whole empirical distribution into calibrated probabilities -- thus Platt considers the whole range of the distr of scores.

4

TheCockatoo t1_j0j4raw wrote

To answer your title question, no. You may wanna look up Bayesian deep learning!

6

visarga t1_j0k46hz wrote

It's a hard problem, nobody has a definitive solution. From my lectures and experience:

  • interval calibration (easy)

  • temperature scaling (easy)

  • ensembling (expensive)

  • Monte Carlo dropout (not great)

  • using prior networks or auxiliary networks (for OOD detection)

  • error correcting output codes (ECOC)

  • conformal prediction (slightly different task than confidence estimation)

Here's my Zeta-Alpha confidence estimation paper feed.

6

vwings t1_j0l5778 wrote

A hard problem indeed. The methods in your list have use different settings. Deep Ensembles, MC Dropout don't require a calibration set. The prior networks (i love this paper) assume that during training OOD samples are available. Conformal prediction assumes the availability of a calibration set that follows the distribution of future data... For the other methods,I would have to check ...

2

zeyus t1_j0kfuu5 wrote

Quick question. Wouldn't a simple solution be to include a 'neither'/'other' output class?

Given that a network should classify an image as a dog or a cat, in reality a lot of use cases actually want a multi-class prediction rather than binary, because a picture of a monkey should not be a dog or a cat. Just on a hunch I would guess the performance goes down significantly and obviously requires more training data.

1

trajo123 t1_j0kullx wrote

Yes, that's an option but you have absolutely no guarantees about it's ability to produce anything meaningful. What images to you introduce in the "other" class? There are infinitely more images falling in the other category than there are cat-or-dog images. For any training set you come up with for the "other" class, the model can still be tested with an image totally different from your training set, and the model output will have no reason what-so-ever to favour "other" for the new image.

4

visarga t1_j0m7kn4 wrote

I can confirm this. I did NER and most tokens are not names entities, so they are "other". It's really hard to define what "other" means, even with lots of text the model is unsure. No matter how much "other" I provide, I couldn't train a negative class properly.

2

zeyus t1_j0kxp2n wrote

True, the thought did occur to me, but I thought you could train the other category with a diverse set of animals and also people, nature, cars, landscapes etc. While there are a larger infinite set of "non-dog" or "non-cat" images, it must be possible to classify features that absolutely don't indicate a dog or cat...I don't think it's the most effective method perhaps...though it would be interesting to give it a go, maybe after my exams I'll try...

I can't shake the feeling that it might be somehow informative on the classification layer, either for reducing the confidence of the other categories or weighting it somehow

1

trajo123 t1_j0l5hwj wrote

You will get some results, for sure. Depending on your application may even be good enough. But as a general probability that an image is something other than cat and dog, not really.
As other commenters have mentioned the general problem is known as OOD (out of distribution) sample detection. There are Deep Learning models which model probabilities explicitly and can in principle used for OOD sample detection - Variational Autoencoders. The original formulation of this model performs poorly in practice at OOD sample detection, but there is work addressing some shortcomings, for instance Detecting Out-of-distribution Samples via Variational Auto-encoder with Reliable Uncertainty Estimation. But with VAEs things get very mathematical, very fast.
Coming back to you initial question, no, softmax is not appropriate for "confidence", but this is an open problem in Deep Learning.

1

visarga t1_j0m7wj7 wrote

How about a bronze statue of a dog, a caricature of a cat, a fantasy image that is hard to classify, etc? Are they "other"?

1

HateRedditCantQuitit t1_j0jgtkj wrote

The root of your problem is that you are stuck assigning 100% confidence to the prediction that your horse is a cat or a dog. It’s perfectly rational that you might have p(cat)=1e-5 and p(dog)=1e-7 for a horse picture, right?

So when you normalize those, you get basically p(cat)=1.

Try binary classification of cat vs not cat and dog vs not dog. Don’t make them sum to one.

5

harponen t1_j0kg9uf wrote

The hacky solutions don't really work... the fact is that if you're going to show the model ones and zeros, the model will learn to predict ones and zeros. If the label however would be a probability, then yes it would be a good measure of confidence.

3

ChuckSeven t1_j0u37tj wrote

But what is confidence really? It's a measure based on how likely an outcome is given a specific model. The idea of confidence is completely broken if you are not certain about your model. E.g., if you think that our error is normal distributed with a certain variance you can make statements if a divination from the expected value is noise or not. But this assumes that your normal distribution assumption is correct! If you cannot be certain about the model, which you never really are if you use neural networks, then the confidence is measured against your own implicit model. And since NNs are very different from your own brain and the models used in both cases are likely computing a different function, AND the NNs is not trained to predict confidence (from a human perspective) there is no meaningful way of talking about confidence.

1

zimonitrome t1_j13698p wrote

Softmax is not good. I would look into conformal prediction. It's a beautiful solution to this problem but it requires extra data. Worth it imo.

1

Cherubin0 t1_j0lflpn wrote

Don't you need an bayesian approach for that?

0

seba07 t1_j0lg760 wrote

From my experience the loss function also plays an important part. Cross entropy forces the model to be very certain with a decision. Focal loss can produce a smoother output distribution.

0

moist_buckets t1_j0m9hnb wrote

You should use a Bayesian network. Monte Carlo dropout is an easy place to start. Then you can get a measure of the uncertainty of your predictions.

0

Pickypalace t1_j0nz41i wrote

What do you mean by confidence? Is that a new metric or smth

0

sanjuromack t1_j0l3kx3 wrote

Softmax class probabilities always have to sum to 100%, so unless you create an “other” class, this will continue to be an issue.

Just replace your the activation on your output layer with sigmoid activations, which will convert your model to a multilabel output.

−1

sanjuromack t1_j0pvx5v wrote

Not sure why I got downvoted. This is a very common problem in industry, and often how I have my team handle it.

1

suflaj t1_j0jbo92 wrote

Depends on what you mean by confidence. With softmax, you model probability. You can train your network to give out near 100% probabilities per class, but this tells you nothing about how confident it is.

Instead, what you could do is the following:

  • get a prediction
  • define the target of your prediction as the resolved label
  • calculate loss on this
  • now define your target as the inverse of the initial one
  • calculate the loss again
  • divide the 2nd one by the first one

Voila, you've got a confidence score for your sample. However, this will only give you a number that will be comparable to other samples. This will not give you a % of confidence. You do know, however, that the higher the confidence score, the closer the confidence is to 100%. And you know the smaller your score is, the closer the confidence to 0%. Based on the ranges of your loss, you can probably figure out how to map it to whatever range you want.

For a multi-class problem, you could just sum the loss ratio between all classes other than your predicted class over the loss of your prediction. So, if you had a classifier that classifies an imagine into dog, cat, and fish, and your softmax layer spits out 0.9, 0.09, 0.01, your confidence score would be loss(cat)/loss(dog) + loss(fish)/loss(dog).

−4