Hello everyone,

I have spent some time trying to figure out how to calibrate my multi-class prediction model, which predicts K values between 0 and 1 for K classes (which haven't been softmaxed).

As far as I understand, I can train a model and calibrate it post-training, i.e. training and calibration are completely independent. Is that right? If yes, I'm wondering what is the current SOTA to calibrate my model? It seems like there is up-to-date resource and I am too new to the field to find the "best" method.

Thanks in advance!

Comments

You must log in or register to comment.

Zealousideal_Low1287 t1_j240c9r wrote on December 29, 2022 at 2:29 PM

I would just try isotonic regression or platt scaling.

arcxtriy OP t1_j2451e2 wrote on December 29, 2022 at 3:05 PM

But how would that work?

Assume I have predictions for one sample [0.01, 0.03, 0.2, 0.8, 0.04] and for another one [0.3, 0.2, 0.1, 0.1, 0.05].
Do you suggest learning the Platt scaling across samples or across classes?

Zealousideal_Low1287 t1_j24at04 wrote on December 29, 2022 at 3:45 PM

Classes

arcxtriy OP t1_j24b90c wrote on December 29, 2022 at 3:48 PM

But then it is not guaranteed that the probabilities for a sample sum up to 1. That's seems strange, right?!

HateRedditCantQuitit t1_j24cg3y wrote on December 29, 2022 at 3:56 PM

If you train a model on a dataset of dogs and cats, then show it a picture of a horse, do you want p(dog)+p(cat) = 1?

arcxtriy OP t1_j24yx4p wrote on December 29, 2022 at 6:21 PM

If p(dog)=p(cat)=0.5 then it's fine, because it tells me the classified is uncertain. Isn't it?

HateRedditCantQuitit t1_j24zv0q wrote on December 29, 2022 at 6:27 PM

You’re still implicitly saying that you’re 100% certain that it’s either a cat or a dog, which is wrong. If a horse picture has p(cat)=1e-5 and p(dog) = 1e-7, that should also be fine, right? And if you normalize those such that p(cat) + p(dog) = 1, you end up with basically p(cat)=1. Testing for (approximately) p(cat) = p(dog) when it can be neither is a messy way to go about doing calibration.

It’s just a long way of saying that having the probabilities not sum to one is fine.

ObjectManagerManager t1_j27i9n5 wrote on December 30, 2022 at 5:01 AM

Actually, you're completely right. SOTA in open set recognition is still max logit / max softmax, which is to say that the maximum softmax probability is a useful measure of certainty.

PK_thundr t1_j24int4 wrote on December 29, 2022 at 4:37 PM

There are a couple of approaches you can try

Temperature scaling https://arxiv.org/abs/1706.04599
One-vs-rest classification per class. Choose the one-vs-rest prob for each class and then normalize. You can choose whichever method you'd like here (isotonic, platt, beta calibration)

ObjectManagerManager t1_j27h8xa wrote on December 30, 2022 at 4:52 AM

Platt ("temperature") scaling works well, and it's very simple. Yes, you do it post-training, usually on a held-out calibration set. Some people will then retrain on all of the data and reuse the learned temperature, but that doesn't always work out as well as you want it to.

FTR, "multiclass classification" means each instance belongs to exactly one of many classes. When each label can be 0 / 1 irrespective of the other labels, it's referred to as "multilabel classification".

Bot-69912020 t1_j24hkd7 wrote on December 29, 2022 at 4:30 PM

It might be more transparent to split up your approach in two steps. First, we try to get a valid probability vector for each prediction (i.e. the vector sums up to 1). Second, we try to recalibrate the probabilities in each vector to improve the correctness of the predicted probabilities.

For the first point, it is important to know the range of your invalid outputs: If they are negative as well as positive, you might want to transform your whole output via softmax function. If you only have positive values v1, ..., vm, but the sum of the vector is not, then it is sufficient to compute vi / (v1+...+vm) to get valid probabilities.

Now, we can try to improve the predicted probabilities via post-hoc recalibration. For this, there have been several methods proposed. But, the simplest baseline, which works surprisingly well for most cases is temperature scaling. Start with that and try to make it work - it usually always gives at least minor improvements in ECE and NLL (don't use ECE alone, it is unreliable; see Fig.2). Once TS works, you can still try out ensemble temperature scaling, parametrized temperature scaling, intra-order preserving scaling, splines, ...

Some of these methods (including temperature scaling) use logits as inputs and their output are logits again. So, to receive logits, you apply the multivariate logit function if you already have probabilities, or simply use your untransformed outputs as logits if you would have used softmax in the first step.