Submitted by arcxtriy t3_zy5ddz in MachineLearning

Hello everyone,

I have spent some time trying to figure out how to calibrate my multi-class prediction model, which predicts K values between 0 and 1 for K classes (which haven't been softmaxed).

As far as I understand, I can train a model and calibrate it post-training, i.e. training and calibration are completely independent. Is that right? If yes, I'm wondering what is the current SOTA to calibrate my model? It seems like there is up-to-date resource and I am too new to the field to find the "best" method.

Thanks in advance!

14

Comments

You must log in or register to comment.

arcxtriy OP t1_j2451e2 wrote

But how would that work?

Assume I have predictions for one sample [0.01, 0.03, 0.2, 0.8, 0.04] and for another one [0.3, 0.2, 0.1, 0.1, 0.05].
Do you suggest learning the Platt scaling across samples or across classes?

1

Bot-69912020 t1_j24hkd7 wrote

It might be more transparent to split up your approach in two steps. First, we try to get a valid probability vector for each prediction (i.e. the vector sums up to 1). Second, we try to recalibrate the probabilities in each vector to improve the correctness of the predicted probabilities.

For the first point, it is important to know the range of your invalid outputs: If they are negative as well as positive, you might want to transform your whole output via softmax function. If you only have positive values v1, ..., vm, but the sum of the vector is not, then it is sufficient to compute vi / (v1+...+vm) to get valid probabilities.

Now, we can try to improve the predicted probabilities via post-hoc recalibration. For this, there have been several methods proposed. But, the simplest baseline, which works surprisingly well for most cases is temperature scaling. Start with that and try to make it work - it usually always gives at least minor improvements in ECE and NLL (don't use ECE alone, it is unreliable; see Fig.2). Once TS works, you can still try out ensemble temperature scaling, parametrized temperature scaling, intra-order preserving scaling, splines, ...

Some of these methods (including temperature scaling) use logits as inputs and their output are logits again. So, to receive logits, you apply the multivariate logit function if you already have probabilities, or simply use your untransformed outputs as logits if you would have used softmax in the first step.

2

PK_thundr t1_j24int4 wrote

There are a couple of approaches you can try

  • Temperature scaling https://arxiv.org/abs/1706.04599
  • One-vs-rest classification per class. Choose the one-vs-rest prob for each class and then normalize. You can choose whichever method you'd like here (isotonic, platt, beta calibration)
3

HateRedditCantQuitit t1_j24zv0q wrote

You’re still implicitly saying that you’re 100% certain that it’s either a cat or a dog, which is wrong. If a horse picture has p(cat)=1e-5 and p(dog) = 1e-7, that should also be fine, right? And if you normalize those such that p(cat) + p(dog) = 1, you end up with basically p(cat)=1. Testing for (approximately) p(cat) = p(dog) when it can be neither is a messy way to go about doing calibration.

It’s just a long way of saying that having the probabilities not sum to one is fine.

4

ObjectManagerManager t1_j27h8xa wrote

Platt ("temperature") scaling works well, and it's very simple. Yes, you do it post-training, usually on a held-out calibration set. Some people will then retrain on all of the data and reuse the learned temperature, but that doesn't always work out as well as you want it to.

FTR, "multiclass classification" means each instance belongs to exactly one of many classes. When each label can be 0 / 1 irrespective of the other labels, it's referred to as "multilabel classification".

3