I'd like to train a neural network where the softmax output has a minimum possible probability. During training, none of the probabilities should go below this minimum. Basically I want to avoid the logits from becoming too different from each other so that none of the output categories are ever completely excluded in a prediction, a sort of smoothing. What's the best way to do this during training?

Comments

You must log in or register to comment.

FastestLearner t1_j6mhjd2 wrote on January 31, 2023 at 11:20 AM

Use composite loss, i.e. add extra terms in the loss function to make the optimizer force the logits to stay within a fixed range.

For example, if current min logit = m and allowed minimum = u, current max logit = n and allowed maximum = v, then the following loss function should help:

Overall loss = CrossEntropy loss + lambda1 * max(u - m, 0) and lambda2 * max(n - v, 0)

The max terms ensure that no loss is added when the logits are all within the allowed range. Use lamba1 and lambda2 to scale each term so that they roughly match the CE loss in strength.

like_a_tensor t1_j6mcv1v wrote on January 31, 2023 at 10:16 AM

I'm not sure how to fix a minimum probability, but you could try softmax with a high temperature.

neuralbeans OP t1_j6md46u wrote on January 31, 2023 at 10:20 AM

That will just make the model learn larger logits to undo the effect of the temperature.

_vb__ t1_j6ocec9 wrote on January 31, 2023 at 7:21 PM

No, it would make the logits be closer to one another and the overall model a bit less confident in its probabilities.

Lankyie t1_j6mf6pt wrote on January 31, 2023 at 10:49 AM

max[softmax, lowest accepted probability]

neuralbeans OP t1_j6miw6o wrote on January 31, 2023 at 11:36 AM

It needs to remain a valid softmax distribution.

Lankyie t1_j6mjvpy wrote on January 31, 2023 at 11:48 AM

yeah true, you can implement that by factoring everything back to the sum of 1 though

emilrocks888 t1_j6mjf7m wrote on January 31, 2023 at 11:42 AM

I would scale logits before softmax, like it’s been done in self attention.Actually that scaling in self attn is to make the final dist of the attention weights to be smooth.

neuralbeans OP t1_j6mjhog wrote on January 31, 2023 at 11:43 AM

What's this about del attention?

emilrocks888 t1_j6mjnk7 wrote on January 31, 2023 at 11:45 AM

Sorry, dictionary issue. I meant Self Attention (I ve edited previous answer)

chatterbox272 t1_j6myph4 wrote on January 31, 2023 at 2:05 PM

If the goal is to keep all predictions above a floor, the easiest way is to make the activation into floor + (1 - floor * num_logits) * softmax(logits). This doesn't have any material impact on the model, but it imposes a floor.

If the goal is to actually change something about how the predictions are made, then adding a floor isn't going to be the solution though. You could modify the activation function some other way (e.g. by scaling the logits, normalising them, etc.), or you could impose a loss penalty for the difference between the logits or the final predictions.

neuralbeans OP t1_j6n0ima wrote on January 31, 2023 at 2:18 PM

I want the output to remain a proper distribution.

chatterbox272 t1_j6n3vx6 wrote on January 31, 2023 at 2:43 PM

My proposed function does that. Let's say you have two outputs, and don't want either to go below 0.25. Your minimum value already adds up to 0.5, so you rescale the softmax to add up to 0.5 as well, giving you a sum of 1 and a valid distribution.

nutpeabutter t1_j6n2eaf wrote on January 31, 2023 at 2:32 PM

Taking a leaf out of RL, you can add an additional entropy loss.

Alternatively, clip the logits but apply STE (copy gradients) on backprop

No_Cryptographer9806 t1_j6nfqhq wrote on January 31, 2023 at 4:01 PM

I am curious why do you want to do that? You can always post process the logits but forcing the Network to learn it will cause harm to the underlying representation imo

neuralbeans OP t1_j6nmccc wrote on January 31, 2023 at 4:42 PM

It's for reinforcement learning to keep the model exploring possibilities.