Submitted by neuralbeans t3_10puvih in deeplearning

I'd like to train a neural network where the softmax output has a minimum possible probability. During training, none of the probabilities should go below this minimum. Basically I want to avoid the logits from becoming too different from each other so that none of the output categories are ever completely excluded in a prediction, a sort of smoothing. What's the best way to do this during training?

6

Comments

You must log in or register to comment.

FastestLearner t1_j6mhjd2 wrote

Use composite loss, i.e. add extra terms in the loss function to make the optimizer force the logits to stay within a fixed range.

For example, if current min logit = m and allowed minimum = u, current max logit = n and allowed maximum = v, then the following loss function should help:

Overall loss = CrossEntropy loss + lambda1 * max(u - m, 0) and lambda2 * max(n - v, 0)

The max terms ensure that no loss is added when the logits are all within the allowed range. Use lamba1 and lambda2 to scale each term so that they roughly match the CE loss in strength.

5

like_a_tensor t1_j6mcv1v wrote

I'm not sure how to fix a minimum probability, but you could try softmax with a high temperature.

2

neuralbeans OP t1_j6md46u wrote

That will just make the model learn larger logits to undo the effect of the temperature.

0

_vb__ t1_j6ocec9 wrote

No, it would make the logits be closer to one another and the overall model a bit less confident in its probabilities.

2

Lankyie t1_j6mf6pt wrote

max[softmax, lowest accepted probability]

1

neuralbeans OP t1_j6miw6o wrote

It needs to remain a valid softmax distribution.

2

Lankyie t1_j6mjvpy wrote

yeah true, you can implement that by factoring everything back to the sum of 1 though

2

emilrocks888 t1_j6mjf7m wrote

I would scale logits before softmax, like it’s been done in self attention.Actually that scaling in self attn is to make the final dist of the attention weights to be smooth.

1

neuralbeans OP t1_j6mjhog wrote

What's this about del attention?

1

emilrocks888 t1_j6mjnk7 wrote

Sorry, dictionary issue. I meant Self Attention (I ve edited previous answer)

1

chatterbox272 t1_j6myph4 wrote

If the goal is to keep all predictions above a floor, the easiest way is to make the activation into floor + (1 - floor * num_logits) * softmax(logits). This doesn't have any material impact on the model, but it imposes a floor.

If the goal is to actually change something about how the predictions are made, then adding a floor isn't going to be the solution though. You could modify the activation function some other way (e.g. by scaling the logits, normalising them, etc.), or you could impose a loss penalty for the difference between the logits or the final predictions.

1

neuralbeans OP t1_j6n0ima wrote

I want the output to remain a proper distribution.

1

chatterbox272 t1_j6n3vx6 wrote

My proposed function does that. Let's say you have two outputs, and don't want either to go below 0.25. Your minimum value already adds up to 0.5, so you rescale the softmax to add up to 0.5 as well, giving you a sum of 1 and a valid distribution.

2

nutpeabutter t1_j6n2eaf wrote

Taking a leaf out of RL, you can add an additional entropy loss.

Alternatively, clip the logits but apply STE (copy gradients) on backprop

1

No_Cryptographer9806 t1_j6nfqhq wrote

I am curious why do you want to do that? You can always post process the logits but forcing the Network to learn it will cause harm to the underlying representation imo

1

neuralbeans OP t1_j6nmccc wrote

It's for reinforcement learning to keep the model exploring possibilities.

1