Watching this old Keras video from TF Summit 2017. Francois shows this slide https://youtu.be/UeheTiBJ0Io?t=936 where the last layer in his classifier does not have a softmax activation. Later he explains that the loss function he's using can take unscaled inputs and apply a softmax to it. Great.

My question: why would you use a final layer like that? What am I missing? Looks like the client would need to softmax the model output in order to get a useful prediction, no? If so, what would be a sane reason to do this? Or is he merely demonstrating that softmax_cross_entropy_with_logits is so smart that it can apply softmax before computing the cross entropy?

Comments

You must log in or register to comment.

Seankala t1_iruw37k wrote on October 11, 2022 at 5:40 AM

Can't speak on behalf of Keras, but for PyTorch's implementation of the cross entropy loss the softmax is calculated with the loss function. Therefore, you'd feed unscaled logits into the loss function.

nullspace1729 t1_irv5z1j wrote on October 11, 2022 at 7:57 AM

It’s because of something called the log-sum trick. If you combine the activation with the loss you can increase numerical stability when the logits are very close to 0 or 1.

sanderbaduk t1_irv1lo8 wrote on October 11, 2022 at 6:53 AM

For classification, you get the same answer taking the argmax of logits vs the argmax of probabilities. For training, combining the soft max or sigmoid with a loss function can be more numerically stable.

rx303 t1_irvia6u wrote on October 11, 2022 at 10:57 AM

Summing log-probs is more stable than multiplying probs.

pocolai t1_irw0b82 wrote on October 11, 2022 at 1:45 PM

this is just for numerical stability when computing the loss. the user can apply softmax to the last layer during inference.

Shot_Expression8647 t1_irw2ihl wrote on October 11, 2022 at 2:02 PM

Consider a neural network with no hidden layers, ie, logistic regression. You can think about it as learning a linear function wx + b with the logistic loss. If we replace the logistic loss with the hinge loss, we get the SVM. There are other losses you can use as well (eg exponential loss), but I’m not sure if they have names. Thus, we can see that leaving the output untransformed abstracts a part of the model that we can easily switch out to get other models.

In practice, you might leave the last layer untransformed if you want to use the layer as an embedding on another task.

mrpogiface t1_irz4o45 wrote on October 12, 2022 at 2:51 AM

The theoretical justification of having the softmax in the loss is nice. Aside from the numerical stability bit, using the softmax / cross entropy makes sense probabilistically