Some of you reading this might not even realize that most of modern machine learning is based on the logistic distribution. What I'm referring to is the sigmoid function. It's technical name is the logistic function and the version which permeates the ML community is the cumulative distribution function of the logistic distribution with location 0 and scale 1.

This little function is used by many to map real numbers into the (0,1) interval which is extremely useful when trying to predict probabilities.

I even came across a statement in scikit-learn documentation which astounded me. It indicates that the log loss is actually named for the logistic distribution because it is the loss function for logistic regression.

https://scikit-learn.org/stable/modules/generated/sklearn.metrics.log_loss.html > Log loss, aka logistic loss or cross-entropy loss. This is the loss function used in (multinomial) logistic regression and extensions of it such as neural networks, defined as the negative log-likelihood of a logistic model

Now I think this is a mistake. The log loss should be short for logarithmic loss as it takes the natural logarithm of predicted probabilities but it has become so unthinkable to the ML community to generate probabilities with anything other than the logistic sigmoid.

I fell into this camp until recently when I realized that the CDF of ANY distribution can perform the same task. For example if you use the CDF of a standard Gaussian then it is a probit regression. And I think it makes sense to pick a CDF based on the problem you are working on.

But how often do you see a neural net and the final activation is a gaussian CDF?

So is there a good reason why everyone only seems to care about the logistic sigmoid in ML? Some potential explanations I thought of is that it's relatively simple mathematically, the logarithm can help with numerical stability via the log sum exp trick, and that it might be easier to extend to multi-class problems.

Have any of you experimented using the CDFs of other distributions to generate probabilities and do you think that it would make sense to explore in that direction?

Comments

seba07 t1_iqltogp wrote on October 1, 2022 at 9:25 AM

I think the real answer for many ML problems is "because it works". Why are we using relu (=max(x,0)) instead of sigmoid or tanh as layer activations nowdays? Math would discourage this as the derivative at 0 is not defined, but it's fast an it works.

jesuslop t1_iqmbdpg wrote on October 1, 2022 at 1:01 PM

Genuine interest, How frequently would you say new projects/libraries use ReLU activations nowadays? (as oposed to other activations)

EDIT: reformultated

cthorrez OP t1_iqmvcze wrote on October 1, 2022 at 3:36 PM

Exactly, lots of people use gelu now. (A more expensive version which utilizes a Gaussian distribution...)

percevalw t1_iqn20u4 wrote on October 1, 2022 at 4:24 PM

The use of log loss is related to the maximum entropy principle, which states that the loss should make as few assumptions as possible about the actual distribution of the data. For example, if you only know that your problem has two classes, your loss should make no further assumptions. In the case of binary classification, the mathematical formula derived from this loss principle is the sigmoid function. You can learn more about it with this short article https://github.com/WinVector/Examples/raw/main/dfiles/LogisticRegressionMaxEnt.pdf

cthorrez OP t1_iqn3vy4 wrote on October 1, 2022 at 4:37 PM

Thank you! This pretty much answers my question. Though I think don't think it makes sense to bundle log loss and logistic regression. Like I mentioned in my post probit regression also uses log loss.

The only difference is how the model makes a probability prediction. The paper you linked provides a great motivation for using logistic sigmoid over another sigmoid.

NeilGirdhar t1_iqoovql wrote on October 1, 2022 at 11:34 PM

This is the correct answer.

ClearlyCylindrical t1_iqlqykj wrote on October 1, 2022 at 8:44 AM

I always thought that it was because its derivative was nice to calculate, just sigmoid(x)*(1 - sigmoid(x)).

NeilGirdhar t1_iqooxi3 wrote on October 1, 2022 at 11:34 PM

Plenty of other sigmoid (s-shaped) functions are differentiable.

cthorrez OP t1_iqlr7nr wrote on October 1, 2022 at 8:48 AM

Derivative of log of any CDF is also nice. Derivative of log CDF(x) = PDF(x)/CDF(x).

Plus we have autograd these days. Complicated derivatives can't hold us back anymore haha.

mocny-chlapik t1_iqlt8jo wrote on October 1, 2022 at 9:19 AM

It's about the speed of computation, not about the complexity of definition. If you need to calculate the function million or even billion times for each sample, it makes sense to optimize it.

cthorrez OP t1_iqmv564 wrote on October 1, 2022 at 3:35 PM

I'm not really convinced by this. I bet sigmoid is a little bit faster but I highly doubt the difference between logistic sigmoid and gaussian sigmoid final activation could even be detected when training a transformer model. The other layers are the main cost.

Also people do all sorts of experiments which increase cost. A good example is gelu vs relu. This adds gaussian calculations to every layer and people still do it.

gdahl t1_iqpg125 wrote on October 2, 2022 at 3:20 AM

People use lots of other things too. Probit regression, Poisson likelihoods, all sorts of stuff. As you said, it is best to fit what you are doing to the problem.

Logistic-regression style output layers are very popular in deep learning, perhaps even more than in other parts of the ML community. But Gaussian Process Classification is often done with probit models (see http://gaussianprocess.org/gpml/chapters/RW3.pdf ). However, if necessary people will design neural network output activation functions and losses to fit the problem they are solving.

That said, a lot of people doing deep learning joined the field in the past 2 years and just use what they see other people using, without giving it much thought. So we get these extremely popular cross entropy losses.

cthorrez OP t1_iqpko2f wrote on October 2, 2022 at 4:04 AM

Thanks for the reply and the resource! You're right about the relatively recent influx of people who enter the ML field via deep learning first. Seems like most of the intro material focuses on logistic sigmoid based methods.

That said, do you think there is a fundamental reason why other log likelihood based methods such as probit and poisson as you mentioned haven't caught on in the deep learning field? Is it just that probit doesn't give an edge in classification, and such a large portion of use cases don't require anything besides a classification based loss?

HateRedditCantQuitit t1_iqnfxvb wrote on October 1, 2022 at 6:00 PM

Log linear probabilities are really convenient.

its_ean t1_iqlr342 wrote on October 1, 2022 at 8:46 AM

hyperbolic tangent is convenient for backpropogation since its derivative is 1-tanh²

cthorrez OP t1_iqlrf1v wrote on October 1, 2022 at 8:51 AM

I'm not necessarily saying it should be replaced in every layer but I think it would at least make sense to investigate other options for final probability generation. tanh is definitely good for intermediate layer activation.

chatterbox272 t1_iqm72tk wrote on October 1, 2022 at 12:20 PM

Tanh is not a particularly good intermediate activation function at all. It's too linear around zero and it saturates at both ends.

cthorrez OP t1_iqnk270 wrote on October 1, 2022 at 6:30 PM

Well it's an even worse final output activation for binary classification because the outputs are -1 to 1 not 0 to 1.

I've never seen it used as anything but an internal activation.

National-Tennis-4528 t1_iqm2sa9 wrote on October 1, 2022 at 11:32 AM

It’s simple, mimics electronics… I’m still thinking that a stretched exponential may do a better job, but eh? Whatever works.

crrrr30 t1_iqpbup1 wrote on October 2, 2022 at 2:42 AM

Stretched exponential as in linearly scaling it, like exp(k*x) for some constant k? Maybe exponential functions grow too fast and there might be gradient explosion

National-Tennis-4528 t1_isbo52a wrote on October 14, 2022 at 6:58 PM

Missed that… that’s exp(-(t/tau)^beta), it grows/decay slowly as needed…

DataMonkey2 t1_iqmmmrz wrote on October 1, 2022 at 2:32 PM

For machine learning, is anyone else seeing the tide change. At first it was a step(useless for optimising obv), then sigmoid/logistic in the 80s, then tanh going into the 00s, which was generally replaced by Relu. Whilst there's a lot more leaky relu now, is anyone seeing more significant uptake of newer activations like mish?