Submitted by cthorrez t3_xsq40j in MachineLearning
Some of you reading this might not even realize that most of modern machine learning is based on the logistic distribution. What I'm referring to is the sigmoid function. It's technical name is the logistic function and the version which permeates the ML community is the cumulative distribution function of the logistic distribution with location 0 and scale 1.
This little function is used by many to map real numbers into the (0,1) interval which is extremely useful when trying to predict probabilities.
I even came across a statement in scikit-learn documentation which astounded me. It indicates that the log loss is actually named for the logistic distribution because it is the loss function for logistic regression.
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.log_loss.html > Log loss, aka logistic loss or cross-entropy loss. This is the loss function used in (multinomial) logistic regression and extensions of it such as neural networks, defined as the negative log-likelihood of a logistic model
Now I think this is a mistake. The log loss should be short for logarithmic loss as it takes the natural logarithm of predicted probabilities but it has become so unthinkable to the ML community to generate probabilities with anything other than the logistic sigmoid.
I fell into this camp until recently when I realized that the CDF of ANY distribution can perform the same task. For example if you use the CDF of a standard Gaussian then it is a probit regression. And I think it makes sense to pick a CDF based on the problem you are working on.
But how often do you see a neural net and the final activation is a gaussian CDF?
So is there a good reason why everyone only seems to care about the logistic sigmoid in ML? Some potential explanations I thought of is that it's relatively simple mathematically, the logarithm can help with numerical stability via the log sum exp trick, and that it might be easier to extend to multi-class problems.
Have any of you experimented using the CDFs of other distributions to generate probabilities and do you think that it would make sense to explore in that direction?
seba07 t1_iqltogp wrote
I think the real answer for many ML problems is "because it works". Why are we using relu (=max(x,0)) instead of sigmoid or tanh as layer activations nowdays? Math would discourage this as the derivative at 0 is not defined, but it's fast an it works.