Suppose that you are interviewed for a data scientist role. You are asked about logistic regression, and you answer all sorts of questions: How to run it in Python, how would you perform feature selection, and how would you use it for prediction. For the last question you answer that if you have the estimated of the regression coefficients and the data of the features, then you perform the necessary multiplications and additions, and the result will be L=log(p/(1-p)) where p is the probability of the event to be predicted. This transformation is known as the logit transformation. From this you can calculate p as exp(L)/(1+exp(L)). Then comes the critical question: why is that?
One possible answer is that since p is between 0 and 1, then L is between -∞ to ∞, that is, it can be any real number, and therefore the logistic regression is transformed into “regular” linear regression. However, this answer is wrong. In the train data, the values of the event/label to be predicted or classified are either 0 or 1. You cannot apply the logit transformations to zeros and ones. And even if you could, the linear regression assumptions do not hold.
A more sophisticated answer is to say that the logit transformation is the link function of choice. This choice has a nice property: if β is the coefficient of a feature X, then exp(β) is a (biased) estimate for the corresponding odds ratio. This is useful if you want to identify risk factors , e.g. for a disease like cancer. But if you are interested in predictions, you may not care about the odds ratio (although you should).
Let’s assume that we know to explain what a link function is. The question still remains: Why not choose another link function? Almost any inverse distribution function will do this trick. Why not choose the inverse of the normal distribution function as the link function?
Moreover, the history does not support this answer. The logit transformation and the logistic regression model came first. This model was developed by Sir David Cox in the 1960’s — the same David Cox who later introduced the proportional hazards model (see https://papers.tinbergen.nl/02119.pdf) . The extension of the model to general linear models using various link functions came later.
So the question remains: what is logistic in the logistic regression?
The key is in the statistical model of the logistic regression, or any other binary regression. Let’s review the model.
Suppose you have data of a response/label/outcome Y that takes values of zeroes and ones. Let assume for the sake of simplicity that you have only one feature/predictor X, which can be any type of variable.
The key assumption of the model is that there exists a continuous/latent/unobservable Y* that relates somehow to the observed values of Y. Note that Y* is not a part of your data. It is a part of your model.
The next assumption is about the relationship between Y and Y*. You assume that Y equals to 1 if the signal of Y* is above some threshold, and otherwise Y is equal to zero. Furthermore, assume, without loss of generality, that this threshold is zero.
This model assumption is not new, and most of the readers are familiar with this approach. This is how the perceptron, the building block of neural networks, works. The idea itself is much older. Karl Pearson used similar modelling when he attempted to develop a correlation coefficient for categorical data, back in the 1910’s.
Assuming you know the values of Y*, you can model the relationship between Y* and X using simple linear regression:
The third and last assumption is about the distribution of the errorepsilon. As I said before, you can choose any distribution you like. If you assume, for example, that epsilon is normally distributed, you will get something that is called probit regression. But if assume that epsilon follows the logistic distribution:
then these three assumptions and some basic probability and algebra you get a logistic regression — a regression with a logit link function.
For simplicity I will assume that X is discrete variable. One can do the whole trick for any X by using density functions for a continuous X.
be the conditional distribution of Y given that X is equal to some value x.
Since Y=1 if and only if Y*>0, we get that
By the second assumption of the model we get that
Using the third assumption that states that the distribution of Y* is logistic we get that
Good luck in your interview!