Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This post explains the logistic regression and implements R code for the estimation of its parameters.

# Logistic Regression

Logistic Regression is a benchmark machine learning model. This model have a binary response variable ($$Y$$) which takes on 0 or 1. We can find lots of this kind of variables, among them are sucess/failure, bankrupcy/solvency, exit/stay. To understand logistic regression model, we should know PMF(probability mass function) of the binomial distribution since its log likelihood function is constructed using this PMF. Given this log likelihood function, we can estimate its parameters by maximizing the log likelihood functio by MLE (maximum likelihood estimation).

### PMF for the binomial distribution

Given success probability $$p$$ and the number of trials $$n$$, binomial distribution produces the probability of occuring $$x( \le n)$$ success as follows. \begin{align} P(X=x;n,p) = {n \choose x} p^{x} (1-p)^{n-x} \end{align}
From the above equation for $$P(X=x;n,p)$$, $${n \choose x}$$ denotes the case when the number of success is $$x$$ from $$n$$ Bernoulli trials. Let’s take an example: suppose $$n=3, x=1$$. \begin{align} & {n \choose x} = {3 \choose 1} = 3 \\ & \textbf{case 1} : 1, 0, 0 \\ & \textbf{case 2} : 0, 1, 0 \\ & \textbf{case 3} : 0, 0, 1 \end{align}
$$p^{x} (1-p)^{n-x}$$ denotes the probability of $${n \choose x}$$. This probability of the above example ($$n=3, x=1$$) is as follows. \begin{align} & \textbf{case 1} : 1, 0, 0 : p(1-p)(1-p) = p^{1}(1-p)^{3-1}\\ & \textbf{case 2} : 0, 1, 0 : (1-p)p(1-p) = p^{1}(1-p)^{3-1}\\ & \textbf{case 3} : 0, 0, 1 : (1-p)(1-p)p = p^{1}(1-p)^{3-1} \end{align}
It is worth noticed that since $$P(X=x;n,p)$$ have one $$p$$, the symbol of combination is used. This logic is similar to that of calculating one weighted average.

Standard linear regression is used for the calculation of the conditional mean, which is varying with the realization of covariates ($$X$$). This means that in addition to the unconditional mean, additional information from covariates are useful for the improvement of forecast performance. This story is also applied to logistic regression. Assuming $$p$$ is neither a fixed nor an unconditional probability, we can estimate the conditional probability using explanatory variables ($$X$$).

### Logit transformation

Since explanatory variables $$X$$ have useful information for explaining variation of a binary variable $$Y$$, we can set up a linear regression model for this binary response variable. But output from a standard linear regression can be of less than zero or more than one. This is, therefore, not appropriate because $$Y$$ is the binary variable which have a range from 0 and 1. From this reason, we need the following logit transformation which converts 0~1 binary variable to $$-\infty$$~$$\infty$$ real number. \begin{align} p & : (0,1) \\ \frac{p}{1-p} & : (0, \infty )\\ \log \left(\frac{p}{1-p}\right) & : (- \infty , \infty ) \end{align}
Using this transformation, we can obtain the logistic regression model in the following way.

\begin{align} & \log \left(\frac{p_i}{1-p_i}\right) = \beta_0 + \beta_1 X_{i1} + \beta_2 X_{i2} \\ & \rightarrow p_i = \frac{exp(\beta_0 + \beta_1 X_{i1} + \beta_2 X_{i2})}{1+exp(\beta_0 + \beta_1 X_{i1} + \beta_2 X_{i2})} \end{align}

### Log-likelihood function

Since each $$p_i (= \frac{exp(\beta_0 + \beta_1 X_{i1} + \beta_2 X_{i2})}{1+exp(\beta_0 + \beta_1 X_{i1} + \beta_2 X_{i2})})$$ is different with covariates, likelihood function of logistic regression as successive Bernoulli trials, has the following form.

\begin{align} L(B;y,x) = \prod_{i=1}^{N} p_i^{y_i} (1-p_i)^{1-y_i} \end{align}
We can find that there is no term like $${n \choose x}$$ in the likelihood function. The reason is that $$p_i (= \frac{exp(\beta_0 + \beta_1 X_{i1} + \beta_2 X_{i2})}{1+exp(\beta_0 + \beta_1 X_{i1} + \beta_2 X_{i2})})$$ is different with covariates as mentioned previously. Combination is used only when occurrence probability are fixed and all the same. In particular, to facilitate a numerical optimization, we use the following log likelihood function.
\begin{align} l(B;y,x) = \sum_{i=1}^{N} y_i \log(p_i) + (1-y_i) \log(1-p_i) \end{align}
It is typical to estimate parameters using MLE by calling numerical optimization. But for ready-made models like linear regression, logistic regression, etc, it is convenient to use glm() R package which performs optimization automatically.

### R code for Logistic Regression

The following R code shows how to implement logistic regression using R. We provide two-method: the one uses glm() conveniently and the other maximize log likelihood function by numerical MLE directly. To check for the accuracy of estimation, we assume population parameters as $$\beta_0 = 0.3, \beta_1 = 0.5, \beta_2 = -0.2$$. It is, therefore, expected that two methods produce the same results.

 1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283 #=========================================================================## Financial Econometrics & Derivatives, ML/DL using R, Python, Tensorflow  # by Sang-Heon Lee #————————————————————————-## Logistic Regression#=========================================================================# graphics.off()  # clear all graphsrm(list = ls()) # remove all files from your workspace #—————————————————–# Simulated data#—————————————————–     # number of observation    nobs <– 10000        # parameters assumed    beta <– c(0.3, 0.5, –0.2)        # simulated x    x1 <– rnorm(n = nobs)    x2 <– rnorm(n = nobs)        bx = beta[1] + beta[2]*x1 + beta[3]*x2    p = 1 / (1 + exp(–bx))        # simulated y    y <– rbinom(n = nobs, size = 1, prob = p)        # simulated y and x    df.yx <– data.frame(y, x1, x2)          #—————————————————–# Estimation by glm#—————————————————–     fit_glm = glm(y ~ ., data = df.yx, family = binomial)    summary(fit_glm)    #—————————————————–# Estimation by optimization#—————————————————–        # objective function = negative log-likelihood    obj.func <– function(b0, df.yx) {                beta <– b0        y <– df.yx[,1]        x <– as.matrix(cbind(1,df.yx[,–1]))                bx = x%*%beta        p = 1 / (1 + exp(–bx))                 # we can calculate log-likelihood        # using one of the two below                # 1. vector operation        #        loglike <– sum(y*log(p) + (1–y)*log(1–p))         # 2. loop operation        #        # loglike <- 0        # for(i in 1:nrow(df.yx)) {        #    loglike <- loglike + y[i]*log(p[i]) +         #               (1-y[i])*log(1-p[i])        # }         return(–1*loglike)    }        # run optimization    m<–optim(c(0.1,0.1,0.1), obj.func,             control = list(maxit=5000, trace=2),             method=c(“Nelder-Mead”),df.yx=df.yx)    #—————————————————–# Comparison of results#—————————————————–    beta    m\$par    coef(fit_glm)Colored by Color Scripter cs

The above R program delivers the parameter estimates from two approaches.

From the estimation output, we can find that two approaches show the same results. Numerical optimziation for MLE is used only for educational purpose. It is, therefore, recommended to use glm() function.$$\blacksquare$$