Logistic regression is a type of regression used when the dependant variable is binary or ordinal (e.g. when the outcome is either “dead” or “alive”). It is commonly used for predicting the probability of occurrence of an event, based on several predictor variables that may either be numerical or categorical. For example, suppose a researcher is interested in how Graduate Record Exam scores (GRE) and grade point average (GPA) effect admission into graduate school. By deriving a logistic regression model from previously observed admissions (we will use an hypothetical dataset from the UCLA Academic Technology Services here), it becomes possible to predict future admissions.
mydata = read.csv(url('http://www.ats.ucla.edu/stat/r/dae/binary.csv'))
In this dataset, gre and gpa represent the predictor variables, and admit the outcome. For a simple start, we will ignore the categorical variable rank. A typical regression analysis using pre-established packages from R could then be applied as follows:
mylogit = glm(admit~gre+gpa, family=binomial, data=mydata)
However, in order to understand the mechanisms of logistic regression we can write a function ouselves. We will use maximum likelihood estimation (MLE) to find the parameters values, here represented by the unknown regression coefficients:
################################################################################
# Calculates the maximum likelihood estimates of a logistic regression model
#
# x : a [n x p] vector with the data
# y : a [n x 1] vector with the outcomes (0 or 1)
#
# OUTPUT
# beta : the estimated regression coefficients
# vcov : the variane-covariance matrix
# ll : -2ln L (deviance)
#
################################################################################
# Author : Thomas Debray
# Version : 31 jul 2011
################################################################################
mle.logreg = function(x,y)
{
# Define the negative log likelihood function
logl <- function(theta,x,y){
y <- y
x <- as.matrix(x)
beta <- theta[1:ncol(x)]
# Use the log-likelihood of the Bernouilli distribution, where p is
# defined as the logistic transformation of a linear combination
# of predictors, according to logit(p)=(x%*%beta)
loglik <- sum(-y*log(1 + exp(-(x%*%beta))) - (1-y)*log(1 + exp(x%*%beta)))
return(-loglik)
}
# Prepare the data
x = as.matrix(x)
y = as.matrix(y)
# Define initial values for the parameters
theta.start = rep(0,(dim(x)[2]+1))
names(theta.start) = c("Intercept",colnames(x))
# Allow Intercept Estimation
X = cbind(1,x)
colnames(X)[1] = "Intercept"
# Calculate the maximum likelihood
mle = optim(theta.start,logl,x=X,y=y,hessian=T)
out = list(beta=mle$par,vcov=solve(mle$hessian),ll=2*mle$value)
}
################################################################################
We can implement this function as follows:
x = cbind(mydata$gre,mydata$gpa)
y = mydata$admit
colnames(x) = c("gre","gpa")
mylogit = mle.logreg(x,y)
Instead of obtaining the observed information matrix from the numerically differentiated Hessian matrix (through the optim-command), it is possible to calculate an unbiased estimate directly from the data:
################################################################################
# Calculates the maximum likelihood estimates of a logistic regression model
#
# x : a [n x p] vector with the data
# y : a [n x 1] vector with the outcomes (0 or 1)
#
# OUTPUT
# beta : the estimated regression coefficients
# vcov : the variane-covariance matrix
# dev : -2logLkh (deviance)
################################################################################
# Author : Thomas Debray
# Version : 29 nov 2011
################################################################################
mle.logreg = function(x,y)
{
# Define the negative log likelihood function
logl <- function(theta,x,y){
y <- y
x <- as.matrix(x)
beta <- theta[1:ncol(x)]
# Use the log-likelihood of the Bernouilli distribution, where p is
# defined as the logistic transformation of a linear combination
# of predictors, according to logit(p)=(x%*%beta)
loglik <- sum(-y*log(1 + exp(-(x%*%beta))) - (1-y)*log(1 + exp(x%*%beta)))
return(-loglik)
}
# Prepare the data
x = as.matrix(x)
y = as.matrix(y)
# Define initial values for the parameters
theta.start = rep(0,(dim(x)[2]+1))
names(theta.start) = c("Intercept",colnames(x))
# Allow Intercept Estimation
X = cbind(1,x)
colnames(X)[1] = "Intercept"
# Calculate the maximum likelihood
mle = optim(theta.start,logl,x=X,y=y,hessian=F)
# Obtain regression coefficients
beta = mle$par
# Calculate the Information matrix
# The variance of a Bernouilli distribution is given by p(1-p)
p = 1/(1+exp(-X%*%beta))
V = array(0,dim=c(dim(X)[1],dim(X)[1]))
diag(V) = p*(1-p)
IB = t(X)%*%V%*%X
# Return estimates
out = list(beta=beta,vcov=solve(IB),dev=2*mle$value)
}
################################################################################
R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...
Zero Inflated Models and Generalized Linear Mixed Models with R.
Zuur, Saveliev, Ieno (2012).
