Logistic Regression in R – Part One

[This article was first published on Mathew Analytics » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Please note that an earlier version of this post had to be retracted because it contained some content which was generated at work. I have since chosen to rewrite the document in a series of posts. Please recognize that this may take some time. Apologies for any inconvenience.

Logistic regression is used to analyze the relationship between a dichotomous dependent variable and one or more categorical or continuous independent variables. It specifies the likelihood of the response variable as a function of various predictors. The model expressed as log(odds) = \beta_0 + \beta_1*x_1 + ... + \beta_n*x_n , where \beta refers to the parameters and x_i represents the independent variables. The log(odds), or log of the odds ratio, is defined as ln[\frac{p}{1-p}]. It expresses the natural logarithm of the ratio between the probability that an event will occur, p(Y=1), to the probability that an event will not occur, p(Y=0).

The models estimates, \beta, express the relationship between the independent and dependent variable on a log-odds scale. A coefficient of 0.020 would indicate that a one unit increase in \beta_i is associated with a log-odds increase in the occurce of Y by 0.020. To get a clearer understanding of the constant effect of a predictor on the likelihood that an outcome will occur, odds-ratios can be calculated. This can be expressed as odds(Y) = \exp(\beta_0 + \beta_1*x_1 + ... + \beta_n*x_n) , which is the exponentiate of the model. Alongside the odd-ratio, it’s often worth calculating predicted probabilities of Y at specific values of key predictors. This is done through p = \frac{1}{1 + \exp^{-z}} where z refers to the log(odds) regression equation.

Using the GermanCredit dataset in the Caret package, we will construct a logistic regression model to estimate the likelihood of a consumer being a good loan applicant based on a number of predictor variables.

Train <- createDataPartition(GermanCredit$Class, p=0.6, list=FALSE)
training <- GermanCredit[ Train, ]
testing <- GermanCredit[ -Train, ]
mod_fit_one <- glm(Class ~ Age + ForeignWorker + Property.RealEstate + Housing.Own +
CreditHistory.Critical, data=training, family="binomial")
summary(mod_fit_one) # estimates 
exp(coef(mod_fit$finalModel)) # odds ratios
predict(mod_fit_one, newdata=testing, type="response") # predicted probabilities

Great, we’re all done, right? Not just yet. There are some critical questions that still remain. Is the model any good? How well does the model fit the data? Which predictors are most important? Are the predictions accurate? In the next few posts, I’ll provide an overview of how to evaluate logistic regression models in R.

To leave a comment for the author, please follow the link and comment on their blog: Mathew Analytics » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)