Predicting Dichotomous Outcomes I

[This article was first published on Kevin Davenport » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

We are trying to predict a dependent dichotomous variable (male/female, yes/no, like/dislike,etc) with independent “predictor” variables. Let’s say we want to determine whether or not an employee will quit based on the percentage of their tenure spent traveling. We assemble the data from HR and erroneously employ simple linear regression to model the relationship, a mistake that is best shown graphically below:logit-lm

One of the obvious problems is that the simple regression line can generate predictions outside the range of 0 and 1 (employed and quit). Another issue is that the model makes an assumption that 15% time spent traveling has the same marginal effect on an employee quitting as 25%. If we were to examine a residuals plot we would also notice stereotypical Heteroscedasticity.

To better understand the economic law of diminishing marginal utility, consider the consumption of your favorite desert. After a certain point any additional unit increase in consumption will not result in an equal increase in satisfaction or satedness. In this example perhaps 30% travel time is the breaking point for an employee and anything past that really won’t add that much deterministic quality to the prediction.

A logit model would address these issues by fitting a non-linear function to our data such as this:


The second plot displays a sigmoid curve that represents the boundaries of the dependent dichotomous variable (0 to 1) and addresses heteroscedasticity. The plot also shows different rates of change at the high and low ends of travel time density.

The logit model uses a function (f) to transform the linear model to a non-linear model:

$$! \hat{y}=\alpha+\beta x$$
$$! \hat{y}=f(\alpha+\beta x)$$

A logit function models the cumulative distribution function (cdf) of the logistic distribution. The cdf describes “the probability that a real-valued random variable X with a given probability distribution will be found at a value less than or equal to x”, it is important to note “given a probability distribution” as it reinforces the importance of determining a given dataset’s distribution prior to selecting tools for any analysis. Novice statistical practitioners often mistakenly apply statistical models that assume normality to non-normal data.

Below are the built in distribution types for the glm() function in R.

glm(formula, family=familytype(link=linkfunction), data=)

Family Default Link Function
binomial (link = “logit”)
gaussian (link = “identity”)
Gamma (link = “inverse”)
inverse.gaussian (link = “1/mu^2″)
poisson (link = “log”)
quasi (link = “identity”, variance = “constant”)
quasibinomial (link = “logit”)
quasipoisson (link = “log”)

To leave a comment for the author, please follow the link and comment on their blog: Kevin Davenport » R. offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)