Logistic Regression & Factors in R

April 24, 2011
By

(This article was first published on mickeymousemodels, and kindly contributed to R-bloggers)

Factors are R's enumerated type. Suppose you define the variable cities -- a vector of strings -- whose possible values are "New York," "Paris," "London" and "Beijing." Instead of representing each city as a string of characters, you might prefer to define an encoding, eg {1="New York", 2="Paris", 3="London", 4="Beijing"}, and store a vector of integers. Behold the factor:

cities <- c(rep("New York", 3), "Paris", "London", rep("Beijing", 2))

cities # Returns "New York" "New York" "New York" "Paris" "London" "Beijing" "Beijing"

cities.factorized <- as.factor(cities)

cities.factorized # Returns essentially the same as above

as.integer(cities.factorized) # Returns 3 3 3 4 2 1 1

Internally, R is using those integers to represent our cities. So what? Until recently I thought factors were useless, but I changed my mind when I realized that a single factor can hold a large set of disjoint indicator variables. Here's an example of a logistic regression made simple using factors:

n <- 1000

df <- data.frame(x = runif(n, 0, 5))

ProbabilityOfYGivenX <- function(x) {

return((sin((x - 3) ^ 2) + 1.5) / 3)

}
df$y <- runif(n) < ProbabilityOfYGivenX(df$x)

df <- df[order(df$x), ] dev.new(height=6, width=8) plot(ProbabilityOfYGivenX, main="Some Hypothetical Data", xlab="x", ylab="y", type="l", xlim=c(0, 5), ylim=c(-0.1, 1.1), lwd=2) points(df$x, df$y, col=rgb(0, 100, 0, 20, maxColorValue=255), pch=16, cex=1.5) legend("topleft", "Pr[Y=1|X]", bty="n", lty=1) legend("topright", "Observed Data", bty="n", pch=16, col=rgb(0, 100, 0, 100, maxColorValue=255), pt.cex=1.5) savePlot("logit_data.png") The fun part is trying to recover Pr[Y=1|X], given only the observed data. This can be somewhat tricky, since Pr[Y|X] curves all over the place. One straightforward solution is to chop x up into a bunch of indicator variables: one that is true iff x falls below its 5th percentile, another iff x lies between the 5th and 10th percentiles, and so on. This is pleasantly easy using factors and the cut function: # Define a factor that chops x into 20 disjoint buckets df$f <- cut(df$x, breaks=c(-Inf, quantile(df$x, probs=((1:19)/20)), Inf))

# By construction we have (n / 20) = 50 observations in each bucket

summary(df$f) logit.model <- glm(y ~ f, family=binomial("logit"), data=df) df$y_hat <- predict(logit.model, newdata=df, type="response")

dev.new(height=6, width=8)

plot(ProbabilityOfYGivenX, main="Fun with Factors", xlab="x", ylab="y", type="l", xlim=c(0, 5), ylim=c(-0.1, 1.1), lwd=2)

points(df$x, df$y, col=rgb(0, 100, 0, 20, maxColorValue=255), pch=16, cex=1.5)

lines(df$x, df$y_hat, col="darkred", lty=2, lwd=1.5)

legend("topleft", "Pr[Y=1|X]", bty="n", lty=1)

legend("topright", "Observed Data", bty="n", pch=16, col=rgb(0, 100, 0, 100, maxColorValue=255), pt.cex=1.5)

legend("top", "Estimate of Pr[Y=1|X]", bty="n", lty=2, col="darkred")

savePlot("logit_data_and_estimate.png")

Et voilà. Nothing too fancy -- this is essentially the same as calculating average values of y for each bucket of x, which you can do directly with by(df$y, df$f, mean). Logistic regression gives you the option of getting much fancier: you could, for example, regress y on some combination of continuous variables and factors, but that's a story for another time.