Logistic Regression & Factors in R

April 24, 2011
By

(This article was first published on mickeymousemodels, and kindly contributed to R-bloggers)

Factors are R's enumerated type. Suppose you define the variable cities -- a vector of strings -- whose possible values are "New York," "Paris," "London" and "Beijing." Instead of representing each city as a string of characters, you might prefer to define an encoding, eg {1="New York", 2="Paris", 3="London", 4="Beijing"}, and store a vector of integers. Behold the factor:

cities <- c(rep("New York", 3), "Paris", "London", rep("Beijing", 2))

cities # Returns "New York" "New York" "New York" "Paris" "London" "Beijing" "Beijing"

cities.factorized <- as.factor(cities)

cities.factorized # Returns essentially the same as above

as.integer(cities.factorized) # Returns 3 3 3 4 2 1 1

Internally, R is using those integers to represent our cities. So what? Until recently I thought factors were useless, but I changed my mind when I realized that a single factor can hold a large set of disjoint indicator variables. Here's an example of a logistic regression made simple using factors:

n <- 1000

df <- data.frame(x = runif(n, 0, 5))

ProbabilityOfYGivenX <- function(x) {

  return((sin((x - 3) ^ 2) + 1.5) / 3)

}
df$y <- runif(n) < ProbabilityOfYGivenX(df$x)

df <- df[order(df$x), ]

dev.new(height=6, width=8)

plot(ProbabilityOfYGivenX, main="Some Hypothetical Data", xlab="x", ylab="y", type="l", xlim=c(0, 5), ylim=c(-0.1, 1.1), lwd=2)

points(df$x, df$y, col=rgb(0, 100, 0, 20, maxColorValue=255), pch=16, cex=1.5)

legend("topleft", "Pr[Y=1|X]", bty="n", lty=1)

legend("topright", "Observed Data", bty="n", pch=16, col=rgb(0, 100, 0, 100, maxColorValue=255), pt.cex=1.5)

savePlot("logit_data.png")


The fun part is trying to recover Pr[Y=1|X], given only the observed data. This can be somewhat tricky, since Pr[Y|X] curves all over the place. One straightforward solution is to chop x up into a bunch of indicator variables: one that is true iff x falls below its 5th percentile, another iff x lies between the 5th and 10th percentiles, and so on. This is pleasantly easy using factors and the cut function:

# Define a factor that chops x into 20 disjoint buckets

df$f <- cut(df$x, breaks=c(-Inf, quantile(df$x, probs=((1:19)/20)), Inf))

# By construction we have (n / 20) = 50 observations in each bucket

summary(df$f)

logit.model <- glm(y ~ f, family=binomial("logit"), data=df)

df$y_hat <- predict(logit.model, newdata=df, type="response")

dev.new(height=6, width=8)

plot(ProbabilityOfYGivenX, main="Fun with Factors", xlab="x", ylab="y", type="l", xlim=c(0, 5), ylim=c(-0.1, 1.1), lwd=2)

points(df$x, df$y, col=rgb(0, 100, 0, 20, maxColorValue=255), pch=16, cex=1.5)

lines(df$x, df$y_hat, col="darkred", lty=2, lwd=1.5)

legend("topleft", "Pr[Y=1|X]", bty="n", lty=1)

legend("topright", "Observed Data", bty="n", pch=16, col=rgb(0, 100, 0, 100, maxColorValue=255), pt.cex=1.5)

legend("top", "Estimate of Pr[Y=1|X]", bty="n", lty=2, col="darkred")

savePlot("logit_data_and_estimate.png")


Et voilà. Nothing too fancy -- this is essentially the same as calculating average values of y for each bucket of x, which you can do directly with by(df$y, df$f, mean). Logistic regression gives you the option of getting much fancier: you could, for example, regress y on some combination of continuous variables and factors, but that's a story for another time.

To leave a comment for the author, please follow the link and comment on his blog: mickeymousemodels.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.