Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Factors are R’s enumerated type. Suppose you define the variable cities — a vector of strings — whose possible values are “New York,” “Paris,” “London” and “Beijing.” Instead of representing each city as a string of characters, you might prefer to define an encoding, eg {1=”New York”, 2=”Paris”, 3=”London”, 4=”Beijing”}, and store a vector of integers. Behold the factor:

```cities <- c(rep("New York", 3), "Paris", "London", rep("Beijing", 2)) cities # Returns "New York" "New York" "New York" "Paris" "London" "Beijing" "Beijing" cities.factorized <- as.factor(cities) cities.factorized # Returns essentially the same as above as.integer(cities.factorized) # Returns 3 3 3 4 2 1 1 ```
Internally, R is using those integers to represent our cities. So what? Until recently I thought factors were useless, but I changed my mind when I realized that a single factor can hold a large set of disjoint indicator variables. Here's an example of a logistic regression made simple using factors:

```n <- 1000 df <- data.frame(x = runif(n, 0, 5)) ProbabilityOfYGivenX <- function(x) {   return((sin((x - 3) ^ 2) + 1.5) / 3) } df\$y <- runif(n) < ProbabilityOfYGivenX(df\$x) df <- df[order(df\$x), ] dev.new(height=6, width=8) plot(ProbabilityOfYGivenX, main="Some Hypothetical Data", xlab="x", ylab="y", type="l", xlim=c(0, 5), ylim=c(-0.1, 1.1), lwd=2) points(df\$x, df\$y, col=rgb(0, 100, 0, 20, maxColorValue=255), pch=16, cex=1.5) legend("topleft", "Pr[Y=1|X]", bty="n", lty=1) legend("topright", "Observed Data", bty="n", pch=16, col=rgb(0, 100, 0, 100, maxColorValue=255), pt.cex=1.5) savePlot("logit_data.png")```

The fun part is trying to recover Pr[Y=1|X], given only the observed data. This can be somewhat tricky, since Pr[Y|X] curves all over the place. One straightforward solution is to chop x up into a bunch of indicator variables: one that is true iff x falls below its 5th percentile, another iff x lies between the 5th and 10th percentiles, and so on. This is pleasantly easy using factors and the cut function:

```# Define a factor that chops x into 20 disjoint buckets df\$f <- cut(df\$x, breaks=c(-Inf, quantile(df\$x, probs=((1:19)/20)), Inf)) # By construction we have (n / 20) = 50 observations in each bucket summary(df\$f) logit.model <- glm(y ~ f, family=binomial("logit"), data=df) df\$y_hat <- predict(logit.model, newdata=df, type="response") dev.new(height=6, width=8) plot(ProbabilityOfYGivenX, main="Fun with Factors", xlab="x", ylab="y", type="l", xlim=c(0, 5), ylim=c(-0.1, 1.1), lwd=2) points(df\$x, df\$y, col=rgb(0, 100, 0, 20, maxColorValue=255), pch=16, cex=1.5) lines(df\$x, df\$y_hat, col="darkred", lty=2, lwd=1.5) legend("topleft", "Pr[Y=1|X]", bty="n", lty=1) legend("topright", "Observed Data", bty="n", pch=16, col=rgb(0, 100, 0, 100, maxColorValue=255), pt.cex=1.5) legend("top", "Estimate of Pr[Y=1|X]", bty="n", lty=2, col="darkred") savePlot("logit_data_and_estimate.png")```

Et voilà. Nothing too fancy -- this is essentially the same as calculating average values of y for each bucket of x, which you can do directly with by(df\$y, df\$f, mean). Logistic regression gives you the option of getting much fancier: you could, for example, regress y on some combination of continuous variables and factors, but that's a story for another time.