**mickeymousemodels**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Factors are R’s enumerated type. Suppose you define the variable cities — a vector of strings — whose possible values are “New York,” “Paris,” “London” and “Beijing.” Instead of representing each city as a string of characters, you might prefer to define an encoding, eg {1=”New York”, 2=”Paris”, 3=”London”, 4=”Beijing”}, and store a vector of integers. Behold the factor:

`cities <- c(rep("New York", 3), "Paris", "London", rep("Beijing", 2))`

```
```cities # Returns "New York" "New York" "New York" "Paris" "London" "Beijing" "Beijing"

cities.factorized <- as.factor(cities)

cities.factorized # Returns essentially the same as above

`as.integer(cities.factorized) # Returns 3 3 3 4 2 1 1 `

Internally, R is using those integers to represent our cities. So what? Until recently I thought factors were useless, but I changed my mind when I realized that a single factor can hold a large set of disjoint indicator variables. Here’s an example of a logistic regression made simple using factors:

`n <- 1000`

```
```df <- data.frame(x = runif(n, 0, 5))

ProbabilityOfYGivenX <- function(x) {

return((sin((x - 3) ^ 2) + 1.5) / 3)

}

df$y <- runif(n) < ProbabilityOfYGivenX(df$x)

df <- df[order(df$x), ]

dev.new(height=6, width=8)

plot(ProbabilityOfYGivenX, main="Some Hypothetical Data", xlab="x", ylab="y", type="l", xlim=c(0, 5), ylim=c(-0.1, 1.1), lwd=2)

points(df$x, df$y, col=rgb(0, 100, 0, 20, maxColorValue=255), pch=16, cex=1.5)

legend("topleft", "Pr[Y=1|X]", bty="n", lty=1)

legend("topright", "Observed Data", bty="n", pch=16, col=rgb(0, 100, 0, 100, maxColorValue=255), pt.cex=1.5)

`savePlot("logit_data.png")`

The fun part is trying to recover Pr[Y=1|X], given only the observed data. This can be somewhat tricky, since Pr[Y|X] curves all over the place. One straightforward solution is to chop x up into a bunch of indicator variables: one that is true iff x falls below its 5th percentile, another iff x lies between the 5th and 10th percentiles, and so on. This is pleasantly easy using factors and the cut function:

`# Define a factor that chops x into 20 disjoint buckets`

```
```df$f <- cut(df$x, breaks=c(-Inf, quantile(df$x, probs=((1:19)/20)), Inf))

# By construction we have (n / 20) = 50 observations in each bucket

summary(df$f)

logit.model <- glm(y ~ f, family=binomial("logit"), data=df)

df$y_hat <- predict(logit.model, newdata=df, type="response")

dev.new(height=6, width=8)

plot(ProbabilityOfYGivenX, main="Fun with Factors", xlab="x", ylab="y", type="l", xlim=c(0, 5), ylim=c(-0.1, 1.1), lwd=2)

points(df$x, df$y, col=rgb(0, 100, 0, 20, maxColorValue=255), pch=16, cex=1.5)

lines(df$x, df$y_hat, col="darkred", lty=2, lwd=1.5)

legend("topleft", "Pr[Y=1|X]", bty="n", lty=1)

legend("topright", "Observed Data", bty="n", pch=16, col=rgb(0, 100, 0, 100, maxColorValue=255), pt.cex=1.5)

legend("top", "Estimate of Pr[Y=1|X]", bty="n", lty=2, col="darkred")

`savePlot("logit_data_and_estimate.png")`

Et voilà. Nothing too fancy — this is essentially the same as calculating average values of y for each bucket of x, which you can do directly with by(df$y, df$f, mean). Logistic regression gives you the option of getting much fancier: you could, for example, regress y on some combination of continuous variables and factors, but that’s a story for another time.

**leave a comment**for the author, please follow the link and comment on their blog:

**mickeymousemodels**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.