# Modeling Trick: Impact Coding of Categorical Variables with Many Levels

July 23, 2012
By

(This article was first published on Win-Vector Blog » R, and kindly contributed to R-bloggers)

One of the shortcomings of regression (both linear and logistic) is that it doesn’t handle categorical variables with a very large number of possible values (for example, postal codes). You can get around this, of course, by going to another modeling technique, such as Naive Bayes; however, you lose some of the advantages of regression — namely, the model’s explicit estimates of variables’ explanatory value, and explicit insight into and control of variable to variable dependence.

Here we discuss one modeling trick that allows us to keep categorical variables with a large number of values, and at the same time retain much of logistic regression’s power.For this example, we will use a data set that contains all the police incidents (except homicide and manslaughter) that were reported in San Francisco for the month of June, 2012. The link to the most recent past-month’s incident data is here. The data set contains the date, time of day, and day of week that each incident was reported, along with the incident’s category, a brief description, and location (as police district, lat-long coordinates, and address to the nearest block).

Supposed we are interested in predicting the likelihood of a given incident being a violent crime, as a function of time, day, and location. We will define violent crimes ourselves, as assault, robbery, rape, kidnapping, and purse snatching. Here is the R code:

# June 2012
as.is=T)

# violent - LARCENY/THEFT:
#           -- Descript = "GRAND THEFT PURSESNATCH"
#                         "ATTEMPTED GRAND THEFT PURSESNATCH"
#           ASSAULT
#           ROBBERY
#           SEX OFFENCES, FORCIBLE
#           KIDNAPPING

# Create violent indicator
incidents$violent = with(incidents, Category %in% c("ASSAULT", "ROBBERY", "SEX OFFENSES, FORCIBLE", "KIDNAPPING") | Descript %in% c("GRAND THEFT PURSESNATCH", "ATTEMPTED GRAND THEFT PURSESNATCH")) table(incidents$violent)/length(incidents$violent) # FALSE TRUE # 0.8733564 0.1266436  We see that the month’s average rate of violent incidents (excluding homicide/manslaughter) was about 13%. Is there a relationship between violence and the day of week? modelDay = glm("violent ~ DayOfWeek", data=incidents, family=binomial(link="logit")) summary(modelDay)  We see that the relationship is not strong at all. Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -1.89213 0.07762 -24.377 <2e-16 *** DayOfWeekMonday -0.02830 0.11874 -0.238 0.8116 DayOfWeekSaturday -0.02067 0.10941 -0.189 0.8502 DayOfWeekSunday 0.19863 0.11116 1.787 0.0739 . DayOfWeekThursday -0.12835 0.12112 -1.060 0.2893 DayOfWeekTuesday -0.16031 0.12359 -1.297 0.1946 DayOfWeekWednesday -0.20421 0.12055 -1.694 0.0903 . --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 6588.4 on 8669 degrees of freedom Residual deviance: 6573.7 on 8663 degrees of freedom AIC: 6587.7  What about time of day? The dataset records the time on the minute resolution; we will just look at the hourly resolution. We will call this model modelHr. incidents$TimeHr = substr(incidents$Time, 1,2) modelHr = glm("violent ~ TimeHr", data=incidents, family=binomial(link="logit")) summary(modelHr)  A little stronger, but still not very strong. The rate of violent incidents increases significantly (relative to midnight) at about 1 am; the rate is significantly lower than at midnight around late morning and noon. Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -1.69669 0.13191 -12.862 < 2e-16 *** TimeHr01 0.58410 0.19004 3.074 0.00212 ** TimeHr02 0.36289 0.21475 1.690 0.09106 . TimeHr03 0.48928 0.24939 1.962 0.04977 * TimeHr04 0.32365 0.28873 1.121 0.26232 TimeHr05 -0.03791 0.30957 -0.122 0.90254 TimeHr06 -0.55460 0.35763 -1.551 0.12096 TimeHr07 -0.39458 0.26627 -1.482 0.13837 TimeHr08 -0.31428 0.22528 -1.395 0.16299 TimeHr09 -0.44337 0.23940 -1.852 0.06403 . TimeHr10 -0.32360 0.21912 -1.477 0.13972 TimeHr11 -0.61794 0.22513 -2.745 0.00605 ** TimeHr12 -0.54447 0.20232 -2.691 0.00712 ** TimeHr13 -0.39928 0.20580 -1.940 0.05237 . TimeHr14 -0.46279 0.20539 -2.253 0.02424 * TimeHr15 -0.33861 0.19483 -1.738 0.08222 . TimeHr16 -0.39278 0.18848 -2.084 0.03717 * TimeHr17 -0.45359 0.18741 -2.420 0.01550 * TimeHr18 -0.46279 0.18971 -2.439 0.01471 * TimeHr19 -0.40045 0.19352 -2.069 0.03852 * TimeHr20 -0.19632 0.19043 -1.031 0.30258 TimeHr21 0.03771 0.19051 0.198 0.84309 TimeHr22 0.05408 0.17827 0.303 0.76162 TimeHr23 -0.37506 0.20095 -1.866 0.06197 . --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 6588.4 on 8669 degrees of freedom Residual deviance: 6493.6 on 8646 degrees of freedom AIC: 6541.6  Here's the graph. It's interesting to see that the rate of incidents (violent or otherwise) peaks at about 5 pm. library(ggplot2) ggplot(incidents, aes(x=TimeHr, fill=violent)) + geom_bar()  Now we add location. A visual inspection tells us that there are indeed violent incident "hot-spots", even relative to regions where incidents in general are more frequent. # plot location (on a map) # specifically this map: http://www.webgis.com/MAPS/ca/lulcgeo/san_francisco.zip # make sure that you unzip the file into your current R working directory library(maptools) library(gpclib) gpclibPermit() sfo = readShapeSpatial("san_francisco/san_francisco.shp") # the map is actually of the greater San Francisco Bay Area. # Let's try to zoom in just on SF proper long_min = min(incidents$X)
long_max = max(incidents$X) lat_min = min(incidents$Y)
lat_max = max(incidents$Y) plot(sfo, xlim=c(long_min, long_max), ylim=c(lat_min, lat_max)) points(incidents[!incidents$violent, c("X", "Y")], pch=20, col="blue")
points(incidents[incidents$violent, c("X", "Y")], pch= 20, col="red")  The bad news is that location (at the block level) is a variable with a lot of possible values.  head(incidents$Address)
# [1] "0 Block of COLLEGE TR"     "1400 Block of KIRKWOOD CT" "900 Block of POLK ST"
# [4] "1200 Block of LAGUNA ST"   "2400 Block of 47TH AV"     "3900 Block of ALEMANY BL"
> length(unique(incidents$Address)) [1] 3915 # out of 8670 rows  We can try using a coarser variable, like police district, but we would see that the model is still not very predictive modelTimeDistrict = glm("violent ~ TimeHr + PdDistrict", data=incidents, family=binomial(link="logit")) 1 - modelTimeDistrict$deviance/modelTimeDistrict$null.deviance [1] 0.0228061 In fact, using time of day and district only reduces the deviance from the "null model" -- that is, simply predicting the global rate of violent incidents -- by 2%. There is always the danger, of course, that using block-level data will lead to overfit, but let's give it a try, anyway. To do that, we replace the categorical variable with a submodel that returns the probability of a violent incident, conditional on each category value. (in this case, the city block). In our case, it's possible that there are city blocks that had no reported incidents -- this month. That may change next month. We guard against this contingency by smoothing novel levels to the grand average. We choose to call this trick impact coding because it summarizes the impact of each category value on the outcome. (This is not standard terminology.) # return a model of the conditional probability # of dependent variable (depvar) by level # assumes outcome is logical and not null impactModel = function(xcol, depvar) { n <- length(depvar) p <- sum(depvar)/n # duplicate output for NA (average NA towards grand uniform average) x <- c(xcol,xcol) y <- c(depvar, depvar) x[(1+n):(2*n)] <- NA levelcounts <- table(x, y, useNA="always") condprobmodel <- (levelcounts[,2]+p)/(levelcounts[,1]+levelcounts[,2]+1.0) # apply model example: applyImpactModel(condprobmodel,data[,varname]) condprobmodel } # apply model to column to essentially return condprobmodel[rawx] # both NA's and new levels are smoothed to original grand average applyImpactModel <- function(condprobmodel, xcol) { naval <- condprobmodel[is.na(names(condprobmodel))] dim <- length(xcol) condprobvec <- numeric(dim) + naval for(nm in names(condprobmodel)) { if(!is.na(nm)) { condprobvec[xcol==nm] <- condprobmodel[nm] } } condprobvec } # convert Address variable to its impact model outcome impact_addr_mod = impactModel(incidents$Address, incidents$violent) incidents$impactAddr = applyImpactModel(impact_addr_mod, incidents$Address) # check the performance of the impact model -- much more predictive than the district model positive <- incidents$violent
modelDeviance <- 2*(sum(-log(incidents$impactAddr[positive]))+ sum(-log(1-incidents$impactAddr[!positive])))
[1] 3794.362
p <- mean(positive)
nullDeviance <- 2*(sum(positive)*(-log(p)) + sum(!positive)*(-log(1-p)))

1 -  modelDeviance/nullDeviance
# [1] 0.4240879


The resulting impact model alone explains 42% of the null deviance. Can we explain a little more by adding back time? We will call this model modelHrAddr.

modelHrAddr = glm("violent ~  impactAddr + TimeHr", data=incidents, family=binomial(link="logit"))


Location is the most predictive variable; but even after controlling for location, the proportion of violent incidents peaks in the early hours of the morning: from about 1 am to 4 am. The resulting model explains about 48% of the null deviance. Not great, but not bad, either.

 Call:
glm(formula = "violent ~  impactAddr + TimeHr", family = binomial(link = "logit"),
data = incidents)

Deviance Residuals:
Min       1Q   Median       3Q      Max
-2.8946  -0.2529  -0.2018  -0.1629   2.9599

Coefficients:
Estimate Std. Error z value   Pr(>|z|)
(Intercept) -4.53184    0.21611 -20.970   <2e-16 ***
impactAddr  11.85131    0.29659  39.959   <2e-16 ***
TimeHr01     0.58929    0.28372   2.077   0.0378 *
TimeHr02     0.74714    0.32154   2.324   0.0201 *
TimeHr03     0.76336    0.36599   2.086   0.0370 *
TimeHr04     0.63613    0.39868   1.596   0.1106
TimeHr05     0.40867    0.46908   0.871   0.3836
TimeHr06    -0.07554    0.52383  -0.144   0.8853
TimeHr07     0.02188    0.38592   0.057   0.9548
TimeHr08     0.17270    0.33201   0.520   0.6030
TimeHr09     0.00157    0.33696   0.005   0.9963
TimeHr10     0.20057    0.31156   0.644   0.5197
TimeHr11    -0.41266    0.32594  -1.266   0.2055
TimeHr12    -0.28582    0.29254  -0.977   0.3286
TimeHr13    -0.30503    0.29004  -1.052   0.2929
TimeHr14    -0.10243    0.29074  -0.352   0.7246
TimeHr15     0.02639    0.27241   0.097   0.9228
TimeHr16     0.04850    0.26910   0.180   0.8570
TimeHr17     0.02524    0.27319   0.092   0.9264
TimeHr18    -0.17420    0.28018  -0.622   0.5341
TimeHr19     0.10241    0.27336   0.375   0.7079
TimeHr20     0.21612    0.28104   0.769   0.4419
TimeHr21     0.35495    0.28354   1.252   0.2106
TimeHr22     0.32642    0.26853   1.216   0.2242
TimeHr23    -0.07702    0.29624  -0.260   0.7949
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 6588.4  on 8669  degrees of freedom
Residual deviance: 3428.8  on 8645  degrees of freedom
AIC: 3478.8

> 1 - modelHrAddr$deviance/modelHrAddr$null.deviance
[1] 0.4795713

When we compare modelHr and modeHrAddr, we see that, overall, the time coefficients of modelHrAddr are less significant than those of modelHr. This indicates some correlation between time and location. This is where logistic regression's handling of dependence is useful -- a Naive Bayes model that used time and location would tend to overestimate the proportion of violent incidents at hotspots. The logistic regression model can compensate for these dependencies, and provides more accurate estimates.

And even though the impactAddr variable is less transparent than the corresponding categorical variable, the effect of time is clearer, since we have pulled out the effect of location. This is how we can make statements like "time has the following impact on the proportion of violent incidents, even after controlling for address," even though address is a very large categorical variable, which is itself subject to possible overfitting -- as well as being too large for typical logistic regression code.

Thus, while impact coding has limitations, it lets us do more than we would otherwise be able to do, especially with other categorical variables.

Related posts:

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...