# Encoding categorical variables: one-hot and beyond

April 15, 2017
By

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

## (or: how to correctly use `xgboost` from `R`)

`R` has "one-hot" encoding hidden in most of its modeling paths. Asking an `R` user where one-hot encoding is used is like asking a fish where there is water; they can’t point to it as it is everywhere.

For example we can see evidence of one-hot encoding in the variable names chosen by a linear regression:

``````dTrain <-  data.frame(x= c('a','b','b', 'c'),
y= c(1, 2, 1, 2))
summary(lm(y~x, data= dTrain))``````
``````##
## Call:
## lm(formula = y ~ x, data = dTrain)
##
## Residuals:
##          1          2          3          4
## -2.914e-16  5.000e-01 -5.000e-01  2.637e-16
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)   1.0000     0.7071   1.414    0.392
## xb            0.5000     0.8660   0.577    0.667
## xc            1.0000     1.0000   1.000    0.500
##
## Residual standard error: 0.7071 on 1 degrees of freedom
## Multiple R-squared:    0.5,  Adjusted R-squared:   -0.5
## F-statistic:   0.5 on 2 and 1 DF,  p-value: 0.7071``````

Much of the encoding in `R` is essentially based on "contrasts" implemented in `stats::model.matrix()` Note: do not use `base::data.matrix()` or use hashing before modeling- you might get away with them (especially with tree based methods), but they are not in general good technique as we show below:

``data.matrix(dTrain)``
``````##      x y
## [1,] 1 1
## [2,] 2 2
## [3,] 2 1
## [4,] 3 2``````

`stats::model.matrix()` does not store its one-hot plan in a convenient manner (it can be inferred by pulling the "`contrasts`" attribute plus examining the column names of the first encoding, but the levels identified are not conveniently represented). When directly applying `stats::model.matrix()` you can not safely assume the same formula applied to two different data sets (say train and application or test) are using the same encoding! We demonstrate this below:

``````dTrain <- data.frame(x= c('a','b','c'),
stringsAsFactors = FALSE)
encTrain <- stats::model.matrix(~x, dTrain)
print(encTrain)``````
``````##   (Intercept) xb xc
## 1           1  0  0
## 2           1  1  0
## 3           1  0  1
## attr(,"assign")
## [1] 0 1 1
## attr(,"contrasts")
## attr(,"contrasts")\$x
## [1] "contr.treatment"``````
``````dTest <- data.frame(x= c('b','c'),
stringsAsFactors = FALSE)
stats::model.matrix(~x, dTest)``````
``````##   (Intercept) xc
## 1           1  0
## 2           1  1
## attr(,"assign")
## [1] 0 1
## attr(,"contrasts")
## attr(,"contrasts")\$x
## [1] "contr.treatment"``````

The above mal-coding can be a critical flaw when you are building a model and then later using the model on new data (be it cross-validation data, test data, or future application data). Many `R` users are not familiar with the above issue as encoding is hidden in model training, and how to encode new data is stored as part of the model. `Python` `scikit-learn` users coming to `R` often ask "where is the one-hot encoder" (as it isn’t discussed as much in `R` as it is in `scikit-learn`) and even supply a number of (low quality) one-off packages "porting one-hot encoding to `R`."

The main place an `R` user needs a proper encoder (and that is an encoder that stores its encoding plan in a conveniently re-usable form, which many of the "one-off ported from `Python`" packages actually fail to do) is when using a machine learning implementation that isn’t completely `R`-centric. One such system is `xgboost` which requires (as is typical of machine learning in `scikit-learn`) data to already be encoded as a numeric matrix (instead of a heterogeneous structure such as a `data.frame`). This requires explicit conversion on the part of the `R` user, and many `R` users get it wrong (fail to store the encoding plan somewhere). To make this concrete let’s work a simple example.

Let’s try the Titanic data set to see encoding in action. Note: we are not working hard on this example (as in adding extra variables derived from cabin layout, commonality of names, and other sophisticated feature transforms)- just plugging the obvious variable into `xgboost`. As we said: `xgboost` requires a numeric matrix for its input, so unlike many `R` modeling methods we must manage the data encoding ourselves (instead of leaving that to `R` which often hides the encoding plan in the trained model). Also note: differences observed in performance that are below the the sampling noise level should not be considered significant (e.g., all the methods demonstrated here performed about the same).

We bring in our data:

``````# set up example data set
library("titanic")
data(titanic_train)
str(titanic_train)``````
``````## 'data.frame':    891 obs. of  12 variables:
##  \$ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  \$ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
##  \$ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
##  \$ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
##  \$ Sex        : chr  "male" "female" "female" "female" ...
##  \$ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  \$ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  \$ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  \$ Ticket     : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
##  \$ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  \$ Cabin      : chr  "" "C85" "" "C123" ...
##  \$ Embarked   : chr  "S" "C" "S" "S" ...``````
``summary(titanic_train)``
``````##   PassengerId       Survived          Pclass          Name
##  Min.   :  1.0   Min.   :0.0000   Min.   :1.000   Length:891
##  1st Qu.:223.5   1st Qu.:0.0000   1st Qu.:2.000   Class :character
##  Median :446.0   Median :0.0000   Median :3.000   Mode  :character
##  Mean   :446.0   Mean   :0.3838   Mean   :2.309
##  3rd Qu.:668.5   3rd Qu.:1.0000   3rd Qu.:3.000
##  Max.   :891.0   Max.   :1.0000   Max.   :3.000
##
##      Sex                 Age            SibSp           Parch
##  Length:891         Min.   : 0.42   Min.   :0.000   Min.   :0.0000
##  Class :character   1st Qu.:20.12   1st Qu.:0.000   1st Qu.:0.0000
##  Mode  :character   Median :28.00   Median :0.000   Median :0.0000
##                     Mean   :29.70   Mean   :0.523   Mean   :0.3816
##                     3rd Qu.:38.00   3rd Qu.:1.000   3rd Qu.:0.0000
##                     Max.   :80.00   Max.   :8.000   Max.   :6.0000
##                     NA's   :177
##     Ticket               Fare           Cabin             Embarked
##  Length:891         Min.   :  0.00   Length:891         Length:891
##  Class :character   1st Qu.:  7.91   Class :character   Class :character
##  Mode  :character   Median : 14.45   Mode  :character   Mode  :character
##                     Mean   : 32.20
##                     3rd Qu.: 31.00
##                     Max.   :512.33
## ``````
``````outcome <- 'Survived'
target <- 1
shouldBeCategorical <- c('PassengerId', 'Pclass', 'Parch')
for(v in shouldBeCategorical) {
titanic_train[[v]] <- as.factor(titanic_train[[v]])
}
tooDetailed <- c("Ticket", "Cabin", "Name", "PassengerId")
vars <- setdiff(colnames(titanic_train), c(outcome, tooDetailed))

dTrain <- titanic_train``````

And design our cross-validated modeling experiment:

``````library("xgboost")
library("sigr")
library("WVPlots")
library("vtreat")

set.seed(4623762)
crossValPlan <- vtreat::kWayStratifiedY(nrow(dTrain),
10,
dTrain,
dTrain[[outcome]])

evaluateModelingProcedure <- function(xMatrix, outcomeV, crossValPlan) {
preds <- rep(NA_real_, nrow(xMatrix))
for(ci in crossValPlan) {
nrounds <- 1000
cv <- xgb.cv(data= xMatrix[ci\$train, ],
label= outcomeV[ci\$train],
objective= 'binary:logistic',
nrounds= nrounds,
verbose= 0,
nfold= 5)
#nrounds  <- which.min(cv\$evaluation_log\$test_rmse_mean) # regression
nrounds  <- which.min(cv\$evaluation_log\$test_error_mean) # classification
model <- xgboost(data= xMatrix[ci\$train, ],
label= outcomeV[ci\$train],
objective= 'binary:logistic',
nrounds= nrounds,
verbose= 0)
preds[ci\$app] <-  predict(model, xMatrix[ci\$app, ])
}
preds
}``````

Our preferred way to encode data is to use the `vtreat` package in the "no variables mode" shown below (differing from the powerful "y aware" modes we usually teach).

``````set.seed(4623762)
tplan <- vtreat::designTreatmentsZ(dTrain, vars,
minFraction= 0,
verbose=FALSE)
# restrict to common varaibles types
# see vignette('vtreatVariableTypes', package = 'vtreat') for details
sf <- tplan\$scoreFrame
newvars <- sf\$varName[sf\$code %in% c("lev", "clean", "isBAD")]
trainVtreat <- as.matrix(vtreat::prepare(tplan, dTrain,
varRestriction = newvars))
print(dim(trainVtreat))``````
``## [1] 891  20``
``print(colnames(trainVtreat))``
``````##  [1] "Pclass_lev_x.1"   "Pclass_lev_x.2"   "Pclass_lev_x.3"
##  [4] "Sex_lev_x.female" "Sex_lev_x.male"   "Age_clean"
## [10] "Parch_lev_x.1"    "Parch_lev_x.2"    "Parch_lev_x.3"
## [13] "Parch_lev_x.4"    "Parch_lev_x.5"    "Parch_lev_x.6"
## [16] "Fare_clean"       "Embarked_lev_x."  "Embarked_lev_x.C"
## [19] "Embarked_lev_x.Q" "Embarked_lev_x.S"``````
``````dTrain\$predVtreatZ <- evaluateModelingProcedure(trainVtreat,
dTrain[[outcome]]==target,
crossValPlan)
sigr::permTestAUC(dTrain,
'predVtreatZ',
outcome, target)``````
``## [1] "AUC test alt. hyp. AUC>AUC(permuted): (AUC=0.86, s.d.=0.017, p<1e-05)."``
``````WVPlots::ROCPlot(dTrain,
'predVtreatZ',
outcome, target,
'vtreat encoder performance')``````

Model matrix can perform similar encoding when we only have a single data set.

``````set.seed(4623762)
f <- paste('~ 0 + ', paste(vars, collapse = ' + '))
# model matrix skips rows with NAs by default,
# get control of this through an option
oldOpt <- getOption('na.action')
options(na.action='na.pass')
trainModelMatrix <- stats::model.matrix(as.formula(f),
dTrain)
# note model.matrix does not conveniently store the encoding
# plan, so you may run into difficulty if you were to encode
# new data which didn't have all the levels seen in the training
# data.
options(na.action=oldOpt)
print(dim(trainModelMatrix))``````
``## [1] 891  16``
``print(colnames(trainModelMatrix))``
``````##  [1] "Pclass1"   "Pclass2"   "Pclass3"   "Sexmale"   "Age"
##  [6] "SibSp"     "Parch1"    "Parch2"    "Parch3"    "Parch4"
## [11] "Parch5"    "Parch6"    "Fare"      "EmbarkedC" "EmbarkedQ"
## [16] "EmbarkedS"``````
``````dTrain\$predModelMatrix <- evaluateModelingProcedure(trainModelMatrix,
dTrain[[outcome]]==target,
crossValPlan)
sigr::permTestAUC(dTrain,
'predModelMatrix',
outcome, target)``````
``## [1] "AUC test alt. hyp. AUC>AUC(permuted): (AUC=0.87, s.d.=0.019, p<1e-05)."``
``````WVPlots::ROCPlot(dTrain,
'predModelMatrix',
outcome, target,
'model.matrix encoder performance')``````

The `caret` package also supplies an encoding functionality properly split between training (`caret::dummyVars()`) and application (called `predict()`).

``library("caret")``
``````## Loading required package: lattice

``````set.seed(4623762)
f <- paste('~', paste(vars, collapse = ' + '))
encoder <- caret::dummyVars(as.formula(f), dTrain)
trainCaret <- predict(encoder, dTrain)
print(dim(trainCaret))``````
``## [1] 891  19``
``print(colnames(trainCaret))``
``````##  [1] "Pclass.1"  "Pclass.2"  "Pclass.3"  "Sexfemale" "Sexmale"
##  [6] "Age"       "SibSp"     "Parch.0"   "Parch.1"   "Parch.2"
## [11] "Parch.3"   "Parch.4"   "Parch.5"   "Parch.6"   "Fare"
## [16] "Embarked"  "EmbarkedC" "EmbarkedQ" "EmbarkedS"``````
``````dTrain\$predCaret <- evaluateModelingProcedure(trainCaret,
dTrain[[outcome]]==target,
crossValPlan)
sigr::permTestAUC(dTrain,
'predCaret',
outcome, target)``````
``## [1] "AUC test alt. hyp. AUC>AUC(permuted): (AUC=0.85, s.d.=0.017, p<1e-05)."``
``````WVPlots::ROCPlot(dTrain,
'predCaret',
outcome, target,
'caret encoder performance')``````

We usually forget to teach `vtreat::designTreatmentsZ()` as it is often dominated by the more powerful y-aware methods `vtreat` supplies (though not for this simple example). `vtreat::designTreatmentsZ` has a number of useful properties:

• Does not look at the outcome values, so does not require extra care in cross-validation.
• Saves its encoding, so can be used correctly on new data.

The above two properties are shared with `caret::dummyVars()`. Additional features of `vtreat::designTreatmentsZ` (that differ from `caret::dummyVars()`‘s choices) include:

• No `NA` values are passed through by `vtreat::prepare()`.
• `NA` presence is added as an additional informative column.
• A few derived columns (such as pooling of rare levels are made available).
• Rare dummy variables are pruned (under a user-controlled threshold) to prevent encoding explosion.
• Novel levels (levels that occur during test or application, but not during training) are deliberately passed through as "no training level activated" by `vtreat::prepare()` (`caret::dummyVars()` considers this an error).

The `vtreat` y-aware methods include proper nested modeling and y-aware dimension reduction.

`vtreat` is designed "to always work" (always return a pure numeric data frame with no missing values). It also excels in "big data" situations where the statistics it can collect on high cardinality categorical variables can have a huge positive impact in modeling performance. In many cases `vtreat` works around problems that kill the analysis pipeline (such as discovering new variable levels during test or application). We teach `vtreat` sore of "bimodally" in both a "fire and forget" mode and a "all the details on deck" mode (suitable for formal citation). Either way `vtreat` can make your modeling procedures stronger, more reliable, and easier.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.