Encoding categorical variables: one-hot and beyond
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
(or: how to correctly use xgboost
from R
)
R
has “one-hot” encoding hidden in most of its modeling paths. Asking an R
user where one-hot encoding is used is like asking a fish where there is water; they can’t point to it as it is everywhere.
For example we can see evidence of one-hot encoding in the variable names chosen by a linear regression:
dTrain <- data.frame(x= c('a','b','b', 'c'), y= c(1, 2, 1, 2)) summary(lm(y~x, data= dTrain))
## ## Call: ## lm(formula = y ~ x, data = dTrain) ## ## Residuals: ## 1 2 3 4 ## -2.914e-16 5.000e-01 -5.000e-01 2.637e-16 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 1.0000 0.7071 1.414 0.392 ## xb 0.5000 0.8660 0.577 0.667 ## xc 1.0000 1.0000 1.000 0.500 ## ## Residual standard error: 0.7071 on 1 degrees of freedom ## Multiple R-squared: 0.5, Adjusted R-squared: -0.5 ## F-statistic: 0.5 on 2 and 1 DF, p-value: 0.7071
Much of the encoding in R
is essentially based on "contrasts" implemented in stats::model.matrix()
Note: do not use base::data.matrix()
or use hashing before modeling- you might get away with them (especially with tree based methods), but they are not in general good technique as we show below:
data.matrix(dTrain)
## x y ## [1,] 1 1 ## [2,] 2 2 ## [3,] 2 1 ## [4,] 3 2
stats::model.matrix()
does not store its one-hot plan in a convenient manner (it can be inferred by pulling the "contrasts
" attribute plus examining the column names of the first encoding, but the levels identified are not conveniently represented). When directly applying stats::model.matrix()
you can not safely assume the same formula applied to two different data sets (say train and application or test) are using the same encoding! We demonstrate this below:
dTrain <- data.frame(x= c('a','b','c'), stringsAsFactors = FALSE) encTrain <- stats::model.matrix(~x, dTrain) print(encTrain)
## (Intercept) xb xc ## 1 1 0 0 ## 2 1 1 0 ## 3 1 0 1 ## attr(,"assign") ## [1] 0 1 1 ## attr(,"contrasts") ## attr(,"contrasts")$x ## [1] "contr.treatment"
dTest <- data.frame(x= c('b','c'), stringsAsFactors = FALSE) stats::model.matrix(~x, dTest)
## (Intercept) xc ## 1 1 0 ## 2 1 1 ## attr(,"assign") ## [1] 0 1 ## attr(,"contrasts") ## attr(,"contrasts")$x ## [1] "contr.treatment"
The above mal-coding can be a critical flaw when you are building a model and then later using the model on new data (be it cross-validation data, test data, or future application data). Many R
users are not familiar with the above issue as encoding is hidden in model training, and how to encode new data is stored as part of the model. Python
scikit-learn
users coming to R
often ask "where is the one-hot encoder" (as it isn’t discussed as much in R
as it is in scikit-learn
) and even supply a number of (low quality) one-off packages "porting one-hot encoding to R
."
The main place an R
user needs a proper encoder (and that is an encoder that stores its encoding plan in a conveniently re-usable form, which many of the "one-off ported from Python
" packages actually fail to do) is when using a machine learning implementation that isn’t completely R
-centric. One such system is xgboost
which requires (as is typical of machine learning in scikit-learn
) data to already be encoded as a numeric matrix (instead of a heterogeneous structure such as a data.frame
). This requires explicit conversion on the part of the R
user, and many R
users get it wrong (fail to store the encoding plan somewhere). To make this concrete let’s work a simple example.
Let’s try the Titanic data set to see encoding in action. Note: we are not working hard on this example (as in adding extra variables derived from cabin layout, commonality of names, and other sophisticated feature transforms)- just plugging the obvious variable into xgboost
. As we said: xgboost
requires a numeric matrix for its input, so unlike many R
modeling methods we must manage the data encoding ourselves (instead of leaving that to R
which often hides the encoding plan in the trained model). Also note: differences observed in performance that are below the the sampling noise level should not be considered significant (e.g., all the methods demonstrated here performed about the same).
We bring in our data:
# set up example data set library("titanic") data(titanic_train) str(titanic_train)
## 'data.frame': 891 obs. of 12 variables: ## $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ... ## $ Survived : int 0 1 1 1 0 0 0 0 1 1 ... ## $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ... ## $ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ... ## $ Sex : chr "male" "female" "female" "female" ... ## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ... ## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ... ## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ... ## $ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ... ## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ... ## $ Cabin : chr "" "C85" "" "C123" ... ## $ Embarked : chr "S" "C" "S" "S" ...
summary(titanic_train)
## PassengerId Survived Pclass Name ## Min. : 1.0 Min. :0.0000 Min. :1.000 Length:891 ## 1st Qu.:223.5 1st Qu.:0.0000 1st Qu.:2.000 Class :character ## Median :446.0 Median :0.0000 Median :3.000 Mode :character ## Mean :446.0 Mean :0.3838 Mean :2.309 ## 3rd Qu.:668.5 3rd Qu.:1.0000 3rd Qu.:3.000 ## Max. :891.0 Max. :1.0000 Max. :3.000 ## ## Sex Age SibSp Parch ## Length:891 Min. : 0.42 Min. :0.000 Min. :0.0000 ## Class :character 1st Qu.:20.12 1st Qu.:0.000 1st Qu.:0.0000 ## Mode :character Median :28.00 Median :0.000 Median :0.0000 ## Mean :29.70 Mean :0.523 Mean :0.3816 ## 3rd Qu.:38.00 3rd Qu.:1.000 3rd Qu.:0.0000 ## Max. :80.00 Max. :8.000 Max. :6.0000 ## NA's :177 ## Ticket Fare Cabin Embarked ## Length:891 Min. : 0.00 Length:891 Length:891 ## Class :character 1st Qu.: 7.91 Class :character Class :character ## Mode :character Median : 14.45 Mode :character Mode :character ## Mean : 32.20 ## 3rd Qu.: 31.00 ## Max. :512.33 ##
outcome <- 'Survived' target <- 1 shouldBeCategorical <- c('PassengerId', 'Pclass', 'Parch') for(v in shouldBeCategorical) { titanic_train[[v]] <- as.factor(titanic_train[[v]]) } tooDetailed <- c("Ticket", "Cabin", "Name", "PassengerId") vars <- setdiff(colnames(titanic_train), c(outcome, tooDetailed)) dTrain <- titanic_train
And design our cross-validated modeling experiment:
library("xgboost") library("sigr") library("WVPlots") library("vtreat") set.seed(4623762) crossValPlan <- vtreat::kWayStratifiedY(nrow(dTrain), 10, dTrain, dTrain[[outcome]]) evaluateModelingProcedure <- function(xMatrix, outcomeV, crossValPlan) { preds <- rep(NA_real_, nrow(xMatrix)) for(ci in crossValPlan) { nrounds <- 1000 cv <- xgb.cv(data= xMatrix[ci$train, ], label= outcomeV[ci$train], objective= 'binary:logistic', nrounds= nrounds, verbose= 0, nfold= 5) #nrounds <- which.min(cv$evaluation_log$test_rmse_mean) # regression nrounds <- which.min(cv$evaluation_log$test_error_mean) # classification model <- xgboost(data= xMatrix[ci$train, ], label= outcomeV[ci$train], objective= 'binary:logistic', nrounds= nrounds, verbose= 0) preds[ci$app] <- predict(model, xMatrix[ci$app, ]) } preds }
Our preferred way to encode data is to use the vtreat
package in the "no variables mode" shown below (differing from the powerful "y aware" modes we usually teach).
set.seed(4623762) tplan <- vtreat::designTreatmentsZ(dTrain, vars, minFraction= 0, verbose=FALSE) # restrict to common varaibles types # see vignette('vtreatVariableTypes', package = 'vtreat') for details sf <- tplan$scoreFrame newvars <- sf$varName[sf$code %in% c("lev", "clean", "isBAD")] trainVtreat <- as.matrix(vtreat::prepare(tplan, dTrain, varRestriction = newvars)) print(dim(trainVtreat))
## [1] 891 20
print(colnames(trainVtreat))
## [1] "Pclass_lev_x.1" "Pclass_lev_x.2" "Pclass_lev_x.3" ## [4] "Sex_lev_x.female" "Sex_lev_x.male" "Age_clean" ## [7] "Age_isBAD" "SibSp_clean" "Parch_lev_x.0" ## [10] "Parch_lev_x.1" "Parch_lev_x.2" "Parch_lev_x.3" ## [13] "Parch_lev_x.4" "Parch_lev_x.5" "Parch_lev_x.6" ## [16] "Fare_clean" "Embarked_lev_x." "Embarked_lev_x.C" ## [19] "Embarked_lev_x.Q" "Embarked_lev_x.S"
dTrain$predVtreatZ <- evaluateModelingProcedure(trainVtreat, dTrain[[outcome]]==target, crossValPlan) sigr::permTestAUC(dTrain, 'predVtreatZ', outcome, target)
## [1] "AUC test alt. hyp. AUC>AUC(permuted): (AUC=0.86, s.d.=0.017, p<1e-05)."
WVPlots::ROCPlot(dTrain, 'predVtreatZ', outcome, target, 'vtreat encoder performance')
Model matrix can perform similar encoding when we only have a single data set.
set.seed(4623762) f <- paste('~ 0 + ', paste(vars, collapse = ' + ')) # model matrix skips rows with NAs by default, # get control of this through an option oldOpt <- getOption('na.action') options(na.action='na.pass') trainModelMatrix <- stats::model.matrix(as.formula(f), dTrain) # note model.matrix does not conveniently store the encoding # plan, so you may run into difficulty if you were to encode # new data which didn't have all the levels seen in the training # data. options(na.action=oldOpt) print(dim(trainModelMatrix))
## [1] 891 16
print(colnames(trainModelMatrix))
## [1] "Pclass1" "Pclass2" "Pclass3" "Sexmale" "Age" ## [6] "SibSp" "Parch1" "Parch2" "Parch3" "Parch4" ## [11] "Parch5" "Parch6" "Fare" "EmbarkedC" "EmbarkedQ" ## [16] "EmbarkedS"
dTrain$predModelMatrix <- evaluateModelingProcedure(trainModelMatrix, dTrain[[outcome]]==target, crossValPlan) sigr::permTestAUC(dTrain, 'predModelMatrix', outcome, target)
## [1] "AUC test alt. hyp. AUC>AUC(permuted): (AUC=0.87, s.d.=0.019, p<1e-05)."
WVPlots::ROCPlot(dTrain, 'predModelMatrix', outcome, target, 'model.matrix encoder performance')
The caret
package also supplies an encoding functionality properly split between training (caret::dummyVars()
) and application (called predict()
).
library("caret")
## Loading required package: lattice ## Loading required package: ggplot2
set.seed(4623762) f <- paste('~', paste(vars, collapse = ' + ')) encoder <- caret::dummyVars(as.formula(f), dTrain) trainCaret <- predict(encoder, dTrain) print(dim(trainCaret))
## [1] 891 19
print(colnames(trainCaret))
## [1] "Pclass.1" "Pclass.2" "Pclass.3" "Sexfemale" "Sexmale" ## [6] "Age" "SibSp" "Parch.0" "Parch.1" "Parch.2" ## [11] "Parch.3" "Parch.4" "Parch.5" "Parch.6" "Fare" ## [16] "Embarked" "EmbarkedC" "EmbarkedQ" "EmbarkedS"
dTrain$predCaret <- evaluateModelingProcedure(trainCaret, dTrain[[outcome]]==target, crossValPlan) sigr::permTestAUC(dTrain, 'predCaret', outcome, target)
## [1] "AUC test alt. hyp. AUC>AUC(permuted): (AUC=0.85, s.d.=0.017, p<1e-05)."
WVPlots::ROCPlot(dTrain, 'predCaret', outcome, target, 'caret encoder performance')
We usually forget to teach vtreat::designTreatmentsZ()
as it is often dominated by the more powerful y-aware methods vtreat
supplies (though not for this simple example). vtreat::designTreatmentsZ
has a number of useful properties:
- Does not look at the outcome values, so does not require extra care in cross-validation.
- Saves its encoding, so can be used correctly on new data.
The above two properties are shared with caret::dummyVars()
. Additional features of vtreat::designTreatmentsZ
(that differ from caret::dummyVars()
‘s choices) include:
- No
NA
values are passed through byvtreat::prepare()
. NA
presence is added as an additional informative column.- A few derived columns (such as pooling of rare levels are made available).
- Rare dummy variables are pruned (under a user-controlled threshold) to prevent encoding explosion.
- Novel levels (levels that occur during test or application, but not during training) are deliberately passed through as "no training level activated" by
vtreat::prepare()
(caret::dummyVars()
considers this an error).
The vtreat
y-aware methods include proper nested modeling and y-aware dimension reduction.
vtreat
is designed "to always work" (always return a pure numeric data frame with no missing values). It also excels in "big data" situations where the statistics it can collect on high cardinality categorical variables can have a huge positive impact in modeling performance. In many cases vtreat
works around problems that kill the analysis pipeline (such as discovering new variable levels during test or application). We teach vtreat
sore of "bimodally" in both a "fire and forget" mode and a "all the details on deck" mode (suitable for formal citation). Either way vtreat
can make your modeling procedures stronger, more reliable, and easier.
All code for this article can be found here.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.