Vtreat: designing a package for variable treatment

[This article was first published on Win-Vector Blog » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

When you apply machine learning algorithms on a regular basis, on a wide variety of data sets, you find that certain data issues come up again and again:

  • Missing values (NA or blanks)
  • Problematic numerical values (Inf, NaN, sentinel values like 999999999 or -1)
  • Valid categorical levels that don’t appear in the training data (especially when there are rare levels, or a large number of levels)
  • Invalid values

Of course, you should examine the data to understand the nature of the data issues: are the missing values missing at random, or are they systematic? What are the valid ranges for the numerical data? Are there sentinel values, what are they, and what do they mean? What are the valid values for text fields? Do we know all the valid values for a categorical variable, and are there any missing? Is there any principled way to roll up category levels? In the end though, the steps you take to deal with these issues will often be the same from data set to data set, so having a package of ready-to-go functions for data treatment is useful. In this article, we will discuss some of our usual data treatment procedures, and describe a prototype R package that implements them.

Missing Values; Missing Category Levels

First, we’ll look at what to do when there are missing values or NAs in the data, and how to guard against category levels that don’t appear in the training data. Let’s make a small example data set that manifests these issues.

set.seed(9394092)

levels = c('a', 'b', 'c', 'd')
levelfreq = c(0.3, 0.3, 0.3, 0.1)
means = c(1, 6, 2, 7)
names(means) = levels
NArate = 1/30

X = sample(levels, 200, replace=TRUE, prob=levelfreq)
Y  = rnorm(200) + means[X]

train = data.frame(x=X[1:150], y=Y[1:150], stringsAsFactors=FALSE)
test = data.frame(x=X[151:200], y=Y[151:200], stringsAsFactors=FALSE)

# remove a level from training
train = subset(train, x !='d')

# sprinkle in some NAs
ntrain = dim(train)[1] ; ntest = dim(test)[1]
train$x = ifelse(runif(ntrain) < NArate, NA, train$x)
test$x = ifelse(runif(ntest) < NArate, NA, test$x)

table(train$x)

##  a  b  c 
## 40 44 42

sum(is.na(train$x))
## [1] 4

sum(is.na(test$x))
## [1] 2

This simulates a situation where a rare level failed to be collected in the training data. In addition, we’ve simulated a missing value mechanism. In this example, it’s a “faulty sensor” mechanism (missing values show up at random, as if a sensor were intermittently and randomly failing) – though it may also in general be a systematic mechanism, where the NA means something specific, like the measurement doesn't apply (say "most recent pregnancy date" for a male subject).

We can build a linear regression model for predicting y from x:

# build a model
model1 = lm("y~x", data=train)

train$pred = predict(model1, newdata=train) # this works

predict(model1, newdata=test) # this fails
## Error in model.frame.default(Terms, newdata, na.action = na.action, 
## xlev = object$xlevels) : factor x has new levels d

The model fails on the holdout data because the new data has a value of x which was not observed in the training data. You can always refuse to predict in such cases, of course, but in some situations even a not-so-good prediction may be better than no prediction at all. Note also that lm quietly omitted the rows where x was missing while training, and the resulting model will return NA as the predicted outcome in such cases. This is again perfectly reasonable, but not always what you want, especially in cases where a large fraction of the data has missing values.

Are there alternative ways to handle these issues? If NAs show up in the data, the conservative assumption is that they are missing systematically; in this situation (when x is a categorical value), we can then treat them as just another category value, for example by pretreating the variable to convert NA to "Unknown." When novel values show up in the test data (or when NAs appear in the holdout data, but not in the training data), the best assumption we can make is that the novel value is in fact one of the values that we have already observed; the probability of being any given value being proportional to the training set frequencies.

We've implemented these data treatments, and others, in an R package called vtreat. The package is very much at the alpha stage, and is not yet available on CRAN; we'll explain how you can get the package later on in the post. For now, let's see how it works.

The first step is to use the training data to create a set of variable treatments, one for each variable of interest.

library(vtreat)  # our library, not public; we'll show how to install later
treatments = designTreatmentsN(train, c("x"), "y")

The function designTreatmentsN() takes as input the data frame of training data, the list of input columns, and the (numerical) outcome column. There is a similar function designTreatmentsC() for binary classification problems. The output of the function is a list of variable treatment objects (of class treatmentplan), one per input variable.

treatments

## $treatments
## $treatments[[1]]
## [1] "vtreat 'Categoric Indicators'('x'->'x_lev_NA','x_lev_x.a','x_lev_x.b','x_lev_x.c')"
## 
## 
## $vars
## [1] "x_lev_NA"  "x_lev_x.a" "x_lev_x.b" "x_lev_x.c"
## 
## $varScores
##  x_lev_NA x_lev_x.a x_lev_x.b x_lev_x.c 
##    1.0310    0.6948    0.2439    0.8959 
## 
## $varMoves
##  x_lev_NA x_lev_x.a x_lev_x.b x_lev_x.c 
##      TRUE      TRUE      TRUE      TRUE 
## 
## $outcomename
## [1] "y"
## 
## $meanY
## [1] 3.246
## 
## $ndat
## [1] 130
## 
## attr(,"class")
## [1] "treatmentplan"

The vars field of a treatmentplan object gives the names of the new variables that were formed from the original variable: a categorical variable like x is converted to several indicator variables, one for each known level of x -- including NA, if it is observed in the training data. varMoves is TRUE if the new variable in question varies (that is, if it has more than one value in the training data). meanY is the base mean of the outcome variable (unconditioned on the inputs). ndat is the number of data points.

The field varScores is a rough indicator of variable importance, based on the Press statistic. The Press statistic of a model is the sum of the variance of all the hold-one-out models: that is, the sum of (y_i - f_i)^2, where y_i is the outcome corresponding to the ith data point, and f_i is the prediction of the model built by using all the training data except the ith data point. We calculate the varScore of the jth input variable x_j to be the Press statistic of the one-dimensional linear regression model that uses only x_j, divided by the Press statistic of the unconditioned mean of y. A varScore of 0 means the model predicts perfectly. A varScore close to one means that the variable predicts only about as well as the global mean; a varScore above 1 means that the model predicts outcome worse than the global mean. So the lower the varScore, the better. You can use varScores to prune uninformative variables, as we will show later.

Once you have created the treatment plans using designTreatmentsN(), you can treat the training and test data frames using the function prepare(). This creates new data frames that express the outcome in terms of the new transformed variables. prepare() takes as input a list of treatment plans and a data set to be treated. The optional argument pruneLevel lets you specify a threshold for varScores; variables with a varScore higher than pruneLevel will be eliminated. By default, prepare() will prune away any variables with a varScore greater than 0.99; we will use pruneLevel=NULL to force prepare() to create all possible variables.

# pruneLevel=NULL  turns pruning OFF
train.treat = prepare(treatments, train, pruneLevel=NULL)
test.treat = prepare(treatments, test, pruneLevel=NULL)

train.treat[1:4,]
##   x_lev_NA x_lev_x.a x_lev_x.b x_lev_x.c     y
## 1        0         0         1         0 7.037
## 2        0         0         0         1 1.209
## 3        0         0         0         1 2.819
## 4        0         0         0         1 2.099

subset(train.treat, is.na(train$x))  # similarly for test
##    x_lev_NA x_lev_x.a x_lev_x.b x_lev_x.c       y
## 12        1         0         0         0 -0.4593
## 48        1         0         0         0  6.4741
## 49        1         0         0         0  5.3387
## 81        1         0         0         0  2.2319

The listing above shows that instead of the training data frame (x, y), we now have a training data frame with four x indicator variables, one for the each known x-values "a", "b", and "c" -- plus NA. According to the listing, the first four values for x in the training data were c("b", "c", "c", "c"). NAs are encoded as the variable x_lev_NA.

We can see how prepare() handles novel values in the test data:

#
# when we encounter a new variable value, we assign it all levels, 
# proportional to training set frequencies
#
subset(test.treat, test$x=='d')
##   x_lev_NA x_lev_x.a x_lev_x.b x_lev_x.c     y
## 8  0.03077    0.3077    0.3385    0.3231 4.622

Looking back at the process by which we generated y, we can see in this case that the "d" level isn't actually a proportional combination of the other levels; still this is the best assumption in the absence of any other information. Furthermore, in the more common situation of multiple input variables, this assumption allows us to take advantage of information that is available through those other variables.

Now we can fit a model using the transformed variables:

# get the names of the x variables
vars = setdiff(colnames(train.treat), "y")
fmla = paste("y ~ ", paste(vars, collapse=" + "))
model2 = lm(fmla, data=train.treat)
summary(model2)
## 
## Call:
## lm(formula = fmla, data = train.treat)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.856 -0.756 -0.026  0.782  3.078 
## 
## Coefficients: (1 not defined because of singularities)
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    2.103      0.168   12.54  < 2e-16 ***
## x_lev_NA       1.293      0.569    2.27  0.02461 *  
## x_lev_x.a     -0.830      0.240   -3.46  0.00075 ***
## x_lev_x.b      4.014      0.234   17.13  < 2e-16 ***
## x_lev_x.c         NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.09 on 126 degrees of freedom
## Multiple R-squared:  0.794,  Adjusted R-squared:  0.789 
## F-statistic:  162 on 3 and 126 DF,  p-value: 

The significance levels of the variables are consistent with the variable importance scores we observed in the treatment plan. The fact that one of the levels is NAd out is to be expected; four levels implies 3 degrees of freedom (plus the intercept). The standard practice is to omit one level of a categorical as redundant. We don't do this in our treatment plan, as regularized models can actually benefit from having the extra level left in. You will get warnings about possibly misleading fits when applying the model; in this case, we know how the variables were constructed, and that there are no hidden degeneracies in the variables (at least none that we created), so we can disregard the warning.

# you get the warnings about rank-deficient fits
train.treat$pred = predict(model2, newdata=train.treat)
## Warning: prediction from a rank-deficient fit may be misleading

test.treat$pred = predict(model2, newdata=test.treat) # works!
## Warning: prediction from a rank-deficient fit may be misleading

# no NAs
summary(train.treat$pred)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.27    1.27    2.10    3.25    6.12    6.12
summary(test.treat$pred)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.27    1.27    2.10    3.37    6.12    6.12

# note that this model gives the same answers on training data 
# as the default model
sum(abs(train$pred - train.treat$pred), na.rm=TRUE)
## [1] 9.566e-13

The last command of the above listing confirms that on the training data, the model learned from the treated data is equivalent to the model learned on the original data. Now we can look at model accuracy. .

rmse = function(y, pred) {
  se = (y-pred)^2
  sqrt(mean(se))
}

# model does well where it really has x values
with(subset(train, !is.na(x)), rmse(y, pred))
## [1] 0.973

# not too bad on NAs
with(train.treat, rmse(y,pred))
## [1] 1.07

# model generalizes well on levels it's observed
with(subset(test.treat, test$x != "d"), rmse(y,pred))
## [1] 1.08

# less well on novel values
with(test.treat, rmse(y,pred))
## [1] 1.272

subset(test.treat, test$x=='d')[,c("y", "pred")]
##       y  pred
## 8 4.622 3.246

As expected, the model does not perform as well on novel data values (x = "d"), but at least it returns a prediction without crashing. Furthermore, if the novel levels are rare (as we would expect), then predicting them poorly will not affect the overall performance of the model too much.

Let's try preparing the data with the default pruning parameters (pruneLevel=0.99):

train.treat = prepare(treatments, train)
test.treat = prepare(treatments, test)

# The x_lev_NA variable has been pruned away
train.treat[1:4,]
##   x_lev_x.a x_lev_x.b x_lev_x.c     y
## 1         0         1         0 7.037
## 2         0         0         1 1.209
## 3         0         0         1 2.819
## 4         0         0         1 2.099

# NAs are now encoded as (0,0,0)
subset(train.treat, is.na(train$x))
##    x_lev_x.a x_lev_x.b x_lev_x.c       y
## 12         0         0         0 -0.4593
## 48         0         0         0  6.4741
## 49         0         0         0  5.3387
## 81         0         0         0  2.2319

# d is now encoded as the relative frequencies of a, b, and c.
subset(test.treat, test$x=='d')
##   x_lev_x.a x_lev_x.b x_lev_x.c     y
## 8    0.3077    0.3385    0.3231 4.622

We no longer keep NA as a level, because it's not any more informative than the global mean; novel levels are still encoded as "all the known levels," proportionally weighted. If we use this data representation to model, we don't have a rank-deficient fit.

vars = setdiff(colnames(train.treat), "y")
fmla = paste("y ~ ", paste(vars, collapse=" + "))
model2 = lm(fmla, data=train.treat)
summary(model2)
## 
## Call:
## lm(formula = fmla, data = train.treat)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.856 -0.756 -0.026  0.782  3.078 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    3.396      0.543    6.25  5.8e-09 ***
## x_lev_x.a     -2.123      0.570   -3.73  0.00029 ***
## x_lev_x.b      2.721      0.567    4.80  4.5e-06 ***
## x_lev_x.c     -1.293      0.569   -2.27  0.02461 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.09 on 126 degrees of freedom
## Multiple R-squared:  0.794,  Adjusted R-squared:  0.789 
## F-statistic:  162 on 3 and 126 DF,  p-value: 

The model performance is similar to that of the model that included x_lev_NA.

train.treat$pred = predict(model2, newdata=train.treat)
test.treat$pred = predict(model2, newdata=test.treat)

sum(abs(train$pred - train.treat$pred), na.rm=TRUE)
## [1] 6.297e-13
with(train.treat, rmse(y,pred))
## [1] 1.07
with(test.treat, rmse(y,pred))
## [1] 1.272

Numerical variables and Categorical variables with many levels

The above examples looked at data treatment for a simple categorical variable with a moderate number of levels, some possibly missing. There are two other cases to consider. First, we would like basic data treatment for numerical variables, to protect against bad values like NA, NaN or Inf.

Second, we'd like to gracefully manage categorical variables with a large number of possible levels, such as ZIP code, telephone area code, or even city or other geographical region. Such categorical variables can be problematic because they introduce computational or data size issues for some modeling algorithms. For example, the size of the design matrix when computing linear or logistic regression models grows as the square of the number of variables -- and a categorical variable with N levels is represented as N-1 indicator variables. The randomForest implementation in R cannot handle categorical variables with more than 32 levels. Categoricals with a large number of levels are also a problem because it is more likely that some of the rarer levels will not appear in the training set, triggering the "novel level" problem on new data: if only a few of your customers come from Alaska or Rhode Island, then those states may not show up in your training set -- but they may show up when you deploy the model to your website.

There are often domain specific ways to handle categories with many levels. For example, a common trick with zip codes is to map them to a new variable whose value is related to zip code and relevant to the problem, such as average household income within that zip code. Obviously, this mapping won't be appropriate in all situations, so it's good to have an automatic procedure to fall back on.

Previously, we've discussed a technique that we call "impact coding" to manage this issue. We discuss this technique here and here; see also Chapter 6 of Practical Data Science with R. Impact coding converts a categorical variable x_cat into a numerical variable that corresponds to a one-variable bayesian model for the outcome as a function of x_cat. The vtreat library implements impact coding as discussed in those posts, with a few improvements.

Let's build another simple example, to demonstrate impact coding and the treatment of numerical variables.

N = 100 # a variable with 100 levels
levels = paste('gp', 1:N, sep='') 
fhi = c(0.15, 0.1, 0.1)   
  # the first three levels account for 35% of of the data
fx = sum(fhi)/(N-length(fhi))
levelfreq = c(fhi, numeric(N-length(fhi))+fx)

means = sample.int(10, size=N, replace=TRUE)
names(means) = levels

X = sample(levels, 200, replace=TRUE, prob=levelfreq)
U = rnorm(200, mean=0.5)  # numeric variable
Y  = rnorm(200) + means[X] + U

length(unique(X)) # the data set is missing levels
## [1] 68
train = data.frame(x=X[1:150], u = U[1:150], y=Y[1:150], 
         stringsAsFactors=FALSE)
test = data.frame(x=X[151:200], u= U[151:200], y=Y[151:200], 
         stringsAsFactors=FALSE)

# sprinkle a few NAs into u (for demonstration purposes)
train$u = ifelse(runif(150) < 0.01, NA, train$u)

length(setdiff(unique(test$x), unique(train$x)))  
     # and test has some levels train doesn't
## [1] 11

The designTreatmentsN function has two parameters that control when a categorical variable is impact coded. The parameter minFraction (default value: 0.02) controls what fraction of the time an indicator variable has to be "on" (that is, not zero) to be used (this is separate from the pruneLevel parameter in prepare). The purpose is to eliminate rare variables or rare levels. By default, we eliminate variables that are on less than 2% of the time.

When a categorical variable has a large number of levels, it's likely that many of them will be on less than 2% of the time. In that case, the corresponding indicator variables are eliminated, and all of those rare levels will encode to c(0, 0, ...), in the way the NA level did in our second example above. Let's call the fraction of the data that gets encoded to zero due to rare levels the fraction of the data that we "lose". The parameter maxMissing (default value: 0.04) specifies what fraction of the data we are allowed to "lose" before automatically switching to an impact coded variable. By default, if the eliminated levels correspond to more than 4% of the data, then the treatment plan will switch to impact coding.

In the example above, three levels of the variable x account for 35% of the data, so all the other levels will account for roughly (1-0.35)/97 = 0.0067 or the data each, or less than 1% of the mass each. So, all of those 97 levels would be eliminated, and we will "lose" 65% of the data if we keep the categorical representation! Therefore, the data treatment automatically converts x to an impact-coded variable.

#
# create the treatment plan.
#
treatments = designTreatmentsN(train, c("x", "u"), "y")
treatments  
## $treatments
## $treatments[[1]]
## [1] "vtreat 'Scalable Impact Code'('x'->'x_catN')"
## 
## $treatments[[2]]
## [1] "vtreat 'Scalable pass through'('u'->'u_clean')"
## 
## $treatments[[3]]
## [1] "vtreat 'is.bad'('u'->'u_isBAD')"
## 
## 
## $vars
## [1] "x_catN"  "u_clean" "u_isBAD"
## 
## $varScores
##  x_catN u_clean u_isBAD 
##  0.1717  0.9183  1.0116 
## 
## $varMoves
##  x_catN u_clean u_isBAD 
##    TRUE    TRUE    TRUE 
## 
## $outcomename
## [1] "y"
## 
## $meanY
## [1] 5.493
## 
## $ndat
## [1] 150
## 
## attr(,"class")
## [1] "treatmentplan"

The variable x_catN is the impact-coded variable corresponding to x. If we refer to the mean of y conditioned on x as y|x, and meanY as grand (unconditioned) mean of y then x_catN = y|x - meanY. Note that x_catN has a low varScore, indicating that it is a good, informative variable.

The variable u_clean is the numerical variable u, with all "bad" values (NA, NaN, Inf) converted to the mean of the "non-bad" u (we'll call this the "clean mean" of u). The variable u_isBAD is an indicator variable that is one whenever u is bad. If the bad values are due to a "faulty sensor" (that is, they occur at random), then converting to the clean mean value of u is the right thing to do. If the bad values are systematic, then u_isBAD can be used by the modeling algorithm to adjust for the systematic effect (assuming it survives the pruning, which in this case, it won't).

We can see how this works concretely by preparing the test and training sets.

train.treat = prepare(treatments, train)
test.treat = prepare(treatments, test)

train.treat[1:5,]  # isBAD column didn't survive
##     x_catN u_clean     y
## 1  0.04809  1.0749 5.328
## 2  1.37053 -0.4429 5.413
## 3 -2.32535  1.4380 2.372
## 4 -5.02863  0.6611 0.464
## 5 -1.04404  0.8327 4.449

# ------------------------
# "bad" u values map to the "clean mean" of u
# ------------------------

train.treat[is.na(train$u),]  
##     x_catN u_clean     y
## 74   1.371  0.5014 5.269
## 133  3.053  0.5014 8.546

# compare to u_clean, above
mean(train$u, na.rm=TRUE)
## [1] 0.5014

#-----------------------
# confirm (x_catN | x = xlevel) is mean(y | x=xlevel) - mean(y)
# ------------------------

subset(train.treat, train$x==levels[1])[1:2,]
##    x_catN u_clean     y
## 3  -2.325  1.4380 2.372
## 15 -2.325  0.0661 2.381

# compare to x_catN, above
mean(subset(train, x==levels[1])$y) - mean(train$y)
## [1] -2.325

#-----------------------
# missing levels map to 0, which is equivalent to
# mapping them to all known levels proportional to frequency
#-----------------------

missingInTest = setdiff(unique(test$x), unique(train$x))

subset(test.treat, test$x %in% missingInTest)[1:2,]
##       x_catN u_clean     y
## 1  4.737e-16  1.3802 1.754
## 13 4.737e-16  0.9062 6.862

Finally, we use the treated data to model.

vars = setdiff(colnames(train.treat), "y")
fmla = paste("y ~ ", paste(vars, collapse=" + "))
model = lm(fmla, data=train.treat)
summary(model)
## 
## Call:
## lm(formula = fmla, data = train.treat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.2077 -0.6131 -0.0113  0.5237  2.5923 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   5.1347     0.0800    64.2   

As you can see, the model performs better on categories that it saw during training, but it still handles novel levels gracefully -- and remember, some modeling algorithms can't handle a large number of categories at all.

That describes the most basic data treatment procedures that our package implements. For binary classification and logistic regression problems, the package has another function, designTreatmentsC(), which creates treatment plans when the outcome is a binary class variable.

Loading the vtreat package

We have made vtreat available on github; remember, this is an alpha package, so it will be rough around the edges. To install the package, download the vtreat tar file (at this writing, vtreat_0.2.tar.gz), as shown in the figure below:

Step1 DownloadRawFile

Once you've downloaded it, you can install it from the R command line, as you would any other package. If your R working directory is the same directory where you've downloaded the tar file, then the command looks like this:

install.packages('vtreat_0.2.tar.gz',repos=NULL,type='source')

Once it's installed, library(vtreat) will load the package. Type help(vtreat) to get a short description of how to use the package, along with some example code snippets.

vtreat has a few more features that we will cover in future posts, but this post has given you enough to get you started. Remember, automatic data treatment procedures are not a substitute for inspecting and exploring your data before modeling. However, once you've gotten a feel for the data, you will find that the procedures we have implemented are applicable to a wide variety of situations.

If you try the package, please do send along feedback, including any errors or bugs that you might discover.


For more on data treatment, see Chapter 4 of Practical Data Science with R.

To leave a comment for the author, please follow the link and comment on their blog: Win-Vector Blog » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)