New vtreat Feature: Nested Model Bias Warning
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
For quite a while we have been teaching estimating variable re-encodings on the exact same data they are later naively using to train a model on, leads to an undesirable nested model bias. The vtreat package (both the R version and Python version) both incorporate a cross-frame method that allows one to use all the training data both to build learn variable re-encodings and to correctly train a subsequent model (for an example please see our recent PyData LA talk).
The next version of vtreat will warn the user if they have improperly used the same data for both vtreat impact code inference and downstream modeling. So in addition to us warning you not to do this, the package now also checks and warns against this situation. vtreat has had methods for avoiding nested model bias for vary long time, we are now adding new warnings to confirm users are using them.
Set up the Example
This example is excerpted from some of our classification documentation.
Demonstrate the Warning
One way to design variable treatments for binomial classification problems in vtreat is to design a cross-frame experiment.
# For this example we want vtreat version 1.5.1 or newer
# remotes::install_github("WinVector/vtreat")
library(vtreat)
packageVersion("vtreat")
## [1] '1.5.1'
...
transform_design = vtreat::mkCrossFrameCExperiment(
# data to learn transform from
dframe = training_data,
# columns to transform
varlist = setdiff(colnames(training_data), c('y', 'yc')),
# outcome variable
outcomename = 'yc',
# outcome of interest
outcometarget = TRUE
)
Once we have that we can pull the data transform and correct cross-validated training frame off the returned object as follows.
transform <- transform_design$treatments train_prepared <- transform_design$crossFrame
train_prepared is prepared in the correct way to use the same training data for inferring the impact-coded variables, using the returned $crossFrame from mkCrossFrameCExperiment().
We prepare new test or application data as follows.
test_prepared <- prepare(transform, test_data)
The issue is: for training data we should not call prepare(), but instead use the cross-frame that is produces during transform design.
The point is we should not do the following:
train_prepared_wrong <- prepare(transform, training_data)
## Warning in prepare.treatmentplan(transform, training_data): ## possibly called prepare() on same data frame as designTreatments*()/ ## mkCrossFrame*Experiment(), this can lead to over-fit. To avoid this, please use ## mkCrossFrame*Experiment$crossFrame.
Notice we now get a warning that we should not have done this, and in doing so we may have a nested model bias data leak.
And that is the new nested model bias warning feature.
The full R example can be found here, and a full Python example can be found here.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.