Prepare real-world data for analysis with the vtreat package

[This article was first published on Revolutions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

As anyone who's tried to analyze real-world data knows, there are any number of problems that may be lurking in the data that can prevent you from being able to fit a useful predictive model:

  • Categorical variables can include infrequently-used levels, which will cause problems if sampling leaves them unrepresented in the training set.
  • Numerical variables can be in wildly different scales, which can cause instability when fitting models.
  • The data set may include several highly-correlated columns, some of which could be pruned from the data without sacrificing predictive power.
  • The data set may include missing values that need to be dealt with before analysis can begin.
  • … and many others

The vtreat package is designed to counter common data problems like these in a statistically sound manner. It's a data frame preprocessor which applies a number of data cleaning processes to the input data before analysis, using techniques such as impact coding and categorical variable encoding (the methods are described in detail in this paper). Further details can be found on the vtreat github page, where authors John Mount and Nina Zumel note:

Even with modern machine learning techniques (random forests, support vector machines, neural nets, gradient boosted trees, and so on) or standard statistical methods (regression, generalized regression, generalized additive models) there are common data issues that can cause modeling to fail. vtreat deals with a number of these in a principled and automated fashion.


One final note: the main function in the package, prepare, is a little like model.matrix in that categorical variables are converted into numeric variables using contrast codings. This means that the output is suitable for many machine-learning functions (like xgboost) that don't accept categorical variables.

The vtreat package is available on CRAN now, and you can find a worked example using vtreat in the blog post linked below.

Win-Vector Blog: vtreat: prepare data

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)