Site icon R-bloggers

vtreat up on CRAN!

[This article was first published on Win-Vector Blog » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Nina Zumel and I are proud to announce our R vtreat variable treatment library has just been accepted by CRAN!

It will take some time for the vtreat package to progress to various CRAN mirrors, but as of now you can install vtreat with the command:

install.packages('vtreat', repos='http://cran.r-project.org/')

Instead of needing to use devtools to install from the Github version as in:

devtools::install_github('WinVector/vtreat')

The purpose of vtreat library is to reliably prepare data for supervised machine learning. We try to leave as much as possible to the machine learning algorithms themselves, but cover most of the truly necessary typically ignored precautions. The library is designed to produce a data.frame that is entirely numeric and takes common precautions to guard against the following real world data issues:

The above are all awful things that often lurk in real world data. Automating these steps ensures they are easy enough that you actually perform them and leaves the analyst time to look for additional data issues. For example this allowed us to essentially automate a number of the steps taught in chapters 4 and 6 of Practical Data Science with R (Zumel, Mount; Manning 2014) into a very short worksheet (though we think for understanding it is essential to work all the steps by hand as we did in the book).

The idea is: data.frames prepared with the vtreat library are somewhat safe to train on as some precaution has been taken against all of the above issues. Also of interest are the vtreat variable significances (help in initial variable pruning, a necessity when there are a large number of columns) and vtreat::prepare(scale=TRUE) which re-encodes all variables into effect units making them suitable for y-aware dimension reduction (variable clustering, or principal component analysis) and for geometry sensitive machine learning techniques (k-means, knn, linear SVM, and more). You may want to do more than the vtreat library does (such as Bayesian imputation, variable clustering, and more) but you certainly do not want to do less.

The original announcement is getting a bit out of date, so we hope to be able to write a new article on vtreat soon. Until then we suggest running vignette('vtreat') in R to produce a rendered version of the package vignette. You can also checkout the package manual, now available online.

There have been a number of recent substantial improvements to the library, including:

Some of our related articles (which should make clear some of our motivations, and design decisions):

A short example of current best practice using vtreat (variable coding, train, test split) is here.

To leave a comment for the author, please follow the link and comment on their blog: Win-Vector Blog » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.