# Preparing Data for Supervised Classification

**R – Win-Vector Blog**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Nina Zumel has been polishing up new `vtreat`

for `Python`

documentation and tutorials. They are coming out so good that I find to be fair to the `R`

community I must start to back-port this new documentation to `vtreat`

for `R`

.

`vtreat`

is a package for systematically preparing data for supervised machine learning tasks such as classification or regression. `vtreat`

designs a data transform that takes in messy data (with missing values, and high cardinality categorical variables) and delivers transformed data that is purely numeric and with no missing values (essentially the data format needed by most scikit-learn machine learning procedures). The transformation is designed to try and retain almost all of the information relating the explanatory variables to the dependent variable in a model usable format. This transformation can be saved and then applied to future test or application data.

If you aren’t using something like `vtreat`

in your data science projects: you are *really* missing out (and making more work for yourself).

Of course all of this is easier to evaluate with examples. And that is what Nina Zumel has been working on (in addition to supervising the semantics and theory; she invented many of the techniques, so we look to her for supervision).

Our first new `Python`

example is here: `vtreat`

for Classification in `Python`

.

As I said, this example came out so well I have ported it from `Python`

to `R`

here: `vtreat`

for Classification in `R`

.

If I get some free time I will also back-port `vtreat`

for regression in `Python`

and `vtreat`

for unsupervised tasks in `Python`

to `R`

. I also would like to note an upcoming treat for `R`

users: chapter 8 “Advanced Data Preparation” of the second edition of *Practical Data Science with R* (Zumel, Mount; 2019) is all about `vtreat`

!

**leave a comment**for the author, please follow the link and comment on their blog:

**R – Win-Vector Blog**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.