Site icon R-bloggers

vtreat version 0.5.26 released on CRAN

[This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Win-Vector LLC, Nina Zumel and I are pleased to announce that ‘vtreat’ version 0.5.26 has been released on CRAN.

‘vtreat’ is a data.frame processor/conditioner that prepares real-world data for predictive modeling in a statistically sound manner.

(from the package documentation)

‘vtreat’ is an R package that incorporates a number of transforms and simulated out of sample (cross-frame simulation) procedures that can:

‘vtreat’ can be used to prepare data for either regression or classification.

Please read on for what ‘vtreat’ does and what is new.

About vtreat

The primary function of ‘vtreat’ is re-coding of high-cardinality categorical variables, re-coding of missing data, and out-of sample estimation of variable effects and significances. You can use ‘vtreat’ as a pre-processor and use ‘vtreat::prepare’ as a powerful replacement for ‘stats::model.matrix’. Using ‘vtreat’ should get you quickly into the competitive ballpark of best performance on a real-world data problem (such as KDD2009) leaving you time to apply deeper domain knowledge and model tuning for even better results.

‘vtreat’ achieves this by using the assumption that you have a modeling “y” (or outcome to predict) throughout, and that all preparation and transformation should be designed to use knowledge of this “y” during training (and anticipate not having the “y” during test or application).

More simply: the purpose of ‘vtreat’ is to quickly take a messy real-world data frame similar to:

library('htmlTable') library('vtreat') dTrainC <- data.frame(x=c('a','a','a','b','b',NA,NA), z=c(1,2,3,4,NA,6,NA), y=c(FALSE,FALSE,TRUE,FALSE,TRUE,TRUE,TRUE)) htmlTable(dTrainC)

x z y
1 a 1 FALSE
2 a 2 FALSE
3 a 3 TRUE
4 b 4 FALSE
5 b TRUE
6 6 TRUE
7 TRUE

And build a treatment plan:

treatmentsC <- designTreatmentsC(dTrainC,colnames(dTrainC),'y',TRUE)

The treatment plan can then be used to clean up the original data and also be applied to any future application or test data:

dTrainCTreated <- prepare(treatmentsC,dTrainC,pruneSig=0.5) nround <- function(x) { if(is.numeric(x)) { round(x,2) } else { x } } htmlTable(data.frame(lapply(dTrainCTreated,nround)))

x_lev_NA x_lev_x.a x_catP x_catB z_clean z_isBAD y
1 0 1 0.43 -0.54 1 0 FALSE
2 0 1 0.43 -0.54 2 0 FALSE
3 0 1 0.43 -0.54 3 0 TRUE
4 0 0 0.29 -0.13 4 0 FALSE
5 0 0 0.29 -0.13 3.2 1 TRUE
6 1 0 0.29 0.56 6 0 TRUE
7 1 0 0.29 0.56 3.2 1 TRUE

‘vtreat’ is designed to be concise, yet implement substantial data preparation and cleaning.

What is new

This release concentrates on code-cleanup and convenience functions inspired by Nina Zumel’s recent article on y-aware PCA/PCR (my note why you should read this series is here). In particular we now have user facing functions and documentation on:

‘vtreat’ now has essentially two workflows:

We think analysts/data-scientists will be well served by learning both workflows and picking the work workflow most appropriate to the data set at hand.

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.