Site icon R-bloggers

vtreat data cleaning and preparation article now available on arXiv

[This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Nina Zumel and I are happy to announce a formal article discussing data preparation and cleaning using the vtreat methodology is now available from arXiv.org as citation arXiv:1611.09477 [stat.AP].

vtreat is an R data.frame processor/conditioner that prepares real-world data for predictive modeling in a statistically sound manner. It prepares variables so that data has fewer exceptional cases, making it easier to safely use models in production. Common problems vtreat defends against include: infinity, NA, too many categorical levels, rare categorical levels, and new categorical levels (levels seen during application, but not during training). vtreat::prepare should be your first choice for real world data preparation and cleaning.

We hope this article will make getting started with vtreat much easier. We also hope this helps with citing the use of vtreat in scientific publications.

We have also submitted a formal draft to The Journal of Statistical Software. JSS is a bit of a new venue for us, so we would appreciate any help we can get with the review process.

You can cite the current article as:

@misc{vtreatarticle, title = {vtreat: a data.frame Processor for Predictive Modeling}, author = {Nina Zumel and John Mount}, year = {2016}, month = {November}, journal = {arXiv}, date = {2016-11-29}, howpublished = {arXiv:1611.09477 [stat.AP] \url{https://arxiv.org/abs/1611.09477}}, url = {https://arxiv.org/abs/1611.09477}, urldate = {2016-11-29}, eprinttype = {arxiv}, pages = {1--40}, eprint = {arXiv:1611.09477 [stat.AP]} } Zumel, N. and Mount, J. (2016). vtreat: a data.frame processor for predictive modeling. arXiv:1611.09477 [stat.AP] https://arxiv.org/abs/1611.09477.

And you can cite the vtreat package as:

@misc{vtreatpackage, title = {vtreat: A Statistically Sound data.frame Processor/Conditioner}, author = {John Mount and Nina Zumel}, year = {2016}, note = {R package version 0.5.28}, howpublished = {\url{https://CRAN.R-project.org/package=vtreat}}, url = {https://CRAN.R-project.org/package=vtreat} } Mount, J. and Zumel, N. (2016). vtreat: A statistically sound data.frame processor/conditioner. https://CRAN.R-project.org/package=vtreat. R package version 0.5.28.

For more articles on vtreat please try here or here.

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.