vtreat data cleaning and preparation article now available on arXiv

November 30, 2016

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

Nina Zumel and I are happy to announce a formal article discussing data preparation and cleaning using the vtreat methodology is now available from arXiv.org as citation arXiv:1611.09477 [stat.AP].

vtreat is an R data.frame processor/conditioner that prepares real-world data for predictive modeling in a statistically sound manner. It prepares variables so that data has fewer exceptional cases, making it easier to safely use models in production. Common problems vtreat defends against include: infinity, NA, too many categorical levels, rare categorical levels, and new categorical levels (levels seen during application, but not during training). vtreat::prepare should be your first choice for real world data preparation and cleaning.

We hope this article will make getting started with vtreat much easier. We also hope this helps with citing the use of vtreat in scientific publications.

We have also submitted a formal draft to The Journal of Statistical Software. JSS is a bit of a new venue for us, so we would appreciate any help we can get with the review process.

You can cite the current article as:

    title = {vtreat: a data.frame Processor for Predictive Modeling},
    author = {Nina Zumel and John Mount},
    year = {2016},
    month = {November},
    journal = {arXiv},
    date        = {2016-11-29},
    howpublished = {arXiv:1611.09477 [stat.AP] \url{https://arxiv.org/abs/1611.09477}},
    url = {https://arxiv.org/abs/1611.09477},
    urldate     = {2016-11-29},
    eprinttype  = {arxiv},
    pages = {1--40},
    eprint      = {arXiv:1611.09477 [stat.AP]}

Zumel, N. and Mount, J. (2016). vtreat: a data.frame processor for predictive modeling. arXiv:1611.09477 [stat.AP] https://arxiv.org/abs/1611.09477.

And you can cite the vtreat package as:

    title = {vtreat: A Statistically Sound data.frame Processor/Conditioner},
    author = {John Mount and Nina Zumel},
    year = {2016},
    note = {R package version 0.5.28},
    howpublished = {\url{https://CRAN.R-project.org/package=vtreat}},
    url = {https://CRAN.R-project.org/package=vtreat}

Mount, J. and Zumel, N. (2016). vtreat: A statistically sound data.frame processor/conditioner. https://CRAN.R-project.org/package=vtreat. R package version 0.5.28.

For more articles on vtreat please try here or here.

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.


Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)