Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Win-Vector LLC, Nina Zumel and I are pleased to announce that ‘vtreat’ version 0.5.27 has been released on CRAN.

vtreat is a data.frame processor/conditioner that prepares real-world data for predictive modeling in a statistically sound manner.

Very roughly vtreat accepts an arbitrary “from the wild” data frame (with different column types, NAs, NaNs and so forth) and returns a transformation that reliably and repeatably converts similar data frames to numeric (matrix-like) frames (all independent variables numeric free of NA, NaNs, infinities, and so on) ready for predictive modeling. This is a systematic way to work with high-cardinality character and factor variables (which are incompatible with some machine learning implementations such as random forest, and also bring in a danger of statistical over-fitting) and leaves the analyst more time to incorporate domain specific data preparation (as vtreat tries to handle as much of the common stuff as practical). For more of an overall description please see here.

We suggest any users please update (and you will want to re-run any “design” steps instead of mixing “design” and “prepare” from two different versions of vtreat).

vtreat 0.5.27 is a maintenance release. User visible improvements include.

• Switching catB encodings to a logit scale (instead of the previous log scale).
• Increasing the degree of parallelism by separately parallelizing the level pruning steps (using the methods outlined here).
• Changing the default for catScaling to FALSE. We still think working logistic link-space is a great idea for classification problems, we are just not fully satisfied that un-regularized logistic regressions are the best way to get there (largely due to issues of separation and quasi-separation). In the meantime we think working in an expectation space is the safer (and now default) alternative.
• Falling back to stats::chisq.test() instead of insisting on stats::fisher.test() for large counts. This calculation is used for level pruning and only relevant if rareSig < 1 (the default is 1). We caution that setting rareSig < 1 remains a fairly expensive setting. We are trying to make significance estimation much more transparent, for example we now return how many extra degrees of freedom are hidden by categorical variable re-encodings in a new score frame column called extraModelDegrees (found in designTreatments*()\$scoreFrame).

The idea is having data preparation as a re-usable library lets us research, document, optimize, and fine tune a lot more details than would make sense on any one analysis project. The main design difference from other data preparation packages is we emphasize “y-aware” (or outcome aware) processing (using the training outcome to generate useful re-encodings of the data).

We have pre-rendered a lot of the package documentation, examples, and tutorials here.