**R – Win-Vector Blog**, and kindly contributed to R-bloggers)

Win-Vector LLC, Nina Zumel and I are pleased to announce that ‘vtreat’ version 0.5.27 has been released on CRAN.

`vtreat`

is a data.frame processor/conditioner that prepares real-world data for predictive modeling in a statistically sound manner.

Very roughly `vtreat`

accepts an arbitrary “from the wild” data frame (with different column types, `NA`

s, `NaN`

s and so forth) and returns a transformation that reliably and repeatably converts similar data frames to numeric (matrix-like) frames (all independent variables numeric free of `NA`

, `NaN`

s, infinities, and so on) ready for predictive modeling. This is a systematic way to work with high-cardinality character and factor variables (which are incompatible with some machine learning implementations such as random forest, and also bring in a danger of statistical over-fitting) and leaves the analyst more time to incorporate domain specific data preparation (as `vtreat`

tries to handle as much of the common stuff as practical). For more of an overall description please see here.

We suggest any users please update (and you will want to re-run any “design” steps instead of mixing “design” and “prepare” from two different versions of `vtreat`

).

For what is new in version 0.5.27 please read on.

`vtreat`

0.5.27 is a maintenance release. User visible improvements include.

- Switching `catB` encodings to a logit scale (instead of the previous log scale).
- Increasing the degree of parallelism by separately parallelizing the level pruning steps (using the methods outlined here).
- Changing the default for
`catScaling`

to`FALSE`

. We still think working logistic link-space is a great idea for classification problems, we are just not fully satisfied that un-regularized logistic regressions are the best way to get there (largely due to issues of separation and quasi-separation). In the meantime we think working in an expectation space is the safer (and now default) alternative. - Falling back to
`stats::chisq.test()`

instead of insisting on`stats::fisher.test()`

for large counts. This calculation is used for level pruning and only relevant if`rareSig < 1`

(the default is`1`

). We caution that setting`rareSig < 1`

remains a fairly expensive setting. We are trying to make significance estimation much more transparent, for example we now return how many extra degrees of freedom are hidden by categorical variable re-encodings in a new score frame column called`extraModelDegrees`

(found in`designTreatments*()$scoreFrame`

).

The idea is having data preparation as a re-usable library lets us research, document, optimize, and fine tune a lot more details than would make sense on any one analysis project. The main design difference from other data preparation packages is we emphasize “y-aware” (or outcome aware) processing (using the training outcome to generate useful re-encodings of the data).

We have pre-rendered a lot of the package documentation, examples, and tutorials here.

**leave a comment**for the author, please follow the link and comment on their blog:

**R – Win-Vector Blog**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...