‘vtreat’ is a data.frame processor/conditioner that prepares real-world data for predictive modeling in a statistically sound manner.(from the package documentation)
‘vtreat’ is an R package that incorporates a number of transforms and simulated out of sample (cross-frame simulation) procedures that can:
- Decrease the amount of hand-work needed to prepare data for predictive modeling.
- Improve actual model performance on new out of sample or application data.
- Lower your procedure documentation burden (through ready vtreat documentation and tutorials).
- Increase model reliability (by re-coding unexpected situations).
- Increase model expressiveness (by allowing use of more variable types, especially large cardinality categorical variables).
‘vtreat’ can be used to prepare data for either regression or classification.
Please read on for what ‘vtreat’ does and what is new.
The primary function of ‘vtreat’ is re-coding of high-cardinality categorical variables, re-coding of missing data, and out-of sample estimation of variable effects and significances. You can use ‘vtreat’ as a pre-processor and use ‘vtreat::prepare’ as a powerful replacement for ‘stats::model.matrix’. Using ‘vtreat’ should get you quickly into the competitive ballpark of best performance on a real-world data problem (such as KDD2009) leaving you time to apply deeper domain knowledge and model tuning for even better results.
‘vtreat’ achieves this by using the assumption that you have a modeling “y” (or outcome to predict) throughout, and that all preparation and transformation should be designed to use knowledge of this “y” during training (and anticipate not having the “y” during test or application).
More simply: the purpose of ‘vtreat’ is to quickly take a messy real-world data frame similar to:
And build a treatment plan:
The treatment plan can then be used to clean up the original data and also be applied to any future application or test data:
‘vtreat’ is designed to be concise, yet implement substantial data preparation and cleaning.
What is new
This release concentrates on code-cleanup and convenience functions inspired by Nina Zumel’s recent article on y-aware PCA/PCR (my note why you should read this series is here). In particular we now have user facing functions and documentation on:
- ‘vtreat’ y-aware scaling. This implements the y-aware methods described here. This includes the important variation that all calculations are “simulated out of sample” or “cross validated” when you replace the ‘vtreat::designTreatments[N/C/Z]’/’vtreat::prepare’ pattern with the ‘vtreat::mkCrossFrame[N/C]Experiment’ pattern. We suggest the ‘designTreatments[N/C/Z]/prepare’ pattern for speed and ease of description and the ‘mkCrossFrame[N/C]Experiment’ if you are suspect over-fitting in the preparation step.
- y-stratified splitting. Procedures in ‘vtreat’ tend to be “y-aware” so we upgraded the cross-validation (simulated out of sample) methods to also be “y-aware” and stratify on the outcome. This can greatly decrease model fitting variance especially when modeling rare outcomes.
- User controlled data grouping. ‘vtreat’ uses cross-validations methods throughout to attempt to simulate model performance on new application data. In many situations much cross-validation or data splitting must preserve record or row grouping to be at all meaningful. For example it may not make sense to split multiple records taken from a single patient into the training and test group, we may instead prefer that each patient have all their records either in training or test. ‘vtreat’ now allows specification of record groups or even user supplied splitting functions as documented here.
‘vtreat’ now has essentially two workflows:
- ‘vtreat::designTreatments[N/C/Z]’/’vtreat::prepare’ a faster, more intuitive, more deterministic, and easier to explain/describe workflow.
- ‘vtreat::mkCrossFrame[N/C]Experiment’ a bit more expensive in time and space, and more resistant to over-fitting workflow.
We think analysts/data-scientists will be well served by learning both workflows and picking the work workflow most appropriate to the data set at hand.