**R – Win-Vector Blog**, and kindly contributed to R-bloggers)

Data preparation and cleaning are some of *the* most important steps of predictive analytic and data science tasks. They are laborious, where most of the errors are made, your last line of defense against a wild data, and hold the biggest opportunities for outcome improvement. No matter how much time you spend on then, they still seem like a neglected topic. Data preparation isn’t as self contained or genteel as tweaking machine learning models or hyperparameter tuning; and that is one of the reasons data preparation represents such an important practical opportunity for improvement.

Photo: NY – http://nyphotographic.com/, License: Creative Commons 3 – CC BY-SA 3.0

Our group is distributing a detailed write up of the theory and operation behind our R realization of a set of sound data preparation and cleaning procedures called vtreat here: arXiv:1611.09477 [stat.AP]. This is where you can find out what `vtreat`

does, decide if it is appropriate for your problem, or even find a specification allowing the use of the techniques in non-`R`

environments (such as `Python`

/`Pandas`

/`scikit-learn`

, `Spark`

, and many others).

We have submitted this article for formal publication, so it is our intent you can cite this article (as it stands) in scientific work as a pre-print, and later cite it from a formally refereed source.

Or alternately, below is the tl;dr (“too long; didn’t read”) form.

Our concrete advice is: when building a supervised model (regression or classification) in `R`

, prepare your training, test, and application data by doing the following.

# load the vtreat package library("vtreat") # use your training data to design # data treatment plan ce <- mkCrossFrameCExperiment(trainData, vars, yName, yTarget) # look at the variable scores varScores <- ce$treatments$scoreFrame print(varScores) # prune variables based on significance pruneSig <- 1/nrow(varScores) modelVars <- varScores$varName[varScores$sig<=pruneSig] # instead of preparing training data, use # "simulated out of sample data" to reduce modeling bias treatedTrainData <- ce$crossFrame # prepare any other data (test, future application) # using the treatment plan treatedTestData <- prepare(ce$treatments, testData, varRestriction= modelVars, pruneSig= NULL)

Then work through our examples to find out what all these steps are doing for you.

**leave a comment**for the author, please follow the link and comment on their blog:

**R – Win-Vector Blog**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...