I was excited to start using Max Khun (creator of Caret’s) new set of ‘tidymodels’ packages – rsample, recipe, yardstick, parsnip and dials. These are still under development but seem promising. The one I have so far found most useful is recipe. Here I’ll give a quick overview of how you use it to do some simple data preparation for machine learning.
R’s approach to machine learning has always been a bit haphazard and fragmented. There has never been an equivalent to python’s scikit-learn. I have never really got along with caret (the main contender) or mlr. I found the API difficult to learn and I’ve never liked the amount of control you give up as a result of using them. I like the fact that these new set of packages are modular and so can be used without fully giving up on other approaches.
Basically, recipe provides a bunch of tools for preparing data and creating design matrices. This is a form of feature engineering. These matrices can then be used as training data for ML models. This is done in four steps:
- Create a recipe made up of steps (eg. missing data imputation and skew correction – many are provided in the package)
- Prep that recipe using the training data (eg. use the training data to learn imputation values)
- Create a model matrix by applying the prepped recipe to the training data
- (Optional) Create another model matrix using the same steps but applied to a new dataset (a test or production dataset say).
Here is a quick example the does median imputation, centres and scales the airquality dataset to give an idea for how it would work.
After doing this you can go off and do what you want with the model matrix. Changing the composition argument allows you to get a ““tibble”, “matrix”, “data.frame”, or “dgCMatrix”.
The recipe package is really useful and i’ve been using it a lot lately – it has streamlined a bit of my workflow that I’d been struggling with. It still has a few rough edges but is really worth trying out.