How to do feature engineering in R with the recipes package

[This article was first published on Ortom | R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I was excited to start using Max Khun (creator of Caret’s) new set of ‘tidymodels’ packages – rsample, recipe, yardstick, parsnip and dials. These are still under development but seem promising. The one I have so far found most useful is recipe. Here I’ll give a quick overview of how you use it to do some simple data preparation for machine learning.

R’s approach to machine learning has always been a bit haphazard and fragmented. There has never been an equivalent to python’s scikit-learn. I have never really got along with caret (the main contender) or mlr. I found the API difficult to learn and I’ve never liked the amount of control you give up as a result of using them. I like the fact that these new set of packages are modular and so can be used without fully giving up on other approaches.


image from the excellent book salt fat acid heat

Basically, recipe provides a bunch of tools for preparing data and creating design matrices. This is a form of feature engineering. These matrices can then be used as training data for ML models. This is done in four steps:

  1. Create a recipe made up of steps (eg. missing data imputation and skew correction – many are provided in the package)
  2. Prep that recipe using the training data (eg. use the training data to learn imputation values)
  3. Create a model matrix by applying the prepped recipe to the training data
  4. (Optional) Create another model matrix using the same steps but applied to a new dataset (a test or production dataset say).

Here is a quick example the does median imputation, centres and scales the airquality dataset to give an idea for how it would work.

aq_train = airquality[1:100, ]
aq_test = airquality[-(1:100), ]

#make recipe
recipe_1 = recipe(formula = Ozone ~ Solar.R + Wind + Temp + Month + Day,
                  data = aq_train) %>%
  #add steps
  step_medianimpute(all_numeric()) %>%
  step_center(all_numeric())  %>%
  step_scale( all_numeric())  %>%
  #prep recipe
  prep(training = aq_train, retain = TRUE,  verbose = TRUE)

#make model matrices
mm_train = bake(recipe_1, new_data = aq_train, composition = 'matrix')
mm_test = bake(recipe_1, new_data = aq_test, composition = 'matrix')

After doing this you can go off and do what you want with the model matrix. Changing the composition argument allows you to get a ““tibble”, “matrix”, “data.frame”, or “dgCMatrix”.

This approach is flexible and allows a prepped recipe to be applied to a new dataset avoiding data leakage problems. A list of available functions is here. User defined functions can also be made.

The recipe package is really useful and i’ve been using it a lot lately – it has streamlined a bit of my workflow that I’d been struggling with. It still has a few rough edges but is really worth trying out.

To leave a comment for the author, please follow the link and comment on their blog: Ortom | R. offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)