Ode to the joy of R data packages
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I have been meaning to get into writing my first R package for some time and was fortunate enough to be guided through the process at a Tidy Tools workshop run by Hadley Wickham (which I thoroughly recommend if anyone gets the chance). This got me (somewhat overly) excited about the workflow possibilities and I hunted out nails for my new found hammer. This led to the way I read in and clean data in preparation for analysis being revolutionised, which I will now evangelise.
I will not go through some of the technical details of package development, if you want to read about that I found Hilary Parker’s post really helpful for getting started (supplemented by the fantastic usethis package) and Erik Howard’s invaluable post as well as Dave Kleinschmidt’s for then applying this to data storage.
Loading and cleaning data in the Dark Ages, before the R Package workflow
Previously, in my unchecked desire to get started with the ‘real’ analysis I would get the data in as quickly and dirtily as possible: initially just the first few lines of an R script would load the data. Then several lines of code to clean, piece by piece. Functions grow to clear things up and if I was feeling particularly structured I would create a separate R script for loading the data into the environment and then source this from other scripts and markdown documents when needed, and sometimes a separate R script for the cleaning functions. I thought this was pretty organised and efficient, albeit sometimes a laborious task.
I shudder at the naivety of my former self.
New and improved data loading – cleaning workflow, with R packages
Now, in my R packaged induced enlightened state (surely temporary, until I find a better way) I have the following workflow:
- “Sketch” the data; maybe some quick plots but usually using
head(df)
and plenty ofunique(df$var)
- Identify the data cleaning tasks and write unit tests for these jobs.
- Write functions that will (hopefully / eventually) pass the unit tests.
- Build / test package; repeat steps 2 – 4.
- Run cleaning functions on data from data-raw folder, exporting using
use_data(df)
and documenting where desired. - Load into analysis with
library(rdatapackage)
Now this is certainly not foolproof – I still find myself part way through an analysis and need to head back and do some cleaning sometimes – but my transition from wrangling to analysis has been much smoother since implementing this workflow.
But why use a package for data?
Indeed, why not just write some well named and structured cleaning functions as part of the analysis? I have found a few unexpected (to me, at least) gems in using an R package for this kind of task:
RStudio shortcuts + usethis
The keyboard shortcuts (such as Ctrl-Shift-t for testing a package) combined with the usethis
package really speed up the process, and they are both seemingly designed to help optimise package development workflow.
Unit tests
These help gamify the whole process – I find it really satisfying achieving each next task. They also help ensure that you will not unknowingly ruin some earlier cleaning task that you had thought solved by solving some more complicated problem.
I also find that writing the code for unit tests is fairly easy, and a natural place to start. For example, a recent project I had the following entries in a variable:
resource/x-bb-lesson, resource/x-bb-document, ...
Clearly this would read nicer as ‘lesson’ and ‘document’, so writing the unit test was easy enough:
test_that("clean_data_var works", { expect_equal(clean_data_var("resource/x-bb-lesson"), "lesson") expect_equal(clean_data_var("resource/x-bb-document"), "document") })
At this stage I have not written the function clean_data_var
yet, but I have clearly defined what I need it to do. Now, regex is a powerful tool but I rarely get expressions first time and I have found having the testing built in and only a quick keyboard shortcut away immensely helpful.
Anonymising data
I work a lot with student activity data so being able to move quickly from names / numbers to Student34 or similar is nice.
Documentation
This has been really helpful in keeping track of what actually was in that data frame, and if there were any issues involved.
Portability
Not just of the data, but of the functions used to clean the data. Being able to quickly grab an old function to solve a similar task using rdatapackage::functionname
is super handy.
Proselytizing over
This might be old-hat to some people out there, but for anyone struggling to get stuck into this key part of data science work this changed approach has really helped me do a better, cleaner job and it’s fun.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.