Tidying “messy data” in R

March 9, 2014

(This article was first published on R - Data School, and kindly contributed to R-bloggers)

I watched Hadley Wickham’s excellent talk on tidy data and tidy tools, and decided to use this as an opportunity to learn about a few of his R packages. (In case you’re unfamiliar with Hadley, he is well-known for his contributions to the R ecosystem, most notably ggplot2; he is also the Chief Scientist for RStudio.)

The principles of tidy data are simple: Every variable (or “feature”) is a column, every observation is a row, and there is one type of “observational unit” per dataset. Tidy datasets, he argues, are easier to model, visualize, and aggregate.

Here are the packages covered in the talk:

  • reshape2: for restructuring and aggregating data
  • plyr (pronounced “plier”): for manipulating and transforming data
  • stringr: for string operations and regular expressions

While watching the talk, I ran the code from the slides on the actual datasets, and annotated the code with comments. If you’re interested in doing the same, you can find the commented code and data files in my GitHub repo. (Note that my code also contains the “Billboard” example which is described in his tidy data paper and classroom slides, but not shown in the video.)

Although I mostly used the code verbatim as presented in the talk, I decided to use the dplyr library instead of plyr. I chose dplyr because it’s an updated version of plyr focused on data frames (that’s where the “d” in “dplyr” comes from), and because it has a nicer syntax and runs faster than plyr. (Here’s more from Hadley on why you should use dplyr.)

If you’re interested in learning more about dplyr, there are some excellent vignettes on the CRAN package page, especially the “Introduction to dplyr.”

To leave a comment for the author, please follow the link and comment on their blog: R - Data School.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers


Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)