Notes on Engineering Data Analysis (with R and ggplot2)

July 8, 2011

(This article was first published on Digithead's Lab Notebook, and kindly contributed to R-bloggers)

Hadley Wickham gave a Google Tech Talk a couple weeks back titled Engineering Data Analysis (with R and ggplot2). These are my notes.

The data analysis cycle is to iteratively transform, visualize and model. Leading into the cycle is data access and the output of the process is knowledge, insight and understanding which can be communicated to others. Transforming the data is almost always necessary to bring data into a workable form. Visualization and modeling have something of a duality where visualization is good at revealing the unexpected but has problems scaling. Models scale better, but will only find expected relationships. A larger cycle comes about when answers to one question lead to more questions.

Hadley makes a case for data analysis in code, rather than GUIs and for R in particular. Working in a programming language gives you a means of:

  • reproducibility
  • automation
  • version control
  • communication

Advantages of R:

  • open source
  • runs anywhere
  • well established community
  • huge library of packages
  • connectivity to other languages

Downsides of R are it’s learning curve, strangeness relative to other programming languages, lack of programming infrastructure and prickliness of the community. R scales well up to about a million observations. How to scale the interactive analysis cycle up to billions of observations is an open question. Programming infrastructure is an area where programmers can contribute.

DSLs help express and think clearly about common problems in data analysis. Hadley views his libraries as DSLs (domain specific languages) within R for the phases of the analysis cycle. For visualization, there’s ggplot2. DSLs align nicely with ggplot’s philosophy as a grammar of graphics. R’s model formula is the DSL for modeling. Plyr is the DSL for data transformation.

The four key verbs of data transformation are:

  • subset
  • mutate
  • arrange
  • summarize


  • group by
  • join

Data can be divided by subsetting or filtering; mutated, for example adding new columns to a table that are functions of other columns; rearranged or sorted; and summarized, condensing a data set down to a smaller number of values. These actions can be combined with a group by operator. Finally, data sets can be joined to other related data sets.

The second half of a talk is a case study, dissecting a set of cause-of-death statistics from the Mexican government. Finally, Hadley makes a familiar sounding point about the tension between making new things and making well-engineered user-friedly software that does old things.

To leave a comment for the author, please follow the link and comment on their blog: Digithead's Lab Notebook. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.


Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)