A recent Wall Street Journal article
ruminated about the degree that language shapes thought (rather than the other way around). This idea has rather profound implications in the more specific domain of programming languages. We initially learn a programming language but later “think” in terms of the language.
To some degree, we are constrained in our ability to solve problems if we only know a single language. This situation has been recognized different ways by the programming community. The Logo programming language
was built based upon constructionist learning theory
and was intended to provide a “mental model” for children to come to understand mathematical constructs. In recent times, many programmers have committed to being polyglots, learning new languages as a part of professional development. Their concern is not always to learn the latest language that they will need to work, but to find out new ways of conceptualizing problems and structuring solutions. This leads to a more subtle goal of ggplot2.
The ggplot2 package
is appealing because it makes it possible to quickly create appealing graphs and charts. However, it is based upon an underlying “grammar of graphics”. This “Grammar” serves a number of purposes. It provides a structure for the API implementation. The API is designed so that you specify what you want rather than how to create it.
Another, perhaps more subtle effect is that it also can influence the way that an R programmer thinks about creating a graph. With this in mind, it is helpful to “think through” the process of creating a chart in the terms presented by ggplot2 in a more disciplined fashion.
- Default Data Set
- Set of Aesthetic Mappings
- Multiple Layers (points, jittered points, box plots, histogram
- Scale for Each Aesthetic
- Faceting Specification
- Coordinate System
In this case a layer comprises several of the elements listed earlier.
- Data set and Aesthetic Mapping
- Geometric Object.
- Position Adjustment
Data is not included as a part of ggplot2. In addition, algebra (from a component identified by Wilkinson) is not included as it in the realm of data transformation rather than actual chart creation.
The individual components of the grammar are fairly well defined regardless of where they appear on a list. The possible interactions between the components are rather complex. The construction of traditional charts are defined by a distinct combination of components. For example, the combination of geom and a stat is significant. At other times, the coordinate system is a defining factor. (A pie chart is a one column stacked bar chart that is mapped to a polar coordinate system).
Chart Geom Stat Coordinate System
Scatterplot point identity cartesian
Histogram bar bin cartesian
Pie Chart bar identity polar
Iris Data Set
The iris data set
is a well known set
of multivariate data introduced by Ronald Fisher in the 1930s. The first few rows of the set are as follows:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
The charts below will use the following components:
- Data – iris data set
- Mapping to aesthetic
- x – Petal.Length
- y – Petal.Width
- Color – Species
- Geometric Object – point
A scatterpoint that includes these components can be created using qplot. However, see Harlan’s comment below (and this his blog which I appear to be echoing) – this is probably not the best way to start if one wants to “think” in the grammar rather than simply produce a good looking graph in the smallest number of keystrokes. Better to use ggplot as demonstrated later in the post.
qplot(Petal.Length, Petal.Width, data=iris, color=Species)
The main decision that needed to be made to construct this call was how to map the aesthetics. It is important to consider whether each variable being mapped is discrete or continuous to create a meaningful (and not just grammatically correct) result.
Color Distinct color Gradient (red to blue)
Size Distinct steps Radius based on value
Shape Distinct Shape N/A
We can even include more information by mapping another attribute (sepal area – derived from sepal length times width) to size.
size=Sepal.Length * Sepal.Width,
A great deal of information can be encoded using the various data attributes. The plot gives an some indication regarding the petal length and width (based upon the position), species (based upon color) and sepal area (based upon size). However, not every value is clearly in view. There are a few changes that might provide an indication that values might overlap. A jitter is might be used. A better alternative is to set an alpha value provides a degree of transparency.
size=Sepal.Length * Sepal.Width,
We can be more explicit about what is going on using ggplot rather than qplot. The basic scatterplot can be created and in this case will be stored in a variable.
p = ggplot(data=iris,
) + geom_point()
With the original plot in a variable, we can add components and immediately see their effect as it is rendered. A line might help discern a trend in the original scatterplot. When applying a stat, you need to – well – think statistically. Consider the following.
p + stat_abline()
The line created doesn’t mean much – this is because it is simply a line with a slope of 1 and intercept of zero. A more meaningful line can be created by determining the line of best fit.
coef(lm(Petal.Width ~ Petal.Length, data=iris))
# (Intercept) Petal.Length
# -0.3630755 0.4157554
So calling a given stat did something in this case. To get it to do something meaningful required additional work.
Distinction between Grammar Components
The distinction between a statistic and geometric object is not always clear (at least in terms of the ggplot2 API). A line with a slope and intercept might be though of as a statistic or a geometric object.
p + geom_abline(intercept=-0.363,slope=0.416, color=’purple’)
Likewise a position adjustment (like a jitter) can be thought of as both a geometric and positional terms.
qplot(Petal.Length, Petal.Width, data=iris) +
I point out these idiosyncrasies because – as with many formal abstractions of real world concepts – edge cases exist. For example, in Western music theory
, one dutifully learns the rules of counterpoint only to find out that they are not always observed by composers in practice and that certain constructs are not easily classified. This doesn’t eliminate the usefulness to studying music theory. It simply highlights the difficulty in neatly categorizing every aspect of a specific creation in an accurate an meaningful way. And for what its worth, I think that Hadley Wickham as done a marvelous job – and appears to have taken an approach of providing an interface to underlying functionality when it appears in more than one category.
Note that the order in which geoms and stats are applied matters! For instance:
The boxplot obscures the original points. These can be added back on after applying the boxplot.
This gives a glimpse of the flexibility and sophistication of the system. The fundamental elements of chart design that comprise the grammar can be combined in new and flexible ways. Not every grammatically correct possibility is aesthetically pleasing or accurate as interpreted by human perception. But ggplot2 is worth learning not only for its own sake, but for the insights it can provide into the creative activity of constructing charts and graphs.