Thinking about Graphs

Posted on July 30, 2010 by C in R bloggers | 0 Comments

[This article was first published on R-Chart, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A recent Wall Street Journal article ruminated about the degree that language shapes thought (rather than the other way around). This idea has rather profound implications in the more specific domain of programming languages. We initially learn a programming language but later “think” in terms of the language.

To some degree, we are constrained in our ability to solve problems if we only know a single language. This situation has been recognized different ways by the programming community. The Logo programming language was built based upon constructionist learning theory and was intended to provide a “mental model” for children to come to understand mathematical constructs. In recent times, many programmers have committed to being polyglots, learning new languages as a part of professional development. Their concern is not always to learn the latest language that they will need to work, but to find out new ways of conceptualizing problems and structuring solutions. This leads to a more subtle goal of ggplot2.

The ggplot2 package is appealing because it makes it possible to quickly create appealing graphs and charts. However, it is based upon an underlying “grammar of graphics”. This “Grammar” serves a number of purposes. It provides a structure for the API implementation. The API is designed so that you specify what you want rather than how to create it.

Another, perhaps more subtle effect is that it also can influence the way that an R programmer thinks about creating a graph. With this in mind, it is helpful to “think through” the process of creating a chart in the terms presented by ggplot2 in a more disciplined fashion.

Components of a Plot

According Hadley Wickham (the author of ggplot and the ggplot book), the following components make up a plot:

Data
Aesthetic Mappings
Geometric Objects
Statistical Transformations
Position Adjustment
Faceting
Coordinate System

The Reference Manual is also organized around these components:

Geoms (Geometric Objects)
Statistics (Statistical Transformations)
Scales
Coordinate System
Faceting
Position Adjustment

He has organized the material slightly differently in a presentation at Vanderbilt.

Default Data Set
Set of Aesthetic Mappings
Multiple Layers (points, jittered points, box plots, histogram
Scale for Each Aesthetic
Faceting Specification
Coordinate System

In this case a layer comprises several of the elements listed earlier.

Data set and Aesthetic Mapping
Geometric Object.
Statistics
Position Adjustment

Data is not included as a part of ggplot2. In addition, algebra (from a component identified by Wilkinson) is not included as it in the realm of data transformation rather than actual chart creation.

The individual components of the grammar are fairly well defined regardless of where they appear on a list. The possible interactions between the components are rather complex. The construction of traditional charts are defined by a distinct combination of components. For example, the combination of geom and a stat is significant. At other times, the coordinate system is a defining factor. (A pie chart is a one column stacked bar chart that is mapped to a polar coordinate system).

Chart Geom Stat Coordinate System

Scatterplot point identity cartesian

Histogram bar bin cartesian

Pie Chart bar identity polar

Iris Data Set

The iris data set is a well known set of multivariate data introduced by Ronald Fisher in the 1930s. The first few rows of the set are as follows:

Sepal.Length Sepal.Width Petal.Length Petal.Width Species

1 5.1 3.5 1.4 0.2 setosa

2 4.9 3.0 1.4 0.2 setosa

3 4.7 3.2 1.3 0.2 setosa

4 4.6 3.1 1.5 0.2 setosa

5 5.0 3.6 1.4 0.2 setosa

6 5.4 3.9 1.7 0.4 setosa

The charts below will use the following components:

Data – iris data set
Mapping to aesthetic

x – Petal.Length
y – Petal.Width
Color – Species

Geometric Object – point

A scatterpoint that includes these components can be created using qplot. However, see Harlan’s comment below (and this his blog which I appear to be echoing) – this is probably not the best way to start if one wants to “think” in the grammar rather than simply produce a good looking graph in the smallest number of keystrokes. Better to use ggplot as demonstrated later in the post.

library(ggplot2)

qplot(Petal.Length, Petal.Width, data=iris, color=Species)

The main decision that needed to be made to construct this call was how to map the aesthetics. It is important to consider whether each variable being mapped is discrete or continuous to create a meaningful (and not just grammatically correct) result.

Discrete Continuous

Color Distinct color Gradient (red to blue)

Size Distinct steps Radius based on value

Shape Distinct Shape N/A

We can even include more information by mapping another attribute (sepal area – derived from sepal length times width) to size.

qplot(Petal.Length,

Petal.Width,

data=iris,

size=Sepal.Length * Sepal.Width,

color=Species)

A great deal of information can be encoded using the various data attributes. The plot gives an some indication regarding the petal length and width (based upon the position), species (based upon color) and sepal area (based upon size). However, not every value is clearly in view. There are a few changes that might provide an indication that values might overlap. A jitter is might be used. A better alternative is to set an alpha value provides a degree of transparency.

qplot(Petal.Length,

Petal.Width,

data=iris,

size=Sepal.Length * Sepal.Width,

color=Species,

alpha=0.3)

We can be more explicit about what is going on using ggplot rather than qplot. The basic scatterplot can be created and in this case will be stored in a variable.

p = ggplot(data=iris,

aes(Petal.Length,

Petal.Width,

color=Species)

) + geom_point()

With the original plot in a variable, we can add components and immediately see their effect as it is rendered. A line might help discern a trend in the original scatterplot. When applying a stat, you need to – well – think statistically. Consider the following.

p + stat_abline()

The line created doesn’t mean much – this is because it is simply a line with a slope of 1 and intercept of zero. A more meaningful line can be created by determining the line of best fit.

coef(lm(Petal.Width ~ Petal.Length, data=iris))

# this returns

# (Intercept) Petal.Length

# -0.3630755 0.4157554

p +

stat_abline(intercept=-0.363,

slope=0.416,

color=’purple’)

So calling a given stat did something in this case. To get it to do something meaningful required additional work.

Distinction between Grammar Components

The distinction between a statistic and geometric object is not always clear (at least in terms of the ggplot2 API). A line with a slope and intercept might be though of as a statistic or a geometric object.

p + geom_abline(intercept=-0.363,slope=0.416, color=’purple’)

Likewise a position adjustment (like a jitter) can be thought of as both a geometric and positional terms.

qplot(Petal.Length,

Petal.Width,

data=iris,

position=’jitter’) +

geom_abline(intercept=-0.363, slope=0.416)

qplot(Petal.Length, Petal.Width, data=iris) +

geom_jitter()

I point out these idiosyncrasies because – as with many formal abstractions of real world concepts – edge cases exist. For example, in Western music theory, one dutifully learns the rules of counterpoint only to find out that they are not always observed by composers in practice and that certain constructs are not easily classified. This doesn’t eliminate the usefulness to studying music theory. It simply highlights the difficulty in neatly categorizing every aspect of a specific creation in an accurate an meaningful way. And for what its worth, I think that Hadley Wickham as done a marvelous job – and appears to have taken an approach of providing an interface to underlying functionality when it appears in more than one category.

Order of Application

Note that the order in which geoms and stats are applied matters! For instance:

p+geom_boxplot()

The boxplot obscures the original points. These can be added back on after applying the boxplot.

scap+geom_boxplot()+geom_point()

This gives a glimpse of the flexibility and sophistication of the system. The fundamental elements of chart design that comprise the grammar can be combined in new and flexible ways. Not every grammatically correct possibility is aesthetically pleasing or accurate as interpreted by human perception. But ggplot2 is worth learning not only for its own sake, but for the insights it can provide into the creative activity of constructing charts and graphs.

To leave a comment for the author, please follow the link and comment on their blog: R-Chart.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

ggplot2

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

Thinking about Graphs

Related

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)