For a much better looking version of this post (where code is actually readable!), see this Github repository, which also contains some of the example datasets I use and a literate programming version of this tutorial.
Let’s start with a preview of what ggplot2 can do.
Given Fisher’s iris data set and one simple command…
qplot(Sepal.Length, Petal.Length, data = iris, color = Species)
…we can produce this plot of sepal length vs. petal length, colored by species.
You can download R here. After installation, you can launch R in interactive mode by either typing
R on the command line or opening the standard GUI (which should have been included in the download).
Vectors are a core data structure in R, and are created with
c(). Elements in a vector must be of the same type.
numbers = c(23, 13, 5, 7, 31) names = c("edwin", "alice", "bob")
Elements are indexed starting at 1, and are accessed with
numbers # 23 names # edwin
books = data.frame( title = c("harry potter", "war and peace", "lord of the rings"), author = c("rowling", "tolstoy", "tolkien"), num_pages = c("350", "875", "500") )
You can access columns of a data frame with
books$title # c("harry potter", "war and peace", "lord of the rings") books$author # "rowling"
You can also create new columns with
books$num_bought_today = c(10, 5, 8) books$num_bought_yesterday = c(18, 13, 20) books$total_num_bought = books$num_bought_today + books$num_bought_yesterday
Suppose you want to import a TSV file into R as a data frame.
tsv file without header
For example, consider the
data/students.tsv file (with columns describing each student’s age, test score, and name).
13 100 alice 14 95 bob 13 82 eve
We can import this file into R using
students = read.table( "data/students.tsv", header = F, sep = "\t", col.names = c("age", "score", "name") )
header = Fmeans that the file does not contain a header (
Fis shorthand for
sep = "\t"means that the file is tab-delimited
col.names = c("age", "score", "name")tells R the column names
We can now access the different columns in the data frame with
csv file with header
For an example of a file in a different format, look at the
age,score,name 13,100,alice 14,95,bob 13,82,eve
Here we have the same data, but now the file is comma-delimited and contains a header. We can import this file with
students = read.table("data/students.tsv", header = T, sep = ",")
header = T, we tell R that the first line of the file contains column names, so we can immediately access
students$age and so on. (Note: there is also a
read.csv function that uses
sep = "," by default.)
There are many more options that
read.table can take. For a full list of these, just type
help(read.table) (or equivalently,
?read.table) at the prompt to access documentation.
This works for other functions as well.
With these R basics in place, let’s dive into the ggplot2 package.
One of R’s greatest strengths is its excellent set of packages. To install a package, you can use the
To load a package into your current R session, use
Scatterplots with qplot()
Let’s look at how to create a scatterplot in ggplot2. We’ll use the
iris data frame that’s automatically loaded into R.
What does the data frame contain? We can use the
head function to look at the first few rows.
head(iris) # by default, head displays the first 6 rows head(iris, n = 10) # we can also explicitly set the number of rows to display Sepal.Length Sepal.Width Petal.Length Petal.Width Species 5.1 3.5 1.4 0.2 setosa 4.9 3.0 1.4 0.2 setosa 4.7 3.2 1.3 0.2 setosa 4.6 3.1 1.5 0.2 setosa 5.0 3.6 1.4 0.2 setosa 5.4 3.9 1.7 0.4 setosa
(The data frame actually contains three types of species: setosa, versicolor, and virginica.)
Petal.Length using ggplot2′s
qplot(Sepal.Length, Petal.Length, data = iris) # Plot Sepal.Length vs. Petal.Length, using data from the `iris` data frame.
To see where each species is located in this graph, we can color each point by adding a
color = Species argument.
qplot(Sepal.Length, Petal.Length, data = iris, color = Species) # dude!
Similarly, we can let the size of each point denote sepal width, by adding a
size = Sepal.Width argument.
qplot(Sepal.Length, Petal.Length, data = iris, color = Species, size = Petal.Width) # We see that Iris setosa flowers have the narrowest petals.
qplot(Sepal.Length, Petal.Length, data = iris, color = Species, size = Petal.Width, alpha = I(0.7)) # By setting the alpha of each point to 0.7, we reduce the effects of overplotting.
Finally, let’s fix the axis labels and add a title to the plot.
qplot(Sepal.Length, Petal.Length, data = iris, color = Species, xlab = "Sepal Length", ylab = "Petal Length", main = "Sepal vs. Petal Length in Fisher's Iris data")
Other common geoms
In the scatterplot examples above, we implicitly used a point geom, the default when you supply two arguments to
# These two invocations are equivalent. qplot(Sepal.Length, Petal.Length, data = iris, geom = "point") qplot(Sepal.Length, Petal.Length, data = iris)
But we can also easily use other types of geoms to create more kinds of plots.
Barcharts: geom = “bar”
movies = data.frame( director = c("spielberg", "spielberg", "spielberg", "jackson", "jackson"), movie = c("jaws", "avatar", "schindler's list", "lotr", "king kong"), minutes = c(124, 163, 195, 600, 187) ) # Plot the number of movies each director has. qplot(director, data = movies, geom = "bar", ylab = "# movies") # By default, the height of each bar is simply a count.
# But we can also supply a different weight. # Here the height of each bar is the total running time of the director's movies. qplot(director, weight = minutes, data = movies, geom = "bar", ylab = "total length (min.)")
Line charts: geom = “line”
qplot(Sepal.Length, Petal.Length, data = iris, geom = "line", color = Species) # Using a line geom doesn't really make sense here, but hey.
# `Orange` is another built-in data frame that describes the growth of orange trees. qplot(age, circumference, data = Orange, geom = "line", colour = Tree, main = "How does orange tree circumference vary with age?")
# We can also plot both points and lines. qplot(age, circumference, data = Orange, geom = c("point", "line"), colour = Tree)
And that’s it with what I’ll cover.
I skipped over a lot of aspects of R and ggplot2 in this intro.
- There are many geoms (and other functionalities) in ggplot2 that I didn’t cover, e.g., boxplots and histograms.
- I didn’t talk about ggplot2′s layering system, or the grammar of graphics it’s based on.
So I’ll end with some additional resources on R and ggplot2.
- I don’t use it myself, but RStudio is a popular IDE for R.
- The official ggplot2 documentation is great and has lots of examples. There’s also an excellent book.
- plyr is another fantastic R package that’s also by Hadley Wickham (the author of ggplot2).
- The official R introduction is okay, but definitely not great. I haven’t found any R tutorials I really like, but I’ve heard good things about The Art of R Programming.