Basic Introduction to ggplot2

[This article was first published on W. Andrew Barr's Paleoecology Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This is a very basic introduction to the ggplot2 package.  A much more detailed description of the package can be found in this book ggplot2: Elegant Graphics for Data Analysis.

On his website ( package author Hadley Wickham describes ggplot2 as
a plotting system for R, based on the grammar of graphics, which tries to take the good parts of base and lattice graphics and none of the bad parts. It takes care of many of the fiddly details that make plotting a hassle (like drawing legends) as well as providing a powerful model of graphics that makes it easy to produce complex multi-layered graphics.
There are two major functions that you will use in ggplot2
  • qplot() – for quick plots 
  • ggplot() – for fine, granular control of everything (not going to get into this in this post)
Lets start with qplot() to see how easy and pretty things can be with ggplot2.

Importing Data and Loading Package
Lets play around with a published dataset of measurements taken on antelope leg bones.  The data come from this paper.
DeGusta D, and Vrba E. 2003. A method for inferring paleohabitats from the functional morphology of bovid astragali. Journal of Archaeological Science 30:1009–1022
For simplicity’s sake, I took out a bunch of columns that we don’t need. If you are actually interested in the data themselves, you should examine the paper itself to find all the data that I took out.
#load the package

We are going to use a nice trick here and read the text file from directly from a remote url.

#read in the tab delimited text file using the url() function

Now we can look at the structure of the dataframe we just imported.

‘data.frame’:    218 obs. of  4 variables:
 $ Tribe : Factor w/ 8 levels “Aepycerotini”,..: 1 1 1 1 1 1 1 1 1 1 …
 $ Hab   : Factor w/ 4 levels “F”,”H”,”L”,”O”: 3 3 3 3 3 3 3 3 3 3 …
 $ BM    : num  56.2 56.2 56.2 56.2 56.2 …
 $ var1  : num  36.5 40.9 37 36.2 36.6 37.7 37.3 39 37.7 35.3 …

You can see that there are 218 rows, each representing an individual antelope. The dataset has 2 factors (categorical variables), one recording the taxonomic tribe of the specimen, and one recording what type if habitat the individual occupies.  Then there are 2 continuous variables: BM is average species body mass and var1 is a measurement of the leg-bone.

A simple histogram in ggplot2
For a histogram, all we need to tell qplot()is which dataframe to look in and which variable is on the x axis.  I also added in a plot title with the “main=” argument. Pretty easy!
qplot(data=myData,x=BM,main=”Histogram of BodyMass”)

A basic scatterplot
Lets say I want to plot a variable against body mass, color coded by taxonomic tribe. qplot() works just like regular plot() only much smarter.  For instance, you tell qplot to do the color coding with a single argument “color=Tribe”. Note also that I indicate I want to log both variables with the log=”xy” argument. Note further: the legend is handled AUTOMATICALLY BY DEFAULT! If you have done a lot of graphing in R previously then your mouth is right now hanging open in astonishment.  And the default graph is beautiful!

NOTE: The appearance of these log-scales reflects the new default in the new version of ggplot2, which was just published on CRAN 2012-03-01. I personally prefer the old log-scale default, where tick bars were evenly spaced, but I haven’t figured out how to change this behavior yet.

UPDATE: to get a plot like the former default plot, simply transform the variables yourself in the call, and take out the log=”xy” part, like the following:


Boxplots – changing the geom
OK, now lets say that I want to see how the raw values of that same variable are distributed over different habitat types.  I could just do this: (graph not shown)

It works, but we probably don’t want to represent the data as points (the default), but rather by a boxplot. We tell qplot() this by setting the argument “geom=”.  Geom means “geometric object” and it tells qplot() how to represent the data, and there are many options. To get a boxplot, we just tell it geom=”boxplot”.

Also, you could try the “jitter” geom, I kind of think works better than boxplots sometimes. (Go ahead….try it!)

Doing the same thing to different data subsets – Faceting
It is extremely common to want to do the same type of plot for different subsets of your data (i.e. based on the values of some factor, aka a categorical variable). qplot() makes this amazingly easy by providing a “facets=” argument.

Going back to our first scatterplot, we can plot the relationship between our variable and BM for each taxonomic tribe, with each tribe separated into the various habitat categories.  We simply add one argument to do this “facets = Hab~Tribe”.  Note the use of the “~” character.  This is because you are passing a formula to the facets argument.  It will do the same plot for every unique level of the factors you tell it.
qplot(data=myData,x=BM,y=var1,log=”xy”,color=Tribe,facets = Hab~Tribe)

This isn’t actually very useful in this case because there are a lot of empty factor levels.  I probably just want to do it for each tribe like this. Note you still need the “~”.
qplot(data=myData,x=BM,y=var1,log=”xy”,color=Tribe,facets = ~Tribe)

Trend lines – adding statistical transformation layers.
We probably want to add a trend line to each plot as well.  In ggplot2, you think of a plot as made up of different layers.  A trendline is a statistical transformation layer that is overlaid on the graph.  The easiest way to do this is to use qplot() to recreate the exact plot we just made, only this time instead of plotting it, we save it in a variable to do further stuff to it. 

Then we can ADD A LAYER to this object in which we add a smoothing statistic using “method=’lm'” to add an ordinary least squares trendline.  We could do other smoothing functions like LOESS, among others.  The function stat_smooth() creates a statistical transformation layer that can be added to our existing plot. To add it we….literally…..just add it.

myGG<- myGG + stat_smooth(method="lm")

Now, to plot a ggplot2 object all you have to do is type its name. By default, it plots itself to the graphics device. It is kind of hard to see here, but by default all ggplot trend lines show confidence intervals as gray areas surrounding the line….nice!

As if that wasn’t enough – saving graphs to file is a snap with ggsave()
ggplot2 comes with a great function called ggsave() that takes all the headache out of exporting graphics from R. The only required parameter is a filename, like so.

BY DEFAULT, it detects the desired format based on the file extension you give it and handles everything silently and efficiently.  Again…If you have done a lot of graphing in R previously then your mouth is right now hanging open in astonishment. The default behavior is to save the last ggplot object that you plotted to the graphic device, but of course you can pass it the name of any saved ggplot object.  You can also of course change things like output dimensions and DPI, etc.  Just have a look at ?ggsave

To leave a comment for the author, please follow the link and comment on their blog: W. Andrew Barr's Paleoecology Blog. offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)