Summarising data using box and whisker plots

[This article was first published on Software for Exploratory Data Analysis and Statistical Modelling, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A box and whisker plot is a type of graphical display that can be used to summarise a set of data based on the five number summary of this data. The summary statistics used to create a box and whisker plot are the median of the data, the lower and upper quartiles (25% and 75%) and the minimum and maximum values.

The box and whisker plot is an effective way to investigate the distribution of a set of data. For example, skewness can be identified from the box and whisker as the display does not make any assumptions about the underlying distribution of the data. The extreme values at either end of the scale are sometimes included on the display to show how far they extend beyond the majority of the data.

To illustrate creating box and whisker plots we consider UK meteorological data that has been collected on a monthly basis at Southampton, UK between 1950 and 1999 and is publicly available. This data is available from the UK Met Office and we will compare the range of temperatures recorded in each month of the year over this period by creating box and whisker plots with the different packages.

The data is assumed to have been imported into R and stored in a data frame called soton.df. An extract of the data is shown here:

    Year Month Max.Temp Min.Temp Frost  Rain
1   1950   Jan      7.7      2.8     7  20.1
2   1950   Feb     10.3        4     4 127.0
3   1950   Mar     13.0      4.5     2  39.4
4   1950   Apr     13.6      4.7     0  62.0
5   1950   May     17.9      7.8     0  32.2

Base Graphics

Fast Tube
Fast Tube by Casper

The base graphics approach makes use of the boxplot function to create box and whisker plots. In this situation the function can be used with a formula rather than specifying two separate vectors of data – we can specify a data frame to point towards a source of data to be used in the graph. For the temperature data we use this code:

boxplot(Max.Temp ~ Month, data = soton.df,
  xlab = "Month", ylab = "Maximum Temperature",
  main = "Temperature at Southampton Weather Station (1950-1999)"
)

The horizontal and vertical axes labels are specified using the xlab and ylab arguments respectively and the title of the plot is created using the main argument. The box and whisker plot is shown here:

Base Graphics Box and Whisker Plot

Base Graphics Box and Whisker Plot

The function boxplot makes it easy to create a reasonably attractive box and whisker plot. The variation in the distribution of temperatures across the year can be seen from the graph.

Lattice Graphics

Fast Tube
Fast Tube by Casper

In the lattice graphics package there is a function bwplot which is used to create box and whisker plots. The function call also uses a formula to specify the x and y variables to use on the graph. The function call arguments are identical to the boxplot function in base graphics:

bwplot(Max.Temp ~ Month, data = soton.df,
  xlab = "Month", ylab = "Maximum Temperature",
  main = "Temperature at Southampton Weather Station (1950-1999)"
)

The variable Month is categorical so a separate box and whisker summary is created for each month separately. The lattice version of the graph is shown here:

Lattice Graphics Box and Whisker Plot

Lattice Graphics Box and Whisker Plot

This is very similar to the box and whisker plot created by base graphics with a similar level of effort required. The main difference is the use of a circle rather than a line to identify the location of the median of the data.

ggplot2

Fast Tube
Fast Tube by Casper

In the ggplot2 package there is a general function ggplot that is used to create graphs of any type. We make use of the boxplot geom to create a box and whisker plot following the standard approach. The first step is to specify a data frame to use to create the graph and then map the columns of this data frame, via the texttt{aes} argument, to the different axes or other aesthetics (such as colour or symbol shape). The particular geom is used to specify the type of plot that we want to create. Our final step is to add on the various axes labels and an overall title to the graph.

ggplot(soton.df, aes(Month, Max.Temp)) + geom_boxplot() +
  ylab("Maximum Temperature") +
  opts(title = "Temperature at Southampton Weather Station (1950-1999)")

The ggplot2 version of box and whisker plots is shown here:

ggplot2 Graphics Box and Whisker Plot

ggplot2 Graphics Box and Whisker Plot

The distinctive gray background used by ggplot2 is an obvious visual difference compared to the default clear background used in the other two approaches. The boxes themselves have a cleaner look in this graph than the other two methods and the overall look is slick.

This blog post is summarised in a pdf leaflet on the Supplementary Material page.

To leave a comment for the author, please follow the link and comment on their blog: Software for Exploratory Data Analysis and Statistical Modelling.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)