# Summarising data using box and whisker plots

**Software for Exploratory Data Analysis and Statistical Modelling**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A box and whisker plot is a type of graphical display that can be used to summarise a set of data based on the five number summary of this data. The summary statistics used to create a box and whisker plot are the median of the data, the lower and upper quartiles (25% and 75%) and the minimum and maximum values.

The box and whisker plot is an effective way to investigate the distribution of a set of data. For example, skewness can be identified from the box and whisker as the display does not make any assumptions about the underlying distribution of the data. The extreme values at either end of the scale are sometimes included on the display to show how far they extend beyond the majority of the data.

To illustrate creating box and whisker plots we consider UK meteorological data that has been collected on a monthly basis at Southampton, UK between 1950 and 1999 and is publicly available. This data is available from the UK Met Office and we will compare the range of temperatures recorded in each month of the year over this period by creating box and whisker plots with the different packages.

The data is assumed to have been imported into **R** and stored in a data frame called **soton.df**. An extract of the data is shown here:

Year Month Max.Temp Min.Temp Frost Rain 1 1950 Jan 7.7 2.8 7 20.1 2 1950 Feb 10.3 4 4 127.0 3 1950 Mar 13.0 4.5 2 39.4 4 1950 Apr 13.6 4.7 0 62.0 5 1950 May 17.9 7.8 0 32.2

**Base Graphics**

Fast Tube by Casper

The **base** graphics approach makes use of the **boxplot** function to create box and whisker plots. In this situation the function can be used with a formula rather than specifying two separate vectors of data – we can specify a data frame to point towards a source of data to be used in the graph. For the temperature data we use this code:

boxplot(Max.Temp ~ Month, data = soton.df, xlab = "Month", ylab = "Maximum Temperature", main = "Temperature at Southampton Weather Station (1950-1999)" )

The horizontal and vertical axes labels are specified using the **xlab** and **ylab** arguments respectively and the title of the plot is created using the **main** argument. The box and whisker plot is shown here:

The function **boxplot** makes it easy to create a reasonably attractive box and whisker plot. The variation in the distribution of temperatures across the year can be seen from the graph.

**Lattice Graphics**

Fast Tube by Casper

In the **lattice** graphics package there is a function **bwplot** which is used to create box and whisker plots. The function call also uses a formula to specify the **x** and **y** variables to use on the graph. The function call arguments are identical to the **boxplot** function in **base** graphics:

bwplot(Max.Temp ~ Month, data = soton.df, xlab = "Month", ylab = "Maximum Temperature", main = "Temperature at Southampton Weather Station (1950-1999)" )

The variable **Month** is categorical so a separate box and whisker summary is created for each month separately. The **lattice** version of the graph is shown here:

This is very similar to the box and whisker plot created by **base** graphics with a similar level of effort required. The main difference is the use of a circle rather than a line to identify the location of the median of the data.

**ggplot2**

Fast Tube by Casper

In the **ggplot2** package there is a general function **ggplot** that is used to create graphs of any type. We make use of the boxplot geom to create a box and whisker plot following the standard approach. The first step is to specify a data frame to use to create the graph and then map the columns of this data frame, via the texttt{aes} argument, to the different axes or other aesthetics (such as colour or symbol shape). The particular geom is used to specify the type of plot that we want to create. Our final step is to add on the various axes labels and an overall title to the graph.

ggplot(soton.df, aes(Month, Max.Temp)) + geom_boxplot() + ylab("Maximum Temperature") + opts(title = "Temperature at Southampton Weather Station (1950-1999)")

The **ggplot2** version of box and whisker plots is shown here:

The distinctive gray background used by **ggplot2** is an obvious visual difference compared to the default clear background used in the other two approaches. The boxes themselves have a cleaner look in this graph than the other two methods and the overall look is slick.

This blog post is summarised in a pdf leaflet on the Supplementary Material page.

**leave a comment**for the author, please follow the link and comment on their blog:

**Software for Exploratory Data Analysis and Statistical Modelling**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.