Side-by-Side Box Plots with Patterns From Data Sets Stacked by reshape2 and melt() in R

Posted on April 10, 2014 by Eric Cai - The Chemical Statistician in R bloggers | 0 Comments

[This article was first published on The Chemical Statistician » R programming, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Introduction

A while ago, one of my co-workers asked me to group box plots by plotting them side-by-side within each group, and he wanted to use patterns rather than colours to distinguish between the box plots within a group; the publication that will display his plots prints in black-and-white only. I gladly investigated how to do this in R, and I want to share my method and an example of what the final result looks like with you.

In generating a fictitious data set for this example, I will also demonstrate how to use the melt() function from the “reshape2” package in R to stack a data set while keeping categorical labels for the individual observations. For now, here is a sneak peek at what we will produce at the end; the stripes are the harder pattern to produce.

Read the rest of this post to learn how to generate side-by-side box plots with patterns like the ones above!

Generating the Data

Due to confidentiality, I cannot use my co-worker’s data set on my public blog, so I generated a data set for my example of pollution in 3 cities involving 2 gases.

# set the working directory to your folder of choice
setwd("INSERT YOUR DIRECTORY PATH HERE")

# set seed for random number generation to allow you to replicate my data
set.seed(1)
# set the standard deviation for the data
sd = 3
# generate fictitious pollution data for New York and London
co2.ny = rnorm(120, 7, sd)
co2.london = rnorm(120, 9, sd)
co2.la = rnorm(120, 10.5, sd)
ch4.ny = c(rnorm(100, 12, sd), rep(NA, 20))
ch4.london = c(rnorm(100, 15, sd), rep(NA, 20))
ch4.la = c(rnorm(100, 18, sd), rep(NA, 20))

# set vectors for labelling the data with location and pollutant
location = c('New York', 'London', 'Los Angeles', 'New York', 'London', 'Los Angeles')
pollutant = c('CO2', 'CO2', 'CO2', 'CH4', 'CH4', 'CH4')

# combine the data
all.data = data.frame(rbind(co2.ny, co2.london, co2.la, ch4.ny, ch4.london, ch4.la))

# add locations and pollutants to the data
all.data$location = location
all.data$pollutant = pollutant

I used the set.seed() function to allow you to replicate my randomly generated numbers from various normal distributions. There are 2 important notes about my code above:

I added 20 values of “NA” for CH4 for all 3 cities. This was done to ensure that the resulting data frame has the same number of observations for all 6 of the vectors. (Try to combine the data into a data frame without doing this, and look at the resulting data frame – it’s not what we want.)
The data frame all.data is sorted in the same order as the vectors are listed in the rbind() function when all.data was created. As a good data manipulation practice, it is always a good idea to explore your data set; specifically, use the head() function to look at a newly generated data table or matrix and check that it is correct or conforms to the correct format.

Stacking the Data

I ultimately need a data table that has 3 columns:

the concentration of the pollutant (the value to be plotted)
the name of the pollutant
the city

Thus, I need to stack the data frame all.data while retaining the name of the pollutant and the city. The melt() function from the “reshape2” package can do this. (Install it first if you don’t have it yet.)

# open the reshape2 library
library(reshape2)
# stack the data while retaining the location and pollutant label by 
stacked.data = melt(all.data, id = c('location', 'pollutant'))
# remove the column that gives the column name of the concentration from all.data
stacked.data = stacked.data[, -3]

If you look at the data table stacked.data when it is first created by melt(), it has the extra, unnecessary column that I deleted in the last line of code.

Generating the Box Plots

Now that the data set has the needed format, let’s plot the box plots. Note that my code below

creates an object called boxplots.triple for the box plots that I will use later in the code
uses the formula $\text{value ~ location + pollutant}$ to separate the plots into the 2 groups of triplets
uses the “at = ” option to specify the locations of the box plots along the horizontal axis
suppresses the default horizontal axis with the option xaxt = ‘n’
adds my custom axis and title with the axis() and title() functions, respectively

png('triple box plots with patterns.png')
boxplots.triple = boxplot(value~location + pollutant, data = stacked.data, at = c(1, 1.8, 2.6, 6, 6.8, 7.6), xaxt='n', ylim = c(min(0, min(co2.ny, co2.london, co2.la)), max(ch4.ny, ch4.london, ch4.la, na.rm = T)), col = c('white', 'white', 'gray'))
axis(side=1, at=c(1.8, 6.8), labels=c('Methane (ppb)\nNumber of Collections = 100', 'Carbon Dioxide (ppb)\nNumber of Collections = 120'), line=0.5, lwd=0)
title('Comparing Pollution in London, Los Angeles, and New York')

Note that the resulting box plot from above gives the grey pattern to the right-most box plot (New York) for each pollutant. That was easy with the “col = ” option in boxplot(). I now have 2 patterns: white and grey. A natural third pattern would be stripes, and this is the (moderately) hard part.

The code below draws a rectangle with stripes over the middle (Los Angeles) box plots. Based on the horizontal locations that I set above and the resulting inter-quartile ranges of the 2 box plots for Los Angeles, I specified where these 2 striped rectangles should be plotted.

Let’s look for the interquartile ranges for Los Angeles. Recall that I created the object boxplots.triple from the boxplot() function. One of the outputs from this object is the collection of the 5-number summaries underlying the box plots; another output is the name of each 5-number summary as given by the city-pollutant label.

I now have all of the information needed to draw my 2 striped rectangles for Los Angeles. (Notice that the order of the cities have been changed from the original order in rbind() to alphabetical order in box plots’ statistics.)

The left coordinates for the 2 rectangles are the left coordinates of the middle box plots as given in the “at =” option in boxplot() above. In my example, they are 1.4 and 6.4.
The right coordinates for the 2 rectangles are the right coordinates of the middle box plots as given in the “at =” option in boxplot() above. In my example, they are 2.2 and 7.2.
The bottom coordinates for the 2 rectangles are the 1st quartiles of the middle box plots as shown in boxplots.statistics above. In my example, they are the 2nd and 5th columns in the 2nd row: 15.713810 and 9.228822.
The top coordinates for the 2 rectangles are the 3rd quartiles of the middle box plots as shown in boxplots.statistics above. In my example, they are the 2nd and 5th columns in the 4th row: 20.109108 and 12.926979.

I encourage you to read the documentation of rect() to learn more about how it works.

# draw the rectangles for Los Angeles, specifying the left, bottom, right, and top positions of the box plots, the density of the lines, and the angles of the lines
rect(c(1.4, 6.4), boxplots.triple$stats[2, c(2, 5)], c(2.2, 7.2), boxplots.triple$stats[4, c(2, 5)], density=12, angle=45)
# add city labels near the box plots
text(c(1, 1.8, 2.6, 6, 6.8, 7.6), c(6, 8, 1, 17.75, 20, 15.5), c('London', 'Los\nAngeles', 'New\nYork', 'London', 'Los\nAngeles', 'New\nYork'))
dev.off()

For visual ease, I added the labels of the cities directly below or above each box plot. I chose not to use a legend because

it’s tiring to look back and forth between the box plots and the legend to determine which pattern denotes which city. If there are only 2 cities, then it’s OK, but 3 or more makes it hard to read.
The legend() function does not work with both the “fill = ” option and the “density = ” option; it seems like it can only work with 1 of them. I searched extensively but unsuccessfully for a solution to this problem.

While this is a nice outcome and likely useful for many other situations with grouped box plots, this approach does seem cumbersome and inelegant. Do you have any suggestions for how else to plot side-by-side box plots with patterns? If so, please leave them in the comments below!