Stop the madness – no more pie charts

Posted on July 1, 2015 by Christopher Wardell in R bloggers | 0 Comments

[This article was first published on cpwardell.com » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

There has been a trend in the last few years to put interesting-looking but non-informative figures in papers; the pie chart is the worst recurrent offender. I have no idea how they keep getting included, as they’re famously misleading and awful. I want my work to look as much like the cockpit of a mecha or Iron Man’s HUD as possible, but I know that it’s not going to be as clear or concise as it should be.

A recent example is this otherwise very interesting and good paper:

Subclonal diversification of primary breast cancer revealed by multiregion sequencing by Yates et al, 2015

Figure 2 is confusing at best and reproduced below. These pie-chart-like plots were made famous by Florence Nightingale 150 years ago and have been called “coxcomb plots”. Wikipedia claims that that’s a mistake and they are really called “polar area diagrams” or “courbes circulaires”.

I admit that they look pretty cool and are fine if you’re building some sort of hip infographic about consumer tastes or Facebook trends. However, I think that they’re inappropriate and misleading because:

They look unusual. The reader expends a lot of energy working out what they’re looking at instead of processing the relationships in the data
These data contain only one variable that changes, but the plots used can encode two variables (arc length and radius can vary, discussed below). The plot is therefore unnecessarily complex
In pie charts, the arc length is proportional to the value represented. In these plots, the arc length is identical for each slice of pie… but you might not know this and you may infer that there is less of a certain segment. This is appropriate for the type of data that Florence Nightingale was plotting (deaths in each month of the year), but not for arbitrary divisions
In these plots, the radius of each segment (i.e. how far it extends from the centre) is informative. You’re supposed to read off an almost-invisible grey scale of concentric rings, but it’s not easy. Also, the visual effect is non-linear because the area of a circle is πr^2, which means that a small increase in radius has a disproportionately large gain
It’s really hard to compare between plots; visual comparison is difficult and reading the scale to get at the numbers is even harder
Multiple data types are represented in a single plot. I’m not sure mixing somatic mutation variant allele frequencies and copy number log ratios is very effective

Some potential solutions

Criticizing without offering a solution isn’t very helpful, so I offer two solutions:

Use barplots or boxplots. Yes, they’re boring, but they’re also familiar and easy to read. I suppose you could even put a scale at either edge and mix data types without too much complaint (e.g. somatic VAFs and copy number logRs on the same plot).

A type of 3D barplot. I generally think 3D is a bad thing when displayed in a 2D plane, particularly if it’s non interactive. For example, a 3D shape on a 2D display is fine if you can rotate and zoom it freely. “Lego plots” have been popularized by the Broad Institute in their sequencing papers, usually to show the sequence context of mutations (see below; taken from Exome and whole-genome sequencing of esophageal adenocarcinoma identifies recurrent driver events and mutational complexity by Dulak et al, 2013). The irony of the pie charts isn’t lost on me.

They’re relatively appealing and a relatively concise way of showing what would otherwise be a barplot of 96 values; it would get tricky if the data weren’t carefully arranged so as not to obscure anything, though.

If 3D barplots are now acceptable, why don’t we make a 3D barplot that encodes two variables? The height would represent one variable and the depth another.

I’ve created a small data set and some R code to illustrate these points below:

Alternative plots

The data set represents 4 mutations (labeled A to D) in a set of 100 samples. Each sample has a variant allele frequency (VAF) for each mutation, between 0 (not present) and 1 (every sequencing read contains it).

# A is frequent (80/100 samples) and usually subclonal (VAF ~0.1)
# B is infrequent (20/100 samples) and usually clonal (VAF ~0.8)
# C is frequent (90/100 samples) and usually clonal (VAF ~0.9)
# D is infrequent (15/100 samples) and usually subclonal (VAF ~0.15)

Coxcomb / polar area plot. Arc length represents proportion of samples, radius (wedge length) represents median VAF. Can be difficult to interpret.

Boxplot. Good representation of the data, but hard to tell how many samples are represented. For example, B is far less common than C but appears to be very similar. Likewise for D and A.

Barplot. No representation of the spread of the data. The proportion of affected samples is encoded using colour (darker mutations affect more samples). Perhaps the colours could be applied to the boxplot for the best result?

3D barplot. It’s very basic (missing axes and labels), but shows the relationships between proportion and VAF more clearly than other plots. I added some transparency so no column totally obscures any others. It’s more convincing when you can rotate the object yourself (download the code and try it yourself), but I think even a static image is better than a Coxcomb/polar coordinate plot.

Code is below and available on GitHub here.

## Load packages
library(rgl)
library(ggplot2)

#### START OF FUNCTIONS

## Functions modified from the "demo(hist3d)" examples in the rgl package:
# library(rgl)
# demo(hist3d)
## Would it have killed the author to comment their code?

## Draws a single "column" or "stack".
## X and Y coordinates determine the area of the stack
## The Z coordinate determines the height of the stack
stackplot.3d<-function(x,y,z=1,alpha=1,topcol="#078E53",sidecol="#aaaaaa"){

## These lines allow the active rgl device to be updated with multiple changes
save <- par3d(skipRedraw=TRUE)
on.exit(par3d(save))

## Determine the coordinates of each surface of the stack and its edges
x1<-c(rep(c(x[1],x[2],x[2],x[1]),3),rep(x[1],4),rep(x[2],4))
z1<-c(rep(0,4),rep(c(0,0,z,z),4))
y1<-c(y[1],y[1],y[2],y[2],rep(y[1],4),rep(y[2],4),rep(c(y[1],y[2],y[2],y[1]),2))
x2<-c(rep(c(x[1],x[1],x[2],x[2]),2),rep(c(x[1],x[2],rep(x[1],3),rep(x[2],3)),2))
z2<-c(rep(c(0,z),4),rep(0,8),rep(z,8) )
y2<-c(rep(y[1],4),rep(y[2],4),rep(c(rep(y[1],3),rep(y[2],3),y[1],y[2]),2) )

## These lines create the sides of the stack and its top surface
rgl.quads(x1,z1,y1,col=rep(sidecol,each=4),alpha=alpha)
rgl.quads(c(x[1],x[2],x[2],x[1]),rep(z,4),c(y[1],y[1],y[2],y[2]),
col=rep(topcol,each=4),alpha=1)
## This line adds black edges to the stack
rgl.lines(x2,z2,y2,col="#000000")
}
# Example:
#stackplot.3d(c(-0.5,0.5),c(4.5,5.5),3,alpha=0.6)

## Calls stackplot.3d repeatedly to create a barplot
# x is a constant distance along x axis
# y is the depth of column
# z is the height of column
barz3d<-function(x,y,z,alpha=1,topcol="#078E53",sidecol="#aaaaaa",scaley=1,scalez=1){
## These lines allow the active rgl device to be updated with multiple changes
save <- par3d(skipRedraw=TRUE)
on.exit(par3d(save))

## Plot each of the columns
n=length(x)
breaks.x = seq(0,n-1)
for(i in 1:n){
stackplot.3d(c(breaks.x[i],breaks.x[i]+1),c(0,-y[i])*scaley,z[i]*scalez,alpha=alpha,topcol=topcol)
}
## Set the viewpoint
rgl.viewpoint(theta=30,phi=25)
}
# Example
#barz3d(x=LETTERS[1:4],y=c(0.8,0.2,0.9,0.15),z=c(0.11,0.75,0.89,0.16),alpha=0.4,scaley=2,scalez=2)

#### END OF FUNCTIONS

## Example data:
# 4 mutations in 100 samples
# VAF range is from 0 to 1
# A is frequent and usually subclonal
# B is infrequent and usually clonal
# C is frequent and usually clonal
# D is infrequent and usually subclonal

Avaf=rnorm(80,0.1,0.05)
Bvaf=rnorm(20,0.8,0.1)
Cvaf=rnorm(90,0.9,0.05)
Dvaf=rnorm(15,0.15,0.05)

## Summarize data in new object
vafsum=data.frame(median=sapply(list(Avaf,Bvaf,Cvaf,Dvaf),median),
proportion=sapply(list(Avaf,Bvaf,Cvaf,Dvaf),function(x){length(x)/100}))
rownames(vafsum)=c(LETTERS[1:4])

## Code to produce coxcomb/polar coordinate plot adapted from:
## http://robinlovelace.net/r/2013/12/27/coxcomb-plots-spiecharts-R.html
## https://github.com/Robinlovelace/lilacPlot
pos = 0.5 * (cumsum(vafsum$proportion) + cumsum(c(0, vafsum$proportion[-length(vafsum$proportion)])))
p = ggplot(vafsum, aes(x=pos)) + geom_bar(aes(y=median), width=vafsum$proportion, color = "black", stat = "identity") + scale_x_continuous(labels = rownames(vafsum), breaks = pos) # Linear version is ok
p + coord_polar(theta = "x")
# (ignore warnings thrown)

## A traditional boxplot
boxplot(Avaf,Bvaf,Cvaf,Dvaf,names=LETTERS[1:4])

## A barplot where height represents median VAF and the color of the bar represents
## how many samples contain each mutation
barplot(vafsum$median,names=LETTERS[1:4],col=rgb(0.1,0.1,0.1,vafsum$proportion))

## Our new 3D barplot function
barz3d(x=LETTERS[1:4],y=vafsum$proportion,z=vafsum$median,alpha=0.4,scaley=2,scalez=2)
rgl.snapshot("3dbarplot.png", fmt = "png", top = TRUE )

To leave a comment for the author, please follow the link and comment on their blog: cpwardell.com » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

Stop the madness – no more pie charts

Some potential solutions

Alternative plots

Related

Some potential solutions

Alternative plots

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)