Beeswarm Boxplot (and plotting it with R)

March 10, 2011
By

(This article was first published on R-statistics blog » R, and kindly contributed to R-bloggers)

(The image above is called a “Beeswarm Boxplot” , the code for producing this image is provided at the end of this post)

The above plot is implemented under different names in different softwares. This “Scatter Dot Beeswarm Box Violin – plot” (in the lack of an agreed upon term) is a one-dimensional scatter plot which is like “stripchart”, but with closely-packed, non-overlapping points; the positions of the points are corresponding to the frequency in a similar way as the violin-plot. The plot can be superimposed with a boxplot to give a very rich description of the underlaying distribution.

This plot has been implemented in various statistical packages, in this post I will list the few I came by so far. And if you know of an implementation I’ve missed please tell me about it in the comments.

Implementations in commercial statistical packages

GraphPad implements this graph under the name “column scatter plot” (with line drawn at the mean) made from the “Frequency distribution” sample data. So does OriginLab
(My thanks goes to nico for finding this examples)

I imagine there is also something similar in the “big” packages (SAS, JMP, SPSS etc…), but I could not yet find an example.

Implementations in Free Open-Source statistical packages

I’ve noticed that GGobi has a “texture” 1D plot, which is a very similar implementation of this plot. But the main focus of this post will (expectedly) be R.

In his blog “SAS and R“, Ken Kleinman has wrote about the creation of a dot-box-plot about half a year ago.
He wrapped his code and it can be run using the following command:

 1 2 3 4  source("http://www.math.smith.edu/sasr/examples/wild-helper.R") # getting the boxplonts3 function ds = read.csv("http://www.math.smith.edu/r/data/help.csv") # getting some data female = subset(ds, female==1) with(female,boxpoints3(pcs, homeless, "PCS", "Homeless")) # plotting...

With the following pleasing output:

In a followup post, Ken posted of some suggestions he received from his readers on how to make the plot better (through other functions, and also on ggplot2 implementations)

In the R help mailing list, there was recently a question asked on this topic (which had led me to writing this post) asking for:

A band of dots on the plot are the data point. The density of dots and the “fatness” of the band present the frequency of a particular value in Y-axis. This property is similar to the violin plot: showing the probability density of the data at different values. Instead of showing a shape in violin plot, this plot shows the actual distribution of the data points.

Joshua Wiley had responded by pointing some R code he had worked on, based on an algorithm from Leland Wilkinson. However, it is not yet release ready and does not
handle multiple groups (though that is on his todo list).

Jim Lemon (the author of the wonderful plotrix R package) have also offered his solution to the problem:

 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21  x<-list(runif(90),runif(100),runif(80))   dendroPlot<-function(x,breaks=NA,nudge=NA) { if(is.na(breaks[1])) breaks=seq(min(unlist(x),na.rm=TRUE), max(unlist(x),na.rm=TRUE),length.out=10) plot(c(0,length(x)+1),range(unlist(x)),type="n") if(is.na(nudge)) nudge<-strwidth("o")/2 for(list_element in 1:length(x)) { binvar<-cut(x[[list_element]],breaks=breaks) for(bin in 1:length(levels(binvar))) { thisbin<-which(as.numeric(binvar)==bin) offset<-(1:length(x[[list_element]][thisbin])-1)*nudge offset[seq(2,length(offset),by=2)]<- -offset[seq(2,length(offset),by=2)] points(list_element+offset,sort(x[[list_element]][thisbin])) } } }   dendroPlot(x)

The first one was by Joris Meys who wrote the following Make.Funny.Plot function (I ran it with rnorm(1000) and added an overlay of a boxplot)

 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36      Make.Funny.Plot <- function(x){ unique.vals <- length(unique(x)) N <- length(x) N.val <- min(N/20,unique.vals)   if(unique.vals>N.val){ x <- ave(x,cut(x,N.val),FUN=min) x <- signif(x,4) } # construct the outline of the plot outline <- as.vector(table(x)) outline <- outline/max(outline)   # determine some correction to make the V shape, # based on the range y.corr <- diff(range(x))*0.05   # Get the unique values yval <- sort(unique(x))   plot(c(-1,1),c(min(yval),max(yval)), type="n",xaxt="n",xlab="")   for(i in 1:length(yval)){ n <- sum(x==yval[i]) x.plot <- seq(-outline[i],outline[i],length=n) y.plot <- yval[i]+abs(x.plot)*y.corr points(x.plot,y.plot,pch=19,cex=0.5) } }   x <- rnorm(1000) Make.Funny.Plot(x) boxplot(x, add = T, at = 0, col="#0000ff22") # my thanks goes to Greg Snow for the tip on the transparency colour (from 2007): https://stat.ethz.ch/pipermail/r-help/2007-October/142934.html

And here is the output:

Finally, I saved the best (IMHO) implementation to the last, which is the beeswarm package, it was written by Aron Charles Eklund and shows to be the most promising solution I came by so far. From the help page:

A bee swarm plot is a one-dimensional scatter plot similar to “stripchart”, except that would-be overlapping points are separated such that each is visible.

This function seems to offer the most options for customization such as several methods for placing the points and controlling the characters and colors. This function is intended to be mostly compatible with calls to stripchart or boxplot. Thus, code that works with these functions should work with beeswarm with minimal modification.

Here is an example for using the beeswarm function (many thanks goes to Shane for writing about this solution!)

 1 2 3 4 5 6 7 8 9 10    if(!require(beeswarm)) install.packages("beeswarm") data(breast) beeswarm(time_survival ~ event_survival, data = breast, method = 'swarm', pch = 16, pwcol = as.numeric(ER), xlab = '', ylab = 'Follow-up time (months)', labels = c('Censored', 'Metastasis')) legend('topright', legend = levels(breast\$ER), title = 'ER', pch = 16, col = 1:2)

And the output is the following:

In order to get the plot I presented in the beginning of the post, you’ll need to use a boxplot function after running the beeswarm:

 1 2 3 4 5 6 7 8 9 10 11 12    if(!require(beeswarm)) install.packages("beeswarm") data(breast)   beeswarm(time_survival ~ event_survival, data = breast, method = 'swarm', pch = 16, pwcol = as.numeric(ER), xlab = '', ylab = 'Follow-up time (months)', labels = c('Censored', 'Metastasis'))   boxplot(time_survival ~ event_survival, data = breast, add = T, names = c("",""), col="#0000ff22") # my thanks goes to Greg Snow for the tip on the transparency colour (from 2007): https://stat.ethz.ch/pipermail/r-help/2007-October/142934.html

I hope you found this post useful, if you know of more ways to make such a plot – please let me (and others) know about it in the comments.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...