One of the most common types of datasets people want to plot is one where there is a continuous dependent variable – like age, company profit, or beard length – as a function of a categorical independent variable – like gender, a specific company, or one’s pirate ship.
Unfortunately, despite the vast improvements in computational graphing capabilties in the past decades, way too many people are still stuck on tired old plotting types like barplots that can reduce a complex dataset to a single, potentially misleading summary statistic.
A good plot needs to show more than just a single summary statistic. Instead, it should have the trifecta of three attributes that I call RDI: Raw data, Descriptive statistics, and Inferential data. Most plots that people use show just one of these attributes. Recent plot types like violin plots and beanplots (Kampstra, 2008) are huge improvements over basic bar and box plots, but they still don’t hit on the full RDI trifecta.
Thankfully, there is a true RDI plot – and it’s called The Pirate plot. The pirate plot goes beyond existing plots by combining Raw data, Descriptive statistics and Inferential statistics into a single elegant plot. And yes, it was developed by a group of R pirates.
You can easily create a Pirate plot using the pirateplot function contained in the yarrr R package. You can download the yarrr package from github at www.github.com/ndphillips/yarrr.
Pirate Beard Length Data
To show you how the function works and the kinds of plots it creates, let’s analyze a set of data. The dataframe we’ll use is called BeardLengths. It contains results from a survey of 150 pirates on three different pirate ships: The Angry Badger, the Fearless Snake, and the Nervous Goat. For each pirate, I recorded both the name of the ship he/she works on, and measured his/her beard length. Our goal is to see if there is a relationship between the ship a pirate works on and the length of his/her beard.
Here’s how the first four (of the 150 total) rows of the dataframe look:
Now, let’s plot these data using barplots and boxplots. For the barplot, we’ll first calculate the mean beard length of pirates on each ship and assign them to an object called length.means. We can run the boxplot function directly on the data.
length.means <- aggregate(Beard ~ Ship, FUN = mean, data = BeardLengths) barplot(length.means$Beard, names.arg = length.means$Ship, main = "Beard Length by Ship") boxplot(Beard ~ Ship, data = BeardLengths, main = "Beard Length by Ship")
Looking at the barplot, it looks like pirates from the Angry badger tend to have the longest beards, while those from the Fearless Snake and the Nervous Goat tend to be a bit shorter. But let’s face it – the plot does not look terribly nice. Most importantly, it completely ignores distributional information: for example, is there larger variability in beard lengths for one ship than another?
To answer this question, we can move up to a boxplot. In addition to showing a measure of central tendency (specifically the median), boxplots display lower and upper quantiles which reflect the variability in a distribution. The boxplot to the right of the beanplot tells us a bit more about the data. Most importantly, it looks like pirates from the Angry Badger and the Fearless Snake have a smaller age variability than those from the Nervous Goat.
However, that’s not the end of the story. While boxplots are an improvement over barplots, they are notorious for hiding distributions that are not unimodal. Specifically, it could be the case that one (or more) of these distributions contain 2 (or more) subgroups. Additionally, neither the barplot nor the boxplot allow us to easily make inferences about the group populations. That’s where the Pirate Plot comes in.
The Pirate Plot
Let’s create a Pirate Plot of the data using the pirateplot function in the yarrr package. You can install the yarrr package using the following code:
# To use the install_github function, you also need to have the devtools library installed and loaded! # install.packages("devtools") library(devtools) install_github("ndphillips/yarrr") library("yarrr")
Now that we’ve installed and loaded the package, we can use the function. To do so, we specify the names of the dependent and independent variables by their string name. We then specify the dataframe in the data argument.
pirateplot(dv.name = "Beard", iv.name = "Ship", data = BeardLengths, main = "Beard Length by Ship" )
Now that looks like a modern plot! Just like in the previous plots, we have a descriptive measure of central tendency – the mean of each distribution is shown in a dark horizontal line in the middle of each bean. However, Pirate plots have the full RDI trifecta. Let’s walk through each one:
Raw Data: All data points
The Pirate Plot shows all the raw data behind each group as an open circle. The points are randomly jittered horizontally to make them easier to see (you can control the amount of jittering with the jitter argument). Thus, unlike bar and boxplots, we can quickly see things like outliers or suspicious gaps in the data.
Descriptive Statistics: Central tendency and Densities
In addition to showing a measure of central tendency, the Pirate Plot also shows full densities for each group. If you’re not familiar with densities, they simply show us how crowded or sparse the data are at every possible value. In the plot above, we can quickly see something interesting from the densities: while the beard distribution of Angry Badger and Fearless Snake pirates look rather normal, the distribution of beard lengths for Nervous Goat pirates is clearly bimodal. This means that Nervous Goat pirates either tend to have very short beards (less than 5) or very long beards (greater than 30).
Inferential Statistics: 95% Highest Density Intervals (HDI)
Finally, the Pirate Plot utilises the BEST (Bayesian Estimation Supersedes the T-Test) package to generate 95% Highest Density Intervals (HDIs) of the mean of each group. These intervals are like confidence intervals – but much better. For each interval, we can state that there is a 95% probability that the true population mean falls within that interval. In the pirate plot, 95% HDIs are shown as solid bands around the sample mean. How can we use these bands to make inferences? Well, looking at the band for the Angry Badger plot, we can conclude that there is a 95% probability that the true mean beard length of Angry Badger pirates is somewhere between 23 and 26. In contrast, for Fearless Snake pirates there is a 95% chance that the true mean falls bewteen around 20 and 21. Because these intervals do not overlap, we can conclude with high confidence that pirates on the Angry Badger have longer beards on average than those on the Fearless Snake. Note that these are not the same as 95% Confidence Intervals (CI) which we cannot use to make population inferences (see XXX).
Customizing the look of pirate plots with additional arguments
The pirateplot function has lots of arguments that allow you to easily customize the look of your Pirate Plot. For example, you can manipulate the transparency of the specific elements (e.g.; points vs. density line vs. HDI) using the trans.vec argument, or the color scheme using the my.palette argument.
Let’s make a black and white version with a stronger emphasis on the raw data. To do this, I’ll drop the transparency of the raw data down to .2, and increase the transparency of the rest of the elements to .9. I’ll make the plot black and white by setting my.palette to “black”:
pirateplot(dv.name = "Beard", iv.name = "Ship", data = BeardLengths, my.palette = "black", trans.vec = c(.2, .9, .9, .9, .9), main = "Beard Length by Ship" )
Now, let’s make a really colorful plot that de-emphasizes the raw data, and really highlights the density and HDI. To do this, I’ll set the transparency of the raw data to .8, and decrease the transparency of the rest of the elements to .2. To try some different colors, I’ll use the “google” palette stored in the piratepal (Pirate Palette) function in the yarrr package.
pirateplot(dv.name = "Beard", iv.name = "Ship", data = BeardLengths, my.palette = "google", trans.vec = c(.8, .2, .2, .2, .2), main = "Beard Length by Ship" )
As you can see, just like a sweet pair of jeans, you can dress a Pirate Plot up or down depending on how you want to show it off.
Don’t use barplots. Don’t use boxplots. In fact, pretty much don’t use any default plot from Excel or SPSS. These plots are not only damn ugly and outdated, but they can hide critical disributional information. Instead, use an `RDI’ plot like the Pirate Plot that shows the trifecta of 1) Raw data, 2) Descriptive statistics, and 3) Inferential statistics.
Kampstra, P. (2008) Beanplot: A Boxplot Alternative for Visual Comparison of Distributions. Journal of Statistical Software, Code Snippets, 28(1), 1-9. URL http://www.jstatsoft.org/v28/c01/