Data flow visuals – alluvial vs ggalluvial in R

R on head spin - the Heads or Tails blog

3 years ago

[This article was first published on R on head spin - the Heads or Tails blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I have long been a fan of creative data visualisation techniques. For me, the choice of visual representation is driven by both the type of data and the kind of question one wants to examine.

The power of its visualisation tools has been a major strength of the R language well before the ggplot2 package and the tidyverse burst onto the scene. Today’s post will be an introductory examination of two similar packages that allow us to study the connection and flow of data between different categorical features via alluvial plots. Those packages are alluvial and ggalluvial.

All in all we need the following libraries:

libs <- c('dplyr', 'stringr', 'forcats',     # wrangling
          'knitr','kableExtra',               # table styling
          'ggplot2','alluvial','ggalluvial',  # plots
          'nycflights13')                     # data
invisible(lapply(libs, library, character.only = TRUE))

Alluvial plots are best explained by showing one. For illustrating the following examples we will take on board the flights data from the nycflights13 library. This comprehensive data set contains all flights that departed from the New York City airports JFK, LGA, and EWR in 2013. For this analysis, we will only look at three features – the 1st-class features if you will: airport of origin, destination airport, and carrier (i.e. airline code). From the metaphorical front of the cabin, here are the first 4 rows:

origin	carrier	dest
EWR	UA	IAH
LGA	UA	IAH
JFK	AA	MIA
JFK	B6	BQN

The alluvial package was introduced in 2014 to fill a niché in the landscape of visualisations. I have enjoyed using it in the past in several Kaggle Kernels. Here’s what a plot looks like:

top_dest <- flights %>%
  count(dest) %>%
  top_n(5, n) %>%
  pull(dest)

top_carrier <- flights %>%
  filter(dest %in% top_dest) %>%
  count(carrier) %>%
  top_n(4, n) %>%
  pull(carrier)

fly <- flights %>%
  filter(dest %in% top_dest & carrier %in% top_carrier) %>%
  count(origin, carrier, dest) %>%
  mutate(origin = fct_relevel(as.factor(origin), c("EWR", "LGA", "JFK")))

alluvial(fly %>% select(-n),
         freq=fly$n, border=NA, alpha = 0.5,
         col=case_when(fly$origin == "JFK" ~ "red",
                       fly$origin == "EWR" ~ "blue",
                       TRUE ~ "orange"),
         cex=0.75,
         axis_labels = c("Origin", "Carrier", "Destination"),
         hide = fly$n < 150)

The features are arranged horizontally, with their value counts stacked vertically. This corresponds to a stacked barplot: e.g. for the destinations “BOS” has fewer flights than “LAX”. Here we only look at the top 5 destination and their top 4 carriers (that’s the first two segments of the code above).
The “alluvia” are the bands that connect the features from left to right. Alluvia break down all feature combinations, with complexity increasing also from left to right. These sub-segments are called “flows”.
This means that starting from the 3 origin airports on the left there are 4 “flows” each (i.e. 12 in total) connecting to the 4 main carriers. Between carrier and destination these then fan out into 5 flows each for a theoretical total of 60 different flows. In practice, we want to use the hide parameter to exclude those flows that only have a few observations so that we can focus on the big picture.
For further styling, a border colour can be assigned to each alluvium. This would allow us to distinguish the different flows on the left side that then break into sub-flows on the right side. Feel free to try it out. Personally, I think the plot looks better without border colours.
We chose a colour coding (argument col) that puts focus on the origin airports. The first argument of the alluvial function is the data set, followed by the frequency column (freq). Note that alluvial expects the data already to be in the shape of grouped counts (as prepared via count in the third code segment above).
In my view, the best transparency for alluvia is the default alpha = 0.5. As usual, cex does the scaling and axis_lables is pretty self-explanatory.
The alluvial function has an ordering parameter, but it’s generally better to do the ordering through factor re-levelling when preparing the data (via the tidyverse forcats library). Here we only change the order for the origin feature.

So, other than looking pretty, what insights does it give us? Well, for instance we see that (for this subset) EWR is dominated by UA (United Airlines) and has almost no AA (American Airlines flights). In turn, UA flights are not frequent in LGA or JFK. Both Boston (BOS) and Los Angeles (LAX) are not connected to LGA (orange). Thus, the alluvial plot shows us – pretty literally in this case – the flow of flight volume between airports through airline carriers.

Now, the alluvial tool has a rather specific syntax and doesn’t integrate seamlessly with the tidyverse. Enter the ggalluvial library:

fly %>%
  mutate(origin = fct_rev(as.factor(origin)),
         carrier = fct_rev(as.factor(carrier)),
         dest = fct_rev(as.factor(dest))) %>%
  filter(n > 150) %>%
  ggplot(aes(y = n, axis1 = origin, axis2 = carrier, axis3 = dest)) +
  geom_alluvium(aes(fill = origin), aes.bind=TRUE, width = 1/12) +
  geom_stratum(width = 1/4, fill = "white", color = "black") +
  geom_text(stat = "stratum", label.strata = TRUE) +
  scale_x_discrete(limits = c("Origin", "Carrier", "Destination"),
                   expand = c(.05, .05)) +
  scale_fill_manual(values = c("red", "orange", "blue")) +
  labs(y = "Cases") +
  theme_minimal() +
  theme(legend.position = "none") +
  ggtitle("NYC flights volume for top destinations and airlines")

Here I purposefully choose the styling parameters to (broadly) reproduce the above plot. It is evident that ggalluvial integrates much more smoothly into the ggplot2 grammar. Specifically:

The alluvia and the vertical features (the “strata”; here: origin, carrier, and destination) are implemented as different geometry layers. Note, that the default order of the strata features is reversed compared to alluvial. Also: there are no gaps between the strata here compared to what alluvial does. This makes it easier to add a y-axis.
I decided not to change the default y-axis and subtle background grid lines, which provide quantitative information and guide the eye. Replace theme_minimal() by theme_void() to get very close to the alluvial plot style.
By default, ggalluvial plots the same number of flows between neighbouring strata. This behaviour can be changed by the aes.bind=TRUE parameter in geom_alluvial. Remove it to see the difference with a larger number of narrower flows between the origin and carrier strata.
We are setting the colours manually. One advantage of ggalluvial is that instead of a manual setting you can use any ggplot2 (or add-on) scale such as brewer or viridis. Similarly we can modify the plot theme.
Instead of geom_text you can use geom_label, e.g. in combination with a different fill colour in geom_stratum.

In closing: both packages are versatile and provide somewhat different approaches to creating alluvial plots. If you are frequently working within the tidyverse then ggalluvial might be more intuitive for you. Specific (edge) cases might be better handled by one tool than the other.

For more information check out the respective vignettes for ggalluvial and alluvial as well as their pages on github.

Have fun!

To leave a comment for the author, please follow the link and comment on their blog: R on head spin - the Heads or Tails blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.