Site icon R-bloggers

Data flow visuals – alluvial vs ggalluvial in R

[This article was first published on R on head spin - the Heads or Tails blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I have long been a fan of creative data visualisation techniques. For me, the choice of visual representation is driven by both the type of data and the kind of question one wants to examine.

The power of its visualisation tools has been a major strength of the R language well before the ggplot2 package and the tidyverse burst onto the scene. Today’s post will be an introductory examination of two similar packages that allow us to study the connection and flow of data between different categorical features via alluvial plots. Those packages are alluvial and ggalluvial.

All in all we need the following libraries:

libs <- c('dplyr', 'stringr', 'forcats',     # wrangling
          'knitr','kableExtra',               # table styling
          'ggplot2','alluvial','ggalluvial',  # plots
          'nycflights13')                     # data
invisible(lapply(libs, library, character.only = TRUE))

Alluvial plots are best explained by showing one. For illustrating the following examples we will take on board the flights data from the nycflights13 library. This comprehensive data set contains all flights that departed from the New York City airports JFK, LGA, and EWR in 2013. For this analysis, we will only look at three features – the 1st-class features if you will: airport of origin, destination airport, and carrier (i.e. airline code). From the metaphorical front of the cabin, here are the first 4 rows:

origin carrier dest
EWR UA IAH
LGA UA IAH
JFK AA MIA
JFK B6 BQN

The alluvial package was introduced in 2014 to fill a niché in the landscape of visualisations. I have enjoyed using it in the past in several Kaggle Kernels. Here’s what a plot looks like:

top_dest <- flights %>%
  count(dest) %>%
  top_n(5, n) %>%
  pull(dest)

top_carrier <- flights %>%
  filter(dest %in% top_dest) %>%
  count(carrier) %>%
  top_n(4, n) %>%
  pull(carrier)

fly <- flights %>%
  filter(dest %in% top_dest & carrier %in% top_carrier) %>%
  count(origin, carrier, dest) %>%
  mutate(origin = fct_relevel(as.factor(origin), c("EWR", "LGA", "JFK")))

alluvial(fly %>% select(-n),
         freq=fly$n, border=NA, alpha = 0.5,
         col=case_when(fly$origin == "JFK" ~ "red",
                       fly$origin == "EWR" ~ "blue",
                       TRUE ~ "orange"),
         cex=0.75,
         axis_labels = c("Origin", "Carrier", "Destination"),
         hide = fly$n < 150)

So, other than looking pretty, what insights does it give us? Well, for instance we see that (for this subset) EWR is dominated by UA (United Airlines) and has almost no AA (American Airlines flights). In turn, UA flights are not frequent in LGA or JFK. Both Boston (BOS) and Los Angeles (LAX) are not connected to LGA (orange). Thus, the alluvial plot shows us – pretty literally in this case – the flow of flight volume between airports through airline carriers.

Now, the alluvial tool has a rather specific syntax and doesn’t integrate seamlessly with the tidyverse. Enter the ggalluvial library:

fly %>%
  mutate(origin = fct_rev(as.factor(origin)),
         carrier = fct_rev(as.factor(carrier)),
         dest = fct_rev(as.factor(dest))) %>%
  filter(n > 150) %>%
  ggplot(aes(y = n, axis1 = origin, axis2 = carrier, axis3 = dest)) +
  geom_alluvium(aes(fill = origin), aes.bind=TRUE, width = 1/12) +
  geom_stratum(width = 1/4, fill = "white", color = "black") +
  geom_text(stat = "stratum", label.strata = TRUE) +
  scale_x_discrete(limits = c("Origin", "Carrier", "Destination"),
                   expand = c(.05, .05)) +
  scale_fill_manual(values = c("red", "orange", "blue")) +
  labs(y = "Cases") +
  theme_minimal() +
  theme(legend.position = "none") +
  ggtitle("NYC flights volume for top destinations and airlines")

Here I purposefully choose the styling parameters to (broadly) reproduce the above plot. It is evident that ggalluvial integrates much more smoothly into the ggplot2 grammar. Specifically:

In closing: both packages are versatile and provide somewhat different approaches to creating alluvial plots. If you are frequently working within the tidyverse then ggalluvial might be more intuitive for you. Specific (edge) cases might be better handled by one tool than the other.

For more information check out the respective vignettes for ggalluvial and alluvial as well as their pages on github.

Have fun!

To leave a comment for the author, please follow the link and comment on their blog: R on head spin - the Heads or Tails blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.