Parallel Coordinate Plots for Discrete and Categorical Data in R — A Comparison
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Parallel Coordinate Plots are useful to visualize multivariate data. R provides several packages/functions to draw Parallel Coordinate Plots (PCPs):
- ggparcoord in the package GGally
- the package ggparallel
- plain ggplot2 with geom_path
In this post I will compare these approaches using a randomly generated data set with three discrete variables.
Generate a data set
We need some multivariate data with categorical data for our PCPs. As an example from practice, we assume that we made a survey with some questions. Each question is asked three times with a different context and can be answered on a discrete scale from 1 to 7. So each question has three “dimensions”. The distribution of answers across the three dimensions should be displayed per question. This is ideal to be displayed via PCP, because the three dimensions have the same unit and scale and hence can be easily compared on parallel coordinates (you can also use different units and scales on parallel coordinates, but the interpretation can become quite tricky then). It would also be easily possible to display more than three dimensions.
Let’s generate a data set for one question with the three dimensions (q1_d1
to q1_d3
):
library(triangle) set.seed(0) q1_d1 <- round(rtriangle(1000, 1, 7, 5)) q1_d2 <- round(rtriangle(1000, 1, 7, 6)) q1_d3 <- round(rtriangle(1000, 1, 7, 2)) df <- data.frame(q1_d1 = factor(q1_d1), q1_d2 = factor(q1_d2), q1_d3 = factor(q1_d3))
We're using a triangular distribution here in order to get random numbers r with r in [1, 7] around a different mode c for each dimension (5, 6 and 2). So here is what our raw data looks like:
head(df) q1_d1 q1_d2 q1_d3 1 6 4 3 2 4 5 5 3 4 6 6 4 5 4 5 5 6 6 3 6 3 3 2
Prepare the data
We basically want to know the main "answer paths". So which answer combinations across the three dimensions occur the most? For this, we need to group by all three dimensions, giving us the unique answer combinations, and then count the rows of each group. We're using the dplyr package for this:
library(dplyr) # group by combinations and count df_grouped % group_by(q1_d1, q1_d2, q1_d3) %>% count() # set an id string that denotes the value combination df_grouped % mutate(id = factor(paste(q1_d1, q1_d2, q1_d3, sep = '-'))) # sort by count and select top rows df_grouped % arrange(desc(n)))[1:10,]
The count per group is automatically stored in a column "n". We additionally set an "id" column which denotes the unique answer combination. We also sorted by count and then only selected the rows with the most counts, hence the most popular answer combinations. This is optional, but will generate less chaotic plots.
head(df_grouped) q1_d1 q1_d2 q1_d3 n id 1 4 5 3 25 4-5-3 2 5 5 2 25 5-5-2 3 5 6 3 25 5-6-3 4 4 6 3 24 4-6-3 5 5 6 2 23 5-6-2 6 4 6 2 21 4-6-2
ggparcoord from GGally
We can now plot our data. Let's try out ggparcoord, which is easy to use:
library(GGally) ggparcoord(df_grouped, columns = 1:3, groupColumn = 'id', scale = 'globalminmax')
We only need to supply the grouped data frame, set the columns which should appear on the x-axis (the first three columns with the answer combinations) and a column that identifies groups for coloring ("id").
Unfortunately, it's not possible to make the line width of the PCP dependent on "n", therefore it only gives a general idea about the most popular "answer paths".
ggparallel
ggparallel is specially designed for categorical data and does not produce a "classical" parallel coordinate output like ggparcoord. It implements several methods for this purpose: "hammock plots, parallel sets plots, common angle plots, and common angle plots with a hammock-like adjustment for line widths" [ggparallel manual]. It's very good to display "movements" of groups. You could for example display voter movement between parties for different elections with it.
We can directly use the raw data (df
) with it by only specifying the columns which should be used on the x-axis:
ggparallel(list('q1_d1', 'q1_d2', 'q1_d3'), df, order = 0)
We fixed the order on the y-axis, but still this produces hardly readable output, because every single combination (of about 200) gets displayed. We can fix this by using our grouped and filtered data frame that only contains the top ten combinations:
df_pcp <- as.data.frame(df_grouped) # this is important! ggparallel(list('q1_d1', 'q1_d2', 'q1_d3'), df_pcp, weight = 'n', order = 0)
This produces a much clearer output. ggparallel can't handle dplyr's "tbl" data frames, so we have to convert it to a traditional data frame first. By specifying a weight, we can make the width of the lines dependent on "n".
ggplot2
Another solution is to use geom_path from ggplot2. For this, we need some data preparation first. We need to convert our grouped data frame into a "long format" using melt()
from the package reshape2 (see this explanation on melt on r-bloggers.com) so that our three dimensions are contained in a column named "variable" and the respective values are in the column "values":
library(reshape2) # create long format df_pcp <- melt(df_grouped, id.vars = c('id', 'n')) df_pcp$value <- factor(df_pcp$value) df_pcp id n variable value 1 4-5-3 25 q1_d1 4 2 5-5-2 25 q1_d1 5 3 5-6-3 25 q1_d1 5 ... 11 4-5-3 25 q1_d2 5 12 5-5-2 25 q1_d2 5 13 5-6-3 25 q1_d2 6 ... 21 4-5-3 25 q1_d3 3 22 5-5-2 25 q1_d3 2 23 5-6-3 25 q1_d3 3 ...
We can then specify the levels that should always be drawn on the y-axis (1 to 7). In the ggplot()
function we define an aesthetic that uses the "variable" column for the x-axis and the "value" column for the y-axis. We also specify to group the values for each of the three dimensions by using the "id" column. This is very important, because otherwise the connections between the three dimensions won't be drawn. We use geom_path()
to draw the connection lines and make the size (i.e. width) dependent on the "n" column and colorize by "id" group.
y_levels <- levels(factor(1:7)) ggplot(df_pcp, aes(x = variable, y = value, group = id)) + # group = id is important! geom_path(aes(size = n, color = id), alpha = 0.5, lineend = 'round', linejoin = 'round') + scale_y_discrete(limits = y_levels, expand = c(0.5, 0)) + scale_size(breaks = NULL, range = c(1, 7))
The result is quite similar to ggparcoord but the line width is dynamic and we can customize the plot more easily.
Conclusion
All in all, the provided packages in R are good for generating parallel coordinate plots. ggparcoord is good for quick drawing of PCPs, but is not well equipped for discrete or categorical variables. ggparallel on the other is specialized on categorical data and produces plots that are clear and good to interpret, if you filter your data beforehand. Parallel coordinate plots in ggplot2 require more effort in preparing your data and setting up the right functions and parameters, but once set up, it gives you most freedom in designing and fine-tuning your plot.
For the full source code, see this gist.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.