Improved net stacked distribution graphs via ggplot2 trickery

Posted on September 13, 2012 by Ethan Brown in Uncategorized | 0 Comments

[This article was first published on Statisfactions: The Sounds of Data and Whimsy » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Net stacked distribution graphs are a nice way of comparing data on a Likert scale (i.e. when respondents are asked whether they “Strongly Disagree”, “Disagree”, etc. with a statement). It strips out the neutral responses and centers the responses around the center of the graph so you can quickly compare agreement and disagreement on different issues. Here we’ll learn how to do this in ggplot2 — it takes a dosage of deviousness.

Jason Becker provides some code for doing this — I’ve taken his basic idea and made it more readable and flexible, including being able map multiple questions at the same time.

net_stacked(), code below, takes a single argument, x, which is a data.frame where each column is an ordered factor containing Likert-style responses. The factor levels must be ordered from the “most negative” possible response (e.g. “Strongly Disagree”) to “most positive” (e.g. “Strongly Agree”). If there is an odd number of possible responses/levels, such as in a 5 or 7 point Likert scale, net_stacked chops out the central level (assumed to be “Neutral”, “Neither Agree nor Disagree”, or similar).

All the columns of the data.frame need to have the same levels. The function can actually accept a list where the factor elements have different lengths, as well. NAs are omitted from each column before plotting.

How do we actually accomplish this effect in ggplot2? Here’s the full text of the function:

net_stacked <- function(x) {
 
  ## x: a data.frame or list, where each column is a ordered factor with the same levels
  ## lower levels are presumed to be "negative" responses; middle value presumed to be neutral
  ## returns a ggplot2 object of a net stacked distribution plot
 
  ## Test that all elements of x have the same levels, are ordered, etc.
  all_levels <- levels(x[[1]])
  n <- length(all_levels)
  levelscheck <- all(sapply(x, function(y)
                            all(c(is.ordered(y), levels(y) == all_levels))
                            ))
  if(!levelscheck)
    stop("All levels of x must be ordered factors with the same levels")
 
  ## Reverse order of columns (to make ggplot2 output look right after coord_flip)
  x <- x[length(x):1]
 
  ## Identify middle and "negative" levels
  if(n %% 2 == 1)
    neutral <- all_levels[ceiling(n/2)]
  else
    neutral <- NULL
 
  negatives <- all_levels[1:floor(n/2)]
  positives <- setdiff(all_levels, c(negatives, neutral))
 
  ## remove neutral, summarize as proportion
  listall <- lapply(names(x), function(y) {
    column <- (na.omit(x[[y]]))
    out <- data.frame(Question = y, prop.table(table(column)))
    names(out) <- c("Question", "Response", "Freq")
 
    if(!is.null(neutral))
      out <- out[out$Response != neutral,]
 
    out
  })
 
  dfall <- do.call(rbind, listall)
 
  ## split by positive/negative
  pos <- dfall[dfall$Response %in% positives,]
  neg <- dfall[dfall$Response %in% negatives,]
 
  ## Negate the frequencies of negative responses, reverse order
  neg$Freq <- -neg$Freq
  neg$Response <- ordered(neg$Response, levels = rev(levels(neg$Response)))
 
  stackedchart <- ggplot() +
    aes(Question, Freq, fill = Response, order = Response) + 
    geom_bar(data = neg, stat = "identity") +
    geom_bar(data = pos, stat = "identity") + geom_hline(yintercept=0) +
    scale_y_continuous(name = "",
                       labels = paste0(seq(-100, 100, 20), "%"),
                       limits = c(-1, 1),
                       breaks = seq(-1, 1, .2)) +
    scale_fill_discrete(limits = c(negatives, positives)) +
    coord_flip()
 
  stackedchart
}

Once we have the function, here's the code for the image above:

require(ggplot2)
 
## generate fake likert data
set.seed(200)
response_scale <- c("Strongly Disagree",
                    "Disagree",
                    "Neither Agree or Disagree",
                    "Agree", 
                    "Strongly Agree")
x <- replicate(5, ordered(sample(response_scale, 20, replace = TRUE),
                          levels = response_scale), simplify = F)
x <- as.data.frame(x)
names(x) <- paste0("Q", 1:5)
 
## plot it as net stacked distribution
net_stacked(x)

This gives a warning, since ggplot2 really isn't sure why we're stacking negative numbers. But that is, in fact, what we're intending to do here: embrace the devious!

Jason Becker's post provides some colors to heuristically represent the intensity of feelings. These and any other customizations we can add onto the ggplot object returned by our function in the usual ways

Most of the function is simply preparing and summarizing the data in the form of proportions for each level of the ordered factor, applied to each column of the data.frame; but notice that we separate the "positive" (more-agreeing) and "negative" (more-disagreeing) levels into two separate objects:

## split by positive/negative
pos <- dfall[dfall$Response %in% positives,]
neg <- dfall[dfall$Response %in% negatives,]

And then we make the frequencies negative because we want them to actually show up on the negative side of 0 in our plot:

neg$Freq <- -neg$Freq

And we reorder the levels in reverse, because we want them oriented so that the "most neutral" responses are stacked first on top of zero in the negative direction and then progressively "more negative" responses:

neg$Response <- ordered(neg$Response, levels = rev(levels(neg$Response)))

And here's where we bring that home -- in the plot command, we actually have two different layers. One represents the positive half, and one the negative half, which are drawing on these separate datasets. We need to separate them, otherwise ggplot2 will get confused stacking positives and negatives together.

geom_bar(data = neg, stat = "identity") +
geom_bar(data = pos, stat = "identity") +

This is the clever trick that Jason Becker does that makes this whole thing possible!

Also, notice that in specifying the mapping, we explicitly tell ggplot2 to order the levels by Response (a column containing the text of each Likert-type response in an ordered factor):

aes(Question, Freq, fill = Response, order = Response)

This is important because the negative side won't be in the right order if we don't do this explicitly and AFTER reversing the order of the negative levels to fan out away from zero.

Then we flip to make the whole thing horizontal with coord_flip(). coord_flip() makes later columns in the data appear on top, which isn't what we want here, which is why earlier in the functon I simply reverse the order of the elements in the input data:

x <- x[length(x):1]

Happy net ranked distribution visualizing!

To leave a comment for the author, please follow the link and comment on their blog: Statisfactions: The Sounds of Data and Whimsy » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

Improved net stacked distribution graphs via ggplot2 trickery

Related

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)