Statistics Sunday: Creating a Stacked Bar Chart for Rank Data

January 27, 2019
By

(This article was first published on Deeply Trivial, and kindly contributed to R-bloggers)

Stacked Bar Chart for Rank Data At work on Friday, I was trying to figure out the best way to display some rank data. What I had were rankings from 1-5 for 10 factors considered most important in a job (such as Salary, Insurance Benefits, and the Opportunity to Learn), meaning each respondent chose and ranked the top 5 from those 10, and the remaining 5 were unranked by that respondent. Without even thinking about the missing data issue, I computed a mean rank and called it a day. (Yes, I know that ranks are ordinal and means are for continuous data, but my goal was simply to differentiate importance of the factors and a mean seemed the best way to do it.) Of course, then we noticed one of the factors had a pretty high average rank, even though few people ranked it in the top 5. Oops.

So how could I present these results? One idea I had was a stacked bar chart, and it took a bit of data wrangling to do it. That is, the rankings were all in separate variables, but I want them all on the same chart. Basically, I needed to create a dataset with:

    1 variable to represent the factor being ranked

  • 1 variable to represent the ranking given (1-5, or 6 that I called “Not Ranked”)
  • 1 variable to represent the number of people giving that particular rank that particular factor

What I ultimately did was run frequencies for the factor variables, turn those frequency tables into data frames, and merged them together with rbind. I then created chart with ggplot. Here’s some code for a simplified example, which only uses 6 factors and asks people to rank the top 3.

First, let’s read in our sample dataset – note that these data were generated only for this example and are not real data:

library(tidyverse)
## -- Attaching packages --------------------------------------------------------------------------------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 3.0.0     v purrr   0.2.4
## v tibble 1.4.2 v dplyr 0.7.4
## v tidyr 0.8.0 v stringr 1.3.1
## v readr 1.1.1 v forcats 0.3.0
## Warning: package 'ggplot2' was built under R version 3.5.1
## -- Conflicts ------------------------------------------------------------------------------------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
ranks <- read_csv("C:/Users/slocatelli/Desktop/sample_ranks.csv", col_names = TRUE)
## Parsed with column specification:
## cols(
## RespID = col_integer(),
## Salary = col_integer(),
## Recognition = col_integer(),
## PTO = col_integer(),
## Insurance = col_integer(),
## FlexibleHours = col_integer(),
## OptoLearn = col_integer()
## )

This dataset contains 7 variables – 1 respondent ID and 6 variables with ranks on factors considered important in a job: salary, recognition from employer, paid time off, insurance benefits, flexible scheduling, and opportunity to learn. I want to run frequencies for these variables, and turn those frequency tables into a data frame I can use in ggplot2. I’m sure there are much cleaner ways to do this (and please share in the comments!), but here’s one not so pretty way:

salary <- as.data.frame(table(ranks$Salary))
salary$Name <- "Salary"
recognition <- as.data.frame(table(ranks$Recognition))
recognition$Name <- "Recognition by \nEmployer"
PTO <- as.data.frame(table(ranks$PTO))
PTO$Name <- "Paid Time Off"
insurance <- as.data.frame(table(ranks$Insurance))
insurance$Name <- "Insurance"
flexible <- as.data.frame(table(ranks$FlexibleHours))
flexible$Name <- "Flexible Schedule"
learn <- as.data.frame(table(ranks$OptoLearn))
learn$Name <- "Opportunity to \nLearn"

rank_chart <- rbind(salary, recognition, PTO, insurance, flexible, learn)
rank_chart$Var1 <- as.numeric(rank_chart$Var1)

With my not-so-pretty data wrangling, the chart itself is actually pretty easy:

ggplot(rank_chart, aes(fill = Var1, y = Freq, x = Name)) +
geom_bar(stat = "identity") +
labs(title = "Ranking of Factors Most Important in a Job") +
ylab("Frequency") +
xlab("Job Factors") +
scale_fill_continuous(name = "Ranking",
breaks = c(1:4),
labels = c("1","2","3","Not Ranked")) +
theme_bw() +
theme(plot.title=element_text(hjust=0.5))

Based on this chart, we can see the top factor is Salary. Insurance is slightly more important than paid time off, but these are definitely the top 2 and 3 factors. Recognition wasn’t ranked by most people, but those who did considered it their #2 factor; ditto for flexible scheduling at #3. Opportunity to learn didn’t make the top 3 for most respondents.

To leave a comment for the author, please follow the link and comment on their blog: Deeply Trivial.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers


Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)