Slopegraphs in R

[This article was first published on rud.is » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I updated the code to use ggsave and tweaked some of the font & line size values for more consistent (and pretty) output. This also means that I really need to get this up on github.

If you even remotely follow this blog, you’ll see that I’m kinda obsessed with slopegraphs. While I’m pretty happy with my Python implementation, I do quite a bit of data processing & data visualization in R these days and had a few free hours on a recent trip to Seattle, so I whipped up some R code to do traditional and multi-column rank-order slopegraphs in R, mostly due to a post over at Microsoft’s security blog.

#
# multicolumn-rankorder-slopegraph.R
#
# 2013-01-12 - formatting tweaks
# 2013-01-10 - Initial version - boB Rudis - @hrbrmstr
#
# Pretty much explained by the script title. This is an R script which is designed to produce
# 2+ column rank-order slopegraphs with the ability to highlight meaningful patterns
#
 
library(ggplot2)
library(reshape2)
 
# transcription of table from:
# http://blogs.technet.com/b/security/archive/2013/01/07/operating-system-infection-rates-the-most-common-malware-families-on-each-platform.aspx
#
# You can download it from: 
# https://docs.google.com/spreadsheet/ccc?key=0AlCY1qfmPPZVdHpwYk0xYkh3d2xLN0lwTFJrWXppZ2c
 
df = read.csv("~/Desktop/malware.csv")
 
# For this slopegraph, we care that #1 is at the top and that higher value #'s are at the bottom, so we 
# negate the rank values in the table we just read in
 
df$Rank.Win7.SP1 = -df$Rank.Win7.SP1
df$Rank.Win7.RTM = -df$Rank.Win7.RTM
df$Rank.Vista = -df$Rank.Vista
df$Rank.XP = -df$Rank.XP
 
# Also, we are really comparing the end state (ultimately) so sort the list by the end state.
# In this case, it's the Windows 7 malware data.
 
df$Family = with(df, reorder(Family, Rank.Win7.SP1))
 
# We need to take the multi-columns and make it into 3 for line-graph processing 
 
dfm = melt(df)
 
# We need to take the multi-columns and make it into 3 for line-graph processing 
 
dfm = melt(df)
 
# We define our color palette manually so we can highlight the lines that "matter".
# This means you'll need to generate the slopegraph at least one time prior to determine
# which lines need coloring. This should be something we pick up algorithmically, eventually
 
sgPalette = c("#990000", "#990000",  "#CCCCCC", "#CCCCCC", "#CCCCCC","#CCCCCC", "#990000", "#CCCCCC", "#CCCCCC", "#CCCCCC", "#CCCCCC", "#CCCCCC", "#CCCCCC", "#CCCCCC", "#CCCCCC")
#sgPalette = c("#CCCCCC", "#CCCCCC",  "#CCCCCC", "#CCCCCC", "#CCCCCC","#CCCCCC", "#CCCCCC", "#CCCCCC", "#CCCCCC", "#CCCCCC", "#CCCCCC", "#CCCCCC", "#CCCCCC", "#CCCCCC", "#CCCCCC")
#sgPalette = c("#000000", "#000000",  "#000000", "#000000", "#000000","#000000", "#000000", "#000000", "#000000", "#000000", "#000000", "#000000", "#000000", "#000000", "#000000")
 
 
# start the plot
#
# we do a ton of customisations to the plain ggplot canvas, but it's not rocket science
 
sg = ggplot(dfm, aes(factor(variable), value, 
                     group = Family, 
                     colour = Family, 
                     label = Family)) +
  scale_colour_manual(values=sgPalette) +
  theme(legend.position = "none", 
        axis.text.x = element_text(size=5),
        axis.text.y=element_blank(), 
        axis.title.x=element_blank(),
        axis.title.y=element_blank(),
        axis.ticks=element_blank(),
        axis.line=element_blank(),
        panel.grid.major = element_line("black", size = 0.1),
        panel.grid.major = element_blank(),
        panel.grid.major.y = element_blank(),
        panel.grid.minor.y = element_blank(),
        panel.background = element_blank())
 
# plot the right-most labels
 
sg1 = sg + geom_line(size=0.15) + 
  geom_text(data = subset(dfm, variable == "Rank.Win7.SP1"), 
            aes(x = factor(variable), label=sprintf(" %-2d %s",-(value),Family)), size = 1.75, hjust = 0) 
 
# plot the left-most labels
 
sg1 = sg1 + geom_text(data = subset(dfm, variable == "Rank.XP"), 
                     aes(x = factor(variable), label=sprintf("%s %2d ",Family,-(value))), size = 1.75, hjust = 1)
 
# this ratio seems to work well for png output
# you'll need to tweak font size for PDF output, but PDF will make post-processing in 
# Illustrator or Inkscape much easier.
 
ggsave("~/Desktop/malware.pdf",sg1,w=8,h=5,dpi=150)

malware
Click for larger version

I really didn’t think the table told a story well and I truly believe slopegraphs are pretty good at telling stories.

This bit of R code is far from generic and requires the data investigator to do some work to make it an effective visualization, but (I think) it’s one of the better starts at a slopegraph library in R. It suffers from the same issues I’ve pointed out before, but it’s far fewer lines of code than my Python version and it handles multi-column slopegraphs quite nicely.

To be truly effective, you’ll need to plot the slopegraph first and then figure out which nodes to highlight and change the sgPalette accordingly to help the reader/viewer focus on what’s important.

I’ll post everything on github once I recover from cross-country travel and—as always–welcome feedback and fixes.

To leave a comment for the author, please follow the link and comment on their blog: rud.is » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)