A while back I had tweeted about a really cool technique that can be used with ggplot2 in R to highlight a subset of your data, while keeping in perspective the trend of the full data set. I found out about this trick through a bit of a tangled web. Please stay with me while I lay it out for you. It all started with a tweet that Mara Averick shared from a blog that Simon Jackson wrote about a technique that Hadley Wickham discussed in his ggplot2 book. Confused yet? Well, the good news is that actually implementing the technique is a lot easier than following the discovery path!
Highlighting The Old School Way
To implement this idea, we don’t need any fancy packages other than ggplot2. The steps are simple:
- Using ggplot2, create a plot with your full data set in grey.
- Create a new data frame that has been subset to only include the data which you would like to highlight.
- Add the highlighted data on to your plot created in step 1. Set the color to something other than grey.
For our example, we are going to examine the crime incident dataset from Seattle 911 Calls on data.gov. Note that I have covered this data set through multiple blog posts already such as map plots in R and time based heat maps.
Install and Load Libraries
install.packages("lubridate") install.packages("ggplot2") install.packages("ggmap") install.packages("data.table") install.packages("ggrepel") install.packages("dplyr") install.packages("magrittr") library(lubridate) library(ggplot2) library(ggmap) library(dplyr) library(data.table) library(ggrepel) library(magrittr)
Download the Data
incidents= fread('https://raw.githubusercontent.com/lgellis/MiscTutorial/master/ggmap/i2Sample.csv', stringsAsFactors = FALSE) str(incidents) attach(incidents) # Create some color variables for graphing later custGrey = "#A9A9A9" #add year to the incidents data frame incidents$ymd <-mdy_hms(Event.Clearance.Date) incidents$month <- lubridate::month(incidents$ymd) incidents$year <- year(incidents$ymd) incidents$wday <- lubridate::wday(incidents$ymd, label = TRUE) incidents$hour <- hour(incidents$ymd) #Create a more manageable data frame with only 2017 data i2 <- incidents[year>=2017, ] #Only include complete cases i2[complete.cases(i2), ] attach(i2) head(i2)
Create a basic time series plot showing the count of 911 event types by month.
#Group the data into a new data frame which has the count of events per month by subgroup groupSummaries <- i2 %>% group_by(month, Event.Clearance.SubGroup) %>% summarize(N = length(Event.Clearance.SubGroup)) #View the new data set head(groupSummaries, n=100) attach(groupSummaries) #Graph the data set through ggplot 2 ggplot(groupSummaries, aes(x=month, y=N, color=Event.Clearance.SubGroup) )+ geom_line() + theme(legend.position="bottom",legend.text=element_text(size=7), legend.title = element_blank()) + scale_x_discrete(name ="Month", limits=c(3,6,9,12))
Create a Graph Highlighting Data with a Max Month Count of 95 or Greater
# Create a data frame with only events types that have had a peak of 95 calls in a month or more groupSummariesF <- groupSummaries %>% group_by(Event.Clearance.SubGroup) %>% filter(max(N) > 95) %>% ungroup() head(groupSummariesF) # Create a layered plot with one layer of grey data for the full data set and one layer of color data for the subset data set ggplot() + geom_line(aes(month, N, group = Event.Clearance.SubGroup), data = groupSummaries, colour = alpha("grey", 0.7)) + geom_line(aes(month, N, group = Event.Clearance.SubGroup, colour = Event.Clearance.SubGroup), data = groupSummariesF) + scale_x_discrete(name ="Month", limits=c(3,6,9,12)) + theme(legend.position="bottom",legend.text=element_text(size=7), legend.title = element_blank())
One of the great things about the “old school way” of doing this type of highlighting is that it can be done with presumably every extension to the ggplot2 package. For example, you can use this same technique to highlight with the ggmap package. The code for these graphs is incredibly simple and has been included in my github repo.
Highlighting The New School Way
While the above methodology is quite easy, it can be a bit of a pain at times to create and add the new data frame. Further, you have to tinker more with the labelling to really call out the highlighted data points.
Thanks to Hiroaki Yutani, we now have the gghighlight package which does most of the work for us with a small function call!! Please note that a lot of this code was created by looking at examples on her introduction document.
The new school way is even more simple:
- Using ggplot2, create a plot with your full data set.
- Add the gghighlight() function to your plot with the conditions set to identify your subset.
For our first example, we are going to create the same time series graph from above. However, we are going to perform the highlighting with gghighlight vs manual layering.
# Install the gghighlight package install.packages("gghighlight") library(gghighlight) # Create the highlighted graph ggplot(groupSummaries, aes(month, N, colour = Event.Clearance.SubGroup)) + geom_line() + gghighlight(max(N) > 95, label_key = Event.Clearance.SubGroup) + scale_x_discrete(name ="Month", limits=c(3,6,9,12))
Well that was so easy, we are going to try a few more ggmap plot types to see how we fare. Below show both a scatterplot and histogram chart.
# Try a scatterplot chart ggplot(groupSummaries, aes(month, N, colour = Event.Clearance.SubGroup, use_group_by=FALSE)) + geom_point() + gghighlight(N > 200, label_key = Event.Clearance.SubGroup) + scale_x_discrete(name ="Month", limits=c(3,6,9,12)) # Try a histogram chart ggplot(groupSummaries, aes(N, fill = Event.Clearance.SubGroup)) + geom_histogram() + theme(legend.position="bottom",legend.text=element_text(size=7), legend.title = element_blank()) + gghighlight(N > 100, label_key = Event.Clearance.SubGroup, use_group_by = FALSE) + facet_wrap(~ Event.Clearance.SubGroup)
Thanks for reading along while we explored data highlighting through layers and gghighlight. Please share your thoughts and creations with me on twitter.