Data visualization with R and ggplot2

Posted on March 28, 2013 by Kevin Davenport in R bloggers | 0 Comments

[This article was first published on Kevin Davenport » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I’m working on a one-hour ggplot2 lecture for the San Diego R users group, which I will post here when I’m done. I think there are many great intro to R data visualization resources out there so I’ll only share working examples on my blog.

A retail chain client employs a few hundred field agents who perform a variety of standardized audits across thousands of their retail locations. This client wanted to identify how the tenure of a field agent (how long they have been with the company) affected the way they scored locations they audited.

The audit score and tenure information was extracted from two separate SQL databases and will be joined in R.

df <- read.csv ("df.csv", header = TRUE, sep = ",", quote="\"", dec=".",fill = FALSE)
tenure <- read.csv ("tenure.csv", header = TRUE, sep = ",", quote="\"", dec=".",fill = FALSE)

The audit dataset contains 2 years of completed audits categorized by 2 month “rounds”. That is to say every retail location was audited once every two months, thus there are 6 rounds in one year. Here I define my outlier function:

remove_outliers <- function(x, na.rm = TRUE, ...) {
  qnt <- quantile(x, probs=c(.25, .75), na.rm = na.rm, ...)
  H <- 1.5 * IQR(x, na.rm = na.rm)
  y <- x
  y[x < (qnt[1] - H)] <- NA
  y[x > (qnt[2] + H)] <- NA
  y
}

Variation is introduced as the audit changes slightly per round so I use the outlier function to remove outliers per round rather than apply it to the entire sample.

df$score_outrm <- ave(df$Score,df$Round,FUN=remove_outliers) #apply f(x) over level combinations of factors
dfoutrm <- df[!(is.na(df$score_outrm) | df$score_outrm==""), ] #remove rows with NA in score_outrm column
df <- subset(dfoutrm, select = -c(score_outrm) ) #remove column used to identify outliers
remove(dfoutrm)

The two datasets will be easy to join since the tenure data frame is structured as a lookup table:

merged <- merge(df,tenure) #merge df and tenure data
merged <- subset(merged, Storetype != "Mall")#remove Mall locations as there are too few too visualize properly
#Just a quick examination of the relationship with a linear model:
lm1<- lm(Score ~ tenureYears,data=merged)
coef(lm1)
#Pasted output:
  #(Intercept)   tenureYearsone-two tenureYearstwo-three 
  #93.5756995           -0.6065633           -1.1380081

This shows a negative relationship between tenure and score, that is score decreases as tenure increases. Let’s take a quick look at summaries with plyr’s ddply which allows us to apply a numeric function to a column subsetted by a value of another column. In this case we are looking at mean and median of the Score column by the tenureYears column:

library(plyr)
ddply(merged, .(tenureYears), summarise, "mean" = mean(Score)) # get summary of means for scores by tenure
#Pasted output
#  tenureYears     mean
#1         one 93.57570
#2     one-two 92.96914
#3   two-three 92.43769
ddply(merged, .(tenureYears), summarise, "median" = median(Score)) # get summary of median for scores by tenure
#Pasted output
#  tenureYears median
#1         one 94.220
#2     one-two 93.345
#3   two-three 92.980

Now for ggplot2:

library(ggplot2)
ggplot(merged, aes(x = tenureYears, y = Score, fill = storetype)) + 
  geom_boxplot(alpha = .6,size = 1) + 
  scale_fill_brewer(palette = "Set1") + 
  stat_summary(fun.y = "mean", geom = "point", shape= 23, size= 2, fill= "white") +
  geom_smooth(alpha=2/10,aes(group=1),method = "lm", se=FALSE, color="black", size=.7) +
  ggtitle("Audit scores by tenure and store type") + 
  theme(axis.title.y=element_blank()) + theme(axis.title.x=element_blank()) +
  labs(fill='store type')

I added mean (white diamonds) to the boxplot to exhibit how extreme values have pulled it one way or the other away from the median. I could have used a more robust gam line instead of lm, but that wouldn’t be as salient unless this was a time series with geom_point. I might fine tune outlier detection and examine the distributions of the data a bit more before proceeding as this was just a preliminary look. We could gain more insights by faceting the visualization on other variables such state, time of day, management hierarchy, etc. with the facet_grid() parameter.

To leave a comment for the author, please follow the link and comment on their blog: Kevin Davenport » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

Data visualization with R and ggplot2

Related

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)