Data visualization with R and ggplot2

March 28, 2013
By

(This article was first published on Kevin Davenport » R, and kindly contributed to R-bloggers)

I’m working on a one-hour ggplot2 lecture for the San Diego R users group, which I will post here when I’m done. I think there are many great intro to R data visualization resources out there so I’ll only share working examples on my blog.

A retail chain client employs a few hundred field agents who perform a variety of standardized audits across thousands of their retail locations. This client wanted to identify how the tenure of a field agent (how long they have been with the company) affected the way they scored locations they audited.

The audit score and tenure information was extracted from two separate SQL databases and will be joined in R.

```df <- read.csv ("df.csv", header = TRUE, sep = ",", quote="\"", dec=".",fill = FALSE)
tenure <- read.csv ("tenure.csv", header = TRUE, sep = ",", quote="\"", dec=".",fill = FALSE)
```

The audit dataset contains 2 years of completed audits categorized by 2 month “rounds”. That is to say every retail location was audited once every two months, thus there are 6 rounds in one year. Here I define my outlier function:

```remove_outliers <- function(x, na.rm = TRUE, ...) {
qnt <- quantile(x, probs=c(.25, .75), na.rm = na.rm, ...)
H <- 1.5 * IQR(x, na.rm = na.rm)
y <- x
y[x < (qnt[1] - H)] <- NA
y[x > (qnt[2] + H)] <- NA
y
}
```

Variation is introduced as the audit changes slightly per round so I use the outlier function to remove outliers per round rather than apply it to the entire sample.

```df\$score_outrm <- ave(df\$Score,df\$Round,FUN=remove_outliers) #apply f(x) over level combinations of factors
dfoutrm <- df[!(is.na(df\$score_outrm) | df\$score_outrm==""), ] #remove rows with NA in score_outrm column
df <- subset(dfoutrm, select = -c(score_outrm) ) #remove column used to identify outliers
remove(dfoutrm)
```

The two datasets will be easy to join since the tenure data frame is structured as a lookup table:

```merged <- merge(df,tenure) #merge df and tenure data
merged <- subset(merged, Storetype != "Mall")#remove Mall locations as there are too few too visualize properly
#Just a quick examination of the relationship with a linear model:
lm1<- lm(Score ~ tenureYears,data=merged)
coef(lm1)
#Pasted output:
#(Intercept)   tenureYearsone-two tenureYearstwo-three
#93.5756995           -0.6065633           -1.1380081
```

This shows a negative relationship between tenure and score, that is score decreases as tenure increases. Let’s take a quick look at summaries with plyr’s ddply which allows us to apply a numeric function to a column subsetted by a value of another column. In this case we are looking at mean and median of the Score column by the tenureYears column:

```library(plyr)
ddply(merged, .(tenureYears), summarise, "mean" = mean(Score)) # get summary of means for scores by tenure
#Pasted output
#  tenureYears     mean
#1         one 93.57570
#2     one-two 92.96914
#3   two-three 92.43769
ddply(merged, .(tenureYears), summarise, "median" = median(Score)) # get summary of median for scores by tenure
#Pasted output
#  tenureYears median
#1         one 94.220
#2     one-two 93.345
#3   two-three 92.980
```

Now for ggplot2:

```library(ggplot2)
ggplot(merged, aes(x = tenureYears, y = Score, fill = storetype)) +
geom_boxplot(alpha = .6,size = 1) +
scale_fill_brewer(palette = "Set1") +
stat_summary(fun.y = "mean", geom = "point", shape= 23, size= 2, fill= "white") +
geom_smooth(alpha=2/10,aes(group=1),method = "lm", se=FALSE, color="black", size=.7) +
ggtitle("Audit scores by tenure and store type") +
theme(axis.title.y=element_blank()) + theme(axis.title.x=element_blank()) +
labs(fill='store type')
```

I added mean (white diamonds) to the boxplot to exhibit how extreme values have pulled it one way or the other away from the median. I could have used a more robust gam line instead of lm, but that wouldn’t be as salient unless this was a time series with geom_point. I might fine tune outlier detection and examine the distributions of the data a bit more before proceeding as this was just a preliminary look. We could gain more insights by faceting the visualization on other variables such state, time of day, management hierarchy, etc. with the facet_grid() parameter.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...