Relearn boxplot and label the outliers

February 5, 2013
By

(This article was first published on StaTEAstics., and kindly contributed to R-bloggers)

Despite the fact that box plot is used almost every where and taught at undergraduate statistic classes, I recently had to re-learn the box plot in order to know how to label the outliers.

This stackoverflow post was where I found how the outliers and whiskers of the Tukey box plots are defined in R and ggplot2:
In ggplot2, what do the end of the boxplot lines represent?

and this post on how to label the outliers using base graphics.
How to label all the outliers in a boxplot

Since the use of ggplot2 is required for this task, I have written some basic hack code to label the outliers for ggplot2.


Here are the codes:

## Install the FAOSTAT package to obtain the data

if(!is.element("FAOSTAT", .packages()))
    install.packages("FAOSTAT")
library(FAOSTAT)


## Download data on Cassava production
cp.lst = getFAOtoSYB(name = "cassava_production", domainCode = "QC",
    itemCode = 125, elementCode = 5510)


## Use the country level data, and take only data for 2011 and remove the NA's
cp.df = cp.lst$entity[!is.na(cp.lst$entity$cassava_production) &
                      cp.lst$entity$Year == 2011, ]

## Merge with the country profile to obtain the country names for labelling
ccp.df = merge(cp.df, FAOcountryProfile[, c("FAOST_CODE", "ABBR_FAO_NAME")],
    all.x = TRUE)


## Merge with the regional pofile to obtain the UNSD M49 macro region
## composition for multiple boxplot.
rcp.df = merge(ccp.df, FAOregionProfile[, c("FAOST_CODE", "UNSD_MACRO_REG")],
    all.x = TRUE)


## Compute the quantile
qrcp.df = ddply(.data = rcp.df, .variables = .(UNSD_MACRO_REG), transform,
    lQntl = quantile(cassava_production, probs = 0.25, na.rm = TRUE),
    uQntl = quantile(cassava_production, probs = 0.75, na.rm = TRUE))


## Compute the lower and upper bound which defines the outlier
brcp.df = ddply(.data = qrcp.df, .variables = .(UNSD_MACRO_REG), transform,
    lBound = lQntl - 1.5 * (uQntl - lQntl),
    uBound = uQntl + 1.5 * (uQntl - lQntl))



## Remove the country names if it is within the bounds
with(brcp.df, {
    brcp.df[cassava_production <= uBound &
            cassava_production >= lBound, "ABBR_FAO_NAME"] <<- NA
})

## Plot the data
set.seed(587)
ggplot(data = brcp.df, aes(x = UNSD_MACRO_REG, y = cassava_production)) +
    geom_boxplot(outlier.colour = NA) +
    geom_text(aes(label = ABBR_FAO_NAME), size = 2,
              position = position_jitter(width = 0.1)) +
    labs(x = NULL, y = NULL, title = "Production of Cassava by region")


Here is the final product, to avoid over-plotting of texts I have used position_jitter. Which is not an elegant solution but I just can not find any algorithm that works well in general.




To leave a comment for the author, please follow the link and comment on his blog: StaTEAstics..

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.