How to do a quantitative literature review in R

May 17, 2011
By

Recently I’ve been working on a review of urban energy modelling literature. It’s a very broad field and a quick search through Web of Knowledge turns up about 400 papers that look relevant. How on earth can you distill these reams of paper into something sensible?

One technique I’ve found helpful, especially at the earlier stages of a literature review when you’re trying to get the big picture straight, is to use clustering techniques. There are many different algorithms depending on what the goal of your analysis is, but here I use a two-step process.

1. Hierarchical clustering These methods start by assuming that every point in your data set represents a unique cluster. The algorithms then successively merge clusters, by measuring the “distance” or “dissimilarity” between data points, until there is only one large cluster containing all of the data. The results of this process can be plotted as a dendogram, from which the number of clusters within the data can be identified.
2. Partitioning clustering Once you know how many clusters are in your data set, you can use a partitioning method. These methods divide the data into a fixed number of clusters and can report data about the typical characteristics and membership of each group.

For both of these steps, R’s cluster package provides all of the methods you will need.

To demonstrate, here are the first few rows of the data within my literature review. As you can see, I’ve categorized each paper along several different criteria representing the spatial and temporal scale of the model, the application domain, and the model’s treatment of energy supply and demand variables. Your data will of course have different categories depending on the subject of interest but the general structure is likely to be similar.

> head(data)
Spatial   Temporal       Family     Supply     Demand          Category
1 Technology      Daily Optimization endogenous       none        Technology
2       City      Daily   Regression       none endogenous Demand estimation
3   Building     Annual   Regression  exogenous endogenous Demand estimation
4       City     Annual   Regression       none endogenous Demand estimation
5 Technology Sub-hourly   Simulation endogenous       none        Technology
6   Building     Annual      Various endogenous  exogenous       Descriptive

The data in this table represent a mix of ordinal and nominal data; that is, categorical data with and without an inherent order respectively. Both data types are represented in R by a factor object, but these have to be declared before attempting the analysis. The following code therefore modifies the data frame and categorizes the variables appropriately. (Note that I’ve also explicitly coded the levels; this isn’t necessary with nominal data but often is with ordinal values. The ordered=TRUE argument creates an ordered factor.)

data <- transform(data,
Spatial=factor(Spatial,
levels=c("Individuals","Technology","Building","Sub 1km","District","City","National/regional","Various"),
ordered=TRUE),
Temporal=factor(Temporal,
ordered=TRUE),
Family=factor(Family,
levels=c("Empirical","Regression","Optimization","Simulation","Various")),
Category=factor(Category,
levels=c("Building design","Demand estimation","Descriptive","Impact assessment","Policy assessment","System design","Technology",
"Transport","Urban climate","Urban planning")),
Supply=factor(Supply,
levels=c("none","endogenous","endogenous (indirect)","exogenous")),
Demand=factor(Demand,
levels=c("none","endogenous","endogenous (indirect)","exogenous")))

Next we want to calculate a dissimilarity matrix. The cluster package’s daisy method will do that, automatically detecting from our input data that the variables are ordinal or nominal. This is an important step because most of the clustering algorithms assume that the input variables are numerical. In the literature review case, the paper’s attributes are typically represented by non-numerical factors so the dissimilarity matrix must be calculated first.

diss <- daisy(data)

Now we can run the hierarchical clustering to determine how many clusters are in the data. We do this with the agnes method which can process the dissimilarity matrix directly.

# Run the agnes hierarchical clustering
agnes.clust <- agnes(diss)
# Plot the result
plot(agnes.clust)

This gives the following figure:

Dendogram showing the result of Agnes hierarchical clustering

From this clustering hierarchy, we can judge that there are about 5 clusters within the data. This is a somewhat subjective decision but in general, you want to identify the points where there are large vertical gaps between successive levels of the tree (a height of just below 0.6 on this plot). This document (PDF) provides a nice summary of how to interpret hierarchical clustering results.

We can then run the pam analysis, specifying the number of clusters. The pam object contains several useful elements: a medoid element which describes the properties of the cluster centers (id.med is a useful alternative, giving the row id of representative centre), and the clustering element which tells you which group each data point has been assigned to. We can then make some summary plots as below.

# Calculate 5 pam clusters, directly from dissimilarity matrix
pam.cl <- pam(diss,5)
# Show medoid characteristics
pam$medoid # Use ggplot2 to make a summary plot # Note that since there are six dimensions in the raw data # the figure can't show the clustering perfectly library(ggplot2) # Define the category labels cats <- as.character(data[pam$id.med,]$Category) # Create the ggplot object gg2 <- ggplot(data,aes(x=Spatial,y=Temporal)) + geom_jitter(aes(colour=factor(pam$clustering,labels=cats))) +
scale_color_brewer(name="Category",pal="Paired") +
theme_bw(11) +
opts(axis.text.x=theme_text(angle=90, hjust=1)) +
facet_wrap(~Family)  

Created by Pretty R at inside-R.org

Jitter plot of lit review data by cluster. Visible clusters include the optimization based system design models and the high temporal resolution urban climate simulation models

Not every literature review will be ameniable to this type of analysis. But if you have a fairly large set of papers to get through, where it’s hard to see the forest for the trees, a clustering analysis with R can be a great way to get a bit of perspective.

Further reading Quick-R also has a brief summary of cluster analysis with R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...