# Pivot tables in R

January 9, 2010
By

A common data-munging operation is to compute cross tabulations of measurements by categories. SQL Server and Excel have a nice feature called pivot tables for this purpose. Here we'll figure out how to do pivot operations in R.

Let's imagine an experiment where we're measuring the gene activity of an organism under different conditions -- exposure to different nutrients and toxins. Our conditions are silly: copper, beer, pizza, and cheetos. First we make a list of genes. Then expand.grid generates all combinations of genes and conditions. Finally, we tack on a column of randomly generated measurements.

> genes = paste('MMP', sprintf("%04d",1:10), sep="")
> data = expand.grid(gene=genes, condition=c('copper', 'cheetos', 'beer', 'pizza'))
> data$value = rnorm(40) > data gene condition value 1 MMP0001 copper 0.90412805 2 MMP0002 copper 0.92664376 3 MMP0003 copper 0.27772147 4 MMP0004 copper 0.08958930 5 MMP0005 copper -0.20132304 6 MMP0006 copper 0.34524729 7 MMP0007 copper -0.33910206 8 MMP0008 copper 1.21006486 9 MMP0009 copper 0.78008022 10 MMP0010 copper 1.05364315 11 MMP0001 cheetos -2.31796229 12 MMP0002 cheetos 0.76706591 13 MMP0003 cheetos -2.93692935 14 MMP0004 cheetos 0.25452306 15 MMP0005 cheetos 0.24168329 16 MMP0006 cheetos 0.28739734 17 MMP0007 cheetos 0.69233543 18 MMP0008 cheetos 0.48865250 19 MMP0009 cheetos -0.11129319 20 MMP0010 cheetos 0.53322842 21 MMP0001 beer -0.74965948 22 MMP0002 beer 0.27105205 23 MMP0003 beer -0.99261363 24 MMP0004 beer 0.65143639 25 MMP0005 beer -0.35589696 26 MMP0006 beer 1.40147484 27 MMP0007 beer 0.37492710 28 MMP0008 beer 0.64453865 29 MMP0009 beer 0.35925345 30 MMP0010 beer 0.96394785 31 MMP0001 pizza -1.91818504 32 MMP0002 pizza 0.31690523 33 MMP0003 pizza -1.20566043 34 MMP0004 pizza -1.91750166 35 MMP0005 pizza 1.98010023 36 MMP0006 pizza 0.90468249 37 MMP0007 pizza 0.04284970 38 MMP0008 pizza -0.08141461 39 MMP0009 pizza -0.72471771 40 MMP0010 pizza -0.01085060  We want to pivot the conditions into columns so that we end up with one column for each condition and one row for each gene. The easy way is to use the reshape package by Hadley Wickham, which is made for restructuring data and does this job nicely. If you don't already have it, you'll have to run install.packages, then load the library. > install.packages('reshape') > library(reshape)  Using cast to move conditions into columns is a snap. > cast(data, gene ~ condition) gene copper cheetos beer pizza 1 MMP0001 0.9041281 -2.3179623 -0.7496595 -1.91818504 2 MMP0002 0.9266438 0.7670659 0.2710521 0.31690523 3 MMP0003 0.2777215 -2.9369294 -0.9926136 -1.20566043 4 MMP0004 0.0895893 0.2545231 0.6514364 -1.91750166 5 MMP0005 -0.2013230 0.2416833 -0.3558970 1.98010023 6 MMP0006 0.3452473 0.2873973 1.4014748 0.90468249 7 MMP0007 -0.3391021 0.6923354 0.3749271 0.04284970 8 MMP0008 1.2100649 0.4886525 0.6445386 -0.08141461 9 MMP0009 0.7800802 -0.1112932 0.3592535 -0.72471771 10 MMP0010 1.0536432 0.5332284 0.9639479 -0.01085060  Done! That was too easy Just as an exercise, what would we have to do without reshape? And, just to keep ourselves honest, let's make sure we can deal with missing data (as reshape can). Make some data go missing: > data.incomplete <- data[data$value > -1.0,]
> dim(data.incomplete)
[1] 35  3


Now, split the data frame up by condition. This produces a list where each element is a data frame containing a subset of the data for each condition. Notice that the cheetos data frame has values for 8 of the 10 genes.

> data.by.condition <- split(data.incomplete, data.incomplete$condition) > typeof(data.by.condition) [1] "list" > names(data.by.condition) [1] "copper" "cheetos" "beer" "pizza" > data.by.condition$cheetos
gene condition      value
12 MMP0002   cheetos  0.7670659
14 MMP0004   cheetos  0.2545231
15 MMP0005   cheetos  0.2416833
16 MMP0006   cheetos  0.2873973
17 MMP0007   cheetos  0.6923354
18 MMP0008   cheetos  0.4886525
19 MMP0009   cheetos -0.1112932
20 MMP0010   cheetos  0.5332284


We're going to recombine the data into a data frame with one row for each gene, so let's get that started:

> result = data.frame(gene=genes)


Now comes some executable line noise. We're going to loop through the list and add a column to the result data frame during each iteration of the loop. We pull the column out of the data frame in the list, but we have to make sure the column has an element for each gene. Merging with the all parameter set is like an outer join. We get a row for each gene, inserting NA's where there data is missing.

> for (i in seq(along=data.by.condition)) { result[[names(data.by.condition)[i]]] <- merge(data.by.condition[[i]], genes, by.x='gene', by.y=1, all=T)\$value }

> result
gene     copper    cheetos       beer       pizza
1  MMP0001  0.9041281         NA -0.7496595          NA
2  MMP0002  0.9266438  0.7670659  0.2710521  0.31690523
3  MMP0003  0.2777215         NA -0.9926136          NA
4  MMP0004  0.0895893  0.2545231  0.6514364          NA
5  MMP0005 -0.2013230  0.2416833 -0.3558970  1.98010023
6  MMP0006  0.3452473  0.2873973  1.4014748  0.90468249
7  MMP0007 -0.3391021  0.6923354  0.3749271  0.04284970
8  MMP0008  1.2100649  0.4886525  0.6445386 -0.08141461
9  MMP0009  0.7800802 -0.1112932  0.3592535 -0.72471771
10 MMP0010  1.0536432  0.5332284  0.9639479 -0.01085060


Extra finesse points if you can figure out how to do that last step with Reduce instead of a loop.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...