Reshape and aggregate data with the R package reshape2

October 31, 2013
By

(This article was first published on Thiago G. Martins » R, and kindly contributed to R-bloggers)

Creating molten data

Instead of thinking about data in terms of a matrix or a data frame where we have observations in the rows and variables in the columns, we need to think of the variables as divided in two groups: identifier and measured variables.

Identifier variables (id) identify the unit that measurements take place on. In the data.frame below subject and time are the two id variables and age, weight and height are the measured variables.

require(reshape2)
x = data.frame(subject = c("John", "Mary"), 
               time = c(1,1),
               age = c(33,NA),
               weight = c(90, NA),
               height = c(2,2))
x
  subject time age weight height
1    John    1  33     90      2
2    Mary    1  NA     NA      2

We can go further and say that there are only id variables and a value, where the id variables also identify what measured variable the value represents. Then each row will represent one observation of one variable. This operation, called melting, produces molten data and can be obtained with the melt function of the R package reshape2.

molten = melt(x, id = c("subject", "time"))
molten
  subject time variable value
1    John    1      age    33
2    Mary    1      age    NA
3    John    1   weight    90
4    Mary    1   weight    NA
5    John    1   height     2
6    Mary    1   height     2

All measured variables must be of the same type, e.g., numeric, factor, date. This is required because molten data is stored in a R data frame, and the value column can assume only one type.

Missing values

In a molten data format it is possible to encode all missing values implicitly, by omitting that combination of id variables, rather than explicitly, with an NA value. However, by doing this you are throwing data away that might be useful in your analysis. Because of that, in order to represent NA implicitly, i.e., by removing the row with NA, we need to set na.rm = TRUE in the call to melt. Otherwise, NA will be present in the molten data by default.

molten = melt(x, id = c("subject", "time"), na.rm = TRUE)
molten
  subject time variable value
1    John    1      age    33
3    John    1   weight    90
5    John    1   height     2
6    Mary    1   height     2

Reshaping your data

Now that you have a molten data you can reshape it into a data frame using dcast function or into a vector/matrix/array using the acast function. The basic arguments of *cast is the molten data and a formula of the form x1 + x2 ~ y1 + y2. The order of the variables matter, the first varies slowest, and the last fastest. There are a couple of special variables: “...” represents all other variables not used in the formula and “.” represents no variable, so you can do formula = x1 ~ .

It is easier to understand the way it works by doing it yourself. Try different options and see what happens:

dcast(molten, formula = time + subject ~ variable)
dcast(molten, formula = subject + time  ~ variable)
dcast(molten, formula = subject  ~ variable)
dcast(molten, formula = ...  ~ variable)

It is also possible to create higher dimension arrays by using more than one ~ in the formula. For example,

acast(molten, formula = subject  ~ time ~ variable)

From [3], we see that we can also supply a list of quoted expressions, in the form list (.(x1, x1), .(y1, y2), .(z)). The advantage of this form is that you can cast based on transformations of the variables: list(.(a + b), (c = round(c))). To use this we need the function ‘.‘ from the plyr package to form a list of unevaluated expressions.

The function ‘.‘ can also be used to form more complicated expressions to be used within the subset argument of the *cast function:

data(airquality)
aqm <- melt(airquality, id=c("Month", "Day"), na.rm=TRUE)
library(plyr) # needed to access . function
acast(aqm, variable ~ Month, mean, 
      subset = .(variable == "Ozone"))

Aggregation

Aggregation occurs when the combination of variables in the *cast formula does not identify individuals observations. In this case an aggregation function reduces the multiple values to a single one. We can specify the aggregation function using the fun.aggregate argument, which default to length (with a warning). Further arguments can be passed to the aggregation function through ‘...‘ in *cast.

# Melt French Fries dataset
data(french_fries)
ffm <- melt(french_fries, id = 1:4, na.rm = TRUE)

# Aggregate examples - all 3 yield the same result
dcast(ffm, treatment ~ .)
dcast(ffm, treatment ~ ., function(x) length(x))
dcast(ffm, treatment ~ ., length) 

# Passing further arguments through ...
dcast(ffm, treatment ~ ., sum)
dcast(ffm, treatment ~ ., sum, trim = 0.1)

Margins

To add statistics to the margins of your table, we can set the argument margins appropriately. Set margins = TRUE to display all margins or list individual variables in a character vector. Note that changing the order and position of the variables in the *cast formula affects the margins that can be computed.

dcast(ffm, treatment + rep ~ time, sum, 
      margins = "rep")
dcast(ffm, rep + treatment ~ time, sum, 
      margins = "treatment")
dcast(ffm, treatment + rep ~ time, sum, 
      margins = TRUE)

A nice piece of advice

A nice piece of advice from [1]: As mentioned above, the order of variables in the formula matters, where variables specified first vary slowest than those specified last. Because comparisons are made most easily between adjacent cells, the variable you are most interested in should be specified last, and the early variables should be thought of as conditioning variables.

References:

[1] Wickham, H. (2007). Reshaping Data with the reshape Package. Journal of Statistical Software 21(12), 1-20. – in the INLA website.
[2] melt R documentation.
[3] cast R documentation.

To leave a comment for the author, please follow the link and comment on his blog: Thiago G. Martins » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.