Creating molten data
Instead of thinking about data in terms of a matrix or a data frame where we have observations in the rows and variables in the columns, we need to think of the variables as divided in two groups: identifier and measured variables.
Identifier variables (id) identify the unit that measurements take place on. In the
time are the two id variables and
height are the measured variables.
require(reshape2) x = data.frame(subject = c("John", "Mary"), time = c(1,1), age = c(33,NA), weight = c(90, NA), height = c(2,2)) x subject time age weight height 1 John 1 33 90 2 2 Mary 1 NA NA 2
We can go further and say that there are only id variables and a value, where the id variables also identify what measured variable the value represents. Then each row will represent one observation of one variable. This operation, called melting, produces molten data and can be obtained with the
melt function of the R package
molten = melt(x, id = c("subject", "time")) molten subject time variable value 1 John 1 age 33 2 Mary 1 age NA 3 John 1 weight 90 4 Mary 1 weight NA 5 John 1 height 2 6 Mary 1 height 2
All measured variables must be of the same type, e.g., numeric, factor, date. This is required because molten data is stored in a R data frame, and the value column can assume only one type.
In a molten data format it is possible to encode all missing values implicitly, by omitting that combination of id variables, rather than explicitly, with an
NA value. However, by doing this you are throwing data away that might be useful in your analysis. Because of that, in order to represent
NA implicitly, i.e., by removing the row with
NA, we need to set
na.rm = TRUE in the call to melt. Otherwise,
NA will be present in the molten data by default.
molten = melt(x, id = c("subject", "time"), na.rm = TRUE) molten subject time variable value 1 John 1 age 33 3 John 1 weight 90 5 John 1 height 2 6 Mary 1 height 2
Reshaping your data
Now that you have a molten data you can reshape it into a data frame using
dcast function or into a vector/matrix/array using the
acast function. The basic arguments of
*cast is the molten data and a formula of the form
x1 + x2 ~ y1 + y2. The order of the variables matter, the first varies slowest, and the last fastest. There are a couple of special variables: “
...” represents all other variables not used in the formula and “
.” represents no variable, so you can do
formula = x1 ~ .
It is easier to understand the way it works by doing it yourself. Try different options and see what happens:
dcast(molten, formula = time + subject ~ variable) dcast(molten, formula = subject + time ~ variable) dcast(molten, formula = subject ~ variable) dcast(molten, formula = ... ~ variable)
It is also possible to create higher dimension arrays by using more than one
~ in the formula. For example,
acast(molten, formula = subject ~ time ~ variable)
From , we see that we can also supply a list of quoted expressions, in the form list
(.(x1, x1), .(y1, y2), .(z)). The advantage of this form is that you can cast based on transformations of the variables:
list(.(a + b), (c = round(c))). To use this we need the function ‘
.‘ from the plyr package to form a list of unevaluated expressions.
The function ‘
.‘ can also be used to form more complicated expressions to be used within the
subset argument of the
data(airquality) aqm <- melt(airquality, id=c("Month", "Day"), na.rm=TRUE) library(plyr) # needed to access . function acast(aqm, variable ~ Month, mean, subset = .(variable == "Ozone"))
Aggregation occurs when the combination of variables in the
*cast formula does not identify individuals observations. In this case an aggregation function reduces the multiple values to a single one. We can specify the aggregation function using the
fun.aggregate argument, which default to
length (with a warning). Further arguments can be passed to the aggregation function through ‘
# Melt French Fries dataset data(french_fries) ffm <- melt(french_fries, id = 1:4, na.rm = TRUE) # Aggregate examples - all 3 yield the same result dcast(ffm, treatment ~ .) dcast(ffm, treatment ~ ., function(x) length(x)) dcast(ffm, treatment ~ ., length) # Passing further arguments through ... dcast(ffm, treatment ~ ., sum) dcast(ffm, treatment ~ ., sum, trim = 0.1)
To add statistics to the margins of your table, we can set the argument
margins appropriately. Set
margins = TRUE to display all margins or list individual variables in a character vector. Note that changing the order and position of the variables in the
*cast formula affects the margins that can be computed.
dcast(ffm, treatment + rep ~ time, sum, margins = "rep") dcast(ffm, rep + treatment ~ time, sum, margins = "treatment") dcast(ffm, treatment + rep ~ time, sum, margins = TRUE)
A nice piece of advice
A nice piece of advice from : As mentioned above, the order of variables in the
formula matters, where variables specified first vary slowest than those specified last. Because comparisons are made most easily between adjacent cells, the variable you are most interested in should be specified last, and the early variables should be thought of as conditioning variables.