Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Creating molten data

Instead of thinking about data in terms of a matrix or a data frame where we have observations in the rows and variables in the columns, we need to think of the variables as divided in two groups: identifier and measured variables.

Identifier variables (id) identify the unit that measurements take place on. In the `data.frame` below `subject` and `time` are the two id variables and `age`, `weight` and `height` are the measured variables.

```require(reshape2)
x = data.frame(subject = c("John", "Mary"),
time = c(1,1),
age = c(33,NA),
weight = c(90, NA),
height = c(2,2))
x
subject time age weight height
1    John    1  33     90      2
2    Mary    1  NA     NA      2
```

We can go further and say that there are only id variables and a value, where the id variables also identify what measured variable the value represents. Then each row will represent one observation of one variable. This operation, called melting, produces molten data and can be obtained with the `melt` function of the R package `reshape2`.

```molten = melt(x, id = c("subject", "time"))
molten
subject time variable value
1    John    1      age    33
2    Mary    1      age    NA
3    John    1   weight    90
4    Mary    1   weight    NA
5    John    1   height     2
6    Mary    1   height     2
```

All measured variables must be of the same type, e.g., numeric, factor, date. This is required because molten data is stored in a R data frame, and the value column can assume only one type.

Missing values

In a molten data format it is possible to encode all missing values implicitly, by omitting that combination of id variables, rather than explicitly, with an `NA` value. However, by doing this you are throwing data away that might be useful in your analysis. Because of that, in order to represent `NA` implicitly, i.e., by removing the row with `NA`, we need to set `na.rm = TRUE` in the call to melt. Otherwise, `NA` will be present in the molten data by default.

```molten = melt(x, id = c("subject", "time"), na.rm = TRUE)
molten
subject time variable value
1    John    1      age    33
3    John    1   weight    90
5    John    1   height     2
6    Mary    1   height     2
```

Now that you have a molten data you can reshape it into a data frame using `dcast` function or into a vector/matrix/array using the `acast` function. The basic arguments of `*cast` is the molten data and a formula of the form `x1 + x2 ~ y1 + y2`. The order of the variables matter, the first varies slowest, and the last fastest. There are a couple of special variables: “`...`” represents all other variables not used in the formula and “`.`” represents no variable, so you can do `formula = x1 ~ .`

It is easier to understand the way it works by doing it yourself. Try different options and see what happens:

```dcast(molten, formula = time + subject ~ variable)
dcast(molten, formula = subject + time  ~ variable)
dcast(molten, formula = subject  ~ variable)
dcast(molten, formula = ...  ~ variable)
```

It is also possible to create higher dimension arrays by using more than one `~` in the formula. For example,

```acast(molten, formula = subject  ~ time ~ variable)
```

From , we see that we can also supply a list of quoted expressions, in the form list `(.(x1, x1), .(y1, y2), .(z))`. The advantage of this form is that you can cast based on transformations of the variables: `list(.(a + b), (c = round(c)))`. To use this we need the function ‘`.`‘ from the plyr package to form a list of unevaluated expressions.

The function ‘`.`‘ can also be used to form more complicated expressions to be used within the `subset` argument of the `*cast` function:

```data(airquality)
aqm <- melt(airquality, id=c("Month", "Day"), na.rm=TRUE)
library(plyr) # needed to access . function
acast(aqm, variable ~ Month, mean,
subset = .(variable == "Ozone"))
```

Aggregation

Aggregation occurs when the combination of variables in the `*cast` formula does not identify individuals observations. In this case an aggregation function reduces the multiple values to a single one. We can specify the aggregation function using the `fun.aggregate` argument, which default to `length` (with a warning). Further arguments can be passed to the aggregation function through ‘`...`‘ in `*cast`.

```# Melt French Fries dataset
data(french_fries)
ffm <- melt(french_fries, id = 1:4, na.rm = TRUE)

# Aggregate examples - all 3 yield the same result
dcast(ffm, treatment ~ .)
dcast(ffm, treatment ~ ., function(x) length(x))
dcast(ffm, treatment ~ ., length)

# Passing further arguments through ...
dcast(ffm, treatment ~ ., sum)
dcast(ffm, treatment ~ ., sum, trim = 0.1)
```

Margins

To add statistics to the margins of your table, we can set the argument `margins` appropriately. Set `margins = TRUE` to display all margins or list individual variables in a character vector. Note that changing the order and position of the variables in the `*cast` formula affects the margins that can be computed.

```dcast(ffm, treatment + rep ~ time, sum,
margins = "rep")
dcast(ffm, rep + treatment ~ time, sum,
margins = "treatment")
dcast(ffm, treatment + rep ~ time, sum,
margins = TRUE)
```

A nice piece of advice from : As mentioned above, the order of variables in the `formula` matters, where variables specified first vary slowest than those specified last. Because comparisons are made most easily between adjacent cells, the variable you are most interested in should be specified last, and the early variables should be thought of as conditioning variables.

References:

 Wickham, H. (2007). Reshaping Data with the reshape Package. Journal of Statistical Software 21(12), 1-20. – in the INLA website.
 melt R documentation.
 cast R documentation. 