How to reshape data in R: tidyr vs reshape2

June 20, 2016
By

(This article was first published on MilanoR, and kindly contributed to R-bloggers)

We often find ourselves tidying and reshaping data. Here we consider the two packages tidyr and reshape2, our aim is to see where their purposes overlap and where they differ by comparing the functions gather(), separate() and spread(), from tidyr, with the functions melt(), colsplit() and dcast(), from reshape2.

Data tidying

Data tidying is the operation of transforming data into a clear and simple form that makes it easy to work with. “Tidy data” represent the information from a dataset as data frames where each row is an observation and each column contains the values of a variable (i.e. an attribute of what we are observing). Compare the two data frames below (cf.Wickham (2014)) to get an idea of the differences: example.tidy is the tidy version of example.messy, the same information is organized in two different ways.

example.messy

##              treatmenta treatmentb
## John Smith           NA          2
## Jane Doe             16         11
## Mary Johnson          3          1

example.tidy

##           name        trt result
## 1   John Smith treatmenta     NA
## 2     Jane Doe treatmenta     16
## 3 Mary Johnson treatmenta      3
## 4   John Smith treatmentb      2
## 5     Jane Doe treatmentb     11
## 6 Mary Johnson treatmentb      1

From the wide to the long format: gather() vs melt()

We now begin by seeing in action how we can bring data from the “wide” to the “long” format.

Let’s start loading the packages we need:

library(tidyr)
library(reshape2)

and some data (from RStudio Blog – Introducing tidyr): we have measurements of how much time people spend on their phones, measured at two locations (work and home), at two times. Each person has been randomly assigned to either treatment or control.

set.seed(10)
messy <- data.frame(id = 1:4,
                    trt = sample(rep(c('control', 'treatment'), each = 2)),
                    work.T1 = runif(4),
                    home.T1 = runif(4),
                    work.T2 = runif(4),
                    home.T2 = runif(4))
messy

##   id       trt    work.T1   home.T1   work.T2    home.T2
## 1  1 treatment 0.08513597 0.6158293 0.1135090 0.05190332
## 2  2   control 0.22543662 0.4296715 0.5959253 0.26417767
## 3  3 treatment 0.27453052 0.6516557 0.3580500 0.39879073
## 4  4   control 0.27230507 0.5677378 0.4288094 0.83613414

Our first step is to put the data in the tidy format, to do that we use tidyr’s functions gather() and separate(). Following Wickham’s tidy data definition, this data frame is not tidy because some variable values are in the column names. We bring this messy data frame from the wide to the long format by using the gather() function (give a look at Sean C. Anderson – An Introduction to reshape2 to get an idea of the wide/long format). We want to gather all the columns, except for the id and trt ones, in two columns key and value:

gathered.messy <- gather(messy, key, value, -id, -trt)
head(gathered.messy)

##   id       trt     key      value
## 1  1 treatment work.T1 0.08513597
## 2  2   control work.T1 0.22543662
## 3  3 treatment work.T1 0.27453052
## 4  4   control work.T1 0.27230507
## 5  1 treatment home.T1 0.61582931
## 6  2   control home.T1 0.42967153

Note that in gather() we used bare variable names to specify the names of the key, value, id and trt columns.

We can get the same result with the melt() function from reshape2:

molten.messy <- melt(messy, 
                     variable.name = "key",
                     value.names = "value",
                     id.vars = c("id", "trt"))
head(molten.messy)

##   id       trt     key      value
## 1  1 treatment work.T1 0.08513597
## 2  2   control work.T1 0.22543662
## 3  3 treatment work.T1 0.27453052
## 4  4   control work.T1 0.27230507
## 5  1 treatment home.T1 0.61582931
## 6  2   control home.T1 0.42967153

We now compare the two functions by running them over the data without any further parameter and see what happen:

head(gather(messy))

## Warning: attributes are not identical across measure variables; they will
## be dropped

##   key     value
## 1  id         1
## 2  id         2
## 3  id         3
## 4  id         4
## 5 trt treatment
## 6 trt   control

head(melt(messy))

## Using trt as id variables

##         trt variable      value
## 1 treatment       id 1.00000000
## 2   control       id 2.00000000
## 3 treatment       id 3.00000000
## 4   control       id 4.00000000
## 5 treatment  work.T1 0.08513597
## 6   control  work.T1 0.22543662

We see a different behaviour: gather() has brought messy into a long data format with a warning by treating all columns as variable, while melt() has treated trt as an “id variables”. Id columns are the columns that contain the identifier of the observation that is represented as a row in our data set. Indeed, if melt() does not receive any id.variables specification, then it will use the factor or character columns as id variables. gather() requires the columns that needs to be treated as ids, all the other columns are going to be used as key-value pairs.

Despite those last different results, we have seen that the two functions can be used to perform the exactly same operations on data frames, and only on data frames! Indeed, gather() cannot handle matrices or arrays, while melt() can as shown below.

set.seed(3)
M <- matrix(rnorm(6), ncol = 3)
dimnames(M) <- list(letters[1:2], letters[1:3])

melt(M)

##   Var1 Var2       value
## 1    a    a -0.96193342
## 2    b    a -0.29252572
## 3    a    b  0.25878822
## 4    b    b -1.15213189
## 5    a    c  0.19578283
## 6    b    c  0.03012394

gather(M)

## Error in UseMethod("gather_"): no applicable method for 'gather_' applied to an object of class "c('matrix', 'double', 'numeric')"

Split a column: separate() vs colsplit()

Our next step is to split the column key into two different columns in order to separate the location and time variables and obtain a tidy data frame:

tidy <- separate(gathered.messy,
                 key, into = c("location", "time"), sep = "\.") 
res.tidy <- cbind(molten.messy[1:2], 
                  colsplit(molten.messy[, 3], "\.", c("location", "time")),
                  molten.messy[4])

head(tidy)

##   id       trt location time      value
## 1  1 treatment     work   T1 0.08513597
## 2  2   control     work   T1 0.22543662
## 3  3 treatment     work   T1 0.27453052
## 4  4   control     work   T1 0.27230507
## 5  1 treatment     home   T1 0.61582931
## 6  2   control     home   T1 0.42967153

head(res.tidy)

##   id       trt location time      value
## 1  1 treatment     work   T1 0.08513597
## 2  2   control     work   T1 0.22543662
## 3  3 treatment     work   T1 0.27453052
## 4  4   control     work   T1 0.27230507
## 5  1 treatment     home   T1 0.61582931
## 6  2   control     home   T1 0.42967153

Again, the result is the same but we need a workaround: because colsplit() operates only on a single column we usecbind() to insert the new two columns in the data frame. separate() performs all the operation at once reducing the possibility of making mistakes.

From the long to the wide format: spread() vs dcast()

Finally, we compare spread() with dcast() using the data frame example for the spread() documentation itself. Briefly,spread() is complementary to gather() and brings data from the long to the wide format.

set.seed(14)
stocks <- data.frame(time = as.Date('2009-01-01') + 0:9,
                     X = rnorm(10, 0, 1),
                     Y = rnorm(10, 0, 2),
                     Z = rnorm(10, 0, 4))
stocksm <- gather(stocks, stock, price, -time)
spread.stock <- spread(stocksm, stock, price)
head(spread.stock)

##         time           X          Y          Z
## 1 2009-01-01 -0.66184983 -0.7656438 -5.0672590
## 2 2009-01-02  1.71895416  0.5988432 -0.7943331
## 3 2009-01-03  2.12166699  1.3484795  0.5554631
## 4 2009-01-04  1.49715368 -0.5856326 -1.1173440
## 5 2009-01-05 -0.03614058  0.9761067  2.8356777
## 6 2009-01-06  1.23194518  1.7656036 -3.0664418

cast.stock <- dcast(stocksm, formula = time ~ stock, value.var = "price")
head(cast.stock)

##         time           X          Y          Z
## 1 2009-01-01 -0.66184983 -0.7656438 -5.0672590
## 2 2009-01-02  1.71895416  0.5988432 -0.7943331
## 3 2009-01-03  2.12166699  1.3484795  0.5554631
## 4 2009-01-04  1.49715368 -0.5856326 -1.1173440
## 5 2009-01-05 -0.03614058  0.9761067  2.8356777
## 6 2009-01-06  1.23194518  1.7656036 -3.0664418

Again, the same result produced by spread() can be obtained using dcast() by specifying the correct formula.

In the next session, we are going to modify the formula parameter in order to perform some data aggregation and compare further the two packages.

Data aggregation

Up to now we made reshape2 following tidyr, showing that everything you can do with tidyr can be achieved by reshape2, too, at the price of a some workarounds. As we now go on with our simple example we will get out of the purposes of tidyr and have no more functions available for our needs. Now we have a tidy data set – one observation per row and one variable per column – to work with. We show some aggregations that are possible with dcast() using the tips data frame from reshape2. Tips contains the information one waiter recorded about each tip he received over a period of a few months working in one restaurant.

head(tips)

##   total_bill  tip    sex smoker day   time size
## 1      16.99 1.01 Female     No Sun Dinner    2
## 2      10.34 1.66   Male     No Sun Dinner    3
## 3      21.01 3.50   Male     No Sun Dinner    3
## 4      23.68 3.31   Male     No Sun Dinner    2
## 5      24.59 3.61 Female     No Sun Dinner    4
## 6      25.29 4.71   Male     No Sun Dinner    4

m.tips <- melt(tips)

## Using sex, smoker, day, time as id variables

head(m.tips)

##      sex smoker day   time   variable value
## 1 Female     No Sun Dinner total_bill 16.99
## 2   Male     No Sun Dinner total_bill 10.34
## 3   Male     No Sun Dinner total_bill 21.01
## 4   Male     No Sun Dinner total_bill 23.68
## 5 Female     No Sun Dinner total_bill 24.59
## 6   Male     No Sun Dinner total_bill 25.29

We use dcast() to get information on the average total bill, tip and group size per day and time:

dcast(m.tips, day+time ~ variable, mean)

##    day   time total_bill      tip     size
## 1  Fri Dinner   19.66333 2.940000 2.166667
## 2  Fri  Lunch   12.84571 2.382857 2.000000
## 3  Sat Dinner   20.44138 2.993103 2.517241
## 4  Sun Dinner   21.41000 3.255132 2.842105
## 5 Thur Dinner   18.78000 3.000000 2.000000
## 6 Thur  Lunch   17.66475 2.767705 2.459016

Averages per smoker or not in the group.

dcast(m.tips, smoker ~ variable, mean)

##   smoker total_bill      tip     size
## 1     No   19.18828 2.991854 2.668874
## 2    Yes   20.75634 3.008710 2.408602

There is no function in the tidyr package that allows us to perform a similar operation, the reason is that tidyr is designed only for data tidying and not for data reshaping.

Conclusions

At the beginning we have seen tidyr and reshape2 functions performing the same operations, therefore, suggesting that the two packages are similar, if not equal in what they do; lately, we have seen that reshape2’s functions can do data aggregation that is not possible with tidyr. Indeed, tidyr’s aim is data tidying while reshape2 has the wider purpose of data reshaping and aggregating. It follows that tidyr syntax is easier to understand and to work with, but its functionalities are limited. Therefore, we use tidyr gather() and separate() functions to quickly tidy our data and reshape2dcast() to aggregate them.

Further readings:

References:

Wickham, Hadley. 2014. “Tidy data.” Journal of Statistical Software 59 (10).

The post How to reshape data in R: tidyr vs reshape2 appeared first on MilanoR.

To leave a comment for the author, please follow the link and comment on their blog: MilanoR.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers


Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)