reshape (from base) Explained: Part II

Posted on May 5, 2012 by tylerrinker in R bloggers | 0 Comments

[This article was first published on TRinker's R Blog » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Part II Explains More Complex Wide to Long With base reshape

In part I of this base reshape tutorial we went over the basics of reshaping data with reshape. We learned two rules that help us to be more efficient and effective in using this powerful base tool:

RULE 1: Stack repeated measures/Replicate and stack everything else

RULE 2: Naming your columns in a way R likes makes your life easier

In part II we will be looking at more complex wide to long reshapes (more than one series of repeated measures) by building on what we learned in part I. Let’s start by generating some data with two series/nested repeated measures):

set.seed(10)
dat <- data.frame(id=paste0("ID.", 1:5), 
    sex=sample(c("male", "female"), 5, replace=TRUE), 
    matrix(rpois(30, 10), 5, 6))
colnames(dat)[-c(1:2)] <- paste0(rep(1:2, each=3), 
    rep(c("work", "home", "church"), 2))
dat

Which looks like this:

    id    sex 1work 2home 1church 2work 1home 2church
1 ID.1 female     7     8       7    10     6      10
2 ID.2   male    10    13      10     7    13      15
3 ID.3   male    11    10       6    10    10       7
4 ID.4 female     6     8      12     9    15       7
5 ID.5   male     9    11      15    10    10      12

As you can see we have nested repeated measures at three different locations (work, home, church) at two different times (the 1 or 2 prefix). Now let’s follow Rule 2 and get our names in a way R likes them (You may ask why I didn’t name them correctly to begin with? Fair question. Let me ask one though. Have you ever got a data set 100% the way you wanted it to be?).

names(dat) <- gsub("([0-9]+)([a-z]+)", "\\2\\.\\1", names(dat))
###############################################################
# BASICALLY, THIS SAYS FIND THE NAMES THAT ARE NUMERICALPHA.  #
# OTHERWISE LEAVE IT ALONE. THE [0-9]+ SAYS FIND THE NUMERIC  #
# STRING (PLUS SIGN SAYS FIND ALL THE PROCEDING CHARACTERS 1  #
# OR MORE TIMES). THE [a-z]+ SAYS FIND THE ALPHA STRING (PLUS #
# AGAIN MEANS FIND THE ALPHAS 1 OR MORE TIMES). THE "." IS    #
# CHARACTERS I'M INSERTING AND THE 1 AND 2 CORRESPOND TO THE  #
# PARENTHESIS IN THE ARGUMENT OF gsub. BASICALLY FLIP FLOPING #
# THE POSITION OF 1 AND 2.                                    #
###############################################################
#==================================================================
##############################################################
# OR MANUAL REPLACEMENT. YOU CAN SEE WHERE REGEX CAN COME IN #
# HANDY AS THE DATA SET GROWS.                               #
##############################################################
#names(dat)[-c(1:2)] <- c("work.1", "home.2", "church.1", 
#    "work.2", "home.1", "church.2")

Which now looks like:

    id    sex work.1 home.2 church.1 work.2 home.1 church.2
1 ID.1 female      7      8        7     10      6       10
2 ID.2   male     10     13       10      7     13       15
3 ID.3   male     11     10        6     10     10        7
4 ID.4 female      6      8       12      9     15        7
5 ID.5   male      9     11       15     10     10       12

Alright in part I we learned the following arguments:

data – dataframe you’re supplying reshape
direction – either ‘long’ or ‘wide’ (in this case we are going to long so choose that)
varying – the repeated measures columns we want to stack (takes indexes or column names but I’m lazy and will use indexes if you want names use: c(“colname1″, “colname2″, “colname…n”))
v.names - This is what we we call the measurements (values) of each repeated measure. Name it anything you want.
timevar – This is what we’ll call the times of each repeated measures (the categorical variable if you will). Name it anything you want.
times - Basically this is your:

(# of starting rept. meas. cols.) ÷ (final # of stacked cols.) = (times vector length)

In the first example we want to have a time 1 and time 2 column by stacking all the locations for time 1 in a column and all the locations for time 2 in a column (these are the v.names columns). Since we have two times we’ll need two column names (I called them TIME_1 and TIME_2 but this is up to you). We’ll need to keep track of these locations in the timevar column. If you notice the major difference between simple repeated measures and more complex repeated measures is that we don’t supply an index of columns to varying but a list of indexes. This is where rule 1 becomes important. What are you stacking? In this case we’re wanting to take everything in time 1 and stack it and the same for time 2 and using time.vars to keep track of the locations. In the example code below I have

The bare bones example (no time column)
An example with a time column (numeric values for cells)
An example with time column and locations for cell values (adj. w/ time.vars arg.)

################
# BARE MINIMUM #
################
reshape(dat,                      #dataframe
    direction="long",             #wide to long
    varying=list(c(3:5), c(6:8)), #repeated measures list of indexes
    idvar='id')

###################################################
# STACKING OF TIME 1 AND 2 REPEAT EVERYTHING ELSE #
###################################################
reshape(dat,                      #dataframe
    direction="long",             #wide to long
    varying=list(c(3:5), c(6:8)), #repeated measures list of indexes
    #idvar='id',                  #1 or more of what's left
    timevar="PLACE",              #the repeated measures times
    v.names=c("TIME_1", "TIME_2"))#the repeated measures values

##################################################
# STACKING OF TIME 1 AND 2 WITH NAMED TIME CELLS #
##################################################
dat2 <- reshape(dat,               #dataframe
    direction="long",              #wide to long
    varying=list(c(3:5), c(6:8)),  #repeated measures list of indexes
    #idvar='id',                   #1 or more of what's left
    timevar="PLACE",               #the repeated measures times
    v.names=c("TIME_1", "TIME_2"), #the repeated measures values
    times =c("wrk", "hom", "chr")) 
row.names(dat2) <- NULL
dat2

The final outcome is:

     id    sex PLACE TIME_1 TIME_2
1  ID.1 female   wrk      7     10
2  ID.2   male   wrk     10      7
3  ID.3   male   wrk     11     10
4  ID.4 female   wrk      6      9
5  ID.5   male   wrk      9     10
6  ID.1 female   hom      8      6
7  ID.2   male   hom     13     13
8  ID.3   male   hom     10     10
9  ID.4 female   hom      8     15
10 ID.5   male   hom     11     10
11 ID.1 female   chr      7     10
12 ID.2   male   chr     10     15
13 ID.3   male   chr      6      7
14 ID.4 female   chr     12      7
15 ID.5   male   chr     15     12

This may be what we want but what if we wanted to have a work, home and church column by stacking all the times for work on each other, all the times for home and all the times for church (these are the v.names columns)? Well we do this with the list of indexes we supply to varying. This again is rule number 1. We know we have three var.names columns (the locations) so we need three indexes to pass as a list to varying. We want to stack all the times for work so we supply the index of 3 (work.1) and 6 (work.2) and do the same for home (c(4, 7)) and play (c(5, 8)). We now switch timevar to TIME because it’s no longer keeping track of the locations and the v.names will be given the three locations as names. We also could supply a times argument to reshape but it doesn’t make sense considering the default numeric index (1, 2) already makes sense.

################################
# STACKING OF THE THREE PLACES #
################################
dat3 <- reshape(dat,                          #dataframe
    direction="long",                         #wide to long
    varying=list(c(3, 6), c(4, 7), c(5, 8)),  #repeated measures list of indexes
    #idvar='id',                              #1 or more of what's left
    timevar="TIME",                           #the repeated measures times
    v.names=c("WORK", "HOME", "CHURCH"))      #the repeated measures values
row.names(dat3) <- NULL
dat3

Remember rule 1? The rule about naming. It’s on these more complex reshapes (more than one series of repeated measures/nested repeated measures) that proper naming pays off. The idea of passing varying a list of indexes was because reshape can’t figure out who’s who if you haven’t named them correctly but since we named them to have the three locations followed by a period and then a numeric index our life is easy peesy cheesy. Look below and you’ll see all we do is tell varying what columns are repeated measures and he figures out what to stack from the names. Additionally, there’s no need to supply the argument v.names because R is such a smarty he figured it out all by himself (what a big boy). You ask well why didn’t this work for stacking above with two times (the dat2 example)? Good question. It doesn’t work because we need to have the form measurment_column_name.time_column. So our rename job at the beginning was work.time, home.time, church.time. In this example our three measurement columns will be work, home, and time and the numeric index after each name indicates which time. If we wanted to have it easy for the dat2 example we would to have named the repeated measures as time_1.1, time_1.2, time_1.3, time_2.1, time_2.2, time_2.3. The dot numeric index at the end stands for the three locations. If you’re interested in seeing this please see the link of the script of this demonstration found at the bottom of this article as it contains extra code not found in this post.

So you have three approaches

Name it correctly (just indexes 1:n)
Provide a list of indexes (who cares about names)
Both name correctly and list of indexes (safety my friend)

###############################################################
# STACKING OF THE THREE PLACES REWARDED BY GOOD COLUMN NAMING #
###############################################################
dat3 <- reshape(dat,                          #dataframe
    direction="long",                         #wide to long
    varying=3:8,                              #indexes
    #idvar='id',                              #1 or more of what's left
    timevar="TIME")                           #the repeated measures times
    #v.names=c("WORK", "HOME", "CHURCH"))     #Rewarded: no need for v.names
row.names(dat3) <- NULL
dat3

Which gives us:

     id    sex TIME WORK HOME CHURCH
1  ID.1 female    1    7    8      7
2  ID.2   male    1   10   13     10
3  ID.3   male    1   11   10      6
4  ID.4 female    1    6    8     12
5  ID.5   male    1    9   11     15
6  ID.1 female    2   10    6     10
7  ID.2   male    2    7   13     15
8  ID.3   male    2   10   10      7
9  ID.4 female    2    9   15      7
10 ID.5   male    2   10   10     12

Hold the phone Fenster!

So let me get this straight. If I’ve been a good R user and followed the Rule #2 (name the way R liketh) then all I have to provide reshape is data, direction and varying (maybe idvar)? Yep that’s right. See I told you that nameology was important, makes your life easy. don’t believe me try it out:

reshape(dat, direction="long", varying=3:8)

See reshape is actually pretty simple once you figure it out.

But sometimes we need to stack all the repeated measures into one column (for certain analysis and visualizations) and keep track of both time and location. To do this we simply supply all repeated measures columns to varying (indexes 3:8) as a vector (not a list as we only want one final column and lists are for when we want multiple repeated measures columns), provide v.names and timevar with appropriate names (I chose LOC_TIME for timevar as both the nested repeated measures of location and time will be in this column), and last give a vector of names to the times argument. Keep in mind that reshape will stack the columns you gave to varying in the order you supplied them. To figure out the number of times (as stated above) we take the original number of columns and divide by the total number of end columns (6 ÷ 1 = 6) which means we have to supply 6 names to the times argument (otherwise we have the numeric 1-6 default which can be pretty difficult to keep track of). This is where paste and R’s recycling rule comes in handy. Simply supply paste with the first vector of repeated measure series (location) and then the second, but use rep with the second providing each = (#of first series of repeated measures). The recycling rule will take care of the rest.

###############################################################
# DOUBLE STACK. STACK TIMES AND PLACES AND NOTE EACH TIME AND #
# PLACE.  # of TIMES = # OF COLUMNS STACKED.                  #
###############################################################
dat4 <- reshape(dat,              #dataframe
    direction="long",             #wide to long
    varying=3:8,                  #repeated measures list of indexes
    #idvar='id'),                 #1 or more of what's left
    timevar="LOC_TIME",           #the repeated measures times
    v.names=c("VALUE"),           #the repeated measures values
    times =paste(c("work", "home", "church"), rep(1:2, each=3)))
row.names(dat4) <- NULL
dat4

This gives us:

     id    sex LOC_TIME VALUE
1  ID.1 female   work 1     7
2  ID.2   male   work 1    10
3  ID.3   male   work 1    11
4  ID.4 female   work 1     6
5  ID.5   male   work 1     9
6  ID.1 female   home 1     8
7  ID.2   male   home 1    13
8  ID.3   male   home 1    10
.
.
.
29 ID.4 female church 2     7
30 ID.5   male church 2    12

This is nice but the information for the timevar (location and time) is all garbled and may make analysis or visualization functions difficult. The best approach would be to split this data into two different columns. Many people are familiar with Wickham’s colsplit
from the reshape2 package. This is one approach. I also have a function called colsplit2 that operates from the base package that I keep in my .Rprofile (I actually call it colsplit as well but for namespace purposes we’ll call it colsplit2). this is similar to Wickham’s but a little different. With Wickham’s you provide just the one column and it splits it into two and you then need to cbind it back to the original some how. My function takes the dataframe and the column to be split and outputs a new data frame with two columns in the same place as the original singular column. This is a base alternative if you’re attempting to avoid dependence. For this tutorial I’ll use my function but the downloadable script has both methods.

#############################################
# ALTERNATE BASE METHOD OF COLUMN SPLITTING #
#############################################
colsplit2 <- function(dataframe, splitcol, new.names=NULL, sep=""){
     if(is.numeric(dataframe[, splitcol])) stop("splitcol can not be numeric")
    X <- data.frame(do.call(rbind, strsplit(as.vector(
        dataframe[, splitcol]), split = sep)))
    z <- if (!is.numeric(splitcol)) match(splitcol, names(dataframe)) else splitcol
    if (!is.null(new.names)) colnames(X)  z) {
        cbind(dataframe[, 1:(z-1), drop=FALSE], X, 
            dataframe[, (z + 1):ncol(dataframe), drop=FALSE])
    } else {
        if (z!=1 & ncol(dataframe) == z) {
            cbind(dataframe[, 1:(z-1), drop=FALSE], X)
        } else {
            if (z==1 & ncol(dataframe) > z) {
                cbind(X, dataframe[, (z + 1):ncol(dataframe), drop=FALSE])
            } else {
                X
            }
        }
    }
} #END OF colsplit2 FUNCTION

dat4 <- colsplit2(dat4, "LOC_TIME", c("place", "time"), " ")

We now have:

     id    sex  place time VALUE
1  ID.1 female   work    1     7
2  ID.2   male   work    1    10
3  ID.3   male   work    1    11
4  ID.4 female   work    1     6
5  ID.5   male   work    1     9
6  ID.1 female   home    1     8
7  ID.2   male   home    1    13
8  ID.3   male   home    1    10
.
.
.
29 ID.4 female church    2     7
30 ID.5   male church    2    12

Let’s do a bit of visualization with one of my favorite packages, Wickham’s ggplot2. For social sciences (and particularly repeated measures) the faceting with facet_grid is pretty nice. One little change to the time column to make the labels on facet_grid nicer. I use a paste approach that alters the actual variable because it’s easier to explain but in real practice I don’t like to alter variable I prefer add another column or approach it with other means. The website Cookbook for Rprovides a very nice alternative to altering your variable content using the labeller argument of facet_grid (look under the heading Modifying facet label text in the link).

###############################################################
# MAKE THE NAMES ON LABELS PRETTY FOR GGPLOT FACETING (ONE OF #
# MANY APPROACHES)                                            #
###############################################################
dat4$time <- paste("time", dat4$time)
########################
# PLOT IT WITH GGPLOT2 #
########################
library(ggplot2)
ggplot(data=dat4, aes(sex, VALUE)) +
    geom_boxplot() + facet_grid(place~time)

ggplot(data=dat4, aes(place, VALUE)) +
    geom_boxplot() + facet_grid(time~sex)