reshape (from base) Explained: Part I

Posted on May 2, 2012 by tylerrinker in R bloggers | 0 Comments

[This article was first published on TRinker's R Blog » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This Post Will Explain the Basics of Wide to Long With base reshape (part I)

Often your data set is in wide format and some sort of analysis or visualization requires putting the data set into long format. Hadely Wickham has a package for reshaping data called reshape2 that is pretty handy for quickly reshaping data with the melt and cast functions. I learned to use this long before I learned the base function reshape for doing the same task. I suspect many of you are in the same boat and may never have learned to use bases reshape period. There’s a reason for that: the arguments are not instinctive like Wickham’s package, the description of the function (LINK) is very difficult for beginners, and this function is actually two functions in 1 (a wide to long as well as a long to wide function).

So you’ve mastered Wickham’s reshape2 package and are thinking “Why the fudgesicle should I learn a confusing function like reshape when I got Hadley?” Here’s my list:

It’s powerful
It’s flexible
It’s in base (no dependencies)
It may be faster for large data sets

So which approach should you use? The best one for the job. Alright let’s tear into base’s reshape and take some mystery away from how to work it (I got 2 rules to help guide your thinking).

RULE 1: Stack repeated measures/Replicate and stack everything else

Basically you want to:

Take repeated measures columns and stack them as a measures column
Put their column names next to them in a new times column (so you can keep track of which time is which time)
And then replicate everything else that’s left as many times as you had repeated measures and stack it all.

So in the data frame below we have 3 repeated measures (time1, time2, time3) and the “everything else” is the id column.

    id time1 time2 time3
1 ID.1  5.01  5.12  8.62
2 ID.2 79.40 81.42 81.29
3 ID.3 80.37 83.12 85.92

We want to stack the last three columns, making sure to put their respective column name next to them and then we want to replicate the id part of the data frame and stack it 3 times because that’s how many repeated measures we have. So the final product will look like this:

    id time results
1 ID.1    1    5.01
2 ID.2    1   79.40
3 ID.3    1   80.37
4 ID.1    2    5.12
5 ID.2    2   81.42
6 ID.3    2   83.12
7 ID.1    3    8.62
8 ID.2    3   81.29
9 ID.3    3   85.92

RULE 2: Naming your columns in a way R likes makes your life easier

Here’s a stackoverflow.com example of someone with this very problem of not satisfying the the naming the way R likes it (this was added after this blog was written) [LINK click here].

When I got this little fact down reshape became a lot easier to operate. So how does R like your columns to look? Well R doesn’t give a rip what your “everything else” columns look like but the repeated measures it likes in the form “time.1″ or a word common to all repeated measures -> followed by a period -> followed by sequence of numbers or alpha.numeric

I promise you getting this down makes your life easier. It enables varying to figure out what columns are what with more complex problems. Alright let’s generate some data using the DFgen function and look at ways to rename the columns (you can source it if you haven’t saved it to your .Rprofile). The last three columns are our repeated measures.

###########################
# LOAD THE DFgen FUNCTION #
###########################
source("http://dl.dropbox.com/u/61803503/DFgen_fun.txt")
######################
# GENERATE SOME DATA #
######################
set.seed(10);dat <- DFgen()[1:5, -c(6:10)]

The data set looks like this:

    id   group hs.grad  race gender score time1 time2 time3
1 ID.1   treat     yes white   male -1.24 51.39 52.15 53.76
2 ID.2 control     yes black   male -0.46 32.21 35.07 33.10
3 ID.3 control     yes white   male -0.83 43.36 45.46 46.22
4 ID.4   treat      no white   male  0.34 71.63 72.06 74.49
5 ID.5 control     yes white female  1.07  9.26 12.24 11.02

Now let’s rename time1, time2 and time3 the way R likes (makes life easy peasy cheesy) . There’s two approaches: 1) I’ll do it manually because regex is kinda a pain to learn 2) I’ll use regex because a) I like to show off b) I am somehow brilliant and know how already c) my data set is huge (many # of vars) and it’s more of a pain to do it manually d) all of the above. I ain’t gonna lie regex takes some learning but can be a valuable asset and a time saver.

#Variable Rename Method 1
names(dat)[7:9] <- c("time.1", "time.2", "time.3")
dat

#Variable Rename Method 2
dat <- redat #reload the data set with the old names
names(dat) <- gsub("([a-z])([0-9])", "\\1\\.\\2", names(dat))
########################################################################
# Basically this says find all the letters a-z followed by all numbers #
# 0-9, slplit them apart into pieces 1 and two then the second part    #
# says take pieces one and two put a period between them and put them  #
# back together. If there's not a pattern of alpha followed by         #
# numeric then leave those names alone.                                #
########################################################################
dat

Alright we’ve satisfied the R beast’s desire for nicely formatted names, now our life is easy. Let’s learn the bare minimum of what reshape needs now. You have to tell reshape:

data – dataframe you’re supplying reshape
direction – either ‘long’ or ‘wide’ (in this case we are going to long so choose that)
varying – the repeated measures columns we want to stack (takes indexes or column names but I’m lazy and will use indexes if you want names use: c(“colname1″, “colname2″, “colname…n”))

Alright let’s see what that gives us:

reshape(dat,
    direction="long",
    varying=7:9)

Which yields:

         id   group hs.grad  race gender score  time
ID.1.1 ID.1   treat     yes white   male -1.24 51.39
ID.2.1 ID.2 control     yes black   male -0.46 32.21
ID.3.1 ID.3 control     yes white   male -0.83 43.36
ID.4.1 ID.4   treat      no white   male  0.34 71.63
ID.5.1 ID.5 control     yes white female  1.07  9.26
ID.1.2 ID.1   treat     yes white   male -1.24 52.15
ID.2.2 ID.2 control     yes black   male -0.46 35.07
ID.3.2 ID.3 control     yes white   male -0.83 45.46
ID.4.2 ID.4   treat      no white   male  0.34 72.06
ID.5.2 ID.5 control     yes white female  1.07 12.24
ID.1.3 ID.1   treat     yes white   male -1.24 53.76
ID.2.3 ID.2 control     yes black   male -0.46 33.10
ID.3.3 ID.3 control     yes white   male -0.83 46.22
ID.4.3 ID.4   treat      no white   male  0.34 74.49
ID.5.3 ID.5 control     yes white female  1.07 11.02

This ain’t bad but (a) the row names are annoying, (b) time is the measurements and (c) speaking of time where’s that column? Well we need to add some cute little arguments to get what we want. Let’s look at some more arguments and see what they’ll give us:

v.names - This is what we we call the measurements (values) of each repeated measure. Name it anything you want.
timevar – This is what we’ll call the times of each repeated measures (the categorical variable if you will). Name it anything you want.

Basically these guys are column renamers. Also by specifying timevar it puts that column into the data set (remember he was no where to be found in the last step). Remember you can call them anything you want. Let’s see what they’re doing:

reshape(dat, 
    direction="long",
    varying=7:9,
    idvar='id',
    timevar="TIME",
    v.names="RESULTS")

Which yields (only show the first 6 rows of data):

         id   group hs.grad  race gender score TIME RESULTS
ID.1.1 ID.1   treat     yes white   male -1.24    1   51.39
ID.2.1 ID.2 control     yes black   male -0.46    1   32.21
ID.3.1 ID.3 control     yes white   male -0.83    1   43.36
ID.4.1 ID.4   treat      no white   male  0.34    1   71.63
ID.5.1 ID.5 control     yes white female  1.07    1    9.26
ID.1.2 ID.1   treat     yes white   male -1.24    2   52.15

But what if times in the TIME column weren’t really 1, 2, and 3 but were 3 locations like “work”, “home”, “church” and we want the data to represent this (this can make our life easier later on for analysis and visuals so we aren’t having to remember what 1,2 & 3 are)? Well we can via:

times - This guy is the way we specify what the 1, 2 and 3 are. As many numeric values that you have in this column you must have names.

If you don’t mind I’m also going to rename the rows at this point to because I can’t stand anything but ordinal numbers for rownames (but this is my blog so who’s going to stop me?).

dat2 <- reshape(dat,        #dataframe
    direction="long",       #wide to long
    varying=7:9,            #repeated measures index
    idvar='id',             #1 or more of what's left
    timevar="TIME",         #The repeated measures times
    v.names="RESULTS",      #the repeated measures values
    times =c("wrk", "hom", "chr"))
######################################
# RENAME THE ROWS TO ORDINAL NUMBERS #
######################################
row.names(dat2) <- NULL
dat2

Which yields:

     id   group hs.grad  race gender score TIME RESULTS
1  ID.1   treat     yes white   male -1.24  wrk   51.39
2  ID.2 control     yes black   male -0.46  wrk   32.21
3  ID.3 control     yes white   male -0.83  wrk   43.36
4  ID.4   treat      no white   male  0.34  wrk   71.63
5  ID.5 control     yes white female  1.07  wrk    9.26
6  ID.1   treat     yes white   male -1.24  hom   52.15
7  ID.2 control     yes black   male -0.46  hom   35.07
8  ID.3 control     yes white   male -0.83  hom   45.46
9  ID.4   treat      no white   male  0.34  hom   72.06
10 ID.5 control     yes white female  1.07  hom   12.24
11 ID.1   treat     yes white   male -1.24  chr   53.76
12 ID.2 control     yes black   male -0.46  chr   33.10
13 ID.3 control     yes white   male -0.83  chr   46.22
14 ID.4   treat      no white   male  0.34  chr   74.49
15 ID.5 control     yes white female  1.07  chr   11.02

If you’re the type who skipped over rule 2, rename columns the way R likes (throwing caution to the wind) in this case you’ll get away with it because the format is pretty simple. In fact I’d probably not rename the columns but if R squawks this is one of the first things to fix.

In part II we’ll explore more complex reshapes like double stacks and more than one set of repeated measures series.

In part II of this reshape series we’ll be looking at more complex reshapes

For a .txt version of this demonstration click here