reshape (from base) Explained: Part I
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
This Post Will Explain the Basics of Wide to Long With base reshape (part I)
Often your data set is in wide format and some sort of analysis or visualization requires putting the data set into long format. Hadely Wickham has a package for reshaping data called reshape2 that is pretty handy for quickly reshaping data with the melt and cast functions. I learned to use this long before I learned the base function reshape for doing the same task. I suspect many of you are in the same boat and may never have learned to use bases reshape period. There’s a reason for that: the arguments are not instinctive like Wickham’s package, the description of the function (LINK) is very difficult for beginners, and this function is actually two functions in 1 (a wide to long as well as a long to wide function).
So you’ve mastered Wickham’s reshape2 package and are thinking “Why the fudgesicle should I learn a confusing function like reshape when I got Hadley?” Here’s my list:
- It’s powerful
- It’s flexible
- It’s in base (no dependencies)
- It may be faster for large data sets
So which approach should you use? The best one for the job. Alright let’s tear into base’s reshape and take some mystery away from how to work it (I got 2 rules to help guide your thinking).
RULE 1: Stack repeated measures/Replicate and stack everything else
Basically you want to:
- Take repeated measures columns and stack them as a measures column
- Put their column names next to them in a new times column (so you can keep track of which time is which time)
- And then replicate everything else that’s left as many times as you had repeated measures and stack it all.
So in the data frame below we have 3 repeated measures (time1, time2, time3) and the “everything else” is the id column.
id time1 time2 time3 1 ID.1 5.01 5.12 8.62 2 ID.2 79.40 81.42 81.29 3 ID.3 80.37 83.12 85.92
We want to stack the last three columns, making sure to put their respective column name next to them and then we want to replicate the id part of the data frame and stack it 3 times because that’s how many repeated measures we have. So the final product will look like this:
id time results 1 ID.1 1 5.01 2 ID.2 1 79.40 3 ID.3 1 80.37 4 ID.1 2 5.12 5 ID.2 2 81.42 6 ID.3 2 83.12 7 ID.1 3 8.62 8 ID.2 3 81.29 9 ID.3 3 85.92
RULE 2: Naming your columns in a way R likes makes your life easier
Here’s a stackoverflow.com example of someone with this very problem of not satisfying the the naming the way R likes it (this was added after this blog was written) [LINK click here].
When I got this little fact down reshape became a lot easier to operate. So how does R like your columns to look? Well R doesn’t give a rip what your “everything else” columns look like but the repeated measures it likes in the form “time.1″ or a word common to all repeated measures -> followed by a period -> followed by sequence of numbers or alpha.numeric
I promise you getting this down makes your life easier. It enables varying to figure out what columns are what with more complex problems. Alright let’s generate some data using the DFgen function and look at ways to rename the columns (you can source it if you haven’t saved it to your .Rprofile). The last three columns are our repeated measures.
########################### # LOAD THE DFgen FUNCTION # ########################### source("http://dl.dropbox.com/u/61803503/DFgen_fun.txt") ###################### # GENERATE SOME DATA # ###################### set.seed(10);dat <- DFgen()[1:5, -c(6:10)]
The data set looks like this:
id group hs.grad race gender score time1 time2 time3 1 ID.1 treat yes white male -1.24 51.39 52.15 53.76 2 ID.2 control yes black male -0.46 32.21 35.07 33.10 3 ID.3 control yes white male -0.83 43.36 45.46 46.22 4 ID.4 treat no white male 0.34 71.63 72.06 74.49 5 ID.5 control yes white female 1.07 9.26 12.24 11.02
Now let’s rename time1, time2 and time3 the way R likes (makes life easy peasy cheesy) . There’s two approaches: 1) I’ll do it manually because regex is kinda a pain to learn 2) I’ll use regex because a) I like to show off b) I am somehow brilliant and know how already c) my data set is huge (many # of vars) and it’s more of a pain to do it manually d) all of the above. I ain’t gonna lie regex takes some learning but can be a valuable asset and a time saver.
#Variable Rename Method 1 names(dat)[7:9] <- c("time.1", "time.2", "time.3") dat #Variable Rename Method 2 dat <- redat #reload the data set with the old names names(dat) <- gsub("([a-z])([0-9])", "\\1\\.\\2", names(dat)) ######################################################################## # Basically this says find all the letters a-z followed by all numbers # # 0-9, slplit them apart into pieces 1 and two then the second part # # says take pieces one and two put a period between them and put them # # back together. If there's not a pattern of alpha followed by # # numeric then leave those names alone. # ######################################################################## dat
Alright we’ve satisfied the R beast’s desire for nicely formatted names, now our life is easy. Let’s learn the bare minimum of what reshape needs now. You have to tell reshape:
- data – dataframe you’re supplying reshape
- direction – either ‘long’ or ‘wide’ (in this case we are going to long so choose that)
- varying – the repeated measures columns we want to stack (takes indexes or column names but I’m lazy and will use indexes if you want names use: c(“colname1″, “colname2″, “colname…n”))
reshape(dat, direction="long", varying=7:9)
id group hs.grad race gender score time ID.1.1 ID.1 treat yes white male -1.24 51.39 ID.2.1 ID.2 control yes black male -0.46 32.21 ID.3.1 ID.3 control yes white male -0.83 43.36 ID.4.1 ID.4 treat no white male 0.34 71.63 ID.5.1 ID.5 control yes white female 1.07 9.26 ID.1.2 ID.1 treat yes white male -1.24 52.15 ID.2.2 ID.2 control yes black male -0.46 35.07 ID.3.2 ID.3 control yes white male -0.83 45.46 ID.4.2 ID.4 treat no white male 0.34 72.06 ID.5.2 ID.5 control yes white female 1.07 12.24 ID.1.3 ID.1 treat yes white male -1.24 53.76 ID.2.3 ID.2 control yes black male -0.46 33.10 ID.3.3 ID.3 control yes white male -0.83 46.22 ID.4.3 ID.4 treat no white male 0.34 74.49 ID.5.3 ID.5 control yes white female 1.07 11.02
- v.names - This is what we we call the measurements (values) of each repeated measure. Name it anything you want.
- timevar – This is what we’ll call the times of each repeated measures (the categorical variable if you will). Name it anything you want.
reshape(dat, direction="long", varying=7:9, idvar='id', timevar="TIME", v.names="RESULTS")
id group hs.grad race gender score TIME RESULTS ID.1.1 ID.1 treat yes white male -1.24 1 51.39 ID.2.1 ID.2 control yes black male -0.46 1 32.21 ID.3.1 ID.3 control yes white male -0.83 1 43.36 ID.4.1 ID.4 treat no white male 0.34 1 71.63 ID.5.1 ID.5 control yes white female 1.07 1 9.26 ID.1.2 ID.1 treat yes white male -1.24 2 52.15
- times - This guy is the way we specify what the 1, 2 and 3 are. As many numeric values that you have in this column you must have names.
dat2 <- reshape(dat, #dataframe direction="long", #wide to long varying=7:9, #repeated measures index idvar='id', #1 or more of what's left timevar="TIME", #The repeated measures times v.names="RESULTS", #the repeated measures values times =c("wrk", "hom", "chr")) ###################################### # RENAME THE ROWS TO ORDINAL NUMBERS # ###################################### row.names(dat2) <- NULL dat2
id group hs.grad race gender score TIME RESULTS 1 ID.1 treat yes white male -1.24 wrk 51.39 2 ID.2 control yes black male -0.46 wrk 32.21 3 ID.3 control yes white male -0.83 wrk 43.36 4 ID.4 treat no white male 0.34 wrk 71.63 5 ID.5 control yes white female 1.07 wrk 9.26 6 ID.1 treat yes white male -1.24 hom 52.15 7 ID.2 control yes black male -0.46 hom 35.07 8 ID.3 control yes white male -0.83 hom 45.46 9 ID.4 treat no white male 0.34 hom 72.06 10 ID.5 control yes white female 1.07 hom 12.24 11 ID.1 treat yes white male -1.24 chr 53.76 12 ID.2 control yes black male -0.46 chr 33.10 13 ID.3 control yes white male -0.83 chr 46.22 14 ID.4 treat no white male 0.34 chr 74.49 15 ID.5 control yes white female 1.07 chr 11.02
In part II of this reshape series we’ll be looking at more complex reshapes
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.