The “splitstackshape” package for R
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
A while ago, a friend of ours presented me with a data problem. Her questionnaire had some questions where the respondent could provide multiple responses. You know, the “Check as many as apply” type of questions. One way that this data is commonly stored is to put a comma separated value into a single cell in a spreadsheet. In fact, if you use something like Google Forms to collect your data and have questions that use check-boxes, that’s how your data will finally be stored in a Google Spreadsheet.
Thus, you might end up with something like this:
mydf <- data.frame( ID = c("A01", "A02", "A03", "A04", "B01", "B02"), Name = c("Boyd", "Rufus", "Dana", "Carole", "Ramona", "Kelley"), Likes = c("1,2,4,5,6", "1,2,5,6", "1,3,4", "2,3,6", "1,2,3,5", "1,4,6"), Siblings = c("Reynolds , Albert , Ortega", "Cohen , Bert , Montgomery", "Pierce", "Colon , Michelle , Ballard", "Snyder , Joann ,", "James , Roxanne ,")) mydf # ID Name Likes Siblings # 1 A01 Boyd 1,2,4,5,6 Reynolds , Albert , Ortega # 2 A02 Rufus 1,2,5,6 Cohen , Bert , Montgomery # 3 A03 Dana 1,3,4 Pierce # 4 A04 Carole 2,3,6 Colon , Michelle , Ballard # 5 B01 Ramona 1,2,3,5 Snyder , Joann , # 6 B02 Kelley 1,4,6 James , Roxanne ,
There are two typical R approaches to split data like this up. First is strsplit
, which will create a list
of your split data:
X <- lapply(mydf, function(y) { strsplit(as.character(y), ",") }) X # $Likes # $Likes[[1]] # [1] "1" "2" "4" "5" "6" # # $Likes[[2]] # [1] "1" "2" "5" "6" # # $Likes[[3]] # [1] "1" "3" "4" # # $Likes[[4]] # [1] "2" "3" "6" # # $Likes[[5]] # [1] "1" "2" "3" "5" # # $Likes[[6]] # [1] "1" "4" "6" # # # $Siblings # $Siblings[[1]] # [1] "Reynolds " " Albert " " Ortega" # # $Siblings[[2]] # [1] "Cohen " " Bert " " Montgomery" # # $Siblings[[3]] # [1] "Pierce" # # $Siblings[[4]] # [1] "Colon " " Michelle " " Ballard" # # $Siblings[[5]] # [1] "Snyder " " Joann " # # $Siblings[[6]] # [1] "James " " Roxanne "
If required, you can also make those lists “columns” in your original data.frame
, but that can also create some problems later on, for instance if you try to use write.csv
.
The second common approach is to use read.csv
again on the concatenated columns of data.
Y <- lapply(mydf, function(y) { read.csv(text = as.character(y), header = FALSE, strip.white = TRUE, blank.lines.skip = FALSE, fill = TRUE, stringsAsFactors = FALSE) }) Y # $Likes # V1 V2 V3 V4 V5 # 1 1 2 4 5 6 # 2 1 2 5 6 NA # 3 1 3 4 NA NA # 4 2 3 6 NA NA # 5 1 2 3 5 NA # 6 1 4 6 NA NA # # $Siblings # V1 V2 V3 # 1 Reynolds Albert Ortega # 2 Cohen Bert Montgomery # 3 Pierce # 4 Colon Michelle Ballard # 5 Snyder Joann # 6 James Roxanne do.call(cbind, Y) # Likes.V1 Likes.V2 Likes.V3 Likes.V4 Likes.V5 Siblings.V1 # 1 1 2 4 5 6 Reynolds # 2 1 2 5 6 NA Cohen # 3 1 3 4 NA NA Pierce # 4 2 3 6 NA NA Colon # 5 1 2 3 5 NA Snyder # 6 1 4 6 NA NA James # Siblings.V2 Siblings.V3 # 1 Albert Ortega # 2 Bert Montgomery # 3 # 4 Michelle Ballard # 5 Joann # 6 Roxanne
From the above, it’s easy to use stack
to get the data into a “long” format, which would make it easy to do things like tabulate your data and do further analysis. Here are a few rows of both columns in this stack
ed form.
lapply(seq_along(Y), function(Z) head(stack(Y[[Z]]))) # [[1]] # values ind # 1 1 V1 # 2 1 V1 # 3 1 V1 # 4 2 V1 # 5 1 V1 # 6 1 V1 # # [[2]] # values ind # 1 Reynolds V1 # 2 Cohen V1 # 3 Pierce V1 # 4 Colon V1 # 5 Snyder V1 # 6 James V1
For the past year and a half, I’ve been hanging around Stack Overflow answering questions mostly under the R tag. One of the categories of questions that I typically like to answer are questions involving “reshaping” data into formats that might facilitate further analysis.
I’ve finally decided to take some of my experience from there to put together a few functions that I think might be generally useful into a package named “splitstackshape”. Below, I will demonstrate a few of the key functions.
Installation
The package is on CRAN, so you can install it with:
install.packages("splitstackshape")
Alternatively, you can check the development page on Github as new features may periodically be added there that are not present in the CRAN version.
concat.split
The main unifying function in this group of functions is concat.split.multiple
which allows you to split multiple columns of concatenated data and either keep the output in a wide format or convert it into a long format. Here’s an example.
CSW <- concat.split.multiple( mydf, split.cols = c("Likes", "Siblings"), seps = ",", direction = "wide") CSW # ID Name Likes_1 Likes_2 Likes_3 Likes_4 Likes_5 Siblings_1 # 1 A01 Boyd 1 2 4 5 6 Reynolds # 2 A02 Rufus 1 2 5 6 NA Cohen # 3 A03 Dana 1 3 4 NA NA Pierce # 4 A04 Carole 2 3 6 NA NA Colon # 5 B01 Ramona 1 2 3 5 NA Snyder # 6 B02 Kelley 1 4 6 NA NA James # Siblings_2 Siblings_3 # 1 Albert Ortega # 2 Bert Montgomery # 3 <NA> <NA> # 4 Michelle Ballard # 5 Joann <NA> # 6 Roxanne <NA> CSL <- concat.split.multiple( mydf, split.cols = c("Likes", "Siblings"), seps = ",", direction = "long") CSL # ID Name time Likes Siblings # A01.Boyd.1 A01 Boyd 1 1 Reynolds # A02.Rufus.1 A02 Rufus 1 1 Cohen # A03.Dana.1 A03 Dana 1 1 Pierce # A04.Carole.1 A04 Carole 1 2 Colon # B01.Ramona.1 B01 Ramona 1 1 Snyder # B02.Kelley.1 B02 Kelley 1 1 James # A01.Boyd.2 A01 Boyd 2 2 Albert # A02.Rufus.2 A02 Rufus 2 2 Bert # A03.Dana.2 A03 Dana 2 3 <NA> # A04.Carole.2 A04 Carole 2 3 Michelle # B01.Ramona.2 B01 Ramona 2 2 Joann # B02.Kelley.2 B02 Kelley 2 4 Roxanne # A01.Boyd.3 A01 Boyd 3 4 Ortega # A02.Rufus.3 A02 Rufus 3 5 Montgomery # A03.Dana.3 A03 Dana 3 4 <NA> # A04.Carole.3 A04 Carole 3 6 Ballard # B01.Ramona.3 B01 Ramona 3 3 <NA> # B02.Kelley.3 B02 Kelley 3 6 <NA> # A01.Boyd.4 A01 Boyd 4 5 <NA> # A02.Rufus.4 A02 Rufus 4 6 <NA> # A03.Dana.4 A03 Dana 4 NA <NA> # A04.Carole.4 A04 Carole 4 NA <NA> # B01.Ramona.4 B01 Ramona 4 5 <NA> # B02.Kelley.4 B02 Kelley 4 NA <NA> # A01.Boyd.5 A01 Boyd 5 6 <NA> # A02.Rufus.5 A02 Rufus 5 NA <NA> # A03.Dana.5 A03 Dana 5 NA <NA> # A04.Carole.5 A04 Carole 5 NA <NA> # B01.Ramona.5 B01 Ramona 5 NA <NA> # B02.Kelley.5 B02 Kelley 5 NA <NA>
As I mentioned, this is a “family” of functions. For instance, when splitting numeric data, you might want to create a “binary” matrix of your concatenated data. Here, a value of “1″ is mapped to the first column, a value of “2″ to the second, and so on.
concat.split.expanded(mydf, split.col = "Likes", sep = ",", mode = "binary", drop = FALSE, fill = 0) # ID Name Likes Siblings Likes_1 Likes_2 # 1 A01 Boyd 1,2,4,5,6 Reynolds , Albert , Ortega 1 1 # 2 A02 Rufus 1,2,5,6 Cohen , Bert , Montgomery 1 1 # 3 A03 Dana 1,3,4 Pierce 1 0 # 4 A04 Carole 2,3,6 Colon , Michelle , Ballard 0 1 # 5 B01 Ramona 1,2,3,5 Snyder , Joann , 1 1 # 6 B02 Kelley 1,4,6 James , Roxanne , 1 0 # Likes_3 Likes_4 Likes_5 Likes_6 # 1 0 1 1 1 # 2 0 0 1 1 # 3 1 1 0 0 # 4 1 0 0 1 # 5 1 0 1 0 # 6 0 1 0 1
The mode = "value"
setting is similar, but instead of recoding to “1″, the original values are retained.
concat.split.expanded(mydf, split.col = "Likes", sep = ",", mode = "value", drop = TRUE, fill = NA) # ID Name Siblings Likes_1 Likes_2 # 1 A01 Boyd Reynolds , Albert , Ortega 1 2 # 2 A02 Rufus Cohen , Bert , Montgomery 1 2 # 3 A03 Dana Pierce 1 NA # 4 A04 Carole Colon , Michelle , Ballard NA 2 # 5 B01 Ramona Snyder , Joann , 1 2 # 6 B02 Kelley James , Roxanne , 1 NA # Likes_3 Likes_4 Likes_5 Likes_6 # 1 NA 4 5 6 # 2 NA NA 5 6 # 3 3 4 NA NA # 4 3 NA NA 6 # 5 3 NA 5 NA # 6 NA 4 NA 6
Stacked
and Reshape
The Stacked
and Reshape
functions are designed for convenient (guess what) stacking and reshaping of wide datasets.
Consider, for example the CSW
object I created earlier when demonstrating the default direction = "wide"
setting in concat.split.multiple
. If we wanted to convert this into a “semi-long” format (that is, keep separate columns for “Likes” and “Siblings”), we would generally try to use reshape()
from base R. The problem is that these data are “unbalanced”. There are only three “times” for “Siblings”, while there are five times for “Likes”. (Ignore, for the time being, that such a dataset would not make much sense since I’m stacking unrelated categorical variables. A better example would be where a certain measured variable in a longitudinal study were introduced in a different wave, meaning it was not present in one or more earlier waves.)
Here’s what happens with reshape()
from base R:
reshape(CSW, direction = "long", idvar = 1:2, varying = 3:ncol(CSW), sep = "_") # Error in reshapeLong(data, idvar = idvar, timevar = timevar, varying = varying, : # 'varying' arguments must be the same length
Here is the Reshape
approach from “splitstackshape”:
Reshape(CSW, id.vars = c("ID", "Name"), var.stubs = c("Likes", "Siblings"), sep = "_") # ID Name time Likes Siblings # A01.Boyd.1 A01 Boyd 1 1 Reynolds # A02.Rufus.1 A02 Rufus 1 1 Cohen # A03.Dana.1 A03 Dana 1 1 Pierce # A04.Carole.1 A04 Carole 1 2 Colon # B01.Ramona.1 B01 Ramona 1 1 Snyder # B02.Kelley.1 B02 Kelley 1 1 James # A01.Boyd.2 A01 Boyd 2 2 Albert # A02.Rufus.2 A02 Rufus 2 2 Bert # A03.Dana.2 A03 Dana 2 3 <NA> # A04.Carole.2 A04 Carole 2 3 Michelle # B01.Ramona.2 B01 Ramona 2 2 Joann # B02.Kelley.2 B02 Kelley 2 4 Roxanne # A01.Boyd.3 A01 Boyd 3 4 Ortega # A02.Rufus.3 A02 Rufus 3 5 Montgomery # A03.Dana.3 A03 Dana 3 4 <NA> # A04.Carole.3 A04 Carole 3 6 Ballard # B01.Ramona.3 B01 Ramona 3 3 <NA> # B02.Kelley.3 B02 Kelley 3 6 <NA> # A01.Boyd.4 A01 Boyd 4 5 <NA> # A02.Rufus.4 A02 Rufus 4 6 <NA> # A03.Dana.4 A03 Dana 4 NA <NA> # A04.Carole.4 A04 Carole 4 NA <NA> # B01.Ramona.4 B01 Ramona 4 5 <NA> # B02.Kelley.4 B02 Kelley 4 NA <NA> # A01.Boyd.5 A01 Boyd 5 6 <NA> # A02.Rufus.5 A02 Rufus 5 NA <NA> # A03.Dana.5 A03 Dana 5 NA <NA> # A04.Carole.5 A04 Carole 5 NA <NA> # B01.Ramona.5 B01 Ramona 5 NA <NA> # B02.Kelley.5 B02 Kelley 5 NA <NA>
You’ll notice that this is the same as the output of CSL
, since the direction = "long"
argument in concat.split.multiple
simply calls the Reshape
function after splitting the concatenated data.
Here’s Stacked
. The syntax is similar to Reshape
. First, I’ll stack just one column, then I’ll stack both, to demonstrate the different features.
Stacked(CSW, id.vars = c("ID", "Name"), var.stubs = "Likes", sep = "_") # $Likes # ID Name .time_1 Likes Siblings_1 Siblings_2 Siblings_3 # 1: A01 Boyd 1 1 Reynolds Albert Ortega # 2: A01 Boyd 2 2 Reynolds Albert Ortega # 3: A01 Boyd 3 4 Reynolds Albert Ortega # 4: A01 Boyd 4 5 Reynolds Albert Ortega # 5: A01 Boyd 5 6 Reynolds Albert Ortega # 6: A02 Rufus 1 1 Cohen Bert Montgomery # 7: A02 Rufus 2 2 Cohen Bert Montgomery # 8: A02 Rufus 3 5 Cohen Bert Montgomery # 9: A02 Rufus 4 6 Cohen Bert Montgomery # 10: A02 Rufus 5 NA Cohen Bert Montgomery # 11: A03 Dana 1 1 Pierce NA NA # 12: A03 Dana 2 3 Pierce NA NA # 13: A03 Dana 3 4 Pierce NA NA # 14: A03 Dana 4 NA Pierce NA NA # 15: A03 Dana 5 NA Pierce NA NA # 16: A04 Carole 1 2 Colon Michelle Ballard # 17: A04 Carole 2 3 Colon Michelle Ballard # 18: A04 Carole 3 6 Colon Michelle Ballard # 19: A04 Carole 4 NA Colon Michelle Ballard # 20: A04 Carole 5 NA Colon Michelle Ballard # 21: B01 Ramona 1 1 Snyder Joann NA # 22: B01 Ramona 2 2 Snyder Joann NA # 23: B01 Ramona 3 3 Snyder Joann NA # 24: B01 Ramona 4 5 Snyder Joann NA # 25: B01 Ramona 5 NA Snyder Joann NA # 26: B02 Kelley 1 1 James Roxanne NA # 27: B02 Kelley 2 4 James Roxanne NA # 28: B02 Kelley 3 6 James Roxanne NA # 29: B02 Kelley 4 NA James Roxanne NA # 30: B02 Kelley 5 NA James Roxanne NA # ID Name .time_1 Likes Siblings_1 Siblings_2 Siblings_3 Stacked(CSW, id.vars = c("ID", "Name"), var.stubs = c("Likes", "Siblings"), sep = "_") # $Likes # ID Name .time_1 Likes # 1: A01 Boyd 1 1 # 2: A01 Boyd 2 2 # 3: A01 Boyd 3 4 # 4: A01 Boyd 4 5 # 5: A01 Boyd 5 6 # 6: A02 Rufus 1 1 # 7: A02 Rufus 2 2 # 8: A02 Rufus 3 5 # 9: A02 Rufus 4 6 # 10: A02 Rufus 5 NA # 11: A03 Dana 1 1 # 12: A03 Dana 2 3 # 13: A03 Dana 3 4 # 14: A03 Dana 4 NA # 15: A03 Dana 5 NA # 16: A04 Carole 1 2 # 17: A04 Carole 2 3 # 18: A04 Carole 3 6 # 19: A04 Carole 4 NA # 20: A04 Carole 5 NA # 21: B01 Ramona 1 1 # 22: B01 Ramona 2 2 # 23: B01 Ramona 3 3 # 24: B01 Ramona 4 5 # 25: B01 Ramona 5 NA # 26: B02 Kelley 1 1 # 27: B02 Kelley 2 4 # 28: B02 Kelley 3 6 # 29: B02 Kelley 4 NA # 30: B02 Kelley 5 NA # ID Name .time_1 Likes # # $Siblings # ID Name .time_1 Siblings # 1: A01 Boyd 1 Reynolds # 2: A01 Boyd 2 Albert # 3: A01 Boyd 3 Ortega # 4: A02 Rufus 1 Cohen # 5: A02 Rufus 2 Bert # 6: A02 Rufus 3 Montgomery # 7: A03 Dana 1 Pierce # 8: A03 Dana 2 NA # 9: A03 Dana 3 NA # 10: A04 Carole 1 Colon # 11: A04 Carole 2 Michelle # 12: A04 Carole 3 Ballard # 13: B01 Ramona 1 Snyder # 14: B01 Ramona 2 Joann # 15: B01 Ramona 3 NA # 16: B02 Kelley 1 James # 17: B02 Kelley 2 Roxanne # 18: B02 Kelley 3 NA
Anyway, test out the package. Hopefully there aren’t too many bugs, and if there are, do be sure to report them.
Update (28 August 2013)
Version 1.2.0 of “splitstackshape” has been accepted at CRAN. The main change is in the Stacked
function (and, by extension, merged.stack
). It now uses unlist(..., use.names = FALSE)
which is much faster than the previous implementation, which used stack
.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.