splitstackshape V1.4.0 for R
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
After more than a year since splitstackshape V1.2.0, I’ve finally gotten around to making some major updates and submitting the package to CRAN.
So, if you have messed up datasets filled with concatenated cells of data, and you need to split that data up and reorganize it for later analysis, install and load the latest version (V1.4.0) of splitstackshape with:
install.packages("splitstackshape") library(splitstackshape) packageVersion("splitstackshape") ## [1] '1.4.0'
Read on for details!
What’s New?
cSplit
cSplit
becomes one of the core functions for data processing. It splits the data into either a “wide” format or a “long” format. In the wide format, cells with an unbalanced set of delimiters get expanded out to fill a common number of columns.
The delimiters are vectorized over splitCols
and sep
, letting you split multiple columns in one statement.
dat <- data.frame(id = 1:3, V1 = c("a, b, c", "d, e, f, g", "h, i"), V2 = c("1|2", "3|4|5|6", "7|8")) dat ## id V1 V2 ## 1 1 a, b, c 1|2 ## 2 2 d, e, f, g 3|4|5|6 ## 3 3 h, i 7|8 cSplit(dat, splitCols = c("V1", "V2"), sep = c(",", "|")) ## id V1_1 V1_2 V1_3 V1_4 V2_1 V2_2 V2_3 V2_4 ## 1: 1 a b c NA 1 2 NA NA ## 2: 2 d e f g 3 4 5 6 ## 3: 3 h i NA NA 7 8 NA NA ## Notice that other columns get recycled cSplit(dat, "V1", sep = ",", direction = "long") ## id V1 V2 ## 1: 1 a 1|2 ## 2: 1 b 1|2 ## 3: 1 c 1|2 ## 4: 2 d 3|4|5|6 ## 5: 2 e 3|4|5|6 ## 6: 2 f 3|4|5|6 ## 7: 2 g 3|4|5|6 ## 8: 3 h 7|8 ## 9: 3 i 7|8
cSplit_f
cSplit_f
added. The _f
is both for “fixed” and “fread
“. Since the function depends on fread
, it only works if the columns that need to be split have the same number of delimiters–fread
does not work with unbalanced/ragged data. It my tests, it’s much faster than cSplit
if you know that the data are balanced.
As with cSplit
, the delimiters are vectorized over splitCols
and sep
, letting you split multiple columns in one statement.
dat <- data.frame(id = 1:3, V1 = c("a, b, c", "d, e, f", "g, h, i"), V2 = c("1|2|3", "4|5|6", "7|8|9")) dat ## id V1 V2 ## 1 1 a, b, c 1|2|3 ## 2 2 d, e, f 4|5|6 ## 3 3 g, h, i 7|8|9 cSplit_f(dat, splitCols = c("V1", "V2"), sep = c(",", "|")) ## id V1_1 V1_2 V1_3 V2_1 V2_2 V2_3 ## 1: 1 a b c 1 2 3 ## 2: 2 d e f 4 5 6 ## 3: 3 g h i 7 8 9
stratified
Great for taking quick stratified random samples from a data.frame
or a data.table
. Can either be a fixed sample size, or proportional according to the group size.
set.seed(1) dat <- data.frame(ID = 1:20, A = sample(c("AA", "BB"), 20, replace = TRUE), B = rnorm(20), C = abs(round(rnorm(20), digits=1)), D = sample(c("CA", "NY", "TX"), 20, replace = TRUE), E = sample(c("M", "F"), 20, replace = TRUE)) dat ## ID A B C D E ## 1 1 AA 1.51178117 1.4 NY F ## 2 2 AA 0.38984324 0.1 NY M ## 3 3 BB -0.62124058 0.4 CA M ## 4 4 BB -2.21469989 0.1 TX M ## 5 5 AA 1.12493092 1.4 NY F ## 6 6 BB -0.04493361 0.4 CA M ## 7 7 BB -0.01619026 0.4 CA F ## 8 8 BB 0.94383621 0.1 NY M ## 9 9 BB 0.82122120 1.1 TX M ## 10 10 AA 0.59390132 0.8 NY F ## 11 11 AA 0.91897737 0.2 TX F ## 12 12 AA 0.78213630 0.3 TX M ## 13 13 BB 0.07456498 0.7 NY M ## 14 14 AA -1.98935170 0.6 NY F ## 15 15 BB 0.61982575 0.7 CA F ## 16 16 AA -0.05612874 0.7 CA F ## 17 17 BB -0.15579551 0.4 TX F ## 18 18 BB -1.47075238 0.8 CA F ## 19 19 AA -0.47815006 0.1 NY F ## 20 20 BB 0.41794156 0.9 NY F stratified(dat, "A", 2) ## Two from each group of A ## ID A B C D E ## 1: 14 AA -1.98935170 0.6 NY F ## 2: 11 AA 0.91897737 0.2 TX F ## 3: 6 BB -0.04493361 0.4 CA M ## 4: 20 BB 0.41794156 0.9 NY F stratified(dat, "E", .3) ## 30% sample from each group in column A ## ID A B C D E ## 1: 17 BB -0.1557955 0.4 TX F ## 2: 11 AA 0.9189774 0.2 TX F ## 3: 5 AA 1.1249309 1.4 NY F ## 4: 15 BB 0.6198257 0.7 CA F ## 5: 2 AA 0.3898432 0.1 NY M ## 6: 12 AA 0.7821363 0.3 TX M # Stratified by column D but only use rows where column E == "F" stratified(dat, "D", .4, select = list(E = "F")) ## ID A B C D E ## 1: 16 AA -0.05612874 0.7 CA F ## 2: 15 BB 0.61982575 0.7 CA F ## 3: 5 AA 1.12493092 1.4 NY F ## 4: 10 AA 0.59390132 0.8 NY F ## 5: 17 BB -0.15579551 0.4 TX F
What else is new?
cSplit
has replacedsplitstackshape:::read.concat
(but theread.concat
function is still included).Reshape
has been made faster (more likeStacked
andmerged.stack
), and for the most part, theid.vars
should now be optional in all of these functions.getanID
andexpandRows
have been added as utility functions.concat.split.list
andconcat.split.expanded
can now be called withcSplit_l
andcSplit_e
instead.
I’m expecting that there will be some rough edges, but hopefully nothing has been seriously broken! If you find anything, send your bug reports over to GitHub
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.