You should not use split in production code

June 7, 2012
By

(This article was first published on R snippets, and kindly contributed to R-bloggers)

Recently I have stumbled on a problem with split function applied on list of factors. The issue is that it might produce wrong splits when splitting factors contain dots.
Here is the example of the problem. Invoking the following code:

df <- data.frame(x = rep(c(“a”, “a.b”), 3),
                 y = rep(c(“b.c”, “c”), 3),
                 z = 1:6)
split(df, df[,3])

produces:

$a.b.c
    x   y z
1   a b.c 1
2 a.b   c 2
3   a b.c 3
4 a.b   c 4
5   a b.c 5
6 a.b   c 6
$a.b.b.c
[1] x y z
<0 rows> (or 0-length row.names)
$a.c
[1] x y z
<0 rows> (or 0-length row.names)

And we can see that incorrect splits were produced. The issue is that split uses interaction to combine list of factors passed to it. One can see this problem by invoking:

> interaction(df[,3])
[1] a.b.c a.b.c a.b.c a.b.c a.b.c a.b.c
Levels: a.b.c a.b.b.c a.c

The problem might be not a huge issue in interactive mode, but in production code such behavior is a problem. There are three obvious ways to improve how split works:

  1. Rewriting split internals to avoid this problem;
  2. Allow passing  sep parameter to split that would be further passed to  interaction;
  3. Warning if resulting number of levels in combined factor does not equal the multiplication of number of levels of combined factors (assuming drop = F option).

Until this issue is solved there is a workaround using split and two other options using by and dlply (from plyr package):

#Workaround
split(df, lapply(df[,3], as.integer))
#Alternative 1
by(df, df[,3], identity)
#Alternative 2
library(plyr)
dlply(df,.(x,y))

To leave a comment for the author, please follow the link and comment on their blog: R snippets.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Sponsors

Mango solutions



RStudio homepage



Zero Inflated Models and Generalized Linear Mixed Models with R

Dommino data lab

Quantide: statistical consulting and training



http://www.eoda.de







ODSC

ODSC

CRC R books series





Six Sigma Online Training



Contact us if you wish to help support R-bloggers, and place your banner here.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)