(This article was first published on

Recently I have stumbled on a problem with split function applied on list of factors. The issue is that it might produce wrong splits when splitting factors contain dots.**R snippets**, and kindly contributed to R-bloggers)Here is the example of the problem. Invoking the following code:

df

**<-**data.frame**(**x**=**rep**(**c**(**"a", "a.b"**)**, 3**)**, y

**=**rep**(**c**(**"b.c", "c"**)**, 3**)**, z

**=**1**:**6**)**split

**(**df, df**[**,**-**3**])**produces:

$a.b.c

x y z

1 a b.c 1

2 a.b c 2

3 a b.c 3

4 a.b c 4

5 a b.c 5

6 a.b c 6

$a.b.b.c

[1] x y z

<0 rows> (or 0-length row.names)

$a.c

[1] x y z

<0 rows> (or 0-length row.names)

And we can see that incorrect splits were produced. The issue is that split uses interaction to combine list of factors passed to it. One can see this problem by invoking:

> interaction

**(**df**[**,**-**3**])**[1] a.b.c a.b.c a.b.c a.b.c a.b.c a.b.c

Levels: a.b.c a.b.b.c a.c

The problem might be not a huge issue in interactive mode, but in production code such behavior is a problem. There are three obvious ways to improve how split works:

- Rewriting split internals to avoid this problem;
- Allow passing sep parameter to split that would be further passed to interaction;
- Warning if resulting number of levels in combined factor does not equal the multiplication of number of levels of combined factors (assuming drop = F option).

#Workaround

split

**(**df, lapply**(**df**[**,**-**3**]**, as.integer**))**#Alternative 1

by

**(**df, df**[**,**-**3**]**, identity**)**#Alternative 2

library

**(**plyr**)**dlply

**(**df,.**(**x,y**))**To

**leave a comment**for the author, please follow the link and comment on his blog:**R snippets**.R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...