Packing everything into a data.frame

August 23, 2010
By

(This article was first published on Struggling Through Problems » R, and kindly contributed to R-bloggers)

OK, I know I talk about R too much, but I like R, so I’m going to talk about it some more.

Common situation: repeat a procedure many times; each time generates some large wadge of awful-structured data, and in the end you’d like to go back and look at it all.

OK, sounds reasonably simple, just

lapply(1:Num.Trials, function(N) {
...
list(
A = ...,
B = ...,
)
})

and you’ve got a list of structs containing that data. It works, but I find it undesirable for a few reasons:

1. A list of lists is cumbersome to navigate. You have to subscript the first list before the second.
2. You can’t do nice data.frame things with it like plot(…, data=…). Basically, it should be a data.frame, because data.frames are pretty.
3. Having to explicitly put everything into the struct there at the end forces you to choose what gets remembered and what gets dropped. Rarely do I have such foresight.

So to get a data.frame, we can use the magic of sapply. Like this:

as.data.frame(t(sapply(1:N, function(I) {
...
list(
A = ...,
B = ...
)
}))))

I have to admit I don’t actually know why sapply is smart enough to do this, but it turns the whole shebang into a matrix of mode “list”. t() transposes that matrix so the fields A, B… become the columns. as.data.frame() makes the whole thing a data frame. Excellent.

Well, there’s a little problem here. I didn’t realize this at first, but a data.frame is just a list() of columns plus some attributes() attached. And those columns are welcome to be of mode “list”, as they will be here. In one way that’s actually really convenient, because you can stick complex stuff inside a data.frame, as in, like anything, even whole other data.frames. But you can’t call mean() or sd() or acf() on a vector of mode “list”. Inconvenient.

(By the way, is there any other language in which every object has a type, a mode and a class, all of which mean different things? What is up with that?)

So the solution is this “clean” function, to convert, where possible, vectors of mode “list” to numeric or character vectors.

clean = function(Data.Frame) {
is.one = function(X) {
is.atomic(X) && (length(X) == 1)
}

is.good = function(Col) {
all(sapply(Col, is.one))
}

for (Col.Name in colnames(Data.Frame)) {
Col = Data.Frame[[Col.Name]]
if (is.good(Col)) {
Data.Frame[[Col.Name]] = unlist(Col)
}
}

Data.Frame
}

Basically, check to see that all the elements are atomic vectors (ie not lists) of length 1; if so, flatten (“unlist”).

And lastly, how about automatically grabbing everything you created along the way? Just end each loop with

as.list(environment())

Putting this all together, we have:

do.trials = function(N, Func) {
clean(as.data.frame(t(sapply(1:N, Func))))
}

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...