Packing everything into a data.frame

[This article was first published on Struggling Through Problems » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

OK, I know I talk about R too much, but I like R, so I’m going to talk about it some more.

Common situation: repeat a procedure many times; each time generates some large wadge of awful-structured data, and in the end you’d like to go back and look at it all.

OK, sounds reasonably simple, just

lapply(1:Num.Trials, function(N) {
    ...
    list(
        A = ...,
        B = ...,
    )
})

and you’ve got a list of structs containing that data. It works, but I find it undesirable for a few reasons:

  1. A list of lists is cumbersome to navigate. You have to subscript the first list before the second.
  2. You can’t do nice data.frame things with it like plot(…, data=…). Basically, it should be a data.frame, because data.frames are pretty.
  3. Having to explicitly put everything into the struct there at the end forces you to choose what gets remembered and what gets dropped. Rarely do I have such foresight.

So to get a data.frame, we can use the magic of sapply. Like this:

as.data.frame(t(sapply(1:N, function(I) {
    ...
    list(
        A = ...,
        B = ...
    )
}))))

I have to admit I don’t actually know why sapply is smart enough to do this, but it turns the whole shebang into a matrix of mode “list”. t() transposes that matrix so the fields A, B… become the columns. as.data.frame() makes the whole thing a data frame. Excellent.

Well, there’s a little problem here. I didn’t realize this at first, but a data.frame is just a list() of columns plus some attributes() attached. And those columns are welcome to be of mode “list”, as they will be here. In one way that’s actually really convenient, because you can stick complex stuff inside a data.frame, as in, like anything, even whole other data.frames. But you can’t call mean() or sd() or acf() on a vector of mode “list”. Inconvenient.

(By the way, is there any other language in which every object has a type, a mode and a class, all of which mean different things? What is up with that?)

So the solution is this “clean” function, to convert, where possible, vectors of mode “list” to numeric or character vectors.

clean = function(Data.Frame) {
	is.one = function(X) {
		is.atomic(X) && (length(X) == 1)
	}

	is.good = function(Col) {
		all(sapply(Col, is.one))
	}

	for (Col.Name in colnames(Data.Frame)) {
		Col = Data.Frame[[Col.Name]]
		if (is.good(Col)) {
			Data.Frame[[Col.Name]] = unlist(Col)
		}
	}

	Data.Frame
}

Basically, check to see that all the elements are atomic vectors (ie not lists) of length 1; if so, flatten (“unlist”).

And lastly, how about automatically grabbing everything you created along the way? Just end each loop with

as.list(environment())

Putting this all together, we have:

do.trials = function(N, Func) {
	clean(as.data.frame(t(sapply(1:N, Func))))
}

To leave a comment for the author, please follow the link and comment on their blog: Struggling Through Problems » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)