Passing columns of a dataframe to a function without quotes
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I love the syntax of calls to lm and ggplot, wherein the dataframe is specified as a variable and specific columns are referenced as though they were separate variables. While developing some of my functions, I’d wanted to introduce something similar. I often find that I have a single large dataframe and want to execute the same function to many columns. I wanted the ability to do this interactively, which ruled out the brute force method of something like lapply. The resulting code in the called function was always a bit messy passing in a character string or position for the column and then writing something like df[,MyColName]. Actually, looking at it now, it seems fairly straightforward. I suppose I just didn’t like the green colored font in RStudio and just wanted to know how it was done. If that smells like a caveat, it is. I’m not 100% certain of the purity of this convention and am open to other views and suggestions.
Turns out the answer is straightforward and relies on use of the eval function. eval lets you specify the environment in which a variable is evaluated and that environment may include a dataframe. Here’s a very simple example, which simply sums the values in a column of a dataframe.
someFunction = function(y, data) { arguments <- as.list(match.call()) y = eval(arguments$y, data) sum(y) }
First, we pull the arguments out using match.call(). I’ll be honest. I read up on that last week until my brain melted. Here’s more or less what it amounts to. match.call() will return a call object, which has all of the items in the function signature unevaluated. This means that arguments exist as quotes. Quotes describe your variable and sit around waiting to be evaluated. Here, we’re grabbing them before anything else happens so that we can control how that happens. The eval function will use the local environment, unless we tell it to use something else. In this case, we tell it to use the dataframe that we’ve passed in. This allows us to do something cool like the following:
myData = data.frame(A = c(1,2,3), B = c(10,9,8)) someFunction(A, data=myData) someFunction(B, data=myData) someFunction(A)
So that’s loads of fun and I love how the function calls look. I also like that I get an error if I try to pass in a column without specifying the dataframe. However, beware. There’s nothing which insists that the first argument to the function must live in the dataframe. Note what happens when we pass in something else
X = c(1,2,3,4,5,6) someFunction(X) someFunction(X, data=myData)
This may not be catastrophic, but it’s probably a situation we’d want to be informed of, at least via a warning. I went to the trouble of creating the dataframe and passing it into a function, I’d like to know if it’s being ignored. Even worse, if I create a variable called A, then someFunction(A) will now work without an error. However, it won’t be using the column labelled A in the dataframe. Try the following:
A = c(1,2) someFunction(A) someFunction(A, data=myData)
I’m still monkeying around with this, trying to sort out what looks right and is most robust. As always, other views are welcome.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.