R Tip: Put Your Values in Columns

[This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Today’s R tip is: put your values in columns.

Some R users use different seemingly clever tricks to bring data to an analysis.

Here is an (artificial) example.

chamber_sizes <- mtcars$disp/mtcars$cyl
form <- hp ~ chamber_sizes
model <- lm(form, data = mtcars)
print(model)
# Call:
# lm(formula = form, data = mtcars)
#
# Coefficients:
#   (Intercept)  chamber_sizes  
#         2.937          4.104  

Notice: one of the variables came from a vector in the environment, not from the primary data.frame. chamber_sizes was first looked for in the data.frame, and then in the environment the formula was defined (which happens to be the global environment), and (if that hadn’t worked) in the executing environment (which is again the global environment).

Our advice is: do not do that. Place all of your values in columns. Make it unambiguous all variables are names of columns in your data.frame of interest. This allows you to write simple code that works over explicit data. The style we recommend looks like the following.

mtcars$chamber_sizes <- mtcars$disp/mtcars$cyl
form <- hp ~ chamber_sizes
model <- lm(form, data = mtcars)
print(model)
# Call:
# lm(formula = form, data = mtcars)
#
# Coefficients:
#   (Intercept)  chamber_sizes  
#         2.937          4.104  

The only difference is we took the time to place the derived vector into the data frame we are working with (assigned to mtcars$chamber_sizes instead of the global environment in the first line). This is a very organized way to work, and as you see it does not take much effort.

Or use only existing values, as we show below.

form <- hp ~ I(disp/cyl)
model <- lm(form, data = mtcars)
print(model)
# Call:
# lm(formula = form, data = mtcars)
# 
# Coefficients:
# (Intercept)  I(disp/cyl)  
#       2.937        4.104  

This is something we teach: with some care you can reliably treat variables as strings, and this is in no way inferior to complex systems such as stats::formula or rlang::quosure. The fact that these objects cary around an environment in addition the names is in fact a barrier to reliable code, not an unmitigated advantage.

I am not alone in this opinion.

If the formula was typed in by the user interactively, then the call came from the global environment, meaning that variables not found in the data frame, or all variables if the data argument was missing, will be looked up in the same way they would in ordinary evaluation. But if the formula object was precomputed somewhere else, then its environment is the environment of the function call that created it. That means that arguments to that call and local assignments in that call will define variables for use in the model parent (that is, enclosing) environment of the call, which may be a package namespace. These rules are standard for R, at least once one know that an environment attribute has been assigned to the formula. They are similar to the use of closures described in Section 5.4, page 126.

Where clear and trustworthy software is a priority, I would personally avoid such tricks. Ideally, all the variables in the model frame should come from an explicit, verifiable data source, typically a data frame object that is archived for future inspection (or equivalently, some other equally well-defined source of data, either inside or outside R, that is used explicitly to construct the data for the model).

Software for Data Analysis (Springer 2008), John M. Chambers, Chapter 6, section 9, page 221.

Chambers’ critique applies equally to stats::formula or rlang::quosure, and roughly he is calling over-use an anti-pattern.

This is why we say from the user point of view variables can be treated as mere names or strings. With some care you can ensure all your values are coming from a single data.frame. And if that is the case, variables are column names.

Going to extra effort to carry around bound variables (variable names, plus an environment resolving the name to a value) is silly and a big source of reference leaks. Roughly: if you don’t know the value of a variable then pass it as a name or string (as that is all an unbound variable or symbol is), if you do know the value then use that value (the variable is serving little purpose at that point). Being able to replace variables with values is the hallmark of referential transparency, which is the family of expressions that are well-behaved in the sense that replacing the expressions with their referred to values does not change observable program behavior. There is code that breaks when you replace variables with values, but that should be considered to be a limitation of such code (not a merit).

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)