Specifying Variables in R

September 25, 2012
By

(This article was first published on r4stats.com » R, and kindly contributed to R-bloggers)

R has several ways to specify which variables to use in an analysis. Some of the most frustrating errors can result from not understanding the order in which R searches for variables. This post demonstrates that order, hopefully smoothing your future use of R.

If all your variables are vectors in your workspace, using them in an analysis is easy: simply name them. For example, you could build a linear model (regression) using the lm function like this:

lm(y ~ x)

However, data frames exist for a good reason. They help organize variables and keep the values of each observation (the rows) locked together. For example, when you sort a data frame, all the rows of a data frame are moved, not just the single variable you’re sorting on. Once variables are stored in a data frame however, referring to them gets more complicated. R can include variables from multiple places (e.g. two data frames or a data frame and the workspace) so it becomes important to know your options and how R views them.

You can specify the names of both a data frame and a variable using the compound forms mydata$myvar or mydata["myvar"]. However, that often means that you have to type the name of the data frame quite a lot. If you use the form “with(mydata,…” then R will look in that data frame for the “short” variable names before it looks elsewhere, like in your workspace. That allows you to type the data frame name only once per function call, but in a long program you would still end up typing it a lot. Modeling functions in R often let you specify “data = mydata” allowing you to use short variable names in formulas like “y ~ x”. The result is like the “with” function, you must type the data frame name once per function call. (SAS users take note: variables used outside of formulas will not be found with this approach!) Finally, you can attach the data frame with “attach(mydata)”. This copies the variables into a temporary space that lets you then refer to them by their short names. This has the big advantage of allowing all the following function calls to use short variable names. Unfortunately, it has the big disadvantage of being confusing. Confusion #1 is that people feel that variables they create will go into the data frame automatically; they will not. Unless you specify a data frame using either mydata$newvar or mydata["newvar"], new variables are created in your workspace. Confusion #2 is that R will look in your workspace before it looks at the attached versions of variables. So if variables with the same names exist there, those will be used instead. Confusion #3 is that even though detach(mydata) will reverse the process, if you run your program multiple times, you may have attached the data multiple times and detaching once does not fully undo the attached state. As confusing at that is, I use attach frequently and rarely get burned by it.

For example, with variables x and y stored in mydata (and nowhere else) you could do a linear regression model using any one of these approaches:

lm(mydata$y ~ mydata$x)

lm(mydata["y"] ~ mydata["x"])

with(mydata, lm(y ~ x))

lm(y ~ x, data = mydata)

attach(mydata)
lm(y ~ x)

As if that weren’t complicated enough, both x and y do not have to both be in the same data frame! The x variable could be in mydata and the y variable could be in the workspace or in an attached version of mydata or some other data frame. That would be dangerous, of course, since it would be up to you to ensure that the values of each observation match or the resulting model would be nonsense. However, this kind of flexibility can also be very useful.

With all this flexibility, it’s important to know the order in which R chooses variables. A simple example can show us the order R uses. Here I am creating four data frames whose x and y variables will have a slope that is indicated by the data frame name. For example, the variables in df10 have a slope of 10. This will make it easy for us to see which version of the variables R is using.

> y <- c(1,2,3,4,5,6,7,8,9,10)
> x <- c(1,2,5,5,5,5,5,8,9,10)
> df1    <- data.frame(x, y)
> df10   <- data.frame(x, y = y*10  )
> df100  <- data.frame(x, y = y*100 )
> df1000 <- data.frame(x, y = y*1000)
> rm(y, x)
> ls()
[1] "df1"    "df10"   "df100"  "df1000"

Notice that I have deleted the original x and y variables so at the moment, varibles x and y exist only within the data frames. Running a regression with lm(y ~ x) will not work since R does not look into data frames unless you tell it to. Even if it did, it would have no way to know which set of x’s and y’s to use. Next I will take two different approaches to “selecting” a data frame. I attach df1 and copy the variables from df10 into the workspace.

> attach(df1)
> y <- df10$y > x <- df10$x

Next, I do something rarely useful, calling a linear model using both “with” and “data=”. Which will dominate?

> with(df100, lm(y ~ x, data = df1000))

Call:
lm(formula = y ~ x, data = df1000)

Coefficients:
(Intercept)            x
0         1000

Since the slope is 1000, it’s clear that the “data=” argument was dominant. So R would look there first. If it found both x and y, it would stop looking. But if it only found one variable, it would continue to look elsewhere for the other. If the other variable where in the “with” data frame, it would then use it.

Next I’ll remove the “data” argument and see what happens.

> with(df100, lm(y ~ x))

Call:
lm(formula = y ~ x)

Coefficients:
(Intercept)            x
0          100

This time the “with” data frame was used for both variables. If variable either had not been in that data frame, R would have continued to look in the workspace and in the attached copy. But which would it use first? Next, I’m not specifying a data frame at all.

> lm(y ~ x)

Call:
lm(formula = y ~ x)

Coefficients:
(Intercept)            x
0           10

The slope of 10 tells us that it found the copies of x and y that I copied from df10 into the workspace. Let’s delete those variables and list the objects in our workspace to ensure that they’re gone.

> rm(y, x)
> ls()
[1] "df1"    "df10"   "df100"  "df1000"

Both x and y are clearly gone. So lets see if we can still use them.

> lm(y ~ x)

Call:
lm(formula = y ~ x)

Coefficients:
(Intercept)            x
0            1

We deleted x and y but we can still use them! However, we see from the slope of 1 that R has used a different pair of x and y variables. They’re the ones that were copied to my search path when I used “attach(myDf1)”. I had to remember that I had attached them. It’s this kind of confusion that makes many R users avoid using attach. Finally, I’ll detach df1 and see what happens.

> detach(df1)
> lm(y ~ x)
Error in eval(expr, envir, enclos) : object 'y' not found

Now, even though all the data frames in our workspace contain an x and y variable, R does not look inside to find any of them. Even if it did, it would have no way of know which to choose.

We have seen that R looks in various places for variables. In order, they are: what you specify in “data=”, using “with(mydata,…”, your workspace and finally attached copies of your data frame. The most recently attached copies are the ones it will use first. I hope this will help you use R with both less typing and less confusion.