(This article was first published on

**Econometrics by Simulation**, and kindly contributed to R-bloggers)It is common for researchers to be concerned about unobserved effects being correlated with observed explanatory variables.

For instance, if we were curious about the effect of meditation on emotional stability we may be concerned that there might be some unobserved factor such as personal genetics that might predict both likelihood to meditate and emotional stability.

In order to remove this potentially biasing effect we could think about taking measurements over multiple periods for the same individual inquiring about frequency of meditation and emotional stability.

If we observe that within the same individual, removing the time constant effects which (presumably) genetics is a component of that there is still a relationship between meditation and emotional stability, then we may feel on firmer ground as to our hypothesis that mediation may lead to more emotional stability.

In order to accomplish the goal of estimating this relationship we may experiment with a "fixed effects" model defined as:

$$y_{it}=x_{it}\beta + a_i+u_{it}$$

In this typical linear model with panel data, there is no problem including an arbitrary number of dummy variables. Let's see this in action.

Think of the problem as:

$$y_{it}=x_{it}\beta + a_i+u_{it}$$

So $$y_{it}-mean_t(y_i)=(x_{it}-mean_t(x_{i}))\beta + a_i-mean_t(a_i)+u_{it}-mean(u_i)$$

Because the unobservable effect is constant over time it drops out. And so long as their was a term controlling for the average unobservable effect (the dummy variables) then the average per person unobserved error must by definition be equal to zero.

thus: $$y_{it}-mean_t(y_i)=(x_{it}-mean_t(x_{i}))\beta + u_{it}$$

Created by Pretty R at inside-R.org

For instance, if we were curious about the effect of meditation on emotional stability we may be concerned that there might be some unobserved factor such as personal genetics that might predict both likelihood to meditate and emotional stability.

In order to remove this potentially biasing effect we could think about taking measurements over multiple periods for the same individual inquiring about frequency of meditation and emotional stability.

If we observe that within the same individual, removing the time constant effects which (presumably) genetics is a component of that there is still a relationship between meditation and emotional stability, then we may feel on firmer ground as to our hypothesis that mediation may lead to more emotional stability.

In order to accomplish the goal of estimating this relationship we may experiment with a "fixed effects" model defined as:

$$y_{it}=x_{it}\beta + a_i+u_{it}$$

In this typical linear model with panel data, there is no problem including an arbitrary number of dummy variables. Let's see this in action.

nperson <- 300 # Number of persons

nobs <- 3 # Number of observations per person

# In order for unobserved person effects to be a problem they must be

# correlated with the explanatory variable.

# Let's say: x = x.base + fe

fe.sd <- 1 # Spefify the standard deviation of the fixed effed

x.sd <- 1 # Specify the base standard deviation of x

beta <- 2

# First generate our data using the time constant effects

constantdata <- data.frame(id=1:nperson, fe=rnorm(nperson))

# We expand our data by nobs

fulldata <- constantdata[rep(1:nperson, each=nobs),]

# Add a time index, first define a group apply function

# that applies by group index.

gapply <- function(x, group, fun) {

returner <- numeric(length(group))

for (i in unique(group))

returner[i==group] <- get("fun")(x[i==group])

returner

}

# Using the generalized apply function coded above

fulldata$t <- gapply(rep(1,length(fulldata$id)),

group=fulldata$id,

fun=cumsum)

# Or a more simplified function

indexer <- function(group) {

returner <- numeric(length(group))

for (i in unique(group))

returner[i==group] <- 1:sum(i==group)

returner

}

# Is a special case of gapply

fulldata$t <- indexer(fulldata$id)

# Now we are ready to caculate the time variant xs

fulldata$x <- fulldata$fe + rnorm(nobs*nperson)

# And our unobservable error

fulldata$u <- rnorm(nobs*nperson)

# Finally we are ready to simulate our y variables

fulldata$y <- .5*fulldata$x + .5*fulldata$fe + fulldata$u

# First lets see how our standard linear model performs:

summary(lm(y~x, data=fulldata))

# Adding a dummy variable removes the bias

summary(lm(y~x+factor(id), data=fulldata))

# The same result can be taken by removing the mean from

# both the explanatory variables and the dependent variables.

# Why is that?

Think of the problem as:

$$y_{it}=x_{it}\beta + a_i+u_{it}$$

So $$y_{it}-mean_t(y_i)=(x_{it}-mean_t(x_{i}))\beta + a_i-mean_t(a_i)+u_{it}-mean(u_i)$$

Because the unobservable effect is constant over time it drops out. And so long as their was a term controlling for the average unobservable effect (the dummy variables) then the average per person unobserved error must by definition be equal to zero.

thus: $$y_{it}-mean_t(y_i)=(x_{it}-mean_t(x_{i}))\beta + u_{it}$$

fulldata$ydemean <- fulldata$y-ave(fulldata$y, group=fulldata$id)

fulldata$xdemean <- fulldata$x-ave(fulldata$x, group=fulldata$id)

summary(lm(ydemean~xdemean-1, data=fulldata))

# We can also accomplish this by adding the Chamberlain device to the

# that regression is the total or mean of the explanatory variables at

# the level of each individual.

fulldata$xmean <- ave(fulldata$x, group=fulldata$id)

fulldata$xsum <- gapply(fulldata$x, group=fulldata$id, fun=sum)

# This is a little trickier to figure out how it accomplishes the task

# of differencing out the unobserved effect.

# This is how I think of it. The unobserved individual effect must be

# correlatedwith the explanatory variable in aggrogate to be a problem.

# However, that correlation can only be on the individual level since

# by definition the "fixed effect" is constant on the individual level.

# Thus by creating a new variable which is the average or total for

# each individual, we are allocating to that variable any variation

# which correlates with the explanatory variable.

summary(lm(y~x+xmean, data=fulldata))

summary(lm(y~x+xsum, data=fulldata))

To

**leave a comment**for the author, please follow the link and comment on his blog:**Econometrics by Simulation**.R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...