Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This tutorial illustrates how to use Bayesian Model Averaging (BMA) with panel data using the R package BMS.

## Introduction

Methods for estimating econometric models with panel data have been frequently discussed in the literature (see eg Mundlak, 1978). Two estimators seem to resurface most often: the fixed effects estimator (FE) and the random effects estimator (RE). Both have their separate virtues and underlying assumptions (for an exposition see Bartels, 2008). Since the FE estimator can be easily cast into the linear regression framework that is used for the BMA package BMS, it will be our focus in this tutorial. For an application of Bayesian Model Averaging employing the RE estimator please refer to Moral-Benito (2011). Furthermore, a great deal of the literature seems to pivot around the question of how to calculate standard errors (Bartels, 2008). At the current stage, we abstract from the calculation of so-called clustered standard errors since we stay in a pure Bayesian framework in BMS (and standard errors are a classical concept).

For the purpose of illustration we will use the data put forward in Moral-Benito (2011). The data contains K=35 variables (including the dependent variable, the growth rate of per capita GDP) for N=73countries and for the period 1960-2000. The appendix lists the variables together with a short description. The dependent variable, GDP growth, is calculated for five year averages resulting into eight observations per country. Moral-Benito (2011) argues in favor of calculating averages of flow variables, while stock variables have been measured at the first year of each five-year period. The data can be downloaded at Moralbenito.com (‘download data for the paper Determinants of Economic Growth: A Bayesian Panel Data Approach and unzip).

library(BMS)

After having started R and loaded the BMS package we can read in the data file:
data.raw=read.table("Dataset.csv",header=TRUE,sep=";")
panelDat=data.raw[,-c(1:4)]
rownames(panelDat)=paste(data.raw[,"code"],data.raw[,"year"],sep="_")
panelDat=panelDat[,-charmatch(c("GRW_WB","GDP_WB"),colnames(panelDat))]


For that purpose I have simply converted the ‘Dataset.xls’ file to a ‘Dataset.csv’ file, replaced all missing values by ‘NA’ in excel and put it online. You can also therefore directly download the data with the command:

load(url("http://bms.zeugner.eu/blog/images/2011-05/paneldat.rda"))


Let us now peek into the data with the following commands:

panelDat=as.matrix(panelDat)


The rownames of the data are a combination of the country code and the year of the observation. The data is provided in a data.frame consisting of stacked observations per column. That is, the first column containing the dependent variable consists Y1 = (y1,1,y1,T=8…,yN,1,…,yN,T) This is also the format we can use later on when calling the bms function.

We will use two approaches to estimate a fixed effects (FE) panel (country / time fixed effects). The first approach makes use of the Frisch-Waugh-Lovell theorem (see eg Lovell, 2008), boiling down to demeaning the data accordingly. That is, in the case of country fixed effects, subtract from each observation (dependent and independent variable) the within country mean. For the case of time fixed effects, subtract from each observation the mean across countries per time period. We will start with the country fixed effects first.

For that purpose we will have to re-shape the data frame and put it into the form of a three dimensional array (T x K x N). That can be achieved with the function panel_unstack. Since bms uses data in its stacked form, we have to make use of panel_stack as well. Both functions are not part of the BMS library and thus have to be copied and pasted into your R console by yourself:

 panel_unstack= function(stackeddata, tstep=NULL) {
# stackeddata is a stacked data frame/matrix ordered in the following way
#               Variable1  Variable2
# ctry1_time1
# ctry1_time2
# ...
# ctry2_time1
# ctry2_time2
# ...
# tstep is the number of time points
# panel_unstack produces a 3-dimensional array out of that

bigT=nrow(stackeddata);K=ncol(stackeddata);
if (is.null(tstep)) tstep=bigT
X1=aperm(array(as.vector(t(as.matrix(stackeddata))),dim=c(K,tstep,bigT/tstep)), perm=c(2,1,3))
try(dimnames(X1)[[1]] <-  unique(sapply(strsplit(rownames(stackeddata),"_"),
function(x) x[[2]])), silent=TRUE)
try(dimnames(X1)[[2]] <-  colnames(stackeddata), silent=TRUE)
try(dimnames(X1)[[3]] <-  unique(sapply(strsplit(rownames(stackeddata),"_"),
function(x) x[[1]])), silent=TRUE)
return(X1)
}

panel_stack = function(array3d) {
x1= apply(array3d,2,rbind)
try(rownames(x1) <-  as.vector(sapply(dimnames(array3d)[[3]],
FUN=function(x) paste(x, dimnames(array3d)[[1]], sep="_"))), silent=TRUE)
return(as.data.frame(x1))
}


We can now easily transform the data from its stacked form into the three-dimensional array via:

dat.array=panel_unstack(panelDat, tstep=8)


where we have set tstep=8 since we have eight time periods per country. The advantages of the three-dimensional array are that we can easily access each dimension of the data:

dat.array[,,"ZWE"]
dat.array["1965",,]
dat.array[,"GSH",]


## Fixed Effects Estimation by Demeaning the Data

The function demean (again not part of the BMS library, so copy and paste the following lines into your R console) demeans the data to estimate individual (eg country), time and individual and time fixed effects. It takes as argument the three dimensional data array we have created above (dat.array) and via margin we can specify over which dimension we want to demean the data (country / time).

demean = function(x, margin) {
#x is an array
#margin is the dimension along which should be demeaned
if (!is.array(x)) stop("x must be an array/matrix")
otherdims=(1:length(dim(x)))[-margin]
sweep(x,otherdims,apply(x,otherdims,mean))
}

Demeaning is now easily accomplished by:
timeDat=panel_stack(demean(dat.array,3))
countryDat=panel_stack(demean(dat.array,1))

where we have used panel_stack to re-transform the demeaned data into its stacked form that can be passed to the bms function. Since in the data frame only the first 12 explanatory variable show variation over time, we will restrict estimation to these variables only.
# only the first 12 variables show time variation, so estimate panel with those regressors only
modelCd=bms(countryDat[,1:13],user.int=F)
modelTd=bms(timeDat[,1:13],user.int=F)


We will briefly discuss the results in the next section (see Table 1 and Table 2). Note that demeaning the data yields the same posterior estimates for the coefficients as with incorporating the FE directly, the approach we opt for in the next section. However, the posterior variance for the coefficient estimates is not identical (though very similar). Also note that demeaning does not save you from the degrees of freedom problem when incorporating the large set of fixed effects by a set of dummy variables. For an application using BMA with country FEs see for example Crespo Cuaresma et al. (2009).

## Fixed Effects Estimation with Dummy Variables

We will now turn to the second possibility of estimating FEs, which is the dummy variable approach. The advantage of the dummy variable approach is also that it yields estimates for the FEs which can be important for some applications. For the dummy approach we will make use of the new bms feature of holding variables constant (not sampling) them by the bms argument fixed.reg. Please make sure that you have installed a BMS version >= 0.3. We start now with creating the country dummies:
# country dummies
bigT=nrow(panelDat);tstep=8;
countryDummies=kronecker(diag(bigT/tstep),rep(1,tstep));
colnames(countryDummies)=dimnames(dat.array)[[3]]
# we have to leave one out (first column in this example, but of course is not relevant which column we leave out)
countryDummies=countryDummies[,-1]

In a same fashion we can easily create a set of time dummies:
# time series dummies
timeDummies=matrix(diag(tstep),bigT,tstep,byrow=T);
colnames(timeDummies)=dimnames(dat.array)[[1]]
# we have to leave one out (first column in this example, but of course is not relevant which column we leave out)
timeDummies=timeDummies[,-1]
modelTdummy=bms(cbind(panelDat[,1:13],timeDummies),fixed.reg=colnames(timeDummies),user.int=F)

Running the two regressions (for the first 13 elements of the demeaned data frame only):
modelCdummy=bms(cbind(panelDat[,1:13],countryDummies),fixed.reg=colnames(countryDummies),user.int=F)
modelTdummy=bms(cbind(panelDat[,1:13],timeDummies),fixed.reg=colnames(timeDummies),user.int=F)


should yield the same results as with demeaning. Type coef(modelCd) and coef(modelCdummy) to get the results in R. These are summarized in the Table below:

          PIP Post Mean Post SD |  PIP Post Mean Post SD
GDP_PWT 1.000    -0.244   0.025 |1.000    -0.244   0.025
POP     1.000     0.000   0.000 |1.000     0.000   0.000
PGRW    0.286    -0.534   0.994 |0.286    -0.534   0.994
IPR     0.327     0.000   0.000 |0.327     0.000   0.000
OPEM    0.999     0.156   0.032 |0.999     0.156   0.032
CSH     0.999    -0.296   0.066 |0.999    -0.296   0.066
GSH     0.981    -0.458   0.139 |0.980    -0.458   0.139
ISH     0.523     0.133   0.148 |0.522     0.133   0.148
LBF     0.510     0.282   0.321 |0.509     0.281   0.321
LEX     0.129     0.000   0.001 |0.129     0.000   0.001
PDE     0.070     0.000   0.000 |0.070     0.000   0.000
URBP    0.083    -0.006   0.043 |0.083    -0.006   0.043
Table 1: Estimation of country fixed effects: Left panel based on demeaning the data,
right panel on the dummy variable estimation approach


As one can see the results are very similar to each other. Since we have used 'full enumeration' no stochastic variability should be expected for the two approaches. However, when using large data sets and thus the MCMC sampler in turn, please bear in mind that there might be some stochastic variation of results when running differen bms chains. Posterior coefficients for the model employing country fixed effects are to be interpreted with respect to the within variation: A positive coefficient on the variable measuring the country's openness (OPEM) means that if openness increases within a country GDP growth is incraesing. On the other hand, time fixed effects in the particular example look at the between variation of the data. That is, if openness across countries (at once) increases, does this affect GDP growth? From Table 2 (equivalent to coef(modelTd); coef(modelTdummy) we see that this effect is smaller compared to that for the within transformed data.

         PIP Post Mean Post SD   PIP Post Mean Post SD
GDP_PWT 1.000    -0.071   0.013 1.000    -0.071   0.013
POP     0.433     0.000   0.000 0.432     0.000   0.000
PGRW    0.994    -2.542   0.647 0.994    -2.542   0.647
IPR     0.771     0.000   0.000 0.770     0.000   0.000
OPEM    0.376     0.013   0.019 0.376     0.013   0.019
CSH     0.095    -0.004   0.016 0.095    -0.004   0.016
GSH     0.136    -0.012   0.039 0.136    -0.012   0.039
ISH     0.864     0.212   0.113 0.864     0.212   0.113
LBF     0.060     0.001   0.026 0.060     0.001   0.026
LEX     1.000     0.007   0.001 1.000     0.007   0.001
PDE     0.347     0.000   0.000 0.347     0.000   0.000
URBP    0.103    -0.005   0.021 0.102    -0.005   0.021
Table 2: Estimation of time fixed effects: Left panel based on demeaning the data,
right panel on the dummy variable estimation approach