Linear regression from a contingency table

September 7, 2013
By

(This article was first published on Freakonometrics » R-english, and kindly contributed to R-bloggers)

This morning, Benoit sent me an email, about an exercise he found in an econometric textbook, about linear regression. Consider the following dataset,

Here, variable X denotes the income, and Y the expenses. The goal was to fit a linear regression (actually, in the email, it was mentioned that we should try to fit an heteroscedastic model, but let us skip this part). So Benoit’s question was more or less: how do you fit a linear regression from a contingency table?

Usually, when I got an email on Saturday morning, I try to postpone. But the kids had their circus class, so I had some time to answer. And this did not look like a complex puzzle… Let us import this dataset in R, so that we can start playing

> df=read.table("http://freakonometrics.free.fr/baseexo.csv",sep=";",header=TRUE)
> M=as.matrix(df[,2:ncol(df)])
> M[is.na(M)]<-0
> M
      X14 X19 X21 X23 X25 X27 X29 X31 X33 X35
 [1,]  74  13   7   1   0   0   0   0   0   0
 [2,]   6   4   2   7   4   0   0   0   0   0
 [3,]   2   3   2   2   4   0   0   0   0   0
 [4,]   1   1   2   3   3   2   0   0   0   0
 [5,]   2   0   1   3   2   0   6   0   0   0
 [6,]   2   0   2   1   0   0   1   2   1   0
 [7,]   0   0   0   2   0   0   1   1   3   0
 [8,]   0   1   0   1   0   0   0   0   2   0
 [9,]   0   0   0   0   1   1   0   1   0   1

The first idea I had was to use those counts as weights. Weighted least squares should be perfect. The dataset is built from this matrix,

> W=as.vector(M)
> x=df[,1]
> X=rep(x,ncol(M))
> y=as.numeric(substr(names(df)[-1],2,3))
> Y=rep(y,each=nrow(M))
> base1=data.frame(X1=X,Y1=Y,W1=W)

Here we have

> head(base1,10)
   X1 Y1 W1
1  16 14 74
2  23 14  6
3  25 14  2
4  27 14  1
5  29 14  2
6  31 14  2
7  33 14  0
8  35 14  0
9  37 14  0
10 16 19 13

The regression is the following,

> summary(reg1)

Call:
lm(formula = Y1 ~ X1, data = base1, weights = W1)

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  4.35569    2.03022   2.145    0.038 *  
X1           0.68263    0.09016   7.572 3.04e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7.892 on 40 degrees of freedom
Multiple R-squared:  0.589,	Adjusted R-squared:  0.5787 
F-statistic: 57.33 on 1 and 40 DF,  p-value: 3.038e-09

It looks like the output is the same as what Benoit found, so we should be happy. Now, I had a second thought. Why not create the implied dataset. Using replicates, we should be able to create the dataset that was used to get this contingency table,

> vX=vY=rep(NA,sum(W))
> sumW=c(0,cumsum(W))
> for(i in 1:length(W)){
+ if(W[i]>0){
+ vX[(1+sumW[i]):sumW[i+1]]=X[i]
+ vY[(1+sumW[i]):sumW[i+1]]=Y[i]
+ }}
> base2=data.frame(X2=vX,Y2=vY)

Here, the dataset is much larger, and there is no weight,

> tail(base2,10)
    X2 Y2
172 31 31
173 33 31
174 37 31
175 31 33
176 33 33
177 33 33
178 33 33
179 35 33
180 35 33
181 37 35

If we run a linear regression on this dataset, we obtain

> summary(reg2)

Call:
lm(formula = Y2 ~ X2, data = base2)

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  4.35569    0.95972   4.538 1.04e-05 ***
X2           0.68263    0.04262  16.017  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.731 on 179 degrees of freedom
Multiple R-squared:  0.589,	Adjusted R-squared:  0.5867 
F-statistic: 256.5 on 1 and 179 DF,  p-value: < 2.2e-16

If we compare the two regressions, we have

> rbind(coefficients(summary(reg1)),
+ coefficients(summary(reg2)))
             Estimate Std. Error   t value     Pr(>|t|)
(Intercept) 4.3556857 2.03021637  2.145429 3.804237e-02
X1          0.6826296 0.09015771  7.571506 3.038443e-09

(Intercept) 4.3556857 0.95972279  4.538483 1.036711e-05
X2          0.6826296 0.04261930 16.016913 2.115373e-36

The estimators are exactly the same (which does not surprise me), but standard deviation (ans significance levels) are quite different. And to be honest, I find that surprising. Which approach here is the most legitimate (since they are finally not equivalent)?

To leave a comment for the author, please follow the link and comment on his blog: Freakonometrics » R-english.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.