Linear regression from a contingency table

[This article was first published on Freakonometrics » R-english, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This morning, Benoit sent me an email, about an exercise he found in an econometric textbook, about linear regression. Consider the following dataset,

Here, variable X denotes the income, and Y the expenses. The goal was to fit a linear regression (actually, in the email, it was mentioned that we should try to fit an heteroscedastic model, but let us skip this part). So Benoit’s question was more or less: how do you fit a linear regression from a contingency table?

Usually, when I got an email on Saturday morning, I try to postpone. But the kids had their circus class, so I had some time to answer. And this did not look like a complex puzzle… Let us import this dataset in R, so that we can start playing

> df=read.table("http://freakonometrics.free.fr/baseexo.csv",sep=";",header=TRUE)
> M=as.matrix(df[,2:ncol(df)])
> M[is.na(M)]<-0
> M
      X14 X19 X21 X23 X25 X27 X29 X31 X33 X35
 [1,]  74  13   7   1   0   0   0   0   0   0
 [2,]   6   4   2   7   4   0   0   0   0   0
 [3,]   2   3   2   2   4   0   0   0   0   0
 [4,]   1   1   2   3   3   2   0   0   0   0
 [5,]   2   0   1   3   2   0   6   0   0   0
 [6,]   2   0   2   1   0   0   1   2   1   0
 [7,]   0   0   0   2   0   0   1   1   3   0
 [8,]   0   1   0   1   0   0   0   0   2   0
 [9,]   0   0   0   0   1   1   0   1   0   1

The first idea I had was to use those counts as weights. Weighted least squares should be perfect. The dataset is built from this matrix,

> W=as.vector(M)
> x=df[,1]
> X=rep(x,ncol(M))
> y=as.numeric(substr(names(df)[-1],2,3))
> Y=rep(y,each=nrow(M))
> base1=data.frame(X1=X,Y1=Y,W1=W)

Here we have

> head(base1,10)
   X1 Y1 W1
1  16 14 74
2  23 14  6
3  25 14  2
4  27 14  1
5  29 14  2
6  31 14  2
7  33 14  0
8  35 14  0
9  37 14  0
10 16 19 13

The regression is the following,

> summary(reg1)

Call:
lm(formula = Y1 ~ X1, data = base1, weights = W1)

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  4.35569    2.03022   2.145    0.038 *  
X1           0.68263    0.09016   7.572 3.04e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7.892 on 40 degrees of freedom
Multiple R-squared:  0.589,	Adjusted R-squared:  0.5787 
F-statistic: 57.33 on 1 and 40 DF,  p-value: 3.038e-09

It looks like the output is the same as what Benoit found, so we should be happy. Now, I had a second thought. Why not create the implied dataset. Using replicates, we should be able to create the dataset that was used to get this contingency table,

> vX=vY=rep(NA,sum(W))
> sumW=c(0,cumsum(W))
> for(i in 1:length(W)){
+ if(W[i]>0){
+ vX[(1+sumW[i]):sumW[i+1]]=X[i]
+ vY[(1+sumW[i]):sumW[i+1]]=Y[i]
+ }}
> base2=data.frame(X2=vX,Y2=vY)

Here, the dataset is much larger, and there is no weight,

> tail(base2,10)
    X2 Y2
172 31 31
173 33 31
174 37 31
175 31 33
176 33 33
177 33 33
178 33 33
179 35 33
180 35 33
181 37 35

If we run a linear regression on this dataset, we obtain

> summary(reg2)

Call:
lm(formula = Y2 ~ X2, data = base2)

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  4.35569    0.95972   4.538 1.04e-05 ***
X2           0.68263    0.04262  16.017  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.731 on 179 degrees of freedom
Multiple R-squared:  0.589,	Adjusted R-squared:  0.5867 
F-statistic: 256.5 on 1 and 179 DF,  p-value: < 2.2e-16

If we compare the two regressions, we have

> rbind(coefficients(summary(reg1)),
+ coefficients(summary(reg2)))
             Estimate Std. Error   t value     Pr(>|t|)
(Intercept) 4.3556857 2.03021637  2.145429 3.804237e-02
X1          0.6826296 0.09015771  7.571506 3.038443e-09

(Intercept) 4.3556857 0.95972279  4.538483 1.036711e-05
X2          0.6826296 0.04261930 16.016913 2.115373e-36

The estimators are exactly the same (which does not surprise me), but standard deviation (ans significance levels) are quite different. And to be honest, I find that surprising. Which approach here is the most legitimate (since they are finally not equivalent)?

To leave a comment for the author, please follow the link and comment on their blog: Freakonometrics » R-english.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)