Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The Gaussian and the (log) Poisson regressions share a very interesting property,

$\frac{1}{n}\sum_{i=1}^n \widehat{Y}_i=\frac{1}{n}\sum_{i=1}^n Y_i$

i.e. the average predicted value is the empirical mean of our sample.

```> mean(predict(lm(dist~speed,data=cars)))
[1] 42.98
> mean(cars\$dist)
[1] 42.98```

One can prove that it is also the prediction for the average individual in our sample

```> predict(lm(dist~speed,data=cars),
+ newdata=data.frame(speed=mean(cars\$speed)))
42.98```

The geometric interpretation is that the regression line passes through the centroid,

```> plot(cars)
> abline(lm(dist~speed,data=cars),col="red")
> abline(h=mean(cars\$dist),col="blue")
> abline(v=mean(cars\$speed),col="blue")
> points(mean(cars\$speed),mean(cars\$dist))```

But in all other cases, it is no longer the case. Consider for instance the case of a logistic regression. And to ask for something even more complicated, consider the case where we have only categorical explanatory variables. In that context, it is more difficult to get a prediction for the “average individual”. Unless we consider some fuzzy interpretation of the regression.

Consider the following dataset

`> source("http://freakonometrics.free.fr/import_data_credit.R")`

Just to get a simple model, consider the following regression model, on three covariates,

```> reg_f=glm(class~checking_status+duration+
+ credit_history,data=train.db,family=binomial)
> summary(reg_f)

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept)                                             -1.3058     0.2765  -4.722 2.33e-06 ***
checking_statusCA > 200 euros                           -1.2297     0.4691  -2.621 0.008761 **
checking_statusCA in [0-200 euros[                      -0.6047     0.2314  -2.614 0.008962 **
checking_statusNo checking account                      -1.8756     0.2570  -7.298 2.92e-13 ***
duration(15,36]                                          0.7630     0.2102   3.629 0.000284 ***
duration(36,Inf]                                         1.3576     0.3543   3.832 0.000127 ***
credit_historycritical account                           1.9812     0.3679   5.385 7.24e-08 ***
credit_historyexisting credits paid back duly till now   0.8171     0.2497   3.273 0.001065 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1```

An alternative is to use the regression on dummy variables,

```> library(FactoMineR)
> credit_disj=data.frame(class=train.db\$class,
+ tab.disjonctif(train.db[,-which(names(
+ train.db)=="class")]))
> reg_d=glm(class~.,data=credit_disj[,1:11],
+ family=binomial)```

It is equivalent since it is exactly what R is doing while running the regression on the covariate. Well, not exactly. Reference modalities will be different, the output is different

```Coefficients: (3 not defined because of singularities)
Estimate Std. Error z value Pr(>|z|)
(Intercept)                               -1.0066     0.3753  -2.682 0.007310 **
CA...0.euros                               1.8756     0.2570   7.298 2.92e-13 ***
CA...200.euros                             0.6459     0.4855   1.330 0.183396
CA.in..0.200.euros.                        1.2709     0.2609   4.871 1.11e-06 ***
No.checking.account                            NA         NA      NA       NA
X.0.15.                                   -1.3576     0.3543  -3.832 0.000127 ***
X.15.36.                                  -0.5947     0.3410  -1.744 0.081161 .
X.36.Inf.                                      NA         NA      NA       NA
all.credits.paid.back.duly                -0.8171     0.2497  -3.273 0.001065 **
critical.account                           1.1641     0.3156   3.688 0.000226 ***
existing.credits.paid.back.duly.till.now       NA         NA      NA       NA
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1```

But it is the same model. Hence, predictions are exactly the same

```> predict(reg_f,type="response")[1:10]
0.21319568 0.56568074 0.70452901 0.56814422 0.16780141 0.08593906 0.24094435 0.36753641 0.38020333 0.56814422
> predict(reg_d,type="response")[1:10]
0.21319568 0.56568074 0.70452901 0.56814422 0.16780141 0.08593906 0.24094435 0.36753641 0.38020333 0.56814422```

Based on that second regression, it is possible to get a prediction for the average individual of the dataset

```> tab.disj.comp <- tab.disjonctif(
+ train.db[,-which(names(train.db)=="class")])
> apply(tab.disj.comp,2,mean)
CA < 0 euros      CA > 200 euros    CA in [0-200 euros[
0.274844720         0.054347826            0.282608696
No checking account      (0,15]                        (15,36]
0.388198758         0.423913043            0.495341615```

Consider the regression on the contingency table

```> credit_disj=data.frame(class=train.db\$class,
+ tab.disj.comp)
> reg=glm(class~.,data=credit_disj,
+ family=binomial)```

and compute the prediction for the average individual

```> nd=as.data.frame(t(apply(tab.disj.comp,
+ 2,mean)))
> names(nd)=names(credit_disj)[-1]
> predict(reg,newdata=nd,type="response")
0.1934358```

We are quite far away, here, compared with the average value

```> mean(as.numeric(train.db\$class)-1)
0.2981366```

but again, there is no reason to get the same value. Actually, if we were running a Gaussian regresision, it would be the same (even with that fuzzy interpretation of those categories),

```> credit_disj=data.frame(class=as.numeric(
+ train.db\$class)-1,tab.disj.comp)
> reg=lm(class~.,data=credit_disj)
> predict(reg,newdata=nd)
0.2981366```

Soon, we will see an application of that fuzzy regression…