Classification with Categorical Variables (the fuzzy side)

April 9, 2015
By

(This article was first published on Freakonometrics » R-english, and kindly contributed to R-bloggers)

The Gaussian and the (log) Poisson regressions share a very interesting property,

i.e. the average predicted value is the empirical mean of our sample.

> mean(predict(lm(dist~speed,data=cars)))
 42.98
> mean(cars\$dist)
 42.98

One can prove that it is also the prediction for the average individual in our sample

> predict(lm(dist~speed,data=cars),
+ newdata=data.frame(speed=mean(cars\$speed)))
42.98

The geometric interpretation is that the regression line passes through the centroid,

> plot(cars)
> abline(lm(dist~speed,data=cars),col="red")
> abline(h=mean(cars\$dist),col="blue")
> abline(v=mean(cars\$speed),col="blue")
> points(mean(cars\$speed),mean(cars\$dist)) But in all other cases, it is no longer the case. Consider for instance the case of a logistic regression. And to ask for something even more complicated, consider the case where we have only categorical explanatory variables. In that context, it is more difficult to get a prediction for the “average individual”. Unless we consider some fuzzy interpretation of the regression.

Consider the following dataset

> source("http://freakonometrics.free.fr/import_data_credit.R")

Just to get a simple model, consider the following regression model, on three covariates,

> reg_f=glm(class~checking_status+duration+
+ credit_history,data=train.db,family=binomial)
> summary(reg_f)

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept)                                             -1.3058     0.2765  -4.722 2.33e-06 ***
checking_statusCA > 200 euros                           -1.2297     0.4691  -2.621 0.008761 **
checking_statusCA in [0-200 euros[                      -0.6047     0.2314  -2.614 0.008962 **
checking_statusNo checking account                      -1.8756     0.2570  -7.298 2.92e-13 ***
duration(15,36]                                          0.7630     0.2102   3.629 0.000284 ***
duration(36,Inf]                                         1.3576     0.3543   3.832 0.000127 ***
credit_historycritical account                           1.9812     0.3679   5.385 7.24e-08 ***
credit_historyexisting credits paid back duly till now   0.8171     0.2497   3.273 0.001065 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

An alternative is to use the regression on dummy variables,

> library(FactoMineR)
> credit_disj=data.frame(class=train.db\$class,
+ tab.disjonctif(train.db[,-which(names(
+ train.db)=="class")]))
> reg_d=glm(class~.,data=credit_disj[,1:11],
+ family=binomial)

It is equivalent since it is exactly what R is doing while running the regression on the covariate. Well, not exactly. Reference modalities will be different, the output is different

Coefficients: (3 not defined because of singularities)
Estimate Std. Error z value Pr(>|z|)
(Intercept)                               -1.0066     0.3753  -2.682 0.007310 **
CA...0.euros                               1.8756     0.2570   7.298 2.92e-13 ***
CA...200.euros                             0.6459     0.4855   1.330 0.183396
CA.in..0.200.euros.                        1.2709     0.2609   4.871 1.11e-06 ***
No.checking.account                            NA         NA      NA       NA
X.0.15.                                   -1.3576     0.3543  -3.832 0.000127 ***
X.15.36.                                  -0.5947     0.3410  -1.744 0.081161 .
X.36.Inf.                                      NA         NA      NA       NA
all.credits.paid.back.duly                -0.8171     0.2497  -3.273 0.001065 **
critical.account                           1.1641     0.3156   3.688 0.000226 ***
existing.credits.paid.back.duly.till.now       NA         NA      NA       NA
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

But it is the same model. Hence, predictions are exactly the same

> predict(reg_f,type="response")[1:10]
0.21319568 0.56568074 0.70452901 0.56814422 0.16780141 0.08593906 0.24094435 0.36753641 0.38020333 0.56814422
> predict(reg_d,type="response")[1:10]
0.21319568 0.56568074 0.70452901 0.56814422 0.16780141 0.08593906 0.24094435 0.36753641 0.38020333 0.56814422

Based on that second regression, it is possible to get a prediction for the average individual of the dataset

> tab.disj.comp <- tab.disjonctif(
+ train.db[,-which(names(train.db)=="class")])
> apply(tab.disj.comp,2,mean)
CA < 0 euros      CA > 200 euros    CA in [0-200 euros[
0.274844720         0.054347826            0.282608696
No checking account      (0,15]                        (15,36]
0.388198758         0.423913043            0.495341615

Consider the regression on the contingency table

> credit_disj=data.frame(class=train.db\$class,
+ tab.disj.comp)
> reg=glm(class~.,data=credit_disj,
+ family=binomial)

and compute the prediction for the average individual

> nd=as.data.frame(t(apply(tab.disj.comp,
+ 2,mean)))
> names(nd)=names(credit_disj)[-1]
> predict(reg,newdata=nd,type="response")
0.1934358

We are quite far away, here, compared with the average value

> mean(as.numeric(train.db\$class)-1)
0.2981366

but again, there is no reason to get the same value. Actually, if we were running a Gaussian regresision, it would be the same (even with that fuzzy interpretation of those categories),

> credit_disj=data.frame(class=as.numeric(
+ train.db\$class)-1,tab.disj.comp)
> reg=lm(class~.,data=credit_disj)
> predict(reg,newdata=nd)
0.2981366

Soon, we will see an application of that fuzzy regression…

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...