Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A few weeks ago, I published a post on Visualising a Classification in High Dimension, based on the use of a principal component analysis, to get a projection on the first two components. Following that post, I was wondering what could be done in the context of a classification on categorical covariates. A natural idea would be to consider a correspondance analysis, and to run a similar code.

Consider here the dataset used in a recent post,

`> source("http://freakonometrics.free.fr/import_data_credit.R")`

If we consider a correspondance analysis, we get

```> library(FactoMineR)
> acm=MCA(train.db,quali.sup =
+ which(names(train.db,)=="class"),ncp=10)```

For the covariates (including also the variable we want to model, considered here as some supplementary variable), the visualisation – on the first two components – is and for the individuals The extension of the previous post (based on a principal component analysis) is easy : based on a point in the (projection) space above, we want to get the point in the original space, use our classifier to get a prediction, then visualise it on that projection. The problem is that when we get back to the original space, we might not end up in some categories: instead of having a gender “male” or “female“, our point might be 83% “male” and 17% “female“. We we can run our classifier in that fuzzy space, that’s fine. The good thing is that we can, in the context of a logistic regression (as discussed in a previous post, entitled Classification with Categorical Variables (the fuzzy side)).

I should also mentioned that when I wrote my initial code, I got some technical problem to get back to the original space (one dimension was missing)

```> X=tab.disjonctif(train.db[,-which(
+ names(train.db)=="class")])
> ncol(X)
 56
> res.mca = MCA(train.db,quali.sup =
+ which(names(train.db)=="class"),ncp=56)
> ncol(res.mca\$var\$coord)
 55```

So I asked Julie who kindly helped me by sharing some code she got.

```> don <- train.db[,-which(names(train.db)=="class")]
> moy.p <- function(V, poids) {
+   res <- sum(V * poids, na.rm = TRUE)/sum(poids[!is.na(V)])
+ }
> row.w = rep(1/nrow(don), nrow(don))
> tab.disj.comp=tab.disjonctif(don)
> M = apply(tab.disj.comp, 2, moy.p, row.w)/ncol(don)
> Z = sweep(tab.disj.comp, 2, apply(tab.disj.comp, 2, moy.p,
+     row.w), FUN = "/")
> Z = sweep(Z, 2, apply(Z, 2, moy.p, row.w), FUN = "-")
> Zscale = sweep(Z, 2, sqrt(M), FUN = "*")
> svd.Zscale = svd.triplet(Zscale, row.w = row.w)
> eig.shrunk = (svd.Zscale\$vs[1:2])```

Now, we can get a function that returns a point in the original space (in a fuzzy sense) given a location in the projected space

```> reconstruct=function(x1,x2){
+ rec = tcrossprod(sweep(cbind(x1,x2), 2,
+ eig.shrunk, FUN = "*"), svd.Zscale\$V[, 1:2])
+ tab.disj.rec = sweep(rec, 2, sqrt(M), FUN = "/") + matrix(1,
+ nrow(rec), ncol(rec))
+ tab.disj.rec = sweep(tab.disj.rec, 2,
+ apply(tab.disj.comp, 2, moy.p, row.w),
+ FUN = "*")
+ colnames(tab.disj.rec)=colnames(tab.disj.comp)
+ return(tab.disj.rec)
+ }```

For instance, if we consider the origin in the projected space

```> reconstruct(0,0)[1:10]
 0.25931677 0.06677019 0.27173913 0.40217391
 0.42546584 0.48447205 0.09006211 0.29503106
 0.09161491 0.61335404```

This point is actually the avereage individual in our dataset

```> (apply(tab.disj.comp,2,mean))[1:4]
CA < 0 euros      CA > 200 euros
0.25931677          0.06677019
CA in [0-200 euros[ No checking account
0.27173913          0.40217391```

And indeed, this individual belongs to the fuzzy set of our covariates: this individual has 40% chance to get no account, 26% chance to get a negative balance, 27% to get a positive balance, below 200 euros, and 7% chance to get a positive balance above 200 euros. So it cannot be an actual observation from our dataset.

Our classifier should be in that fuzzy space, not on the space of categorical covariates. This was discussed in our previous post

```> credit_disj=data.frame(class=train.db\$class,
+ tab.disjonctif(don))
> reg=glm(class~.,data=credit_disj,
+ family=binomial)
> nd=as.data.frame(t(apply(tab.disj.comp,
+ 2,mean)))
> names(nd)=names(credit_disj)[-1]
> predict(reg,newdata=nd,type="response")
0.1934358```

Here, we get exactly the same prediction using our reconstruction function,

```> nd=data.frame(reconstruct(0,0))
> predict(reg,newdata=nd,type="response")
0.1934358```

We can define a function to get a prediction

```> prev=function(x1,x2){
+ newd=data.frame(retour(x1,x2))
+ return(predict(reg,newdata=newd,
+ type="response"))
+ }```

It is then possible to get predictions on a grid, on the projective space

```> xgrid=seq(-1,2,length=51)
> ygrid=seq(-1,1,length=51)
> xygrid=as.matrix(expand.grid(xgrid,ygrid))
> rec = tcrossprod(sweep(xygrid, 2,
+ eig.shrunk, FUN = "*"), svd.Zscale\$V[, 1:2])
> tab.disj.rec = sweep(rec, 2, sqrt(M),
+ FUN = "/") + matrix(1,
+ nrow(rec), ncol(rec))
> tab.disj.rec = sweep(tab.disj.rec, 2, apply(
+ tab.disj.comp, 2, moy.p, row.w), FUN = "*")
> colnames(tab.disj.rec)=colnames(tab.disj.comp)
> newd=data.frame(tab.disj.rec)
> z_vect=predict(reg,newdata=newd,
+ type="response")
> zgrid=matrix(z_vect,51,51)```

We can then visualize our classifier, on that projected space

```> plot(acm\$ind\$coord[,1:2],xlab="Dim 1",ylab="Dim 2",col="white")
> abline(h=0,col="grey")
> abline(v=0,col="grey")
> points(acm\$ind\$coord[train.db\$class=="0",1],
+ acm\$ind\$coord[train.db\$class=="0",2],
+ col="blue",pch=19,cex=.5)
> points(acm\$ind\$coord[train.db\$class=="1",1],
+ acm\$ind\$coord[train.db\$class=="1",2],
+ col="red",pch=19,cex=.5)
+ levels=prev(0,0),lwd=2) 