Visualising a Classification in High Dimension, part 2

April 9, 2015
By

(This article was first published on Freakonometrics » R-english, and kindly contributed to R-bloggers)

A few weeks ago, I published a post on Visualising a Classification in High Dimension, based on the use of a principal component analysis, to get a projection on the first two components. Following that post, I was wondering what could be done in the context of a classification on categorical covariates. A natural idea would be to consider a correspondance analysis, and to run a similar code.

Consider here the dataset used in a recent post,

> source("http://freakonometrics.free.fr/import_data_credit.R")

If we consider a correspondance analysis, we get

> library(FactoMineR)
> acm=MCA(train.db,quali.sup = 
+ which(names(train.db,)=="class"),ncp=10)

For the covariates (including also the variable we want to model, considered here as some supplementary variable), the visualisation – on the first two components – is

and for the individuals

The extension of the previous post (based on a principal component analysis) is easy : based on a point in the (projection) space above, we want to get the point in the original space, use our classifier to get a prediction, then visualise it on that projection. The problem is that when we get back to the original space, we might not end up in some categories: instead of having a gender “male” or “female“, our point might be 83% “male” and 17% “female“. We we can run our classifier in that fuzzy space, that’s fine. The good thing is that we can, in the context of a logistic regression (as discussed in a previous post, entitled Classification with Categorical Variables (the fuzzy side)).

I should also mentioned that when I wrote my initial code, I got some technical problem to get back to the original space (one dimension was missing)

> X=tab.disjonctif(train.db[,-which(
+ names(train.db)=="class")])
> ncol(X)
[1] 56
> res.mca = MCA(train.db,quali.sup =
+ which(names(train.db)=="class"),ncp=56)
> ncol(res.mca$var$coord)
[1] 55

So I asked Julie who kindly helped me by sharing some code she got.

> don <- train.db[,-which(names(train.db)=="class")]
> moy.p <- function(V, poids) {
+   res <- sum(V * poids, na.rm = TRUE)/sum(poids[!is.na(V)])
+ }
> row.w = rep(1/nrow(don), nrow(don))
> tab.disj.comp=tab.disjonctif(don)
> M = apply(tab.disj.comp, 2, moy.p, row.w)/ncol(don)
> Z = sweep(tab.disj.comp, 2, apply(tab.disj.comp, 2, moy.p, 
+     row.w), FUN = "/")
> Z = sweep(Z, 2, apply(Z, 2, moy.p, row.w), FUN = "-")
> Zscale = sweep(Z, 2, sqrt(M), FUN = "*")
> svd.Zscale = svd.triplet(Zscale, row.w = row.w)
> eig.shrunk = (svd.Zscale$vs[1:2])

Now, we can get a function that returns a point in the original space (in a fuzzy sense) given a location in the projected space

> reconstruct=function(x1,x2){
+ rec = tcrossprod(sweep(cbind(x1,x2), 2, 
+ eig.shrunk, FUN = "*"), svd.Zscale$V[, 1:2])
+ tab.disj.rec = sweep(rec, 2, sqrt(M), FUN = "/") + matrix(1, 
+ nrow(rec), ncol(rec))
+ tab.disj.rec = sweep(tab.disj.rec, 2, 
+ apply(tab.disj.comp, 2, moy.p, row.w), 
+ FUN = "*")
+ colnames(tab.disj.rec)=colnames(tab.disj.comp)
+ return(tab.disj.rec)
+ }

For instance, if we consider the origin in the projected space

> reconstruct(0,0)[1:10]
 [1] 0.25931677 0.06677019 0.27173913 0.40217391
 [5] 0.42546584 0.48447205 0.09006211 0.29503106
 [9] 0.09161491 0.61335404

This point is actually the avereage individual in our dataset

> (apply(tab.disj.comp,2,mean))[1:4]
       CA < 0 euros      CA > 200 euros 
         0.25931677          0.06677019 
CA in [0-200 euros[ No checking account 
         0.27173913          0.40217391

And indeed, this individual belongs to the fuzzy set of our covariates: this individual has 40% chance to get no account, 26% chance to get a negative balance, 27% to get a positive balance, below 200 euros, and 7% chance to get a positive balance above 200 euros. So it cannot be an actual observation from our dataset.

Our classifier should be in that fuzzy space, not on the space of categorical covariates. This was discussed in our previous post

> credit_disj=data.frame(class=train.db$class,
+ tab.disjonctif(don))
> reg=glm(class~.,data=credit_disj,
+ family=binomial)
> nd=as.data.frame(t(apply(tab.disj.comp,
+ 2,mean)))
> names(nd)=names(credit_disj)[-1]
> predict(reg,newdata=nd,type="response")
0.1934358

Here, we get exactly the same prediction using our reconstruction function,

> nd=data.frame(reconstruct(0,0))
> predict(reg,newdata=nd,type="response")
0.1934358

We can define a function to get a prediction

> prev=function(x1,x2){
+ newd=data.frame(retour(x1,x2))
+ return(predict(reg,newdata=newd,
+ type="response"))
+ }

It is then possible to get predictions on a grid, on the projective space

> xgrid=seq(-1,2,length=51)
> ygrid=seq(-1,1,length=51)
> xygrid=as.matrix(expand.grid(xgrid,ygrid))
> rec = tcrossprod(sweep(xygrid, 2,
+ eig.shrunk, FUN = "*"), svd.Zscale$V[, 1:2])
> tab.disj.rec = sweep(rec, 2, sqrt(M), 
+ FUN = "/") + matrix(1,
+ nrow(rec), ncol(rec))
> tab.disj.rec = sweep(tab.disj.rec, 2, apply(
+ tab.disj.comp, 2, moy.p, row.w), FUN = "*")
> colnames(tab.disj.rec)=colnames(tab.disj.comp)
> newd=data.frame(tab.disj.rec)
> z_vect=predict(reg,newdata=newd,
+ type="response")
> zgrid=matrix(z_vect,51,51)

We can then visualize our classifier, on that projected space

> plot(acm$ind$coord[,1:2],xlab="Dim 1",ylab="Dim 2",col="white")
> abline(h=0,col="grey")
> abline(v=0,col="grey")
> points(acm$ind$coord[train.db$class=="0",1],
+ acm$ind$coord[train.db$class=="0",2],
+ col="blue",pch=19,cex=.5)
> points(acm$ind$coord[train.db$class=="1",1],
+ acm$ind$coord[train.db$class=="1",2],
+ col="red",pch=19,cex=.5)
> contour(xgrid,ygrid,zgrid,add=TRUE,
+ levels=prev(0,0),lwd=2)
> contour(xgrid,ygrid,zgrid,add=TRUE,col="grey")

 

To leave a comment for the author, please follow the link and comment on their blog: Freakonometrics » R-english.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers


Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)