Visualising a Classification in High Dimension

March 6, 2015
By

(This article was first published on Freakonometrics » R-english, and kindly contributed to R-bloggers)

So far, when discussing classification, we’ve been playing on my toy-dataset (actually, I should no claim it’s mine, it is inspired by the one used in the introduction of Boosting, by Robert Schapire and Yoav Freund). But in ral life, there are more observations, and more explanatory variables.With more than two explanatory variables, it starts to be more complicated to visualise. For instance, consider

MYOCARDE=read.table(
"http://freakonometrics.free.fr/saporta.csv",
head=TRUE,sep=";")

where we have observations from people in E.R., for infarctus, and we want to understand who did survive, to get a predictive model. But before running some classifier, let us visualise our data. Since we have seven explanatory variables and our class (survival or death), we can go for a PCA.

library(FactoMineR) # ACP (sur les var continues)
X=MYOCARDE[,1:7]
acp=PCA(X)

To add the death/survival variable, treat it as numerical 0/1 variable (at least to get a direction)

MYOCARDE2=MYOCARDE
MYOCARDE2$PRONO=(MYOCARDE2$PRONO=="SURVIE")*1
acp=PCA(MYOCARDE2,quanti.sup=8,graph=TRUE)

The nice thing is that we see here where variables are colinear with that one. It is also possible to visualise individuals, and classes, too

acp=PCA(MYOCARDE,quali.sup=8,graph=TRUE)
plot(acp, habillage = 8,col.hab=c("red","blue"))

It looks like we should be able to derive a nice classifier here. At least, while projecting on the first two components, we can see our classes.

Now, can’t we get a classifier and visualise it on the first two components ?  Because PCAs are simply based on orthogonral projection, yes, we can (we have to keep in mind also that data were normalized here). Given two coordinates on the first two components plane (assume that the other 5 are null), given our transformation matrix, our normalizing components, and a classifier (here based on a logistic regression) we can get back to the original space, and classify that new data

acp=PCA(X,ncp=ncol(X))
M=acp$var$coord
m=apply(X,2,mean)
s=apply(X,2,sd)
pred=function(d1,d2,Mat,reg){
  z=Mat %*% c(d1,d2,rep(0,ncol(X)-2))
  newd=data.frame(t(z*s+m))
  names(newd)=names(X)
predict(reg,newdata=newd,type="response") }

Consider now a logistic regression. Just to get something not too complicated (and get rid of non significant variables), we use a stepwise procedure to simplify the model,

reg_tot=glm(PRONO~.,data=MYOCARDE,
family=binomial)
reg_tot=step(reg_tot)
Minv=solve(M)
p=function(d1,d2) pred(d1,d2,Minv,reg_tot)

To visualize the isoprobability line (where an individual has a 50% chance to survive) use the following

Outer <- function(x,y,fun) {
   mat <- matrix(NA, length(x), length(y))
   for (i in seq_along(x)) {
   for (j in seq_along(y)) 
             mat[i,j]=fun(x[i],y[j])}
  return(mat)}
 
xgrid=seq(-5,5,length=251)
ygrid=seq(-5,5,length=251)
zgrid=Outer(xgrid,ygrid,p)

and then, we add a contour line on our previous graph

acp2=PCA(MYOCARDE,quali.sup=8,graph=TRUE)
plot(acp2, habillage = 8,col.hab=c("red","blue"))
contour(xgrid,ygrid,zgrid,add=TRUE,levels=.5)

It is not too bad, but we should be able to do better. What if we keep  all variables here (even if they were not significant)

reg_tot=glm(PRONO~.,data=MYOCARDE,
            family=binomial)
Minv=solve(M)
p=function(d1,d2) pred(d1,d2,Minv,reg_tot)
zgrid=Outer(xgrid,ygrid,p)
acp2=PCA(MYOCARDE,quali.sup=8,graph=TRUE)
plot(acp2, habillage = 8,col.hab=c("red","blue"))
contour(xgrid,ygrid,zgrid,add=TRUE,levels=.5)

It is much better, isn’t it ? Our course, in real life, to actually say something relevant about our classifier, we should fit our model on a subpart of our observations, and then test it on the other subpart. But here, the goal was more to get a function to visualise our classification on some projected space.

To leave a comment for the author, please follow the link and comment on their blog: Freakonometrics » R-english.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers


Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)