Supervised Classification, discriminant analysis

[This article was first published on Freakonometrics » R-english, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Another popular technique for classification (or at least, which used to be popular) is the (linear) discriminant analysis, introduced by Ronald Fisher in 1936. Consider the same dataset as in our previous post

> clr1 <- c(rgb(1,0,0,1),rgb(0,0,1,1))
> x <- c(.4,.55,.65,.9,.1,.35,.5,.15,.2,.85)
> y <- c(.85,.95,.8,.87,.5,.55,.5,.2,.1,.3)
> z <- c(1,1,1,1,1,0,0,1,0,0)
> df <- data.frame(x,y,z)
> plot(x,y,pch=19,cex=2,col=clr1[z+1])

The main interest of that technique is not the output, but more the fact that we can make here simple (and explicit) computations. Especially to get a better understanding of theoretical concepts on classification.

The code to run a linear discriminent analysis is

> library(MASS)
> fit_lda <- lda (z ~ x+y, data=df)
> fit_lda
Call:
lda(z ~ x + y, data = df)
 
Prior probabilities of groups:
  0   1 
0.4 0.6 
 
Group means:
          x      y
0 0.4750000 0.3625
1 0.4583333 0.6950
 
Coefficients of linear discriminants:
        LD1
x -2.588390
y  4.762615

To visualize the prediction, consider

> pred_lda <- function(x,y) as.numeric(predict(fit_lda, 
+       newdata=data.frame(x=x,y=y))$class)
> x_grid<-seq(0,1,length=101)
> y_grid<-seq(0,1,length=101)
> z_grid <- outer(x_grid,y_grid,pred_lda)
> clr1=c(rgb(1,0,0,1),rgb(0,0,1,1))
> clr2=c(rgb(1,0,0,.2),rgb(0,0,1,.2))
> image(x_grid,y_grid,z_grid,col=clr2)
> points(x,y,pch=19,cex=2,col=clr1[z+1])

Note that it is very close to what we got with our logistic regression. So, what did we do, here ? The idea, here, is that

http://latex.codecogs.com/gif.latex?boldsymbol{X}vert%20Y=0simmathcal{N}(boldsymbol{mu}_0,boldsymbol{Sigma})

and

http://latex.codecogs.com/gif.latex?boldsymbol{X}vert%20Y=1simmathcal{N}(boldsymbol{mu}_1,boldsymbol{Sigma})

Using Bayes’ theorem, it is possible thoses relationship, to compute the probability that http://latex.codecogs.com/gif.latex?Y=1 (for instance), given http://latex.codecogs.com/gif.latex?boldsymbol{X}. Here, we can easily estimate the three paramers, since means are just centroids of the groups

http://latex.codecogs.com/gif.latex?widehat{boldsymbol{mu}}_k=frac{1}{text{card}{i,Y_i=k}}sum_{i,Y_i=k}%20boldsymbol{X}_i

and

http://latex.codecogs.com/gif.latex?widehat{boldsymbol{Sigma}}_k=frac{1}{n}left(sum_{i,Y_i=0}%20(boldsymbol{X}_i-widehat{boldsymbol{mu}}_0)(boldsymbol{X}_i-widehat{boldsymbol{mu}}_0)^{text{sffamily%20T}}+sum_{i,Y_i=1}%20(boldsymbol{X}_i-widehat{boldsymbol{mu}}_1)(boldsymbol{X}_i-widehat{boldsymbol{mu}}_1)^{text{sffamily%20T}}right)

> m0=fit_lda$means[1,]
> m1=fit_lda$means[2,]
> df$x0=df$x-m0[1]
> df$y0=df$y-m0[2]
> df$x0[z==1]=df$x[z==1]-m1[1]
> df$y0[z==1]=df$y[z==1]-m1[2]
> S=var(df[,c("x0","y0")])
> library(mnormt)
> d0 <- function(x,y) dmnorm(cbind(x,y),mean=m0,varcov=S)
> z_0 <- outer(x_grid,y_grid,d0)
> d1 <- function(x,y) dmnorm(cbind(x,y),mean=m1,varcov=S)
> z_1 <- outer(x_grid,y_grid,d1)
> plot(x,y,pch=19,cex=2,col=clr1[z+1])
> contour(x_grid,y_grid,z_0,col=clr2[1],add=TRUE)
> contour(x_grid,y_grid,z_1,col=clr2[2],add=TRUE)
> points(fit_lda$means,col=clr1,pch=4,lwd=2)

To visualize the two regions, just compare the densities. For instance, if the ratio of the two densities exceeds one, then we should be in one class,

> contour(x_grid,y_grid,z_1/z_0,levels=1)

But if we want to be consistent with the Bayesian interpretation, we should not compare the ratios of probabilities with one, but with the ratio of the proportions of points in each categories, since

http://latex.codecogs.com/gif.latex?mathbb{P}(Y=yvert%20boldsymbol{X}=boldsymbol{x})proptovarphi_{boldsymbol{mu}_k,boldsymbol{Sigma}}(boldsymbol{x})cdot%20mathbb{P}(Y=y)

Thus,

http://latex.codecogs.com/gif.latex?mathbb{P}(Y=1vert%20boldsymbol{X}=boldsymbol{x})=mathbb{P}(Y=0vert%20boldsymbol{X}=boldsymbol{x})

if and only if

http://latex.codecogs.com/gif.latex?frac{varphi_{boldsymbol{mu}_1,boldsymbol{Sigma}}(boldsymbol{x})}{varphi_{boldsymbol{mu}_0,boldsymbol{Sigma}}(boldsymbol{x})}=frac{mathbb{P}(Y=0)}{mathbb{P}(Y=1)}

> contour(x_grid,y_grid,z_1/z_0,levels=.4/.6)

(which is exacty the graph we got above). Consider now the case where we have three clases (as in the previous post)

> clr1=c(rgb(1,0,0,1),rgb(1,1,0,1),rgb(0,0,1,1))
> clr2=c(rgb(1,0,0,.2),rgb(1,1,0,.2),rgb(0,0,1,.2))
> x=c(.4,.55,.65,.9,.1,.35,.5,.15,.2,.85)
> y=c(.85,.95,.8,.87,.5,.55,.5,.2,.1,.3)
> z=c(1,2,2,2,1,0,0,1,0,0)
> df=data.frame(x,y,z)
> plot(x,y,pch=19,cex=2,col=clr1[z+1])

The code to run a linear discriminent analysis is the same,

> fit_lda <- lda (z ~ x+y, data=df)
> fit_lda
Call:
lda(z ~ x + y, data = df)
 
Prior probabilities of groups:
  0   1   2 
0.4 0.3 0.3 
 
Group means:
          x         y
0 0.4750000 0.3625000
1 0.2166667 0.5166667
2 0.7000000 0.8733333
 
Coefficients of linear discriminants:
       LD1       LD2
x 2.014510  4.237266
y 3.438693 -3.185543

To visualise the three classes use

> pred_lda <- function(x,y) 
+ as.numeric(predict(fit_lda, 
+ newdata=data.frame(x=x,y=y))$class)
> z_grid <- outer(x_grid,y_grid,pred_lda)
> image(x_grid,y_grid,z_grid,col=clr2)
> points(x,y,pch=19,cex=2,col=clr1[z+1])

Here again, we have the interpretation of a mixture of three Gaussian vectors (with given means, and identical variance)

> m0=fit_lda$means[1,]
> m1=fit_lda$means[2,]
> m2=fit_lda$means[3,]
> df$x0=df$x-m0[1]
> df$y0=df$y-m0[2]
> df$x0[z==1]=df$x[z==1]-m1[1]
> df$y0[z==1]=df$y[z==1]-m1[2]
> df$x0[z==2]=df$x[z==2]-m2[1]
> df$y0[z==2]=df$y[z==2]-m2[2]
> S=var(df[,c("x0","y0")])
> library(mnormt)
> d0 <- function(x,y) dmnorm(cbind(x,y),mean=m0,varcov=S)
> z_0 <- outer(x_grid,y_grid,d0)
> d1 <- function(x,y) dmnorm(cbind(x,y),mean=m1,varcov=S)
> z_1 <- outer(x_grid,y_grid,d1)
> d2 <- function(x,y) dmnorm(cbind(x,y),mean=m2,varcov=S)
> z_2 <- outer(x_grid,y_grid,d2)
> plot(x,y,pch=19,cex=2,col=clr1[z+1],)
> contour(x_grid,y_grid,z_0,col=clr2[1],add=TRUE)
> contour(x_grid,y_grid,z_1,col=clr2[2],add=TRUE)
> contour(x_grid,y_grid,z_2,col=clr2[3],add=TRUE)
> points(fit_lda$means,col=clr1,pch=4,lwd=2)

Again, to visualize lines where two densities are equal, we use

> contour(x_grid,y_grid,z_1/z_0,levels=1)
> contour(x_grid,y_grid,z_2/z_0,levels=1)
> contour(x_grid,y_grid,z_1/z_2,levels=1)

or, using Bayes theorem to get an interpretation in terms of conditional probabilities

> contour(x_grid,y_grid,z_1/z_0,
+ add=TRUE,levels=mean(z==0)/mean(z==1))
> contour(x_grid,y_grid,z_2/z_0,
+ add=TRUE,levels=mean(z==0)/mean(z==2))
> contour(x_grid,y_grid,z_1/z_2,
+ add=TRUE,levels=mean(z==2)/mean(z==1))

Now, this linear discriminent analysis is very restrictive. We must have the same variance. If not, we get what is called the quadratic discriminent analysis. The code is here

> fit_qda <- qda (z ~ x+y, data=df)

Again, to visualize the classes, we use

> pred_qda <- function(x,y) as.numeric(predict(fit_qda, newdata=data.frame(x=x,y=y))$class)
> x_grid<-seq(0,1,length=101)
> y_grid<-seq(0,1,length=101)
> z_grid <- outer(x_grid,y_grid,pred_qda)
> clr1=c(rgb(1,0,0,1),rgb(0,0,1,1))
> clr2=c(rgb(1,0,0,.2),rgb(0,0,1,.2))
> image(x_grid,y_grid,z_grid,xcol=clr2)
> points(x,y,pch=19,cex=2,col=clr1[z+1])
> points(x,y,pch=1,cex=2)

Here again, the idea is that we have a mixture of two Gaussian vectors (with different variances, this time), i.e.

http://latex.codecogs.com/gif.latex?boldsymbol{X}vert%20Y=0simmathcal{N}(boldsymbol{mu}_0,boldsymbol{Sigma}_0)

while

http://latex.codecogs.com/gif.latex?boldsymbol{X}vert%20Y=1simmathcal{N}(boldsymbol{mu}_1,boldsymbol{Sigma}_1)

> m0=fit_qda$means[1,]
> m1=fit_qda$means[2,]
> S0=var(df[z==0,c("x","y")])
> S1=var(df[z==1,c("x","y")])
> library(mnormt)
> d0 <- function(x,y) dmnorm(cbind(x,y),mean=m0,varcov=S0)
> z_0 <- outer(x_grid,y_grid,d0)
> d1 <- function(x,y) dmnorm(cbind(x,y),mean=m1,varcov=S1)
> z_1 <- outer(x_grid,y_grid,d1)
> plot(x,y,pch=19,cex=2,col=clr1[z+1],)
> contour(x_grid,y_grid,z_0,col=clr2[1],add=TRUE)
> contour(x_grid,y_grid,z_1,col=clr2[2],add=TRUE)
> points(fit_lda$means,col=clr1,pch=4,lwd=2)

Here they are,

To get again the separation curve, we use

> contour(x_grid,y_grid,z_1/z_0,levels=.4/.6)

The decision boudary, i.e.

http://latex.codecogs.com/gif.latex?{boldsymbol{x}:mathbb{P}(Y=0vertboldsymbol{X}=boldsymbol{x})=mathbb{P}(Y=1vertboldsymbol{X}=boldsymbol{x})}

is quadratic since

http://latex.codecogs.com/gif.latex?mathbb{P}(Y=yvertboldsymbol{X}=boldsymbol{x})=-frac{1}{2}logvertboldsymbol{Sigma}_yvert%20\%20-frac{1}{2}[boldsymbol{x}-boldsymbol{mu}_y]^{text{sffamily{T}}}boldsymbol{Sigma}_y^{-1}[boldsymbol{x}-boldsymbol{mu}_y]+logmathbb{P}(Y=y)

What if we focus on the three classes case ? The classes are

> z_grid <- outer(x_grid,y_grid,pred_qda)
> image(x_grid,y_grid,z_grid,col=clr2)
> points(x,y,pch=19,cex=2,col=clr1[z+1])
> points(x,y,pch=1,cex=2)

The idea behind is the same as in the linear case, we simply relax the assumption of identical variance

> m0=fit_qda$means[1,]
> m1=fit_qda$means[2,]
> m2=fit_qda$means[3,]
> S0=var(df[z==0,c("x","y")])
> S1=var(df[z==1,c("x","y")])
> S2=var(df[z==2,c("x","y")])
> library(mnormt)
> d0 <- function(x,y) dmnorm(cbind(x,y),mean=m0,varcov=S0)
> z_0 <- outer(x_grid,y_grid,d0)
> d1 <- function(x,y) dmnorm(cbind(x,y),mean=m1,varcov=S1)
> z_1 <- outer(x_grid,y_grid,d1)
> d2 <- function(x,y) dmnorm(cbind(x,y),mean=m2,varcov=S2)
> z_2 <- outer(x_grid,y_grid,d2)
> plot(x,y,pch=19,cex=2,col=clr1[z+1])
> contour(x_grid,y_grid,z_0,col=clr2[1],add=TRUE)
> contour(x_grid,y_grid,z_1,col=clr2[2],add=TRUE)
> contour(x_grid,y_grid,z_2,col=clr2[3],add=TRUE)
> points(fit_lda$means,col=clr1,pch=4,lwd=2)
> contour(x_grid,y_grid,z_1/z_0,levels=1)
> contour(x_grid,y_grid,z_2/z_0,levels=1)
> contour(x_grid,y_grid,z_1/z_2,levels=1)

to compare densities, but here we do not have the conditional probability interpretation.

Note that there is also a mixture discriminant analysis

> library(mda)
> fit_mda <- mda (z ~ x+y, data=df)          
> pred_mda <- function(x,y) 
+ as.numeric(predict(fit_mda,
+ newdata=data.frame(x=x,y=y),type="class"))

for two classes, but also in the case where we had three classes,

To leave a comment for the author, please follow the link and comment on their blog: Freakonometrics » R-english.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)