Classification with Linear Discriminant Analysis

December 23, 2016
By

(This article was first published on R – Aaron Schlegel, and kindly contributed to R-bloggers)

Classification with linear discriminant analysis is a common approach to predicting class membership of observations. A previous post explored the descriptive aspect of linear discriminant analysis with data collected on two groups of beetles. In this post, we will use the discriminant functions found in the first post to classify the observations. We will also employ cross-validation on the predicted groups to get a realistic sense of how the model would perform in practice on new observations. Linear classification analysis assumes the populations have equal covariance matrices (\Sigma_1 = \Sigma_2) but does not assume the data are normally distributed.

Classification with Linear Discriminant Analysis

The classification portion of LDA can be employed after calculating \bar{y}_1, \bar{y}_2 and S_{p1}. The procedure for classifying observations is based on the discriminant functions:

z = a’y = (\bar{y}_1 – \bar{y}_2)’S_{p1}^{-1}y

y is the vector of measurements to be classified. The discriminant functions z_1 and z_2 for the two groups are used to determine to which group the observation vector belongs. The classification procedure assigns the observation vector y to group 1 if its discriminant function z = a’y is closer to z_1 or group 2 if its discriminant function is closer to z.

z > \frac{1}{2}(\bar{z}_1 + \bar{z}_2)

We can now express the classification function regarding the observation vector y.

\frac{1}{2}(\bar{z}_1 + \bar{z}_2) = \frac{1}{2}(\bar{y}_1 – \bar{y}_2)’S_{p1}^{-1}(\bar{y}_1 + \bar{y}_2)

Thus the observation vector y is assigned to a group determined by the following:

Assign y to group 1 if:

a’y = (\bar{y}_1 – \bar{y}_2)’S_{p1}y > \frac{1}{2}(\bar{y}_1 – \bar{y}_2)’S_{p1}^{-1}(\bar{y}_1 + \bar{y}_2)

Or assign y to group 2 if:

a’y = (\bar{y}_1 – \bar{y}_2)’S_{p1}y < \frac{1}{2}(\bar{y}_1-\bar{y}_2)’S_{p1}^{-1}(\bar{y}_1 + \bar{y}_2)

This classification rule is where the discriminant function comes into play. Note the discriminant function acts as a linear classification function only in the two-group case.

Classification with Linear Discriminant Analysis in R

The following steps should be familiar from the discriminant function post. We first calculate the group means \bar{y}_1 and \bar{y}_2 and the pooled sample variance S_{p1}. The beetle data were obtained from the companion FTP site of the book Methods of Multivariate Analysis by Alvin Rencher.

beetles <- read.table('BEETLES.DAT', col.names = c('Measurement.Number', 'Species', 'transverse.groove.dist', 'elytra.length', 'second.antennal.joint.length', 'third.antennal.joint.length'))

Find the group means and the pooled sample variance.

beetle1 <- beetles[beetles$Species == 1,][,3:6]
beetle2 <- beetles[beetles$Species == 2,][,3:6]

n1 <- nrow(beetle1)
n2 <- nrow(beetle2)

beetle1.means <- apply(beetle1, 2, mean)
beetle2.means <- apply(beetle2, 2, mean)

w1 <- (n1 - 1) * var(beetle1)
w2 <- (n2 - 1) * var(beetle2)

sp1 <- 1 / (n1 + n2 - 2) * (w1 + w2)

The cutoff point to determine group membership of the observation vector is then found.

cutoff <- .5 * (beetle1.means - beetle2.means) %*% solve(sp1) %*% (beetle1.means + beetle2.means)
cutoff
##           [,1]
## [1,] -15.80538

Thus if z is greater than -15.81, the observation is assigned to group 1. Otherwise, it is assigned to group 2. We can apply the computed discriminant functions to the beetle data already collected to determine how well it performs in classifying observations.

species.prediction <- apply(beetles[,3:6], 1, function(y) {
  z <- (beetle1.means - beetle2.means) %*% solve(sp1) %*% y # Calculate the discriminate function for the observation vector y ifelse(z > cutoff, 1, 2)
})

Print a confusion matrix to display how the observations were assigned compared to their actual groups.

table(beetles$Species, species.prediction, dnn = c('Actual Group','Predicted Group'))
##             Predicted Group
## Actual Group  1  2
##            1 19  0
##            2  1 19

Our predictions classified all of group 1’s observations correctly but incorrectly assigned a group 2 observation to group 1. The error rate is simply the number of misclassifications divided by the total sample size.

n <- dim(beetles)[1]
1 / n
## [1] 0.02564103

Thus our predictions were rather close to actual with only a 2.6% error rate. However, predictions tend to be rather optimistic when sample sizes are small, as in this case. Therefore, we will also perform leave-one-out cross-validation to find a more realistic error rate.

The lda() function from the MASS package can also be used to make predictions on the supplied data or a new data set.

library(MASS)

The predict() function accepts a lda object.

beetle.lda <- lda(Species ~ .-Measurement.Number, data = beetles)
lda.pred <- predict(beetle.lda)$class

As before, print a confusion matrix to display the results of the predictions compared to the actual group memberships.

table(beetles$Species, lda.pred, dnn = c('Actual Group','Predicted Group'))
##             Predicted Group
## Actual Group  1  2
##            1 19  0
##            2  1 19
Cross-Validation of Predicted Groups

As mentioned previously, in cases with small sample sizes, prediction error rates can tend to be optimistic. Cross-validation is a technique used to estimate how accurate a predictive model may be in actual practice. When larger sample sizes are available, the more common approach of splitting the data into test and training sets may still be employed. There are many different approaches to cross-validation, including leave-p-out and k-fold cross-validation. One particular case of leave-p-out cross-validation is the leave-one-out approach, also known as the holdout method.

Leave-one-out cross-validation is performed by using all but one of the sample observation vectors to determine the classification function and then using that classification function to predict the omitted observation’s group membership. The procedure is repeated for each observation so that each is classified by a function of the other observations. The leave-one-out technique is demonstrated on the beetle data below. The approach to building the discriminant and classification functions remain the same as before, with the exception that all but N-1 observations are used.

cv.prediction <- c()

for (i in 1:n) {
  holdout <- beetles[-i,]
  
  holdout1 <- holdout[holdout$Species == 1,][,3:6]
  holdout2 <- holdout[holdout$Species == 2,][,3:6]
  
  holdout1.means <- apply(holdout1, 2, mean)
  holdout2.means <- apply(holdout2, 2, mean)
  
  n1 <- nrow(holdout1)
  n2 <- nrow(holdout2)

  w1 <- (n1 - 1) * var(holdout1)
  w2 <- (n2 - 1) * var(holdout2)

  sp1 <- 1 / (n1 + n2 - 2) * (w1 + w2)

  cutoff <- .5 * (holdout1.means - holdout2.means) %*% solve(sp1) %*% (holdout1.means + holdout2.means)
  
  ay <- (holdout1.means - holdout2.means) %*% solve(sp1) %*% as.numeric(beetles[i,3:6])
  group <- ifelse(ay > cutoff, 1, 2)
  cv.prediction <- append(cv.prediction, group)
}

Construct a confusion matrix to display how the observations were classified.

table(beetles$Species, cv.prediction, dnn = c('Actual Group','Predicted Group'))
##             Predicted Group
## Actual Group  1  2
##            1 19  0
##            2  3 17

As before, all of group 1’s observations were correctly classified; however, three of group 2’s observations were incorrectly assigned to group 1. Although the cross-validated error rate has increased three times to about 7.7%, it is a more realistic estimate compared to the non-cross-validated result.

Cross-validation is also available in the lda() function with the cv argument.

beetle.cv <- lda(Species ~ .-Measurement.Number, CV=TRUE, data = beetles)

table(beetles$Species, beetle.cv$class, dnn = c('Actual Group','Predicted Group'))
##             Predicted Group
## Actual Group  1  2
##            1 19  0
##            2  3 17
Summary

This post explored the predictive aspect of linear discriminant analysis as well as a brief introduction to cross-validation through the leave-one-out method. As noted, it is often important to perform some form of cross-validation on datasets with few observations to get a more realistic indication of how accurate the model will be in practice. Future posts will examine classification with linear discriminant analysis for more than two groups as well as quadratic discriminant analysis.

References

Rencher, A. (n.d.). Methods of Multivariate Analysis (2nd ed.). Brigham Young University: John Wiley & Sons, Inc.

The post Classification with Linear Discriminant Analysis appeared first on Aaron Schlegel.

To leave a comment for the author, please follow the link and comment on their blog: R – Aaron Schlegel.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)