Discriminating Between Iris Species

August 4, 2012
By

(This article was first published on imDEV » R, and kindly contributed to R-bloggers)

The Iris data set is a famous for its use to compare unsupervised classifiers.

The goal is to use information about flower characteristics to accurately classify the 3 species of Iris. We can look at scatter plots of the 4 variables in the data set and see that no single variable nor bivariate combination can achieve this.

One approach to improve the separation between the two closely related Iris species, I.versicolor (blue) and I.virginica (green), is to use a combination of all 4 measurements, by constructing principal components (PCs).

Using the singular value decomposition to calculate PCs we see that the sample scores above are not resolved for the two species of interest.

Another approach is to use a supervised projection method like partial least squares (PLS), to identify Latent Variables (LVs) which are data projections similar to those of PCA, but  which are also correlated with the species label. Interestingly this approach leads to a projection which changes the relative orientation of  I. versicolor and I. verginica to I. setaosa. However,  this supervised approach is not enough to identify a hyperplane of separation between all three species.

Non-linear PCA via neural networks can be used to identify the hypersurface of separation, shown above. Looking at the scores we can see that  this  approach is the most success for resolving the  two closely related species. However, the loadings from this method, which help relate how the variables are combined achieve the classification, are impossible to interpret. In the case of the function used above(nlPca, pcaMethods R package)  the loadings are literally NA.


To leave a comment for the author, please follow the link and comment on his blog: imDEV » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.