I picked up the AT&T Laboratories Cambridge database of faces for a clustering application. The database consists of images of 40 distinct subjects, each in 10 different facial positions and expressions. Typically, the goal of clustering in these data is to recover the ‘true’ partition, or that which isolates images of distinct subjects. Each image is is 92 x 112 pixels in dimension, taking black-and-white integer values in the 8-bit range (0 to 255). Such high-dimensional images (92 x 112 = 10304) are difficult to work with directly. We can look to data-squashing to help here. (Actually, I’m not sure the term ‘data-squashing’ was intended for methods like PCA, but it seems appropriate to me.)
I used principal components analysis to identify a set of rotated pixels that were highly variable, and presumably most useful for discriminating between the images, resulting in this interesting image montage. The first 20 eigneimages (in reading order) each represent the rotation of a 92 x 112 black-and-white image onto a single pixel. Darker regions in the eigenimages load higher in the rotation. Consequently, darker regions are important for discriminating between images in the dataset. The dark pixels in the top-left image account for about 18% of the variability in the entire dataset. In other words, these regions of the face may be the most useful for facial recognition.
I’ve put together an archive of the images, a function to read the PGM image pixels into R, do the PCA, and recreate the graphic above, in less than 60 lines (though I shouldn’t boast, else someone will cut it to 20 lines and shame me). You can download the archive here ATTfaces.tar.gz (please be patient, ~3.7MB). From a shell prompt, recreate the graphic as follows:
$ tar -xvzf ATTfaces.tar.gz $ R -q > source("ATTfaces.R") > pcaPlot()
Disclaimer: This image is a re-posting from my old website. However, the code and discussion were not given before.