Would hierarchical clustering be as popular without the dendrogram? Cannot the same be said of finite mixture modeling with its multidimensional spaces populated by normal distributions? I invite you to move your mouse over the figure on the introductory page of the website for the R package mclust and click through all the graphics that bring mixture modeling to life. So what is the compelling image for nonnegative matrix factorization (NMF)?
Dendrograms are constructed from distance matrices. We have some choice in the distance or divergence metric, but once a variable has been selected, it is included in all the distance calculations. Finite mixtures and k-means avoid such matrices, but still define fit as some variation on the ratio of between-cluster and within-cluster distances, once again computed from all the variables selected for the analysis. These are the clusters pictured in the above links to dendrograms and isodensity curves in low-dimensional spaces derived from the entire set of included variables.
However, such representations do not exhaust common ways of talking and thinking about similarity. For example, object substitution in a task or an activity is based on a more limited definition of shared functionality. These are goal-derived categories that I discussed at the end of my post showing how NMF can use top-contender rankings to reveal preference patterns for breakfast foods. Will a Danish pastry be good enough when all the donuts have been eaten? The thought of eating the donut evokes the criteria upon which substitutability and hence similarity will be judged (see Norm Theory: Comparing Reality to Its Alternatives). In the context of toast and other options for breakfast, the donut and the Danish may appear more similar in contrast, yet that is not what comes to mind when hungry for donuts. Similarity can only be defined within a context, as noted by Nelson Goodman.
Similarity Derived from Building Blocks in Localized Additive Representations
What did you do today? I could give you a list of activities and ask you to indicate how frequently you engaged in each activity. Who else is like you? Should we demand complete agreement across all the activities, or is it sufficient that you share some common task? If my list is a complete inventory, it will include many relatively infrequent activities performed only by specific subgroups. For example, caregivers for very young children come in all ages and gender from diverse backgrounds and with other responsibilities, yet they form a product category with an entire aisle of the supermarket dedicated to their shared needs. Situational demands pull together individuals in rows and activities in columns to mold a building block of relational data.
To be clear, a hierarchical clustering of respondents or the rows of our data matrix averages over all the columns. We start with an nxp data matrix, but perform the analysis with the nxn dissimilarity or distance matrix. Still, the data matrix contains relational data. Individuals are associated with the activities they perform. Instead of ignoring these relationships between the rows and the columns, we could seek simultaneous clustering of individuals and activities to form blocks running along the diagonal of the data matrix (a block diagonal matrix). Consequently, we may wish to alter our initial figure at the beginning of this post to be more precise and push the colored blocks out of a straight line to form a diagonal with each block demarcated by the intersection of individuals and their frequent activities.
A Toy Example
We can see how it is all accomplished in the following NMF analysis. We will begin with a blank data matrix and combine two blocks to form the final data matrix in the lower right of the figure below.
The final data matrix represents 6 respondents in the rows and 4 activities in the columns. The cells indicate the frequency of engaging in the activity ranging from 0=never to 6=daily. Since the rows and columns have been sorted into two 3×2 blocks along the diagonal, we have no problem directly interpreting this small final data matrix. The frequency of the 4 activities is greatest in the first column and least for third column. The first 3 and last 3 respondents are separated by the first 2 and last 2 activities. It appears that the 6×4 data matrix might be produced by only two latent features.
The following R code creates the final data matrix as the matrix product of respondent mixing weights and activity latent feature coefficients. That is, activities get organized into packets of stuff done by the same respondents, and respondents get clustered based on their activities. If you are familiar with the co-evolution of music genre and listening communities, you will not be surprised by the co- or bi-clustering of rows and columns into these diagonal building blocks. In a larger data matrix, however, we would expect to see both purist with nonzero mixing weights for only one latent feature and hybrids that spread their weights across several latent features. As noted in earlier posts, NMF thrives on sparsity in the data matrix especially when there is clear separation into non-overlapping blocks of rows and columns (e.g., violent action films and romantic comedies appealing to different audiences or luxury stores and discount outlets with customers tending to shop at one or the other).
# enter the data for the respondent mixing weights MX<-matrix(c(3,2,1,0,0,0,0,0,0,1,1,2), ncol=2) MX # enter the data for the latent features LP<-matrix(c(2,1,0,0,0,0,1,2), ncol=4, byrow=TRUE) LP # observed data is the matrix product DATA<-MX%*%LP DATA # load the NMF library library(NMF) # run with rank=2 fit<-nmf(DATA, 2, "lee", nrun=20) # output the latent feature coefficients lp<-coef(fit) round(lp,3) #output the respondent mixing weights mx<-basis(fit) round(mx,3) # reproduce the data matrix using NMF results data<-mx%*%lp round(data) # output residuals round(DATA-data,3) # same but only for 1st latent feature rank1<-as.matrix(mx[,1])%*%lp[1,] round(DATA-rank1,3) # same but only for 2nd latent feature rank2<-as.matrix(mx[,2])%*%lp[2,] round(DATA-rank2,3) # additive representation round(rank1+rank2,3) round(DATA-rank1-rank2,3)
The extensive comments in this R code reduce the need for additional explanation except to emphasize that you should copy and run the code in R (after installing NMF). I did not set a seed so that the order of the two parts may be switched. The exercise is intended to imprint the building block imagery. In addition, you might wish to think about how NMF deals with differences in respondent and activity intensity. For example, the first three respondents all engage in the first two activities but with decreasing frequency. Moreover, the same latent feature is responsible for the first two activities, yet the first activity is more frequent than the second.
I would suggest that the answer can be found in the following section of output from the above code. You must, of course, remember your matrix multiplication. The first cell in our data matrix contains a "6" formed by multiplying the first row of mx by the first column of lp or 0.5 x 12 + 0.0 x 0 = 6. Now, it is easy to see that the 0.500, 0.333 and 0.167 in mx reveal the decreasing intensity of the first latent feature. Examining the rest of mx suggest that the last respondent should have higher scores than the previous two and that is what we discover.
[,1] [,2] [,3] [,4]
[1,] 12 6 0 0
[2,] 0 0 4 8
[1,] 0.500 0.00
[2,] 0.333 0.00
[3,] 0.167 0.00
[4,] 0.000 0.25
[5,] 0.000 0.25
[6,] 0.000 0.50
When you see diagrams, such as the following from Wikipedia, you should take them literally.
The data matrix V is reproduced approximately by a reduced rank matrix of mixing weights W multiplied by a reduced rank matrix of latent features H. These interpretations of W and H depend on V being a respondents-by-variables data matrix. One needs to be careful because many applications of NMF reverse the rows and columns changing the meaning of W and H.
The number of columns in W and the number of rows in H can be much smaller than the number of observed variables, which is what is meant by data reduction. The same latent features are responsible for the clustering of respondents and variables. This process of co- or bi-clustering has redefined similarity by computing distances within the building blocks instead of across all the rows and columns. Something had to be done if we wish to include a complete inventory of activities. As the number of activities increase, the data become increasingly sparse and distances become more uniform (see Section 3 The Curse of Dimensionality).
The building block imagery seems to work in this example because different people engage in different activities. The data matrix is sparse due to such joint separation of row and columns. Those building blocks, the latent features, provide a localized additive representation from which we can reproduce the data matrix by stacking the blocks, or stated more accurately, by a convex combination of the latent features.