# How Much Can We Learn from Top Rankings using Nonnegative Matrix Factorization?

July 10, 2014
By

(This article was first published on Engaging Market Research, and kindly contributed to R-bloggers)

Purchases are choices from available alternatives. Post-purchase, we know what is the most preferred, but all the other options score the same. Regardless of differences in appeal, all the remaining items received the same score of not chosen. A second choice tells us more, as would the alternative selected as third most preferred. As we add top rankings from first to second to the kth choice, we seem to gain more and more information about preferences. Yet, what if we concentrated only on the top performers, what might be called the "pick of the litter" or the "top of the heap" (e.g., top k from J alternatives)? How much can be learn from such partial rankings?

Jan de Leeuw shows us what can be done with a complete ranking. What if we were to take de Leeuw breakfast food dataset and keep only the top-3 rankings so that all we know is what each respondent selected as their first, second and third choices? Everything that you would need to know is contained in the Journal of Statistical Software article by de Leeuw and Mair (see section 6.2). The data come in a matrix with 42 individuals and 15 breakfast foods. I have reproduce his plot below to make the discussion easier to follow. Please note that all the R code can be found at the end of this post.

The numbers running from 1 to 42 represent the location of each individual ranking the 15 different breakfast foods. That is, rows are individuals, columns are foods, and the cells are rankings from 1 to 15 for each row. What would you like for breakfast? Here are 15 breakfast foods, please order them in terms of your preference with "1" being your most preferred food and "15" indicating your least preferred.

The unfolding model locates each respondent's ideal and measures preference as distance from that ideal point. Thus, both rows (individuals) and columns (foods) are points that are positioned in the same space such that the distances between any given row number and the columns have the same ordering as the original data for that row. As a result, you can reproduce (approximately) an individual's preference ordering from the position of their ideal point relative to the foods. Who likes muffins? If you answered, #23 or #34 or #33 or anyone else nearby, then you understand the unfolding map.

Now, suppose that only the top-3 rankings were provided by each respondent. We will keep the rankings for first, second and third and recode everything else to zero. Now, what values should be assigned to the first, second and third picks? Although ranks are not counts, it is customary to simply reverse the ranks so that the weight for first is 3, second is 2, and third is 1. As a result, the rows are no longer unique values of 1 to 15, but instead contain one each of 1, 2 and 3 plus 12 zeroes. We have wiped out 80% of our data. Although there are other approaches for working with partial rankings, I will turn to nonnegative matrix factorization because I want a technique that works well with sparsity, for example, top 3 of 50 foods or top 5 of 100 foods. Specifically, we are seeking a general approach for dealing with any partial ranking that generates sparse data matrices. Nonnegative matrix factorization seems to be up for the task, as demonstrated in a large food consumption study.

We are now ready for the nmf R package as soon as we specify the number of latent variables. I will try to keep it simple. The data matrix is 42 x 15 with each row having 12 zeroes and three entries that are 1, 2 and 3 with 3 as the best (ranking reversed). Everything would be simpler if the observed breakfast food rankings resulted from a few latent consumption types (e.g., sweet-lovers tempted by pastries, the donuts-for-breakfast crowd, the muffin-eaters and the bread-slicers). Then, observed rankings could be accounted for by some combination of these latent types. "Pure Breads" select only toast or hard roll. "Pure Muffins" pick only the three varieties of muffin, though corn muffin may not be considered a real muffin by everyone. Coffee cakes may be its own latent type, and I have idea how nmf will deal with cinnamon toast (remember that the data is at least 40 years old). From these musings one might reasonably try three or four latent variables.

The nonnegative matrix factorization (nmf) was run with four latent variables. The first function argument is the data matrix, followed by the rank or number of latent variables, with the method next, and a number indicating the number of times you want the analysis rerun with different starting points. This last nrun argument works in the same manner as the nstart argument in kmeans. Local minimum can be a problem, so why not restart the nmf function with several different initialization and pick the best solution? The number 10 seemed to work with this data matrix, by which I mean that I obtained similar results each time I reran the function with nrun=10. You will note that I did not set the seed, so that you can try it yourself and see if you get a similar solution.

The coefficient matrix is shown below. The entries have been rescaled to fall along a scale from 0 to 100 for no other reason than it is relative value that is important and marketing research often uses such a thermometer scale. Because I will be interpreting these coefficients as if they were factor loadings, I borrowed the fa.sort() function from the psych R package. Hopefully, this sorting make it easier to see the underlying pattern.

Obviously, these coefficients are not factor loadings, which are correlations between the observed and latent variables. You might want to think of them as if they were coefficients from a principal component analysis. What are these coefficients? You might wish to recall that we are factoring our data matrix into two parts: this coefficient matrix and what is called a basis matrix. The coefficient matrix enables us to name the latent variables by seeing the association between the observed and latent variables. The basis matrix includes a row for every respondent indicating the contribution of each latent variable to their top 3 rankings. I promise that all this will become clearer as we work through this example.

 Coffee Cake Muffin Pastry Bread cofcake 70 1 0 0 cornmuff 2 0 0 2 engmuff 0 38 0 4 bluemuff 2 36 5 0 cintoast 0 7 0 3 danpastry 1 0 100 0 jdonut 0 0 25 0 gdonut 8 0 20 0 cinbun 0 6 20 0 toastmarm 0 0 12 10 toast 0 0 2 0 butoast 0 3 0 51 hrolls 0 0 2 22 toastmarg 0 1 0 14 butoastj 2 0 7 10

These coefficients indicate the relative contribution of each food. The columns are named as one would name a factor or a principal component or any other latent variable. That is, we know what a danish is and a glazed or jelly donut, but we know nothing about the third column except that interest in these three breakfast foods seem to covary together. Pastry seemed like a good, although not especially creative, name. These column names seem to correspond to the different regions in the joint configuration plot derived from the complete rankings. In fact, I borrowed de Leeuw's cluster names from the top of his page 20.

And what about the 42 rows in the basis matrix? The nmf package relies on a heatmap to display the relationship between the individuals and the latent variables.

Interpretation is made easier by the clustering of the respondents along the left side of the heatmap. We are looking for blocks of solid color in each column, for example, the last 11 rows or the 4 rows just above the last 11 rows. The largest block falls toward the middle of the third column associated with pastries, and the first several rows tend to have their largest values in the first column. although most have membership in more than one column. The legend tells us that lighter yellows indicate the lowest association with the column and the darkest reds or browns identify the strongest connection. The dendrogram divides the 42 individuals into the same groupings if you cut the tree at 4 clusters.

The dendrogram also illustrates that some of the rows are combinations of more than one type. The whole, meaning the 42 individuals, can be separated into four "pure" types. A pure type is an individual whose basis vector contains one value very near one and the remaining values very near zero. Everyone is a combination of the pure types or latent variables. Some are all pure types, and some are mixtures of different types. The last 4 rows are a good example of a mixture of muffins and breads (columns 4 and 2).

Finally, I have not compared the location of the respondents on the configuration plot with their color spectrum in the heatmap. There is a correspondence, for example, #37 is near the breads on the plot and in the bread column on the heatmap. And we could continue with #19 into pastries and #33 eating muffins, but we will not since one does not expect complete agreement when the heatmap has collapsed the lower 80% of the rankings. We have our answer to the initial question raised in the title. We can learn a great deal about attraction using only the top rankings. However, we have lost any avoidance information contained in the complete rankings.

So, What Is Nonnegative Matrix Factorization?

I answered this question at the end of a previous post, and it might be helpful for you to review another example. I show in some detail the equation and how the coefficient matrix and the basis matrix combine to yield approximations of the observed data.

What do you want for breakfast? Is it something light and quick, or are you hungry and want something filling? We communicate in food types. A hotel might advertise that their price includes a continental breakfast. Continental breakfast is a food type. Bacon and eggs are not included. This is the structure shaping human behavior that nonnegative matrix factorization attempts to uncover. There were enough respondents who wanted only the foods from each of the four columns that we were able to extract four breakfast food types. These latent variables are additive so that a respondent can select according to their own individual proportions how much they want the foods from each column.

Nonnegative matrix factorization will succeed to the extent that preferences are organized as additive groupings of observed choices. I would argue that a good deal of consumption is structured by goals and that these latent variables reflect goal-derived categories. We observe the selections made by individuals and infer their motivation. Those inferences are the columns of our coefficient matrix, and the rows of the heatmap tell us how much each respondent relies on those inferred latent constructs when making their selections.

R code needed to recreate all the tables and plots:

library(smacof)data(breakfast)breakfastres <- smacofRect(breakfast)plot(res, plot.type = "confplot") partial_rank<-4-breakfastpartial_rank[partial_rank<1]<-0apply(breakfast, 2, table)apply(partial_rank, 2, table)partial_rank library(NMF)fit<-nmf(partial_rank, 4, "lee", nrun=10)h<-coef(fit)library(psych)fa.sort(t(round(h,3)))w<-basis(fit)wp<-w/apply(w,1,sum)fa.sort(round(wp,3))basismap(fit)

Created by Pretty R at inside-R.org