Clustering is the search for discontinuity achieved by sorting all the similar entities into the same piles and thus maximizing the separation between different piles. The latent class assumption makes the process explicit. What is the source of variation among the objects? An unseen categorical variable is responsible. Heterogeneity arises because entities come in different types. We seem to prefer mutually exclusive types (either A or B), but will settle for probabilities of cluster membership when forced by the data (a little bit A but more B-like). Actually, we are more likely to acknowledge that our clusters overlap early on and then forget because it is so easy to see type as the root cause of all variation.
I am asking the reader to recognize that statistical analysis and its interpretation extend over time. If there is variability in our data, a cluster analysis will yield partitions. Given a partitioning, a data analyst will magnify those differences by focusing on contrastive comparisons and assigning evocative names. Once we have names, especially if those names have high imagery, can we be blamed for the reification of minor distinctions? How can one resist segments from Nielsen PRIZM with names like “Shotguns and Pickups” and “Upper Crust”? Yet, are “Big City Blues” and “Low-Rise Living” really separate clusters or simply variations on a common set of dwelling constraints?
Taking our lessons seriously, we expect to see the well-separated clusters displayed in textbooks and papers. However, our expectations may be better formed than our clusters. We find heterogeneity, but those differences are not clumping or distinct concentrations. Our data clouds can be parceled into regions, although those parcels run into one another and are not separated by gaps. So we name the regions and pretend that we have assigned names to types or kinds of different entities with properties that control behavior over space and time. That is, we have constructed an ontology specifying categories to which we have given real explanatory powers.
Consider the following scatterplot from the introductory vignette in the R package mclust. You can find all the R code needed to produce these figures at the end of this post.
This is the Old Faithful geyser data from the “datasets” R package showing the waiting time in minutes between successive eruptions on the y-axis and the duration of the eruptions along the x-axis. It is worth your time to get familiar with Old Faithful because it is one of those datasets that gets analyzed over and over again using many different programs. There seems to be two concentrations of points: shorter eruption that occurs more quickly and longer eruptions that have a longer waiting period. If we told the Mclust function from the mclust package that the scatterplot contains observations from G=2 groups, the function would produce a classification plot that looked something like this:
The red and the blue with their respective ellipses are the two normal densities that are getting mixed. It is such a straightforward example of finite mixture or latent class models (as these models are also called by analysts in other fields of research). If we discovered that there were two pools feeding the geyser, we could write a compelling narrative tying all the data together.
The mclust vignette or manual is comprehensive but not overly difficult. If you prefer a lecture, there is no better introduction to finite mixture than the MathematicalMonk YouTube video. The key to understanding finite mixture models is recognizing that the underlying latent variable responsible for the observed data is categorical, a latent class which we do not observe, but which explains the location and shape of the data points. Do you have a cold or the flu? Without medical tests, all we can observe are your symptoms. If we filled a room with coughing, sneezing, achy and feverish people, we would find a mixture of cold and flu with differing numbers of each type.
This appears straightforward, but for the general case, how does decide how many categories, in what proportions, and with what means and covariance matrices? That is, those two ellipses in the above figure are drawn using a vector with means for the x- and y-axes plus a 2×2 covariance matrix. The means move the ellipse over the space, and the covariance matrix changes the shape and orientation of the ellipse. A good summary of the possible forms is given in Table 1 of the mlcust vignette.
Unfortunately, the mathematics of the EM algorithm used to solve this problem gets complicated quickly. Fortunately, Chris Bishop provides an intuitive introduction in a 2004 video lecture. Starting at 44:14 of Part 4, you will find a step-by-step description of how the EM algorithm works with the Old Faithful data. Moreover, in Chapter 9 of his book, Bishop cleverly compares the workings of the EM and k-means algorithms, leaving us with a better understanding of both techniques.
If only my data showed such clear discontinuity and could be tied to such a convincing narrative.
Product Categories Are Structured around the Good, the Better, and the Best
Almost every product category offers a range of alternatives that can be described as good, better, and best. Such offerings reflect the trade-offs that customers are willing to make between the quality of the products they want and the amount that they are ready to spend. High-end customers demand the best and will pay for it. On the other end, one finds customers with fewer needs and smaller budgets accepting less and paying less. Clearly, we have heterogeneity, but are those differences continuous or discrete? Can we tell by looking at the data?
Unlike the Old Faithful data with well-separated grouping of data points, product quality-price trade-offs look more like the following plot of 300 consumers indicating how much they are willing to spend and what they expect to get for their money (i.e., product quality is a composite index combining desired features and services).
There is a strong positive relationship between demand for quality and willingness to pay, so a product manager might well decide that there was opportunity for at least a high-end and a low-end option. However, there is no natural breaks in the scatterplot. Thus, if this data cloud is a mixture of distinct distributions, then these distributions must be overlapping.
Another example might help. As shown by John Cook, the distribution of heights among adults is a mixture two overlapping normal distribution, one for men and another for women. Yet, as you can observe from Cook’s plots, the mixture of men’s and women’s height does not appear bimodal because the separation between the two distributions is not large enough. If you follow the links in Cook’s post, eventually you will find the paper “Is Human Height Bimodal?”, which clearly demonstrates that many mixtures of distributions appear to be homogeneous. We simply cannot tell that they are mixture by looking just at the shape of the distribution for the combined data. The Old Faithful data with its well-separated bimodal curve provides a nice contrast, especially when we focus only on waiting time as a single dimension (Fitting Mixture Models with the R Package mixtools).
Perhaps then, segmentation does not require gaps in the data cloud. As Wedel and Kamakura note, “Segments are not homogeneous groupings of customers naturally occurring in the marketplace, but are determined by the marketing manager’s strategic view of the market.” That is, one could look at the above scatterplot, see the presence of a strong first principal component from the closest of the data points to the principal axis of the ellipse and argue that customer heterogeneity is a single continuous dimension running from the low- to the high-end of the product spectrum. Or, one could look at the same scatterplot and see three overlapping segments seeking basic, value and premium products (good, better, best). Let’s run mclust and learn if we can find our three segments.
When instructed to look for three clusters, the Mclust function returned the above result. The 300 observed points are represented as a mixture of three distributions falling along the principal axis of the larger ellipse formed by all the observations. However, if I had not specified three clusters and asked Mclust to use its default BIC criterion to select the number of segments, I would have been told that there was no compelling evidence for more than one homogeneous group. Without any prior specification Mclust would have returned a single homogeneous distribution, although as you can see from the R code below, my 300 observations were a mixture of three equal size distributions falling along the principal axis and separated by one standard deviation.
Number of Segments = Number Yielding Value to Product Management
Market segmentation lies somewhere between mass marketing and individual customization. When mass marketing fails because customers have different preferences or needs and customization is too costly or difficult, the compromise is segmentation. We do not need “natural” grouping, but just enough coherence for customers to be satisfied by the same offering. Feet come in many shapes and sizes. The shoe manufacturer can get along with three sizes of sandals but not three sizes of dress shoes. It is not the foot that is changing, but the demands of the customer. Thus, even if segments are no more than convenient fictions, they can be useful from the manager’s perspective.
My warning still holds. Reification can be dangerous. These segments are meaningful only within the context of the marketing problem created by trying to satisfy everyone with products and services that yield maximum profit. Some segmentations may return clusters that are well-separated and represent groups with qualitatively different needs and purchase processes. Many of these are obvious and define different markets. If you don’t have a dog, you don’t buy dog food. However, when segmentation seeks to identify those who feel that their dog is a member of the family, we will find overlapping clusters that we treat differently not because we have revealed the true underlying typology, but because it is in our interest. Don’t be fooled into believing that our creative segment names reveal the true workings of the consumer mind.
Finally, what is true for marketing segmentation is true for all of cluster analysis. “Clustering: Science or Art?” (a 2009 NIPS workshop) raises many of these same issues for cluster analysis in general. Videos of this workshop are available at Videolectures. Unlike supervised learning with its clear criterion for success and failure, clustering depends on users of the findings to tell us if the solution is good or bad, helpful or not. On the one hand, this seems to make everything more difficult. On the other hand, it frees us to be more open to alternative methods for describing heterogeneity as it is now and how it evolves over time.
Yet, even these three combinations cannot adequately account for all the forms of diversity we find in consumer data. Innovation might generate a structure appearing more like long arrays or streams of points seemingly pulled toward the periphery by an archetypal ideal or aspirational goal. And if the coordinate space of k-means and mixture models becomes too limiting, we can replace it with pairwise dissimilarity and graphical clustering techniques, such as affinity propagation or spectral clustering. Nor should we be wedded to the stability of our segment solution when those segments were created by dynamic forces that continue to act and alter its structure. Our models ought to be as diverse as the objects we are studying.