What is Cluster Analysis? A Projective Test

[This article was first published on Engaging Market Research, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Supposedly, projective tests (e.g., the inkblots of psychoanalysis) contain sufficient ambiguity that “what you see” reveals some aspect of your thinking that has escaped your awareness. Although the following will provide no insight into your neurotic thoughts or feelings, it might help separate two different way of performing and interpreting cluster analysis.

A light pollution map of the United States, a picture at night from a satellite orbiting the earth, is shown below.



Which of the following two representations more closely matches the way you think of this map?

Do you consider population density to be the mixture of distributions represented by the red spikes in the first option?




Or perhaps this mixture model is too passive for you, so that you prefer the air traffic representation in the second option showing separate airplane locations at some point in time.



The mclust package in R provides the more homeostatic first representation using density functions. Because mclust adjusts the shape of each normal distribution in the mixture, one can model the Northeast corridor from Boston to Philadelphia with a single cluster. Moreover, the documentation enables you to perform the analysis without excessive pain and to understand how finite mixture models work. If you need a video lecture on Gaussian mixtures, MathematicalMonk on YouTube is the place to start (aka Jeff Miller).

On the other hand, if airplanes can be considered as messages passed between nodes with greater concentrations (i.e., cities with airports), then the R package performing affinity propagation, apcluster, offers the more “self-organizing” model shown in the second option with many possible ways of defining similarity or affinity. Ease of use should not be a problem with a webinar, a comprehensive manual, and a link to the original Science article. However, the message propagation algorithm requires some work to comprehend the details. Fortunately, one can run the analysis, interpret the output, and know enough not to make any serious mistakes without all the computational intricacies.

And the true representation is? As a marketer, I see it as a dynamic process with concentrations supported by the seaports, rivers, railroad tracks, roads, and airports that served commerce over time. Population clusters continually evolve (e.g., imagine Las Vegas without air travel).  They are not natural kinds revealed by craving nature at its joints. Diversity comes in many shapes and forms, each requiring its own model with its unique assumptions concerning the underlying structures. More importantly, cluster analysis serves many different purposes with each setting its own criteria. Haven’t we learned that one size does not fit all?



To leave a comment for the author, please follow the link and comment on their blog: Engaging Market Research.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)