# Warning: Clusters May Appear More Separated in Textbooks than in Practice

**Engaging Market Research**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Clustering is the search for discontinuity achieved by sorting all the similar entities into the same piles and thus maximizing the separation between different piles. The latent class assumption makes the process explicit. What is the source of variation among the objects? An unseen categorical variable is responsible. Heterogeneity arises because entities come in different types. We seem to prefer mutually exclusive types (either A or B), but will settle for probabilities of cluster membership when forced by the data (a little bit A but more B-like). Actually, we are more likely to acknowledge that our clusters overlap early on and then forget because it is so easy to see type as the root cause of all variation.

I am asking the reader to recognize that statistical analysis and its interpretation extend over time. If there is variability in our data, a cluster analysis will yield partitions. Given a partitioning, a data analyst will magnify those differences by focusing on contrastive comparisons and assigning evocative names. Once we have names, especially if those names have high imagery, can we be blamed for the reification of minor distinctions? How can one resist segments from Nielsen PRIZM with names like “Shotguns and Pickups” and “Upper Crust”? Yet, are “Big City Blues” and “Low-Rise Living” really separate clusters or simply variations on a common set of dwelling constraints?

Taking our lessons seriously, we expect to see the well-separated clusters displayed in textbooks and papers. However, our expectations may be better formed than our clusters. We find heterogeneity, but those differences are not clumping or distinct concentrations. Our data clouds can be parceled into regions, although those parcels run into one another and are not separated by gaps. So we name the regions and pretend that we have assigned names to types or kinds of different entities with properties that control behavior over space and time. That is, we have constructed an ontology specifying categories to which we have given real explanatory powers.

Consider the following scatterplot from the introductory vignette in the R package mclust. You can find all the R code needed to produce these figures at the end of this post.

**Product Categories Are Structured around the Good, the Better, and the Best**

**Number of Segments = Number Yielding Value to Product Management**

**R code for all figures and analysis**

#attach faithful data set data(faithful) plot(faithful, pch="+") #run mclust on faithful data require(mclust) faithfulMclust<-Mclust(faithful, G=2) summary(faithfulMclust, parameters=TRUE) plot(faithfulMclust) #create 3 segment data set require(MASS) sigma <- matrix(c(1.0,.6,.6,1.0),2,2) mean1<-c(-1,-1) mean2<-c(0,0) mean3<-c(1,1) set.seed(3202014) mydata1<-mvrnorm(n=100, mean1, sigma) mydata2<-mvrnorm(n=100, mean2, sigma) mydata3<-mvrnorm(n=100, mean3, sigma) mydata<-rbind(mydata1,mydata2,mydata3) colnames(mydata)<-c("Desired Level of Quality", "Willingness to Pay") plot(mydata, pch="+") #run Mclust with 3 segments mydataClust<-Mclust(mydata, G=3) summary(mydataClust, parameters=TRUE) plot(mydataClust) #let Mclust decide on number of segments mydataClust<-Mclust(mydata) summary(mydataClust, parameters=TRUE)

**leave a comment**for the author, please follow the link and comment on their blog:

**Engaging Market Research**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.