**Engaging Market Research**, and kindly contributed to R-bloggers)

Cluster are groupings that have no external label. We start with entities described by a set of measurements but no rule for sorting them by type. Mixture modeling makes this point explicit with its equation showing how each measurement is an independent draw from one of K possible distributions.

Each row of our data matrix contains the measurements for a different object, represented by the vector x in the above equation. If all the rows came from a single normal distribution, then we would not need the subscript K. However, we have a mixture of populations so that measurements come from one of the K groups with probability given by the Greek letter *pi*. If we knew K, then we would know the mean *mu* and covariance matrix *sigma* that describe the Gaussian distribution generating our observation.

The above graphical model attempts to illustrate the entire process using plate notation. That is, the K and the N in the lower right corner of the two boxes indicate that we have chosen not to show all of the K or N different boxes, one for each group and one for each observation, respectively. The arrows represent directed effects so that group membership in the box with [K] is outside the measurement process. With K known, the corresponding mean and variance act as input to generate one of the i = 1,…,N observations.

This graphical model describes a production process that may be responsible for our data matrix. We must decide on a value for K (the number of clusters) and learn the probabilities for each of the K groups (*pi* is a K-valued vector). But we are not done estimating parameters. Each of the K groups has a mean vector and a variance-covariance matrix that must be estimated, and both depend on the number of columns (p) in the data matrix: (1) Kp means and (2) Kp(p+1)/2 variances and covariances. Perhaps we should be concerned that the number of parameters increases so rapidly with the number of variables p.

A commonly used example will help us understand the equation and the graphical model. The Old Faithful dataset included with the R package mclust illustrates that eruptions from the geyser can come from one of two sources: the brief eruptions in red with shorter waiting times and the extended eruptions in blue with longer waiting periods. There are two possible sources (K=2), and each source generates a bivariate normal distribution of eruption duration and waiting times (N=number of combined red squares and blue dots). Finally, our value of *pi* can be calculated by comparing the number of red and blue points in the figure.

**Scalability Issues in High Dimensions**

**leave a comment**for the author, please follow the link and comment on their blog:

**Engaging Market Research**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...