**ExploringDataBlog**, and kindly contributed to R-bloggers)

Last time, I discussed some of the advantages and disadvantages of robust estimators like the median and the MADM scale estimator, noting that certain types of datasets – like the rainfall dataset discussed last time – can cause these estimators to fail spectacularly. An extremely useful idea in working with datasets like this one is that of *mixture distributions*, which describe random variables that are drawn from more than one parent population. I discuss mixture distributions in some detail in Chapter 10 of Exploring Data in Engineering, the Sciences, and Medicine (Section 10.6), and the objective of this post is to give a brief introduction to the basic ideas of mixture distributions, with some hints about how they can be useful in connection with problems like analyzing the rainfall data discussed last time. Subsequent posts will discuss these ideas in more detail, with pointers to specific *R* packages that are useful in applying these ideas to real data analysis problems.

*not*“mixture models,” which are the subject of considerable current interest in the statistics community. The distinction is important because these topics are very different: mixture distributions represent a useful way of describing heterogeneity in the distribution of a variable, whereas mixture models provide a foundation for incorporating both deterministic and random predictor variables in regression models.

*R*dataset

**faithful**). This data frame contains 272 observations of the durations of successive eruptions and the waiting time until the next eruption, both measured in minutes. The upper left plot above is the normal Q-Q plot for the duration values, computed using the

**qqPlot**procedure from the

**car**package that I have discussed previously, and the upper right plot is the corresponding normal Q-Q plot for the waiting times. Both of these Q-Q plots exhibit pronounced “kinks,” which I have noted previously often indicate the presence of a multimodal data distribution. The lower two plots are the corresponding nonparametric density estimates (computed using the

**density**procedure in

*R*with its default parameters). The point of these plots is that they give strong evidence that the distributions of both the durations and waiting times are bimodal.

*x*that we observe is modeled as a random variable that has some probability

*p*of being drawn from distribution

_{1}*D*, probability

_{1}*p*of being drawn from distribution

_{2}*D*, and so forth, with probability

_{2}*p*of being drawn from distribution

_{n}*D*, where

_{n}*n*is the number of components in our mixture distribution. The key assumption here is one of statistical independence between the process of randomly selecting the component distribution

*D*to be drawn and these distributions themselves. That is, we assume there is a random selection process that first generates the numbers 1 through

_{i}*n*with probabilities

*p*through

_{1}*p*. Then, once we have drawn some number

_{n}*j,*we turn to distribution

*D*and draw the random variable

_{j}*x*from this distribution. So long as the probabilities – or

*mixing percentages*–

*p*sum to 1, and all of the distributions

_{i}*D*are proper densities, the combination also defines a proper probability density function, which can be used as the basis for computing expectations, formulating maximum likelihood estimation problems, and so forth. The simplest case is that for

_{i}*n*= 2, which provides many different specific examples that have been found to be extremely useful in practice. Because it is the easiest to understand, this post will focus on this special case.

*p*when

_{2}= 1 – p_{1}*n = 2*, it simplifies the discussion to drop the subscripts, writing

*p*and

_{1}= p*p*. Similarly, let

_{2}= 1 – p*f(x)*denote the density associated with the distribution

*D*and

_{1}*g(x)*denote the density associated with the distribution

*D*. The overall probability density function

_{2}*p(x)*is then given by:

*p(x) = p f(x) + (1 – p)g(x)*

*f(x)*and

*g(x)*are Gaussian distributions. The figure below shows four specific examples.

*f(x)*and

*g(x)*as standard normal densities, for any choice of

*p*. I have included it here both because it represents what has been historically the most common distributional assumption, and because it represents a reference case in interpreting the other distributions considered here. The upper right plot corresponds to the

*contaminated normal outlier distribution*, widely adopted in the robust statistics literature. There, the idea is that measurement errors – traditionally modeled as zero-mean Gaussian random variables with some unknown standard deviation

*S*– mostly conform to this model, but some fraction (typically, 10% to 20%) of these measurement errors have larger variability. The traditional contaminated normal model defines

*p*as this contamination percentage and assumes a standard deviation of

*3S*for these measurements. The upper right plot above shows the density for this contaminated normal model with

*p = 0.15*(i.e., 15% contamination). Visually, this plot looks identical to the one on the upper left for the standard normal distribution: plotting them on common axes (as done in Fig. 10.17 of

*Exploring Data*) shows that these distributions are not identical, a point discussed further below on the basis of Q-Q plots.

*p = 0.5*), the first component

*f(x)*is the standard normal distribution as before, and the second component

*g(x)*is a Gaussian distribution with mean 3 and standard deviation 3. The first component contributes the sharper main peak centered at zero, while the second component contributes the broad “shoulder” seen in the right half of this plot. Finally, the lower right plot shows a mixture distribution with

*p = 0.40*where the first component is a Gaussian distribution with mean -2 and standard deviation 1, and the second component is a Gaussian distribution with mean +2 and standard deviation 1. The result is a bimodal distribution with the same general characteristics as the

*R*procedure listed below (the number 272 was chosen to make the sample size the same as that of the Old Faithful geyser data discussed above). As before, this Q-Q plot was generated using the procedure

**qqPlot**from the

**car**package. The heavy-tailed, non-Gaussian character of this distribution is evident from the fact that both the upper and lower points in this plot fall well outside the 95% confidence interval around the Gaussian reference line. This example illustrates the power of the Q-Q plot for distributional assessment: like the density plots shown above for the standard normal and contaminated normal distributions, nonparametric density plots (not shown) generated from 272 samples drawn from each of these data distributions are not markedly different.

**n**is the sample size,

**cpct**is the contamination percentage (expressed as a fraction, so

**cpct**= 0.15 corresponds to 15% contamination),

**mu1**and

**mu2**are the means of the component Gaussian distributions, and

**sig1**and

**sig2**are the corresponding standard deviations. The default values for

**cpct, mu1, mu2, sig1,**and

**sig2**are those for a typical contaminated normal outlier distribution. Finally, the parameter

**iseed**is the seed for the random number generator: specifying its value as a default means that the procedure will return the same set of pseudorandom numbers each time it is called; to obtain an independent set, simply specify a different value for

**iseed.**

**y0**corresponds to the first component distribution,

**y1**corresponds to the second, and

**flag**is a binomial random variable that determines which component distribution is selected for the final random number: when

**flag**= 1 (an event with probability

**cpct**), the contaminated value

**y1**is returned, and when

**flag**= 0 (an event with probability 1 –

**cpct**), the non-contaminated value

**y0**is returned.

*n*distinct distributions – can be applied to a very wide range of distributions, leading to an extremely wide range of data models. The examples described above give a very preliminary indication of the power of Gaussian mixture distributions as approximations to the distribution of heterogeneous phenomena that arise from multiple sources. Practical applications of more general (i.e., not necessarily Gaussian) mixture distributions include modeling particle sizes generated by multiple mechanisms (e.g., accretion of large particles from smaller ones and fragmentation of larger particles to form smaller ones, possibly due to differences in material characteristics), pore size distributions in rocks, polymer chain length or chain branching distributions, and many others. More generally, these ideas are also applicable to discrete distributions, leading to ideas like the zero-inflated Poisson and negative binomial distributions that I will discuss in my next post, or to combinations of continuous distributions with degenerate distributions (e.g., concentrated at zero), leading to ideas like zero-augmented continuous distributions that may be appropriate in applications like the analysis of the rainfall data I discussed last time. One of the advantages of

*R*is that it provides a wide variety of support for various useful implementations of these ideas.

**leave a comment**for the author, please follow the link and comment on his blog:

**ExploringDataBlog**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...