(This article was first published on

**SAS and R**, and kindly contributed to R-bloggers)Finite mixture models (FMMs) can be used in settings where some unmeasured classification separates the observed data into groups with different exposure/outcome relationships. One familiar example of this is a zero-inflated model, where some observations come from a degenerate distribution with all mass at 0. In that case the exposure/outcome relationship is less interesting in the degenerate distribution group, but there would be considerable interest in the estimated probability of group membership. Another possibly familiar setting is the estimation of a continuous density as a mixture of normal distributions.

More generally, there could be several groups, with "concomitant" covariates predicting group membership. Each group might have different sets of predictors and outcomes from different distribution families. On the other hand, in a "homogenous" mixture setting, all groups have the same distributional form, but with different parameter values. If the covariates in the model are the same, this setting is similar to an ordinary regression model where every observed covariate is interacted with the (unobserved) group membership variable.

**SAS**

SAS 9.3 includes the "experimental"

`FMM`procedure to fit these models. We're unsure what criteria SAS uses to decide when a procedure is experimental and when it becomes "production", but experimental procedures in SAS/STAT usually do become production eventually.

As hinted at above, the generality implies by the models is fairly vast, and the FMM procedure includes a lot of generality. The most obvious limitation is that it requires independent observations.

We'll demonstrate with a simulated data set. We create a variable

`x`that predicts both group membership and an outcome

`y`with different linear regression parameters depending on group. The mixing probability follows a logistic regression with intercept=-2 and slope=1. The intercept and slope for the outcome are (0, 1) and (3, 1.2) for groups 0 and 1, respectively. The resulting data is plotted above using a default plot from

`proc glm`using the actual group membership. A histogram and nearest normal density for the residuals from a simple OLS are shown, with R code, below-- they appear more or less normal.

data fmmtest;

do i = 1 to 5000;

x = normal(0);

group = (exp(-1 + 2*x)/ (1 + exp(-1 + 2*x)) ) gt uniform(0);

y = (group * 3) + ((1 + group/5) * x) + normal(0) * sqrt(1);

output;

end;

run;

title "Fixed k = 2";

proc fmm data = fmmtest;

model y = x / k=2 equate=scale;

probmodel x;

run;

In the

`model`statement, the option

`k`defines how many groups (or "components") are to be included. There are also options

`kmin`and

`kmax`which will fit models with various numbers of components and report results of one of them based on some model criterion. The

`equate`option allows the user to force some elements of the component distributions to be equal. Here we force the residuals to be equal, since that's the model that generated our data. The

`probmodel`statement is how the concomitant variables enter the model. The default settings model a logistic (or generalized logit) regression using the listed covariates.

We show only the parameter estimates.

Standard

Component Effect Estimate Error z Value Pr > |z|

1 Intercept 2.9856 0.04856 61.48 <.0001

1 x 1.2060 0.03927 30.71 <.0001

2 Intercept -0.01978 0.02512 -0.79 0.4309

2 x 0.9889 0.02453 40.32 <.0001

1 Variance 1.0279 0.02531

2 Variance 1.0279 0.02531

Parameter Estimates for Mixing Probabilities

Standard

Effect Estimate Error z Value Pr > |z|

Intercept -1.0018 0.05665 -17.68 <.0001

x 1.9403 0.07413 26.17 <.0001

The recovery of the true parameters is quite good. This may be unsurprising given the sample size and the distinctness of the components, but we think it's pretty impressive.

**R**

The CRAN Task View on Cluster Analysis & Finite Mixture Models provides an overview of packages available for R.

First, we'll make the data and take a look at the simple model diagnostics.

> x = rnorm(5000)

> probgroup1 = exp(-1 + 2*x)/(1 + exp(-1 + 2*x))

> group = ifelse(probgroup1 > runif(5000),1,0)

> y = (group * 3) + ((1 + group/5) * x) + rnorm(5000);

> resids = residuals(lm(y~x))

> hist(resids, freq = FALSE)

> dvals = seq(from=min(resids), to=max(resids),length=100)

> lines(dvals, dnorm(dvals, mean(resids), sd(resids)))

The histogram of the residuals from the simple OLS model is shown below. They appear reasonably normal.

Ron Pearson recently showed a simple example using the

`mixtools`package. We tried to replicate the example above using

`mixtools`but were unable to find functionality for concomitant variables. If you fit the closest model, assuming the mixing probabilities depend on no covariates, this is what happens:

> library(mixtools)

> mixout.mt = regmixEM(y,x,k=2,arbvar = FALSE)> summary(mixout.mt)

summary of regmixEM object:

comp 1 comp 2

lambda 0.5275915 0.472408

sigma 1.0553877 1.055388

beta1 -0.0035015 2.237736

beta2 1.5068146 2.004008

which is pretty awful.

Fortunately, the

`flexmix`package, while a bit more difficult to use, offers the needed flexibility. As a side note, the generality of the package is pretty awe-inspiring. Nice work!

library(flexmix)

mixout.fm=flexmix(y~x, k=2, model=FLXMRglmfix(y~x, varFix=TRUE),

concomitant=FLXPmultinom(~ x))

The

`flexmix`function uses a variety of special objects that are created by other functions the package provides. Here we use the

`FLXMRglmfix`function to force equal variances across the components and the

`FLXPmultinom`function to define the logistic regression on the covariate

`x`. The results are:

> parameters(mixout5)

Comp.1 Comp.2

coef.(Intercept) 3.000412 -0.008360171

coef.x 1.190454 1.047660263

sigma 0.972306 0.972306015

> parameters(mixout5, which = "concomitant")

1 2

(Intercept) 0 0.9605367

x 0 -1.9109429

which also reproduces reality with impressive accuracy. Note that the concomitant variable model apparently predicts membership in the second component, so the signs are reversed from the generating model.

To

**leave a comment**for the author, please follow the link and comment on his blog:**SAS and R**.R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...