Finite mixture models (FMMs) can be used in settings where some unmeasured classification separates the observed data into groups with different exposure/outcome relationships. One familiar example of this is a zero-inflated model, where some observations come from a degenerate distribution with all mass at 0. In that case the exposure/outcome relationship is less interesting in the degenerate distribution group, but there would be considerable interest in the estimated probability of group membership. Another possibly familiar setting is the estimation of a continuous density as a mixture of normal distributions.
More generally, there could be several groups, with "concomitant" covariates predicting group membership. Each group might have different sets of predictors and outcomes from different distribution families. On the other hand, in a "homogenous" mixture setting, all groups have the same distributional form, but with different parameter values. If the covariates in the model are the same, this setting is similar to an ordinary regression model where every observed covariate is interacted with the (unobserved) group membership variable.
SAS 9.3 includes the "experimental" FMM procedure to fit these models. We're unsure what criteria SAS uses to decide when a procedure is experimental and when it becomes "production", but experimental procedures in SAS/STAT usually do become production eventually.
As hinted at above, the generality implies by the models is fairly vast, and the FMM procedure includes a lot of generality. The most obvious limitation is that it requires independent observations.
We'll demonstrate with a simulated data set. We create a variable x that predicts both group membership and an outcome y with different linear regression parameters depending on group. The mixing probability follows a logistic regression with intercept=-2 and slope=1. The intercept and slope for the outcome are (0, 1) and (3, 1.2) for groups 0 and 1, respectively. The resulting data is plotted above using a default plot from proc glm using the actual group membership. A histogram and nearest normal density for the residuals from a simple OLS are shown, with R code, below-- they appear more or less normal.
do i = 1 to 5000;
x = normal(0);
group = (exp(-1 + 2*x)/ (1 + exp(-1 + 2*x)) ) gt uniform(0);
y = (group * 3) + ((1 + group/5) * x) + normal(0) * sqrt(1);
title "Fixed k = 2";
proc fmm data = fmmtest;
model y = x / k=2 equate=scale;
In the model statement, the option k defines how many groups (or "components") are to be included. There are also options kmin and kmax which will fit models with various numbers of components and report results of one of them based on some model criterion. The equate option allows the user to force some elements of the component distributions to be equal. Here we force the residuals to be equal, since that's the model that generated our data. The probmodel statement is how the concomitant variables enter the model. The default settings model a logistic (or generalized logit) regression using the listed covariates.
We show only the parameter estimates.
Component Effect Estimate Error z Value Pr > |z|
1 Intercept 2.9856 0.04856 61.48 <.0001
1 x 1.2060 0.03927 30.71 <.0001
2 Intercept -0.01978 0.02512 -0.79 0.4309
2 x 0.9889 0.02453 40.32 <.0001
1 Variance 1.0279 0.02531
2 Variance 1.0279 0.02531
Parameter Estimates for Mixing Probabilities
Effect Estimate Error z Value Pr > |z|
Intercept -1.0018 0.05665 -17.68 <.0001
x 1.9403 0.07413 26.17 <.0001
The recovery of the true parameters is quite good. This may be unsurprising given the sample size and the distinctness of the components, but we think it's pretty impressive.
The CRAN Task View on Cluster Analysis & Finite Mixture Models provides an overview of packages available for R.
First, we'll make the data and take a look at the simple model diagnostics.
> x = rnorm(5000)
> probgroup1 = exp(-1 + 2*x)/(1 + exp(-1 + 2*x))
> group = ifelse(probgroup1 > runif(5000),1,0)
> y = (group * 3) + ((1 + group/5) * x) + rnorm(5000);
> resids = residuals(lm(y~x))
> hist(resids, freq = FALSE)
> dvals = seq(from=min(resids), to=max(resids),length=100)
> lines(dvals, dnorm(dvals, mean(resids), sd(resids)))
The histogram of the residuals from the simple OLS model is shown below. They appear reasonably normal.
Ron Pearson recently showed a simple example using the mixtools package. We tried to replicate the example above using mixtools but were unable to find functionality for concomitant variables. If you fit the closest model, assuming the mixing probabilities depend on no covariates, this is what happens:
> mixout.mt = regmixEM(y,x,k=2,arbvar = FALSE)> summary(mixout.mt)
summary of regmixEM object:
comp 1 comp 2
lambda 0.5275915 0.472408
sigma 1.0553877 1.055388
beta1 -0.0035015 2.237736
beta2 1.5068146 2.004008
which is pretty awful.
Fortunately, the flexmix package, while a bit more difficult to use, offers the needed flexibility. As a side note, the generality of the package is pretty awe-inspiring. Nice work!
mixout.fm=flexmix(y~x, k=2, model=FLXMRglmfix(y~x, varFix=TRUE),
The flexmix function uses a variety of special objects that are created by other functions the package provides. Here we use the FLXMRglmfix function to force equal variances across the components and the FLXPmultinom function to define the logistic regression on the covariate x. The results are:
coef.(Intercept) 3.000412 -0.008360171
coef.x 1.190454 1.047660263
sigma 0.972306 0.972306015
> parameters(mixout5, which = "concomitant")
(Intercept) 0 0.9605367
x 0 -1.9109429
which also reproduces reality with impressive accuracy. Note that the concomitant variable model apparently predicts membership in the second component, so the signs are reversed from the generating model.