Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

### A Problem

A major problem in secondary data analysis is that you didn't get to decide what data was collected. Lets say you were interested in how many times a student has read the Twilight books). Specifically, you want to know how effective the ads for the movies and books are. You come up with a model that says $$n_{T} = f(n_{VB}, \text{ads}, \text{gender}, \epsilon)$$ where $$n_{T}$$ is the number of times they read Twilight books and $$n_{VB}$$ is the number of other vampire books they have read in the past year, $$\text{ads}$$ is the number of ads they have seen for the Twilight books or movies and finally gender as Twilight seems to have a differential appeal on the basis of gender. You run off to your favorite data repo and find the prefect data set to answer your question. Someone sat down with a 1,000 high schoolers and asked them about their reading habits as well as how many times they laughed at Twilight trailers. You expect males and females to have difference preferences and therefore respond differently (e.g., different beta coefficients from a linear model) to the ads and the number of other vampire books read in the last year. Instead of creating a model: $n_{T} = \beta_0 + \beta_1 n_{VB} + \beta_2 \text{ads} + \beta_3 \text{gender} + \epsilon$ you decide to use some interactions, getting at a regression model something like: $n_{T} = (\beta_0 + \beta_4 \text{gender}) + (\beta_1 + \beta_5 \text{gender}) n_{VB} + (\beta_2 + \beta_6 \text{gender}) \text{ads} + \beta_3 \text{gender} + \epsilon$ It is a lovely model, it makes sense and neatly answers your question of interest. You fire up R (or your stats package of choice) only to realize when exploring the data that they didn't report if the student was male or female.

Now this is a problem. You already have theorize that gender is part of the true data generating process. To ignore it violates the assumptions of most forms of regression — your error term is now a $$f(\text{gender}, \epsilon)$$ and if gender is part of the data generating process, than Y and your error term are not independent. You can do a regression still but the wonderful properties may not hold. You need a way to account for the observed heterogeneity (e.g., gender of the study subjects).

You read up on the subject matter and discover that, on average, girls read more vampire books and that the ads for Twilight were published in venues with a large teenage female demographic (maybe?). Could you use this information to potentially “discover” the unobserved characteristic using the data you have?

### Expectation–Maximization (EM) Algorithm

EM is an an iterative unsupervised clustering method. To grossly oversimplify, EM takes a set of data ($$\bf X$$) and attempts to find the number of clusters represented in $$\bf X$$ and finds clusters in the data. It finds clusters by first assigning each point a probability of belonging to a each cluster and then re-estimating the probability distribution for each cluster. This process continues until the error has converged at which time the points are all assigned to a given cluster with some degree of uncertainty. The process repeats for various counts of clusters. The quality of each model is accessed using something like the Bayesian information criterion and the model with the number of clusters that maximizes the quality of the model is returned.

We are going to use this with some simulated data to see what risks we run if we ignore the unobserved variables and also how well we can recover that missing information with EM.

### Data Generation and Clustering

Lets put together a little simulation in R. Sticking to the Twilight example, lets have two observed continuous variables ($$x_1$$ and $$x_2$$), one unobserved binary variable ($$z$$) and differences between the two groups along the unobserved variable. Let $$x_1$$ be iid and $x_1 \sim \begin{cases} N(10, 9) &\mbox{if } z = 0 \\ N(16, 9) &\mbox{if } z = 1 \end{cases}$ For simplicity, let $$x_2$$ also iid and distributed in the same fashion as $$x_1$$. We end up having two populations, each with similar variances but unequal means. Lets generate these data in R:

nSample <- 2000  # this is equal to the size of each class (0.5*total)

# x1a and x2a are features x1 and x2 for z=0 (x11 just seemed a tad much)
x1a <- rnorm(nSample, 10, 3)
x2a <- rnorm(nSample, 10, 3)
# x1b and x2b are features x1 and x2 for z=1
x1b <- rnorm(nSample, 16, 3)
x2b <- rnorm(nSample, 16, 3)


One of the best things to do when doing any sort of statistics is to look at the data. I prefer kernel density plots to histograms, so lets take a look at the density plots.

library(ggplot2)
# put the data into a dataframe
clusterData <- data.frame(x1 = c(x1a, x1b), x2 = c(x2a, x2b), class = c(rep("z = 0",
nSample), rep("z = 1", nSample)))
clusterData.gg <- ggplot(clusterData, aes(x1 = x1, x2 = x2, type = class))

# plot the combined density in grey and then overlay with colored density
# for each class
clusterData.gg + stat_density(aes(x = x1), alpha = 0.5) + geom_density(aes(x = x1,
color = class, fill = class), alpha = 0.5) + theme(legend.position = "none") +
scale_x_continuous("Value of x1") + scale_y_continuous("Density") clusterData.gg + stat_density(aes(x = x2), alpha = 0.5) + geom_density(aes(x = x2,
color = class, fill = class), alpha = 0.5) + theme(legend.position = "none") +
scale_x_continuous("Value of x2") + scale_y_continuous("Density") The tails are pretty fat but as far as empirical data goes, this is pretty well behaved looking overall. Looking at the composite density, it would be hard to tell that there are actually two different densities represented in the curve. However, our theory/knowledge of the domain says that at least 2 classes exist.

The easiest way I know to do EM in R is to use the mclust package. Lets use that and the data in $$x_1$$ and $$x_2$$ to see what we find:

library(mclust)

## Package 'mclust' version 4.0

# put the data into a matrix just to make using Mclust a bit easier
clusterMatrix <- matrix(NA, nrow = nSample * 2, ncol = 2)
clusterMatrix[, 1] <- c(x1a, x1b)
clusterMatrix[, 2] <- c(x2a, x2b)
# fits the EM model to the data
model <- Mclust(clusterMatrix)


And that is all it takes. Lets see some of the results

summary(model)

## ----------------------------------------------------
## Gaussian finite mixture model fitted by EM algorithm
## ----------------------------------------------------
##
## Mclust EII (spherical, equal volume) model with 2 components:
##
##  log.likelihood    n df    BIC
##          -22111 4000  6 -44271
##
## Clustering table:
##    1    2
## 2052 1948


No surprise there, we expected and we found 2 clusters. Lets take a look at those clusters now.

clusterPlots <- data.frame(x1 = clusterData$x1, x2 = clusterData$x2, cluster = factor(model$classification, levels = c(1, 2), labels = c("z = 0", "z = 1")), uncertainity = model$uncertainty,
logUncertainity = log(model$uncertainty)) clusterPlots.gg <- ggplot(clusterPlots) clusterPlots.gg + geom_point(aes(x = x1, y = x2, color = cluster)) The clusters as we defined them were discovered and are centered more or less where we would expect. We can see that they do sit on top of each other a bit in the middle. One of the nice things about EM is it quantifies how uncertain it is in a points class. clusterPlots.gg + geom_point(aes(x = x1, y = x2, color = logUncertainity)) That is somewhat helpful, you can see the general trend (points in the middle have higher $$log(\text{uncerrtainty})$$ values and there are more uncertain) but it might be easier to tell if we split it into groups. clusterPlots$uncertainGroups <- ifelse(clusterPlots$uncertainity <= 0.2, "Highly certain", ifelse(clusterPlots$uncertainity <= 0.4, "Moderately certain", ifelse(clusterPlots$uncertainity <= 0.5, "Weakly certain", ifelse(clusterPlots$uncertainity <= 0.6, "Weakly uncertain",
ifelse(clusterPlots$uncertainity <= 0.8, "Moderately uncertain", "Extremely uncertain"))))) clusterPlots.gg <- ggplot(clusterPlots) clusterPlots.gg + geom_point(aes(x = x1, y = x2, color = uncertainGroups)) ### Application to Regression In the Twilight example, we were using EM to account for some unobserved heterogeneity. Now that we have $$\bf X$$, we might as well simulate some regression models and see how this does. To start with, lets define the data generating process as $y_i = \beta_0 + \beta_{1, z}x_{1, i} + \beta_{2, z}x_{2, i} + \epsilon_i$. For simplicity, we are assuming that the error distribution and the intercept are the same between the two groups. It would not be difficult to allow the intercept to vary with group membership by the inclusion of a dummy variable for group but there is enough going on already with group specific values for $$\beta_1$$ and $$\beta_2$$. Lets generate some data for $$y$$, $$\epsilon$$ and the$\beta$s. # beta estimates for z = 1 are 0.3 b1a <- rnorm(nSample, 1, 0.05) # specific beta1 for z = 0 b1b <- rnorm(nSample, 1.3, 0.05) # specific beta1 for z = 1 b2a <- rnorm(nSample, 1, 0.05) # specific beta2 for z = 0 b2b <- rnorm(nSample, 1.3, 0.05) # specific beta2 for z = 1 intercept <- rnorm(2 * nSample, 3, 0.075) # intercept e <- rnorm(2 * nSample, 0, 0.05) # error term x1 <- c(x1a, x1b) x2 <- c(x1b, x2b) beta1 <- c(b1a, b1b) beta2 <- c(b2a, b2b) y <- intercept + beta1 * x1 + beta2 * x2 + e  And lets test the regression model where we pretend like the $$\beta$$ values are constant $$\forall$$ $$z$$. naiveModel <- lm(y ~ x1 + x2) summary(naiveModel) ## ## Call: ## lm(formula = y ~ x1 + x2) ## ## Residuals: ## Min 1Q Median 3Q Max ## -14.00 -2.30 0.28 2.45 8.96 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -6.0538 0.3286 -18.4 <2e-16 *** ## x1 1.8902 0.0124 152.8 <2e-16 *** ## x2 1.1386 0.0175 65.1 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 3.32 on 3997 degrees of freedom ## Multiple R-squared: 0.873, Adjusted R-squared: 0.873 ## F-statistic: 1.37e+04 on 2 and 3997 DF, p-value: <2e-16 AIC(naiveModel) ##  20947  If we just look at the adjusted R-squared (0.873) and p values ($$<2e-16$$, that is, like, really small /sarcasm), it looks like the model did a very good job of explaining the system in question. However, we know what the true data generating process was and all three parameters are significant off. We aren't really getting an accurate idea of what the underlying data generating process is with this model. What happens when we include the EM predicted cluster membership? classOne <- model$classification == 1  # actually z = 1
emModel <- lm(y ~ x1 + I(x1 * classOne) + x2 + I(x2 * classOne))
summary(emModel)

##
## Call:
## lm(formula = y ~ x1 + I(x1 * classOne) + x2 + I(x2 * classOne))
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -13.454  -0.969   0.038   1.037   8.851
##
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)
## (Intercept)        3.5267     0.2780   12.69  < 2e-16 ***
## x1                 1.3463     0.0152   88.80  < 2e-16 ***
## I(x1 * classOne)  -0.1186     0.0190   -6.25  4.6e-10 ***
## x2                 1.1887     0.0153   77.92  < 2e-16 ***
## I(x2 * classOne)  -0.3328     0.0158  -21.04  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.35 on 3995 degrees of freedom
## Multiple R-squared: 0.936,   Adjusted R-squared: 0.936
## F-statistic: 1.47e+04 on 4 and 3995 DF,  p-value: <2e-16

AIC(emModel)

##  18190

anova(naiveModel, emModel)

## Analysis of Variance Table
##
## Model 1: y ~ x1 + x2
## Model 2: y ~ x1 + I(x1 * classOne) + x2 + I(x2 * classOne)
##   Res.Df   RSS Df Sum of Sq    F Pr(>F)
## 1   3997 43958
## 2   3995 22043  2     21915 1986 <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


A significant improvement the simple sweep-it-under-the-rug model but still not all that amazing. We are still not reliably uncovering the true underlying data generating process as well as we might hope. We have to be able to do better than that.

So far we haven't taken advantage of the uncertainty measurements from the EM process. We can use these as weights in the regression to reduce the importance of points which we are most uncertain about group membership. We could also find a way to fix it in with the regression model, perhaps instead of using classZero use something like $$Pr(z_i = 0)$$, but this is the simplest way to bring them into play.

uncertainity <- model\$uncertainty

emModelWeighted <- lm(y ~ x1 + I(x1 * classOne) + x2 + I(x2 * classOne), weights = -log(uncertainity))
summary(emModelWeighted)

##
## Call:
## lm(formula = y ~ x1 + I(x1 * classOne) + x2 + I(x2 * classOne),
##     weights = -log(uncertainity))
##
## Weighted Residuals:
##    Min     1Q Median     3Q    Max
## -22.40  -1.73  -0.14   1.59  12.36
##
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)
## (Intercept)        3.2240     0.2003    16.1   <2e-16 ***
## x1                 1.3107     0.0110   119.7   <2e-16 ***
## I(x1 * classOne)  -0.2015     0.0146   -13.8   <2e-16 ***
## x2                 1.2655     0.0110   115.3   <2e-16 ***
## I(x2 * classOne)  -0.3304     0.0118   -28.1   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.62 on 3995 degrees of freedom
## Multiple R-squared: 0.975,   Adjusted R-squared: 0.975
## F-statistic: 3.96e+04 on 4 and 3995 DF,  p-value: <2e-16

AIC(emModelWeighted)

##  16438


If we weight the estimates by how certain we are in the group membership, we get a much better result. Here I used $$-log(uncertainity)$$ but say similar results using $$1 - uncertainity$$ and $$1/(1 - uncertainity)^{-1}$$. I decided on using $$-log(uncertainity)$$ as I really wanted to weight towards points that I was more certain in against points that I wasn't. The $$-log$$ transformation gives me this desired effect. Using rlm() from MASS gives similar results to any of these weight transformations.

Finally, if just for curiosity's sake, I wanted to see what was “the best I could do.” Since we had assigned the clusters, I wanted to repeat the above regression but using the true assignments instead of the predicted ones.

trueClassOne <- trueClass == 1  # z = 1
cheatingModel <- lm(y ~ x1 + I(x1 * trueClassOne) + x2 + I(x2 * trueClassOne))
summary(cheatingModel)

##
## Call:
## lm(formula = y ~ x1 + I(x1 * trueClassOne) + x2 + I(x2 * trueClassOne))
##
## Residuals:
##    Min     1Q Median     3Q    Max
## -4.258 -0.724 -0.017  0.723  3.945
##
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)
## (Intercept)           2.92893    0.11932    24.6   <2e-16 ***
## x1                    0.99962    0.00769   129.9   <2e-16 ***
## I(x1 * trueClassOne)  0.30467    0.00908    33.6   <2e-16 ***
## x2                    1.00149    0.00686   145.9   <2e-16 ***
## I(x2 * trueClassOne)  0.29878    0.00751    39.8   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.09 on 3995 degrees of freedom
## Multiple R-squared: 0.986,   Adjusted R-squared: 0.986
## F-statistic: 7.24e+04 on 4 and 3995 DF,  p-value: <2e-16

AIC(cheatingModel)

##  12015


Looking at this model and the performance of the EM enhanced models, I have to say while having actual group membership would be best, even in this simple 2D case we readily uncovered the clusters and improved out regression quality drastically compared to ignoring the unobserved variable. What I show here can be readily extended to highly dimensional cases and may detect true clusters more easily. If we had another several $$x$$ variables, we may have been more able to more accurately distinguish between $$z = 0$$ and $$z = 1$$. But that simulation is for another post.