Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A friend recently asked whether I trusted the inferences from heteroskedastic probit models. I said no, because the heteroskedastic probit does not allow a researcher to distinguish between non-constant variance and a mis-specified mean function.

In particular, my friend had a hypothesis that the variance of the latent outcome (commonly called “y-star”) should increase with an explanatory variable of interest. He was using the heteroskedastic probit model, which looks something like $Pr(y_i = 1) = \Phi(X_i\beta, e^{Z_i\gamma})$, where $\Phi()$ is the cumulative normal with mean $X_i\beta$ and $e^{Z_i\gamma}$.

He wanted to argue that his explanatory variable increased both the mean function ($X\beta$) and the variance function ($e^{Z\gamma}$). To do this, he included his variable in both the $X$ and $Z$ matrices and tested the statistical significance of the associated coefficients. He found that they were both significant. It would seem that his variance increases the mean and the variance of the latent outcome. He wanted to know if this was good evidence for his theory.

I replied that I did not think so, because a binary outcome variable doesn’t contain any direct information about a non-constant variance. Indeed, the variance of a Bernoulli random variable is tied directly to the probability of success. This implies that any inference about changes in $e^{Z\gamma}$ must come from observed changes in the probability of a success (i.e. changes in the mean function). Because we’ve assumed a specific (i.e. linear) functional form for the mean function, deviations from this will be attributed to the variance function. Because of this structure, the results are driven totally by our assumption of the linearity of the mean function. Indeed, it would not be hard to find a plausible non-linear mean function (e.g. quadratic specification) that makes the $\gamma$ parameter no longer significant.

## Example One

I thought a good way to illustrate this claim would be to show that for a large but plausible sample size of one million, the heteroskedastic probit will suggest a non-constant variance when the relationship is simply a logit.

To see an illustration of this, start by simulating data from a simple logit model. Then estimate a regular probit and a heteroskedastic probit.

n <- 10^6
x <- runif(n)
y <- rbinom(n, 1, plogis(-4 + 3*x))

library(glmx)
h1 <- hetglm(y ~ x)



Now if the coefficient for x is significant in the model of the scale, then we should conclude there is heteroskedasticity, right? No, because we already know that the latent variance is constant. However, we've barely mis-specified the link function (we're using a probit, the true model is logit). This slight mis-specification causes the results to point toward non-constant variance.

Coefficients (binomial model with probit link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.090026   0.007269  -287.5   <2e-16 ***
x            1.589261   0.007418   214.2   <2e-16 ***

Latent scale model coefficients (with log link):
Estimate Std. Error z value Pr(>|z|)
x -0.20780    0.01475  -14.09   <2e-16 ***



Here is a plot of the predicted probabilities from the true, probit, and heteroskedastic probit models. Notice that in the range of the data, the heteroskedastic probit does a great job of representing the relationship. However, that's not because the variance is non-constant as the heteroskedastic probit would suggest. It's because the link function is slightly mis-specified. ## Example Two

I think the logit example makes the point powerfully, but let's look at a second example just for kicks. This time, let's say that we believe there's heteroskedasticity that can be accounted for by x, so we estimate a heteroskedastic probit and include x in the mean and variance function. However, again we're wrong. Actually the true model has a constant variance, but a non-linear mean function ($\beta_0 + \beta_1x^2$).

If we simulate the data and estimate the model, we see again that our mis-specified mean function leads us to conclude that the variance is non-constant. In this case though, the mis-specification is severe enough that you'll find significant results with much smaller sample. If we made conclusions about the non-constant variance from the statistical significance of coefficients in the model of the variance, then we would be led astray.

Coefficients (binomial model with probit link):
Estimate Std. Error z value Pr(>|z|)
(Intercept)  -3.6122     0.3726  -9.694   <2e-16 ***
x             3.1493     0.2809  11.210   <2e-16 ***

Latent scale model coefficients (with log link):
Estimate Std. Error z value Pr(>|z|)
x  -0.8194     0.2055  -3.988 6.66e-05 ***


Again, the plot shows that the heteroskedastic probit does a good job at adjusting for the mis-specified mean function (working much like a non-parametric model). ## So What Should You Do?

I think that researchers who have a theory that allows them to speculate about the mean and variance of a latent variable should go ahead and estimate a statistical model that maps cleanly onto their theory (like the heteroskedastic probit). However, these researchers should realize that this model does not allow them to distinguish between non-constant variance and a mis-specified mean function.