# Simple Distributions for Mixtures?

**R-english – Freakonometrics**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The idea of GLMs is that given some covariates, has a distribution in the exponential family (Gaussian, Poisson, Gamma, etc). But that does not mean that has a similar distribution… so there is no reason to test for a Gamma model for before running a Gamma regression, for instance. But are there cases where it might work? That the non-conditional distribution is the same (same family at least) than the conditional ones?

For instance, if has a joint Gaussien distribution, then both marginals are Gaussian, but also . So, in that case, if the covariate is normally distributed, it is possible to have a Gaussian distribution also for . The econometric interpretation is that with a standard Gaussian linear model, if is normally distributed, not only the conditional distribution is Gaussian but also the non-conditional distribution of .

> set.seed(1) > n=1e3 > X=rnorm(n,10,2) > Y=1+3*X+rnorm(n) > plot(X,Y,xlim=c(4,20))

Indeed, here the distribution of is also Gaussian

> library(nortest) > ad.test(Y) Anderson-Darling normality test data: Y A = 0.23155, p-value = 0.802 > shapiro.test(Y) Shapiro-Wilk normality test data: Y W = 0.99892, p-value = 0.8293

(not only from a statistical point of view, the thoery of Gaussian random vectors confirms that the non-conditional distribution is Gaussian actually)

Here is continuous. What if we consider a finite mixture here, i.e. takes only a finite number of values? Actually, Teicher (1963) proved that it is not possible to have a non-conditional Gaussian distribution for . But in practice, would we really reject the Gaussian assumption, for ? If the number of classes is to small, yes. But with a large number of classes (a sufficiently large number of mixture components), it is possible,

> pv=function(k=2){ + n=1e4 + X=rnorm(n,10,2) + Q=quantile(X,(0:k)/k) + Q[1]=0 + Xc=cut(X,Q,labels=1:k) + XcN=tapply(X,Xc,mean) + Xn=XcN[as.numeric(Xc)] + Y=1+3*Xn+rnorm(n) + ad.test(Y)$p.value} > plot(2:100,Vectorize(pv)(2:100),type="l") > abline(h=.05,col="red")

So here, it could be possible to have also a Gaussian distribution, for . As least to accept that assumption, statistically.

In the context of a Poisson regression, it is well know that it’s not possible to have *at the same time* that is Poisson distributed (that’s a Poisson regression) and also that is Poisson distributed. That simply comes from the fact that

while

and because of the conditional Poisson distribution, then

Thus,

So cannot be Poisson distribution. But again, it could be possible, if heterogeneity is not too large, to accept the null assumption of a Poisson distribution for .

More generally, it is very difficult to have a distribution family for that is also the distribution of the non-conditional variable . In the context of a finite mixture ( takes a finite number of values),Teicher (1963) proved that it was not not possible, neither for the Gaussian distribution nor the Gamma distribution. An to go further, check Monfrini (2002) (thanks Romuald for point out the reference).

Hence, as a keep saying, before running a regression model on with some given family, it is never a good idea to check if the non-conditional distribution has the same distribution. Because there is no reason, usually, to remain in the same family.

**leave a comment**for the author, please follow the link and comment on their blog:

**R-english – Freakonometrics**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.