Maximum Likelihood versus Goodness of Fit

[This article was first published on Freakonometrics » R-english, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Thursday, I got an interesting question from a colleague of mine (JP). I mean, the way I understood the question turned out to be a nice puzzle (but I have to confess I might have misunderstood). The question is the following : consider a i.i.d. sample http://latex.codecogs.com/gif.latex?\{X_1,\cdots,X_n\} of continuous variables. We would like to choose between two (parametric) families for the distribution, http://latex.codecogs.com/gif.latex?\mathcal{F}=\{F_{\boldsymbol%20\alpha};\boldsymbol%20\alpha\in\mathcal{A}\} and http://latex.codecogs.com/gif.latex?\mathcal{G}=\{G_{\boldsymbol%20\beta};\boldsymbol%20\beta\in\mathcal{B}\}. If we use maximum likelihood techniques, we get two estimators, one for each family, http://latex.codecogs.com/gif.latex?\widehat{\boldsymbol%20\alpha} and http://latex.codecogs.com/gif.latex?\widehat{\boldsymbol%20\beta}. Clearly, http://latex.codecogs.com/gif.latex?F_{\widehat{\boldsymbol%20\alpha}}(\cdot) is a much better than http://latex.codecogs.com/gif.latex?G_{\widehat{\boldsymbol%20\beta}}(\cdot), in the sense of a standard goodness of fit test (e.g. Kolmogorov-Smirnov since the sample is assumed to be obtained from a continuous variable). Does that mean that family is http://latex.codecogs.com/gif.latex?\mathcal{F} (somehow) better than family http://latex.codecogs.com/gif.latex?\mathcal{G} ?

This is my interpretation of the question, and I found it amusing. So I will try to show (using simulated samples) that some odd situations can easily be obtained.

Consider a sample from a mixture of log-normal distributions,

>  set.seed(228)
>  X=exp(c(rnorm(50,1,1),rnorm(50,2,1.2)))

Consider two standard families for positive random variables: a Gamma distribution and a lognormal distribution.

> library(MASS)
> ab=fitdistr(X,"gamma")
> ms=fitdistr(X,"lognormal")

If we want to visualized those two distributions, let us use

> vab=pgamma(u,ab$estimate[1],ab$estimate[2])
> vms=plnorm(u,ms$estimate[1],ms$estimate[2])
> plot(ecdf(X))
> lines(u,vab,col="red")
> lines(u,vms,col="blue")

Here, we get

What else can we say ? actually, we can also compute Kolmogorov-Smirnov statistic,

http://latex.codecogs.com/gif.latex?D_n=\sup_x%20|\widehat%20F_n(x)-F_\star(x)|where

http://latex.codecogs.com/gif.latex?\widehat%20F_n(x)={1%20\over%20n}\sum_{i=1}^n%20\boldsymbol{1}_{X_i\leq%20x}

This can be done using

> ks.test(X,"plnorm",ms$estimate[1],ms$estimate[2])

One-sample Kolmogorov-Smirnov test

data:  X
D = 0.0693, p-value = 0.7231
alternative hypothesis: two-sided

> ks.test(X,"pgamma",ab$estimate[1],ab$estimate[2])

One-sample Kolmogorov-Smirnov test

data:  X
D = 0.148, p-value = 0.02507
alternative hypothesis: two-sided

From a theoretical point of view, we should not look at the p-values, since the null-distribution is based on a fixed distribution, not a fitted one (see the Lilliefors tests for normal samples). But still. The Gamma distribution seems to be very far away from the true distribution. The statistics is twice the one we have with our lognormal distribution. And one p-value is 72%, while the other one is 2.5%. Here, we should prefer this lognormal distribution to that Gamma one. But here, we did consider only one distribution in each family. Does that mean that we cannot find one Gamma distribution that will be better than all possible lognormal distributions ? Better, for instance, according to Kolmogorov-Smirnov statistics…

Well, it is possible to use another strategy to find appropriate parameters. We can minimize this statistic actually. Define

> ks1=function(ms) {m=ms[1];s=ms[2];ks.test(X,"plnorm",m,s)$statistic}
> ks2=function(ab) {a=ab[1];b=ab[2];ks.test(X,"pgamma",a,b)$statistic}

and compute

> n1=nlm(ks1,c(ms$estimate[1],ms$estimate[2]))
> n1
$minimum
[1] 0.05252692

$estimate
[1] 1.547437 1.121864
> n2=nlm(ks2,c(ab$estimate[1],ab$estimate[2]))
> n2
$minimum
[1] 0.04737725

$estimate
[1] 1.1449041 0.167041

So here, it is possible to find a distribution much closer to the empirical sample, within the Gamma family actually.

>  vab=pgamma(u,n2$estimate[1],n2$estimate[2])
>  vms=plnorm(u,n1$estimate[1],n1$estimate[2])
>  lines(u,vab,col="red",lwd=2)
>  lines(u,vms,col="blue",lwd=2)

What would be the point here? Maybe just the idea that the maximum likelihood estimator is only one estimator among a lot of them. And if it has interesting asymptotic properties, on small samples, it might not be the best estimator to consider…

And to be completely honest, I’ve been cheating here… I mean, not really cheating (not more than any researcher using a statistical test to validate the findings). But here, I did fix the seed of the random number generator. Actually, such example does not occur that frequently. Here, out of 1000 samples, I got this odd conclusion almost 15 times. And the smaller the sample, the more likely we can observe that story, where the maximum likelihood estimator can be far away from the best fit. Here is the proportion of opposite conclusions, as a function of the sample size,

> SIM=function(ns=1000,n=100){
+ t=0
+ for(s in 1:ns){
+  set.seed(s)
+  X=exp(c(rnorm(n/2,1,1),rnorm(n/2,2,1.2)))
+  ks1=function(ms) {m=ms[1];s=ms[2];ks.test(X,"plnorm",m,s)$statistic}
+  ks2=function(ab) {a=ab[1];b=ab[2];ks.test(X,"pgamma",a,b)$statistic}
+  library(MASS)
+  ab=fitdistr(X,"gamma")
+  ms=fitdistr(X,"lognormal")
+  n1=nlm(ks1,c(ms$estimate[1],ms$estimate[2]))
+  n2=nlm(ks2,c(ab$estimate[1],ab$estimate[2]))
+  if( (ks.test(X,"plnorm",ms$estimate[1],ms$estimate[2])$statistic-
+  ks.test(X,"pgamma",ab$estimate[1],ab$estimate[2])$statistic)
+ *(n1$minimum-n2$minimum)<=0 ) t=t+1
+ }
+ return(t/ns)}

> VM=rep(NA,20)
> VS=seq(10,200,by=10)
> for(i in 1:20){VM[i]=SIM(n=VS[i],ns=1000)}
> plot(VS,VM,type="p")

So to provide a more complete answer to JP’s question, with a very large sample, I guess that your intuition should be valid…. but clearly not on a small sample.

To leave a comment for the author, please follow the link and comment on their blog: Freakonometrics » R-english.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)