Optim, you’re doing it wrong?

Posted on May 28, 2012 by diffuseprior in R bloggers | 0 Comments

[This article was first published on DiffusePrioR » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Call me uncouth, but I like my TV loud, my beer cold and my optimization functions as simple as possible. Therefore, what I write in this blog post is very much from a layman’s perspective, and I am happy to be corrected on any fundamental errors.

I have recently become interested in writing my own maximum likelihood estimators. However, before examining any complex optimization problems, I wanted to perform a very simple analysis, and estimate a standard probit regression model.

Obviously, this is a simple model, which can be estimated using the glm function. In the following, I am assuming that the glm coefficients and standard errors are the (most) correct ones (again, I am happy to be corrected on this). The data and model I use is one of female labor force participation taken from William H. Greene’s well-known econometrics textbook.

lfp <- read.csv("lfp.csv")
lfp$kids <- ifelse(lfp$k6==1 | lfp$k18==1, 1, 0)
formula=lfp ~ age+I(age^2)+log(inc)+ed+kids
data=lfp
aa <- lm(formula,data)
X <- model.matrix(aa)
y <- data.frame(lfp$lfp)
pll <- function(beta){
  sum(y*log(pnorm(X %*% beta)) + (1-y)*log(1-pnorm(X %*% beta)))
  }

The above code loads these data and specifies the basic probit log-likelihood (pll). The idea here is: given the design matrix X and outcome vector y, what beta vector maximizes the log-likelihood function pll? To perform this optimization problem, I use the following two functions: optim, which is part of the stats package, and maxLik, a function from the package of the same name.

> system.time(ml1 <- optim(coef(aa)*2.5, pll, method="BFGS",
+              control=list(maxit=5000, fnscale=-1), hessian=T))
   user  system elapsed 
   2.59    0.00    2.66 
> system.time(ml2 <- maxLik(pll,start=coef(aa)*2.5))
   user  system elapsed 
   4.36    0.00    4.46 
> system.time(mod1 <- glm(formula,family=binomial(link="probit"),data))
   user  system elapsed 
   0.03    0.00    0.04 
> 
> cbind(ml1$par,ml2$estimate,coef(mod1))
                     [,1]         [,2]        [,3]
(Intercept) -3.3929077241 -5.326569347 -5.32655206
age         -0.0565559741  0.128949421  0.12894921
I(age^2)     0.0004543657 -0.001675693 -0.00167569
log(inc)     0.4287758432  0.223211383  0.22320995
ed           0.0812164389  0.086559458  0.08655950
kids        -0.3168540384 -0.323633523 -0.32363339
> cbind(sqrt(diag(-solve(ml1$hessian))),
+       sqrt(diag(-solve(ml2$hessian))),
+       sqrt(diag(vcov(mod1))))
                    [,1]         [,2]         [,3]
(Intercept) 9.387307e-01 1.6182304697 1.5569666788
age         8.835891e-03 0.0659315872 0.0642575252
I(age^2)    8.978703e-05 0.0007601266 0.0007413629
log(inc)    9.897385e-02 0.0998697296 0.0999188711
ed          2.302763e-02 0.0230260173 0.0229592562
kids        1.020938e-01 0.1017291904 0.1015104540

In the above, I contrast the output obtained by the two optimization functions, alongside the results produced using the standard glm function. Firstly, we can see that optim is almost twice the speed of maxLik. However, the coefficients produced by the maxLik function (column 2) are much closer to the glm coefficients (column 3) than those obtained via optim (column 1). A comparison of the standard errors, taken from the Hessian matrix, show that the maxLik function once again ‘outperforms’ the optim function, which appears to yield completely incorrect standard errors.

> pgr <- function(beta){
+   t(y*(dnorm(X %*% beta)/pnorm(X %*% beta)) - 
+     (1-y)*(dnorm(X %*% beta)/(1-pnorm(X %*% beta)))) %*% X
+ }
> system.time(ml1 <- optim(coef(aa)*2.5, pll, pgr, method="BFGS",
+                          control=list(maxit=5000, fnscale=-1), hessian=T))
   user  system elapsed 
   0.40    0.00    0.42 
> system.time(ml2 <- maxLik(pll, pgr, start=coef(aa)*2.5))
   user  system elapsed 
   0.57    0.00    0.58 
> cbind(ml1$par,ml2$estimate,coef(mod1))
                    [,1]        [,2]        [,3]
(Intercept) -5.326570797 -5.32656034 -5.32655206
age          0.128949476  0.12894917  0.12894921
I(age^2)    -0.001675693 -0.00167569 -0.00167569
log(inc)     0.223211464  0.22321101  0.22320995
ed           0.086559408  0.08655946  0.08655950
kids        -0.323633519 -0.32363352 -0.32363339
> cbind(sqrt(diag(-solve(ml1$hessian))),
+       sqrt(diag(-solve(ml2$hessian))),
+       sqrt(diag(vcov(mod1))))
                  [,1]         [,2]         [,3]
(Intercept) 0.50590822 1.5510444988 1.5569666788
age                NaN 0.0640955648 0.0642575252
I(age^2)           NaN 0.0007394648 0.0007413629
log(inc)    0.09833250 0.0989695518 0.0999188711
ed          0.02293623 0.0230289024 0.0229592562
kids        0.10153358 0.1017304094 0.1015104540
Warning message:
In sqrt(diag(-solve(ml1$hessian))) : NaNs produced

The slow performance of both optim and maxLik is because I did not specify the gradient function (i.e. first derivatives). Without a gradient function both optimizers are forced to approximate. In the above, I add a gradient function pgr. This speeds up both optimization procedures considerably. Furthermore, the similarity of the estimated coefficients is evident. Nevertheless, problems are encountered in the calculation of the Hessian matrix in the optim case. Once again, fewer problems are encountered with the maxLik function.

Based on my experience thus far, I would recommend the use of maxLik compared to optim.