Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

# Why optim() is out of date

## Once upon a time

Once upon a time, there was a young Oxford D.Phil. graduate with a thesis on quantum mechanics who — by virtue of a mixup in identities — got hired as an Agricultural Economist. He was actually better qualified than the person he was mistaken for, especially for the work that his employers wanted him to do. This was to carry out some rather cutting edge modelling calculations that needed pretty good linear algebra and function minimization software.

In this far away time, computers were very exotic pieces of machinery. Our minor hero was lucky enough to have access to what was essentially a personal computer. This was a “share” of a Data General Nova computer. There were about 6 shares. Each share had about 4K (that’s K, or 1000) bytes for program and data, though for some reason the shares were slightly different sizes. Storage for programs and re-readable output was via a paper tape reader/punch and there was a 10 character per second teletype printer. Ear protectors were available as operating equipment. (No kidding!)

In this early 1970s world, S had not yet been thought of, and R was two decades in the future, but a crude dialect of BASIC was available. Less than a decade earlier, John Nelder and Roger Mead had published their “simplex” method, which will forever be confused with the Dantzig linear programming method of a similar name. Less than five years before, Roger Fletcher published his variant of the Variable Metric code, though he and Reeves had earlier described their non-linear conjugate-gradient method. These three methods were coded in BASIC for the small partition of the Data General Nova. On R, these methods are called “œNelder-Mead”, “BFGS” and “CG”, though “CG” offers two other ways of computing the conjugate gradients search direction besides that of Fletcher-Reeves.

## How they came to R

The various tools for small computers were written up as my 1979 book “Compact numerical methods for computers: Linear algebra and function minimization” that was published by Adam Hilger. This had about 25 algorithms in step and description form. The Institute of Physics acquired Adam Hilger and published the second edition in 1990 in which the major update was to present the algorithms in Turbo Pascal with a diskette in a pocket at the back.

In 1995, I was visiting Oxford and had a meeting with Brian Ripley, who happened to have the office once occupied by my D.Phil. supervisor (C.A.Coulson FRS — Wikipedia gives a reasonable but rather bland biography). I agreed R could use the codes.

## Meanwhile etc.

Of course, the world of optimization did not stand still in the quarter century between Fletchers VM code and R getting it in optim(). There were developments in quasi-Newton minimizers, and the 1980s code L-BFGS-B from Nocedal et al. was eventually brought into R, as was a rather crude Simulated Annealing (method “SANN”). The last method I will not mention further as it is really a different beast from other minimizers.

Dennis and Schnabel’s approach was codified in the function nlm() while David Gay’s Bell Labs code nlminb() supplied a bounds constrained minimizer. However, the calling syntax and controls of these functions differ, and differ from those of optim().

On the other hand, there were a lot of other developments in optimization and function minimization that could be of interest to R users, and some of the shortcomings of the original three methods should be acknowledged. Moreover, in 1987, Mary Walker-Smith and I had published “Nonlinear parameter estimation: an integrated system in BASIC”, where the methods were slightly improved and bounds constraints and masks (fixed parameter constraints) were added. The book and BASIC software are available at https://archive.org/details/NLPE87plus.

## Weaknesses of Nelder-Mead, BFGS, and CG

Let us examine the results of applying some of these methods to the standard Rosenbrock banana-shaped valley. The original problem has starting point (-1.2, 1). The function is given in the code below. To save myself some work, I am calling the methods through the package optimrx which allows multiple methods to be applied via the single function opm(). Alternatively, they can be applied one at a time using method = "name" using the function optimr() which has the same syntax as the base R optim(). However, we can specify which gradient approximation to use. Here we apply a very simple forward difference approximation. I intend to blog about gradients in a later posting.

require(optimrx)
options(width=110)
fnRosenbrock <- function (x)
{
n <- length(x)
x1 <- x[2:n]
x2 <- x[1:(n - 1)]
sum(100 * (x1 - x2^2)^2 + (1 - x2)^2)
}
meth0 <- c("Nelder-Mead", "BFGS", "CG", "L-BFGS-B", "nlm", "nlminb")
st0 <- c(-1.2, 1) # the standard start
result0 <- opm(st0, fnRosenbrock, "grfwd", method=meth0)
summary(result0, order=value)
##                    p1        p2        value fevals gevals convergence  kkt1 kkt2 xtime
## BFGS        0.9999999 0.9999998 9.784363e-15    138     41           0  TRUE TRUE 0.003
## nlm         0.9999971 0.9999941 8.697058e-12     NA     24           0  TRUE TRUE 0.001
## nlminb      0.9999970 0.9999940 9.025650e-12     44     36           0  TRUE TRUE 0.002
## L-BFGS-B    0.9999970 0.9999940 9.119830e-12     49     49           0  TRUE TRUE 0.002
## Nelder-Mead 1.0002601 1.0005060 8.825241e-08    195     NA           0 FALSE TRUE 0.003
## CG          0.9174150 0.8413668 6.802418e-03   3957   1001           1 FALSE TRUE 0.049

Here we see that most of the methods get a “reasonable” answer in a short time. However, CG has not terminated within the allowed number of gradient evaluations. This is indicated by the convergence code and also the number of gradient evaluations recorded.

Truthfully, I have never been very happy with the CG algorith or code as I specified it in the book or Pascal code. That is a pity, as it is a very compact code, uses only a few vectors of working storage and allows very nice vectorization. Fortunately, as we shall see later, there is a happy update to this code.

The issue may, of course, be the use of an approximate gradient. Just to be sure, let us try using analytic gradients.

grRosenbrock <- function (x)
{
n <- length(x)
g <- rep(NA, n)
g[1]  2) {
ii <- 2:(n - 1)
g[ii] <- 2 * (x[ii] - 1) + 400 * x[ii] * (x[ii]^2 - x[ii +
1]) + 200 * (x[ii] - x[ii - 1]^2)
}
g[n] <- 200 * (x[n] - x[n - 1]^2)
g
}
result0g <- opm(st0, fnRosenbrock, grRosenbrock, method=meth0)
summary(result0g, order=value)
##                    p1        p2        value fevals gevals convergence  kkt1 kkt2 xtime
## nlminb      1.0000000 1.0000000 4.291816e-22     43     36           0  TRUE TRUE 0.001
## nlm         1.0000000 1.0000000 1.182096e-20     NA     24           0  TRUE TRUE 0.002
## BFGS        1.0000000 1.0000000 9.594956e-18    110     43           0  TRUE TRUE 0.001
## L-BFGS-B    0.9999997 0.9999995 2.267577e-13     47     47           0  TRUE TRUE 0.001
## Nelder-Mead 1.0002601 1.0005060 8.825241e-08    195     NA           0 FALSE TRUE 0.001
## CG          0.9174130 0.8413632 6.802745e-03   3957   1001           1 FALSE TRUE 0.033

The methods needing gradients generally are doing better with the more precise gradient, except our CG code. And if we look at the timings (which may be untrustworthy since the calculations are very fast), the gradient methods appear to have taken less time.

## Improved codes

Wanting to incorporate the bounds and masks constraints that were in Nash and Walker-Smith, I prepared package Rvmmin a few years ago. Moreover, I had discovered a very nice approach to the CG code by Yuan and Dai that combined several formulas into one, giving a program that was simultaneouly better and simpler. I was also able to incorporate bounds and masks, and this became package Rcgmin. Both these packages are entirely written in R so the algorithms can be instrumented and reworked if people wish.

In the 1990s C T Kelley made some improvements to the Nelder-Mead code. Ravi Varadhan and Hans Werner Borchers put this code into package dfoptim as nmk() with a bounds constrained version nmkb(). However, the bounds are implemented via the transfinite function rescaling, and it is not possible to start on a bound. This limitation is an unfortunate and frequent nuisance, but having the bounds possibility is nevertheless welcome. Here, of course, we are not imposing bounds.

meth1 <- c(meth0, "nmkb", "Rcgmin", "Rvmmin")
result1 <- opm(st0, fnRosenbrock, "grfwd", method=meth1)
## Warning in optimr(par, fn, gr, method = meth, lower = lower, upper = upper, : Successful convergence Restarts
## for stagnation =0
summary(result1, order=value)
##                    p1        p2        value fevals gevals convergence  kkt1 kkt2 xtime
## BFGS        0.9999999 0.9999998 9.784363e-15    138     41           0  TRUE TRUE 0.003
## Rcgmin      1.0000003 1.0000006 1.009084e-13    869    515           0  TRUE TRUE 0.041
## Rvmmin      0.9999972 0.9999944 7.779352e-12    196     42           0  TRUE TRUE 0.008
## nlm         0.9999971 0.9999941 8.697058e-12     NA     24           0  TRUE TRUE 0.002
## nlminb      0.9999970 0.9999940 9.025650e-12     44     36           0  TRUE TRUE 0.001
## L-BFGS-B    0.9999970 0.9999940 9.119830e-12     49     49           0  TRUE TRUE 0.002
## Nelder-Mead 1.0002601 1.0005060 8.825241e-08    195     NA           0 FALSE TRUE 0.001
## nmkb        1.0002749 1.0006092 4.268664e-07    120     NA           0 FALSE TRUE 0.007
## CG          0.9174150 0.8413668 6.802418e-03   3957   1001           1 FALSE TRUE 0.056

result1g <- opm(st0, fnRosenbrock, grRosenbrock, method=meth1)
## Warning in optimr(par, fn, gr, method = meth, lower = lower, upper = upper, : Successful convergence Restarts
## for stagnation =0
summary(result1g, order=value)
##                    p1        p2        value fevals gevals convergence  kkt1 kkt2 xtime
## Rvmmin      1.0000000 1.0000000 1.232595e-32     59     39           0  TRUE TRUE 0.004
## nlminb      1.0000000 1.0000000 4.291816e-22     43     36           0  TRUE TRUE 0.001
## nlm         1.0000000 1.0000000 1.182096e-20     NA     24           0  TRUE TRUE 0.000
## Rcgmin      1.0000000 1.0000000 8.843379e-18    735    434           0  TRUE TRUE 0.022
## BFGS        1.0000000 1.0000000 9.594956e-18    110     43           0  TRUE TRUE 0.001
## L-BFGS-B    0.9999997 0.9999995 2.267577e-13     47     47           0  TRUE TRUE 0.001
## Nelder-Mead 1.0002601 1.0005060 8.825241e-08    195     NA           0 FALSE TRUE 0.001
## nmkb        1.0002749 1.0006092 4.268664e-07    120     NA           0 FALSE TRUE 0.006
## CG          0.9174130 0.8413632 6.802745e-03   3957   1001           1 FALSE TRUE 0.033

What do we see here? The conjugate gradients codes are still “slow”, but we now get a reasonable solution with Rcgmin. The all-R codes, while a bit slower in general than those from base-R written in C, are nevertheless likely fast enough for most users. And, for this problem, standard Nelder-Mead does as well as the “improved” version (which gives rise to the warning message about “stagnation”).

## A less extreme problem

The Rosenbrock problem has very bad scaling (note the factor of 100 in the definition). Let us try a somewhat simpler problem, but allow for more parameters at once. Here we will try a problem in 25 parameters. Note that when we do the optimization, we turn off the computation of the KKT (Kuhn Karush Tucker) tests for optimality because they require the computation of the 25 by 25 Hessian matrix approximation and then its eigenvalues. We could include the trace of the computations, and indeed had this set to 1 during early tests of this problem, but do not want a lot of output in the final blog. Similarly, we select just the first 4 parameters for display using par.select in the summary output.

fourth <- function(x){
n <- length(x)
val <- 10 # to ensure we don't get a very small number that keeps changing
for (i in 2:n){
t1 <- (x[i-1] - i + 1)
t2 <- (x[i] - i)
val <- val + (t1*t1+t2*t2)^2
}
val
}
stt <- rep(1,25)
resultt <- opm(stt, fourth, "grfwd", method=meth1, control=list(kkt=FALSE, trace=0))
## Warning in optimr(par, fn, gr, method = meth, lower = lower, upper = upper, : Maximum number of fevals
## exceeded Restarts for stagnation =0
## Warning in Rvmminu(par = spar, fn = efn, gr = egr, control = mcontrol, ...): Too many gradient evaluations
summary(resultt, order=value, par.select=1:4)
##                    p1         p2       p3        p4        value fevals gevals convergence kkt1 kkt2 xtime
## Rcgmin      1.0000000   2.001618 3.000212  4.000777 1.000000e+01     46     41           0   NA   NA 0.070
## Rvmmin      1.0000000   2.000576 3.000087  4.000809 1.000000e+01    376   2501           1   NA   NA 4.288
## L-BFGS-B    1.0000000   1.996713 2.997969  3.997143 1.000000e+01     33     33           0   NA   NA 0.054
## CG          1.0000000   1.994371 2.996702  3.995209 1.000000e+01    389    193           0   NA   NA 0.332
## nlm         1.0000000   1.988893 3.003896  4.000518 1.000000e+01     NA    118           0   NA   NA 0.257
## nlminb      1.0000000   2.001322 2.988232  3.992054 1.000000e+01    155    151           1   NA   NA 0.261
## BFGS        1.0000000   2.007567 2.998411  4.002900 1.000000e+01    182     70           0   NA   NA 0.122
## nmkb        0.8702348   2.145185 2.977281  3.986003 1.001408e+01   2500     NA           1   NA   NA 0.423
## Nelder-Mead 4.5912291 -14.477946 4.450378 -3.549623 2.672876e+06   2502     NA           1   NA   NA 0.156

We can also compute with the analytic gradient. This gives the following code and results.

require(numDeriv)
fourth.g <- function(x){
n <- length(x)
gg <- rep(0,n)
for (i in 2:n){
t1 <- (x[i-1] - i + 1)
t2 <- (x[i] - i)
#      val <- val + (t1*t1+t2*t2)^2
gg[i-1] <- gg[i-1] + 4*(t1*t1+t2*t2)*t1
gg[i] <- gg[i] + 4*(t1*t1+t2*t2)*t2
}
gg
}
ga <- fourth.g(stt)
cat("max abs difference (analytic - numeric)=", max(abs(gn-ga)), "\n")
## max abs difference (analytic - numeric)= 6.714727e-05
cat("max rel difference (analytic - numeric)=", max(abs(gn-ga))/max(abs(ga)), "\n")
## max rel difference (analytic - numeric)= 3.445995e-10
resulttg <- opm(stt, fourth, fourth.g, method=meth1, control=list(kkt=FALSE, trace=0))
## Warning in optimr(par, fn, gr, method = meth, lower = lower, upper = upper, : Maximum number of fevals
## exceeded Restarts for stagnation =0
## Warning in Rvmminu(par = spar, fn = efn, gr = egr, control = mcontrol, ...): Too many gradient evaluations
summary(resulttg, order=value, par.select=1:4)
##                    p1         p2        value fevals gevals convergence kkt1 kkt2 xtime
## Rcgmin      1.0000000   2.001618 1.000000e+01     46     41           0   NA   NA 0.075
## Rvmmin      1.0000000   2.000576 1.000000e+01    376   2501           1   NA   NA 4.371
## L-BFGS-B    1.0000000   1.996713 1.000000e+01     33     33           0   NA   NA 0.057
## CG          1.0000000   1.994371 1.000000e+01    389    193           0   NA   NA 0.341
## nlm         1.0000000   1.988893 1.000000e+01     NA    118           0   NA   NA 0.259
## nlminb      1.0000000   2.001322 1.000000e+01    155    151           1   NA   NA 0.271
## BFGS        1.0000000   2.007567 1.000000e+01    182     70           0   NA   NA 0.124
## nmkb        0.8702348   2.145185 1.001408e+01   2500     NA           1   NA   NA 0.419
## Nelder-Mead 4.5912291 -14.477946 2.672876e+06   2502     NA           1   NA   NA 0.160

Note that we ALWAYS test our gradient code using a good approximation such as that in the grad() function of package numDeriv.

Observations we can make on the results above are as follows:

• The heuristic direct-search methods Nelder-Mead and nmkb have to try too many test sets of parameters to find the minimum of the fourth function. Both methods hit the limit of function evaluations without even getting near the solution.

• The variable metric methods BFGS and Rvmmin both do quite well. In particular Rvmmin gets the best result with both approximate and analytic gradients. However, it takes quite a lot of function and gradient evaluations. With the trace flag set, we see these methods trying very hard to reduce the function long after a reasonable solution has been found. This is, in fact, part of the very conservative design of this method (Algorithm 21 in the original Compact Numerical Methods). We are getting the behaviour that was intended, but that might not quite give the performance users like.

• Both CG and its update, Rcgmin, do well. Despite being purely in R, we see Rcgmin is very fast for this problem, taking relatively few gradient and function evaluations and turning in a very fast timing, along with a respectable solution.

• Gradient methods do better with analytic gradients, both in result and timing. L-BFGS-B does particularly well (it shares some similarities with Rcgmin, but is written in C).

If we don’t have analytic gradients, can we do better than a simple forward difference approximation. Package optimrx allows for a default approximation if we set the gradient function NULL, but such a choice may leave the gradient undefined, and certainly does not allow us to say how it is computed. We can expect a backward difference to have similar properties. On the other hand, both central differences and extrapolated approximations are known to offer better accuracy. Let us try these as a final look at these methods.

# Central difference approximation
meth2<- c("BFGS", "CG", "L-BFGS-B", "nlm", "nlminb", "Rcgmin", "Rvmmin")
resultt2 <- opm(stt, fourth, "grcentral", method=meth2, control=list(kkt=FALSE, trace=0))
## Warning in Rvmminu(par = spar, fn = efn, gr = egr, control = mcontrol, ...): Too many gradient evaluations
summary(resultt2, order=value, par.select=1:4)
##          p1       p2 value fevals gevals convergence kkt1 kkt2 xtime
## Rvmmin    1 1.998545    10    379   2501           1   NA   NA 8.197
## Rcgmin    1 1.996014    10     45     41           0   NA   NA 0.130
## L-BFGS-B  1 1.996731    10     33     33           0   NA   NA 0.103
## CG        1 1.994366    10    389    193           0   NA   NA 0.617
## nlm       1 1.988894    10     NA    118           0   NA   NA 0.520
## BFGS      1 1.994444    10    178     66           0   NA   NA 0.221
## nlminb    1 2.001275    10    155    151           1   NA   NA 0.500
resultt3 <- opm(stt, fourth, "grnd", method=meth2, control=list(kkt=FALSE, trace=0))
## Warning in Rvmminu(par = spar, fn = efn, gr = egr, control = mcontrol, ...): Too many gradient evaluations
summary(resultt3, order=value, par.select=1:2)
##          p1       p2 value fevals gevals convergence kkt1 kkt2  xtime
## Rcgmin    1 1.996722    10     44     38           0   NA   NA  0.503
## Rvmmin    1 1.996348    10    369   2501           1   NA   NA 33.771
## L-BFGS-B  1 1.996754    10     33     33           0   NA   NA  0.439
## CG        1 1.994370    10    389    193           0   NA   NA  2.593
## BFGS      1 1.996236    10    186     73           0   NA   NA  0.972
## nlm       1 1.988893    10     NA    117           0   NA   NA  2.031
## nlminb    1 2.001443    10    155    151           1   NA   NA  2.003

From these we see that the more accurate gradient approximations require more time. In conjunction with the very conservative and bulldog-like Rvmmin, they take so much time as to be a poor choice for regular usage. On the other hand, Rcgmin does “not too badly”.

## Conclusions

The function minimization methods I developed that are in optim() are still quite effective, but they were cutting edge four decades ago and showing a few wrinkles. There are very similar tools available in R that usually will do better.

From the examples above, we see what I have more generally observed:

• For large, not too badly scaled problems, Rcgmin and L-BFGS-B are good choices for solvers, especially if analytic gradients are available.

• For large problems it is wise to turn off the computation of the KKT tests when using opm(). There is a function kktchk() in the packages optextras which can be called separately once the trial solution has been found.

• For small problems, most of the solvers do quite well. Rvmmin is, by design, very aggressive in seeking solutions, so quite useful when problems are “difficult” but the function and gradient are relatively quick to compute.

• When gradients are not available, and the problem is not too “difficult”, a simple forward gradient approximation with a gradient method is often more efficient than a derivative-free method.

The above observations should be read with the proviso that this short discussion has omitted to include several of the solvers available to optimrx, some of which may be particularly suited to special cases users may encounter.