How much faster is calibration with parallel processing and/or R byte-code compiling ?

[This article was first published on Thierry Moudiki's blog » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The Svensson model is a 6-factor yield curve model that has been derived from the Nelson-Siegel model in : Svensson, L. E. (1995). Estimating forward interest rates with the extended Nelson & Siegel method. Sveriges Riksbank Quarterly Review, 3(1) :13-26.

In this post, I compare the performances of R package mcGlobaloptim on the model’s calibration to observed (fictitious) rates.

This will be done in pure R, with sequential or parallel processing, with or without byte-code compilation (shortly, a compact numeric code between source code and machine code, in which the source code is ‘translated’; please refer to this article or this one).

Here are the R packages required for this experiment :

# Support for simple parallel computing in R
require(snow)
# Global optimization using Monte Carlo and Quasi 
# Monte Carlo simulation
require(mcGlobaloptim)
# The R Compiler Package
require(compiler)
# Sub microsecond accurate timing functions
require(microbenchmark)
# Benchmarking routine for R
require(rbenchmark)

The R Code for the objective function (function to be minimized) used here, appears in : Gilli, Manfred and Schumann, Enrico, A Note on ‘Good Starting Values’ in Numerical Optimisation (June 3, 2010). Available at SSRN: http://ssrn.com/abstract=1620083 or http://dx.doi.org/10.2139/ssrn.1620083

### Calibrating the Nelson-Siegel-Svensson model
### : Svensson model
NSS2 <- function(betaV, mats) {
    gam1 <- mats/betaV[5]
    gam2 <- mats/betaV[6]
    aux1 <- 1 - exp(-gam1)
    aux2 <- 1 - exp(-gam2)
    betaV[1] + betaV[2] * (aux1/gam1) + betaV[3]
*(aux1/gam1 + aux1 - 1) + 
        betaV[4] * (aux2/gam2 + aux2 - 1)
}

# True parameters
betaTRUE <- c(5, -2, 5, -5, 1, 3)

# Input maturities
mats <- c(1, 3, 6, 9, 12, 15, 18, 21, 24, 30, 36, 48, 60
, 72, 84, 96, 108, 120)/12

# Corresponding yield to maturities
yM <- NSS2(betaTRUE, mats)
dataList <- list(yM = yM, mats = mats, model = NSS2)

# Bounds for parameters' search
settings <- list(min = c(0, -15, -30, -30, 0, 3), 
max = c(15, 30, 30, 30, 3, 6), d = 6)

# Objective function
OF <- function(betaV, dataList) {
    mats <- dataList$mats
    yM <- dataList$yM
    model <- dataList$model
    y <- model(betaV, mats)
    aux <- y - yM
    crossprod(aux)
}
# 

A compiled version of the objective function is then defined, using the package compiler from L. Tierney (now a part of base R) :

# define compiled objective function
OFcmp <- cmpfun(OF)
OFcmp



## function(betaV, dataList) {
##     mats <- dataList$mats
##     yM <- dataList$yM
##     model <- dataList$model
##     y <- model(betaV, mats)
##     aux <- y - yM
##     crossprod(aux)
## }
## bytecode: 0x000000000dfa98b8

The 4 situations being tested are the following :
* Optimization with sequential processing and no byte-code compilation
* Optimization with sequential processing and byte-code compilation
* Optimization with parallel processing and no byte-code compilation
* Optimization with parallel processing and byte-code compilation

Thus, the performance of multiStartoptim, based on the 4 following functions will be tested :

OF_classic <- function(.n) {
    multiStartoptim(objectivefn = OF, data = dataList, 
lower = settings$min, upper = settings$max, 
method = "nlminb", nbtrials = .n, 
typerunif = "sobol")
}

OF_cmp <- function(.n) {
    multiStartoptim(objectivefn = OFcmp, data = dataList, 
lower = settings$min, upper = settings$max, 
method = "nlminb", 
nbtrials = .n, typerunif = "sobol")
}

OF_classic_parallel <- function(.n) {
    multiStartoptim(objectivefn = OF, data = dataList, 
lower = settings$min, upper = settings$max, 
method = "nlminb", 
nbtrials = .n, typerunif = "sobol", 
nbclusters = 2)
}

OF_cmp_parallel <- function(.n) {
    multiStartoptim(objectivefn = OFcmp, data = dataList, lower = settings$min, 
        upper = settings$max, method = "nlminb", 
nbtrials = .n, typerunif = "sobol", 
nbclusters = 2)
}

First of all, it's important to notice that parallel processing is not necessarily faster than sequential processing when nbtrials is low :

nbtrials <- 5
benchmark1 <- benchmark(OF_classic(nbtrials), 
OF_classic_parallel(nbtrials), 
columns = c("test", "replications", 
"elapsed", "relative"), 
replications = 10, 
relative = "elapsed")



benchmark1



##                            test replications elapsed relative
## 1          OF_classic(nbtrials)           10    0.38     1.00
## 2 OF_classic_parallel(nbtrials)           10    7.39    19.45

Moreover, a little attention should be paid to :
* The choice of the number of clusters in terms of availability
* More clusters doesn't necessarily mean faster calibration.

Now, increasing the number of trials, we verify first that the results obtained in the 4 cases are all the same :

nbtrials <- 500
t1 <- OF_classic(nbtrials)
t2 <- OF_cmp(nbtrials)
t3 <- OF_classic_parallel(nbtrials)
t4 <- OF_cmp_parallel(nbtrials)

We have :

t1$par

## [1]  5 -2  5 -5  1  3


t1$objective

## [1] 2.795e-25

and :

all.equal(t1$objective, t2$objective, 
t3$objective, t4$objective)

## [1] TRUE

And now, the final comparison of the 4 cases :

benchmark2 <- microbenchmark(OF_classic(nbtrials), 
OF_cmp(nbtrials), OF_classic_parallel(nbtrials), 
OF_cmp_parallel(nbtrials), times = 100)


print(benchmark2, unit = "ms", order = "median")

## Unit: milliseconds
##  expr                min   lq   median uq     max
##  OF_cmp_parallel     2697  2753   2787 2883   3194
##  OF_classic_parallel 2849  2925   2971 3061   4121
##  OF_cmp              3712  3753   3809 3860   4430
##  OF_classic          4060  4121   4153 4228   4801

Here is how pretty ggplot2 displays it :

require(ggplot2)
plt <- ggplot2::qplot(y = time, data = benchmark2, 
colour = expr, xlab = "execution date", 
    ylab = "execution time in milliseconds")
plt <- plt + ggplot2::scale_y_log10()
print(plt)

parallelorbytecodeggplot

With this being said, a sequential implementation in C code might be much faster than these examples in pure R.


To leave a comment for the author, please follow the link and comment on their blog: Thierry Moudiki's blog » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)