Automatic differentiation in pqR

Posted on July 6, 2019 by Radford Neal in R bloggers | 0 Comments

[This article was first published on R Programming – Radford Neal's blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I’ve released a version of my pqR implementation of R that has extensions for automatic differentiation. This is not a stable release, but it can be downloaded from pqR-project.org — look for the test version at the bottom — and installed the same as other pqR versions (from source, so you’ll need C and Fortran compilers).

Note that this version very likely has various bugs — mostly showing up only if you use automatic differentiation, I hope.

You can read about the automatic differentation facilities here, or with help(Gradient) after installing the test version. Below are a few examples to show a bit of what you can do.

You can get derivatives (gradients) using the new with gradient construct. Here’s a simple example:

> with gradient (abc=7) abc^2+5
[1] 54
attr(,"gradient")
[1] 14

The with gradient construct creates a new local environment with a variable abc, which is initialized to 7. The expression abc^2+5 is evaluated in this enviroment, and the result is returned along with its derivative with respect to abc, attached as a gradient attribute.

The expression whose gradient is found needn’t be so simple. Here’s another example:

> f <- function (x) { 
+   s <- 0
+   for (i in 1:length(x)) s <- s + i*x[i]
+   s
+ }
> vec <- c(3,1,9)
> with gradient (vec) f(vec)
[1] 32
attr(,"gradient")
     [,1] [,2] [,3]
[1,]    1    2    3
> with gradient (vec) f(vec^2)
[1] 254
attr(,"gradient")
     [,1] [,2] [,3]
[1,]    6    4   54

Here, with gradient (vec) initializes the vec variable to the value of the existing variable of that name (same as if (vec=vec) had been written).

The output of with gradient matches what the existing numericDeriv function gives:

> numericDeriv (quote (f(vec^2)), "vec")
[1] 254
attr(,"gradient")
     [,1] [,2] [,3]
[1,]    6    4   54

However, with gradient computes the derivatives exactly (apart from roundoff error), whereas numericDeriv computes an approximation using finite differences.

The output of both with gradient and numericDeriv in this example is a “Jacobian” matrix, with one row in this case, since the value of f(vec^2) is a scalar, and three columns, giving the derivatives with respect to the three elements of vec.

When finding the gradient with respect to many variables (or vector elements), numericDeriv may be very slow, since getting numerical derivatives with respect to N variables requires at least N+1 evaluations of the expression. In contrast, with gradient evaluates the expression once, either computing gradients as it goes (called “forward mode” differentiation) or recording how the value was computed and later finding the gradient by “backpropagation” (also called “reverse mode” diffferentiation). The pqR implementation tries to choose which of these modes is best, though at present forward mode is always used for some operations for which reverse mode hasn’t been implemented yet.

Neural network training with multiple hidden layers is one application where reverse mode differentiation is crucial. Here’s a function to compute the output of a neural network with two hidden layers, given an input vector x and network parameters in L:

nnet <- function (x, L)
{
    h1 <- tanh (L$b1 + x %*% L$W1)
    h2 <- tanh (L$b2 + h1 %*% L$W2)
    as.vector (L$b3 + h2 %*% L$W3)
}

If the two hidden layers both have 10 units, so there are 100 weights in L$W2, naive forward mode differentiation would record a 10-by-100 element Jacobian matrix associated with h2. But reverse mode differentiation starting from the final scalar output value only needs to compute Jacobian matrices with a single row (and up to 100 columns), as it works backwards.

The current automatic differentiation implementation in pqR manages to achieve this automatically (though it doesn’t yet when some other operations are done).

Here’s how this network function could be used for gradient descent training to minimize squared error when predicting responses in a training set, using a network with n1 and n2 units in the hidden layers:

train <- function (X, y, n1, n2, iters, step, sd=0.1)
{
    # Initialize parameters randomly.

    L <- list()
    n0 <- ncol(X)
    L$b1 <- rnorm (n1,sd=sd)
    L$W1 <- matrix (rnorm(n0*n1,sd=sd), n0, n1)
    L$b2 <- rnorm (n2,sd=sd)
    L$W2 <- matrix (rnorm(n1*n2,sd=sd), n1, n2)
    L$b3 <- rnorm (1,sd=sd)
    L$W3 <- rnorm (n2,sd=sd)

    # Train for 'iters' iterations to minimize squared 
    # error predicting y.

    for (i in 1:iters) {

        # Find gradient of squared error (summed over all
        # training examples) with respect to the parameters.

        r <- with gradient (L) {
            e <- 0
            for (i in 1:nrow(X)) {
                o <- nnet (X[i,], L)
                e <- e + (y[i]-o)^2
            }
            e
        }

        g <- attr(r,"gradient")

        if (i %% 100 == 0) 
            cat ("Iteration", i, ": Error", round(r,4), "\n")

        # Update parameters to reduce squared error.

        L$b1 <- L$b1 - step * as.vector (g$b1)
        L$W1 <- L$W1 - step * as.vector (g$W1)
        L$b2 <- L$b2 - step * as.vector (g$b2)
        L$W2 <- L$W2 - step * as.vector (g$W2)
        L$b3 <- L$b3 - step * as.vector (g$b3)
        L$W3 <- L$W3 - step * as.vector (g$W3)
    }

    L
}

Note that when the gradient is with respect to a list (L here), the gradient is a list of the corresponding Jacobian matrices (which here have just one row).

Here’s code to use this train function to learn an example function of two inputs, based on 100 slightly noisy examples:

set.seed(1)
truef <- function (X) cos (2*sqrt(X[,1]*X[,2])) - 2*(0.4-X[,1])^2
N <- 100
X <- cbind (runif(N), runif(N))
y <- truef (X) + rnorm(N,sd=0.01)
print (system.time (L <- train (X, y, 10, 10, 30000, 0.001)))

The 30000 training iterations (each looking at all 100 training cases) take 30 seconds on my computer.

The result is pretty good, as seen from the contour plots below:

You can get the source for this and other examples from a repository of pqR automatic differentiation examples.

I’ll be talking about automatic differentiation for pqR at the RIOT workshop held in Toulouse on July 11, in conjunction with UseR! 2019.

To leave a comment for the author, please follow the link and comment on their blog: R Programming – Radford Neal's blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

Automatic differentiation in pqR

Related

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)