From datasets to algorithms in R

[This article was first published on Realizations in Biostatistics, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Many statistical algorithms are taught and implemented in terms of linear algebra. Statistical packages often borrow heavily from optimized linear algebra libraries such as LINPACK, LAPACK, or BLAS. When implementing these algorithms in systems such as Octave or MATLAB, it is up to you to translate the data from the use case terms (factors, categories, numerical variables) into matrices.

In R, much of the heavy lifting is done for you through the formula interface. Formulas resemble y ~ x1 + x2 + …, and are defined in relation to a data.frame. There are a few features that make this very powerful:

  • You can specify transformations automatically. For example, you can do y ~ log(x1) + x2 + … just as easily.
  • You can specify interactions and nesting.
  • You can automatically create a numerical matrix for a formula using model.matrix(formula).
  • Formulas can be updated through the update() function.

Recently, I wanted to create predictions via Bayesian model averaging method (bma library on CRAN), but did not see where the authors implemented it. However, it was very easy to create a function that does this:

predict.bic.glm <- function(,, {
    # predict.bic.glm
    #  Purpose: predict new values from a bma fit with values in a new dataframe
    # Arguments:
    # – an object fit by bic.glm using the formula method
    # – a data frame, which must have variables with the same names as the independent
    #             variables as was specified in the formula of
    #             (it does not need the dependent variable, and ignores it if present)
    # – a vectorized function representing the inverse of the link function
    # Returns:
    #  a vector of length nrow( with the conditional probabilities of the independent
    #  variable being 1 or TRUE
    # TODO: make not be specified, but rather extracted from of$call
    form <- formula($call$f)[-2] # extract formula from the call of the fit.bma, drop dep var
    des.matrix <- model.matrix(form,
    pred <- des.matrix %*% matrix($postmean,nc=1)
    pred <-

The first task of the function finds the formula that was used in the call of the bic.glm() call, and the [-2] subscripting removes the dependent variable. Then the model.matrix() function creates a matrix of predictors with the original function (minus dependent variable) and new data. The power here is that if I had interactions or transformations in the original call to bic.glm(), they are replicated automatically on the new data, without my having to parse it by hand. With a new design matrix and a vector of coefficients (in this case, the expectation of the coefficients over the posterior distributions of the models), it is easy to calculate the conditional probabilities.

In short, the formula interface makes it easy to translate from the use case terms (factors, variables, dependent variables, etc.) into linear algebra terms where algorithms are implemented. I’ve only scratched the surface here, but it is worth investing some time into formulas and their manipulation if you intend to implement any algorithms in R.

To leave a comment for the author, please follow the link and comment on their blog: Realizations in Biostatistics. offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)