The Need for Speed Part 1: Building an R Package with Fortran (or C)

[This article was first published on R – Strange Attractors, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Genealogy tree of programming languages - Algol & Fortran family

Searching for Speed

Everyone who has ever used R has, at one time or another, wished for an increase in R’s speed. If you haven’t, you’re not using R hard enough! Recently, as part of some research on credibility, I was calculating layer loss costs for millions of simulated loss observations. As I progressed, the R markdown document I was was creating took longer and longer to render. Profiling the code showed one of the bottlenecks was the layer loss cost calculation. Being actively interested in compiled code for R for the better part of a decade, the simplicity of the function was a great opportunity for me to compare different implementations of compiled code.

I started writing one long post, but realized it would be better split into two parts. This post will describe the important steps in building an R package using Fortran. The next post will describe and compare the form and timings of the base R, C++, and Fortran implementations of the layer loss cost function. I hope that compiling (bad pun most definitely intended) this information and guidance in one place will make it easier for people to develop their own enhancements to R. Remember, a package doesn’t have to mean CRAN-ready code—a package is a collection of code that is easy to load and use in R. Don’t be afraid to experiment and push your own R boundaries!

Why Fortran?

In my opinion, Fortran gets a bad rap. It remains the oldest compiled language in active use. Yes, you wouldn’t want to write a web browser or a CRPG in it, but for mathematical calculation it remains difficult to beat Formula translator as a language. The majority of high-performance computing today remains in Fortran—it has been proven to work for the past 60+ years, it works well, and free-form Fortran is relatively easy to read. Searching for “why is Fortran still relevant” will return many answers, even in today’s world of Rust, Python, and Julia.

Fortran in R

R itself uses Fortran for much of its hard number crunching. There is a built-in base-R interface named .Fortran. However, according to Drew Schmidt, an expert in high-performance R computation, this interface is outdated, comes with significant overhead, and should not be used if at all possible. His comments on his Romp package on github, and the associated files, are very valuable in learning how to navigate the C/Fortran API for R. My Delaporte package’s conversion from C++ to C/Fortran, and the resulting speedup, was heavily influenced by his writing.

There are other pages discussing the .Fortran interface. Therefore, this post will focus on how to use use Fortran by building it into a package and using .Call. Readers will benefit more if they have basic knowledge of R package generation (NAMESPACE, DESCRIPTION, /man, etc.). For those who do not, reading the links below should prove valuable for basic package generation as well.

Before writing any code

Necessary pre-reading

At first glance, writing compiled code in R packages using the .Call interface is complicated. To say that the R API is obtuse and not-well-described would be charitable. It can be downright byzantine. This is why Dirk Eddelbuettel’s Rcpp package changed the world. It allowed authors to focus on the calculation code and leave the interfacing to Rcpp. As there is no Rcpp-like package for pure C or Fortran, some basic knowledge of the C API is necessary. The following links are not just highly recommended, they should be considered required reading:

Critical points to remember

  • Use the .Call interface. It’s modern, has lower overhead, and is safer than the .C, .Fortran, or .External interfaces.
  • Everything that moves between R and C is a SEXP. Everything. If it stays in C, C++, or Fortran, or moves between those, that’s OK, but R only understands SEXPs. It’s fine to not understand the deep mechanics of an S EXPpression pointer. Just know that data moving between R and C via .Call must be defined as SEXPs in C. Inside the function is where the actual type will be specified.
  • After GCC 4.5, it’s easy to call Fortran from C but only as a Fortran subroutine and not a function. Inside the Fortran code functions are fine, but any Fortran code that is going to be called by C and then passed to R should be a subroutine.
  • Read “Writing R Extensions” another five or ten times.

How to build a package using compiled code (Fortran)

Brief Description of the Function

The function precipitating these posts is the insured layer loss cost. This is a simple function whose implementation in base R is one line:

LLC_r <- function(x, l, a) sum(pmax(0, pmin(x - a, l)))

In C++ (Rcpp-style), the simplest implementation would be a loop:

// [[Rcpp::export]]
double LLC_c(NumericVector X, double L, double A) {
  int n = X.size();
  double LLC = 0.0;
  for (int i = 0; i < n; ++i) {
    LLC += fmax(0.0, fmin(X[i] - A, L));
  }
  return(LLC);
}

You will have to include cmath to get fmax and fmin. We’ll see this again in the next post.

First step: Write the Fortran

I’ve found it easiest to create Fortran files as modules. This allows simpler calling of functions and subroutines from different files, although it does require some care in writing the Makevars if there will be dependencies between separate Fortran files. For the actual calculation, the direct analogue of the loop in C++ would be a loop in Fortran. For this example, the complete Fortran file with some added comments looks like this:

module fortloop
    use, intrinsic :: iso_c_binding

    implicit none
    private
    public :: llc_f

contains

    subroutine llc_f(x, n, l, a, llc) bind(C, name = "llc_f_")

        real(kind = c_double), intent(in)               :: l, a !limit & attach
        integer(kind = c_int), intent(in), value        :: n    !Length of x
        real(kind = c_double), intent(in), dimension(n) :: x    !Vector of loss
        real(kind = c_double), intent(out)              :: llc  !Output variable
        integer                                         :: i    !Internal count

        llc = 0.0_c_double
        do i = 1, n
            llc = llc + max(0.0_c_double, min(x(i) - a, l))
        end do

    end subroutine llc_f

end module fortloop

The subroutine is passed a vector x of length n, the limit l, the attachment a, and the return variable llc. Using iso_c_binding allows the coder to explicitly state the name of the calling C program in the subroutine via bind. The F77_ prefixes in C add a trailing underscore, so remember to add it to the bound name. iso_c_binding also has built-in variable-type definitions; use them for easier interoperability. Using implicit none—thus needing to define the variable types—is helpful in ensuring well-formed code. Explicitly stating which routines are public and private is a matter of preference.

Save this file into the package’s src directory as ./src/fortloopf.f95. Why .f95? If you cannot answer that question, apparently you haven’t read Writing R Extensions enough! The next step is to write the C function which will call Fortran and the seperate C function which will communicate with R.

Second step: Write the C

C-to-Fortran

The first function is the C code which will call Fortran, and it’s rather simple:

void F77_NAME(llc_f)(double *x, int n, double *l, double *a, double *ret);

This function passes the variables to Fortran in the order it expects. This was the function “bound” in Fortran with a trailing underscore. Almost always, variables are passed to Fortran by reference, and thus need to be pointers in C. Fortran 2003 introduced the value keyword, which allows passing by value. In the Fortran subroutine written above, n was declared with the value keyword, so in C it does not need to be a pointer. The F77_NAME prefix is explained in Writing R extensions, which you should have read near a dozen times by now!

C-to-R

The second function is the one which will be called from R using the .Call interface:

extern SEXP c_llc_f(SEXP x, SEXP l, SEXP a){
  const int n = LENGTH(x);
  SEXP ret;
  PROTECT(ret = allocVector(REALSXP, 1));
  F77_CALL(llc_f)(REAL(x), n, REAL(l), REAL(a), REAL(ret));
  UNPROTECT(1);
  return(ret);
}

This function communicates between R and C. Therefore, the vector of losses x, the limit l, and the attachment a need to be SEXPs. The calculation of the length of x is done in C via LENGTH and assigned to n. Since it isn’t coming from R, it doesn’t need to be a SEXP. The return value, ret, is going back to R, so it needs to be defined as a SEXP. Since it is will be a double precision real variable, its flavor of SEXP is REALSXP. The PROTECT and UNPROTECT calls are needed to ensure that R’s garbage collection doesn’t free the memory until we are done with it. Since the Fortran loop returns a single value, ret(urn) is declared as SEXP and given protected allocation of a length 1 real vector. Remember, there are no scalars in R.

In the actual call (F77_CALL calling the F77_NAME), the SEXPs have to be appropriately typed. Since they are all doubles, they get cast via the REAL() call. The length variable, n, is not being passed in from R. Since it was defined inside the function as a const int, it doesn’t need casting in the call to the C-to-Fortran function. If it was an SEXP that came from R, it would need casting as INTEGER(n). In this case it would also need to be a pointer to that value as well. There are other types of SEXP, of which I’m sure you are now aware, having read all the links above.

Save these two functions into the package’s src directory as ./src/fortloopc.c. This file is incomplete, but the final C coding will be easier to understand once the R code is written.

Hard-learned lessons

Booleans

One valuable point, gleaned from much frustration, relates to passing logicals. iso_c_binding includes boolean bindings. For almost all Fortran compilers, this would allow easy passing of a logical out of R into a boolean in C, followed by the passing of that boolean into a logical Fortran variable. Sadly, that does not work with some Fortran implementations—specifically Solaris SPARC which has opposite endianness from most other CPUs. The foolproof way to pass booleans from R to Fortran via C is to cast them as integers—0 for FALSE and 1 for TRUE is the usual convention—and then test for equality to 0 or 1 in Fortran. Maybe one day all major implementations will work seamlessly. Until then, while it is less elegant, passing integers can save a whole host of headaches.

Portability

If your intent is to eventually create a CRAN-worthy package, pay special attention to portability. Make sure you know which elements of Fortran 90, 95, 2003, or 2008 are implemented in the regular compilers. For example, as of December 2018, the oldest CRAN compiler is GCC 4.6.3 for r-oldrel-windows. So to be compatible with r-oldrel-windows-ix86+x86_64 Fortran cannot use the intrinsic functionlog_gamma, which was only added in Fortran 2008. The user must write their own log gamma code or use an existing module.

Another example is that the current release of R on Windows can use Fortran with iso_c_binding for variable types and subroutines. That was implemented in GCC 4.3 and is in Oracle Developer Studio 12.5 for Solaris. However, the current release of R on Windows cannot use ieee_arithmetic, as that was only implemented in GFortran 5. The current Windows toolchain is based on version 4.9.3, although there is work on updating the Windows toolchain to GCC 8.

Third step: Write the R

In R, the function to call the C code is:

LLC_f2 <- function(x, l, a) {
  if (!is.double(x)) {storage.mode(x) <- 'double'}
  if (!is.double(l)) {storage.mode(l) <- 'double'}
  if (!is.double(a)) {storage.mode(a) <- 'double'}
  .Call(c_llc_f, x, l, a)
}

The user calls this function the same way they would the base R version. The mechanics are to first ensure that the variables are of the right type (passing the wrong type from R to C has been known to release a horde of demons somewhere within a ten mile radius of the infraction) and then .Calling the C. Save this under the package’s R directory as ./R/LLC.R.

Fourth step: Add the needed package code to C

Recently, R-core has become stricter with registration of native routines for compiled code. But you knew that because you follow R-devel and have all-but memorized Writing R Extensions at this point, right? So you already know that there are two steps: the first is to declare the interface methods and the second is to register the routines. In this case, there is only use of the .Call, so the registration will only require one type. Each function needs to be defined with the number of variables being passed between R and C.

static const R_CallMethodDef CallEntries[] = {
  {"c_llc_f",   (DL_FUNC) &c_llc_f,   3},
  {NULL,         NULL,                0}
};

void R_init_SimpFort (DllInfo *dll) {
  R_registerRoutines(dll, NULL, CallEntries, NULL, NULL);
  R_useDynamicSymbols(dll, FALSE);
}

The first function creates the definitions for all functions using the .Call interface—the only interface you should ever use—and the second registers the routines and passes FALSE to Dynamic Symbols, which is highly preferred by R. The entire C code file, fortloop.c now looks like this:

#include <R.h>
#include <Rinternals.h>
#include <stdlib.h> // for NULL
#include <Rmath.h>
#include <R_ext/Rdynload.h>

void F77_NAME(llc_f)(double *x, int n, double *l, double *a, double *ret);

extern SEXP c_llc_f(SEXP x, SEXP l, SEXP a){
  const int n = LENGTH(x);
  SEXP ret;
  PROTECT(ret = allocVector(REALSXP, 1));
  F77_CALL(llc_f)(REAL(x), n, REAL(l), REAL(a), REAL(ret));
  UNPROTECT(1);
  return(ret);
}

static const R_CallMethodDef CallEntries[] = {
  {"c_llc_f",   (DL_FUNC) &c_llc_f,   3},
  {NULL,         NULL,                0}
};

void R_init_SimpFort(DllInfo *dll) {
  R_registerRoutines(dll, NULL, CallEntries, NULL, NULL);
  R_useDynamicSymbols(dll, FALSE);
}

No matter what you write, the includes listed above are almost always needed. There may be others depending on what code you’re writing, of course. With the final function, R_init_XXXXX, the XXXXX should be the name of your package. If it is not, it’s possible that the registration will not work. In this case, we’re naming this toy package “SimpFort”.

Fifth step: Update other package files

NAMESPACE

The namespace needs to know that the registration occurred in C, so the following should be added as the first line of NAMESPACE: useDynLib(SimpFort, .registration=TRUE).

DESCRIPTION

The line NeedsCompilation: yes needs to be added, of course.

Makevars

For this simple package, a Makevars may not be necessary. When there are multiple Fortran modules, though, the compiler needs to know the dependency structure. Otherwise, especially on machines that use parallel make (like some CRAN binary builders), a later file may get compiled before an earlier one and myriads of demons of the abyss will be released to feast on the souls of R package developers. Even though it isn’t necessary for this package, the object file for the C code will be listed as depending on the module of the Fortran code.

Another issue with Fortran code is that the modules created by the Fortran compilation are platform dependent, so they cannot be part of the package itself and must be removed after package installation. This is most easily accomplished by creating a clean target in the Makevars which is called from the all target. For the SimpFort package, the Makevars is:

C_OBJS = fortloopc.o
FT_OBJS = fortloopf.o

all: $(SHLIB) clean

$(SHLIB): $(FT_OBJS) $(C_OBJS)

fortloop.mod: fortloopf.o
fortloopc.o: fortloop.mod

clean:
	@rm -rf *.mod *.o

While not always necessary, I usually put the Fortran and C objects into different variables and have Fortran listed before C in the shared libraries variable, $(SHLIB). The Makevars also creates a clean target which deletes all compiled object and module files, which is itself called in all AFTER the creation of the shared library. Lastly, although unnecessary here, the fortloop module is defined as depending on the fortloopf object file, and the fortloopc object file is defined as depending on the fortloop module. Note that object files have the same name as their corresponding source files (.c, .cpp, .cxx, .f95, etc.) whereas Fortran module files have the module name with the .mod ending. Looking above, the Fortran source file fortloopf.f95 has as its first line module fortloop so the module will be named fortloop.mod and not fortloopf.mod.

Everything else

At this point, the rest of the package needs completion. Help files need to be created, probably a README and a NEWS too, the LICENSE has to be selected, unit tests should be created, and the rest of the DESCRIPTION and NAMESPACE needs to be corrected or completed. That is left as an exercise to the reader, who at this point, should have read “Writing R extensions” two dozen times. Has that been mentioned already?

For future reference, I’ve expanded the code brought above by adding very basic help documentation and unit tests resulting in a minimal, but complete package which passes R CMD check --as-cran—at least on my computer. The SimpFort package can be found on github.

In closing

At first, building an R package which uses Fortran as its calculation engine seems complicated: we have an R function calling a C function calling another C function which calls Fortran. Conceptually, though, it makes sense. The compiled language R knows best is C, so we can create an R-to-C translator and a C-to-Fortran translator. Since the two translators both know C, they talk to each other in C. Each function in a foreign language needs its own pair of C-to-Fortran(C/C++) and C-to-R functions and entry in the methods definition. No matter how complicated the foreign coding gets, so long as the C-to-Fortran and C-to-R functions pass the proper variables back and forth, and the methods are defined properly in C, the power of C, C++, or Fortran can be leveraged by R.

The Rcpp package does a lot of this behind the scenes for C++ code. Every Rcpp package will have a RcppExports.cpp in its src directory and a RcppExports.R file in its R directory. More may be going on in each, but after reading this post, much of what is going on under the hood in those files should now be recognizable.

There is more to consider when writing compiled code, especially if dealing with random number generation, but that’s a topic for another time.

My next post will detail various flavors of the same function in R, C++, and Fortran, but I hope that this post proves valuable to anyone who wanted to use Fortran (or C) with R and was intimidated by the interface. Enjoy!

The post The Need for Speed Part 1: Building an R Package with Fortran (or C) appeared first on Strange Attractors.

To leave a comment for the author, please follow the link and comment on their blog: R – Strange Attractors.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)