The Need for Speed Part 1: Building an R Package with Fortran (or C)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Searching for Speed
Everyone who has ever used R has, at one time or another, wished for an increase in R’s speed. If you haven’t, you’re not using R hard enough! Recently, as part of some research on credibility, I was calculating layer loss costs for millions of simulated loss observations. As I progressed, the R markdown document I was was creating took longer and longer to render. Profiling the code showed one of the bottlenecks was the layer loss cost calculation. Being actively interested in compiled code for R for the better part of a decade, the simplicity of the function was a great opportunity for me to compare different implementations of compiled code.
I started writing one long post, but realized it would be better split into two parts. This post will describe the important steps in building an R package using Fortran. The next post will describe and compare the form and timings of the base R, C++, and Fortran implementations of the layer loss cost function. I hope that compiling (bad pun most definitely intended) this information and guidance in one place will make it easier for people to develop their own enhancements to R. Remember, a package doesn’t have to mean CRAN-ready code—a package is a collection of code that is easy to load and use in R. Don’t be afraid to experiment and push your own R boundaries!
Why Fortran?
In my opinion, Fortran gets a bad rap. It remains the oldest compiled language in active use. Yes, you wouldn’t want to write a web browser or a CRPG in it, but for mathematical calculation it remains difficult to beat Formula translator as a language. The majority of high-performance computing today remains in Fortran—it has been proven to work for the past 60+ years, it works well, and free-form Fortran is relatively easy to read. Searching for “why is Fortran still relevant” will return many answers, even in today’s world of Rust, Python, and Julia.
Fortran in R
R itself uses Fortran for much of its hard number crunching. There is a built-in base-R interface named .Fortran
. However, according to Drew Schmidt, an expert in high-performance R computation, this interface is outdated, comes with significant overhead, and should not be used if at all possible. His comments on his Romp package on github, and the associated files, are very valuable in learning how to navigate the C/Fortran API for R. My Delaporte package’s conversion from C++ to C/Fortran, and the resulting speedup, was heavily influenced by his writing.
There are other pages discussing the .Fortran
interface. Therefore, this post will focus on how to use use Fortran by building it into a package and using .Call
. Readers will benefit more if they have basic knowledge of R package generation (NAMESPACE, DESCRIPTION, /man, etc.). For those who do not, reading the links below should prove valuable for basic package generation as well.
Before writing any code
Necessary pre-reading
At first glance, writing compiled code in R packages using the .Call
interface is complicated. To say that the R API is obtuse and not-well-described would be charitable. It can be downright byzantine. This is why Dirk Eddelbuettel’s Rcpp package changed the world. It allowed authors to focus on the calculation code and leave the interfacing to Rcpp. As there is no Rcpp-like package for pure C or Fortran, some basic knowledge of the C API is necessary. The following links are not just highly recommended, they should be considered required reading:
- Writing R Extensions: Read this a minimum of six times from start to finish. Re-read it before every time you plan to ask a question of R-package-devel.
- Dr. Ripley’s slides on calling other languages from R: Over a decade old now, but the mechanics are the same.
- Hadley Wickham on the C API: Hadley has forgotten more about R than most of us will ever know.
- Simon Urbanek’s cheat sheet: Dense, but chock full of information. Once you have some idea, this is a good reminder/reference.
- Jonathan Callahan on
.C
vs..Call
: Applies to.Fortran
as well. See Simon Urbanek’s response to Hervé Pagès at the beginning. - Ben Bass on the C API: This wasn’t around when I started learning about the API; I wish it had been.
Critical points to remember
- Use the
.Call
interface. It’s modern, has lower overhead, and is safer than the.C
,.Fortran
, or.External
interfaces. - Everything that moves between R and C is a SEXP. Everything. If it stays in C, C++, or Fortran, or moves between those, that’s OK, but R only understands SEXPs. It’s fine to not understand the deep mechanics of an S EXPpression pointer. Just know that data moving between R and C via
.Call
must be defined as SEXPs in C. Inside the function is where the actual type will be specified. - After GCC 4.5, it’s easy to call Fortran from C but only as a Fortran subroutine and not a function. Inside the Fortran code functions are fine, but any Fortran code that is going to be called by C and then passed to R should be a subroutine.
- Read “Writing R Extensions” another five or ten times.
How to build a package using compiled code (Fortran)
Brief Description of the Function
The function precipitating these posts is the insured layer loss cost. This is a simple function whose implementation in base R is one line:
LLC_r <- function(x, l, a) sum(pmax(0, pmin(x - a, l)))
In C++ (Rcpp-style), the simplest implementation would be a loop:
// [[Rcpp::export]] double LLC_c(NumericVector X, double L, double A) { int n = X.size(); double LLC = 0.0; for (int i = 0; i < n; ++i) { LLC += fmax(0.0, fmin(X[i] - A, L)); } return(LLC); }
You will have to include cmath
to get fmax
and fmin
. We’ll see this again in the next post.
First step: Write the Fortran
I’ve found it easiest to create Fortran files as modules. This allows simpler calling of functions and subroutines from different files, although it does require some care in writing the Makevars if there will be dependencies between separate Fortran files. For the actual calculation, the direct analogue of the loop in C++ would be a loop in Fortran. For this example, the complete Fortran file with some added comments looks like this:
module fortloop use, intrinsic :: iso_c_binding implicit none private public :: llc_f contains subroutine llc_f(x, n, l, a, llc) bind(C, name = "llc_f_") real(kind = c_double), intent(in) :: l, a !limit & attach integer(kind = c_int), intent(in), value :: n !Length of x real(kind = c_double), intent(in), dimension(n) :: x !Vector of loss real(kind = c_double), intent(out) :: llc !Output variable integer :: i !Internal count llc = 0.0_c_double do i = 1, n llc = llc + max(0.0_c_double, min(x(i) - a, l)) end do end subroutine llc_f end module fortloop
The subroutine is passed a vector x
of length n
, the limit l
, the attachment a
, and the return variable llc
. Using iso_c_binding
allows the coder to explicitly state the name of the calling C program in the subroutine via bind
. The F77_
prefixes in C add a trailing underscore, so remember to add it to the bound name. iso_c_binding
also has built-in variable-type definitions; use them for easier interoperability. Using implicit none
—thus needing to define the variable types—is helpful in ensuring well-formed code. Explicitly stating which routines are public and private is a matter of preference.
Save this file into the package’s src
directory as ./src/fortloopf.f95
. Why .f95
? If you cannot answer that question, apparently you haven’t read Writing R Extensions enough! The next step is to write the C function which will call Fortran and the seperate C function which will communicate with R.
Second step: Write the C
C-to-Fortran
The first function is the C code which will call Fortran, and it’s rather simple:
void F77_NAME(llc_f)(double *x, int n, double *l, double *a, double *ret);
This function passes the variables to Fortran in the order it expects. This was the function “bound” in Fortran with a trailing underscore. Almost always, variables are passed to Fortran by reference, and thus need to be pointers in C. Fortran 2003 introduced the value
keyword, which allows passing by value. In the Fortran subroutine written above, n
was declared with the value
keyword, so in C it does not need to be a pointer. The F77_NAME
prefix is explained in Writing R extensions, which you should have read near a dozen times by now!
C-to-R
The second function is the one which will be called from R using the .Call
interface:
extern SEXP c_llc_f(SEXP x, SEXP l, SEXP a){ const int n = LENGTH(x); SEXP ret; PROTECT(ret = allocVector(REALSXP, 1)); F77_CALL(llc_f)(REAL(x), n, REAL(l), REAL(a), REAL(ret)); UNPROTECT(1); return(ret); }
This function communicates between R and C. Therefore, the vector of losses x
, the limit l
, and the attachment a
need to be SEXPs. The calculation of the length of x
is done in C via LENGTH
and assigned to n
. Since it isn’t coming from R, it doesn’t need to be a SEXP. The return value, ret
, is going back to R, so it needs to be defined as a SEXP. Since it is will be a double precision real variable, its flavor of SEXP is REALSXP. The PROTECT
and UNPROTECT
calls are needed to ensure that R’s garbage collection doesn’t free the memory until we are done with it. Since the Fortran loop returns a single value, ret
(urn) is declared as SEXP and given protected allocation of a length 1 real vector. Remember, there are no scalars in R.
In the actual call (F77_CALL
calling the F77_NAME
), the SEXPs have to be appropriately typed. Since they are all doubles, they get cast via the REAL()
call. The length variable, n
, is not being passed in from R. Since it was defined inside the function as a const int
, it doesn’t need casting in the call to the C-to-Fortran function. If it was an SEXP that came from R, it would need casting as INTEGER(n)
. In this case it would also need to be a pointer to that value as well. There are other types of SEXP, of which I’m sure you are now aware, having read all the links above.
Save these two functions into the package’s src
directory as ./src/fortloopc.c
. This file is incomplete, but the final C coding will be easier to understand once the R code is written.
Hard-learned lessons
Booleans
One valuable point, gleaned from much frustration, relates to passing logicals. iso_c_binding
includes boolean bindings. For almost all Fortran compilers, this would allow easy passing of a logical out of R into a boolean in C, followed by the passing of that boolean into a logical Fortran variable. Sadly, that does not work with some Fortran implementations—specifically Solaris SPARC which has opposite endianness from most other CPUs. The foolproof way to pass booleans from R to Fortran via C is to cast them as integers—0 for FALSE and 1 for TRUE is the usual convention—and then test for equality to 0 or 1 in Fortran. Maybe one day all major implementations will work seamlessly. Until then, while it is less elegant, passing integers can save a whole host of headaches.
Portability
If your intent is to eventually create a CRAN-worthy package, pay special attention to portability. Make sure you know which elements of Fortran 90, 95, 2003, or 2008 are implemented in the regular compilers. For example, as of December 2018, the oldest CRAN compiler is GCC 4.6.3 for r-oldrel-windows. So to be compatible with r-oldrel-windows-ix86+x86_64
Fortran cannot use the intrinsic functionlog_gamma
, which was only added in Fortran 2008. The user must write their own log gamma code or use an existing module.
Another example is that the current release of R on Windows can use Fortran with iso_c_binding
for variable types and subroutines. That was implemented in GCC 4.3 and is in Oracle Developer Studio 12.5 for Solaris. However, the current release of R on Windows cannot use ieee_arithmetic
, as that was only implemented in GFortran 5. The current Windows toolchain is based on version 4.9.3, although there is work on updating the Windows toolchain to GCC 8.
Third step: Write the R
In R, the function to call the C code is:
LLC_f2 <- function(x, l, a) { if (!is.double(x)) {storage.mode(x) <- 'double'} if (!is.double(l)) {storage.mode(l) <- 'double'} if (!is.double(a)) {storage.mode(a) <- 'double'} .Call(c_llc_f, x, l, a) }
The user calls this function the same way they would the base R version. The mechanics are to first ensure that the variables are of the right type (passing the wrong type from R to C has been known to release a horde of demons somewhere within a ten mile radius of the infraction) and then .Call
ing the C. Save this under the package’s R
directory as ./R/LLC.R
.
Fourth step: Add the needed package code to C
Recently, R-core has become stricter with registration of native routines for compiled code. But you knew that because you follow R-devel and have all-but memorized Writing R Extensions at this point, right? So you already know that there are two steps: the first is to declare the interface methods and the second is to register the routines. In this case, there is only use of the .Call
, so the registration will only require one type. Each function needs to be defined with the number of variables being passed between R and C.
static const R_CallMethodDef CallEntries[] = { {"c_llc_f", (DL_FUNC) &c_llc_f, 3}, {NULL, NULL, 0} }; void R_init_SimpFort (DllInfo *dll) { R_registerRoutines(dll, NULL, CallEntries, NULL, NULL); R_useDynamicSymbols(dll, FALSE); }
The first function creates the definitions for all functions using the .Call
interface—the only interface you should ever use—and the second registers the routines and passes FALSE to Dynamic Symbols, which is highly preferred by R. The entire C code file, fortloop.c
now looks like this:
#include <R.h> #include <Rinternals.h> #include <stdlib.h> // for NULL #include <Rmath.h> #include <R_ext/Rdynload.h> void F77_NAME(llc_f)(double *x, int n, double *l, double *a, double *ret); extern SEXP c_llc_f(SEXP x, SEXP l, SEXP a){ const int n = LENGTH(x); SEXP ret; PROTECT(ret = allocVector(REALSXP, 1)); F77_CALL(llc_f)(REAL(x), n, REAL(l), REAL(a), REAL(ret)); UNPROTECT(1); return(ret); } static const R_CallMethodDef CallEntries[] = { {"c_llc_f", (DL_FUNC) &c_llc_f, 3}, {NULL, NULL, 0} }; void R_init_SimpFort(DllInfo *dll) { R_registerRoutines(dll, NULL, CallEntries, NULL, NULL); R_useDynamicSymbols(dll, FALSE); }
No matter what you write, the includes listed above are almost always needed. There may be others depending on what code you’re writing, of course. With the final function, R_init_XXXXX
, the XXXXX should be the name of your package. If it is not, it’s possible that the registration will not work. In this case, we’re naming this toy package “SimpFort”.
Fifth step: Update other package files
NAMESPACE
The namespace needs to know that the registration occurred in C, so the following should be added as the first line of NAMESPACE: useDynLib(SimpFort, .registration=TRUE)
.
DESCRIPTION
The line NeedsCompilation: yes
needs to be added, of course.
Makevars
For this simple package, a Makevars may not be necessary. When there are multiple Fortran modules, though, the compiler needs to know the dependency structure. Otherwise, especially on machines that use parallel make (like some CRAN binary builders), a later file may get compiled before an earlier one and myriads of demons of the abyss will be released to feast on the souls of R package developers. Even though it isn’t necessary for this package, the object file for the C code will be listed as depending on the module of the Fortran code.
Another issue with Fortran code is that the modules created by the Fortran compilation are platform dependent, so they cannot be part of the package itself and must be removed after package installation. This is most easily accomplished by creating a clean
target in the Makevars which is called from the all
target. For the SimpFort package, the Makevars is:
C_OBJS = fortloopc.o FT_OBJS = fortloopf.o all: $(SHLIB) clean $(SHLIB): $(FT_OBJS) $(C_OBJS) fortloop.mod: fortloopf.o fortloopc.o: fortloop.mod clean: @rm -rf *.mod *.o
While not always necessary, I usually put the Fortran and C objects into different variables and have Fortran listed before C in the shared libraries variable, $(SHLIB)
. The Makevars also creates a clean
target which deletes all compiled object and module files, which is itself called in all
AFTER the creation of the shared library. Lastly, although unnecessary here, the fortloop module is defined as depending on the fortloopf object file, and the fortloopc object file is defined as depending on the fortloop module. Note that object files have the same name as their corresponding source files (.c, .cpp, .cxx, .f95, etc.) whereas Fortran module files have the module name with the .mod ending. Looking above, the Fortran source file fortloopf.f95
has as its first line module fortloop
so the module will be named fortloop.mod
and not fortloopf.mod
.
Everything else
At this point, the rest of the package needs completion. Help files need to be created, probably a README and a NEWS too, the LICENSE has to be selected, unit tests should be created, and the rest of the DESCRIPTION and NAMESPACE needs to be corrected or completed. That is left as an exercise to the reader, who at this point, should have read “Writing R extensions” two dozen times. Has that been mentioned already?
For future reference, I’ve expanded the code brought above by adding very basic help documentation and unit tests resulting in a minimal, but complete package which passes R CMD check --as-cran
—at least on my computer. The SimpFort package can be found on github.
In closing
At first, building an R package which uses Fortran as its calculation engine seems complicated: we have an R function calling a C function calling another C function which calls Fortran. Conceptually, though, it makes sense. The compiled language R knows best is C, so we can create an R-to-C translator and a C-to-Fortran translator. Since the two translators both know C, they talk to each other in C. Each function in a foreign language needs its own pair of C-to-Fortran(C/C++) and C-to-R functions and entry in the methods definition. No matter how complicated the foreign coding gets, so long as the C-to-Fortran and C-to-R functions pass the proper variables back and forth, and the methods are defined properly in C, the power of C, C++, or Fortran can be leveraged by R.
The Rcpp package does a lot of this behind the scenes for C++ code. Every Rcpp package will have a RcppExports.cpp
in its src
directory and a RcppExports.R
file in its R
directory. More may be going on in each, but after reading this post, much of what is going on under the hood in those files should now be recognizable.
There is more to consider when writing compiled code, especially if dealing with random number generation, but that’s a topic for another time.
My next post will detail various flavors of the same function in R, C++, and Fortran, but I hope that this post proves valuable to anyone who wanted to use Fortran (or C) with R and was intimidated by the interface. Enjoy!
The post The Need for Speed Part 1: Building an R Package with Fortran (or C) appeared first on Strange Attractors.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.