Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R 4.4.0 (“Puppy Cup”) was released on the 24th April 2024 and it is a beauty. In time-honoured tradition, here we summarise some of the changes that caught our eyes. R 4.4.0 introduces some cool features (one of which is experimental) and makes one of our favourite {rlang} operators available in base R. There are a few things you might need to be aware of regarding handling NULL and complex values.

The full changelog can be found at the r-release ‘NEWS’ page and if you want to keep up to date with developments in base R, have a look at the r-devel ‘NEWS’ page.

Data comes in all shapes and sizes. It can often be difficult to know where to start. Whatever your problem, Jumping Rivers can help.

### A tail-recursive tale

Years ago, before I’d caused my first stack overflow, my Grandad used to tell me a daft tale:

It was on a dark and stormy night,
And the skipper of the yacht said to Antonio,
"Antonio, tell us a tale",
So Antonio started as follows...
It was on a dark and stormy night,
And the skipper of the yacht .... [ad infinitum]

The tale carried on in this way forever. Or at least it would until you were finally asleep.

At around the same age, I was toying with BASIC programming and could knock out classics such as

>10 PRINT "Ali stinks!"
>20 GOTO 10

Burn! Infinite burn!

That was two example processes that demonstrate recursion. Antonio’s tale quotes itself recursively, and my older brother will be repeatedly mocked unless someone intervenes.

Recursion is an elegant approach to many programming problems – this usually takes the form of a function that can call itself. You would use it when you know how to get closer to a solution, but not necessarily how to get directly to that solution. And unlike the un-ending examples above, when we write recursive solutions to computational problems, we include a rule for stopping.

An example from mathematics would be finding zeros for a continuous function. The sine function provides a typical example:

We can see that when x = π, there is a zero for sin(x), but the computer doesn’t know that.

One recursive solution to finding the zeros of a function, f(), is the bisection method, which iteratively narrows a range until it finds a point where f(x) is close enough to zero. Here’s a quick implementation of that algorithm. If you need to perform root-finding in R, please don’t use the following function. stats::uniroot() is much more robust…

bisect = function(f, interval, tolerance, iteration = 1, verbose = FALSE) {
if (verbose) {
msg = glue::glue(
"Iteration {iteration}: Interval [{interval[1]}, {interval[2]}]"
)
message(msg)
}
# Evaluate 'f' at either end of the interval and return
# any endpoint where f() is close enough to zero
lhs = interval[1]; rhs = interval[2]
f_left = f(lhs); f_right = f(rhs)

if (abs(f_left) <= tolerance) {
return(lhs)
}
if (abs(f_right) <= tolerance) {
return(rhs)
}
stopifnot(sign(f_left) != sign(f_right))

# Bisect the interval and rerun the algorithm
# on the half-interval where y=0 is crossed
midpoint = (lhs + rhs) / 2
f_mid = f(midpoint)
new_interval = if (sign(f_mid) == sign(f_left)) {
c(midpoint, rhs)
} else {
c(lhs, midpoint)
}
bisect(f, new_interval, tolerance, iteration + 1, verbose)
}

We know that π is somewhere between 3 and 4, so we can find the zero of sin(x) as follows:

bisect(sin, interval = c(3, 4), tolerance = 1e-4, verbose = TRUE)
#> Iteration 1: Interval [3, 4]
#> Iteration 2: Interval [3, 3.5]
#> Iteration 3: Interval [3, 3.25]
#> Iteration 4: Interval [3.125, 3.25]
#> Iteration 5: Interval [3.125, 3.1875]
#> Iteration 6: Interval [3.125, 3.15625]
#> Iteration 7: Interval [3.140625, 3.15625]
#> Iteration 8: Interval [3.140625, 3.1484375]
#> Iteration 9: Interval [3.140625, 3.14453125]
#> Iteration 10: Interval [3.140625, 3.142578125]
#> Iteration 11: Interval [3.140625, 3.1416015625]
#> [1] 3.141602

It takes 11 iterations to get to a point where sin(x) is within 10−4 of zero. If we tightened the tolerance, had a more complicated function, or had a less precise starting range, it might take many more iterations to approximate a zero.

Importantly, this is a recursive algorithm - in the last statement of the bisect() function body, we call bisect() again. The initial call to bisect() (with interval = c(3, 4)) has to wait until the second call to bisect() (interval = c(3, 3.5)) completes before it can return (which in turn has to wait for the third call to return). So we have to wait for 11 calls to bisect() to complete before we get our result.

Those function calls get placed on a computational object named the call stack. For each function call, this stores details about how the function was called and where from. While waiting for the first call to bisect() to complete, the call stack grows to include the details about 11 calls to bisect().

Imagine our algorithm didn’t just take 11 function calls to complete, but thousands, or millions. The call stack would get really full and this would lead to a “stack overflow” error.

We can demonstrate a stack-overflow in R quite easily:

blow_up = function(n, max_iter) {
if (n >= max_iter) {
return("Finished!")
}
blow_up(n + 1, max_iter)
}

The recursive function behaves nicely when we only use a small number of iterations:

blow_up(1, max_iter = 100)
#> [1] "Finished!"

But the call-stack gets too large and the function fails when we attempt to use too many iterations. Note that we get a warning about the size of the call-stack before we actually reach it’s limit, so the R process can continue after exploding the call-stack.

blow_up(1, max_iter = 1000000)
# Error: C stack usage 7969652 is too close to the limit

In R 4.4, we are getting (experimental) support for tail-call recursion. This allows us (in many situations) to write recursive functions that won’t explode the size of the call stack.

How can that work? In our bisect() example, we still need to make 11 calls to bisect() to get a result that is close enough to zero, and those 11 calls will still need to be put on the call-stack.

Remember the first call to bisect()? It called bisect() as the very last statement in it’s function body. So the value returned by the second call to bisect() was returned to the user without modification by the first call. So we could return the second call’s value directly to the user, instead of returning it via the first bisect() call; indeed, we could remove the first call to bisect() from the call stack and put the second call in it’s place. This would prevent the call stack from expanding with recursive calls.

The key to this (in R) is to use the new Tailcall() function. That tells R “you can remove me from the call stack, and put this cat on instead”. Our final line in bisect() should look like this:

bisect = function(...) {
... snip ...
Tailcall(bisect, f, new_interval, tolerance, iteration + 1, verbose)
}

Note that you are passing the name of the recursively-called function into Tailcall(), rather than a call to that function (bisect rather than bisect(...)).

To illustrate that the stack no longer blows up when tail-call recursion is used. Let’s rewrite our blow_up() function:

# R 4.4.0
blow_up = function(n, max_iter) {
if (n >= max_iter) {
return("Finished!")
}
Tailcall(blow_up, n+1, max_iter)
}

We can still successfully use a small number of iterations:

blow_up(1, 100)
#> [1] "Finished!"

But now, even a million iterations of the recursive function can be performed:

blow_up(1, 1000000)
#> [1] "Finished!"

Note that the tail-call optimisation only works here, because the recursive call was made as the very last step in the function body. If your function needs to modify the value after the recursive call, you may not be able to use Tailcall().

### Rejecting the NULL

Missing values are everywhere.

In a typical dataset you might have missing values encoded as NA (if you’re lucky) and invalid numbers encoded as NaN, you might have implicitly missing rows (for example, a specific date missing from a time series) or factor levels that aren’t present in your table. You might even have empty vectors, or data-frames with no rows, to contend with. When writing functions and data-science workflows, where the input data may change over time, by programming defensively and handling these kinds of edge-cases your code will throw up less surprises in the long run. You don’t want a critical report to fail because a mathematical function you wrote couldn’t handle a missing value.

When programming defensively with R, there is another important form of missingness to be cautious of …

The NULL object.

NULL is an actual object. You can assign it to a variable, combine it with other values, index into it, pass it into (and return it from) a function. You can also test whether a value is NULL.

# Assignment
my_null = NULL
my_null
#> NULL

# Use in functions
my_null[1]
#> NULL
c(NULL, 123)
#> [1] 123
c(NULL, NULL)
#> NULL
toupper(NULL)
#> character(0)

# Testing NULL-ness
is.null(my_null)
#> [1] TRUE
is.null(1)
#> [1] FALSE
identical(my_null, NULL)
#> [1] TRUE

# Note that the equality operator shouldn't be used to
# test NULL-ness:
NULL == NULL
#> logical(0)

R functions that are solely called for their side-effects (write.csv() or message(), for example) often return a NULL value. Other functions may return NULL as a valid value - one intended for subsequent use. For example, list-indexing (which is a function call, under the surface) will return NULL if you attempt to access an undefined value:

config = list(user = "Russ")

# When the index is present, the associated value is returned
config\$user
#> [1] "Russ"

# But when the index is absent, a `NULL` is returned
config\$url
#> NULL

Similarly, you can end up with a NULL output from an incomplete stack of if / else clauses:

language = "Polish"

greeting = if (language == "English") {
"Hello"
} else if (language == "Hawaiian") {
"Aloha"
}

greeting
#> NULL

A common use for NULL is as a default argument in a function signature. A NULL default is often used for parameters that aren’t critical to function evaluation. For example, the function signature for matrix() is as follows:

matrix(data = NA, nrow = 1, ncol = 1, byrow = FALSE, dimnames = NULL)

The dimnames parameter isn’t really needed to create a matrix, but when a non-NULL value for dimnames is provided, the values are used to label the row and column names of the created matrix.

matrix(1:4, nrow = 2)
#> [,1] [,2]
#> [1,] 1 3
#> [2,] 2 4
matrix(1:4, nrow = 2, dimnames = list(c("2023", "2024"), c("Jan", "Feb")))
#> Jan Feb
#> 2023 1 3
#> 2024 2 4

R 4.4 introduces the %||% operator to help when handling variables that are potentially NULL. When working with variables that could be NULL, you might have written code like this:

# Remember there is no 'url' field in our `config` list

# Set a default value for the 'url' if one isn't defined in
# the config
my_url = if (is.null(config\$url)) {
"https://www.jumpingrivers.com/blog/"
} else {
config\$url
}
my_url
#> [1] "https://www.jumpingrivers.com/blog/"

Assuming config is a list:

• when the url entry is absent from config (or is itself NULL), then config\$url will be NULL and the variable my_url will be set to the default value;
• but when the url entry is found within config (and isn’t NULL) then that value will be stored in my_url.

That code can now be rewritten as follows:

# R 4.4.0
my_url = config\$url %||% "https://www.jumpingrivers.com/blog"
my_url
#> [1] "https://www.jumpingrivers.com/blog"

Note that the left-hand value must evaluate to NULL for the right-hand side to be evaluated, and that empty vectors aren’t NULL:

# R 4.4.0
NULL %||% 1
#> [1] 1

c() %||% 1
#> [1] 1

numeric(0) %||% 1
#> numeric(0)

This operator has been available in the {rlang} package for eight years and is implemented in exactly the same way. So if you have been using %||% in your code already, the base-R version of this operator should work without any problems, though you may want to wait until you are certain all your users are using R >= 4.4 before switching from {rlang} to the base-R version of %||%.

A shorthand hexadecimal format (common in web-programming) for specifying RGB colours has been introduced. So, rather than writing the 6-digit hexcode for a colour “#112233”, you can use “#123”. This only works for those 6-digit hexcodes where the digits are repeated in pairs.

Parsing and formatting of complex numbers has been improved. For example, as.complex("1i") now returns the complex number 0 + 1i, previously it returned NA.

There are a few other changes related to handling NULL that have been introduced in R 4.4. The changes highlight that NULL is quite different from an empty vector. Empty vectors contain nothing, whereas NULL represents nothing. For example, whereas an empty numeric vector is considered to be an atomic (unnestable) data structure, NULL is no longer atomic. Also, NCOL(NULL) (the number of columns in a matrix formed from NULL) is now 0, whereas it was formerly 1.

sort_by() a new function for sorting objects based on values in a separate object. This can be used to sort a data.frame based on it’s columns (they should be specified as a formula):

mtcars |> sort_by(~ list(cyl, mpg)) |> head()
## mpg cyl disp hp drat wt qsec vs am gear carb
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
## Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2

To take away the pain of installing the latest development version of R, you can use docker. To use the devel version of R, you can use the following commands:

docker pull rstudio/r-base:devel-jammy
docker run --rm -it rstudio/r-base:devel-jammy

Once R 4.4 is the released version of R and the r-docker repository has been updated, you should use the following command to test out R 4.4.

docker pull rstudio/r-base:4.4-jammy
docker run --rm -it rstudio/r-base:4.4-jammy