Let’s talk about NA-s!

Posted on July 22, 2025 by R on Biofunctor in R bloggers | 0 Comments

[This article was first published on R on Biofunctor, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

ARTHUR: Well, what is it you want?
HEAD KNIGHT: We want… a paste function that can deal with NA-s!

A language for statistical computing clearly needs to be able to deal with missing values, and R has various ways to do so. I will briefly go through some of them, then propose an interesting way to spice it up a bit.

Missingness is represented with NA values, or – rather – non-values. Every vector type can contain the intended values or NA-s. As mentioned in a previous post, this resembles an optional or maybe type in other languages. Most of the time, it is used to represent missing observations in a dataset, but you can equally return an NA from your function if it’s not able to calculate the expected result.

To illustrate this, let’s write a function that takes a lower case letter and returns the previous one in the alphabet.

previous_letter <- function(x) {
    # Only consider the first letter of the fist string
    res <- letters[match(substr(x[1], 1, 1), letters) - 1]
    if (length(res) != 1L) NA_character_ else res
}

previous_letter("n")

## [1] "m"

previous_letter("a")

## [1] NA

previous_letter("#rstats")

## [1] NA

If the input starts with [b-z], it will return a lowercase letter, otherwise NA.

Producing missing values is no big deal. Dealing with them is a bit more complicated, and R has various ways to do so.

Worst first! You can set the na.action option globally, which may or may not be used by some functions or objects. This is not recommended, because it relies on something specific to your setup, so it can have unintended behaviour in another environment. Explicit is better than implicit, you should spell out what you want to do with the missing values.

Luckily, the {stats} package comes with a few built-in utility functions:

na.pass: leave missing values alone
na.omit: omit missing values and record where they were
na.exclude: same as above, but different class (see help)
na.contiguous: find the longest consecutive stretch of non-missing values in an object
na.fail: throw error if there are missing values

See examples

vec_na <- c(1, 2, NA, 4)
na.pass(vec_na)

## [1]  1  2 NA  4

na.omit(vec_na)

## [1] 1 2 4
## attr(,"na.action")
## [1] 3
## attr(,"class")
## [1] "omit"

na.exclude(vec_na)

## [1] 1 2 4
## attr(,"na.action")
## [1] 3
## attr(,"class")
## [1] "exclude"

na.contiguous(vec_na)

## [1] 1 2
## attr(,"na.action")
## [1] 3 4
## attr(,"class")
## [1] "omit"

na.fail(vec_na) |> try()

## Error in na.fail.default(vec_na) : missing values in object

You can also create functions to make use of these internally, or handle missingness in other ways. Have you ever used median(na.rm = TRUE) in your scripts? No? How about mean(na.rm = TRUE) No? sd then? You see where I’m going? All these (and other) functions have implemented NA handling separately. Could there be a way that developers could focus on the task at hand (e.g. calculate median of a vector of values), and not have to care about what to do with missing values?

From this perspective, there are two common function types:

Summarising functions: Reduce many values into a single scalar (e.g. mean() or paste(collapse = " "))
Plain functions: Keep the values in the vector realm (e.g. cumsum() or paste(collapse = NULL))

Plain functions

The above mentioned cumulative sum function has a problem with missing values. Once it hits an NA, it will produce NA-s for the rest of the result.

cumsum(vec_na)

## [1]  1  3 NA NA

For the most part, this is the required behaviour, but sometimes one would prefer to treat NA as missing and keep the sum going. One could think of a function wrapper to sidestep the NA-s in one or more function arguments, calculate the result, and add back NA-s to the correct positions. Let’s call it dodge_NA. We can restrict which function arguments are considered when searching for missing values, and only the complete cases are used in the calculation (vector recycling rules apply if their lengths differ).

dodge_NA() function definition

# Modified from base::Vectorize()
dodge_NA <- function(FUN, these_args = arg.names)
{
    my_name <- match.call()[1L]
    my_args <- names(match.call()[-1L])
    arg.names <- as.list(formals(args(FUN)))
    arg.names[["..."]] <- NULL
    arg.names <- names(arg.names)
    these_args <- as.character(these_args)
    if (!length(these_args))
        return(FUN)
    if (!all(these_args %in% arg.names))
        stop("must specify names of formal arguments for '", my_name, "'")
    collisions <- arg.names %in% my_args
    if (any(collisions))
        stop(sQuote("FUN"), " may not have argument(s) named ",
            paste(sQuote(arg.names[collisions]), collapse = ", "))
    rm(arg.names, collisions, my_args, my_name)
    (function() {
        FUNV <- function() {
            args <- lapply(as.list(match.call())[-1L], eval,
                parent.frame())

            names <- names(args) %||% character(length(args))

            to_consider <- (names %in% these_args) | names == ""

            max_length <- max(lengths(args[to_consider]))
            arg_df <- data.frame(lapply(args[to_consider], rep, length.out = max_length))

            has_fun <- complete.cases(arg_df)
            short_res <- do.call(
                what = FUN,
                args = c(as.list(arg_df[has_fun,,drop = FALSE]),
                         args[!to_consider]))
            # TODO: reconsider. this can give weird results if
            #       args[!to_consider] is not length 1

            long_res <- vector(
                mode = typeof(short_res),
                length = max_length
            )

            long_res[has_fun] <- short_res
            long_res[!has_fun] <- NA
            return(long_res)
        }
        formals(FUNV) <- formals(args(FUN))
        environment(FUNV) <- parent.env(environment())
        FUNV
    })()
}

Let’s look at some examples!

cumsum_na <- dodge_NA(cumsum)
cumsum_na(vec_na)

## [1]  1  3 NA  7

Now, the cumulative sum continues past the missing values.

paste_na <- dodge_NA(paste)

paste(vec_na, LETTERS[1:12], sep = "") |> noquote()

##  [1] 1A  2B  NAC 4D  1E  2F  NAG 4H  1I  2J  NAK 4L

paste_na(vec_na, LETTERS[1:12], sep = "") |> noquote()

##  [1] 1A   2B   <NA> 4D   1E   2F   <NA> 4H   1I   2J   <NA> 4L

There might be perfectly good reasons for wanting the string “NA” in your final text, I’ve just never come across one. I always had to clean up the vectors before going into the paste function. It’s quite likely that I’ll package this whole thing just to avoid this problem in the future.

Summarising functions

The implementation is a bit easier in this case, because we don’t need to add back the NA values in the final vector (length of one). The rest is more or less the same as above.

dodge_NA_collapse() function definition

dodge_NA_collapse <- function(FUN, these_args = arg.names) {
    my_name <- match.call()[1L]
    my_args <- names(match.call()[-1L])
    arg.names <- as.list(formals(args(FUN)))
    arg.names[["..."]] <- NULL
    arg.names <- names(arg.names)
    these_args <- as.character(these_args)
    if (!length(these_args))
        return(FUN)
    if (!all(these_args %in% arg.names))
        stop("must specify names of formal arguments for '", my_name, "'")
    collisions <- arg.names %in% my_args
    if (any(collisions))
        stop(sQuote("FUN"), " may not have argument(s) named ",
            paste(sQuote(arg.names[collisions]), collapse = ", "))
    rm(arg.names, collisions, my_args, my_name)
    (function() {
        FUNV <- function() {
            args <- lapply(as.list(match.call())[-1L], eval,
                parent.frame())

            names <- names(args) %||% character(length(args))

            to_consider <- (names %in% these_args) | names == ""

            max_length <- max(lengths(args[to_consider]))
            arg_df <- data.frame(lapply(
                args[to_consider],
                rep,
                length.out = max_length
            ))

            has_fun <- complete.cases(arg_df)

            res <- do.call(
                what = FUN,
                args = c(as.list(arg_df[has_fun,,drop = FALSE]),
                         args[!to_consider]))
            # TODO: reconsider. this can give weird results if
            #       args[!to_consider] is not length 1

            if (length(res) != 1L) warning(match.call()[1L], " produced vector of length ", length(res))

            return(res)
        }
        formals(FUNV) <- formals(args(FUN))
        environment(FUNV) <- parent.env(environment())
        FUNV
    })()
}

We can find some examples when simple paste can just blow up in your face. In those cases, you’d better use the dodge version.

pastecollapse_na <- dodge_NA_collapse(paste)

paste(
    vec_na  + 1,
    "palms",
    sep = "",
    collapse = " "
)

## [1] "2palms 3palms NApalms 5palms"

pastecollapse_na(
    vec_na + 1,
    "palms",
    sep = "",
    collapse = " "
)

## [1] "2palms 3palms 5palms"

As you know, the mean function uses the na.rm option. It’s a bit annoying to have to type it out each and every time, but let’s say you’re already used to that.

Let’s implement a new function, plusminus. It will add even numbers and subtract odd numbers in a vector.

# plusminus :: [ int ] -> int
plusminus <- function(x) {
    Reduce(`+`, -x * (2L * (x %% 2L) - 1L))
}


pm_data <- round(runif(20) * 46)

# The final plus-minus sum
plusminus(pm_data)

## [1] -45

Plusminus is a vectorized function, but I’d like to visualise how it works, so let’s create a function to plot the running plusminus sum at each position. The numbers to be added/subtracted are displayed along the curve.

plot_running_sum() function definition

plot_running_sum <- function(x, sumfun = plusminus) {
    pos <- seq_along(x)
    running_sum <- vapply(pos, \(p) sumfun(x[1:p]), 0)
    reinj <- range(running_sum, na.rm = TRUE) # A-ha!
    plot(pos, running_sum, type = "b", ylim = reinj * c(1, 1.1))
    text(x = pos, y = running_sum + reinj[2] * 0.1, labels = x)
    return(invisible(running_sum))
}

plot_running_sum(pm_data)

Now, should we add error handling to the function? Do we implement na.rm option with if statements within the body… Or shall we just dodge this chore?

plusminus_NA_ready <- dodge_NA_collapse(plusminus)

# Break the data
pm_data[13] <- NA

# Can't handle it
plusminus(pm_data)

## [1] NA

# Can handle it
plusminus_NA_ready(pm_data)

## [1] -20

plot_running_sum(pm_data, sumfun = plusminus)

plot_running_sum(pm_data, sumfun = plusminus_NA_ready)

I hope this illustrates my point: focus on the logic, and not the NA handling, which is the same boring boilerplate in many different functions.

Con[cl|f]usion

So what’s really going on here? The whole thinking started with monads. In a monadic realm, you’d usually need a bind (>>=) function to chain computations. That’s because the usual functions start from scalar and produce a monadic value (see equations below).

$$\displaylines{fun :: a \rightarrow m~b \\ bind :: m~a \rightarrow (a \rightarrow m~b) \rightarrow m~b}$$

In R, this is quite different. We are already in vector+maybe land, and we usually don’t leave it. Also, R functions rarely take only a single argument. Multiple arguments can contain NA-s (optional values). So a classic bind function would be ill-suited in this environment. Bind would only work on single argument (or curried) functions. And on top of that, it would add another infix operator, which may annoy some (many?) people.

Instead, I decided to go for a function wrapper. It takes a function that can handle non-missing input values only, and imbue it with the capability of dodging them.

$$wrap :: (a \rightarrow \ldots \rightarrow b \rightarrow m~c) \rightarrow (m~a \rightarrow m~\ldots \rightarrow b \rightarrow m~c)$$

This approach has some disadvantages, for example, the need for two wrappers for the two types of functions (summarising and plain). On the other hand, it will cope quite well with some R-specifics, and the resulting function can be simply piped into with a native or magrittr pipe, like we’re used to.

Next, I’d like to check if this can be extended to the other monad which is native to R, the vector.

To leave a comment for the author, please follow the link and comment on their blog: R on Biofunctor.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

Let’s talk about NA-s!

Plain functions

Summarising functions

Con[cl|f]usion

Related

Plain functions

Summarising functions

Con[cl|f]usion

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)