Let’s talk about NA-s!
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
ARTHUR: Well, what is it you want?
HEAD KNIGHT: We want… a paste function that can deal with NA-s!
A language for statistical computing clearly needs to be able to deal with missing values, and R has various ways to do so. I will briefly go through some of them, then propose an interesting way to spice it up a bit.
Missingness is represented with NA
values, or – rather – non-values. Every
vector type can contain the intended values or NA
-s. As mentioned in a
previous post, this resembles an
optional or maybe type in other languages. Most of the time, it is used to
represent missing observations in a dataset, but you can equally return an NA
from your function if it’s not able to calculate the expected result.
To illustrate this, let’s write a function that takes a lower case letter and returns the previous one in the alphabet.
previous_letter <- function(x) { # Only consider the first letter of the fist string res <- letters[match(substr(x[1], 1, 1), letters) - 1] if (length(res) != 1L) NA_character_ else res } previous_letter("n") ## [1] "m" previous_letter("a") ## [1] NA previous_letter("#rstats") ## [1] NA
If the input starts with [b-z]
, it will return a lowercase letter, otherwise NA
.
Producing missing values is no big deal. Dealing with them is a bit more complicated, and R has various ways to do so.
Worst first! You can set the na.action
option globally, which may or may not be used by some functions or objects. This is not recommended, because it relies on something specific to your setup, so it can have unintended behaviour in another environment. Explicit is better than implicit, you should spell out what you want to do with the missing values.
Luckily, the {stats}
package comes with a few built-in utility functions:
- na.pass: leave missing values alone
- na.omit: omit missing values and record where they were
- na.exclude: same as above, but different class (see help)
- na.contiguous: find the longest consecutive stretch of non-missing values in an object
- na.fail: throw error if there are missing values
See examples
vec_na <- c(1, 2, NA, 4) na.pass(vec_na) ## [1] 1 2 NA 4 na.omit(vec_na) ## [1] 1 2 4 ## attr(,"na.action") ## [1] 3 ## attr(,"class") ## [1] "omit" na.exclude(vec_na) ## [1] 1 2 4 ## attr(,"na.action") ## [1] 3 ## attr(,"class") ## [1] "exclude" na.contiguous(vec_na) ## [1] 1 2 ## attr(,"na.action") ## [1] 3 4 ## attr(,"class") ## [1] "omit" na.fail(vec_na) |> try() ## Error in na.fail.default(vec_na) : missing values in object
You can also create functions to make use of these internally, or handle missingness in other ways. Have you ever used median(na.rm = TRUE)
in your scripts? No? How about mean(na.rm = TRUE)
No? sd
then? You see where I’m going? All these (and other) functions have implemented NA
handling separately. Could there be a way that developers could focus on the task at hand (e.g. calculate median of a vector of values), and not have to care about what to do with missing values?
From this perspective, there are two common function types:
- Summarising functions: Reduce many values into a single scalar (e.g.
mean()
orpaste(collapse = " ")
) - Plain functions: Keep the values in the vector realm (e.g.
cumsum()
orpaste(collapse = NULL)
)
Plain functions
The above mentioned cumulative sum function has a problem with missing values.
Once it hits an NA
, it will produce NA
-s for the rest of the result.
cumsum(vec_na) ## [1] 1 3 NA NA
For the most part, this is the required behaviour, but sometimes one would
prefer to treat NA
as missing and keep the sum going. One could think of a
function wrapper to sidestep the NA
-s in one or more function arguments,
calculate the result, and add back NA
-s to the correct positions. Let’s call
it dodge_NA
. We can restrict which function arguments are considered when
searching for missing values, and only the complete cases are used in the
calculation (vector recycling rules apply if their lengths differ).
dodge_NA()
function definition
# Modified from base::Vectorize() dodge_NA <- function(FUN, these_args = arg.names) { my_name <- match.call()[1L] my_args <- names(match.call()[-1L]) arg.names <- as.list(formals(args(FUN))) arg.names[["..."]] <- NULL arg.names <- names(arg.names) these_args <- as.character(these_args) if (!length(these_args)) return(FUN) if (!all(these_args %in% arg.names)) stop("must specify names of formal arguments for '", my_name, "'") collisions <- arg.names %in% my_args if (any(collisions)) stop(sQuote("FUN"), " may not have argument(s) named ", paste(sQuote(arg.names[collisions]), collapse = ", ")) rm(arg.names, collisions, my_args, my_name) (function() { FUNV <- function() { args <- lapply(as.list(match.call())[-1L], eval, parent.frame()) names <- names(args) %||% character(length(args)) to_consider <- (names %in% these_args) | names == "" max_length <- max(lengths(args[to_consider])) arg_df <- data.frame(lapply(args[to_consider], rep, length.out = max_length)) has_fun <- complete.cases(arg_df) short_res <- do.call( what = FUN, args = c(as.list(arg_df[has_fun,,drop = FALSE]), args[!to_consider])) # TODO: reconsider. this can give weird results if # args[!to_consider] is not length 1 long_res <- vector( mode = typeof(short_res), length = max_length ) long_res[has_fun] <- short_res long_res[!has_fun] <- NA return(long_res) } formals(FUNV) <- formals(args(FUN)) environment(FUNV) <- parent.env(environment()) FUNV })() }
Let’s look at some examples!
cumsum_na <- dodge_NA(cumsum) cumsum_na(vec_na) ## [1] 1 3 NA 7
Now, the cumulative sum continues past the missing values.
paste_na <- dodge_NA(paste) paste(vec_na, LETTERS[1:12], sep = "") |> noquote() ## [1] 1A 2B NAC 4D 1E 2F NAG 4H 1I 2J NAK 4L paste_na(vec_na, LETTERS[1:12], sep = "") |> noquote() ## [1] 1A 2B <NA> 4D 1E 2F <NA> 4H 1I 2J <NA> 4L
There might be perfectly good reasons for wanting the string “NA” in your final text, I’ve just never come across one. I always had to clean up the vectors before going into the paste function. It’s quite likely that I’ll package this whole thing just to avoid this problem in the future.
Summarising functions
The implementation is a bit easier in this case, because we don’t need to add
back the NA
values in the final vector (length of one). The rest is more or
less the same as above.
dodge_NA_collapse()
function definition
dodge_NA_collapse <- function(FUN, these_args = arg.names) { my_name <- match.call()[1L] my_args <- names(match.call()[-1L]) arg.names <- as.list(formals(args(FUN))) arg.names[["..."]] <- NULL arg.names <- names(arg.names) these_args <- as.character(these_args) if (!length(these_args)) return(FUN) if (!all(these_args %in% arg.names)) stop("must specify names of formal arguments for '", my_name, "'") collisions <- arg.names %in% my_args if (any(collisions)) stop(sQuote("FUN"), " may not have argument(s) named ", paste(sQuote(arg.names[collisions]), collapse = ", ")) rm(arg.names, collisions, my_args, my_name) (function() { FUNV <- function() { args <- lapply(as.list(match.call())[-1L], eval, parent.frame()) names <- names(args) %||% character(length(args)) to_consider <- (names %in% these_args) | names == "" max_length <- max(lengths(args[to_consider])) arg_df <- data.frame(lapply( args[to_consider], rep, length.out = max_length )) has_fun <- complete.cases(arg_df) res <- do.call( what = FUN, args = c(as.list(arg_df[has_fun,,drop = FALSE]), args[!to_consider])) # TODO: reconsider. this can give weird results if # args[!to_consider] is not length 1 if (length(res) != 1L) warning(match.call()[1L], " produced vector of length ", length(res)) return(res) } formals(FUNV) <- formals(args(FUN)) environment(FUNV) <- parent.env(environment()) FUNV })() }
We can find some examples when simple paste can just blow up in your face. In those cases, you’d better use the dodge version.
pastecollapse_na <- dodge_NA_collapse(paste) paste( vec_na + 1, "palms", sep = "", collapse = " " ) ## [1] "2palms 3palms NApalms 5palms" pastecollapse_na( vec_na + 1, "palms", sep = "", collapse = " " ) ## [1] "2palms 3palms 5palms"
As you know, the mean function uses the na.rm
option. It’s a bit annoying to
have to type it out each and every time, but let’s say you’re already used to that.
Let’s implement a new function, plusminus
. It will add even numbers and
subtract odd numbers in a vector.
# plusminus :: [ int ] -> int plusminus <- function(x) { Reduce(`+`, -x * (2L * (x %% 2L) - 1L)) } pm_data <- round(runif(20) * 46) # The final plus-minus sum plusminus(pm_data) ## [1] -45
Plusminus is a vectorized function, but I’d like to visualise how it works, so let’s create a function to plot the running plusminus sum at each position. The numbers to be added/subtracted are displayed along the curve.
plot_running_sum()
function definition
plot_running_sum <- function(x, sumfun = plusminus) { pos <- seq_along(x) running_sum <- vapply(pos, \(p) sumfun(x[1:p]), 0) reinj <- range(running_sum, na.rm = TRUE) # A-ha! plot(pos, running_sum, type = "b", ylim = reinj * c(1, 1.1)) text(x = pos, y = running_sum + reinj[2] * 0.1, labels = x) return(invisible(running_sum)) }
plot_running_sum(pm_data)

Now, should we add error handling to the function? Do we implement na.rm
option with if statements within the body… Or shall we just dodge this
chore?
plusminus_NA_ready <- dodge_NA_collapse(plusminus) # Break the data pm_data[13] <- NA # Can't handle it plusminus(pm_data) ## [1] NA # Can handle it plusminus_NA_ready(pm_data) ## [1] -20 plot_running_sum(pm_data, sumfun = plusminus)

plot_running_sum(pm_data, sumfun = plusminus_NA_ready)

I hope this illustrates my point: focus on the logic, and not the NA
handling,
which is the same boring boilerplate in many different functions.
Con[cl|f]usion
So what’s really going on here? The whole thinking started with monads. In a
monadic realm, you’d usually need a bind
(>>=
) function to chain
computations. That’s because the usual functions start from scalar and produce a
monadic value (see equations below).
$$\displaylines{fun :: a \rightarrow m~b \\ bind :: m~a \rightarrow (a \rightarrow m~b) \rightarrow m~b}$$
In R, this is quite different. We are already in vector+maybe land, and we
usually don’t leave it. Also, R functions rarely take only a single argument.
Multiple arguments can contain NA
-s (optional values). So a classic bind
function would be ill-suited in this environment. Bind would only work on single
argument (or curried) functions. And on top of that, it would add another infix
operator, which may annoy some (many?) people.
Instead, I decided to go for a function wrapper. It takes a function that can handle non-missing input values only, and imbue it with the capability of dodging them.
$$wrap :: (a \rightarrow \ldots \rightarrow b \rightarrow m~c) \rightarrow (m~a \rightarrow m~\ldots \rightarrow b \rightarrow m~c)$$
This approach has some disadvantages, for example, the need for two wrappers for the two types of functions (summarising and plain). On the other hand, it will cope quite well with some R-specifics, and the resulting function can be simply piped into with a native or magrittr pipe, like we’re used to.
Next, I’d like to check if this can be extended to the other monad which is native to R, the vector.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.