Booleans & NAs

[This article was first published on R on Data & The World, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Missing values are inevitable in data science, and handling them is a constant issue. In the case of Boolean logic, it can behave fairly differently depending on the order of arguments and exactly how it is set up, unlike a lot of other data types. Whether this is useful or not depends on the scenario, but the behavior is something to keep in mind.

Most programming languages that I’m aware of have the capacity for short-circuit evaluation, which can allow for certain Boolean expressions to be evaluated given only one argument. Specifically, an OR will always be true if the first value is true, and an AND will always be false if the first argument is false. Other operators like XOR can’t be short-circuited due to actually needing to know both values to determine the output.

If the second argument is a missing value, you may expect the same as before, and you’d be right. Since you can’t short circuit an AND when the first argument is true or an OR when the first argument is false, those being missing isn’t surprising.

# in Julia; R and pandas.NA in Python behave the same
> true & missing
missing
> false & missing
false
> true | missing
true
> false | missing
missing

If the missing data is the first argument, however, what should happen is less clear. There’s an argument to be made for Boolean expressions short-circuiting to “missing” or “NA” in that case, but it turns out that’s not what happens – instead, it behaves exactly like if the missing value was second:

# in R
> NA & TRUE
NA
> NA & FALSE
FALSE
> NA | TRUE
TRUE
> NA | FALSE
NA

R’s documentation describes this like so:

NA is a valid logical object. Where a component of x or y is NA, the result will be NA if the outcome is ambiguous. In other words NA & TRUE evaluates to NA, but NA & FALSE evaluates to FALSE.

Julia and pandas.NA operate in the same way, so presumably they use the same reasoning.

A Note Regarding Pandas

Above, I always referred to Python’s missing data type as pandas.NA for one particular reason. In the past, Python’s pandas would use numpy.NaN to fill in missing values.  Those won’t work in boolean operations, since their introduction would coerce any numeric data to floats, where AND and OR aren’t going to work.

> import numpy as np
> a = np.array([True, np.NaN])
> a
array([ 1., nan])
> a.dtype
dtype('float64')
> a & a
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: ufunc 'bitwise_and' not supported for the input types, and the inputs could
not be safely coerced to any supported types according to the casting rule ''safe''

In version 1.0.0, however, pandas.NA was added, which can be used in conjunction with boolean operators:

> import pandas as pd
> pd.NA & True
<NA>
> pd.NA & False
False
> pd.NA | True
True
> pd.NA | False
<NA>

To leave a comment for the author, please follow the link and comment on their blog: R on Data & The World.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)