Representation of numerical NA’s in R and the 1954 enigma

Posted on July 8, 2012 by mark in R bloggers | 0 Comments

[This article was first published on Mark van der Loo, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I’ve always wondered how exactly the missing value (NA) in R is represented under the hood. Last weekend I was working on a little project that gave me enough excuse to spend some time on finding this out. So, I descended into the catacombs of R and came back with some treasure. In short:

A missing integer is repesented by the largest negative (4 word) signed integer that can be represented by your computer.
A missing double (real number) is represented by a special version of the default NaN (Not a Number) of the IEEE standard. A special role is given to the number 1954 here, but why?

Read on if you want to dig a little deeper.

Missing integers

As you may know, a lot of R‘s core is written in the C language. However, an int variable in C does not support the concept of a missing value. So, what happens in R is that a single value of the integer range is pointed out as representing a missing value. In this case it is INT_MIN (a C macro from limits.h) which determines the largest negative value that can be represented by a int variable in C. On most computers, an int variable will be 32 bits (4 8-bit words). To make things easier, we’ll assume that’s always the case here. Since 1 bit is reserved for the sign, the range of representable numbers is

$[-2^{31},\: 2^{31}-1] =[-2147483648,\: 2147483647].$

(The range is asymmetric because 0 occupies the place of one positive number).

Now let’s compare this with R‘s integer range. The maximum integer is easily found,
since it is stored in the hidden .Machine variable.

^?Download download.txt

1 2	> .Machine$integer.max [1] 2147483647

So this corresponds with C‘s INT_MAX. The largest negative integer is
not present in .Machine but we can do some tests:

^?Download download.txt

# store the one-but-least C-integer. The L in the end forces the number
# to be "integer", not "numeric"
x <- -2147483647L
typeof(x)
[1] "integer"
 
# adding an integer works fine, since we move further into the range:
typeof(x+1L)
[1] "integer"
 
# substracting an integer gives a warning telling us that the result is out-of-range:
> typeof(x-1L)
[1] "integer"
Warning message:
In x - 1L : NAs produced by integer overflow
 
# substracting a non-integer 1 ("numeric") yields a non-integer:
> typeof(x-1)
[1] "double"

The result is out of R‘s integer range. The integer range of R is $\pm 2147483647$ : one integer less than you get in C. So by sacrificing only one of your four billion two hundred ninety-four million nine hundred sixty-seven thousand two hundred ninety-five integers, you get the truly awesome feature of computing with missing values.

Missing doubles

To explain how real ( $\mathbb{R}$ ) missing values are represented, we first need to spend a few words on the double type. A double is short for double precision and it is the variable type used to represent (approximations to) the real numbers in a computer.

Basically, a double represents a rounded real number in the following notation (see also the wikipedia article):

$\textrm{sign}\times 2^{e} \times 1.F$ .

The sign is represented by 1 bit, the exponent $e$ by 11 bits and the mantissa $F$ by 52 bits, so we have 64 bits in total. The special value NaN (and also $\pm$ Inf) is coded using values of $e$ that are not used to represent numbers. NaN is represented by $e=$ 0x7ff (hexadecimal) and $F\not=0$ . The important thing is that it does not matter what the value of $F$ is when representing NaN. This leaves developers with lots of room in the mantissa to give different meanings to NaN. In R the developers chose $F=1954$ in the mantissa to represent NA. A C-level function called R_IsNA detects the 1954 in NaN values.