# Representation of numerical NA’s in R and the 1954 enigma

July 8, 2012
By

(This article was first published on Mark van der Loo, and kindly contributed to R-bloggers)

I've always wondered how exactly the missing value (NA) in R is represented under the hood. Last weekend I was working on a little project that gave me enough excuse to spend some time on finding this out. So, I descended into the catacombs of R and came back with some treasure. In short:

• A missing integer is repesented by the largest negative (4 word) signed integer that can be represented by your computer.
• A missing double (real number) is represented by a special version of the default NaN (Not a Number) of the IEEE standard. A special role is given to the number 1954 here, but why?

Read on if you want to dig a little deeper.

### Missing integers

As you may know, a lot of R's core is written in the C language. However, an int variable in C does not support the concept of a missing value. So, what happens in R is that a single value of the integer range is pointed out as representing a missing value. In this case it is INT_MIN (a C macro from limits.h) which determines the largest negative value that can be represented by a int variable in C. On most computers, an int variable will be 32 bits (4 8-bit words). To make things easier, we'll assume that's always the case here. Since 1 bit is reserved for the sign, the range of representable numbers is

$[-2^{31},\: 2^{31}-1] =[-2147483648,\: 2147483647].$

(The range is asymmetric because 0 occupies the place of one positive number).

Now let's compare this with R's integer range. The maximum integer is easily found,
since it is stored in the hidden .Machine variable.

 1 2  > .Machine\$integer.max [1] 2147483647

So this corresponds with C's INT_MAX. The largest negative integer is
not present in .Machine but we can do some tests:

 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19  # store the one-but-least C-integer. The L in the end forces the number # to be "integer", not "numeric" x <- -2147483647L typeof(x) [1] "integer"   # adding an integer works fine, since we move further into the range: typeof(x+1L) [1] "integer"   # substracting an integer gives a warning telling us that the result is out-of-range: > typeof(x-1L) [1] "integer" Warning message: In x - 1L : NAs produced by integer overflow   # substracting a non-integer 1 ("numeric") yields a non-integer: > typeof(x-1) [1] "double"

The result is out of R's integer range. The integer range of R is $\pm 2147483647$: one integer less than you get in C. So by sacrificing only one of your four billion two hundred ninety-four million nine hundred sixty-seven thousand two hundred ninety-five integers, you get the truly awesome feature of computing with missing values.

### Missing doubles

To explain how real ($\mathbb{R}$) missing values are represented, we first need to spend a few words on the double type. A double is short for double precision and it is the variable type used to represent (approximations to) the real numbers in a computer.

Basically, a double represents a rounded real number in the following notation (see also the wikipedia article):

$\textrm{sign}\times 2^{e} \times 1.F$.

The sign is represented by 1 bit, the exponent $e$ by 11 bits and the mantissa $F$ by 52 bits, so we have 64 bits in total. The special value NaN (and also $\pm$Inf) is coded using values of $e$ that are not used to represent numbers. NaN is represented by $e=$0x7ff (hexadecimal) and $F\not=0$. The important thing is that it does not matter what the value of $F$ is when representing NaN. This leaves developers with lots of room in the mantissa to give different meanings to NaN. In R the developers chose $F=1954$ in the mantissa to represent NA. A C-level function called R_IsNA detects the 1954 in NaN values.

A funny question is why did the R developers choose 1954? Any ol' number would have been fine. Was it because

• It's the year of birth of one of the developers? (I couldn't find a match here)
• Alan Turing died in 1954? (macabre)
• Because president Eisenhower met with aliens in 1954? (ehm...)
• In 1954 Queen Elisabeth II became the reigning monarch of Australia? (well...)

Leave an answer in the comments if you have a better idea...