Representation of numerical NA’s in R and the 1954 enigma

[This article was first published on Mark van der Loo, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I’ve always wondered how exactly the missing value (NA) in R is represented under the hood. Last weekend I was working on a little project that gave me enough excuse to spend some time on finding this out. So, I descended into the catacombs of R and came back with some treasure. In short:

  • A missing integer is repesented by the largest negative (4 word) signed integer that can be represented by your computer.
  • A missing double (real number) is represented by a special version of the default NaN (Not a Number) of the IEEE standard. A special role is given to the number 1954 here, but why?

Read on if you want to dig a little deeper.

Missing integers

As you may know, a lot of R‘s core is written in the C language. However, an int variable in C does not support the concept of a missing value. So, what happens in R is that a single value of the integer range is pointed out as representing a missing value. In this case it is INT_MIN (a C macro from limits.h) which determines the largest negative value that can be represented by a int variable in C. On most computers, an int variable will be 32 bits (4 8-bit words). To make things easier, we’ll assume that’s always the case here. Since 1 bit is reserved for the sign, the range of representable numbers is

[-2^{31},\: 2^{31}-1] =[-2147483648,\: 2147483647].

(The range is asymmetric because 0 occupies the place of one positive number).

Now let’s compare this with R‘s integer range. The maximum integer is easily found,
since it is stored in the hidden .Machine variable.

?Download download.txt
1
2
> .Machine$integer.max
[1] 2147483647

So this corresponds with C‘s INT_MAX. The largest negative integer is
not present in .Machine but we can do some tests:

?Download download.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# store the one-but-least C-integer. The L in the end forces the number
# to be "integer", not "numeric"
x <- -2147483647L
typeof(x)
[1] "integer"
 
# adding an integer works fine, since we move further into the range:
typeof(x+1L)
[1] "integer"
 
# substracting an integer gives a warning telling us that the result is out-of-range:
> typeof(x-1L)
[1] "integer"
Warning message:
In x - 1L : NAs produced by integer overflow
 
# substracting a non-integer 1 ("numeric") yields a non-integer:
> typeof(x-1)
[1] "double"

The result is out of R‘s integer range. The integer range of R is \pm 2147483647: one integer less than you get in C. So by sacrificing only one of your four billion two hundred ninety-four million nine hundred sixty-seven thousand two hundred ninety-five integers, you get the truly awesome feature of computing with missing values.

Missing doubles

To explain how real (\mathbb{R}) missing values are represented, we first need to spend a few words on the double type. A double is short for double precision and it is the variable type used to represent (approximations to) the real numbers in a computer.

Basically, a double represents a rounded real number in the following notation (see also the wikipedia article):

\textrm{sign}\times 2^{e} \times 1.F .

The sign is represented by 1 bit, the exponent e by 11 bits and the mantissa F by 52 bits, so we have 64 bits in total. The special value NaN (and also \pmInf) is coded using values of e that are not used to represent numbers. NaN is represented by e=0x7ff (hexadecimal) and F\not=0. The important thing is that it does not matter what the value of F is when representing NaN. This leaves developers with lots of room in the mantissa to give different meanings to NaN. In R the developers chose F=1954 in the mantissa to represent NA. A C-level function called R_IsNA detects the 1954 in NaN values.

A funny question is why did the R developers choose 1954? Any ol’ number would have been fine. Was it because

  • It’s the year of birth of one of the developers? (I couldn’t find a match here)
  • Alan Turing died in 1954? (macabre)
  • Because president Eisenhower met with aliens in 1954? (ehm…)
  • In 1954 Queen Elisabeth II became the reigning monarch of Australia? (well…)

Leave an answer in the comments if you have a better idea…

To leave a comment for the author, please follow the link and comment on their blog: Mark van der Loo.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)