Representation of numerical NA’s in R and the 1954 enigma

July 8, 2012
By

(This article was first published on Mark van der Loo, and kindly contributed to R-bloggers)

I've always wondered how exactly the missing value (NA) in R is represented under the hood. Last weekend I was working on a little project that gave me enough excuse to spend some time on finding this out. So, I descended into the catacombs of R and came back with some treasure. In short:

  • A missing integer is repesented by the largest negative (4 word) signed integer that can be represented by your computer.
  • A missing double (real number) is represented by a special version of the default NaN (Not a Number) of the IEEE standard. A special role is given to the number 1954 here, but why?

Read on if you want to dig a little deeper.

Missing integers

As you may know, a lot of R's core is written in the C language. However, an int variable in C does not support the concept of a missing value. So, what happens in R is that a single value of the integer range is pointed out as representing a missing value. In this case it is INT_MIN (a C macro from limits.h) which determines the largest negative value that can be represented by a int variable in C. On most computers, an int variable will be 32 bits (4 8-bit words). To make things easier, we'll assume that's always the case here. Since 1 bit is reserved for the sign, the range of representable numbers is

[-2^{31},\: 2^{31}-1] =[-2147483648,\: 2147483647].

(The range is asymmetric because 0 occupies the place of one positive number).

Now let's compare this with R's integer range. The maximum integer is easily found,
since it is stored in the hidden .Machine variable.

?Download download.txt
1
2
> .Machine$integer.max
[1] 2147483647

So this corresponds with C's INT_MAX. The largest negative integer is
not present in .Machine but we can do some tests:

?Download download.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# store the one-but-least C-integer. The L in the end forces the number
# to be "integer", not "numeric"
x <- -2147483647L
typeof(x)
[1] "integer"
 
# adding an integer works fine, since we move further into the range:
typeof(x+1L)
[1] "integer"
 
# substracting an integer gives a warning telling us that the result is out-of-range:
> typeof(x-1L)
[1] "integer"
Warning message:
In x - 1L : NAs produced by integer overflow
 
# substracting a non-integer 1 ("numeric") yields a non-integer:
> typeof(x-1)
[1] "double"

The result is out of R's integer range. The integer range of R is \pm 2147483647: one integer less than you get in C. So by sacrificing only one of your four billion two hundred ninety-four million nine hundred sixty-seven thousand two hundred ninety-five integers, you get the truly awesome feature of computing with missing values.

Missing doubles

To explain how real (\mathbb{R}) missing values are represented, we first need to spend a few words on the double type. A double is short for double precision and it is the variable type used to represent (approximations to) the real numbers in a computer.

Basically, a double represents a rounded real number in the following notation (see also the wikipedia article):

\textrm{sign}\times 2^{e} \times 1.F .

The sign is represented by 1 bit, the exponent e by 11 bits and the mantissa F by 52 bits, so we have 64 bits in total. The special value NaN (and also \pmInf) is coded using values of e that are not used to represent numbers. NaN is represented by e=0x7ff (hexadecimal) and F\not=0. The important thing is that it does not matter what the value of F is when representing NaN. This leaves developers with lots of room in the mantissa to give different meanings to NaN. In R the developers chose F=1954 in the mantissa to represent NA. A C-level function called R_IsNA detects the 1954 in NaN values.

A funny question is why did the R developers choose 1954? Any ol' number would have been fine. Was it because

  • It's the year of birth of one of the developers? (I couldn't find a match here)
  • Alan Turing died in 1954? (macabre)
  • Because president Eisenhower met with aliens in 1954? (ehm...)
  • In 1954 Queen Elisabeth II became the reigning monarch of Australia? (well...)

Leave an answer in the comments if you have a better idea...

To leave a comment for the author, please follow the link and comment on his blog: Mark van der Loo.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.