R in a 64 bit world
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
32 bit data structures (pointers, integer representations, single precision floating point) have been past their “best before date” for quite some time. R itself moved to a 64 bit memory model some time ago, but still has only 32 bit integers. This is going to get more and more awkward going forward. What is R doing to work around this limitation?
We discuss this in this article, the first of a new series of articles discussing aspects of “R as it is.”
Currently R’s only integer data type is a 32 bit signed integer. Such integers can only count up to about 2 billion. This range is in fact ridiculously small and unworkable. Some examples:
- The human population on the earth has been over 2 billion humans since around 1930.
- The U.S. Department of the Treasury prints over 2 billion $1 bills per year.
- Gangnam Style by Psy has been viewed well over 2 billion times.
- An obsolete computer can count through this set of values in around 1 second.
- It is becoming more and more likely somebody may share data with containing 64 bit integers (as many other languages do have a 64 bit integer data type).
What can we do about this in R?
Note we are not talking about big-data issues (which are addressed in R with tools like ff, data.table, dplyr, databases, streaming algorithms, Hadoop, see Revolutions big data articles for some tools and ideas). Fortunately, when working with data you really want to do things like aggregate, select, partition, and compute derived columns. With the right primitives you can (and should) perform all of these efficiently without ever referring to explicit row indices.
What we are talking about is the possible embarrassment of:
- Not being to represent 64 bit IDs as something other than strings.
- Not being able to represent a 65536 by 65536 matrix in R (because R matrices are actually views over single vectors).
- Not being able to index into 3 billion doubles (or about $300 worth of memory).
Now R actually dodged these bullets, but without introducing a proper 64 bit integer type. Let’s discuss how.
First a lot of people think R has a 64 bit integer type. This is because R’s notation for integer constants “L” looks like Java’s notation for longs (64 bit), but probably derives from C’s notation for long (“at least 32 bits”). But feast your eyes on the following R code:
3000000000L ## [1] 3e+09 Warning message: non-integer value 3000000000L qualified with L; using numeric value
Yes, “L” only means integer in R.
What the R designers did to delay paying the price of having only 32 bit integers was to allow doubles to be used as array indices (and as the return value for length()
)! Take a look at the following R code:
c(1,2,3)[1.7] ## [1] 1
It looks like this was one of the big changes in moving from R2.15.3 to R3.0.1 in 2013 (see here). However, it feels dirty. In a more perfect world the above code would throw an error. This puts R in league with languages that force everything to be represented in way too few base-types ( Javascript, TCL, and Perl). IEEE 754 doubles define a 53 bit mantissa (separate from the sign and exponent), so with a proper floating point implementation we expect a double can faithfully represent an integer range of -2^53 through 2^53. But only as long as you don’t accidentally convert to or round-trip through a string/character type.
One of the issues is that underlying C and Fortran code (often used to implement R packages/libraries) are not going to be able to easily use longs as indices. However, I still would much prefer the introduction of a proper 64 bit integer type.
Of course Java is in a much worse position going forward than R. Because of Java’s static type signatures any class that implements the Collection
interface is stuck with “int size()
” pretty much forever (this includes Array, Vector, List, Set, and many more). In much better shape is Python which has been working on unifying ints and longs since 2001 ( PEP237 ) and uses only 64 bit integers in Python 3 (just a matter of moving people from Python 2 to Python 3).
Enough about sizes and indexing- let’s talk a bit about representation. What should we do if we try to import data and we know one of the columns is 64 bit integers (assuming we are lucky enough to detect this and the column doesn’t get converted in a non-reversible way to doubles)?
R has always been a bit “loosey-goosey” with ints. That is why you see weird stuff in summary:
summary(55555L) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 55560 55560 55560 55560 55560 55560
Where we are told that the integer 55555 is in the range 55560 to 55560 (in hindsight we can see R is printing as if the data were floating point and then, adding insult to injury, it is not signaling its use of four significant figures by having the decency to switch into scientific notation). This is also why I don’t really trust R numerics to reliably represent integers like 2^50, some function accidentally round trips your value into a string representation and back (such as reading/writing data to a CSV/TSV table) and you may not get back the value you started with (for the worst possible reason: you never wrote the correct value out).
In fact it becomes a bit of a bother to even check if a floating point number represents an integer. Some standard advice is “check if floor(a)==ceiling(a)
.” Which works until it fails:
a <- 2^50 b <- 2^50 + 0.5 c <- 5^25+1.5 for(v in c(a,b,c)) { print(floor(v)==ceiling(v)) } ## [1] TRUE ## [1] FALSE ## [1] TRUE
What went wrong for “c
” is “c
” is an integer, it just isn’t the number we typed in (due to the use of floating point). The true value is seen as follows:
# Let's try the "doubles are good enough" path # 5^25 < 2^60, so would fit in a 64 bit integer format(5L^25L,digits=20) ## [1] "298023223876953152" # doesn't even end in a 5 as powers of 5 should format(5^25+1.5,digits=20) ## [1] "298023223876953152" # and can't see our attempt at addition # let's get the right answer library('Rmpfr') mpfr(5,precBits=200)^25 ## [1] 298023223876953125 # ends in 5 as we expect mpfr(5,precBits=200)^25 + 1.5 # obviously not an integer! ## [1] 298023223876953126.5
Something like the following is probably pretty close to the correct test function:
is.int <- function(v) { is.numeric(v) & v>-2^53 & v<2^53 & (floor(v)==ceiling(v)) }
But, that (or whatever IEEE math library function actually does this) is hard to feel very good about. The point is we should not have to study What Every Computer Scientist Should Know About Floating-Point Arithmetic when merely trying to index into arrays. However every data scientist should read this paper to understand some of the issues of general numeric representation and manipulation!
What are our faithful representation options in R?
- Force to strings (and pray they don’t try to convert to factors).
- Try to use doubles (this is what happens if you don’t know about the column, and will irreversibly mangle IDs).
- Try a package like Google’s int64 package (kicked off cran in 2012 for lack of maintenance).
- Try a bigint package such as gmp special math package such as
" target="_blank">Rmpfr.
Our advice is to first try representing 64 bit integers as strings. For functions like read.table()
this means setting as.is
to TRUE
for the appropriate columns, and not converting a column back to string after it has already been damaged by the reader.
And this is our first taste of “R as it is.”
(Thank you to Joseph Rickert and Nina Zumel for helpful comments on earlier drafts of this article.)
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.