Read the docs before questioning R’s defaults

Posted on August 22, 2018 by Iñaki Úcar in R bloggers | 0 Comments

[This article was first published on R – Enchufa2, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The latest R tip in Win-Vector Blog encourages you to Use Radix Sort based on a simple benchmark showing a x35 speedup compared to the default method, but with no further explanation. In my opinion, though, the complete tip would be, instead, use radix sort… if you know what you are doing, because a quick benchmark shouldn’t spare you the effort of actually reading the docs. And here is a spoiler: you are already using it.

One may wonder why R’s default sorting algorithm is so bad, and why was even chosen. The thing is that there is a trick here, and to understand it, first we must understand the benchmark’s data and then read the docs. This is the function from the original code (slightly modified for subsequent reuse) that generates the data:

mk_data <- function(nrow, stringsAsFactors = FALSE) {
  alphabet <- paste("sym", seq_len(max(2, floor(nrow^(1/3)))), sep = "_")
  data.frame(col_a = sample(alphabet, nrow, replace=TRUE),
             col_b = sample(alphabet, nrow, replace=TRUE),
             col_c = sample(alphabet, nrow, replace=TRUE),
             col_x = runif(nrow),
             stringsAsFactors = stringsAsFactors)
}

set.seed(32523)
d <- mk_data(1e+6)

summary(d)
##     col_a              col_b              col_c          
##  Length:1000000     Length:1000000     Length:1000000    
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##      col_x          
##  Min.   :0.0000002  
##  1st Qu.:0.2496717  
##  Median :0.4991010  
##  Mean   :0.4996031  
##  3rd Qu.:0.7494089  
##  Max.   :0.9999999
length(table(d$col_a))
## [1] 99

There are three character columns sampled from 99 symbols (sym_1, sym_2, …, sym_99) and a numeric column sampled from a uniform. The first three columns are thus clearly factors, but they are not treated as such. Let’s see now what help(sort) has to tell us about the sorting method, which by default is method="auto":

The “auto” method selects “radix” for short (less than 2^31 elements) numeric vectors, integer vectors, logical vectors and factors; otherwise, “shell”.

So, as I said in the opening paragraph, you are already using radix sort, except for characters. Let’s see then what happens if we treat such columns as proper factors:

library(microbenchmark)

set.seed(32523)
d <- mk_data(1e+6, stringsAsFactors = TRUE)

timings <- microbenchmark(
  order_default = d[order(d$col_a, d$col_b, d$col_c, d$col_x), , 
                    drop = FALSE],
  order_radix = d[order(d$col_a, d$col_b, d$col_c, d$col_x,
                        method = "radix"), ,
                  drop = FALSE],
  times = 10L)

print(timings)
## Unit: milliseconds
##           expr      min       lq     mean   median       uq      max neval
##  order_default 289.4685 312.0257 388.5259 387.8308 418.2673 584.4771    10
##    order_radix 265.6491 321.8337 421.2072 376.1166 512.0047 667.0545    10
##  cld
##    a
##    a

Unsurprisingly, timings are the same, because R automatically selects "radix" for you when appropriate. But when is it considered appropriate and why isn’t it appropriate in general for character vectors? We should go back to the docs:

The implementation is orders of magnitude faster than shell sort for character vectors, in part thanks to clever use of the internal CHARSXP table.

However, there are some caveats with the radix sort:

If x is a character vector, all elements must share the same encoding. Only UTF-8 (including ASCII) and Latin-1 encodings are supported. Collation always follows the “C” locale.

Long vectors (with more than 2^32 elements) and complex vectors are not supported yet.

An there it is: R is doing the right thing for you for the general case. So let us round up the tip: enforce method="radix" for character vectors if you know what you are doing. And, please, do read the docs.

Article originally published in Enchufa2.es: Read the docs before questioning R’s defaults.

To leave a comment for the author, please follow the link and comment on their blog: R – Enchufa2.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

Read the docs before questioning R’s defaults

Related

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)