**R – Enchufa2**, and kindly contributed to R-bloggers)

The latest *R tip* in *Win-Vector Blog* encourages you to *Use Radix Sort* based on a simple benchmark showing a x35 speedup compared to the default method, but with no further explanation. In my opinion, though, the *complete* tip would be, instead, *use radix sort… if you know what you are doing*, because a quick benchmark shouldn’t spare you the effort of actually reading the docs. And here is a spoiler:

*you are already using it*.

One may wonder why R’s default sorting algorithm is so bad, and why was even chosen. The thing is that there is a trick here, and to understand it, first we must understand the benchmark’s data and then read the docs. This is the function from the original code (slightly modified for subsequent reuse) that generates the data:

```
mk_data <- function(nrow, stringsAsFactors = FALSE) {
alphabet <- paste("sym", seq_len(max(2, floor(nrow^(1/3)))), sep = "_")
data.frame(col_a = sample(alphabet, nrow, replace=TRUE),
col_b = sample(alphabet, nrow, replace=TRUE),
col_c = sample(alphabet, nrow, replace=TRUE),
col_x = runif(nrow),
stringsAsFactors = stringsAsFactors)
}
set.seed(32523)
d <- mk_data(1e+6)
summary(d)
```

```
## col_a col_b col_c
## Length:1000000 Length:1000000 Length:1000000
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## col_x
## Min. :0.0000002
## 1st Qu.:0.2496717
## Median :0.4991010
## Mean :0.4996031
## 3rd Qu.:0.7494089
## Max. :0.9999999
```

`length(table(d$col_a))`

`## [1] 99`

There are three character columns sampled from 99 symbols (`sym_1`

, `sym_2`

, …, `sym_99`

) and a numeric column sampled from a uniform. The first three columns are thus clearly factors, but they are not treated as such. Let’s see now what `help(sort)`

has to tell us about the sorting method, which by default is `method="auto"`

:

The “auto” method selects “radix” for short (less than 2^31 elements) numeric vectors, integer vectors, logical vectors and factors; otherwise, “shell”.

So, as I said in the opening paragraph, you are *already* using radix sort, except for characters. Let’s see then what happens if we treat such columns as proper factors:

```
library(microbenchmark)
set.seed(32523)
d <- mk_data(1e+6, stringsAsFactors = TRUE)
timings <- microbenchmark(
order_default = d[order(d$col_a, d$col_b, d$col_c, d$col_x), ,
drop = FALSE],
order_radix = d[order(d$col_a, d$col_b, d$col_c, d$col_x,
method = "radix"), ,
drop = FALSE],
times = 10L)
print(timings)
```

```
## Unit: milliseconds
## expr min lq mean median uq max neval
## order_default 289.4685 312.0257 388.5259 387.8308 418.2673 584.4771 10
## order_radix 265.6491 321.8337 421.2072 376.1166 512.0047 667.0545 10
## cld
## a
## a
```

Unsurprisingly, timings are the same, because R automatically selects `"radix"`

for you when appropriate. But when is it considered appropriate and why isn’t it appropriate *in general* for character vectors? We should go back to the docs:

The implementation is orders of magnitude faster than shell sort for character vectors, in part thanks to clever use of the internal CHARSXP table.

However, there are some caveats with the radix sort:

- If x is a character vector, all elements must share the same encoding. Only UTF-8 (including ASCII) and Latin-1 encodings are supported. Collation always follows the “C” locale.
- Long vectors (with more than 2^32 elements) and complex vectors are not supported yet.

An there it is: R is doing the right thing for you for the general case. So let us round up the tip: *enforce method="radix" for character vectors if you know what you are doing*. And, please, do read the docs.

*Article originally published in Enchufa2.es: Read the docs before questioning R’s defaults*.

**leave a comment**for the author, please follow the link and comment on their blog:

**R – Enchufa2**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...