Hash Table Performance in R: Part II In Part I of this series, I explained how R hashed…

[This article was first published on Jeffrey Horner, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Hash Table Performance in R: Part II

In Part I of this series, I explained how R hashed environments are superior to vectors or lists when you need a hash table for your work. I also teased that in this post I would explain the caveats associated with that choice, but I’m saving that for later as I want to share with you the fastest ways of operating on them.

Operations on Hash Tables

There are three main operations you want to perform on hash tables. Given a key (as a string) and a value:

  • INSERT will add the key/value pair to the hash table,
  • LOOKUP will search the hash table for a key and return a value if it exists,
  • and DELETE will delete a value based on the key.

There are other operations of some importance. For instance EXISTS is a special case of LOOKUP which returns true if the key exists and false otherwise. But the main operations are the performance heavyweights so we’ll restrict our discussion to those.

INSERT

So what’s the fastest way to insert a key into an R hashed environment? We could use the assign function which will assign a variable specified as a string along with a value into an environment, but we could also use an indexing construct available to other R objects. Turns out this is not really documented where one would expect. Anyone have a reference?

Using assign to insert

# Our hash table ht is a hashed environment
ht <- new.env(); key <- "foo"; value <- 1

assign(key, value, ht, inherits=FALSE) # only look in this environment, not in the parent
ls.str(ht)

## foo :  num 1

Using the [[ construct

ht <- new.env(); key <- "foo"; value <- 1

ht[[key]] <- value
ls.str(ht)

## foo :  num 1

$ Doesn’t work as expected

ht <- new.env(); key <- "foo"; value <- 1

ht$key <- value # works, but not what you expect
ls.str(ht)

## key :  num 1

Uh oh! Our key is “foo” not “key”. R doesn’t expand the variable key to “foo” like we want, so this is not helpful.

We could do this:

ht <- new.env(); key <- "foo"; value <- 1

ht$"foo" <- value # works, but have you seen anything uglier?
ls.str(ht)

## foo :  num 1

but it’s more likeley that your key will already be assigned to a variable and not expressed as a constant directly in your code.

Total Fail on [

ht <- new.env(); key <- "foo"; value <- 1

ht[key] <- value # error

## Error in ht[key] <- value: object of type 'environment' is not subsettable

So we’ve got two ways to perform lookup, but which is faster?

Our test will compare the call to two similarly designed functions but with one using assign and the other using [[ to perform INSERT.

library(microbenchmark) # on CRAN, install with install.packages('microbenchmark')

INSERT_assign <- function(){
  ht <- new.env()

  assign("foo",1,ht,inherits=FALSE)
}

INSERT_subset <- function(){
  ht <- new.env()

  ht[["foo"]] <- 1
}

# Perform a high number of each, 10L^5. Might need more if you have a fast machine
microbenchmark(INSERT_assign(), INSERT_subset(), times=10L^5)

## Unit: microseconds
##             expr   min    lq     mean median    uq      max neval
##  INSERT_assign() 1.435 1.718 2.204589  1.899 2.138 3073.382 1e+05
##  INSERT_subset() 1.105 1.313 1.670967  1.470 1.663 1331.484 1e+05

Clear winner with [[!

LOOKUP

For constructing this operation, let’s compare the R function get which works with environments to the [[ construct. $ and [ obviously won’t work. We’ll set up our test just like the one in INSERT where the functions perform equivalent work except for the expression we want to test.

# Each returns the value of the key "foo"

LOOKUP_get <- function(){
  ht <- new.env()
  ht[["foo"]] <- 1

  get("foo",envir=ht,inherits=FALSE)
}

LOOKUP_subset <- function(){
  ht <- new.env()
  ht[["foo"]] <- 1

  ht[["foo"]]
}

microbenchmark(LOOKUP_get(), LOOKUP_subset(), times=10L^5)

## Unit: microseconds
##             expr   min    lq     mean median    uq      max neval
##     LOOKUP_get() 1.835 2.196 2.758611  2.417 2.661 2914.613 1e+05
##  LOOKUP_subset() 1.260 1.518 1.876052  1.687 1.863 1319.268 1e+05

And again, [[ is faster!

DELETE

So how does one delete a key/value pair from R hashed environments? Turns out there’s only one way: use rm as assigning NULL won’t work (like it would with lists):

ht <- new.env(); ht[["foo"]] <- 1

ht[["foo"]] <- NULL
ls.str(ht)

## foo :  NULL

rm(list="foo",envir=ht,inherits=FALSE)
ls.str(ht) # Produces nothing

In Conclusion

If you are going to use R hashed environments as a hash table, then it would behoove you to construct the main operations INSERT, LOOKUP, and DELETE with [[ rather than assign and get like so:

# Simple Hash Table Interface for R

HASH_TABLE <- function() new.env()
INSERT <- function(key, value, ht)  ht[[key]] <- value
LOOKUP <- function(key, ht) ht[[key]]
DELETE <- function(key, ht) rm(list=key,envir=ht,inherits=FALSE)

But! I’ll leave you with one last performance comparison:

ht <- HASH_TABLE()
INSERT("key",1L,ht)
ls.str(ht)

## key :  int 1

microbenchmark(
  LOOKUP("foo",ht)==1L,
  ht[["foo"]]==1L,
  times=10L^5
)

## Unit: nanoseconds
##                     expr min  lq     mean median  uq     max neval
##  LOOKUP("foo", ht) == 1L 410 457 633.9799    551 665 1036167 1e+05
##        ht[["foo"]] == 1L 172 198 258.6643    215 288   34987 1e+05

So is an interface really necessary?

To leave a comment for the author, please follow the link and comment on their blog: Jeffrey Horner.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)