Hash Table Performance in R: Part III In Part I of this series, I explained how R hashed…

April 17, 2015

(This article was first published on Jeffrey Horner, and kindly contributed to R-bloggers)

Hash Table Performance in R: Part III

Part I
of this series, I explained how R hashed environments are superior
to vectors or lists when you’re in need of an associative array. In
Part II
I explained the three main operations you want to perform on hash tables
and how you should implement them for optimal performance.

This time around I want to introduce a package I’ve created that “peaks behind the curtain” of R environments, which I’ll use later to evaluate more performance characteristics.

Under Envestigation

The envestigate package
exposes the hash table and its statistics that back the R environment. I
consider this an experimental package so I don’t intend to
push it to CRAN any time soon. However it properly uses the R C API and
was developed the proper way, so it should
work on any late model R version across all platforms.

library(devtools) # Make sure you've already installed this

Hash Table Implementation

Hash tables are commonly implemented as an array of buckets with each bucket containing a linked list. They use a hash function to compute an index into the array of buckets so that the correct value can be found. A perfect hash function ‘hashes’ each key to a unique bucket to store the value, but they are rarely perfect. It is stated that when two keys hash to the same bucket a collision occures, and collisions are resolved by traversing the bucket’s linked list until the correct value is found.

In R, environment hash tables are a vector of lists where each
element of the list contains a binding from symbol to value. The
envestigate::hash_table function transforms that structure into a list
of character vectors where each element contains only the symbol name as a string and
ignores the value, thus exposing the internal structure so we can explore:

 # e - an environment who's hash table has 5 buckets
e <- new.env(size=5L)

e[['Tukey']] <- 1 # Values aren't important
e[['Fisher']] <- 1
e[['Bayes']] <- 1
e[['Pearson']] <- 1

ht <- envestigate::hash_table(e)

# List the bucket contents
## [[1]]
## [1] "Bayes" "Tukey"
## [[2]]
## [[3]]
## [[4]]
## [1] "Pearson"
## [[5]]
## [1] "Fisher"

So we see that the internal hash table of 5 buckets has two empty buckets, bucket 2 and 3, and 3 non-empty buckets with bucket 1 containing two elements.

hash_table function

Let’s explore the ‘hash_table’ object a little further. The envestigate package provides a simple print function for hash table objects:

## $size
## [1] 5
## $num_entries
## [1] 4
## $num_collisions
## [1] 1
## $num_empty_buckets
## [1] 2
## $num_non_empty_buckets
## [1] 3
## $load_factor
## [1] 0.6
## $max_bucket_length
## [1] 2


  • size – Number of buckets in the hash table, e.g. the size of the hash table.
  • num_entries – Total number of entries (or elements or variables) in the hash table.
  • num_collisions – Total number of collisions, counted as the sum of 1 – length of each list.
  • num_empty_buckets – Total number of buckets that contain zero elements.
  • num_non_empty_buckets – Total number of buckets that contain at least one element.
  • load_factor – Ratio of non empty buckets to size.
  • max_bucket_length – Length of bucket with the most elements.

And there’s more. The following items are also part of the object but are typically too large for a simple print summary. They are:

  • buckets – List of the buckets and their associated lists, which we saw above.
  • empty_buckets – Vector of indexes for the empty buckets.
  • non_empty_buckets – Vector of indexes for non empty buckets.
  • bucket_counts – Named intenger vector where names are indexes of and elements are the counts of each bucket.
  • bucket_size_dist – Data frame histogram of the number of buckets for each unique bucket size.

A note on the load factor. It is typically defined in the literature as the total number of elements (or entries) in the hash table divided by the size of the hash table. However, computing that value requires a complete scan of the table and this can have performance consequenses for instance when the table needs to be resized. So just using a ratio of non empty buckets to size is a cheaper to compute approximation. This is how R determines when to resize a hash table.

Computing Hash Values

envestigate also exposes the hash function that R uses to compute the hash value for strings, values that are used to map the strings to their buckets. Defining hash functions for computing hash values, a well researched topic, is beyond the scope of this note so I’ll just point to R’s implementation
here, authored by Peter J. Weinberger.

The function envestigate::hash_value maps character vectors to integer values:

##     Tukey    Fisher     Bayes   Pearson 
##   6013385  80780994   4755395 112761358

So those hash values are pretty meaningless at this point as you can tell non of them are in the 1 to 5 range, the size of our hash table. But they are a consistent, reproducible, one to one mapping from unique string to unique value, and we can use that result to compute the bucket index using the %% or modulo operator:

(hash_value(c('Tukey','Fisher','Bayes','Pearson')) %% ht$size) + 1
##   Tukey  Fisher   Bayes Pearson 
##       1       5       1       4

or use the convenient envestigate::hash_index function:

##   Tukey  Fisher   Bayes Pearson 
##       1       5       1       4

Now we can compute on R’s internal environment hash tables just like R. For instance looking up a variable in an environment is nearly equivalent to this operation for looking up the bucket for the string ‘Tukey’:

index <- hash_index('Tukey',ht)
bucket <- ht$buckets[[ index ]]
## [1] "Bayes" "Tukey"
for (i in 1:length(bucket)){
  if ('Tukey' == bucket[i]){
    cat('Found Tukey in bucket',index,'and list element',i,'!n')
## Found Tukey in bucket 1 and list element 2 !

So envestigate away, and share what you come up with!

To leave a comment for the author, please follow the link and comment on their blog: Jeffrey Horner.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers


Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)