Hash Table Performance in R: Part IV

April 21, 2015
By

(This article was first published on Jeffrey Horner, and kindly contributed to R-bloggers)

In the last post I introduced the package envestigate that provides the hash table structure and interesting statistics associated with an R environment. Now I want to show you some performance characteristics of the R environment as a hash table.

I’ll be using a synthetic list of distinct words that mimics real world data scraped from Wikipedia which you can scope out here.

Hash Table Construction Time

As I alluded in Part III, while R allows you to set the size of an environment with new.env(size=X) it will automatically resize the internal hash table once the load factor reaches a certain threshold (.85 to be exact). Let’s see how that affects hash table construction time.

The code below creates a hash table and inserts each word from the DISTINCT.1mil set. We accumulate the time it takes to insert words 1000 at a time. We also note the size of the hash table after every 1000 inserts. This will reveal an interesting spike that occurs every time the hash table increases in size. There are other spikes that happen and I suspect those are due to garbage collection.

library(envestigate)
library(readr); library(dplyr)
library(ggplot2); library(scales)

# Character vector of distinct words
DISTINCT <- read_lines('DISTINCT.1mil')
len_DISTINCT <- length(DISTINCT)

# Accumulate data each ith_insert
ith_insert <- 10^3

# Number of observations
n_observations <- len_DISTINCT/ith_insert

# Collect these items each ith_insert
insert      <- numeric(n_observations)
size           <- numeric(n_observations)
construct_time <- numeric(n_observations)

# New environment with default hash size of 29L
e <- new.env()

i <- j <- 1L
elapsed_time <- 0.0
while(i <= len_DISTINCT){
  t1 <- proc.time()['elapsed']
  # Insert word into environment
  e[[DISTINCT[i]]] <- 1
  t2 <- proc.time()['elapsed']

  # t2 - t1 = time to insert one word
  # elapsed_time = accumulated time of inserts
  elapsed_time <- elapsed_time + (t2 - t1)

  # ith_insert occures here, collect data
  if (i %% ith_insert == 0){
    ht <- hash_table(e) # gather size of current hash table
    insert[j]              <- i
    size[j]                   <- ht$size
    construct_time[j]         <- elapsed_time

    elapsed_time <- 0.0
    j <- j + 1L
  }
  i <- i + 1L
}
# Our data frame of results
res <- data_frame(insert, size, construct_time)
res
## Source: local data frame [1,000 x 3]
## 
##    insert size construct_time
## 1    1000  849          0.006
## 2    2000 1758          0.009
## 3    3000 2530          0.004
## 4    4000 3643          0.005
## 5    5000 4371          0.003
## 6    6000 5245          0.005
## 7    7000 6294          0.007
## 8    8000 7552          0.011
## 9    9000 7552          0.003
## 10  10000 7552          0.003
## ..    ...  ...            ...
# Use a bimodal color scheme to denote when a hash size changes
num_uniq_sizes <- length(unique(res$size))
cols <- rep(hue_pal()(2),ceiling(num_uniq_sizes/2))

p <- ggplot(res, aes(x=insert,y=construct_time,color=factor(size))) +
  geom_line(size=2) + scale_color_manual(values=cols) +
  guides(col=guide_legend(ncol=2,title="Hash Size")) +
  labs(x="Number of Elements in Environment", y="Construct Time (seconds)") +
  theme( axis.text = element_text(colour = "black")) +
    ggtitle('Environment Construct Time')
p

Some interesting takeaways:

  • There’s a clear spike in construct time occurring every time the line color changes, signifying a change in the hash size. We presume the hash resize occurs during the spike. Also looks like linear growth.
  • Other spikes appear even when the hash size doesn’t change. We presume that’s due to the garbage collector either getting rid of old R objects or allocating more heap space.
  • The length of the colored lines along the x axis are growing as more inserts occur. It might follow a linear growth, but I haven’t looked at the R source code to confirm.

To leave a comment for the author, please follow the link and comment on their blog: Jeffrey Horner.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)