**Jeffrey Horner**, and kindly contributed to R-bloggers)

In the last post I introduced the package **envestigate** that provides the hash table structure and interesting statistics associated with an R environment. Now I want to show you some performance characteristics of the R environment as a hash table.

I’ll be using a synthetic list of distinct words that mimics real world data scraped from Wikipedia which you can scope out here.

## Hash Table Construction Time

As I alluded in Part III, while R allows you to set the size of an environment with **new.env(size=X)** it will automatically resize the internal hash table once the load factor reaches a certain threshold (.85 to be exact). Let’s see how that affects hash table construction time.

The code below creates a hash table and inserts each word from the DISTINCT.1mil set. We accumulate the time it takes to insert words 1000 at a time. We also note the size of the hash table after every 1000 inserts. This will reveal an interesting spike that occurs every time the hash table increases in size. There are other spikes that happen and I suspect those are due to garbage collection.

```
library(envestigate)
library(readr); library(dplyr)
library(ggplot2); library(scales)
# Character vector of distinct words
DISTINCT <- read_lines('DISTINCT.1mil')
len_DISTINCT <- length(DISTINCT)
# Accumulate data each ith_insert
ith_insert <- 10^3
# Number of observations
n_observations <- len_DISTINCT/ith_insert
# Collect these items each ith_insert
insert <- numeric(n_observations)
size <- numeric(n_observations)
construct_time <- numeric(n_observations)
# New environment with default hash size of 29L
e <- new.env()
i <- j <- 1L
elapsed_time <- 0.0
while(i <= len_DISTINCT){
t1 <- proc.time()['elapsed']
# Insert word into environment
e[[DISTINCT[i]]] <- 1
t2 <- proc.time()['elapsed']
# t2 - t1 = time to insert one word
# elapsed_time = accumulated time of inserts
elapsed_time <- elapsed_time + (t2 - t1)
# ith_insert occures here, collect data
if (i %% ith_insert == 0){
ht <- hash_table(e) # gather size of current hash table
insert[j] <- i
size[j] <- ht$size
construct_time[j] <- elapsed_time
elapsed_time <- 0.0
j <- j + 1L
}
i <- i + 1L
}
# Our data frame of results
res <- data_frame(insert, size, construct_time)
res
```

```
## Source: local data frame [1,000 x 3]
##
## insert size construct_time
## 1 1000 849 0.006
## 2 2000 1758 0.009
## 3 3000 2530 0.004
## 4 4000 3643 0.005
## 5 5000 4371 0.003
## 6 6000 5245 0.005
## 7 7000 6294 0.007
## 8 8000 7552 0.011
## 9 9000 7552 0.003
## 10 10000 7552 0.003
## .. ... ... ...
```

```
# Use a bimodal color scheme to denote when a hash size changes
num_uniq_sizes <- length(unique(res$size))
cols <- rep(hue_pal()(2),ceiling(num_uniq_sizes/2))
p <- ggplot(res, aes(x=insert,y=construct_time,color=factor(size))) +
geom_line(size=2) + scale_color_manual(values=cols) +
guides(col=guide_legend(ncol=2,title="Hash Size")) +
labs(x="Number of Elements in Environment", y="Construct Time (seconds)") +
theme( axis.text = element_text(colour = "black")) +
ggtitle('Environment Construct Time')
p
```

Some interesting takeaways:

- There’s a clear spike in construct time occurring every time the line color changes, signifying a change in the hash size. We presume the hash resize occurs during the spike. Also looks like linear growth.
- Other spikes appear even when the hash size doesn’t change. We presume that’s due to the garbage collector either getting rid of old R objects or allocating more heap space.
- The length of the colored lines along the x axis are growing as more inserts occur. It might follow a linear growth, but I haven’t looked at the R source code to confirm.

**leave a comment**for the author, please follow the link and comment on their blog:

**Jeffrey Horner**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...