**Blog - Data Science and the fst Package**, and kindly contributed to R-bloggers)

The *fst* package uses the xxHash algorithm for internal hashing of (meta-)data. With method *hash_fst* the user now has direct access to this extremely fast hashing algorithm.

# Hashing

Hashing is used to map data of any size to data of a (usually much smaller) fixed size. That’s very useful for creating fast lookup tables, a digital data signature or in cryptography. The *fst* package hashes all meta-data that is stored in the *fst* format and will support optional hashing of columnar data as well in the near future. Storing hashes in the format greatly adds to the security and stability as errors in the format can easily be detected by comparing hashes.

For hashing in *fst* the excellent and speedy xxHash algorithm is used, developed by Yann Collet. Method *hash_fst* provides direct access to the xxHash library and also includes a multi-threaded variant that hashes at extreme speeds.

# Multi-threaded hashing

To demonstrate the *hash_fst* interface, we use a 93 MB file downloaded from Kaggle.

```
# file downloaded from https://www.kaggle.com/stackoverflow/so-survey-2017
sample_file <- "survey_results_public.csv"
raw_vec <- readBin(sample_file, "raw", file.size(sample_file)) # read byte contents
# file size (in MB)
1e-6 * file.size(sample_file)
```

```
## [1] 93.09709
```

To calculate the hash value of data contained in the *raw_vec* vector, we use:

```
library(fst)
hash_fst(raw_vec)
```

```
## [1] 1853499107 -1914678989
```

The return value is a length two integer vector because the hashing algorithm is actually a 64-bit hashing algorithm (and a single integer occupies 32 bits in memory). Based on the already fast xxHash algorithm, the speed of the multi-threaded hash implementation in *fst* is pretty extreme:

```
threads_fst(8)
hash_timing <- microbenchmark(
hash_fst(raw_vec),
times = 1000
)
# hashing speed (GB/s)
as.numeric(object.size(raw_vec)) / median(hash_timing$time)
```

```
hash_timings[Threads == 8, Speed]
```

```
## [1] 46.72303
```

That’s a hashing speed of more than 46 GB/s!

# Dependency on number of cores

With a small benchmark, we can reveal how the multi-threaded hashing depends on the selected number of cores:

```
library(data.table)
# result table
bench <- as.list(rep(1, 20))
for (threads in 1:40) {
threads_fst(threads)
hash_timing <- microbenchmark(
hash_fst(raw_vec),
times = 1000
)
bench[[threads]] <- data.table(Threads = threads, Time = hash_timing$time)
}
bench <- rbindlist(bench)
```

The computer used for the benchmark has two Xeon E5 CPU’s (@2.5GHz) with 10 physical cores each. The results are displayed below:

*Figure 1: hashing speed vs the number of cores used for computation*

Using more than 8-10 threads doesn’t help performance. It’s clear that the Xeon is hitting other boundaries than computational speed, such as the maximum memory bandwidth and thread- or CPU synchronization issues. That’s also confirmed by the less than 100 percent pressure on CPU resources during the benchmark:

*Figure 2: CPU resources used during benchmark*

# Single threaded mode

Method *hash_fst* has a parameter *block_hash* that activates the multi-threaded hashing implementation. For compatibility with the default xxHash algorithm, *block_hash* can be set to *FALSE*. With that setting, the single threaded default (64-bit) xxHash mode is used.

This post is also available on R-bloggers

**leave a comment**for the author, please follow the link and comment on their blog:

**Blog - Data Science and the fst Package**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...