**mlampros**, and kindly contributed to R-bloggers)

The **nmslibR** package is a wrapper of *NMSLIB*, which according to the authors “… is a similarity search library and a toolkit for evaluation of similarity search methods. The goal of the project is to create an effective and comprehensive toolkit for searching in generic non-metric spaces. Being comprehensive is important, because no single method is likely to be sufficient in all cases. Also note that exact solutions are hardly efficient in high dimensions and/or non-metric spaces. Hence, the main focus is on approximate methods”.

I’ve searched for some time (before wrapping NMSLIB) for a nearest neighbor library which can work with high dimensional data and can scale with big datasets. I’ve already written a package for k-nearest-neighbor search (KernelKnn), however, it’s based on brute force and unfortunately, it requires a certain computation time if the data consists of many rows. The *nmslibR* package, besides the main functionality of the NMSLIB python library, also includes an Approximate Kernel k-nearest function, which as I will show in the next lines is both fast and accurate. A comparison of NMSLIB with other popular approximate k-nearest-neighbor methods can be found here.

The NMSLIB Library,

- is a collection of search methods for generic spaces
- has both metric and non-metric search algorithms
- has both exact and approximate search algorithms
- is an evaluation toolkit that simplifies experimentation and processing of results
- is extensible (new spaces and methods can be added)
- It was designed to be efficient

Details can be found in the NMSLIB-manual.

**The nmslibR package**

The *nmslibR* package includes the following R6-class / functions,

**class**

NMSlib |
---|

Knn_Query() |

knn_Query_Batch() |

save_Index() |

**functions**

KernelKnn_nmslib()

KernelKnnCV_nmslib()

dgCMatrix_2scipy_sparse()

mat_2scipy_sparse()

The package documentation includes details and examples for the R6-class and functions. I’ll start explaining how a user can work with sparse matrices as the input can also be a **python sparse matrix**.

**Sparse matrices as input**

The nmslibR package includes two functions (**mat_2scipy_sparse** and **dgCMatrix_2scipy_sparse**) which allow the user to convert from a *matrix* / *dgCMatrix* to a *scipy sparse matrix*,

```
library(nmslibR)
# conversion from a matrix object to a scipy sparse matrix
#----------------------------------------------------------
set.seed(1)
x = matrix(runif(1000), nrow = 100, ncol = 10)
x_sparse = mat_2scipy_sparse(x, format = "sparse_row_matrix")
print(dim(x))
[1] 100 10
print(x_sparse$shape)
(100, 10)
```

```
# conversion from a dgCMatrix object to a scipy sparse matrix
#-------------------------------------------------------------
data = c(1, 0, 2, 0, 0, 3, 4, 5, 6)
# by default column-oriented format
dgcM = Matrix::Matrix(data = data, nrow = 3,
ncol = 3, byrow = TRUE,
sparse = TRUE)
print(dim(dgcM))
[1] 3 3
x_sparse = dgCMatrix_2scipy_sparse(dgcM)
print(x_sparse$shape)
(3, 3)
```

**The NMSlib R6-class**

The parameter settings for the *NMSlib* R6-class can be found in the Non-Metric Space Library (NMSLIB) Manual, which explains the NMSLIB Library in detail. In the following code chunk, I’ll show the functionality of the methods included using a data set from my Github repository (it appears as .ipynb notebook in the nmslib Github repository)

```
library(nmslibR)
# download the data from my Github repository (tested on a Linux OS)
#-------------------------------------------------------------------
system("wget https://raw.githubusercontent.com/mlampros/DataSets/master/sift_10k.txt")
# load the data in the R session
#-------------------------------
sift_10k = read.table("~/sift_10k.txt", quote="\"", comment.char="")
# index parameters
#-----------------
M = 15
efC = 100
num_threads = 5
index_params = list('M'= M, 'indexThreadQty' = num_threads, 'efConstruction' = efC,
'post' = 0, 'skip_optimized_index' = 1 )
# query-time parameters
#----------------------
efS = 100
query_time_params = list('efSearch' = efS)
# Number of neighbors
#--------------------
K = 100
# space to use
#---------------
space_name = 'l2sqr_sift'
# initialize NMSlib [ the data should be a matrix ]
#--------------------------------------------------
init_nms = NMSlib$new(input_data = as.matrix(sift_10k), Index_Params = index_params,
Time_Params = query_time_params, space = space_name,
space_params = NULL, method = 'hnsw',
data_type = 'DENSE_UINT8_VECTOR', dtype = 'INT',
index_filepath = NULL, print_progress = FALSE)
```

```
# returns a 1-dimensional vector
#-------------------------------
init_nms$Knn_Query(query_data_row = as.matrix(sift_10k[1, ]), k = 5)
```

```
[[1]]
[1] 2 6 4585 9256 140 # indices
[[2]]
[1] 18724 24320 68158 69067 70321 # distances
```

```
# returns knn's for all data
#---------------------------
all_dat = init_nms$knn_Query_Batch(as.matrix(sift_10k), k = 5, num_threads = 1)
str(all_dat)
```

```
# a list of indices and distances for all observations
#------------------------------------------------------
List of 2
$ knn_idx : num [1:10000, 1:5] 3 4 1 2 13 14 1 2 30 31 ...
$ knn_dist: num [1:10000, 1:5] 18724 14995 18724 14995 21038 ...
```

Details on the various methods and parameter settings can be found in the manual of the NMSLIB python Library.

**KernelKnn using the nmslibR package**

In the Vignette of the KernelKnn (*Image classification of the MNIST and CIFAR-10 data using KernelKnn and HOG (histogram of oriented gradients)*) package I experimented with the **mnist dataset** and a cross-validated kernel k-nearest-neighbors model gave **98.4 % accuracy** based on **HOG** (histogram of oriented gradients) features. However, it took almost **30 minutes** (depending on the system configuration) to complete using **6 threads**. I’ve implemented a similar function using NMSLIB (**KernelKnnCV_nmslib**), so in the next code chunk I’ll use the *same parameter setting* and I’ll compare *computation time* and *accuracy*.

First load the data,

```
# using system('wget..') on a linux OS
system("wget https://raw.githubusercontent.com/mlampros/DataSets/master/mnist.zip")
mnist <- read.table(unz("mnist.zip", "mnist.csv"), nrows = 70000, header = T,
quote = "\"", sep = ",")
```

```
X = mnist[, -ncol(mnist)]
dim(X)
## [1] 70000 784
# the 'KernelKnnCV_nmslib' function requires that the labels are numeric and start from 1 : Inf
y = mnist[, ncol(mnist)] + 1
table(y)
## y
## 1 2 3 4 5 6 7 8 9 10
## 6903 7877 6990 7141 6824 6313 6876 7293 6825 6958
# evaluation metric
acc = function (y_true, preds) {
out = table(y_true, max.col(preds, ties.method = "random"))
acc = sum(diag(out))/sum(out)
acc
}
```

then compute the HOG features,

```
library(OpenImageR)
hog = HOG_apply(X, cells = 6, orientations = 9, rows = 28, columns = 28, threads = 6)
##
## time to complete : 2.101281 secs
dim(hog)
## [1] 70000 324
```

then compute the **approximate** kernel k-nearest-neighbors using the **cosine** distance,

```
# parameters for 'KernelKnnCV_nmslib'
#------------------------------------
M = 30
efC = 100
num_threads = 6
index_params = list('M'= M, 'indexThreadQty' = num_threads, 'efConstruction' = efC,
'post' = 0, 'skip_optimized_index' = 1 )
efS = 100
query_time_params = list('efSearch' = efS)
# approximate kernel knn
#-----------------------
fit_hog = KernelKnnCV_nmslib(hog, y, k = 20, folds = 4, h = 1,
weights_function = 'biweight_tricube_MULT',
Levels = sort(unique(y)), Index_Params = index_params,
Time_Params = query_time_params, space = "cosinesimil",
space_params = NULL, method = "hnsw", data_type = "DENSE_VECTOR",
dtype = "FLOAT", index_filepath = NULL, print_progress = FALSE,
num_threads = 6, seed_num = 1)
# cross-validation starts ..
# |=================================================================================| 100%
# time to complete : 32.88805 secs
str(fit_hog)
```

```
List of 2
$ preds:List of 4
..$ : num [1:17500, 1:10] 0 0 0 0 0 0 0 0 0 0 ...
..$ : num [1:17500, 1:10] 0 0 0 0 1 ...
..$ : num [1:17500, 1:10] 0 0 0 0 0 ...
..$ : num [1:17500, 1:10] 0 0 0 0 0 0 0 0 0 0 ...
$ folds:List of 4
..$ fold_1: int [1:17500] 49808 21991 42918 7967 49782 28979 64440 49809 30522 36673 ...
..$ fold_2: int [1:17500] 51122 9469 58021 45228 2944 58052 65074 17709 2532 31262 ...
..$ fold_3: int [1:17500] 33205 40078 68177 32620 52721 18981 19417 53922 19102 67206 ...
..$ fold_4: int [1:17500] 28267 41652 28514 34525 68534 13294 48759 47521 69395 41408 ...
```

```
acc_fit_hog = unlist(lapply(1:length(fit_hog$preds),
function(x) acc(y[fit_hog$folds[[x]]],
fit_hog$preds[[x]])))
acc_fit_hog
## [1] 0.9768000 0.9786857 0.9763429 0.9760000
cat('mean accuracy for hog-features using cross-validation :', mean(acc_fit_hog), '\n')
## mean accuracy for hog-features using cross-validation : 0.9769571
```

It took approx. **33 seconds** to return with an accuracy of **97.7 %** . Almost **47 times faster** than KernelKnn’s corresponding function (brute force) with a **slight lower accuracy** rate (the *braycurtis* distance metric might be better suited for this dataset).

I also run the corresponding brute-force algorithm of the NMSLIB Library by setting the *method* parameter to **seq_search**,

```
# brute force of NMSLIB [ here we set 'Index_Params' and 'Time_Params' to NULL ]
#----------------------
fit_hog_seq = KernelKnnCV_nmslib(hog, y, k = 20, folds = 4, h = 1,
weights_function = 'biweight_tricube_MULT',
Levels = sort(unique(y)), Index_Params = NULL,
Time_Params = NULL, space = "cosinesimil",
space_params = NULL, method = "seq_search",
data_type = "DENSE_VECTOR", dtype = "FLOAT",
index_filepath = NULL, print_progress = FALSE,
num_threads = 6, seed_num = 1)
# cross-validation starts ..
# |=================================================================================| 100%
# time to complete : 4.506177 mins
acc_fit_hog_seq = unlist(lapply(1:length(fit_hog_seq$preds),
function(x) acc(y[fit_hog_seq$folds[[x]]],
fit_hog_seq$preds[[x]])))
acc_fit_hog_seq
## [1] 0.9785143 0.9802286 0.9783429 0.9784571
cat('mean accuracy for hog-features using cross-validation :', mean(acc_fit_hog_seq), '\n')
## mean accuracy for hog-features using cross-validation : 0.9788857
```

The brute-force algorithm of the NMSLIB Library is almost **6 times faster** than KernelKnn giving an accuracy of approx. **97.9 %**.

The *README.md* file of the *nmslibR* package includes the SystemRequirements and installation instructions.

An updated version of the nmslibR package can be found in my Github repository and to report bugs/issues please use the following link, https://github.com/mlampros/nmslibR/issues.

**leave a comment**for the author, please follow the link and comment on their blog:

**mlampros**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...