Non Metric Space (Approximate) Library in R

[This article was first published on mlampros, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The nmslibR package is a wrapper of NMSLIB, which according to the authors “… is a similarity search library and a toolkit for evaluation of similarity search methods. The goal of the project is to create an effective and comprehensive toolkit for searching in generic non-metric spaces. Being comprehensive is important, because no single method is likely to be sufficient in all cases. Also note that exact solutions are hardly efficient in high dimensions and/or non-metric spaces. Hence, the main focus is on approximate methods”.

I’ve searched for some time (before wrapping NMSLIB) for a nearest neighbor library which can work with high dimensional data and can scale with big datasets. I’ve already written a package for k-nearest-neighbor search (KernelKnn), however, it’s based on brute force and unfortunately, it requires a certain computation time if the data consists of many rows. The nmslibR package, besides the main functionality of the NMSLIB python library, also includes an Approximate Kernel k-nearest function, which as I will show in the next lines is both fast and accurate. A comparison of NMSLIB with other popular approximate k-nearest-neighbor methods can be found here.

The NMSLIB Library,

  • is a collection of search methods for generic spaces
  • has both metric and non-metric search algorithms
  • has both exact and approximate search algorithms
  • is an evaluation toolkit that simplifies experimentation and processing of results
  • is extensible (new spaces and methods can be added)
  • It was designed to be efficient

Details can be found in the NMSLIB-manual.

The nmslibR package

The nmslibR package includes the following R6-class / functions,

class

NMSlib
Knn_Query()
knn_Query_Batch()
save_Index()

functions

KernelKnn_nmslib()

KernelKnnCV_nmslib()

dgCMatrix_2scipy_sparse()

mat_2scipy_sparse()

The package documentation includes details and examples for the R6-class and functions. I’ll start explaining how a user can work with sparse matrices as the input can also be a python sparse matrix.

Sparse matrices as input

The nmslibR package includes two functions (mat_2scipy_sparse and dgCMatrix_2scipy_sparse) which allow the user to convert from a matrix / dgCMatrix to a scipy sparse matrix,

<span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">nmslibR</span><span class="p">)</span><span class="w">

</span><span class="c1"># conversion from a matrix object to a scipy sparse matrix</span><span class="w">
</span><span class="c1">#----------------------------------------------------------</span><span class="w">

</span><span class="n">set.seed</span><span class="p">(</span><span class="m">1</span><span class="p">)</span><span class="w">

</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">matrix</span><span class="p">(</span><span class="n">runif</span><span class="p">(</span><span class="m">1000</span><span class="p">),</span><span class="w"> </span><span class="n">nrow</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">100</span><span class="p">,</span><span class="w"> </span><span class="n">ncol</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">10</span><span class="p">)</span><span class="w">

</span><span class="n">x_sparse</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">mat_2scipy_sparse</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">format</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"sparse_row_matrix"</span><span class="p">)</span><span class="w">

</span><span class="n">print</span><span class="p">(</span><span class="nf">dim</span><span class="p">(</span><span class="n">x</span><span class="p">))</span><span class="w">

</span><span class="p">[</span><span class="m">1</span><span class="p">]</span><span class="w"> </span><span class="m">100</span><span class="w">  </span><span class="m">10</span><span class="w">

</span><span class="n">print</span><span class="p">(</span><span class="n">x_sparse</span><span class="o">$</span><span class="n">shape</span><span class="p">)</span><span class="w">

</span><span class="p">(</span><span class="m">100</span><span class="p">,</span><span class="w"> </span><span class="m">10</span><span class="p">)</span><span class="w">
  
</span>

<span class="w">
</span><span class="c1"># conversion from a dgCMatrix object to a scipy sparse matrix</span><span class="w">
</span><span class="c1">#-------------------------------------------------------------</span><span class="w">

</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">3</span><span class="p">,</span><span class="w"> </span><span class="m">4</span><span class="p">,</span><span class="w"> </span><span class="m">5</span><span class="p">,</span><span class="w"> </span><span class="m">6</span><span class="p">)</span><span class="w">

</span><span class="c1"># by default column-oriented format</span><span class="w">

</span><span class="n">dgcM</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">Matrix</span><span class="o">::</span><span class="n">Matrix</span><span class="p">(</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">data</span><span class="p">,</span><span class="w"> </span><span class="n">nrow</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">3</span><span class="p">,</span><span class="w">

                      </span><span class="n">ncol</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">3</span><span class="p">,</span><span class="w"> </span><span class="n">byrow</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">,</span><span class="w">

                      </span><span class="n">sparse</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w">

</span><span class="n">print</span><span class="p">(</span><span class="nf">dim</span><span class="p">(</span><span class="n">dgcM</span><span class="p">))</span><span class="w">

</span><span class="p">[</span><span class="m">1</span><span class="p">]</span><span class="w"> </span><span class="m">3</span><span class="w"> </span><span class="m">3</span><span class="w">

</span><span class="n">x_sparse</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">dgCMatrix_2scipy_sparse</span><span class="p">(</span><span class="n">dgcM</span><span class="p">)</span><span class="w">

</span><span class="n">print</span><span class="p">(</span><span class="n">x_sparse</span><span class="o">$</span><span class="n">shape</span><span class="p">)</span><span class="w">

</span><span class="p">(</span><span class="m">3</span><span class="p">,</span><span class="w"> </span><span class="m">3</span><span class="p">)</span><span class="w">
  
</span>

The NMSlib R6-class

The parameter settings for the NMSlib R6-class can be found in the Non-Metric Space Library (NMSLIB) Manual, which explains the NMSLIB Library in detail. In the following code chunk, I’ll show the functionality of the methods included using a data set from my Github repository (it appears as .ipynb notebook in the nmslib Github repository)

<span class="w">

</span><span class="n">library</span><span class="p">(</span><span class="n">nmslibR</span><span class="p">)</span><span class="w">


</span><span class="c1"># download the data from my Github repository (tested on a Linux OS)</span><span class="w">
</span><span class="c1">#-------------------------------------------------------------------</span><span class="w">

</span><span class="n">system</span><span class="p">(</span><span class="s2">"wget https://raw.githubusercontent.com/mlampros/DataSets/master/sift_10k.txt"</span><span class="p">)</span><span class="w">


</span><span class="c1"># load the data in the R session</span><span class="w">
</span><span class="c1">#-------------------------------</span><span class="w">

</span><span class="n">sift_10k</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">read.table</span><span class="p">(</span><span class="s2">"~/sift_10k.txt"</span><span class="p">,</span><span class="w"> </span><span class="n">quote</span><span class="o">=</span><span class="s2">"\""</span><span class="p">,</span><span class="w"> </span><span class="n">comment.char</span><span class="o">=</span><span class="s2">""</span><span class="p">)</span><span class="w">


</span><span class="c1"># index parameters</span><span class="w">
</span><span class="c1">#-----------------</span><span class="w">

</span><span class="n">M</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">15</span><span class="w">
</span><span class="n">efC</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">100</span><span class="w">
</span><span class="n">num_threads</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">5</span><span class="w">

</span><span class="n">index_params</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="s1">'M'</span><span class="o">=</span><span class="w"> </span><span class="n">M</span><span class="p">,</span><span class="w"> </span><span class="s1">'indexThreadQty'</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">num_threads</span><span class="p">,</span><span class="w"> </span><span class="s1">'efConstruction'</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">efC</span><span class="p">,</span><span class="w"> 
                    
                    </span><span class="s1">'post'</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="s1">'skip_optimized_index'</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="p">)</span><span class="w">


</span><span class="c1"># query-time parameters</span><span class="w">
</span><span class="c1">#----------------------</span><span class="w">

</span><span class="n">efS</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">100</span><span class="w">

</span><span class="n">query_time_params</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="s1">'efSearch'</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">efS</span><span class="p">)</span><span class="w">


</span><span class="c1"># Number of neighbors </span><span class="w">
</span><span class="c1">#--------------------</span><span class="w">

</span><span class="n">K</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">100</span><span class="w">


</span><span class="c1"># space to use</span><span class="w">
</span><span class="c1">#---------------</span><span class="w">

</span><span class="n">space_name</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'l2sqr_sift'</span><span class="w">     


</span><span class="c1"># initialize NMSlib [ the data should be a matrix ]</span><span class="w">
</span><span class="c1">#--------------------------------------------------</span><span class="w">

</span><span class="n">init_nms</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">NMSlib</span><span class="o">$</span><span class="n">new</span><span class="p">(</span><span class="n">input_data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">as.matrix</span><span class="p">(</span><span class="n">sift_10k</span><span class="p">),</span><span class="w"> </span><span class="n">Index_Params</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">index_params</span><span class="p">,</span><span class="w"> 
                      
                      </span><span class="n">Time_Params</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">query_time_params</span><span class="p">,</span><span class="w"> </span><span class="n">space</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">space_name</span><span class="p">,</span><span class="w"> 
                      
                      </span><span class="n">space_params</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">NULL</span><span class="p">,</span><span class="w"> </span><span class="n">method</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'hnsw'</span><span class="p">,</span><span class="w"> 
                      
                      </span><span class="n">data_type</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'DENSE_UINT8_VECTOR'</span><span class="p">,</span><span class="w"> </span><span class="n">dtype</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'INT'</span><span class="p">,</span><span class="w">
                      
                      </span><span class="n">index_filepath</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">NULL</span><span class="p">,</span><span class="w"> </span><span class="n">print_progress</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">)</span><span class="w">

</span>

<span class="w">
</span><span class="c1"># returns a 1-dimensional vector</span><span class="w">
</span><span class="c1">#-------------------------------</span><span class="w">

</span><span class="n">init_nms</span><span class="o">$</span><span class="n">Knn_Query</span><span class="p">(</span><span class="n">query_data_row</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">as.matrix</span><span class="p">(</span><span class="n">sift_10k</span><span class="p">[</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="p">]),</span><span class="w"> </span><span class="n">k</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">5</span><span class="p">)</span><span class="w">

</span>

<span class="w">
</span><span class="p">[[</span><span class="m">1</span><span class="p">]]</span><span class="w">
</span><span class="p">[</span><span class="m">1</span><span class="p">]</span><span class="w">    </span><span class="m">2</span><span class="w">    </span><span class="m">6</span><span class="w"> </span><span class="m">4585</span><span class="w"> </span><span class="m">9256</span><span class="w">  </span><span class="m">140</span><span class="w">                    </span><span class="c1"># indices</span><span class="w">

</span><span class="p">[[</span><span class="m">2</span><span class="p">]]</span><span class="w">
</span><span class="p">[</span><span class="m">1</span><span class="p">]</span><span class="w"> </span><span class="m">18724</span><span class="w"> </span><span class="m">24320</span><span class="w"> </span><span class="m">68158</span><span class="w"> </span><span class="m">69067</span><span class="w"> </span><span class="m">70321</span><span class="w">               </span><span class="c1"># distances</span><span class="w">
 
</span>

<span class="w">
</span><span class="c1"># returns knn's for all data</span><span class="w">
</span><span class="c1">#---------------------------</span><span class="w">

</span><span class="n">all_dat</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">init_nms</span><span class="o">$</span><span class="n">knn_Query_Batch</span><span class="p">(</span><span class="n">as.matrix</span><span class="p">(</span><span class="n">sift_10k</span><span class="p">),</span><span class="w"> </span><span class="n">k</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">5</span><span class="p">,</span><span class="w"> </span><span class="n">num_threads</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w">

</span><span class="n">str</span><span class="p">(</span><span class="n">all_dat</span><span class="p">)</span><span class="w">

</span>

<span class="w">
</span><span class="c1"># a list of indices and distances for all observations</span><span class="w">
</span><span class="c1">#------------------------------------------------------</span><span class="w">

</span><span class="n">List</span><span class="w"> </span><span class="n">of</span><span class="w"> </span><span class="m">2</span><span class="w">
 </span><span class="o">$</span><span class="w"> </span><span class="n">knn_idx</span><span class="w"> </span><span class="o">:</span><span class="w"> </span><span class="n">num</span><span class="w"> </span><span class="p">[</span><span class="m">1</span><span class="o">:</span><span class="m">10000</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="o">:</span><span class="m">5</span><span class="p">]</span><span class="w"> </span><span class="m">3</span><span class="w"> </span><span class="m">4</span><span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="m">2</span><span class="w"> </span><span class="m">13</span><span class="w"> </span><span class="m">14</span><span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="m">2</span><span class="w"> </span><span class="m">30</span><span class="w"> </span><span class="m">31</span><span class="w"> </span><span class="n">...</span><span class="w">
 </span><span class="o">$</span><span class="w"> </span><span class="n">knn_dist</span><span class="o">:</span><span class="w"> </span><span class="n">num</span><span class="w"> </span><span class="p">[</span><span class="m">1</span><span class="o">:</span><span class="m">10000</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="o">:</span><span class="m">5</span><span class="p">]</span><span class="w"> </span><span class="m">18724</span><span class="w"> </span><span class="m">14995</span><span class="w"> </span><span class="m">18724</span><span class="w"> </span><span class="m">14995</span><span class="w"> </span><span class="m">21038</span><span class="w"> </span><span class="n">...</span><span class="w">

</span>

Details on the various methods and parameter settings can be found in the manual of the NMSLIB python Library.

KernelKnn using the nmslibR package

In the Vignette of the KernelKnn (Image classification of the MNIST and CIFAR-10 data using KernelKnn and HOG (histogram of oriented gradients)) package I experimented with the mnist dataset and a cross-validated kernel k-nearest-neighbors model gave 98.4 % accuracy based on HOG (histogram of oriented gradients) features. However, it took almost 30 minutes (depending on the system configuration) to complete using 6 threads. I’ve implemented a similar function using NMSLIB (KernelKnnCV_nmslib), so in the next code chunk I’ll use the same parameter setting and I’ll compare computation time and accuracy.

First load the data,

<span class="w">
</span><span class="c1"># using system('wget..') on a linux OS </span><span class="w">

</span><span class="n">system</span><span class="p">(</span><span class="s2">"wget https://raw.githubusercontent.com/mlampros/DataSets/master/mnist.zip"</span><span class="p">)</span><span class="w">             

</span><span class="n">mnist</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">read.table</span><span class="p">(</span><span class="n">unz</span><span class="p">(</span><span class="s2">"mnist.zip"</span><span class="p">,</span><span class="w"> </span><span class="s2">"mnist.csv"</span><span class="p">),</span><span class="w"> </span><span class="n">nrows</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">70000</span><span class="p">,</span><span class="w"> </span><span class="n">header</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nb">T</span><span class="p">,</span><span class="w"> 
                    
                    </span><span class="n">quote</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"\""</span><span class="p">,</span><span class="w"> </span><span class="n">sep</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">","</span><span class="p">)</span><span class="w">

</span>

<span class="w">
</span><span class="n">X</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">mnist</span><span class="p">[,</span><span class="w"> </span><span class="o">-</span><span class="n">ncol</span><span class="p">(</span><span class="n">mnist</span><span class="p">)]</span><span class="w">
</span><span class="nf">dim</span><span class="p">(</span><span class="n">X</span><span class="p">)</span><span class="w">

</span><span class="c1">## [1] 70000   784</span><span class="w">

</span><span class="c1"># the 'KernelKnnCV_nmslib' function requires that the labels are numeric and start from 1 : Inf</span><span class="w">

</span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">mnist</span><span class="p">[,</span><span class="w"> </span><span class="n">ncol</span><span class="p">(</span><span class="n">mnist</span><span class="p">)]</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">1</span><span class="w">          
</span><span class="n">table</span><span class="p">(</span><span class="n">y</span><span class="p">)</span><span class="w">

</span><span class="c1">## y</span><span class="w">
</span><span class="c1">##    1    2    3    4    5    6    7    8    9   10 </span><span class="w">
</span><span class="c1">## 6903 7877 6990 7141 6824 6313 6876 7293 6825 6958</span><span class="w">


</span><span class="c1"># evaluation metric</span><span class="w">

</span><span class="n">acc</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="k">function</span><span class="w"> </span><span class="p">(</span><span class="n">y_true</span><span class="p">,</span><span class="w"> </span><span class="n">preds</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
  
  </span><span class="n">out</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">table</span><span class="p">(</span><span class="n">y_true</span><span class="p">,</span><span class="w"> </span><span class="n">max.col</span><span class="p">(</span><span class="n">preds</span><span class="p">,</span><span class="w"> </span><span class="n">ties.method</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"random"</span><span class="p">))</span><span class="w">
  
  </span><span class="n">acc</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">sum</span><span class="p">(</span><span class="n">diag</span><span class="p">(</span><span class="n">out</span><span class="p">))</span><span class="o">/</span><span class="nf">sum</span><span class="p">(</span><span class="n">out</span><span class="p">)</span><span class="w">
  
  </span><span class="n">acc</span><span class="w">
</span><span class="p">}</span><span class="w">

</span>

then compute the HOG features,

<span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">OpenImageR</span><span class="p">)</span><span class="w">

</span><span class="n">hog</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">HOG_apply</span><span class="p">(</span><span class="n">X</span><span class="p">,</span><span class="w"> </span><span class="n">cells</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">6</span><span class="p">,</span><span class="w"> </span><span class="n">orientations</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">9</span><span class="p">,</span><span class="w"> </span><span class="n">rows</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">28</span><span class="p">,</span><span class="w"> </span><span class="n">columns</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">28</span><span class="p">,</span><span class="w"> </span><span class="n">threads</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">6</span><span class="p">)</span><span class="w">

</span><span class="c1">## </span><span class="w">
</span><span class="c1">## time to complete : 2.101281 secs  </span><span class="w">

</span><span class="nf">dim</span><span class="p">(</span><span class="n">hog</span><span class="p">)</span><span class="w">

</span><span class="c1">## [1] 70000   324</span><span class="w">

</span>

then compute the approximate kernel k-nearest-neighbors using the cosine distance,

<span class="w">
</span><span class="c1"># parameters for 'KernelKnnCV_nmslib'</span><span class="w">
</span><span class="c1">#------------------------------------</span><span class="w">

</span><span class="n">M</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">30</span><span class="w">
</span><span class="n">efC</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">100</span><span class="w">
</span><span class="n">num_threads</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">6</span><span class="w">

</span><span class="n">index_params</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="s1">'M'</span><span class="o">=</span><span class="w"> </span><span class="n">M</span><span class="p">,</span><span class="w"> </span><span class="s1">'indexThreadQty'</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">num_threads</span><span class="p">,</span><span class="w"> </span><span class="s1">'efConstruction'</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">efC</span><span class="p">,</span><span class="w">
                    
                    </span><span class="s1">'post'</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="s1">'skip_optimized_index'</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="p">)</span><span class="w">


</span><span class="n">efS</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">100</span><span class="w">

</span><span class="n">query_time_params</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="s1">'efSearch'</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">efS</span><span class="p">)</span><span class="w">


</span><span class="c1"># approximate kernel knn</span><span class="w">
</span><span class="c1">#-----------------------</span><span class="w">

</span><span class="n">fit_hog</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">KernelKnnCV_nmslib</span><span class="p">(</span><span class="n">hog</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="p">,</span><span class="w"> </span><span class="n">k</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">20</span><span class="p">,</span><span class="w"> </span><span class="n">folds</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">4</span><span class="p">,</span><span class="w"> </span><span class="n">h</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> 
                             </span><span class="n">weights_function</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'biweight_tricube_MULT'</span><span class="p">,</span><span class="w"> 
                             </span><span class="n">Levels</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">sort</span><span class="p">(</span><span class="n">unique</span><span class="p">(</span><span class="n">y</span><span class="p">)),</span><span class="w"> </span><span class="n">Index_Params</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">index_params</span><span class="p">,</span><span class="w">
                             </span><span class="n">Time_Params</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">query_time_params</span><span class="p">,</span><span class="w"> </span><span class="n">space</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"cosinesimil"</span><span class="p">,</span><span class="w"> 
                             </span><span class="n">space_params</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">NULL</span><span class="p">,</span><span class="w"> </span><span class="n">method</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"hnsw"</span><span class="p">,</span><span class="w"> </span><span class="n">data_type</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"DENSE_VECTOR"</span><span class="p">,</span><span class="w"> 
                             </span><span class="n">dtype</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"FLOAT"</span><span class="p">,</span><span class="w"> </span><span class="n">index_filepath</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">NULL</span><span class="p">,</span><span class="w"> </span><span class="n">print_progress</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">,</span><span class="w"> 
                             </span><span class="n">num_threads</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">6</span><span class="p">,</span><span class="w"> </span><span class="n">seed_num</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w">


</span><span class="c1"># cross-validation starts .. </span><span class="w">

</span><span class="c1"># |=================================================================================| 100%</span><span class="w">

</span><span class="c1"># time to complete : 32.88805 secs </span><span class="w">


</span><span class="n">str</span><span class="p">(</span><span class="n">fit_hog</span><span class="p">)</span><span class="w">


</span>

<span class="w">
</span><span class="n">List</span><span class="w"> </span><span class="n">of</span><span class="w"> </span><span class="m">2</span><span class="w">
 </span><span class="o">$</span><span class="w"> </span><span class="n">preds</span><span class="o">:</span><span class="n">List</span><span class="w"> </span><span class="n">of</span><span class="w"> </span><span class="m">4</span><span class="w">
  </span><span class="n">..</span><span class="o">$</span><span class="w"> </span><span class="o">:</span><span class="w"> </span><span class="n">num</span><span class="w"> </span><span class="p">[</span><span class="m">1</span><span class="o">:</span><span class="m">17500</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="o">:</span><span class="m">10</span><span class="p">]</span><span class="w"> </span><span class="m">0</span><span class="w"> </span><span class="m">0</span><span class="w"> </span><span class="m">0</span><span class="w"> </span><span class="m">0</span><span class="w"> </span><span class="m">0</span><span class="w"> </span><span class="m">0</span><span class="w"> </span><span class="m">0</span><span class="w"> </span><span class="m">0</span><span class="w"> </span><span class="m">0</span><span class="w"> </span><span class="m">0</span><span class="w"> </span><span class="n">...</span><span class="w">
  </span><span class="n">..</span><span class="o">$</span><span class="w"> </span><span class="o">:</span><span class="w"> </span><span class="n">num</span><span class="w"> </span><span class="p">[</span><span class="m">1</span><span class="o">:</span><span class="m">17500</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="o">:</span><span class="m">10</span><span class="p">]</span><span class="w"> </span><span class="m">0</span><span class="w"> </span><span class="m">0</span><span class="w"> </span><span class="m">0</span><span class="w"> </span><span class="m">0</span><span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="n">...</span><span class="w">
  </span><span class="n">..</span><span class="o">$</span><span class="w"> </span><span class="o">:</span><span class="w"> </span><span class="n">num</span><span class="w"> </span><span class="p">[</span><span class="m">1</span><span class="o">:</span><span class="m">17500</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="o">:</span><span class="m">10</span><span class="p">]</span><span class="w"> </span><span class="m">0</span><span class="w"> </span><span class="m">0</span><span class="w"> </span><span class="m">0</span><span class="w"> </span><span class="m">0</span><span class="w"> </span><span class="m">0</span><span class="w"> </span><span class="n">...</span><span class="w">
  </span><span class="n">..</span><span class="o">$</span><span class="w"> </span><span class="o">:</span><span class="w"> </span><span class="n">num</span><span class="w"> </span><span class="p">[</span><span class="m">1</span><span class="o">:</span><span class="m">17500</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="o">:</span><span class="m">10</span><span class="p">]</span><span class="w"> </span><span class="m">0</span><span class="w"> </span><span class="m">0</span><span class="w"> </span><span class="m">0</span><span class="w"> </span><span class="m">0</span><span class="w"> </span><span class="m">0</span><span class="w"> </span><span class="m">0</span><span class="w"> </span><span class="m">0</span><span class="w"> </span><span class="m">0</span><span class="w"> </span><span class="m">0</span><span class="w"> </span><span class="m">0</span><span class="w"> </span><span class="n">...</span><span class="w">
 </span><span class="o">$</span><span class="w"> </span><span class="n">folds</span><span class="o">:</span><span class="n">List</span><span class="w"> </span><span class="n">of</span><span class="w"> </span><span class="m">4</span><span class="w">
  </span><span class="n">..</span><span class="o">$</span><span class="w"> </span><span class="n">fold_1</span><span class="o">:</span><span class="w"> </span><span class="n">int</span><span class="w"> </span><span class="p">[</span><span class="m">1</span><span class="o">:</span><span class="m">17500</span><span class="p">]</span><span class="w"> </span><span class="m">49808</span><span class="w"> </span><span class="m">21991</span><span class="w"> </span><span class="m">42918</span><span class="w"> </span><span class="m">7967</span><span class="w"> </span><span class="m">49782</span><span class="w"> </span><span class="m">28979</span><span class="w"> </span><span class="m">64440</span><span class="w"> </span><span class="m">49809</span><span class="w"> </span><span class="m">30522</span><span class="w"> </span><span class="m">36673</span><span class="w"> </span><span class="n">...</span><span class="w">
  </span><span class="n">..</span><span class="o">$</span><span class="w"> </span><span class="n">fold_2</span><span class="o">:</span><span class="w"> </span><span class="n">int</span><span class="w"> </span><span class="p">[</span><span class="m">1</span><span class="o">:</span><span class="m">17500</span><span class="p">]</span><span class="w"> </span><span class="m">51122</span><span class="w"> </span><span class="m">9469</span><span class="w"> </span><span class="m">58021</span><span class="w"> </span><span class="m">45228</span><span class="w"> </span><span class="m">2944</span><span class="w"> </span><span class="m">58052</span><span class="w"> </span><span class="m">65074</span><span class="w"> </span><span class="m">17709</span><span class="w"> </span><span class="m">2532</span><span class="w"> </span><span class="m">31262</span><span class="w"> </span><span class="n">...</span><span class="w">
  </span><span class="n">..</span><span class="o">$</span><span class="w"> </span><span class="n">fold_3</span><span class="o">:</span><span class="w"> </span><span class="n">int</span><span class="w"> </span><span class="p">[</span><span class="m">1</span><span class="o">:</span><span class="m">17500</span><span class="p">]</span><span class="w"> </span><span class="m">33205</span><span class="w"> </span><span class="m">40078</span><span class="w"> </span><span class="m">68177</span><span class="w"> </span><span class="m">32620</span><span class="w"> </span><span class="m">52721</span><span class="w"> </span><span class="m">18981</span><span class="w"> </span><span class="m">19417</span><span class="w"> </span><span class="m">53922</span><span class="w"> </span><span class="m">19102</span><span class="w"> </span><span class="m">67206</span><span class="w"> </span><span class="n">...</span><span class="w">
  </span><span class="n">..</span><span class="o">$</span><span class="w"> </span><span class="n">fold_4</span><span class="o">:</span><span class="w"> </span><span class="n">int</span><span class="w"> </span><span class="p">[</span><span class="m">1</span><span class="o">:</span><span class="m">17500</span><span class="p">]</span><span class="w"> </span><span class="m">28267</span><span class="w"> </span><span class="m">41652</span><span class="w"> </span><span class="m">28514</span><span class="w"> </span><span class="m">34525</span><span class="w"> </span><span class="m">68534</span><span class="w"> </span><span class="m">13294</span><span class="w"> </span><span class="m">48759</span><span class="w"> </span><span class="m">47521</span><span class="w"> </span><span class="m">69395</span><span class="w"> </span><span class="m">41408</span><span class="w"> </span><span class="n">...</span><span class="w">

</span>

<span class="w">
</span><span class="n">acc_fit_hog</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">unlist</span><span class="p">(</span><span class="n">lapply</span><span class="p">(</span><span class="m">1</span><span class="o">:</span><span class="nf">length</span><span class="p">(</span><span class="n">fit_hog</span><span class="o">$</span><span class="n">preds</span><span class="p">),</span><span class="w"> 
                            
                            </span><span class="k">function</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="w"> </span><span class="n">acc</span><span class="p">(</span><span class="n">y</span><span class="p">[</span><span class="n">fit_hog</span><span class="o">$</span><span class="n">folds</span><span class="p">[[</span><span class="n">x</span><span class="p">]]],</span><span class="w"> 
                                            
                                            </span><span class="n">fit_hog</span><span class="o">$</span><span class="n">preds</span><span class="p">[[</span><span class="n">x</span><span class="p">]])))</span><span class="w">
</span><span class="n">acc_fit_hog</span><span class="w">

</span><span class="c1">## [1] 0.9768000 0.9786857 0.9763429 0.9760000</span><span class="w">

</span><span class="n">cat</span><span class="p">(</span><span class="s1">'mean accuracy for hog-features using cross-validation :'</span><span class="p">,</span><span class="w"> </span><span class="n">mean</span><span class="p">(</span><span class="n">acc_fit_hog</span><span class="p">),</span><span class="w"> </span><span class="s1">'\n'</span><span class="p">)</span><span class="w">

</span><span class="c1">## mean accuracy for hog-features using cross-validation : 0.9769571</span><span class="w">

</span>

It took approx. 33 seconds to return with an accuracy of 97.7 % . Almost 47 times faster than KernelKnn’s corresponding function (brute force) with a slight lower accuracy rate (the braycurtis distance metric might be better suited for this dataset).

I also run the corresponding brute-force algorithm of the NMSLIB Library by setting the method parameter to seq_search,

<span class="w">

</span><span class="c1"># brute force of NMSLIB   [ here we set 'Index_Params' and 'Time_Params' to NULL ]</span><span class="w">
</span><span class="c1">#----------------------</span><span class="w">

</span><span class="n">fit_hog_seq</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">KernelKnnCV_nmslib</span><span class="p">(</span><span class="n">hog</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="p">,</span><span class="w"> </span><span class="n">k</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">20</span><span class="p">,</span><span class="w"> </span><span class="n">folds</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">4</span><span class="p">,</span><span class="w"> </span><span class="n">h</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> 
                                </span><span class="n">weights_function</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'biweight_tricube_MULT'</span><span class="p">,</span><span class="w"> 
                                </span><span class="n">Levels</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">sort</span><span class="p">(</span><span class="n">unique</span><span class="p">(</span><span class="n">y</span><span class="p">)),</span><span class="w"> </span><span class="n">Index_Params</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">NULL</span><span class="p">,</span><span class="w">
                                </span><span class="n">Time_Params</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">NULL</span><span class="p">,</span><span class="w"> </span><span class="n">space</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"cosinesimil"</span><span class="p">,</span><span class="w"> 
                                </span><span class="n">space_params</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">NULL</span><span class="p">,</span><span class="w"> </span><span class="n">method</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"seq_search"</span><span class="p">,</span><span class="w"> 
                                </span><span class="n">data_type</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"DENSE_VECTOR"</span><span class="p">,</span><span class="w"> </span><span class="n">dtype</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"FLOAT"</span><span class="p">,</span><span class="w"> 
                                </span><span class="n">index_filepath</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">NULL</span><span class="p">,</span><span class="w"> </span><span class="n">print_progress</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">,</span><span class="w"> 
                                </span><span class="n">num_threads</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">6</span><span class="p">,</span><span class="w"> </span><span class="n">seed_num</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w">


</span><span class="c1"># cross-validation starts .. </span><span class="w">

</span><span class="c1"># |=================================================================================| 100%</span><span class="w">

</span><span class="c1"># time to complete : 4.506177 mins  </span><span class="w">


</span><span class="n">acc_fit_hog_seq</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">unlist</span><span class="p">(</span><span class="n">lapply</span><span class="p">(</span><span class="m">1</span><span class="o">:</span><span class="nf">length</span><span class="p">(</span><span class="n">fit_hog_seq</span><span class="o">$</span><span class="n">preds</span><span class="p">),</span><span class="w"> 
                                
                                </span><span class="k">function</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="w"> </span><span class="n">acc</span><span class="p">(</span><span class="n">y</span><span class="p">[</span><span class="n">fit_hog_seq</span><span class="o">$</span><span class="n">folds</span><span class="p">[[</span><span class="n">x</span><span class="p">]]],</span><span class="w"> 
                                                
                                                </span><span class="n">fit_hog_seq</span><span class="o">$</span><span class="n">preds</span><span class="p">[[</span><span class="n">x</span><span class="p">]])))</span><span class="w">
</span><span class="n">acc_fit_hog_seq</span><span class="w">

</span><span class="c1">## [1] 0.9785143 0.9802286 0.9783429 0.9784571</span><span class="w">

</span><span class="n">cat</span><span class="p">(</span><span class="s1">'mean accuracy for hog-features using cross-validation :'</span><span class="p">,</span><span class="w"> </span><span class="n">mean</span><span class="p">(</span><span class="n">acc_fit_hog_seq</span><span class="p">),</span><span class="w"> </span><span class="s1">'\n'</span><span class="p">)</span><span class="w">

</span><span class="c1">## mean accuracy for hog-features using cross-validation : 0.9788857</span><span class="w">


</span>

The brute-force algorithm of the NMSLIB Library is almost 6 times faster than KernelKnn giving an accuracy of approx. 97.9 %.

The README.md file of the nmslibR package includes the SystemRequirements and installation instructions.

An updated version of the nmslibR package can be found in my Github repository and to report bugs/issues please use the following link, https://github.com/mlampros/nmslibR/issues.

To leave a comment for the author, please follow the link and comment on their blog: mlampros.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)