**mlampros**, and kindly contributed to R-bloggers)

This blog post is about my recently released package on CRAN, **KernelKnn**. The package consists of three functions **KernelKnn**, **KernelKnnCV** and **knn.index.dist**. It also includes two data sets (*housing data*, *ionosphere*), which will be used here to illustrate the functionality of the package.

### k nearest neighbors

In pattern recognition the k nearest neighbors (KNN) is a non-parametric method used for classification and regression. Although KNN belongs to the 10 most influential algorithms in data mining, it is considered as one of the simplest in machine learning.

The most important parameters of the KNN algorithm are **k** and the **distance metric**. The parameter k specifies the number of neighbor observations that contribute to the output predictions. Optimal values for k can be obtained mainly through resampling methods, such as *cross-validation* or *bootstrap*. The distance metric is another important factor, which depending on the data set can affect the performance of the algorithm. Widely used distance metrics are the *euclidean*, *manhattan*, *chebyshev*, *minkowski* and *hamming*.

The simple KNN algorithm can be extended by giving different weights to the selected k nearest neighbors. A common weighting scheme consists in giving each neighbor a weight of 1/d, where d is the distance to the neighbor. The purpose of the *KernelKnn* package is to use different weight functions (kernels) in order to optimize the output predictions in both regression and classification.

### KernelKnn function

The following code snippets appear in the package Vignettes. I’ll illustrate the package functionality using mainly classification and in-between I’ll point out the differences for regression and classification, which could lead to potential errors.

```
data(ionosphere, package = 'KernelKnn')
apply(ionosphere, 2, function(x) length(unique(x)))
## V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12
## 2 1 219 269 204 259 231 260 244 267 246 269
## V13 V14 V15 V16 V17 V18 V19 V20 V21 V22 V23 V24
## 238 266 234 270 254 280 254 266 248 265 248 264
## V25 V26 V27 V28 V29 V30 V31 V32 V33 V34 class
## 256 273 256 281 244 266 243 263 245 263 2
```

```
# the second column will be removed as it has a single unique value
ionosphere = ionosphere[, -2]
```

When using an algorithm where the output depends on distance calculation (as is the case in k-nearest-neighbors) it is recommended to first scale the data,

```
X = scale(ionosphere[, -ncol(ionosphere)])
y = ionosphere[, ncol(ionosphere)]
```

Both in regression and classification the **KernelKnn** function accepts a numeric vector as a response variable (here y). In **classification** the user should additionally give the unique values of the labels, which should range from 1:Inf. This is important otherwise the internal functions do not work. The *KernelKnn* function returns a vector of numeric values in case of regression or a matrix of class probabilities in case of classification.

```
# convert y from factor to numeric in classification
y = c(1:length(unique(y)))[ match(ionosphere$class, sort(unique(ionosphere$class))) ]
# random split in train-test
spl_train = sample(1:length(y), round(length(y) * 0.75))
spl_test = setdiff(1:length(y), spl_train)
```

```
str(spl_train)
```

```
## int [1:263] 56 10 224 249 109 223 221 146 93 194
```

```
str(spl_test)
```

```
## int [1:88] 2 4 7 9 11 15 20 23 33 34 ...
```

```
# evaluation metric
acc = function (y_true, preds) {
out = table(y_true, max.col(preds, ties.method = "random"))
acc = sum(diag(out))/sum(out)
acc
}
```

A simple k-nearest-neighbors model can be run with weights_function = NULL and the parameter ‘regression’ should be set to FALSE in case of classification.

```
library(KernelKnn)
preds_TEST = KernelKnn(X[spl_train, ], TEST_data = X[spl_test, ], y[spl_train], k = 5 ,
method = 'euclidean', weights_function = NULL, regression = F,
Levels = unique(y))
```

```
head(preds_TEST)
```

```
## class_1 class_2
## [1,] 0 1
## [2,] 0 1
## [3,] 0 1
## [4,] 0 1
## [5,] 0 1
## [6,] 0 1
```

K-nearest-neigbor calculations in the KernelKnn function can be accomplished using the following distance metrics : *euclidean*, *manhattan*, *chebyshev*, *canberra*, *braycurtis*, *minkowski* (by default the order ‘p’ of the minkowski parameter equals k), *hamming*, *mahalanobis*, *pearson_correlation*, *simple_matching_coefficient*, *jaccard_coefficient* and *Rao_coefficient*. The last four are similarity measures and are appropriate for binary data [0,1].

There are two ways to use a kernel in the KernelKnn function. The **first option** is to choose one of the existing kernels (*uniform*, *triangular*, *epanechnikov*, *biweight*, *triweight*, *tricube*, *gaussian*, *cosine*, *logistic*, *silverman*, *inverse*, *gaussianSimple*, *exponential*). Here, I’ll use the canberra metric and the tricube kernel because they give optimal results (according to my RandomSearchR package),

```
preds_TEST_tric = KernelKnn(X[spl_train, ], TEST_data = X[spl_test, ], y[spl_train], k = 10 ,
method = 'canberra', weights_function = 'tricube', regression = F,
Levels = unique(y))
head(preds_TEST_tric)
```

```
## [,1] [,2]
## [1,] 1.745564e-02 0.98254436
## [2,] 9.667304e-01 0.03326963
## [3,] 0.000000e+00 1.00000000
## [4,] 6.335040e-18 1.00000000
## [5,] 4.698239e-02 0.95301761
## [6,] 0.000000e+00 1.00000000
```

The **second option** is to give a self-defined kernel function. Here, I’ll pick the density function of the normal distribution with mean = 0.0 and standard deviation = 1.0 (the data are scaled to have mean zero and unit variance),

```
norm_kernel = function(W) {
W = dnorm(W, mean = 0, sd = 1.0)
W = W / rowSums(W)
return(W)
}
preds_TEST_norm = KernelKnn(X[spl_train, ], TEST_data = X[spl_test, ], y[spl_train], k = 10 ,
method = 'canberra', weights_function = norm_kernel, regression = F,
Levels = unique(y))
head(preds_TEST_norm)
```

```
## [,1] [,2]
## [1,] 0.26150003 0.7385000
## [2,] 0.84170089 0.1582991
## [3,] 0.00000000 1.0000000
## [4,] 0.07614579 0.9238542
## [5,] 0.09479386 0.9052061
## [6,] 0.00000000 1.0000000
```

The computations can be speed up by using the parameter **threads** (utilizes openMP). There is also the option to exclude **extrema** (minimum and maximum distances) during the calculation of the k-nearest-neighbor distances using extrema = TRUE. The bandwidth of the existing kernels can be tuned using the **h** parameter, which defaults to 1.0.

### The KernelKnnCV function

The *KernelKnnCV* function can be employed to return the prediction accuracy using n-fold cross-validation. The following parameter pairs give optimal results according to my RandomSearchR package,

k | method | kernel |
---|---|---|

10 | canberra | tricube |

9 | canberra | epanechnikov |

```
fit_cv_pair1 = KernelKnnCV(X, y, k = 10 , folds = 5, method = 'canberra',
weights_function = 'tricube', regression = F,
Levels = unique(y), threads = 5)
```

```
str(fit_cv_pair1)
```

```
## List of 2
## $ preds:List of 5
## ..$ : num [1:71, 1:2] 0.00648 0.25323 1 0.97341 0.92031 ...
## ..$ : num [1:70, 1:2] 0 0 0 0 0.999 ...
## ..$ : num [1:70, 1:2] 0.353 0 0.17 0.212 0.266 ...
## ..$ : num [1:70, 1:2] 0 0 0 0 0 ...
## ..$ : num [1:70, 1:2] 0.989 0 1 0 0 ...
## $ folds:List of 5
## ..$ fold_1: int [1:71] 5 26 233 243 30 41 237 229 19 11 ...
## ..$ fold_2: int [1:70] 262 89 257 67 58 266 253 85 275 268 ...
## ..$ fold_3: int [1:70] 127 128 295 287 134 288 130 277 125 101 ...
## ..$ fold_4: int [1:70] 313 301 317 318 316 142 175 157 146 147 ...
## ..$ fold_5: int [1:70] 195 326 225 332 342 347 206 219 218 214 ...
```

```
fit_cv_pair2 = KernelKnnCV(X, y, k = 9 , folds = 5,method = 'canberra',
weights_function = 'epanechnikov', regression = F,
Levels = unique(y), threads = 5)
```

```
str(fit_cv_pair2)
```

```
## List of 2
## $ preds:List of 5
## ..$ : num [1:71, 1:2] 0.0224 0.255 1 0.9601 0.8876 ...
## ..$ : num [1:70, 1:2] 0 0 0 0 0.998 ...
## ..$ : num [1:70, 1:2] 0.36 0 0.164 0.185 0.202 ...
## ..$ : num [1:70, 1:2] 0 0 0 0 0 ...
## ..$ : num [1:70, 1:2] 0.912 0 1 0 0 ...
## $ folds:List of 5
## ..$ fold_1: int [1:71] 5 26 233 243 30 41 237 229 19 11 ...
## ..$ fold_2: int [1:70] 262 89 257 67 58 266 253 85 275 268 ...
## ..$ fold_3: int [1:70] 127 128 295 287 134 288 130 277 125 101 ...
## ..$ fold_4: int [1:70] 313 301 317 318 316 142 175 157 146 147 ...
## ..$ fold_5: int [1:70] 195 326 225 332 342 347 206 219 218 214 ...
```

Each cross-validated object returns a list of length 2 ( the first sublist includes the predictions for each fold whereas the second gives the indices of the folds)

```
acc_pair1 = unlist(lapply(1:length(fit_cv_pair1$preds),
function(x) acc(y[fit_cv_pair1$folds[[x]]],
fit_cv_pair1$preds[[x]])))
acc_pair1
```

```
## [1] 0.9154930 0.9142857 0.9142857 0.9285714 0.9571429
```

```
cat('accurcay for params_pair1 is :', mean(acc_pair1), '\n')
```

```
## accurcay for params_pair1 is : 0.9259557
```

```
acc_pair2 = unlist(lapply(1:length(fit_cv_pair2$preds),
function(x) acc(y[fit_cv_pair2$folds[[x]]],
fit_cv_pair2$preds[[x]])))
acc_pair2
```

```
## [1] 0.9014085 0.9142857 0.9000000 0.9142857 0.9571429
```

```
cat('accuracy for params_pair2 is :', mean(acc_pair2), '\n')
```

```
## accuracy for params_pair2 is : 0.9174245
```

### Adding or multiplying kernels

In the KernelKnn package the user can also combine kernels by adding or multiplying from the existing ones. For instance, if I want to multiply the tricube with the gaussian kernel, then I’ll give the following character string to the weights_function, *“tricube_gaussian_MULT”*. On the other hand, If I want to add the same kernels then the weights_function will be *“tricube_gaussian_ADD”*.

I’ll illustrate this option of the package using two image data sets in form of matrices, i.e. the *MNIST* and the *CIFAR-10*. From within R one can download the data in a linux OS using,

```
system("wget https://raw.githubusercontent.com/mlampros/DataSets/master/mnist.zip")
and
system("wget https://raw.githubusercontent.com/mlampros/DataSets/master/cifar_10.zip")
```

Moreover, the **irlba** and the **OpenImageR** packages are needed for comparison purposes, which can be installed directly from CRAN using the install.packages() function. A 4-fold cross-validation using the KernelKnnCV function can take 40-50 minutes utilizing 6 threads (for each data set). An alternative to reduce the computation time would be a train-test split of the data at the cost of performance validation.

**MNIST data set**

The MNIST data set of handwritten digits has a training set of 70,000 examples and each row of the matrix corresponds to a 28 x 28 image. The unique values of the response variable *y* range from 0 to 9. More information about the data can be found in the *DataSets* repository (the folder includes also an Rmarkdown file).

```
system("wget https://raw.githubusercontent.com/mlampros/DataSets/master/mnist.zip")
mnist <- read.table(unz("mnist.zip", "mnist.csv"), nrows = 70000, header = T,
quote = "\"", sep = ",")
```

```
X = mnist[, -ncol(mnist)]
dim(X)
## [1] 70000 784
# the KernelKnn function requires that the labels are numeric and start from 1 : Inf
y = mnist[, ncol(mnist)] + 1
table(y)
## y
## 1 2 3 4 5 6 7 8 9 10
## 6903 7877 6990 7141 6824 6313 6876 7293 6825 6958
```

K nearest neighbors do not perform well in high dimensions due to the *curse of dimensionality* (k observations that are nearest to a given test observation x1 may be very far away from x1 in p-dimensional space when p is large [ An introduction to statistical learning, James/Witten/Hastie/Tibshirani, pages 108-109 ]), leading to a very poor k-nearest-neighbors fit. One option to overcome this problem would be to use truncated svd (irlba package) to reduce the dimensions of the data. A second option, which is appropriate in case of images, would be to use image descriptors. Here, I’ll compare those two approaches.

**KernelKnnCV using truncated svd**

I experimented with different settings and the following parameters gave the best results,

irlba_singlular_vectors | k | method | kernel |
---|---|---|---|

40 | 8 | braycurtis | biweight_tricube_MULT |

```
library(irlba)
svd_irlb = irlba(as.matrix(X), nv = 40, nu = 40, verbose = F) # irlba truncated svd
new_x = as.matrix(X) %*% svd_irlb$v # new_x using the 40 right singular vectors
```

```
fit = KernelKnnCV(as.matrix(new_x), y, k = 8, folds = 4, method = 'braycurtis',
weights_function = 'biweight_tricube_MULT', regression = F,
threads = 6, Levels = sort(unique(y)))
# str(fit)
```

```
acc_fit = unlist(lapply(1:length(fit$preds),
function(x) acc(y[fit$folds[[x]]],
fit$preds[[x]])))
acc_fit
## [1] 0.9742857 0.9749143 0.9761143 0.9741143
cat('mean accuracy using cross-validation :', mean(acc_fit), '\n')
## mean accuracy using cross-validation : 0.9748571
```

Utilizing truncated svd a 4-fold cross-validation KernelKnn model gives a 97.48% accuracy.

**KernelKnnCV and HOG (histogram of oriented gradients)**

In this chunk of code, besides KernelKnnCV I’ll also use HOG. The histogram of oriented gradients (HOG) is a feature descriptor used in computer vision and image processing for the purpose of object detection. The technique counts occurrences of gradient orientation in localized portions of an image. This method is similar to that of edge orientation histograms, scale-invariant feature transform descriptors, and shape contexts, but differs in that it is computed on a dense grid of uniformly spaced cells and uses overlapping local contrast normalization for improved accuracy (Wikipedia).

```
library(OpenImageR)
hog = HOG_apply(X, cells = 6, orientations = 9, height = 28, width = 28, threads = 6)
##
## time to complete : 1.802997 secs
dim(hog)
## [1] 70000 324
```

```
fit_hog = KernelKnnCV(hog, y, k = 20, folds = 4, method = 'braycurtis',
weights_function = 'biweight_tricube_MULT', regression = F,
threads = 6, Levels = sort(unique(y)))
#str(fit_hog)
```

```
acc_fit_hog = unlist(lapply(1:length(fit_hog$preds),
function(x) acc(y[fit_hog$folds[[x]]],
fit_hog$preds[[x]])))
acc_fit_hog
## [1] 0.9833714 0.9840571 0.9846857 0.9838857
cat('mean accuracy for hog-features using cross-validation :', mean(acc_fit_hog), '\n')
## mean accuracy for hog-features using cross-validation : 0.984
```

By changing from the simple svd-features to HOG-features the accuracy of a 4-fold cross-validation model increased from 97.48% to 98.4% (approx. 1% difference)

**CIFAR-10 data set**

CIFAR-10 is an established computer-vision dataset used for object recognition. The data I’ll use in this example is a subset of an 80 million tiny images dataset and consists of 60,000 32×32 color images containing one of 10 object classes ( 6000 images per class ). Furthermore, the data were converted from RGB to gray, normalized and rounded to 2 decimal places (to reduce the storage size). More information about the data can be found in the *DataSets* repository (the folder includes also an Rmarkdown file).

I’ll build the kernel k-nearest-neighbors models in the same way I’ve done for the mnist data set and then I’ll compare the results.

```
system("wget https://raw.githubusercontent.com/mlampros/DataSets/master/cifar_10.zip")
cifar_10 <- read.table(unz("cifar_10.zip", "cifar_10.csv"), nrows = 60000, header = T,
quote = "\"", sep = ",")
```

**KernelKnnCV using truncated svd**

```
X = cifar_10[, -ncol(cifar_10)]
dim(X)
## [1] 60000 1024
# the KernelKnn function requires that the labels are numeric and start from 1 : Inf
y = cifar_10[, ncol(cifar_10)]
table(y)
## y
## 1 2 3 4 5 6 7 8 9 10
## 6000 6000 6000 6000 6000 6000 6000 6000 6000 6000
```

The parameter settings are similar to those of the mnist data,

```
svd_irlb = irlba(as.matrix(X), nv = 40, nu = 40, verbose = F) # irlba truncated svd
new_x = as.matrix(X) %*% svd_irlb$v # new_x using the 40 right singular vectors
```

```
fit = KernelKnnCV(as.matrix(new_x), y, k = 8, folds = 4, method = 'braycurtis',
weights_function = 'biweight_tricube_MULT', regression = F,
threads = 6, Levels = sort(unique(y)))
# str(fit)
```

```
acc_fit = unlist(lapply(1:length(fit$preds),
function(x) acc(y[fit$folds[[x]]],
fit$preds[[x]])))
acc_fit
## [1] 0.4080667 0.4097333 0.4040000 0.4102667
cat('mean accuracy using cross-validation :', mean(acc_fit), '\n')
## mean accuracy using cross-validation : 0.4080167
```

The accuracy of a 4-fold cross-validation model using truncated svd is 40.8%.

**KernelKnnCV using HOG (histogram of oriented gradients)**

Next, I’ll run the KernelKnnCV using the HOG-descriptors,

```
hog = HOG_apply(X, cells = 6, orientations = 9, height = 32,
width = 32, threads = 6)
##
## time to complete : 3.394621 secs
dim(hog)
## [1] 60000 324
```

```
fit_hog = KernelKnnCV(hog, y, k = 20, folds = 4, method = 'braycurtis',
weights_function = 'biweight_tricube_MULT', regression = F,
threads = 6, Levels = sort(unique(y)))
# str(fit_hog)
```

```
acc_fit_hog = unlist(lapply(1:length(fit_hog$preds),
function(x) acc(y[fit_hog$folds[[x]]],
fit_hog$preds[[x]])))
acc_fit_hog
## [1] 0.5807333 0.5884000 0.5777333 0.5861333
cat('mean accuracy for hog-features using cross-validation :', mean(acc_fit_hog), '\n')
## mean accuracy for hog-features using cross-validation : 0.58325
```

By using hog-descriptors in a 4-fold cross-validation model the accuracy in the cifar-10 data increases from 40.8% to 58.3% (approx. 17.5% difference).

An updated version of the **KernelKnn** package can be found in the Github repository and to report bugs/issues please use the following link, https://github.com/mlampros/KernelKnn/issues.

**leave a comment**for the author, please follow the link and comment on their blog:

**mlampros**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...