Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This is this second post of the “Create your Machine Learning library from scratch with R !” series. Today, we will see how you can implement K nearest neighbors (KNN) using only the linear algebra available in R. Previously, we managed to implement PCA and next time we will deal with SVM and decision trees.

The K-nearest neighbors (KNN) is a simple yet efficient classification and regression algorithm. KNN assumes that an observation will be similar to its K closest neighbors. For instance, if most of the neighbors of a given point belongs to a given class, it seems reasonable to assume that the point will belong to the same given class.

## The mathematics of KNN

Now, let’s quickly derive the mathematics used for KNN regression (they are similar for classification).

Let be the observations of our training dataset. The points are in . We denote the variable we seek to estimate. We know its value for the train dataset.
Let be a new point in . We do not know and will estimate it using our train dataset.

Let be a positive and non-zero integer (the number of neighbors used for estimation). We want to select the points from the dataset which are the closest to . To do so, we compute the euclidean distance . From all the distance, we can compute , the smallest radius of the circle centered on which includes exactly points from the training sample.

An estimation of is now easy to construct. This is the mean of the of the closest points to :

### KNN regression in R

First, we build a “my_knn_regressor” object which stores all the training points, the value of the target variable and the number of neighbors to use.

my_knn_regressor <- function(x,y,k=5)
{
if (!is.matrix(x))
{
x <- as.matrix(x)
}
if (!is.matrix(y))
{
y <- as.matrix(y)
}
my_knn <- list()
my_knn[['points']] <- x
my_knn[['value']] <- y
my_knn[['k']] <- k
attr(my_knn, "class") <- "my_knn_regressor"
return(my_knn)
}


The tricky part of KNN is to compute efficiently the distance. We will use the function we created in our previous post on vectorization. The function and mathematical derivations are specified in this post.

gramMatrix <- function(X,Y)
{
tcrossprod(X, Y)
}
compute_pairwise_distance=function(X,Y)
{
xn <- rowSums(X ** 2)
yn <- rowSums(Y ** 2)
outer(xn, yn,'+') - 2 * tcrossprod(X, Y)
}


Now we can build our predictor:

predict.my_knn_regressor <- function(my_knn,x)
{
if (!is.matrix(x))
{
x=as.matrix(x)
}
##Compute pairwise distance
dist_pair <- compute_pairwise_distance(x,my_knn[['points']])
##as.matrix(apply(dist_pair,2,order) <= my_knn[['k']]) orders the points by distance and select the k-closest points
##The M[i,j]=1 if x_j is on the k closest point to x_i
t(as.matrix(apply(dist_pair,2,order) <= my_knn[['k']])) %*% my_knn[['value']] / my_knn[['k']]
}


The last line may seem complicated:

1. apply(dist_pair,2,order) orders the points by distance
2. apply(dist_pair,2,order) selects the k-closest points to each point in our new dataset
3. M=t(as.matrix(apply(dist_pair,2,order)  cast the matrix into a one hot matrix. if is one of the k closest points to . is zero otherwise.
4. M %*% my_knn[['value']] / my_knn  sums the value of the k closest points and normalises it by k

## KNN Binary Classification in R

The previous code can be reused as it is for binary classification. Your outcome should be encoded as a one-hot variable. If the estimated output is greater (resp. less) than 0.5, you can assume that your point belongs to the class encoded as one (resp. zero). We will use the classical Iris dataset and classify the setosa versus the virginica specy.

iris_class <- iris[iris[["Species"]]!="versicolor",]
iris_class[["Species"]] <- as.numeric(iris_class[["Species"]]!="setosa")
knn_class <- my_knn_regressor(iris_class[,1:2],iris_class[,5])
predict(knn_class,iris_class[,1:2])


Since, we only used 2 variables, we can easily plot the decision boundaries on a 2D plot.

#Build grid
x_coord <- seq(min(iris_class[,1]) - 0.2,max(iris_class[,1]) + 0.2,length.out = 200)
y_coord <- seq(min(iris_class[,2])- 0.2,max(iris_class[,2]) + 0.2 , length.out = 200)
coord <- expand.grid(x=x_coord, y=y_coord)
#predict probabilities
coord[['prob']] <- predict(knn_class,coord[,1:2])

library(ggplot2)
ggplot() +
geom_tile(data=coord,mapping=aes(x, y, fill=prob)) + scale_fill_gradient(low = "lightblue", high = "red") +
geom_point(data=iris_class,mapping=aes(Sepal.Length,Sepal.Width, shape=Species),size=3 ) +
#add the labels to the plots
xlab('Sepal length') + ylab('Sepal width') + ggtitle('Decision boundaries of KNN')+
#remove grey border from the tile
scale_x_continuous(expand=c(0,0)) + scale_y_continuous(expand=c(0,0))


And this gives us this cool plot:

## Possible extensions

Our current KNN is basic, but you can improve and test it in several ways:

• What is the influence of the number of neighbors ? (You should see some overfitting/underfitting)
• Can you implement other metrics than distance ? Can you create kernel KNNs ?
• Instead of doing estimations using only the mean, could you use a more complex mapping ?