Site icon R-bloggers

Machine Learning Explained: Vectorization and matrix operations

[This article was first published on Enhance Data Science, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Today in Machine Learning Explained, we will tackle a central (yet under-looked) aspect of Machine Learning: vectorization. Let’s say you want to compute the sum of the values of an array. The naive way to do so is to loop over the elements and to sequentially sum them. This naive way is slow and tends to get even slower with large amounts of data and large data structures.
With vectorization these operations can be seen as matrix operations which are often more efficient than standard loops. Vectorized versions of algorithm are several orders of magnitudes faster and are easier to understand from a mathematical perspective.

A basic exemple of vectorization

Preliminary exemple – Python

 
Let’s compare the naive way and the vectorized way of computing the sum of the elements of an array. To do so, we will create a large (100,000 elements) Numpy array and compute the sum of its element 1,000 times with each algorithm. The overall computation time will then be compared.

import numpy as np
import time
W=np.random.normal(0,1,100000)
n_rep=1000

The naive way to compute the sum iterates over all the elements of the array and stores the sum:

start_time = time.time()
for i in range(n_rep):
    loop_res=0
    for elt in W:
        loop_res+=elt
time_loop = time.time() - start_time

If is our vector of interest, the sum of its elements can be expressed as the dot product :

start_time = time.time()
for i in range(n_rep):
    one_dot=np.ones(W.shape)
    vect_res=one_dot.T.dot(W)
time_vect = time.time() - start_time

Finally, we can check that both methods yield the same results and compare their runtime. The vectorized version run approximately 100 to 200 times faster than the naive loop.

print(np.abs(vect_res-loop_res)<10e-10)
print(time_loop/time_vect)

Note: The same results can be obtained with np.sum.The numpy version has a very similar runtime to our vectorized version. Numpy being very optimized, this show that our vectorized sum is reasonably fast.

start_time = time.time()
for i in range(n_rep):
    vect_res=np.sum(W)
time_vect_np = time.time() - start_time

Preliminary exemple – R

The previous experiments can be replicated in R:

##Creation of the vector
W=matrix(rnorm(100000))
n_rep=10000
#Naive way:
library(tictoc)
tic('Naive computation')
for (rep in 1:n_rep)
{
  res_loop=0
  for (w_i in W)
    res_loop=w_i+res_loop
}
toc()

tic('Vectorized computation')
# vectorized way
for (rep in 1:n_rep)
{
  ones=rep(1,length(W))
  res_vect= crossprod(ones,W)
}
toc()

tic('built-in computation')
# built-in way
for (rep in 1:n_rep)
{
  res_built_in= sum(W)
}
toc()

In R, the vectorized version is only an order of magnitude faster than the naive way. The built-in way achieves the best performances and is an order of magnitude faster than our vectorized way.

Preliminary exemple – Results

   

   

Vectorization divides the computation times by several order of magnitudes and the difference with loops increase with the size of the data. Hence, if you want to deal with large amount of data, rewriting the algorithm as matrix operations may lead to important performances gains.

Why vectorization is (often) faster

Note 1: Though vectorization is often faster, it requires to allocate the memory of the array. If your amount of RAM is limited or if the amount of data is large, loops may be required.
Note 2: When you deal with large arrays or computationally intensive algebra ( like inversion, diagonalization, eigenvalues computations, ….) computations on GPU are even order of magnitudes faster than on CPU. To write efficient GPU code, the code needs to be composed of matrix operations. Hence, having vectorized code maked it easier to translate CPU code to GPU (or tensor-based frameworks).

A small zoo of matrix operations

The goal of this part is to show some basic matrix operations/vectorization and to end on a more complex example to show the thought process which underlies vectorization.

Column-wise matrix sum

The column wise sum (and mean) can be expressed as a matrix product. Let be our matrix of interest. Using the matrix multiplication formula, we have: . Hence, the column-wise sum of is .

Python code:

def colWiseSum(W):
    ones=np.ones((W.shape[0],1))
    return ones.T.dot(W)

R code:

colWiseSum=function(W)
{
  ones=rep(1,nrow(W))
  t(W)%*%ones
}

Row-wise matrix sum

Similarly, the row-wise sum is .

Python code:

def rowWiseSum(W):
    ones=np.ones((W.shape[1],1))
    return W.dot(ones)

R code:

rowWiseSum=function(W)
{
  ones=rep(1,ncol(W))
  W%*%ones
}

Matrix sum

The sum of all the elements of a matrix is the sum of the sum of its rows. Using previous expression, the sum of all the terms of is .

Python code:

def matSum(W):
    rhs_ones=np.ones((W.shape[1],1))
    lhs_ones=np.ones((W.shape[0],1))
    return lhs_ones.T.dot(W).dot(rhs_ones)

R code:

matSum=function(W)
{
  rhs_ones=rep(1,ncol(W))
  lhs_ones=rep(1,nrow(W))
  t(lhs_ones) %*% W%*% rhs_ones
}

Similarity matrix (Gram matrix)

Let’s say we have a set of words and for each of this words we want to find the most similar words from a dictionary. We assume that the words have been projected in space of dimension (using word2vect). Let (our set of words) and (our dictionary) be two matrices resp. in and . To compute the similarity of all the observations of and we simply need to compute .

Python code:

def gramMatrix(X,Y):
    return X.dot(Y.T)

R code:

gramMatrix=function(X,Y)
{
  X %*% t(Y)
}

L2 distance

We want to compute the pair-wise distance between two sets of vector. Let and be two matrix in and . For each vector of we need to compute the distance with all the vectors of .Hence, the output matrix should be of size .

If and are two vectors, their distance is:

   

. To compute all pairwise distance, some work is required on the last equality. All the matrices should be of size , then the output vector of distance will be of size , which can be reshaped into a vector of size .

The first two terms and need to be computed for each and . is the element-wise multiplication of X with itself (its elements are ). Hence, the i-th element of is the squared sum of the coordinate of the i-th observation, .

However, this is a vector and its shape is . By replicating each of its elements times, we will get a matrix of size . The replication can be done easily, if we consider the right matrix .

Let be a vector of one of size . Let be the matrix of size with repetitions of on the “diagonal”:

   

Then, our final vector is (The same expression holds for ). We denote a reshape operator (used to transform the previous vector in matrix). With previous part on similarity matrix, we get the following expression of the pairwise distance:

   

The previous expression can seem complex, but this will help us a lot to code the pairwise distance. We only have to do the translation from maths to Numpy or R.

Python code:

def L2dist(X,Y):
    n_1=X.shape[0]
    n_2=Y.shape[0]
    p=X.shape[1]
    ones=np.ones((p,1))
    x_sq=(X**2).dot(ones)[:,0]
    y_sq=(Y**2).dot(ones)[:,0]
    delta_n1_n2=np.repeat(np.eye(n_1),n_2,axis=0)
    delta_n2_n1=np.repeat(np.eye(n_2),n_1,axis=0)
    return np.reshape(delta_n1_n2.dot(x_sq),(n_1,n_2))+np.reshape(delta_n2_n1.dot(y_sq),(n_2,n_1)).T-2*gramMatrix(X,Y)

R Code:

L2dist=function(X,Y)
{
  n_1=dim(X)[1]
  n_2=dim(Y)[1]
  p=dim(X)[2]
  ones=rep(1,p)
  x_sq=X**2 %*% ones
  x_sq=t(matrix(diag(n_1) %x% rep(1, n_2) %*% x_sq, n_2,n_1))
  
  y_sq=Y**2 %*% ones
  y_sq=matrix(diag(n_2) %x% rep(1, n_1) %*% y_sq,n_1,n_2)
  
  x_sq+y_sq-2*gramMatrix(X,Y)
}

L2 distance (improved)

Actually the previous L2dist is not completely optimized requires a lot of memory since has cells and is mostly empty. Using Numpy built-in function, we can circumvent this multiplication by directly repeating the vector (which reduces the memory footprints by a factor ):

Python code:

def L2dist_improved(X,Y):
    n_1=X.shape[0]
    n_2=Y.shape[0]
    p=X.shape[1]
    ones=np.ones((p,1))
    x_sq=(X**2).dot(ones)[:,0]
    y_sq=(Y**2).dot(ones)[:,0]
##Replace multiplication by a simple repeat
    X_rpt=np.repeat(x_sq,n_2).reshape((n_1,n_2))
    Y_rpt=np.repeat(y_sq,n_1).reshape((n_2,n_1)).T
    return X_rpt+Y_rpt-2*gramMatrix(X,Y)

R code:

L2dist_improved=function(X,Y)
{
  n_1=dim(X)[1]
  n_2=dim(Y)[1]
  p=dim(X)[2]
  ones=rep(1,p)
  x_sq=X**2 %*% ones
  x_sq=t(matrix(rep(x_sq,each=n_2),n_2,n_1))
  
  y_sq=Y**2 %*% ones
  y_sq=matrix(rep(y_sq,each=n_1),n_1,n_2)
  
  x_sq+y_sq-2*gramMatrix(X,Y)
}

L2 distance – benchmark

To show the interest of our previous work, let’s compare the computation speed of the vectorized L2 distance, the naive implementation and the scikit-learn implementation. The experiments are run on different size of dataset with 100 repetitions.

Computation time of the L2 distance

The vectorized implementation is 2 to 3 orders of magnitude faster than the naive implementation and on par with the scikit implementation.

The post Machine Learning Explained: Vectorization and matrix operations appeared first on Enhance Data Science.

To leave a comment for the author, please follow the link and comment on their blog: Enhance Data Science.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.