Machine Learning Explained: Vectorization and matrix operations

Posted on May 7, 2018 by Antoine Guillot in R bloggers | 0 Comments

[This article was first published on Enhance Data Science, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Today in Machine Learning Explained, we will tackle a central (yet under-looked) aspect of Machine Learning: vectorization. Let’s say you want to compute the sum of the values of an array. The naive way to do so is to loop over the elements and to sequentially sum them. This naive way is slow and tends to get even slower with large amounts of data and large data structures.
With vectorization these operations can be seen as matrix operations which are often more efficient than standard loops. Vectorized versions of algorithm are several orders of magnitudes faster and are easier to understand from a mathematical perspective.

A basic exemple of vectorization

Preliminary exemple – Python

Let’s compare the naive way and the vectorized way of computing the sum of the elements of an array. To do so, we will create a large (100,000 elements) Numpy array and compute the sum of its element 1,000 times with each algorithm. The overall computation time will then be compared.

import numpy as np
import time
W=np.random.normal(0,1,100000)
n_rep=1000

The naive way to compute the sum iterates over all the elements of the array and stores the sum:

start_time = time.time()
for i in range(n_rep):
    loop_res=0
    for elt in W:
        loop_res+=elt
time_loop = time.time() - start_time

If $\mathbf{W}$ is our vector of interest, the sum of its elements can be expressed as the dot product $\mathbf{1}^T \mathbf{W} = \sum W_i$ :

start_time = time.time()
for i in range(n_rep):
    one_dot=np.ones(W.shape)
    vect_res=one_dot.T.dot(W)
time_vect = time.time() - start_time

Finally, we can check that both methods yield the same results and compare their runtime. The vectorized version run approximately 100 to 200 times faster than the naive loop.

print(np.abs(vect_res-loop_res)<10e-10)
print(time_loop/time_vect)

Note: The same results can be obtained with np.sum.The numpy version has a very similar runtime to our vectorized version. Numpy being very optimized, this show that our vectorized sum is reasonably fast.

start_time = time.time()
for i in range(n_rep):
    vect_res=np.sum(W)
time_vect_np = time.time() - start_time

Preliminary exemple – R

The previous experiments can be replicated in R:

##Creation of the vector
W=matrix(rnorm(100000))
n_rep=10000
#Naive way:
library(tictoc)
tic('Naive computation')
for (rep in 1:n_rep)
{
  res_loop=0
  for (w_i in W)
    res_loop=w_i+res_loop
}
toc()

tic('Vectorized computation')
# vectorized way
for (rep in 1:n_rep)
{
  ones=rep(1,length(W))
  res_vect= crossprod(ones,W)
}
toc()

tic('built-in computation')
# built-in way
for (rep in 1:n_rep)
{
  res_built_in= sum(W)
}
toc()

In R, the vectorized version is only an order of magnitude faster than the naive way. The built-in way achieves the best performances and is an order of magnitude faster than our vectorized way.

Preliminary exemple – Results

$\begin{tabular}{|l|c|c|r|} \hline Lang & Naive & Vectorized & Built in \\ \hline Python & 115 s & 0.55 s & 0.28 s \\ R & 25 s & 3.8 s & 0.9 s \\ \hline \end{tabular}$

$\text{Computation time of the sum of an array of 100,000 elements (10,000 rep)}$

Vectorization divides the computation times by several order of magnitudes and the difference with loops increase with the size of the data. Hence, if you want to deal with large amount of data, rewriting the algorithm as matrix operations may lead to important performances gains.

Why vectorization is (often) faster

R and Python are interpreted language, this means that your instructions are analyzed and interpreted at each execution. Since they are not statically typed, the loops have to assess the type of the operands at each iteration which leads to a computational overhead.
R and Python linear algebra relies on optimized back-end for matrix operations and linear algebra. These back-end are written in C/C++ and can process loops efficiently. Furthermore, while loops are sequential, these back-end can run operations in parallel which improves the computation speed on modern CPU.

Note 1: Though vectorization is often faster, it requires to allocate the memory of the array. If your amount of RAM is limited or if the amount of data is large, loops may be required.
Note 2: When you deal with large arrays or computationally intensive algebra ( like inversion, diagonalization, eigenvalues computations, ….) computations on GPU are even order of magnitudes faster than on CPU. To write efficient GPU code, the code needs to be composed of matrix operations. Hence, having vectorized code maked it easier to translate CPU code to GPU (or tensor-based frameworks).

A small zoo of matrix operations

The goal of this part is to show some basic matrix operations/vectorization and to end on a more complex example to show the thought process which underlies vectorization.

Column-wise matrix sum

The column wise sum (and mean) can be expressed as a matrix product. Let $\mathbf{M}$ be our matrix of interest. Using the matrix multiplication formula, we have: $[\mathbf{1}^T\mathbf{M}]_j= \sum_i \mathbf{1}_i \mathbf{M}_{i,j} =\sum_i \mathbf{M}_{i,j}$ . Hence, the column-wise sum of $\mathbf{M}$ is $\mathbf{1}^T\mathbf{M}$ .

Python code:

def colWiseSum(W):
    ones=np.ones((W.shape[0],1))
    return ones.T.dot(W)

R code:

colWiseSum=function(W)
{
  ones=rep(1,nrow(W))
  t(W)%*%ones
}

Row-wise matrix sum

Similarly, the row-wise sum is $\mathbf{M}\mathbf{1}$ .

Python code:

def rowWiseSum(W):
    ones=np.ones((W.shape[1],1))
    return W.dot(ones)

R code:

rowWiseSum=function(W)
{
  ones=rep(1,ncol(W))
  W%*%ones
}

Matrix sum

The sum of all the elements of a matrix is the sum of the sum of its rows. Using previous expression, the sum of all the terms of $\mathbf{M}$ is $\mathbf{1}^T\mathbf{M}\mathbf{1}$ .

Python code:

def matSum(W):
    rhs_ones=np.ones((W.shape[1],1))
    lhs_ones=np.ones((W.shape[0],1))
    return lhs_ones.T.dot(W).dot(rhs_ones)

R code:

matSum=function(W)
{
  rhs_ones=rep(1,ncol(W))
  lhs_ones=rep(1,nrow(W))
  t(lhs_ones) %*% W%*% rhs_ones
}

Similarity matrix (Gram matrix)

Let’s say we have a set of words and for each of this words we want to find the most similar words from a dictionary. We assume that the words have been projected in space of dimension $p$ (using word2vect). Let $\mathbf{X}$ (our set of words) and $\mathbf{Y}$ (our dictionary) be two matrices resp. in $\mathbb{R}^{n_1,p}$ and $\mathbb{R}^{n_2,p}$ . To compute the similarity of all the observations of $\mathbf{X}$ and $\mathbf{Y}$ we simply need to compute $\mathbf{X} \mathbf{Y}^T$ .

Python code:

def gramMatrix(X,Y):
    return X.dot(Y.T)

R code:

gramMatrix=function(X,Y)
{
  X %*% t(Y)
}

L2 distance

We want to compute the pair-wise distance between two sets of vector. Let $\mathbf{X}$ and $\mathbf{Y}$ be two matrix in $\mathbb{R}^{n_1,p}$ and $\mathbb{R}^{n_2,p}$ . For each vector of $X$ we need to compute the distance with all the vectors of $Y$ .Hence, the output matrix should be of size $(n_1,n_2)$ .

If $x$ and $y$ are two vectors, their $\mathbf{L}^2$ distance is:

$\sum (x_i - y_i)^2 = \sum x_i^2 + \sum y_i^2 - 2 \sum x_i y_i = x^T x + y^T y - 2 x^T y$

. To compute all pairwise distance, some work is required on the last equality. All the matrices should be of size $(n_1 \times n_2,p)$ , then the output vector of distance will be of size $(n_1 \times n_2,1)$ , which can be reshaped into a vector of size $(n_1,n_2)$ .

The first two terms $x^T x$ and $y^T y$ need to be computed for each $x$ and $y$ . $\mathbf{X} \odot \mathbf{X}$ is the element-wise multiplication of X with itself (its elements are $x_{i,j}^2$ ). Hence, the i-th element of $x_{sq}=(\mathbf{X} \odot \mathbf{X}) \mathbf{1}$ is the squared sum of the coordinate of the i-th observation, $\sum_j x_{i,j}^2$ .

However, this is a vector and its shape is $(n_1,1)$ . By replicating each of its elements $n_2$ times, we will get a matrix of size $(n_1 \times n_2,1)$ . The replication can be done easily, if we consider the right matrix $\mathbf{\Delta}_{n_2}^{n_1}$ .

Let $\mathbf{1}_{n_2}$ be a vector of one of size $n_2$ . Let $\mathbf{\Delta}_{n_2}^{n_1}$ be the matrix of size $(n_1\times n_2, n_1)$ with $n_1$ repetitions of $\mathbf{1}_{n_2}$ on the “diagonal”:

$\mathbf{\Delta}_{n_2}^{n_1}=\begin{matrix} \mathbf{1}_{n_2} & 0 & \ldots & 0 \\ 0 & \mathbf{1}_{n_2} & \ldots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 &\ldots & \mathbf{1}_{n_2} \end{matrix}$

Then, our final vector is $\Delta_{n_2}^{n_1}x_{sq}$ (The same expression holds for $\mathbf{Y}$ ). We denote $\Phi$ a reshape operator (used to transform the previous vector in matrix). With previous part on similarity matrix, we get the following expression of the pairwise distance:

$\Phi(\mathbf{\Delta}_{n_2}^{n_1} ((\mathbf{X} \odot \mathbf{X}) \mathbf{1}_{p}))+\Phi(\mathbf{\Delta}_{n_1}^{n_2} ((\mathbf{Y} \odot \mathbf{Y}) \mathbf{1}_{p})) -2 \mathbf{X} \mathbf{Y}^T$

The previous expression can seem complex, but this will help us a lot to code the pairwise distance. We only have to do the translation from maths to Numpy or R.

Python code:

def L2dist(X,Y):
    n_1=X.shape[0]
    n_2=Y.shape[0]
    p=X.shape[1]
    ones=np.ones((p,1))
    x_sq=(X**2).dot(ones)[:,0]
    y_sq=(Y**2).dot(ones)[:,0]
    delta_n1_n2=np.repeat(np.eye(n_1),n_2,axis=0)
    delta_n2_n1=np.repeat(np.eye(n_2),n_1,axis=0)
    return np.reshape(delta_n1_n2.dot(x_sq),(n_1,n_2))+np.reshape(delta_n2_n1.dot(y_sq),(n_2,n_1)).T-2*gramMatrix(X,Y)

R Code:

L2dist=function(X,Y)
{
  n_1=dim(X)[1]
  n_2=dim(Y)[1]
  p=dim(X)[2]
  ones=rep(1,p)
  x_sq=X**2 %*% ones
  x_sq=t(matrix(diag(n_1) %x% rep(1, n_2) %*% x_sq, n_2,n_1))
  
  y_sq=Y**2 %*% ones
  y_sq=matrix(diag(n_2) %x% rep(1, n_1) %*% y_sq,n_1,n_2)
  
  x_sq+y_sq-2*gramMatrix(X,Y)
}

L2 distance (improved)

Actually the previous L2dist is not completely optimized $\mathbf{\Delta}_{n_2}^{n_1} ((\mathbf{X} \odot \mathbf{X}) \mathbf{1}_{p})$ requires a lot of memory since $\Delta}_{n_2}^{n_1}$ has $n_2 \times n_1^2$ cells and is mostly empty. Using Numpy built-in function, we can circumvent this multiplication by directly repeating the vector (which reduces the memory footprints by a factor $max(n_1,n2)$ ):

Python code:

def L2dist_improved(X,Y):
    n_1=X.shape[0]
    n_2=Y.shape[0]
    p=X.shape[1]
    ones=np.ones((p,1))
    x_sq=(X**2).dot(ones)[:,0]
    y_sq=(Y**2).dot(ones)[:,0]
##Replace multiplication by a simple repeat
    X_rpt=np.repeat(x_sq,n_2).reshape((n_1,n_2))
    Y_rpt=np.repeat(y_sq,n_1).reshape((n_2,n_1)).T
    return X_rpt+Y_rpt-2*gramMatrix(X,Y)

R code:

L2dist_improved=function(X,Y)
{
  n_1=dim(X)[1]
  n_2=dim(Y)[1]
  p=dim(X)[2]
  ones=rep(1,p)
  x_sq=X**2 %*% ones
  x_sq=t(matrix(rep(x_sq,each=n_2),n_2,n_1))
  
  y_sq=Y**2 %*% ones
  y_sq=matrix(rep(y_sq,each=n_1),n_1,n_2)
  
  x_sq+y_sq-2*gramMatrix(X,Y)
}

L2 distance – benchmark

To show the interest of our previous work, let’s compare the computation speed of the vectorized L2 distance, the naive implementation and the scikit-learn implementation. The experiments are run on different size of dataset with 100 repetitions.

compute time L2 — Computation time of the L2 distance

The vectorized implementation is 2 to 3 orders of magnitude faster than the naive implementation and on par with the scikit implementation.

The post Machine Learning Explained: Vectorization and matrix operations appeared first on Enhance Data Science.

To leave a comment for the author, please follow the link and comment on their blog: Enhance Data Science.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

Machine Learning Explained: Vectorization and matrix operations

A basic exemple of vectorization

Preliminary exemple – Python

Preliminary exemple – R

Preliminary exemple – Results

Why vectorization is (often) faster

A small zoo of matrix operations

Column-wise matrix sum

Row-wise matrix sum

Matrix sum

Similarity matrix (Gram matrix)

L2 distance

L2 distance (improved)

L2 distance – benchmark

Related

A basic exemple of vectorization

Preliminary exemple – Python

Preliminary exemple – R

Preliminary exemple – Results

Why vectorization is (often) faster

A small zoo of matrix operations

Column-wise matrix sum

Row-wise matrix sum

Matrix sum

Similarity matrix (Gram matrix)

L2 distance

L2 distance (improved)

L2 distance – benchmark

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)