Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

1. Orignal GloVe paper
2. Jon Gauthier’s post, which provides detailed explanation of python implementation. This post helps me a lot with C++ implementation.

# Word embedding

After Tomas Mikolov et al. released word2vec tool, there was a boom of articles about words vector representations. One of the greatest is GloVe, which did a big thing while explaining how such algorithms work and refolmulating word2vec optimizations as special kind of factoriazation for word cooccurences matrix.

This post will consists of two main parts:

1. Very brief introduction into GloVe algorithm.
2. Details of implementation. I will show how to write fast, parallel asynchronous SGD with RcppParallel with adaptive learning rate in C++ using Intel TBB and RcppParallel.

# Introduction to GloVe algorithm

GloVe algorithm consists of following steps:

1. Collect word cooccurence statistics in a form of word coocurence matrix $X$. Each element $X_{ij}$ of such matrix represents measure of how often word i appears in context of word j. Usually we scan our corpus in followinf manner: for each term we look for context terms withing some area – window_size before and window_size after. Also we give less weight for more distand words (usually $decay = 1/offset$).
2. Define soft constraint for each word pair: $w_i^Tw_j + b_i + b_j = log(X_{ij})$. Here $w_i$ – vector for main word, $w_j$ – vector for context word, $b_i$, $b_j$ – scalar biases for main and context words.
3. Define cost function $J = sum_{i=1}^V sum_{j=1}^V ; f(X_{ij}) ( w_i^T w_j + b_i + b_j - log X_{ij})^2$. Here $f$ is a weighting function which help us to prevent learning only from exremly common word pairs. GloVe authors choose following fucntion:

# Implementation

Main challenges I faced during implementation:

1. Efficient cooccurence matrix creation.
2. Implementation of efficient SGD for objective cost function minimization.

## Cooccurence matrix creation

There are a few main issues with term cooccurence matrix (tcm):

1. tcm should be sparse. We should be able to construct tcm for large vocabularies ( > 100k words).
2. Fast lookups/inserts.

To meet requirement of sparsity we need to store data in associative array. `unordered_map` is good candidate because of $O(1)$ lookups/inserts complexity. I ended with `std::unordered_map< std::pair<uint32_t, uint32_t>, T >` as container for sparse matrix in triplet form. Performance of `unordered_map` heavily depends on underlying hash fucntion. Fortunately, we can pack `pair<uint32_t, uint32_t>` into single `uint64_t` in a determenistic way without any collisions.
Hash function for `std::pair<uint32_t, uint32_t>` will look like:

For details see this and this stackoverflow questions.

Also note, that our cooccurence matrix is symmetrical, so internally we will store only elements above main diagonal.

Now we should implement efficient parallel asynchronous stochastic gradient descent for word cooccurence matrix factorization, which is proposed in GloVe paper. Interesting thing - SGD inherently is serial algoritms, but when your problem is sparse, you can do asynchronous updates without any locks and achieve speedup proportional to number of cores on your machine! If you didn’t read HOGWILD!, I recomment to do that.

Let me remind formulation of SGD. We try to move $x_t$ parameters in a minimizing direction, given by $−g_t$ with a learning rate $alpha$:

So, we have to calculate gradients for our cost function:

$J = sum_{i=1}^V sum_{j=1}^V f(X_{ij}) ( w_i^T w_j + b_i + b_j - log X_{ij} )^2$:

We will use modification of SGD - AdaGrad algoritm. It automaticaly determines per-feature learning rate by tracking historical gradients, so that frequently occurring features in the gradients get small learning rates and infrequent features get higher ones. For AdaGrad implementation deteails see excellents Notes on AdaGrad by Chris Dyer.

Formulation of AdaGrad step $t$ and feature $i$ is following:

As we can see, at each iteration $t$ we need to keep track of sum over all historical gradients.

Actually we will use modification of AdaGrad - HOGWILD-style asynchronous AdaGrad 🙂 Main idea of HOGWILD! algorithm is very simple - don’t use any syncronizations. If your problem is sparse, allow threads to overwrite each other! This works and works fine. Again, see HOGWILD! paper for details and theoretical proof.

## Code

Now lets put all into the code.

As seen from analysis above, `GloveFit` class should consists following parameters:

1. word vecvors `w_i`, `w_j` (for main and context words).
2. biases `b_i`, `b_j`.
3. word vectors square gradients `grad_sq_w_i`, `grad_sq_w_j` for adaptive learning rates.
4. word biases square gradients `grad_sq_b_i`, `grad_sq_b_j` for adaptive learning rates.
5. `lerning_rate`, `max_cost` and other scalar model parameters.

### Single iteration

Now we should to initialize parameters and perform iteration of SGD:

For actual text2vec code (with a few tricks) check this loop.

### OpenMP

As discussed above, all these steps can be performed in parallel loop (over all non-zero word-coocurence scores). This can be easily done via OpenMP `parallel for` and reduction: `#pragma omp parallel for reduction(+:global_cost)`. But there is one significant issue with this approach - it is very hard to make portable R-package with OpenMP support. By default it will work only on linux distributions, because:

1. default `clang` on OS X don’t support OpenMP (of course you can install `clang-omp` or `gcc` from brew, but this also could be tricky).
2. Rtools begins support of OpenMP on Windows only in 2015. But even modern realization has substantial overheads.

For more details see OpenMP-support section of Writing R Extensions manual.

### Intel TBB

Luckily we have a better alternative - Intel Thread Building Blocks library and RcppParallel package which provides `RVector` and `RMatrix` wrapper classes for safe and convenient access to R data structures in a multi-threaded environment! Moreover it “just works” on main platforms - OS X, Windows, Linux. Have very positive experience with this library, thanks to Rcpp Core team and especially to JJ Allaire.

Using TBB is little bit trickier, then writing simple OpenMP `#pragma` directives. You should implement functor which operates on a chunk of data and call `parallelReduce` or `parallelFor` on entire data collection. You can find useful (and simple) examples at RcppParallel examples section.

### Putting all together

For now suppose, we have `partial_fit` method in `GloveFit` class with following signature (see actual code here):

It takes

1. tcm in sparse triplet form `<x_irow, x_icol, x_val>`
2. `begin` and `end` pointers for a range on which we want to perform our SDG.

And performs SGD steps over this range - updates word vectors, gradients, etc. At the end it retruns value of accumulated cost function. Note, that internally this method modifies values members of the class.

Also note, that signature of `partial_fit` is very similar to what we have to implement in our TBB functor. Now we are ready to write it:

As you can see, it is very similar to example form RcppParallel site. One diffrence - it has side-effects. By calling `partial_fit` it modifies internal state of the input instance of `GloveFit` class (which actually contains our GloVe model).

Now lets write `GloveFitter` class, which will be callable from R via `Rcpp-modules`. It will act as interface for fitting our model and take all input model parameters such as vocabulary size, desired word vectors size, initial AdaGrad learning rate, etc. Also we want to track cost between iterations and want to be able to perform some early stopping strategy between SGD iterations. For that purpose we keep our model in C++ class, so we can modify it “in place” at each SGD iteration (which can be problematic in R)

And create wrapper with `Rcpp-Modules`:

Now we can use `GloveFitter` class from R: