Moving around sparse matrices of text data – the limitations of as.h2o
This post is the resolution of a challenge I first wrote about in late 2016, moving large sparse data from an R environment onto an H2O cluster for machine learning purposes. In that post, I experimented with functionality recently added by the H2O team to their supporting R package, the ability for as.h2o() to interpret a sparse Matrix object from R and convert it to an H2O frame. The Matrix and as.h2o method is ok for medium sized data but broke down on my hardware with a larger dataset – a bags of words from New York Times articles with 300,000 rows and 102,000 columns. Cell entries are the number of times a particular word is used in the document represented by a row and are mostly empty, so my 12GB laptop has no problem managing the data in a sparse format like Matrix from the Matrix package or a simple triplet matrix from the slam package. I’m not sure what as.h2o does under the hood in converting from Matrix to an H2O frame, but it’s too much for my laptop.
My motivation for this is that I want to use R for convenient pre-processing of textual data using the tidytext approach; but H2O for high powered machine learning. tidytext makes it easy to create a sparse matrix with cast_dtm or cast_sparse, but uploading this to H2O can be a challenge.
How to write from R into SVMLight format
After some to-and-fro on Stack Overflow, the best advice was to export the sparse matrix from R into a SVMLight/LIBSVM format text file, then read it into H2O with h2o.importFile(..., parse_type = "SVMLight"). This turned the problem from an difficult and possibly intractable memory managment challenge into a difficult and possibly intractable data formatting and file writing challenge – how to efficiently write files in SVMLight format.
SVMLight format combines a data matrix with some modelling information ie the response value of a model, or “label” as it is (slightly oddly, I think) often called in this world. Instead of a more conventional row-based sparse matrix format which might convey information in row-column-value triples, it uses label-column:value indicators. It looks like this:
1 10:3.4 123:0.5 34567:0.231
0.2 22:1 456:0.3
That example is equivalent to two rows of a sparse matrix with at least 34,567 columns. The first row has 1 as the response value, 3.4 in the 10th column of explanatory variables, and 0.231 in the 34,567th column; the second row has 0.2 as the response value, 1 in the 22nd column, and so on.
Writing data from R into this format is a known problem discussed in this Q&A on Stack Overflow. Unfortunately, the top rated answer to that question, e1071::write.svm is reported as being slow, and also it is integrated into a workflow that requires you to first fit a Support Vector Machine model to the data, a step I wanted to avoid. That Q&A led me to a GitHub repo by zygmuntz that had a (also slow) solution for writing dense matrices into SVMLight format, but that didn’t help me as my data were too large for R to hold in dense format. So I wrote my own version for taking simplet triplet matrices and writing SVMLight format. My first version depended on nested paste statements that were applyd to each row of the data and was still too slow at scale, but with the help of yet another Stack Overflow interaction and some data table wizardry by @Roland this was able to reduce the expected time writing my 300,000 by 83,000 New York Times matrix (having removed stop words) from several centuries to two minutes.
I haven’t turned this into a package – it would seem better to find an existing package to add it to than create a package just for this one function, any ideas appreciated. The functions are available on GitHub but in case I end up moving them, here they are in full. One function creates a big character vector; the second writes that to file. This means multiple copies of the data need to be held in R and hence creates memory limitations, but is much much faster than writing it one line at a time (seconds rather than years in my test cases).
Example application - spam detection with the Enron emails
Although I’d used the New York Times bags of words from the UCI machine learning dataset repository for testing the scaling up of this approach, I actually didn’t have anything I wanted to analyse that data for in H2O. So casting around for an example use case I decided on using the Enron email collection for spam detection, first analysed in a 2006 conference paper by V. Metsis, I. Androutsopoulos and G. Paliouras. As well as providing one of the more sensational corporate scandals of recent times, the Enron case has blessed data scientists with one of the largest published sets of emails collected from their natural habitat.
The original authors classified the emails as spam or ham and saved these pre-processed data for future use and reproducibility. I’m not terribly knowledgeable (or interested) in spam detection, so please take the analysis below as a crude and naive example only.
First the data need to be downloaded and unzipped. The files are stored as 6 Tape ARchive files
This creates six folders with the names enron1, enron2 etc; each with a spam and a ham subfolder containing numerous text files. The files look like this example piece of ham (ie non-spam; a legitimate email), chosen at random:
Subject: re : creditmanager net meeting
yes , this will work for us .
" aidan mc nulty " on 12 / 16 / 99 08 : 36 : 14 am
to : vince j kaminski / hou / ect @ ect
subject : creditmanager net meeting
vincent , i cannot rearrange my schedule for tomorrow so i would like to
confirm that we will have a net - meeting of creditmanager on friday 7 th of
january at 9 . 30 your time .
aidan mc nulty
212 981 7422
The pre-processing has removed duplicates, emails sent to themselves, some of the headers, etc.
Importing the data into R and making tidy data frames of documents and word counts is made easy by Silge and Robinson’s tidytext package which I never tire of saying is a game changer for convenient analysis of text by statisticians:
As well as basic word counts, I wanted to experiment with other characteristics of emails such as number of words, number and proportion of of stopwords (frequently used words like “and” and “the”). I create a traditional data frame with a row for each email, identified by id, and columns indicating whether it is SPAM and those other characteristics of interest.
Source: local data frame [33,702 x 6]
Groups: id [33,702]
id SPAM number_characters number_words number_stop_words
<chr> <chr> <int> <int> <int>
1 0001.1999-12-10.farmer.ham.txt ham 28 4 0
2 0001.1999-12-10.kaminski.ham.txt ham 24 4 3
3 0001.2000-01-17.beck.ham.txt ham 3486 559 248
4 0001.2000-06-06.lokay.ham.txt ham 3603 536 207
5 0001.2001-02-07.kitchen.ham.txt ham 322 48 18
6 0001.2001-04-02.williams.ham.txt ham 1011 202 133
7 0002.1999-12-13.farmer.ham.txt ham 4194 432 118
8 0002.2001-02-07.kitchen.ham.txt ham 385 64 40
9 0002.2001-05-25.SA_and_HP.spam.txt spam 990 170 80
10 0002.2003-12-18.GP.spam.txt spam 1064 175 63
I next make my sparse matrix as a document term matrix (which is a special case of a simplet triplet matrix from the slam package), with a column for each word (having first limited myself to interesting words)
Now we can load our two datasets onto an H2O cluster for analysis:
I now have an H2O frame with 33602 rows and 26592 columns; most of the columns representing words and the cells being counts; but some columns representing other variables such as number of stopwords.
To give H2O a workout, I decided to fit four different types of models trying to understand which emails were ham and which spam:
generalized linear model, with elastic net regularization to help cope with the large number of explanatory variables
gradient boosting machine
I split the data into training, validation and testing subsets; with the idea that the validation set would be used for choosing tuning parameters, and the testing set used as a final comparison of the predictive power of the final models. As things turned out, I didn’t have patience to do much in the way of tuning. This probably counted against the latter three of my four models, because I’m pretty confident better performance would be possible with more careful choice of some of the meta parameters. Here’s the eventual results from my not-particularly-tuned models:
The humble generalized linear model (GLM) performs pretty well; outperformed clearly only by the neural network. The GLM has a big advantage in interpretability too. Here are the most important variables for the GLM in predicting spam (NEG means a higher count of the word means less likely to be spam)
So, who knew, emails containing the words “money”, “software”, “life”, “click”, “online”, “viagra” and “meds” are (or at least were in the time of Enron - things may have changed) more likely to be spam.