Scaling up text processing and Shutting up R: Topic modelling and MALLET

October 29, 2013
By

(This article was first published on Quantifying Memory, and kindly contributed to R-bloggers)

In this post I show how a combination of MALLET, Python, and data.table means we can analyse quite Big data in R, even though R itself buckles when confronted by textual data. 

Topic modelling is great fun. Using topic modelling I have been able to separate articles about the 'Kremlin' as a) a building, b) an international actor c) the adversary of the Russian opposition, and d) as a political and ideological symbol.  But for topic modelling to be really useful it needs to see a lot of text. I like to do my analysis in R, but R tends to disagree with large quantities of text. Reading digital humanities literature lately, I suspect I am not alone in confronting this four-step process, of which this post is a result:

1) We want more data to get better results (and really just because it's available)
2) More data makes our software crash. Sometimes spectacularly, and particularly if its R.
3) Ergo we need new or more efficient methods.
4) We write a blog post about our achievements in reaching step 3


Recently I have found that I can use MALLET for topic modelling of my Russian media dataset. Topic modelling has become popular of late, and there are numerous excellent descriptions of what it is and how it works - see for instance Ted Underwood's piece here. Personally I've not made extensive use of topic models, usually fitted using LDA, because I thought the Russian case structure would be a barrier to getting useful results, and because most tools struggled to handle the quantity of data I wanted to analyse. Consequently, the R topicmodels package was perfectly OK, but was limited by being reliant upon the tm package and R, both of which struggle when faced with thousands let alone millions of texts.

The obvious solution to these problems is using MALLET. Mallet runs Java and is consequently very much more efficient than R. David Mimno has recently released a very handy R wrapper (aptly named 'mallet'). The package is handy for getting used to working with MALLET, but as texts to be analysed need to be loaded into R, and consequently into memory, this wasn't really an option either. Ben Marwick has released an excellent script that allows you to run the command line implementation of MALLET from R, as well as import the results afterwards, and this is probably the closest implementation of what we need. 

All these implementations suffer when you try to scale up the analysis - mainly due to R, rather than MALLET, though MALLET also is happier if it can keep all its input in memory.* I fed my data to the program through the command line, which worked fine after increasing the heap space. In the end MALLET spat out a 7GB file, a matrix just shy of 1 million rows by 1000 columns or so, which was much to large to read in to R. Or rather, I could read in the file, but not do much with it afterwards.

I was able to read the file in using read.table, but my laptop died when I tried to convert the data.frame to a data.table. In the end I used fread  from from the data.table package to get the file straight into the data.table format. 

The MALLET output is in the format [ID] [filename], followed by topic-proportion codes ranked according to the quality of the match:


And this repeated for 500 topics. This is great for identifying the top subject matter for individual articles - no reshaping at all is needed, but not so great for finding the distribution of topics across articles. To reorder the data by topic rather than topic rank would require the mother of all reshape operations. The obvious way to do this is using reshape and dcast, following (or: copy-pasting) Ben Marwick:

outputtopickeysresult <- font="" header="F," outputtopickeys="" read.table="" sep="\t">
outputdoctopicsresult <-read .table="" font="" header="F," outputdoctopics="" sep="\t">
# manipulate outputdoctopicsresult to be more useful 
dat <- font="" outputdoctopicsresult="">
l_dat <- 2="" dat="" font="" idvar="1:2," nbsp="" ncol="" reshape="" varying="list(topics=colnames(dat[,seq(3,">
                                             props=colnames(dat[,seq(4, ncol(dat), 2)])),   direction="long")
library(reshape2)
w_dat <- dcast="" font="" l_dat="" v2="" v3="">
rm(l_dat) # because this is very big but not longer needed

Ehrm, yeah. R didn't like that very much. It wanted to allocate some insane quantity of memory to this operation, and it was just not an option - not even close, probably not on some super-computer, and definitely not on my laptop. 

The best option I found in a single piece of code was using the splitstackshape package:

 out <- d="" font="" id.vars="c(" merged.stack="" proportion="" sep="topic|proportion" text="" topic="" var.stubs="c(">

This worked great for up to about 100000 rows, but it still makes a copy of the data.table and struggled to deal with the whole dataset. I tried some other data.table options, but they all involved melting the data into a very long table, then casting it back into a short or wide form, and at no point was I able to process more than about 200 000 rows of data. It boiled down to: if I am going to have a copy of the data in memory, I won't have enough spare memory to do anything with it. 

The solution? A bit of Python. Nice and slow: read one line, write one line, placing the proportions in order of topics rather than rank. None of this load-everything-into-memory-and-cross-my-fingers nonsense. Another approach would be to have a dictionary holding a dictionary for each file and writing them all, but this would mean needing to keep the data in memory. This little script requires virtually no memory, and by halving the number of columns (and rounding to five decimal points) the output file was a third of the size of the input file - about 2GB. 

A benchmark: using fread() R loaded the processed data in 3 and a half minutes:

  user  system elapsed 
 201.72    3.09  242.02 

The imported file occupied 1.6gb of memory, and was much more manageable:

> gc()
            used   (Mb) gc trigger   (Mb)  max used   (Mb)
Ncells    534899   28.6     899071   48.1    818163   43.7
Vcells 219987339 1678.4  233753696 1783.4 220164956 1679.8

And much easier to work with too:

etc.

Using data.table's ability to conduct join operations, data in this format allows me to analyse how a particular topic varied over time, was more or less present in one or other newspaper, was associated with a particular genre, feed it to a machine learning test, or whatever really. 

* Has anyone figured out a good way of working around this?

To leave a comment for the author, please follow the link and comment on his blog: Quantifying Memory.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.