Determining the number of “topics” in a corpus of documents
In my last post I finished by topic modelling a set of political blogs from 2004. I made a passing comment that it’s a challenge to know how many topics to set; the R
topicmodels package doesn’t do this for you. There’s quite a discussion on this out there, but nearly all the extant approaches amount to fitting your model with lots of different values of k (where k is the number of latent topics to identify) and seeing which is “best”. Naturally, this begs the question of what is meant by “best”.
A literature has grown around this question, and the convenient
ldatuning R package by Nikita Murzintcev provides metrics using four of the methods. A helpful vignette shows this in action with the
AssociatedPress dataset of 2,246 articles and 10,473 words. This dataset was presented at the First Text Retrieval Conference (TREC-1) in 1992 and is distributed with several of the commonly used R text-mining packages. The
ldatuning approach is super-simple to run; here’s the all the code needed to fit calculate three of the measures for a range of different topic numbers from 10 to 350. To save computational effort, I omitted a fourth method that the vignette shows not to work well with this particular data.
and here’s the results.
OK, pretty convincing, and a good literature behind it to justify these methods. The various methods agree that somewhere between 90 and 140 topics is optimal for this dataset. Time consuming, but rigorous. The code above took 10 hours to run on an old machine with four CPUs. The
ldatuning documentation has references and links to the underlying articles.
The abstracts of the (mostly paywalled unfortunately) articles implemented by
ldatuning look like the metrics they suggest are based on assessing maximising likelihood, minimising Kullback-Leibler divergence or similar, using the same dataset that the model was trained on (rather than cross-validation).
While I was researching this I came across this question on Stack Overflow as well as a few mentions elsewhere on the web of cross-validation of topic models to choose optimal number of topics. As part of my familiarisation with the whole approach to topic modelling I decided to look further into this. For example, the vignette for topicmodels says the number of topics in their worked example was chosen by 10-fold cross-validation, but doesn’t say how this was done other than pointing out that it is possible in “a few lines of R code”.
A more complete description is given in this conference paper by Zhao, Chen, Perkins, Liu, Ge, Ding and Zou:
“Lacking such a heuristic to choose the number of topics, researchers have no recourse beyond an informed guess or time-consuming trial and error evaluation. For trial and error evaluation, an iterative approach is typical based on presenting different models with different numbers of topics, normally developed using cross-validation on held-out document sets, and selecting the number of topics for which the model is least perplexed by the test sets… Using the identified appropriate number of topics, LDA is performed on the whole dataset to obtain the topics for the corpus. We refer to this as the perplexity-based method. Although the perplexity-based method may generate meaningful results in some cases, it is not stable and the results vary with the selected seeds even for the same dataset.”
The key idea of cross-validation is that you divide the data into different numbers of subsets - conventionally 5 or 10, let’s say 5 from now on - and take turns at using one of the five as a validation set while the remaining four are used as a training set. This way each data point gets one turn as part of the hold-out validation, and four turns as part of the training set. It’s useful for assessing the overall validity of the model on data that wasn’t involved in its training, and also for identifying good values of tuning hyper-parameters. For today, I want to use it to find the best value of the number of topics hyperparameter “k”.
I’m going to use the
perplexity measure for the applicability of a topic model to new data. Perplexity is a measure of how well a probability model predicts a sample. Given the
perplexity function comes in the
topicmodels R package and is the obvious way supplied to test a trained model on new data, I think this is most likely what Grun and Hornick did for cross-validation in the
topicmodels vignette; and it is mentioned by Zhao et al as the method used in “time-consuming trial and error evaluation”. I’m not particularly worried for now about what Zhao et al say about the stability of cross-validation with perplexity; I might come back to that in a later post (my intuition is that this is unlikely to be a problem, but I’ll want to check that some time!).
Let’s do this one step at a time, building up to the full cross-validation at different values of k:
- first, I’m going to do validation (not cross-validation), on a single hold-out set of one fifth of the data, at a single value of k
- then I’m going to use cross-validation at a single value of k
- finally I’m going to use cross-validation at many values of k
Cross-validation is very computing-intensive and embarrassingly parallel so I’m going to parallelize the second and third of the efforts above.
Simple validation with a single number of topics
For the simple validation, I make a single
train_set version of 75% of rows from the
AssociatedPress document-term-matrix, randomly chosen; and a
valid_set of the remaining 25%. The model is fit with the
LDA function, and then (final two lines of code in the chunk below) I estimate how perplexed the resulting model is with both the training data set and the validation set.
Perplexity was 2728 on the training data, and a much higher 4280 on the validation set. This is to be expected - the model has been fit to the training set and naturally finds the new validation data more perplexing - that’s the whole reason for confronting a model with fresh data. BTW, to check whether this was impacted on by the number of documents, I also tried the above with equally sized training and validation sets and got similar results.
Cross validation with a single number of topics
Next step is do the validation, with a single value of k, but splitting the data into five so each row of the data gets a turn in the validation set. Here’s how I did that:
The results aren’t particularly interesting - just 5 numbers - so I won’t show here.
Cross validation with many candidate numbers of topics
Finally, we now do this cross-validation many times, with different values of the number of latent topics to estimate. Here’s how I did that:
This took 75 minutes to run on a modern machine (different to that used for
ldatuning) when I stopped the
candidate_k at 100. When I added the final values of 200 and 300 to
candidate_k it took 210 minutes. I did notice that towards the end, not all processors were being used at maximum capacity, which suggests the way I’ve parallelised it may not be optimal.
Exactly how number of documents, topics and words contribute to the time cost of fitting a topic model is a subject for a future post I think.
Here’s the results from the cross-validation using perplexity:
We get something that at least is consistent with the measures from the
ldatuning package; there’s a distinct flattening out of the cross-validated perplexity somewhere between 50 and 200 topics. By the time we have 200 topics there has definitely been over-fitting, and the model is starting to get worse when tested on the hold-out validation sets. There’s still judgement required as to exactly how many topics to use, but 100 looks a good consensus number that all three methods tried from
ldatuning support as well as the perplexity cross-validation measure.
Presentation of final model
I opted to fit a model with 90 topics. Here’s an animation of the results:
Code for creating the animation:
A small note on
topicmodels in linux.
In writing this post I used both Windows and Linux (Ubuntu) machines, and it was my first time using the R
topicmodels pacakge on Linux. I had some frustrations getting it installed; basically the same problem with missing GNU Scientific Library in this blog post and this Stack Overflow Q&A.
The trick was I needed not just
gsl-bin but also
libsg10-dev. Also the
mpfr library which is “a GNU portable C library for arbitrary-precision binary floating-point computation with correct rounding, based on GNU Multi-Precision Library.” In Ubuntu: