The first thing to realize is that this problem is intrinsically different from classifying documents into topics – because the topics are not known beforehand (This is also, in a way, what the ‘latent’ in ‘latent dirichlet allocation’ means). We want to simultaneously solve the problems of discovering topic groups in our data, and assign documents to topics (The assignment metaphor isn’t exact, and we’ll see why in just a sec)
A conventional approach to grouping documents into topics might be to cluster them using some features, and call each cluster one topic. LDA goes a step further, in that it allows the possibility of a document to arise from a combination of topics. So, for example, comic 162 might be classified as 0.6 physics, 0.4 romance.
The Data and the Features
Processing and interpreting the contents of the images would be a formidable task, if not an impossible one. So for now, we’re going to stick to the text of the comics. This is good not only because text is easier to parse, but also because it probably contains the bulk of the information. Accordingly, I scraped the transcriptions for xkcd comics – an easy enough task from the command line. (Yes, they are crowd-transcribed! You can find a load of webcomics transcribed at OhnoRobot, but Randall Munroe has conveniently put them in the source of the xkcd comic page itself)
Cleaning up the text required a number of judgement calls, and I usually went with whatever was simple. I explain these in comments in the code – Feel free to alter it and do this in a different way.
Finally, the transcripts are converted into a bag of words – exactly the kind of input LDA works with. The code is shared via github
What to Expect
I’m not going to cover the details of how LDA works (There is an easy to understand, layman explanation here, and a rigorous, technical one here), but I’ll tell you what output we’re expecting: LDA is a generative modeling technique, and is going to give us k topics, where each ‘topic’ is basically a probability distribution over the set of all words in our vocabulary (all words ever seen in the input data). The values indicate the probability of each word being selected if you were trying to generate a random document from the given topic.
Each topic can then be interpreted from the words that are assigned the highest probabilities.