Despite believing we can treat anything we can represent in digital form as “data”, I’m still pretty flakey on understanding what sorts of analysis we can easily do with different sorts of data. Time series analysis is one area – the pandas Python library has all manner of handy tools for working with that sort of data that I have no idea how to drive – and text analysis is another.
So prompted by Sheila MacNeill’s post about textexture, which I guessed might be something to do with topic modeling (I should have read the about, h/t @mhawksey), here’s a quick round up of handy things the text analysts seem to be able to do pretty easily…
Taking the lazy approach, I has a quick look at the CRAN natural language processing task view to get an idea of what sort of tool support for text analysis there is in R, and a peek through the NLTK documentation to see what sort of thing we might be readily able to do in Python. Note that this take is a personal one, identifying the sorts of things that I can see I might personally have a recurring use for…
First up – extracting text from different document formats. I’ve already posted about Apache Tika, which can pull text from a wide range of documents (PDFs, extract text from Word docs, extract text from images), which seems to be a handy, general purpose tool. (Other tools are available, but I only have so much time, and for now Tika seems to do what I need…)
Second up, concordance views. The NLTK docs describe concordance views as follows: “A concordance view shows us every occurrence of a given word, together with some context.” So for example:
This can be handy for skimming through multiple references to a particular item, rather than having to do a lot of clicking, scrolling or page turning.
How about if we want to compare the near co-occurrence of words or phrases in a document? One way to do this is graphically, plotting the “distance” through the text on the x-axis, and then for categorical terms on y marking out where those terms appear in the text. In NLTK, this is referred to as a lexical dispersion plot:
I guess we could then scan across the distance axis using a windowing function to find terms that appear within a particular distance of each other? Or use co-occurrence matrices for example (eg Co-occurrence matrices of time series applied to literary works), perhaps with overlapping “time” bins? (This could work really well as a graph model – eg for 20 pages, set up page nodes 1-2, 2-3, 3-4,.., 18-19, 19-20, then an actor node for each actor, connecting actors to page nodes for page bins on which they occur; then project the bipartite graph onto just the actor nodes, connecting actors who were originally to the same page bin nodes.)
Something that could be differently useful is spotting common sentences that appear in different documents (for example, quotations). There are surely tools out there that do this, though offhand I can’t find any..? My gut reaction would be to generate a sentence list for each document (eg using something like the handy looking textblob python library), strip quotation marks and whitespace, etc, sort each list, then run a diff on them and pull out the matched lines. (So a “reverse differ”, I think it’s called?) I’m not sure if you could easily also pull out the near misses? (If you can help me out on how to easily find matching or near matching sentences across documents via a comment or link, it’d be appreciated…:-)
The more general approach is to just measure document similarity – TF-IDF (Term Frequency – Inverse Document Frequency) and cosine similarity are key phrases here. I guess this approach could also be applied to sentences to find common ones across documents, (eg SO: Similarity between two text documents), though I guess it would require comparing quite a large number of sentences (for ~N sentences in each doc, it’d require N^2 comparisons)? I suppose you could optimise by ignoring comparisons between sentences of radically different lengths? Again, presumably there are tools that do this already?
Unlike simply counting common words that aren’t stop words in a document to find the most popular words in a doc, TF-IDF moderates the simple count (the term frequency) with the inverse document frequency. If a word is popular in every document, the term frequency is large and the document frequency is large, so the inverse document frequency (one divided by the document frequency) is small – which in turn gives a reduced TF-IDF value. If a term is popular in one document but not any other, the document frequency is small and so the relative document frequency is large, giving a large TF-IDF for the term in the rare document in which it appears. TF-IDF helps you spot words that are rare across documents or uncommonly frequent within documents.
Topic models: I thought I’d played with these quite a bit before, but if I did the doodles didn’t make it as far as the blog… The idea behind topic modeling is generate a set of key terms – topics – that provide an indication of the topic of a particular document. (It’s a bit more sophisticated than using a count of common words that aren’t stopwords to characterise a document, which is the approach that tends to be used when generating wordclouds…) There are some pointers in the comments to A Quick View Over a MASHe Google Spreadsheet Twitter Archive of UKGC12 Tweets about topic modeling in R using the R topicmodels package; this ROpenSci post on Topic Modeling in R has code for a nice interactive topic explorer; this notebook on Topic Modeling 101 looks like a handy intro to topic modeling using the gensim Python package.
Automatic summarisation/text summary generation: again, I thought I dabbled with this but there’s no sign of it on this blog:-( There are several tools and recipes out there that will generate text summaries of long documents, but I guess they could be hit and miss and I’d need to play with a few of them to see how easy they are to use and how well they seem to work/how useful they appear to be. The python sumy package looks quite interesting in this respect (example usage) and is probably where I’d start. A simple description of a basic text summariser can be found here: Text summarization with NLTK.
So – what have I missed?
PS In passing, see this JISC review from 2012 on the Value and Benefits of Text Mining.