Generating Labels for Supervised Text Classification using CAT and R

February 4, 2013
By

(This article was first published on Solomon Messing » R, and kindly contributed to R-bloggers)

The explosion in the availability of text has opened new opportunities to exploit text as data for research. As Justin Grimmer and Brandon Stewart discuss in the above paper, there are a number of approaches to reducing human text to data, with various levels of computational sophistication and human input required. In this post, I’ll explain how to use the Coding Analysis Toolkit (CAT) to help you collect human evaluations of documents, which is a necessary part of many text analyses, and especially so when you have a specific research question that entails precisely characterizing whether a particular document contains a particular type of content. CAT facilitates fast data entry and handles data management when you have multiple human coders. It’s default output can be tricky to deal with however, so I’ll also provide R code to extract useable data from CAT’s XML output, which should serve as a good into to data munging with XML to the uninitiated. I’ll also show you how to compute metrics that will help diagnose the reliability of your coding system, which entails using the melt and cast functionality in Hadley’s ‘reshape’ package to get the data in the right shape then feeding the results to the ‘irr’ package.

In future posts, I’ll explain how to use these labels to train various machine learning algorithms aka classification models to automatically classify documents in a large corpus. I’ll talk about how to extract features using R and the ‘tm’ package; the problem of classification in high-dimensional spaces (e.g., there are many many words in natural language) and how we can exploit the bias-variance tradeoff to get traction on this problem; the various models that are generally well suited for text classification like the lasso, elastic net, SVMs and random forests; the importance of properly tuning these models; and how to use cross-validation to avoid overfitting these models to your data. (For a preview check out my slides and R labs for my short course on Analyzing Text as Data that I presented at the Stanford Computational Social Science Workshop)

Is human categorization/content analysis the right solution?

Social scientists have been trying to reduce the complexities and subjectivities of the human language to objective data for a long time, calling it “content analysis.” It is no easy task. If you decide to use human coders, I suggest you read Kim Neuendorf’s book and you can find some nice resources on her website that may prove helpful. You’d also do well to read Krippendoff’s eponymous classic.

If you are trying to characterize the type of things that occur or discover categories in your documents, it might make more sense to go with a fully computational approach, employing unsupervised machine learning methods that cluster documents based on word features (e.g. simple word counts, unigrams, or combinations thereof, N-grams). You can take a look at the Grimmer and Stewart paper above for more details on this approach, or check out the notes from lectures 5 – 7 from Justin’s course on the topic. If your are interested in additional computational approaches/features, have a look at Dan Jurafsky’s course From Language to Information.

But if you have a specific research question in mind, which entails precisely characterizing whether a particular document contains a particular type of content, you probably need people to read and classify each document according to a coding scheme. Of course, this can become expensive or impossible if you are dealing with a large corpus of documents. If so, you can first have people classify a sample of documents, which can then be used as labels to train supervised classifiers. Once you have a classifier that performs well, you can use it to classify the rest of the documents in your large data set. I have put together some slides on this process, along with an R lab with example code. Also of interest may be some slides I put together on acquiring and pre-processing text data from the web with R, along with computational labs on interacting with APIs and scraping/regex.

If instead you care about making generalizations about the proportion of documents in any given category (in your population of documents based on your sample), check out the R package ReadMe, which implements A Method of Automated Nonparametric Content Analysis for Social Science, and an industrial implementation of the method is offered at big data social media analytics company Crimson Hexagon. If you decide on this approach, you’ll still need human-labeled categories to start, so keep reading.

Before we get to CAT, I want to talk about human classification with Amazon’s Mechanical Turk, which is great when the classification task is simple and the documents/units of text to classify are short. Mturk is especially useful when the categorization task becomes mind-numbingly boring with repetition, because when one Turker burns out on your task and stops working, fresh Turkers can continue the work. Hence, the main advantage to Mturk is that you can get simple classification tasks done much more quickly than by relying upon research assistants/undergraduates/employees/etc—in fact I’ve used Mturk to classify thousands of short open responses to the question “What’s the most important issue facing the nation” into Gallup’s categories in a matter of hours.

Often overlooked are the nice interfaces that Mturk provides both to it’s requesters (e.g., you), which makes data management far easier than keeping track of excel spreadsheets/google docs edited by multiple coders, and to it’s workers, which translate to less mouse clicks, fatigue, and probably lower error rates. Panos Ipeirotis has some nice slides (best stuff in 30-50) + open source code to help ensure this kind of crowd-sourced data is of the highest quality.

But often the classification task is complicated and people need training to do it correctly and efficiently. I’d direct you to the various links above for guidance on building codebooks for complicated classification schemes and training coders. A great human-coder system is well-conceptualized, highly relevant to the corpus in question, and contains crystal clear instructions for your coders—providing flow charts and diagrams seems to be especially helpful. When Justin, Sean and I were implementing a coding protocol recently, we used this flow chart (from this latex file ) to compliment our actual codebook.

Why is CAT better?

My experiences with Mechanical Turk spoiled me—all of the complexities of dealing with multiple coders entering data and managing that data were abstracted away by the system that Mturk has in place. What I’d done in the past—having analysts enter data in a Google Doc or worse, MS-Excel—was keystroke and mouse-click intensive, which meant it was time-consuming and error-prone for coders when they were entering data, and for me when I was merging/cleaning data from multiple coders.

CAT is the best solution I’ve come across yet. It’s interface isn’t aesthetically perfect, but it gets the job done well. It minimizes key strokes and requires no mouse clicks for data-entry, so you’ll see your coders work faster and probably happier. It maintains your data, alleviating the need to manage spreadsheets and concerns about your coders making errors due to transcribing codes to a spreadsheet. Because it also handles the back-end of things, there’s no need to maintain your own servers/sql database/etc. But it’s open-source so if you need to use your own servers, you can download the source and set it up yourself.

Getting Started

Head over to the CAT website and register for an account. Maybe poke around a bit on the website a bit, familiarize yourself with the menus. When you’re ready to upload a data set, go to Datasets –> Upload Raw Data Set. CAT wants a .zip file with all of your documents in individual .txt files.

If you have your text in a vector in R, say called text_vector, you can output each element to individual .txt files as follows:

for( i in 1:length(text_vector)){
capture.output(text_vector[i], file = paste("doc_number_", i, ".txt", sep="") )
}


Next put these files in a zip archive and upload to CAT. You can upload another file that specifies the coding scheme, but it’s probably easier just to make the coding scheme using the interface on the website later.

When you’re done, go to Datasets –> View Raw Datasets and click on the dataset you just uploaded. From this page, you can manage coders associated with the data sets. If you click on Manage Sub-Accounts, you can easily create new accounts for your coders. Add yourself and other coders to the dataset and be sure to click the “Set Chosen Coders” button when you’re done.

Next, implement your carefully constructed coding scheme. From the same “View Raw Dataset” page, click on “Add or modify codes in the dataset” (on the right under “Toolbox”). Add a sensical name for each code and enter a shortcut key—this will make it so your coders can just hit a button to code a document and move on to the next. When you’re done, hit finished.

I highly recommend you test out your coding scheme yourself. You’ll also probably want to consult your coders and iteratively tweak your codebook according to qualitative input from your coders (this is discussed in full in the content analysis links above).

Then set your coders loose on a hundred documents or so.

Exporting Data from CAT

This process is a bit more complicated than it sounds. If you download data in CSV format, it comes out jagged (i.e., not rectangular), and hence it’s not immediately useful in R. The .CSV file becomes especially convoluted if you change your code labels, add coders, etc.

Better to just deal with the XML output. I’ll introduce data-munging/ETL with XML below. XML is like HTML, but tags describe data, not formatting. We’ll use those tags to create queries that return the data we want. XML is semi-structured data, with a general tree structure, rather than the rectangular structure that R likes. This tree structure saves space and is highly flexible, though it can be hard to work with initially. If you’ve never seen XML before, you’d do well to check out Jennifer Widom’s excellent lecture on XML in her Coursera course. XML is often used in various useful data APIs, which you can learn more about by checking out Sean Westwood’s short course on the topic.

To get the XML output for your project, from the “View Raw Dataset” page, select “Download Coded Text File (XML Format)” from the drop-down menu and then click on “Download Data” (on the right under “Toolbox”).

Here’s how to read in and clean the resulting file. You need to do this step because (1) R wants the encoding to be utf-8 but the CAT file says it’s utf-16, and (2) the XML package doesn’t like “&#x0​;” strings (HTML notation for NULL), which frequently occur in the CAT output.

doc <- readLines("http://dl.dropbox.com/u/25710348/blog/sample.xml")

# check out what's in the file:

# Fix utf-8 issue:
doc <- gsub("utf-16", "utf-8", doc)

grep("&#x0​;", doc)

doc <- gsub("&#x0​;", "", doc)



First a bit of background on this particular data. These represent a small subsample of documents that Justin Grimmer, Sean Westwood and I were putting together for a project looking at how congressional representatives claim credit for expenditures in their district. We were most concerned with identifying press releases that claimed credit for an expenditure in the district (CC_expenditure). But we also wanted to code items that explicitly criticized earmarks or advocated for earmark reform (Egregious); items that were speaking to local constituents—advertising constituent service that the candidate performed or community events the candidate attended to build name recognition (Adv_par_const); and items that were explicitly taking a national policy position or speaking to non-local audiences (Position_taking_other).

Have a look at the file above in an editor so you get a sense of its structure. The file first provides a lot of meta-data about the project, about each of the codes used, the coders, then the full text of each document. After the full text comes the actual data we want—how each item (paragraphId) was coded (codeId) and by which coder (coderId). It looks like this:

      <codedDataItem>
<paragraphId>4334458</paragraphId>
<codeId>142061</codeId>
<coderId>4506</coderId>
</codedDataItem>


It’s the values inside each of the tags that we want. Here’s how we can get them: (1) parse the XML so R recognizes the tags and values properly using the XML package, and (2) extract those values and get them into a data frame for analysis using XPATH. (2) involves telling R to traverse the XML data and return the value in each of the paragraphId, codeId and coderId tags.

# Parse the XML
# uncomment the line below to install the XML package
# install.packages('XML')

library('XML')
doc <- xmlInternalTreeParse(doc, asText=T)

# That was easy, now for #2:
para = unlist(xpathApply(doc, "//paragraphId", xmlValue))
code = unlist(xpathApply(doc, "//codeId", xmlValue))
coderid = unlist(xpathApply(doc, "//coderId", xmlValue))

# Now put into a data frame:
alldat <- data.frame(para, coder=coderid, code)


That’s great, but if you want human-readable data, you need to do a few more things. Let’s pull each of the codeIds and codenames, then use that to map each of the codeIds in our data back to human-readable codes. We’ll do the same thing for our coders and give a number to each of the coding units (paragraphId).

# now map back to human readable values:
# CODES
codeids <- unlist(xpathApply(doc, "//code", xmlGetAttr, "codeId" ))
codenames <- unlist(xpathApply(doc, "//code", xmlValue))
alldat$codes <- codenames[match(alldat$code, codeids)]

# CODERS
coderids <- unlist(xpathApply(doc, "//coder", xmlGetAttr, "coderId" ))
codernames <- unlist(xpathApply(doc, "//coder", xmlValue))
alldat$coder <- codernames[match(alldat$coder, coderids)]

# paragraph num:
pgnum <- as.numeric(unlist(lapply(strsplit(paragraphCodes, "_"), function(x) x[[2]] )))
alldat$pgnum <- pgnum[match(para, paragraphIds)] # paragraph tag: paragraphTag <- unlist(xpathApply(doc, "//paragraph", xmlGetAttr, "paragraphTag")) alldat$paragraphTag <- paragraphTag[match(para, paragraphIds)]


Excellent, now we have our data in a very nice rectangular format.

Basic diagnostics

Two of the most helpful diagnostics when assessing inter-coder reliability are confusion matrices and Krippendorff’s Alpha. Confusion matrices are a bit easier to produce when the data is in this format so that’s where I’ll start.

A confusion matrix is just a contingency table, or incidence matrix, that helps us figure out if any two coders are scoring things in the same way. It consists of the incidence matrix of codes for a pair of coders where the entries are the sum of the incidences—if this sounds confusing, don’t worry this will become clear in the example below. One compact way to get this is to use the paragraph-code incidence matrix for each coder, then multiply each pair of matrices. Here’s how to do it:

# get paragraph-code incidence matrix for each coder:
alltabs <- table(alldat$para, alldat$codes, alldat$coder ) dimnames(alltabs)[[3]] coder1 <- alltabs[,,1] coder2 <- alltabs[,,2] coder3 <- alltabs[,,3] # Multiply together to get confusion matrix for each pair # of coders: coder12 <- t(coder1) %*% coder2 coder23 <- t(coder2) %*% coder3 coder13 <- t(coder1) %*% coder3 # Clean up column names so we can read things clearly: dimnames(coder12)[[2]] <- substr( dimnames(coder12)[[2]], 1, 6) dimnames(coder23)[[2]] <- substr( dimnames(coder23)[[2]], 1, 6) dimnames(coder13)[[2]] <- substr( dimnames(coder13)[[2]], 1, 6) # Take a look: coder12 coder23 coder13 # Pay attention to the sum on the diagonal: sum(diag(coder12)) sum(diag(coder23)) sum(diag(coder13))  Here’s what the first confusion matrix looks like: > coder12 Adv_pa CC_exp Egregi Positi Adv_par_const 25 0 0 6 CC_expenditure 4 12 0 6 Egregious 0 0 0 0 Position_taking_other 11 1 0 35  It shows the incidence between coder 1 and coder 2′s codes, with coder 1′s codes on the rows and coder 2′s codes on the columns. So coder 1 and coder 2 coded 25 of the same items as “Adv_par_const” but coder 1 coded “Position_taking_other” when coder 2 coded “Adv_par_const” for 11 items. This can help diagnose which categories are creating the most confusion. We can see that our coders are confusing “Adv_par_const” and “Position_taking_other” more often than “CC_expenditure” and “Adv_par_const.” For us, that meant we focused on distinguishing these two categories in our training sessions. It’s also useful to look at Krippendorff’s alpha to get a sense for the global agreement between all coders. We can compute Krippendorff’s alpha using the “irr” package. But first a little data-munging is in order. The irr package expects data in a matrix with a single row for each document and columns for each coder. But of course, currently our data is in “long” format, with one line for each document-coder pair. Luckily, we can use the “reshape” package to “melt” our data then “cast” it into the format we want. In this case, “melt” does not actually change the shape of our data—it’s already long. It simply adds a “variable” column and a “value” column, which is necessary to use “cast.” Next, transform the variable to numeric so that irr will be happy. Lastly, cast the data into the format we want, with a column for each coder. library(reshape) alltabsm <- melt(alldat, measure.vars=c("codes")) # Add make "value" column numeric for irr alltabsm$value <- as.numeric(alltabsm$value) alltabsrs <- cast(alltabsm[,which(names(alltabsm)!="code")], ... ~ coder)  And lastly, run the kripp.alpha() function on the columns that contain the coders and codes. # KRIPP ALPHA library(irr) kripp.alpha(t(as.matrix(alltabsrs[,5:7])) )  Now, all that’s left is to get our output. What we want is to take the mode value for each article (which in this case is the same as the median). Let’s take a look at the histogram of the results. alltabsrs$modecode <- apply(alltabsrs[,5:7], 1, median)
hist(alltabsrs\$modecode)


You can see from the histogram that code 3 (Eggregious) was rare, as we were expecting. The other codes look good.

And now we’ve got what we need! A data frame with each item, each code for each coder, and the most commonly occurring code that we can use as the actual label in our analysis.