Text mining in R – Automatic categorization of Wikipedia articles

[This article was first published on Rexamine » Blog/R-bloggers, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Text mining is currently a live issue in data analysis. Enoromus text data resourses on the Internet made it an important component of Big Data world. The potential of information hidden in the words is the reason why I find worth knowing what’s going on.

I wanted to learn about R text analysis capabilities and this post is the result of my small research. More precisely, this is an example of (hierarchical) categorization of Wikipedia articles. I share the source code here and explain it, so that everyone could try it oneself with various articles.

I use tm package which provides the set of tools for text mining. Also package stringi is useful here for string processing.

First of all, we have to load the data. In the variable titles I list some of the titles of the Wikipedia articles. There are 5 mathematical terms (3 of them are about integrals), 3 painters and 3 writers. After loading the articles (as texts – html page sources), we make a container for them, called “Corpus”. It’s a structure for storing text documents, which is just a kind of a list, containing text documents and metadata that concern them.

wiki <- "http://en.wikipedia.org/wiki/"
titles <- c("Integral", "Riemann_integral", "Riemann-Stieltjes_integral", "Derivative",
    "Limit_of_a_sequence", "Edvard_Munch", "Vincent_van_Gogh", "Jan_Matejko",
    "Lev_Tolstoj", "Franz_Kafka", "J._R._R._Tolkien")
articles <- character(length(titles))

for (i in 1:length(titles)) {
    articles[i] <- stri_flatten(readLines(stri_paste(wiki, titles[i])), col = " ")

docs <- Corpus(VectorSource(articles))

As we have already loaded the data, we can start to process our text documents. This is the first step of text analysis. It’s important because preparing the data strongly affects the results. Now we apply the function tm_map to the corpus, which is equivalent to lapply for list. What we do here is:

  1. Replace all “” elements with a space. We do it because there are not a part of text document but in general a html code.
  2. Replace all “/t” with a space.
  3. Convert previous result (returned type was “string”) to “PlainTextDocument”, so that we can apply the other functions from tm package, which require this type of argument.
  4. Remove extra whitespaces from the documents.
  5. Remove punctuation marks.
  6. Remove from the documents words which we find redundant for text mining (e.g. pronouns, conjunctions). We set this words as stopwords(“english”) which is a built-in list for English language (this argument is passed to the function removeWords.
  7. Transform characters to lower case.

docs2 <- tm_map(docs, function(x) stri_replace_all_regex(x, "<.+?>", " "))
docs3 <- tm_map(docs2, function(x) stri_replace_all_fixed(x, "\t", " "))

docs4 <- tm_map(docs3, PlainTextDocument)
docs5 <- tm_map(docs4, stripWhitespace)
docs6 <- tm_map(docs5, removeWords, stopwords("english"))
docs7 <- tm_map(docs6, removePunctuation)
docs8 <- tm_map(docs7, tolower)


We can look at the results of the “cleaned” text. Instead of this:

“The volume of irregular objects can be measured with precision by the fluid < a href=”/wiki/Displacement_(fluid)” title=”Displacement (fluid)”>displaced as the object is submerged; see < a href=”/wiki/Archimedes” title=”Archimedes”>Archimedes’s Eureka.”

now we have this:

“the volume irregular objects can measured precision fluid displaced object submerged see archimedes s eureka”

Now we are ready to proceed to the heart of the analysis. The starting point is creating “Term document matrix”. It describes the frequency of each term in each document in the corpus. This is a fundamental object in the text analysis. Based on it we create a matrix of dissimilarities – it measures dissimilarity between documents (the function dissimilarity returns an object of class dist – it is a convenience because clustering functions require this type of argument). At last we apply the function hclust (but it can be any clusterig function) and we see result on the plot.

docsTDM <- TermDocumentMatrix(docs8)

docsdissim <- dissimilarity(docsTDM, method = "cosine")

docsdissim2 <- as.matrix(docsdissim)
rownames(docsdissim2) <- titles
colnames(docsdissim2) <- titles
h <- hclust(docsdissim, method = "ward")
plot(h, labels = titles, sub = "")
plot of chunk unnamed-chunk-4

As we can see, the result is perfect here. Of course it is because chosen articles are easy to categorize. On the left side, writers made one small cluster and painters the second. Next this both clusters made bigger cluster of people. On the right side, integrals made one cluster and next two terms joined it and made together bigger cluster of mathematical terms.

This example is only a piece of R text mining capabilities. I think that you can easily proceed other text analysis as concept extraction, sentiment analysis and information extraction in general.

I give some sources for more information about text mining in R: cran.r-project, r-bloggers, onepager.togaware.com, jstatsoft.org.

Norbert Ryciak
[email protected]

To leave a comment for the author, please follow the link and comment on their blog: Rexamine » Blog/R-bloggers.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)