Distributional Semantics in R: Part 2 Entity Recognition w. {openNLP}

January 2, 2017

(This article was first published on The Exactness of Mind, and kindly contributed to R-bloggers)

The R code for this tutorial on Methods of Distributional Semantics in R is found in the respective GitHub repository. You will find .R, .Rmd, and .html files corresponding to each part of this tutorial (e.g. DistSemanticsBelgradeR-Part2.RDistSemanticsBelgradeR-Part2.Rmd, and DistSemanticsBelgradeR-Part2.html, for Part 2) there. All auxiliary files are also uploaded to the repository.

Following my Methods of Distributional Semantics in R BelgradeR Meetup with Data Science Serbia, organized in Startit Center, Belgrade,
11/30/2016, several people asked me for the R code used for the analysis
of William Shakespeare’s plays that was presented. I have decided to
continue the development of the code that I’ve used during the Meetup in order to advance the examples
that I have shown then into a more or less complete and comprehensible
text-mining tutorial with {tm}, {openNLP}, and {topicmodels} in R. All
files in this GitHub repository are a product of that work. 

Part 2 will introduce named entity recognition with {openNLP}, and Apache project in Java interfaced by this nice R package that, in turn, relies on {NLP} classes. We will try to make machine learning (MaxEnt models offered in {openNLP} figure out the characters from Shakespeare’s plays, a quite difficult task given that the learning algorithms at our disposal were trained on contemporary English corpora.

The accuracy of character recognition from Shakespeare’s comedies, tragedies, and histories; the black dashed line is the overall density. The results is not realistic (explanation given in the respective .Rmd and .hmtl files).

What I really want to show you here is how tricky and difficult it can be to do serious text-mining, and help you by exemplifying some steps that are necessary to ensure the consistency of results that you are expecting. The text-mining pipelines being developed here are in no sense perfect or complete; they are meant to demonstrate important problems and propose solutions rather than to provide a copy and paste ready chunks for future re-use. In essence, except in those cases where a standardized information extraction + text-mining pipeline is being developed (a situation where, by assumption, one periodically processes large text corpora, e.g. web-scraped news and other media reports, from various sources, in various formats, and where one simply needs to learn to live with approximations) every text-mining study will need a specific pipeline on its own. Chaining those tm_map() calls to various content_transformers from {tm} restlessly, while being ignorant of the necessary changes in parameters and different content-specific transformations – of which {tm} supports only a few – will simply not do.

Don’t get hooked on the results presented in the {ggplot2} figure above; {openNLP} is not that successful in recognizing personal names from Shakespeare’s plays (in spite of the fact that it works great for contemporary English documents). I have helped it a bit, by doing something that is not applicable to real-world situations; go take a look at the code from this GitHub repository.

See you soon.

To leave a comment for the author, please follow the link and comment on their blog: The Exactness of Mind.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers


Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)