Distributional Semantics in R: Part 1 {tm} classes + read/write

December 24, 2016
By

(This article was first published on The Exactness of Mind, and kindly contributed to R-bloggers)

The R code for this tutorial on Methods of Distributional Semantics in R is found in the respective GitHub repository.

Following my Methods of Distributional Semantics in R BelgradeR Meetup with Data Science Serbia, organized in Startit Center, Belgrade,
11/30/2016, several people asked me for the R code used for the analysis
of William Shakespeare’s plays that was presented. I have decided to
continue the development of the code that I’ve used during the Meetup in order to advance the examples
that I have shown then into a more or less complete and comprehensible
text-mining tutorial with {tm}, {openNLP}, and {topicmodels} in R. All
files in this GitHub repository are a product of that work.

The idea here is to provide an overview of selected R packages and
functions for text-mining and modeling in distributional semantics.
Instead of presenting functions and packages in a piece-wise fashion, I
have decided to develop a full text-mining pipeline by combining the
essential steps orderly and exactly as one would need to follow them to
arrive at some useful Data Science production following data wrangling,
checking for integrity, text pre-processing, and modeling.

The first notebook – Part 1: The {tm} structures for text-mining in R – introduces the classes provided by the {tm} package, and show you how to index a text corpus with metadata prior to modeling and analytics. I have also introduced the essential read and write (i.e. Vcorpus formation)  operations from {tm} there.

The forthcoming Part 2. of this tutorial will cover Entitity Recognition with {OpenNLP}. We will check how well can machine learning tell what characters appear in which Shakespeare’s play. In Part 3. we will deal with text pre-processing with {tm}, while Part 4. introduces topic modeling with Latent Dirichlet Allocation. Part 5, finally, will present an analytical exploration of the topic model.

image

A semantic network of Shakespeare’s characters produced by {igraph} and from a previously developed LDA model from {topicmodels}.

The video of the Meetup that motivated me to develop this tutorial is on YouTube – however, no English titles yet.

The exercise uses the complete plays of William Shakespeare kindly provided by the Massachusetts Institute of Technology at their The Complete Works of William Shakespeare pages.

Stay tuned for more text-mining in R. 

To leave a comment for the author, please follow the link and comment on their blog: The Exactness of Mind.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers


Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)