24 Days of R: Day 11

December 11, 2013
By

(This article was first published on PirateGrunt » R, and kindly contributed to R-bloggers)

I don't know how often Michael Caine appeared in a Shakespearean work, but I'm sure that he has and I'm sure that he was excellent. A bit pressed for time today, so just a simple word cloud featuring the full text of King Lear. I found the text at a website that I presume is associated with a university in Cambridge. http://shakespeare.mit.edu/lear/full.html I stored a local copy.

My sister lives in Stratfrod-Upon-Avon and can't stop talking about Shakespeare. Today's post is dedicated to her.

aFile = readLines("./Data/Lear.txt")

library(tm)
myCorpus = Corpus(VectorSource(aFile))

myCorpus = tm_map(myCorpus, tolower)
myCorpus = tm_map(myCorpus, removePunctuation)
myCorpus = tm_map(myCorpus, removeNumbers)
myCorpus = tm_map(myCorpus, removeWords, stopwords("english"))

myDTM = TermDocumentMatrix(myCorpus, control = list(minWordLength = 1))

m = as.matrix(myDTM)

v = sort(rowSums(m), decreasing = TRUE)

library(wordcloud)
set.seed(1234)
wordcloud(names(v), v, min.freq = 15)

plot of chunk ReadData

A lot of “king”, “lear”, “thee”, “thy” and “thou”.

And of course in searching for a reference, for the code above (I modified from it something else), I came across this: Text mining Shakespeare. I feel even lazier than I did before.

I can't leave it at that, so I'll very quickly determine the most frequent 2 and 3 word phrases in the text.

library(tau)

bigrams = textcnt(aFile, n = 2, method = "string")
bigrams = bigrams[order(bigrams, decreasing = TRUE)]
bigrams[1]
## king lear 
##       209
bigrams[2]
## my lord 
##      76
trigrams = textcnt(aFile, n = 3, method = "string")
trigrams = trigrams[order(trigrams, decreasing = TRUE)]
trigrams[1]
## king lear no 
##           13
trigrams[2]
## i know not 
##         12

No surprises that the most frequent bigram is “king lear” at 209 times and “my lord” is the sort of thing one would expect in an Elizabethan play. I like that the most frequent trigram is “king lear no” at 13. I'll have to have a look at the text to see what's behind that.

sessionInfo()
## R version 3.0.2 (2013-09-25)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] wordcloud_2.4      RColorBrewer_1.0-5 Rcpp_0.10.6       
## [4] knitr_1.4.1        RWordPress_0.2-3   tau_0.0-15        
## [7] tm_0.5-9.1        
## 
## loaded via a namespace (and not attached):
##  [1] digest_0.6.3   evaluate_0.4.7 formatR_0.9    parallel_3.0.2
##  [5] RCurl_1.95-4.1 slam_0.1-30    stringr_0.6.2  tools_3.0.2   
##  [9] XML_3.98-1.1   XMLRPC_0.3-0

To leave a comment for the author, please follow the link and comment on his blog: PirateGrunt » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.