24 Days of R: Day 11

December 11, 2013
By

[This article was first published on PirateGrunt » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I don't know how often Michael Caine appeared in a Shakespearean work, but I'm sure that he has and I'm sure that he was excellent. A bit pressed for time today, so just a simple word cloud featuring the full text of King Lear. I found the text at a website that I presume is associated with a university in Cambridge. http://shakespeare.mit.edu/lear/full.html I stored a local copy.

My sister lives in Stratfrod-Upon-Avon and can't stop talking about Shakespeare. Today's post is dedicated to her.

aFile = readLines("./Data/Lear.txt")

library(tm)
myCorpus = Corpus(VectorSource(aFile))

myCorpus = tm_map(myCorpus, tolower)
myCorpus = tm_map(myCorpus, removePunctuation)
myCorpus = tm_map(myCorpus, removeNumbers)
myCorpus = tm_map(myCorpus, removeWords, stopwords("english"))

myDTM = TermDocumentMatrix(myCorpus, control = list(minWordLength = 1))

m = as.matrix(myDTM)

v = sort(rowSums(m), decreasing = TRUE)

library(wordcloud)
set.seed(1234)
wordcloud(names(v), v, min.freq = 15)

plot of chunk ReadData

A lot of “king”, “lear”, “thee”, “thy” and “thou”.

And of course in searching for a reference, for the code above (I modified from it something else), I came across this: Text mining Shakespeare. I feel even lazier than I did before.

I can't leave it at that, so I'll very quickly determine the most frequent 2 and 3 word phrases in the text.

library(tau)

bigrams = textcnt(aFile, n = 2, method = "string")
bigrams = bigrams[order(bigrams, decreasing = TRUE)]
bigrams[1]
## king lear 
##       209
bigrams[2]
## my lord 
##      76
trigrams = textcnt(aFile, n = 3, method = "string")
trigrams = trigrams[order(trigrams, decreasing = TRUE)]
trigrams[1]
## king lear no 
##           13
trigrams[2]
## i know not 
##         12

No surprises that the most frequent bigram is “king lear” at 209 times and “my lord” is the sort of thing one would expect in an Elizabethan play. I like that the most frequent trigram is “king lear no” at 13. I'll have to have a look at the text to see what's behind that.

sessionInfo()
## R version 3.0.2 (2013-09-25)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] wordcloud_2.4      RColorBrewer_1.0-5 Rcpp_0.10.6       
## [4] knitr_1.4.1        RWordPress_0.2-3   tau_0.0-15        
## [7] tm_0.5-9.1        
## 
## loaded via a namespace (and not attached):
##  [1] digest_0.6.3   evaluate_0.4.7 formatR_0.9    parallel_3.0.2
##  [5] RCurl_1.95-4.1 slam_0.1-30    stringr_0.6.2  tools_3.0.2   
##  [9] XML_3.98-1.1   XMLRPC_0.3-0

To leave a comment for the author, please follow the link and comment on their blog: PirateGrunt » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)