# 24 Days of R: Day 11

December 11, 2013
By

(This article was first published on PirateGrunt » R, and kindly contributed to R-bloggers)

I don't know how often Michael Caine appeared in a Shakespearean work, but I'm sure that he has and I'm sure that he was excellent. A bit pressed for time today, so just a simple word cloud featuring the full text of King Lear. I found the text at a website that I presume is associated with a university in Cambridge. http://shakespeare.mit.edu/lear/full.html I stored a local copy.

My sister lives in Stratfrod-Upon-Avon and can't stop talking about Shakespeare. Today's post is dedicated to her.

```aFile = readLines("./Data/Lear.txt")

library(tm)
myCorpus = Corpus(VectorSource(aFile))

myCorpus = tm_map(myCorpus, tolower)
myCorpus = tm_map(myCorpus, removePunctuation)
myCorpus = tm_map(myCorpus, removeNumbers)
myCorpus = tm_map(myCorpus, removeWords, stopwords("english"))

myDTM = TermDocumentMatrix(myCorpus, control = list(minWordLength = 1))

m = as.matrix(myDTM)

v = sort(rowSums(m), decreasing = TRUE)

library(wordcloud)
set.seed(1234)
wordcloud(names(v), v, min.freq = 15)
```

A lot of “king”, “lear”, “thee”, “thy” and “thou”.

And of course in searching for a reference, for the code above (I modified from it something else), I came across this: Text mining Shakespeare. I feel even lazier than I did before.

I can't leave it at that, so I'll very quickly determine the most frequent 2 and 3 word phrases in the text.

```library(tau)

bigrams = textcnt(aFile, n = 2, method = "string")
bigrams = bigrams[order(bigrams, decreasing = TRUE)]
bigrams[1]
```
```## king lear
##       209
```
```bigrams[2]
```
```## my lord
##      76
```
```trigrams = textcnt(aFile, n = 3, method = "string")
trigrams = trigrams[order(trigrams, decreasing = TRUE)]
trigrams[1]
```
```## king lear no
##           13
```
```trigrams[2]
```
```## i know not
##         12
```

No surprises that the most frequent bigram is “king lear” at 209 times and “my lord” is the sort of thing one would expect in an Elizabethan play. I like that the most frequent trigram is “king lear no” at 13. I'll have to have a look at the text to see what's behind that.

```sessionInfo()
```
```## R version 3.0.2 (2013-09-25)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
##
## locale:
## [1] LC_COLLATE=English_United States.1252
## [2] LC_CTYPE=English_United States.1252
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.1252
##
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base
##
## other attached packages:
## [1] wordcloud_2.4      RColorBrewer_1.0-5 Rcpp_0.10.6
## [4] knitr_1.4.1        RWordPress_0.2-3   tau_0.0-15
## [7] tm_0.5-9.1
##
## loaded via a namespace (and not attached):
##  [1] digest_0.6.3   evaluate_0.4.7 formatR_0.9    parallel_3.0.2
##  [5] RCurl_1.95-4.1 slam_0.1-30    stringr_0.6.2  tools_3.0.2
##  [9] XML_3.98-1.1   XMLRPC_0.3-0
```