The Eye of the World as word cloud

December 16, 2012
By

(This article was first published on Wiekvoet, and kindly contributed to R-bloggers)

The Eye of the World is the first book of Robert Jordan's Wheel of Time books. As the last of these books will be published soon, I was wondering if natural language processing can be used to examine books like these. For this purpose I downloaded a copy from somewhere undisclosed and analyzed it.

During my experiments with this file I found wordcloud was actually a good way to look at this. My first attempts, using correspondence analysis did not give anything useful. Everything on top of each other does not yield an interesting plot. Clustering of chapters did not reveal anything nice. Wordcloud has comparison clouds, which can be used to differentiate between chapters.
I am sure readers can do their own interpretation of this. Myself, I am surprised by the massive amount of names of places and persons in this first book, even though I know the number of persons in the series is large.


R code
r1 <- readLines("Robert Jordan - Wheel Of Time 01 - The Eye Of The World.txt")
#remove text page xxx
pagina <- grep('^Page [[:digit:]]+$',r1)
r1 <- r1[-pagina]
r1 <- sub('Page [[:digit:]]+$','',r1)
# remove empty lines
r1 <- r1[r1!='']
#extract chapter headers
chapterrow <- grep('^(CHAPTER [[:digit:]]+)|(PROLOGUE)$',r1)
chapterrow <- c(chapterrow,length(r1)+1)
#extract chapters
chapters <- sapply(1:(length(chapterrow)-1),function(i) 
      paste(r1[(chapterrow[i]+2):(chapterrow[i+1]-1)],sep=' '))
chapterrow <- chapterrow[-length(chapterrow)]
#name the chapters
chapternames <- paste(sub('CHAPTER ','',r1[chapterrow]),r1[chapterrow+1])
names(chapters) <- chapternames

# use example processing from tm
library(tm)
EotW <- Corpus(VectorSource(chapters))
EotW <- tm_map(EotW,stripWhitespace)
EotW <- tm_map(EotW,tolower)
EotW <- tm_map(EotW,removeWords,stopwords("English"))
EotW <- tm_map(EotW,stemDocument)
EotW <- tm_map(EotW,removePunctuation)

library(wordcloud)
tdmEotW <- TermDocumentMatrix(EotW)

h1 <- hclust(dist(t(sqrt(as.matrix(tdmEotW )))),method='ward')
# hclust to put related chapters together

# and make a cloud
library(colorspace)
tdmEotW2 <- as.matrix(tdmEotW)[,h1$order]

comparison.cloud(tdmEotW2,random.order=FALSE,scale=c(1.4,.6),title.size=.7,
    colors=rainbow_hcl(n=57))

 

To leave a comment for the author, please follow the link and comment on his blog: Wiekvoet.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.