Clustering the Words of William Shakespeare
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
In my previous post I used the tm package to do some simple text mining on the Complete Works of William Shakespeare. Today I am taking some of those results and using them to generate word clusters.
Preparing the Data
I will start with the Term Document Matrix (TDM) consisting of 71 words commonly used by Shakespeare.
> inspect(TDM.common[1:10,1:10]) A term-document matrix (10 terms, 10 documents) Non-/sparse entries: 94/6 Sparsity : 6% Maximal term length: 6 Weighting : term frequency (tf) Docs Terms 1 2 3 4 5 6 7 8 9 10 act 1 4 7 9 6 3 2 14 1 0 art 53 0 9 3 5 3 2 17 0 6 away 18 5 8 4 2 10 5 13 1 7 call 17 1 4 2 2 1 6 17 3 7 can 44 8 12 5 10 6 10 24 1 5 come 19 9 16 17 12 15 14 89 9 15 day 43 2 2 4 1 5 3 17 2 3 enter 0 7 12 11 10 10 14 87 4 6 exeunt 0 3 8 8 5 4 7 49 1 4 exit 0 6 8 5 6 5 3 31 3 2
This matrix is first converted from a sparse data format into a conventional matrix.
> TDM.dense dim(TDM.dense) [1] 71 182
Next the TDM is normalised so that the rows sum to unity. Each entry in the normalised TDM then represents the number of times that a word occurs in a particular document relative to the number of occurrences across all of the documents.
> TDM.scaled
Clustering
We will be using a hierarchical clustering technique which operates on a dissimilarity matrix. We will use the Euclidean distance between each of the rows in the TDM, where each row is treated as a vector in a space of 182 dimensions.
> TDM.dist = dist(TDM.scaled)
Finally we perform agglomerative clustering using agnes() from the cluster package.
> library(cluster) > > hclusters hclusters Call: agnes(x = TDM.dist, method = "complete") Agglomerative coefficient: 0.6256247 Order of objects: [1] act great way away hand stand life can hath yet [11] look see leav let shall make take thus made till [21] come well will good ill like now give upon know [31] may must man much think hear speak never one say [41] tell enter exeunt scene exit tis mean fear men keep [51] word name lord call two old sir first art thee [61] thou thi day live heart mine time part true eye [71] love Height (summary): Min. 1st Qu. Median Mean 3rd Qu. Max. 0.02495 0.04509 0.05722 0.06050 0.06897 0.14260 Available components: [1] "order" "height" "ac" "merge" "diss" "call" [7] "method" "order.lab"
Plotting a Dendrogram
Let’s have a look at the results of our labours.
plot(hclusters, which.plots = 2, main = "", sub = "", xlab = "")
This dendrogram reflects the tree-like structure of the word clusters. We can see that the words “enter”, “exeunt” and “scene” are clustered together, which makes sense since they are related to stage directions. Also “thee” and “thou” have similar usage. In the previous analysis we found that the occurrences of “love” and “eye” were highly correlated and consequently we find them clustered here too.
This is rather cool. No doubt a similar analysis applied to contemporary literature would yield extremely different results. Anybody keen on clustering the Complete Works of Terry Pratchett?
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.