Site icon R-bloggers

Don Quijote — Word Statistics

[This article was first published on Mathematical Poetics, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Using the Gutenberg Project’s free text of Don Quijote + Unix for Poets, here are the most used (non-short) words in Miguel de Cervantes’ famous work:

CODE: tr -sc '[A-Z][a-z][áéíóú]' '[\012*]' < quijote.textfile | perl -e 'while () { print if length($_)>5; }' | sort | uniq -c | sort -rn > quijote.hist

Here’s the power law distribution of non-short words in Don Quijote:

CODE:  tr -sc '[A-Z][a-z][áéíóú]' '[\012*]' < quijote | perl -e 'while () { print if length($_)>5; }' | sort | uniq -c | sort -rn | perl -e 'while () { print $1 if $_ =~ /(\d+)/; print "\n"; } ' | uniq -c > quijote.countofcounts.powerlaw.hist

> par(bg="#fafaff", col="#111177")
> plot(quijote.countofcounts.powerlaw, log="y", type="s", lwd=4, xlab="Number of times a word appears in the text", ylab="Number of words with this frequency", main="Word Frequency in Don Quijote de la Mancha", col="#111177")

And including short words retains the power law distribution.

CODE: tr -sc '[A-Z][a-z][áéíóú]' '[\012*]' < quijote | uniq -c | sort -rn | perl -e ‘while (< >) { print $1 if $_ =~ /(\d+)/; print “\n”; } ’ | uniq -c > quijote.countofcounts.powerlaw.hist.shortwordstambien

To leave a comment for the author, please follow the link and comment on their blog: Mathematical Poetics.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.