**Weird Data Science**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In the previous post in this series we coyly unveiled the tantalising mysteries of the Voynich Manuscript: an early 15th century text written in an unknown alphabet, filled with compelling illustrations of plants, humans, astronomical charts, and less easily-identifiable entities.

Stretching back into the murky history of the Voynich Manuscript, however, is the lurking suspicion that it is a fraud; either a modern fabrication or, perhaps, a hoax by a contemporary scribe.

One of the more well-known arguments for the authenticity of the manuscript, in addition to its manufacture with period parchment and inks, is that the text appears to follow certain statistical properties associated with human language, and which were unknown at the time of its creation.

The most well-known of these properties is that the frequency of words in the Voynich Manuscript have been claimed to follow a phenomenon known as *Zipf’s Law*, whereby the frequency of a word’s occurrence in the text is inversely proportional to its rank in the list of words ordered by frequency.

In this post, we will scrutinise the extent to which the expected statistical properties of natural languages hold for the arcane glyphs presented by the Voynich manuscript.

# Unnatural Laws

Zipf’s Law is an example of a discrete power law probability distribution. Power laws have been found to lurk beneath a sinister variety of ostensibly natural phenomena, from the relative size of human settlements to the diversity of species descended from a particular ancestral freshwater fish.

In its original context of human langauge, Zipf’s Law states that the most common word in a given language is likely to be roughly twice as common as the second most common word, and three times as common as the third most common word. More precisely, this law holds *for much of the corpus*, as the law tends to break down somewhat at both the most-frequent and least-frequent words in the corpus^{1}. Despite this, we will focus on the original, simpler Zipfian characterisation in this analysis.

The most well-known, if highly flawed, method to determine whether a distribution follows a power law is to plot it with both axes expressed as a log-scale: a so-called log-log plot. A power law, represented in such a way, will appear linear. Unfortunately, a hideous menagerie of other distributions will also appear linear in such a setting.

More generally, it is rarely sensible to claim that any natural phenomenon *follows* a given distribution or model, but instead to demonstrate that a distribution presents *a useful model* for a given set of observations. Indeed, it is possible to fit any set of observations to a power law, with the assumption that the fit will be poor. Ultimately, we can do little more than demonstrate that a given model is the best simulacrum of observed reality, subject to the uses to which it will be put. Certainly, a more Bayesian approach would advocate building a range of models, demonstrating that the power law is most accurate. All truth, it seems, is relative.

Faced with the awful statistical horror of the universe, we are reduced to seeking evidence *against* a phenomenon’s adherence to a given distribution. Our first examination, then, is to see whether the basic log-log plot supports or undermines the Voynich Manuscript.

A crude visual analysis certainly supports the argument that, for much of the upper half of the Voynich corpus, there is a linear relationship on the log-log plot consistent with Zipf’s Law. As mentioned, however, this superficial appeal to our senses leaves a gnawing lack of certainty in the conclusion. We must turn to less fallible tools.

The poweRlaw package for R is designed specifically to exorcise these particular demons. This package attempts to fit a power law distribution to a series of observations, in our case the word frequencies observed in the corpus of Voynich text. With the fitted model, we then attempt to *disprove* the null hypothesis that the data is drawn from a power law. If this attempt to betray our own model fails, then we attain an inverse enlightenment: there is insufficient evidence that the model is *not* drawn from a power law.

This is an inversion of the more typical frequentist null hypothesis scenario. Typically, in such approaches, we hope for a low p-value, typically below 0.05 or even 0.001, showing that the chance of the observations being consistent with the null hypothesis is extremely low. For this test, we instead hope that our p-value is *insufficiently* low to make such a claim, and thus that a power law *is* consistent with the data.

The diagram above shows a fitted parameterisation of the power law according to the poweRlaw package. In addition to the visually appealing fit of the line, the weirdly inverted logic of the above test provides a p-value of `0.151`

. We thus have as much confidence as we can have, via this approach, that a power law is a reasonable model for the text in the Voynich corpus.

Led further down twisting paths by this initial taste of success, we can now present the Voynich corpus against other human-language corpora to gain a faint impression of how similar or different it is to known languages. The following plot compares the frequency of words in the Voynich Manuscript to those of the twenty most popular languages in Wikipedia, taken from the dataset available here.

The Voynich text seems consistent with the behaviour of known natural languages from Wikipedia. The most striking difference being the clustering of Voynich word frequencies in the lower half of the diagram, resulting from the smaller corpus of words in the Voynich Manuscript. This causes, in particular, lower-frequency words to occur an identical number of times, resulting in vertical leaps in the frequency graph towards the lower end.

To highlight this phenomenon, we can apply a similar technique to another widely-translated short text: the United Nations Declaration of Human Rights.

# A Refined Randomness

The above arguments might at first appear compelling. The surface incomprehensibility of the Voynich Manuscript succumbs to the deep currents of statistical laws, and reveals an underlying pattern amongst the chaos of the text.

Sadly, however, as with all too many arguments in the literature regarding power law distributions arising in nature, there is a complication to this argument that again highlights the difference between proof and the failure to disprove. Certainly, if a power law had proved incompatible with the Voynich Manuscript then we would have doubted its authenticity. With its apparent adherence to such a distribution, however, we have taken only one hesitant step towards confidence.

Rugg has argued that certain random mechanisms can produce text that adheres to Zipf’s Law, and has demonstrated a simple mechanical procedure for doing so. A more compelling argument is presented, without reference to the Voynich Manuscript, by Li. (1992)