Site icon R-bloggers

Illuminating the Illuminated Part Two: Ipsa Scientia Potestas Est

[This article was first published on Weird Data Science, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In the previous post in this series we coyly unveiled the tantalising mysteries of the Voynich Manuscript: an early 15th century text written in an unknown alphabet, filled with compelling illustrations of plants, humans, astronomical charts, and less easily-identifiable entities.

Stretching back into the murky history of the Voynich Manuscript, however, is the lurking suspicion that it is a fraud; either a modern fabrication or, perhaps, a hoax by a contemporary scribe.

One of the more well-known arguments for the authenticity of the manuscript, in addition to its manufacture with period parchment and inks, is that the text appears to follow certain statistical properties associated with human language, and which were unknown at the time of its creation.

The most well-known of these properties is that the frequency of words in the Voynich Manuscript have been claimed to follow a phenomenon known as Zipf’s Law, whereby the frequency of a word’s occurrence in the text is inversely proportional to its rank in the list of words ordered by frequency.

In this post, we will scrutinise the extent to which the expected statistical properties of natural languages hold for the arcane glyphs presented by the Voynich manuscript.

Unnatural Laws

Zipf’s Law is an example of a discrete power law probability distribution. Power laws have been found to lurk beneath a sinister variety of ostensibly natural phenomena, from the relative size of human settlements to the diversity of species descended from a particular ancestral freshwater fish.

In its original context of human langauge, Zipf’s Law states that the most common word in a given language is likely to be roughly twice as common as the second most common word, and three times as common as the third most common word. More precisely, this law holds for much of the corpus, as the law tends to break down somewhat at both the most-frequent and least-frequent words in the corpus1. Despite this, we will focus on the original, simpler Zipfian characterisation in this analysis.

The most well-known, if highly flawed, method to determine whether a distribution follows a power law is to plot it with both axes expressed as a log-scale: a so-called log-log plot. A power law, represented in such a way, will appear linear. Unfortunately, a hideous menagerie of other distributions will also appear linear in such a setting.

More generally, it is rarely sensible to claim that any natural phenomenon follows a given distribution or model, but instead to demonstrate that a distribution presents a useful model for a given set of observations. Indeed, it is possible to fit any set of observations to a power law, with the assumption that the fit will be poor. Ultimately, we can do little more than demonstrate that a given model is the best simulacrum of observed reality, subject to the uses to which it will be put. Certainly, a more Bayesian approach would advocate building a range of models, demonstrating that the power law is most accurate. All truth, it seems, is relative.

Faced with the awful statistical horror of the universe, we are reduced to seeking evidence against a phenomenon’s adherence to a given distribution. Our first examination, then, is to see whether the basic log-log plot supports or undermines the Voynich Manuscript.

Fitted Power Law of Voynich Corpus | (PDF Version)

A crude visual analysis certainly supports the argument that, for much of the upper half of the Voynich corpus, there is a linear relationship on the log-log plot consistent with Zipf’s Law. As mentioned, however, this superficial appeal to our senses leaves a gnawing lack of certainty in the conclusion. We must turn to less fallible tools.

The poweRlaw package for R is designed specifically to exorcise these particular demons. This package attempts to fit a power law distribution to a series of observations, in our case the word frequencies observed in the corpus of Voynich text. With the fitted model, we then attempt to disprove the null hypothesis that the data is drawn from a power law. If this attempt to betray our own model fails, then we attain an inverse enlightenment: there is insufficient evidence that the model is not drawn from a power law.

This is an inversion of the more typical frequentist null hypothesis scenario. Typically, in such approaches, we hope for a low p-value, typically below 0.05 or even 0.001, showing that the chance of the observations being consistent with the null hypothesis is extremely low. For this test, we instead hope that our p-value is insufficiently low to make such a claim, and thus that a power law is consistent with the data.

The diagram above shows a fitted parameterisation of the power law according to the poweRlaw package. In addition to the visually appealing fit of the line, the weirdly inverted logic of the above test provides a p-value of 0.151. We thus have as much confidence as we can have, via this approach, that a power law is a reasonable model for the text in the Voynich corpus.

Voynich Power Law Fit and Plot

voynich_powerlaw.r

library( tidyverse )
library( magrittr )

library( ggthemes )
library( showtext )

library( tidytext )
library( drlib )

library( poweRlaw )

library(cowplot)
library(magick)

_add( "voynich_", "/usr/shares/TTF/weird/voynich/eva1.ttf")
_add( "main_", "/usr/shares/TTF/weird/alchemy/1651 Alchemy/1651AlchemyNormal.otf")
_add( "bold_", "/usr/shares/TTF/weird/alchemy/1651 Alchemy/1651AlchemyNormal.otf")\

showtext_auto()

message( "Reading raw Voynich data..." )
voynich_tbl <- 
	read_csv( "data/voynich_raw.txt", col_names=FALSE ) %>%
	rename( folio = X1, text = X2 )

# Tokenize
voynich_words <-
	voynich_tbl %>%
	unnest_tokens( word, text ) 

# Most common words
message( "Calculating Voynich language statistics..." )
voynich_common <-
	voynich_words %>%
	count( word, sort=TRUE ) %>%
	mutate( word = reorder( word, n ) ) %>%
	mutate( freq = n / sum(n) )

voynich_word_counts <-
	voynich_words %>%
	count( word, folio, sort = TRUE ) 

# (Following the poweRlaw vignette)
# Create a discrete power law distribution object from the word counts
voynich_powerlaw <- 
	voynich_common %>%
	extract2( "n" ) %>%
	displ$new()

# Estimate the lower bound
voynich_powerlaw_xmin <- 
	estimate_xmin( voynich_powerlaw )

# Set the parameters of the voynich_powerlaw to the estimated values
voynich_powerlaw$setXmin( voynich_powerlaw_xmin )

# Estimate parameters of the power law distribution
voynich_powerlaw_est <-
	estimate_pars( voynich_powerlaw )

# Calculate p-value of power law. See Section 4.2 of "Power-Law Distributions in Empirical Data" by Clauset et al.
# If the p-value is _greater_ than 0.1 then we cannot rule out a power-law distribution.
voynich_powerlaw_bootstrap_p <-
	bootstrap_p(voynich_powerlaw, no_of_sims=1000, threads=7 )
# p=0.143 power law cannot be ruled out

# Parameter uncertainty via boostrapping
voynich_powerlaw_bootstrap <- 
	bootstrap( voynich_powerlaw, no_of_sims=1000, threads=7 )

# Plot data and power law fit
voynich_powerlaw_plot_data <- 
	plot( voynich_powerlaw, draw = F ) %>%
	mutate( 
			 log_x = log( x ),
			 log_y = log( y ) )

voynich_powerlaw_fit_data <- 
	lines( voynich_powerlaw, col=2, draw = F ) %>%
	mutate( 
			 log_x = log( x ),
			 log_y = log( y ) 
	)

# Plot the fitted power law data.
gp <-
	ggplot( voynich_powerlaw_plot_data ) + 
	geom_point( aes( x = log( x ), y =log( y ) ), colour="#8a0707" ) + 
	geom_line( data= voynich_powerlaw_fit_data, 
				 aes( 
					  x = log( x ),
					  y = log( y ) ), colour="#0b6788") +
	labs(
		  x = "Log Rank", 
		  y = "Log Frequency" )

gp <-
	gp +
	theme( 
			panel.background = element_rect(fill = "transparent", colour = "transparent"),
			plot.background = element_rect(fill = "transparent", colour = "transparent"),
			plot.title = element_text( family="bold_", colour="#3c3f4a", size=22 ),
			plot.subtitle = element_text( family="bold_", colour="#3c3f4a", size=12 ),
			axis.text = element_text( family="bold_", colour="#3c3f4a", size=12 ),
			axis.title.x = element_text( family="bold_", colour="#3c3f4a", size=12 ),
			axis.title.y = element_text( family="bold_", colour="#3c3f4a", size=12 ),
			axis.line = element_line( colour="#3c3f4a" ),
			panel.grid.major.x = element_blank(),
			panel.grid.major.y = element_blank(),
			panel.grid.minor.x = element_blank(),
			panel.grid.minor.y = element_blank()
			) 

# Remove legend from internal plot
theme_set(theme_cowplot(_size=4, _family = "main_" ) )  

# Cowplot trick for ggtitle
title <- 
	ggdraw() + 
	draw_label(
				  "Fitted Power Law of Voynich Corpus", 
				  family="bold_", 
				  colour = "#3c3f4a", 
				  size=20, 
				  hjust=0, vjust=1, 
				  x=0.02, y=0.88 ) +
	draw_label(
				  "http://www.weirddatascience.net | @WeirdDataSci", 
				  family="main_", 
				  colour = "#3c3f4a", 
				  size=12, 
				  hjust=0, vjust=1, 
				  x=0.02, y=0.40 )

data_label <- 
	ggdraw() +
	draw_label(
				  "Data: http://www.voynich.nu", 
				  family="main_", 
				  colour = "#3c3f4a", 
				  size=14, hjust=1, 
				  x=0.98 ) 

# Combine plots
tgp <- 
	plot_grid( 
				 title, 
				 gp, 
				 data_label, 
				 ncol=1, 
				 rel_heights=c( 0.1, 1, 0.1 ) ) 

# Add parchment underlay
parchment_plot <- 
	ggdraw() +
	draw_image("img/parchment.jpg", scale=1.4 ) +
	draw_plot(tgp)

save_plot("output/voynich_power_law.pdf", 
							parchment_plot,
							base_width = 16,
							base_height = 9,
			           	base_aspect_ratio = 1.78 )

Led further down twisting paths by this initial taste of success, we can now present the Voynich corpus against other human-language corpora to gain a faint impression of how similar or different it is to known languages. The following plot compares the frequency of words in the Voynich Manuscript to those of the twenty most popular languages in Wikipedia, taken from the dataset available here.

Voynich Manuscript Rank Frequency Distribution against Wikipedia Corpora | (PDF Version)

Wikipedia Word Frequency Plot

Data:

voynich_zipf_wikipedia.r


library( tidyverse )
library( magrittr )

library( ggthemes )
library( showtext )

library( tidytext )
library( drlib )

library(cowplot)
library(magick)

_add( "voynich_", "/usr/shares/TTF/weird/voynich/eva1.ttf")
_add( "main_", "/usr/shares/TTF/weird/alchemy/1651 Alchemy/1651AlchemyNormal.otf")
_add( "bold_", "/usr/shares/TTF/weird/alchemy/1651 Alchemy/1651AlchemyNormal.otf")

message( "Reading raw Voynich data..." )
voynich_tbl <- 
	read_csv( "data/voynich_raw.txt", col_names=FALSE ) %>%
	rename( folio = X1, text = X2 )

# Tokenize
# (Remove words of 3 letters or less)
# Stemming and stopword removal apparently not so effective anyway,
# according to Schofield et al.: <www.cs.cornell.edu/~xanda/winlp2017.pdf>
voynich_words <-
	voynich_tbl %>%
	unnest_tokens( word, text ) 

# Most common words
message( "Calculating Voynich language statistics..." )
voynich_common <-
	voynich_words %>%
	count( word, sort=TRUE ) %>%
	mutate( word = reorder( word, n ) ) %>%
	mutate( freq = n / sum(n) )

# Plot a log-log plot of Voynich word frequencies.
voynich_word_counts <-
	voynich_words %>%
	count( word, folio, sort = TRUE ) 

# Load other languages.
# Select frequency counts.
# Convert to long format, then normalise per-language.
message( "Loading common language statistics..." )
wiki_language <- 
	read.csv( "data/Multilingual_Wikipedia_2015_word_frequencies__32_languages_X_5_million_words.csv" ) %>%
	head( 10000 ) %>%
	as_tibble %>%
	select( matches( "*_FREQ" ) ) %>%
	gather( key = "language", value = "count" ) %>%
	mutate( language = str_replace( language, "_FREQ", "" ) ) %>%
	group_by( language ) %>%
	transmute( freq = count / sum( count ) ) %>%
	ungroup

wiki_language_words <- 
	read.csv( "data/Multilingual_Wikipedia_2015_word_frequencies__32_languages_X_5_million_words.csv" ) %>%
	head( 10000 ) %>%
	as_tibble 

# Combine with Voynich, assigning it the unassigned ISO 3166-1 alpha-2 code "vy"
message( "Combining common and Voynich language statistics..." )
voynich_language <-
	voynich_common %>%
	transmute( language = "vy", freq = freq )

# Combine, then add per-language rank information
message( "Processing common and Voynich language statistics..." )
all_languages <- 
	bind_rows( wiki_language, voynich_language ) %>%
	mutate( colour = ifelse( str_detect( `language`, "vy" ), "red", "grey" ) ) %>%
	group_by( language ) %>%
	transmute( log_rank=log( row_number() ), log_freq=log( freq ), colour ) %>%
	ungroup 

# Plot a log-log plot of all language word frequencies.
message( "Plotting common and Voynich language statistics..." )
voynich_wikipedia_plot <-
	all_languages %>%
	ggplot( aes( x=log_rank, y=log_freq, colour=colour) ) +
	geom_point( alpha=0.4, shape=20 ) + 
	scale_color_manual( values=c("#3c3f4a", "#8a0707" ) ) +
	theme (
			 axis.title.y = element_text( angle = 90, family="main_", size=12 ),
			 axis.text.y = element_text( colour="#3c3f4a", family="main_", size=12 ),
			 axis.title.x = element_text( colour="#3c3f4a", family="main_", size=12 ),
			 axis.text.x = element_text( colour="#3c3f4a", family="main_", size=12 ),
			 axis.line.x = element_line( color = "#3c3f4a" ),
			 axis.line.y = element_line( color = "#3c3f4a" ),
			 plot.title = element_blank(),
			 plot.subtitle = element_blank(),
			 plot.background = element_rect( fill = "transparent" ),
			 panel.background = element_rect( fill = "transparent" ) # bg of the panel
			 ) +
	#scale_colour_viridis_d( option="cividis", begin=0.4 ) +
	guides( colour="none" ) +
	labs( y="Log Frequency",
			x="Log Rank" )

theme_set(theme_cowplot(_size=4, _family = "main_" ) )  

# Cowplot trick for ggtitle
title <- ggdraw() + 
	draw_label(
				  "Voynich Manuscript Rank Frequency Distribution against Wikipedia Corpora", 
				  family="bold_", colour = "#3c3f4a", size=20, 
				  hjust=0, vjust=1, x=0.02, y=0.88 ) +
	draw_label("http://www.weirddatascience.net | @WeirdDataSci", 
				  family="bold_", colour = "#3c3f4a", size=12, 
				  hjust=0, vjust=1, x=0.02, y=0.40 )

data_label <- 
	ggdraw() +
	draw_label("Data from: http://www.voynich.nu | http://wikipedia.org", 
				  family="bold_", colour = "#3c3f4a", size=12, 
				  hjust=1, x=0.98 )
 
tgp <- 
	plot_grid(title, voynich_wikipedia_plot, data_label, ncol=1, rel_heights=c(0.1, 1, 0.1)) 

voynich_wikipedia_plot <- 
	ggdraw() +
	draw_image("img/parchment.jpg", scale=1.4 ) +
	draw_plot(tgp)

save_plot("output/voynich_wikipedia_plot.pdf", 
							voynich_wikipedia_plot,
							base_width = 16,
							base_height = 9,
			           	base_aspect_ratio = 1.78 )

The Voynich text seems consistent with the behaviour of known natural languages from Wikipedia. The most striking difference being the clustering of Voynich word frequencies in the lower half of the diagram, resulting from the smaller corpus of words in the Voynich Manuscript. This causes, in particular, lower-frequency words to occur an identical number of times, resulting in vertical leaps in the frequency graph towards the lower end.

To highlight this phenomenon, we can apply a similar technique to another widely-translated short text: the United Nations Declaration of Human Rights.

Voynich Manuscript Rank Frequency Distribution against UNDHR Translations | (PDF Version)

UNDHR Word Frequency Plot

voynich_zipf_udhr.r


library( tidyverse )
library( magrittr )

library( ggthemes )
library( showtext )

library( tidytext )

library(cowplot)
library(magick)

_add( "voynich_", "/usr/shares/TTF/weird/voynich/eva1.ttf")
_add( "main_", "/usr/shares/TTF/weird/alchemy/1651 Alchemy/1651AlchemyNormal.otf")
_add( "bold_", "/usr/shares/TTF/weird/alchemy/1651 Alchemy/1651AlchemyNormal.otf")

showtext_auto()

message( "Reading raw Voynich data..." )
voynich_tbl <- 
	read_csv( "data/voynich_raw.txt", col_names=FALSE ) %>%
	rename( folio = X1, text = X2 )

# Tokenize
# (Remove words of 3 letters or less)
# Stemming and stopword removal apparently not so effective anyway,
# according to Schofield et al.: <www.cs.cornell.edu/~xanda/winlp2017.pdf>
voynich_words <-
	voynich_tbl %>%
	unnest_tokens( word, text ) 

# Most common words
message( "Calculating Voynich language statistics..." )
voynich_common <-
	voynich_words %>%
	count( word, sort=TRUE ) %>%
	mutate( word = reorder( word, n ) ) %>%
	mutate( freq = n / sum(n) )

# Combine with Voynich, assigning it the unassigned ISO 3166-1 alpha-2 code "vy"
message( "Combining common and Voynich language statistics..." )
voynich_language <-
	voynich_common %>%
	transmute( language = "vy", freq = freq )

voynich_word_counts <-
	voynich_words %>%
	count( word, folio, sort = TRUE ) 

# UDHR corpus comparison (smaller text)
udhr_corpus_files <- list.files("data/udhr/udhr_txt", pattern="*.txt", full.names=TRUE )

# Helper function to read in a text file and calculate a frequency tablle
table_frequency_mapper <- function( x ) {

	# Read file and extract language code from filename
	udhr_text <- read_lines( x, skip=6, skip_empty_rows=TRUE )
	language <- basename( x ) %>% str_replace( "udhr_", "" ) %>% str_replace( ".txt", "" ) 

	# Tokenize and remove punctuation
	udhr_words <-
		udhr_text %>%
		str_flatten %>%
		str_remove_all( "[.,]" ) %>%
		str_split( "\\s+" ) %>%
		extract2( 1 ) %>%
		{ tibble( word=. ) }

	# Most common words
	udhr_common <-
		udhr_words %>%
		count( word, sort=TRUE ) %>%
		mutate( word = reorder( word, n ), language )

}

voynich_corpus <-
	voynich_language %>%
	transmute( language, log_rank=log( row_number() ), log_freq=log( freq ), colour="Voynich Text" )

udhr_corpus <-
	udhr_corpus_files %>%
	map( table_frequency_mapper ) %>%
	bind_rows %>%
	group_by( language ) %>%
	transmute( log_rank=log( row_number() ), log_freq=log( n / sum(n) ), colour="Known UDHR Language" ) %>%
	ungroup

voynich_udhr_corpus <-
	bind_rows( udhr_corpus, voynich_corpus )

voynich_udhr_frequency_plot <-
	voynich_udhr_corpus %>%
	ggplot( aes( x=log_rank, y=log_freq, colour=colour) ) +
	geom_point( alpha=0.4, shape=19 ) + 
	scale_color_manual( values=c( "Known UDHR Language" = "#3c3f4a", "Voynich Text" = "#8a0707" ) ) +
	theme (
			 axis.title.y = element_text( angle = 90, family="main_", size=12 ),
			 axis.text.y = element_text( colour="#3c3f4a", family="main_", size=12 ),
			 axis.title.x = element_text( colour="#3c3f4a", family="main_", size=12 ),
			 axis.text.x = element_text( colour="#3c3f4a", family="main_", size=12 ),
			 axis.line.x = element_line( color = "#3c3f4a" ),
			 axis.line.y = element_line( color = "#3c3f4a" ),
			 plot.title = element_blank(),
			 plot.subtitle = element_blank(),
			 plot.background = element_rect( fill = "transparent" ),
			 panel.background = element_rect( fill = "transparent" ), # bg of the panel
			 panel.grid.major.x = element_blank(),
			 panel.grid.major.y = element_blank(),
			 panel.grid.minor.x = element_blank(),
			 panel.grid.minor.y = element_blank(),
			 legend.text = element_text( family="bold_", colour="#3c3f4a", size=10 ),
			 legend.title = element_blank(),
			 legend.key.height = unit(1.2, "lines"),
			 legend.position=c(.85,.5)
			 ) +
	labs( y="Log Frequency",
			x="Log Rank" )

theme_set(theme_cowplot(_size=4, _family = "main_" ) )  

# Cowplot trick for ggtitle
title <- 
	ggdraw() + 
	draw_label("Voynich Manuscript Rank Frequency Distribution against UNDHR Translations", family="bold_", colour = "#3c3f4a", size=20, hjust=0, vjust=1, x=0.02, y=0.88) +
	draw_label("http://www.weirddatascience.net | @WeirdDataSci", family="bold_", colour = "#3c3f4a", size=12, hjust=0, vjust=1, x=0.02, y=0.40)

data_label <- 
	ggdraw() +
	draw_label("Data from: http://www.voynich.nu | http://unicode.org/udhr/", family="bold_", colour = "#3c3f4a", size=12, hjust=1, x=0.98 )
 
tgp <- 
	plot_grid(title, voynich_udhr_frequency_plot, data_label, ncol=1, rel_heights=c(0.1, 1, 0.1)) 

voynich_udhr_plot <- 
	ggdraw() +
	draw_image("img/parchment.jpg", scale=1.4 ) +
	draw_plot(tgp)

save_plot("output/voynich_udhr_plot.pdf", 
							voynich_udhr_plot,
							base_width = 16,
							base_height = 9,
			           	base_aspect_ratio = 1.78 )

A Refined Randomness

The above arguments might at first appear compelling. The surface incomprehensibility of the Voynich Manuscript succumbs to the deep currents of statistical laws, and reveals an underlying pattern amongst the chaos of the text.

Sadly, however, as with all too many arguments in the literature regarding power law distributions arising in nature, there is a complication to this argument that again highlights the difference between proof and the failure to disprove. Certainly, if a power law had proved incompatible with the Voynich Manuscript then we would have doubted its authenticity. With its apparent adherence to such a distribution, however, we have taken only one hesitant step towards confidence.

Rugg has argued that certain random mechanisms can produce text that adheres to Zipf’s Law, and has demonstrated a simple mechanical procedure for doing so. A more compelling argument is presented, without reference to the Voynich Manuscript, by Li. (1992)

Twisting Paths

These analyses can only present a dim outline of the text itself, and we resist the awful temptation to attempt any form of decipherment. Certainly, the evidence here seems convincing enough that the Voynich Manuscript does represent a human language, but the statistics presented here are of little use in such an effort. It is likely, of course, that the most frequent words in the manuscript may, under certain assumptions, correspond to the most common words or particles in many languages — the definite article, the indefinite article, conjunctions, pronouns, and similar. Without deeper knowledge of the language, however, and with the range of scribing conventions and shortcuts commonplace in texts of the period, these techniques are too limited to do more than tantalise us with what we may never know.

Credible Conclusions

Subjecting the text of the Voynich Manuscript to the crude frequency analyses presented here can support, although not prove, the view that the manuscript, regardless of its true content, is not simply random gibberish. Nor is the text likely to be the result of a simple mechanical process designed without knowledge of the statistical patterns of human languages. Neither is it likely to be any form of cryptogram more sophisticated than the simplest ciphers, as these would have tended to compromise the statistical properties that we have observed.

The demonstrable following of Zipf’s Law, and the adherence to a Gamma distribution of similar shape to known languages, strongly suggests that the text is likely a representation of some natural language.

In the next post we will attempt blindly to wrench more secrets from the text itself through application of modern textual analysis techniques. Until then the Voynich Manuscript remains, silently obscure, beyond the reach of our faltering science.

Footnotes

To leave a comment for the author, please follow the link and comment on their blog: Weird Data Science.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.