Illuminating the Illuminated: A First Look at the Voynich Manuscript

[This article was first published on Weird Data Science, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The Voynich Manuscript

While the world abounds with strange phenomena ripe for analysis in their raw state, there is a peculiar pleasure in scrutinising arcane information curated and obscured by the human mind.

The Voynich Manuscript is one of the most well-known and studied volumes of occult knowledge. The book’s most recent history involves its purchase in 1912 by Wilfrid Voynich, a rare book dealer, from a sale of manuscripts by the Society of Jesus at the Villa Mondragone, Frascati. Following several fruitless years of attempts to decipher the manusript and discover its origin, or to interest others in it, Wilfrid Voynich died. The book passed through a number of other hands before being donated to Yale University by the noted rare book dealer Hans P. Kraus in 1969. It now resides in Yale’s Beinecke Rare Book and Manuscript Library with the designation MS 408.

Written almost entirely in an unknown script, barring a small number of words apparently in Latin and High German, the manuscript is compellingly illustrated with depictions of plants, herbs, human figures, astronomical and astrological symbols. The manuscript has resisted all attempts at interpretation by cryptographers, historians, and linguists.

From a linguistic and cryptographic perspective, this lack of success in interpretation is not surprising. The two-hundred or so folios of the manuscript, while beautifully illuminated, present a sadly limited corpus of text for the purposes of traditional analysis.

In this short series of posts we will subject the Voynich Manuscript to a range of text analysis techniques, delving into its structure, gain horrific insight into its composition, and skeptically assessing its credibility. The manuscript has been subjected to almost fifty years of furtive attempts by cryptographers, including the US National Security Agency and a menagerie of others from the distinguished to the deranged. We will crudely mimic some earlier results, and hopefully add our own confusion to the roiling mass of current research into the Voynich Manuscript.


Since its discovery, and throughout the ongoing unsuccessful attempts to decipher its contents, many have questioned the authenticity of the Voynich Manuscript. The theory that the entire book is a hoax, either by contemporary scribes or by more modern players, has been raised repeatedly over the years.

Radiocarbon dating in 2010 asserted that the manuscript’s parchment likely dates from the early 15th century; the volume of parchment in the manuscript, and its consistency across the document, make it unlikely, although not impossible, that the book is a modern-day hoax.

Other supporting evidence has drawn from early mentions of the manuscript in correspondence. According to, which presents a far more detailed and thorough description of the research around the manuscript and its history than we could hope to offer here, the first extant mention of the manuscript can be found in a 1639 letter from Athanasius Kircher in Rome, replying to a letter forwarded from Georgius Barschius of Prague by the mathematician Theodor Moretus.

The letter refers to a “book of mysterious steganography” (“libellum… …steganographici mysterisi”) illustrated with pictures of plants, stars and chemical secrets that Kirscher had not yet had time to decipher. Barschius had sought out Kirscher’s expertise due to his fame at the time for claiming to have, erroneously as it later transpired, deciphered the hieroglyphic writing system of the Ancient Egyptian language. Later correspondence between Barschius and Kirscher appears, according to Zandbergen3.

  • The manuscript is written from left to right, and not the reverse, vertically, boustrophedon. This is uncontroversial and apparent from even a cursory inspection of the text itself; the horizontal flow of the writing is clear, with lines clearly starting at the left margin and ending before the right. The text is separated into paragraphs, of which the final line is justified to the left.
  • Data

    Due to the diligent activity of several generations of Voynich researchers, the text of the manuscript has been transcribed into a machine-readable format. As the alphabet is unknown, there are minor uncertainties in rendering the text, leading to a number of similar but competing transcriptions. The subtle details of the various transcription efforts, and their history, are available at:, with the raw data available at We have settled on the v101 transliteration by Glen Claston, rendered in the Intermediate Voynich Transliteration File Format (IVTFF) of Zandbergen. This is one of the more recent and widely-used transcriptions, and has the added advantage of being supported by the availability of a TrueType font. The underlying file is available here:

    Crude Manipulations

    We perform the following steps to make the data usable for our analyses. For many scenarios, we would develop a generalisable set of steps to allow conversion of many documents to an appropriate form. Until and unless, however, a new cache of documents in the same language are found, it is simpler and easier to perform these one-time steps manually.

    Firstly, we delete from the text all incomplete words, as marked in the IVTFF format. This includes:

    • all text in angle brackets
    • all words containing ?’s
    • all words containing []

    Secondly, we tokenize the text and remove punctuation. The transcription of the Voynich manuscript that we have chosen uses the following punctuation:

    • “.” is a space
    • “,” is a potential space. For simplicity, we do not treat these as a space.

    Finally, we organize the document in an appropriate form to be imported into an R data frame, or tidyverse tibble.

    The above steps were performed in the Vim text editor, and the commands used are reproduced in the code below:

    Show Vim text manipulation commands.
    # Delete all commented lines
    # Remove blank lines
    # Remove "," -- assume that potential spaces are /not/ spaces. 
    # Replace each folio's page marker (initial for each page) 
    # with its contents, followed by a comma. ( -\> f1r,)
    # Remove all \<\> entries (non-greedy)
    # Join all paragraphs (all newlines followed by a character 
    # other than a newline are removed).
    # Replace "high ascii" rare characters from the IVTFF with their 
    # ASCII representation. ()
    # Replace full stops with spaces
    :\%s/./ /g

    The resulting raw data file is available here. This file can be read into R simply by use of the read.csv function:

    voynich_tbl <- 
    	read_csv( "data/voynich_raw.txt", col_names=FALSE ) %>%
    	rename( folio = X1, text = X2 )

    As a first, horrifying glance into the forms of analysis that this allows, we can now use our raw data to identify the most repeated words in the manuscript, according to our transcription. The following R code extracts the entirety of the text and encodes it as a run length encoding. This conveniently results in a sequential list of words and the number of times that each is repeated in sequence. We can then simply extract the largest number of repetitions for each word in the corpus:

    Count longest word repetition sequences in the Voynich Manuscript.
    library( tidyverse )
    library( magrittr )
    # Count the number of repeated words in the Voynich Manuscript text.
    # Load the raw data
    voynich_tbl <- 
    	read_csv( "data/voynich_raw.txt", col_names=FALSE ) %>%
    	rename( folio = X1, text = X2 )
    # Extract the text as a vector of words
    voynich_vector <- 
    	voynich_tbl %>%
    	extract2( "text" ) %>%
    	paste( sep=" ", collapse=" " ) %>%
    	str_split( " " ) %>%
    # Create a run length encoding object from the vector
    voynich_rle <- 
    	voynich_vector %>%
    # Convert rle object to a data frame and report the maximum number of repeated
    # cases for each word
    voynich_repetitions <- 
    	voynich_rle %>%
    	unclass %>% %>%
    	group_by( values ) %>%
    	summarise( max_repetitions = max( lengths ) ) %>%
    	ungroup %>%
    	arrange( desc( max_repetitions ) )

    This simple analysis shows that, in the transcription we have chosen, the longest sequences of repeated words are only three words in length, occuring a total of five times in the text. While there are many other arguments against the potential validity of the Voynich Manuscript, word repetition does in itself present a compelling reason to doubt that the text is a human language.

    We have now reduced the strange and beautiful elegance of the Voynich Manuscript’s centuries-old illuminations to a crude, utilitarian abstraction. With this particular act of artistic and literary desecration complete, in the next post we will examine Zipf’s Law in more detail, and interrogate the extent to which this law supports or undermines the text’s authenticity.


    To leave a comment for the author, please follow the link and comment on their blog: Weird Data Science. offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
    Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

    Never miss an update!
    Subscribe to R-bloggers to receive
    e-mails with the latest R posts.
    (You will not see this message again.)

    Click here to close (This popup will not appear again)