Fake text generation the wrong way, and a contest

January 23, 2013
By

(This article was first published on Probability and statistics blog » r, and kindly contributed to R-bloggers)

As part of a bigger project, I needed to simulate a text string based on a source document, but at the character level. Just in case people find the code useful, I’ve uploaded it to MCMCtext.r.

In my simulated text, each character is chosen based on the transition probabilities in the source text from one character to another. The result is (nearly complete) gibberish without much interest to anyone, except perhaps those looking for a replacement for the standard Lorum Ibsum dummy text. More interesting fake text could be generated by using two character (or more) transition probabilities, or by working at the level of words.

Before moving on, I thought it might be interesting to see if anyone can “reverse engineer” my fake text output to figure out which original text was used as a source to generate it. Got that? The source text comes from Project Gutenberg. Hint: some features of the (fake) text could help you narrow the field of candidates.

First person to post a correct guess in the comments gets a copy of my comic and an unlimited supply of Hotpockets*. Limit one guess per person please.

* Hotpockets offer only valid if you are currently saving the planet from destruction.

To leave a comment for the author, please follow the link and comment on his blog: Probability and statistics blog » r.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.