**joy of data » R**, and kindly contributed to R-bloggers)

Having a habit of compulsively wondering approximately every 34.765th day about how zip compression (bzip2 in this case) might be used to measure information contained in data – this time the question popped up in my head of whether or not and if then how permutation of a text’s words would affect its compression ratio. The answer is – it does – and it does so in a very weird and fascinating way.

Lo and behold James Joyce’s “A Portrait of the Artist as a Young Man” and its peculiar distribution of compression ratios …

library(stringr) library(ggplot2) compress <- function(str) { length(memCompress(str, type="bzip2")) / nchar(str) } calc_ratios <- function(title, N, sep=" ") { f <- paste("/media/Volume/temp/struc/",title,".txt",sep="") t <- scan(f, what=character()) y <- unlist(sapply(t, function(l) str_split(l,"[^a-zA-Z]"))) names(y) <- NULL y <- y[nchar(y)>1] ratios <- rep(NA, N) for(i in 1:N) { ratios[i] <- compress(paste(y[sample(length(y))],collapse=sep)) } results <- list("perms" = ratios, "original" = compress(paste(y,collapse=sep))) return(results) } res <- calc_ratios("portraitartistyoungman", N=10000) ggplot(data = data.frame(x = res[["perms"]])) + geom_histogram(aes(x = x),binwidth=.00001,stat="bin") + geom_vline(xintercept=res[["original"]], color="red") + labs(title="compression ratio of not permuted text in red", x="", y="") + theme(plot.title = element_text(vjust=1, size=12)) hist(res[["perms"]], breaks=100, main = "A Portrait of the Artist as a Young Man", xlab="distribution of compression ratios for word permutations with median in red", ylab="") abline(v=median(res[["perms"]]), col = "red")

So obviously the compression ratio is indeed affected by word order and even comes along with a very nice looking distribution. But the best part is that the compression ratio of the original word order is effectively statistically mega-super-über-hyper-significant! The same is true for all the book’s texts I checked out and, I guess, it is true for every natural language text in the world. Intuitively I thought that maybe there is a distribution which is typical for text but from what I see, they are pretty different from content to content. It even seems to me that there is a resemblance of shape per author (“authorship attribution”). Though, Ulysses – the literary troublemaker of last century – is of course not playing along nicely. Twist features a plateau, the other two Joyces a double hill and Dickens a bell-shape.

# Compression Ratio as a Measure for Structure

The difference between a normal text and a permutation of it is of course that we kill its structure by doing so. This makes me wonder if it would be possible to measure a texts degree of structure by relating its original compression ratio to its distribution of compression ratios. F.x.:

A *low* DOS would indicate a *high* degree of structure because of the relatively high difference to the lower boundary of the distribution of compression ratios for randomly permuted word order. Looking at the DOS column of the table below this interpretation seems reasonable when we compare the degree f.x. for Dickens vs. Joyce.

wc = word count original = compression ratio of original text (word order) dec_nth / median = nth decile / median of c.r. dist. for permuted word orderings title wc original dec_1st median dec_9th DOS Dubliners 65609 0.28288 0.31397 0.31446 0.31511 0.901 Ulysses 253927 0.32091 0.34607 0.34632 0.34663 0.927 Great Expectations 176096 0.26178 0.29518 0.29534 0.29556 0.887 David Copperfield 337291 0.25696 0.29184 0.292 0.29217 0.88 Huckleberry Finn 105441 0.26252 0.2999 0.30017 0.30056 0.875 Life on Mississippi 138867 0.27769 0.30453 0.30474 0.30534 0.912

# Technicalities

All texts are taken from Project Gutenberg – relieved of sections not belonging to the book itself – reduced to letters a to z and A to Z by replacing anything else with a space – the result is split at white spaces and all “words” of length 1 are discarded. This vector of words or string atoms is then permuted using sample().

The compression applied is of type bzip2 and performed using memCompress() with calculation of compression ratio being implemented as follows:

compress <- function(str) { length(memCompress(str, type="bzip2")) / nchar(str) }

**leave a comment**for the author, please follow the link and comment on his blog:

**joy of data » R**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...