[This article was first published on R-Bloggers – Learning Machines, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

One of the most fiercely fought debates in quantitative finance is whether the stock market (or financial markets in general) is (are) efficient, i.e. whether you can find patterns in them that can be profitably used.

If you want to learn about an ingenious method (that is already present in anyone’s computer) to approach that question, read on!

The general idea of market efficiency is that markets are conceptionally information processing systems that incorporate all available information to arrive at the most accurate price, i.e. the best estimate of the current value, of a company. The only possibility for a price change is that new information becomes available. Because, as the name says, the information is new, it cannot be anticipated, and therefore it is impossible to beat the market: price changes are unpredictable!

The other side of the debate argues that there are certain patterns hidden in the ups and downs of the charts and you only have to understand the underlying logic to make use of that. One prominent candidate is the so-called technical analysis that tries to discern all kinds of structures within the data, e.g. head and shoulders or double top/bottom reversal patterns, lines of support or resistance, and channels… and much more: price changes are unpredictable!

So, is it randomness vs. pattern recognition, or noise vs. signal: who is right?

Enter (algorithmic) information theory!

One of the basic ideas of information theory is that random sequences are incompressible. Let us illustrate this with a very simple example. A simple compression method is called run length encoding (RLE). It just computes the lengths and values of runs of equal values in a sequence:

# simple pattern
rle(c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1))
## Run Length Encoding
##   lengths: int 10
##   values : num 1

# "random"
rle(c(0, 1, 1, 0, 1, 0, 1, 0, 0, 1))
## Run Length Encoding
##   lengths: int [1:8] 1 2 1 1 1 1 2 1
##   values : num [1:8] 0 1 0 1 0 1 0 1


As you can see, the encoding of the simple pattern is much shorter than the encoding of the “random” series. Put another way there are way fewer “surprises” in the first series than in the second. Or in the lingo of information theory, the algorithmic complexity (AC) of the second series is much higher.

Now, this was a very simple pattern recognition engine, we all have something way more sophisticated on our computers.

Enter the ZIP compression tool!

Most of us use this little piece of software to compress files that we e.g. want to send over the internet. Only a few people know that it is a very advanced piece of technology, that is a master at spotting patterns in the files it is supposed to compress. Well, we are not the first to recognize this. In fact, there are many renowned papers out there that examine all kinds of complex dynamical systems with this little tool!

Concerning the exact inner working of the ZIP tool, we won’t go into the details but you can think of the general idea as a further development of RLE, where not only simple runs of values are being considered but all kinds of more complicated combinations/blocks of values. This is perfectly suited to spot any patterns in all kinds of data.

In the following analysis, we load the Standard & Poors price data series beginning of 1990 till today, transform it into returns, scale (detrend) it and “binarize” it into up and downtick data:

library(quantmod)
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
##     as.Date, as.Date.numeric
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

library(BMS) # for function bin2hex()
getSymbols("^GSPC", from = "1990-01-01")
## 'getSymbols' currently uses auto.assign=TRUE by default, but will
## use auto.assign=FALSE in 0.5-0. You will still be able to use
## 'loadSymbols' to automatically load data. getOption("getSymbols.env")
## and getOption("getSymbols.auto.assign") will still be checked for
## alternate defaults.
##
## This message is shown once per session and may be disabled by setting
## options("getSymbols.warning4.0"=FALSE). See ?getSymbols for details.
## [1] "^GSPC"

returns <- GSPC |> Cl() |> ROC() |> coredata() |> na.omit()
if (length(returns) %% 8 != 0) returns <- returns[(length(returns) %% 8 + 1):length(returns)] # trim for hex conversion
returns_zscores <- returns |> scale()
returns_zscores_bin <- ifelse(returns_zscores > 0, 1, 0)
returns_zscores_bin |> as.vector() |> head(100)
##   [1] 0 0 0 1 0 0 1 0 0 1 0 1 1 0 1 0 0 0 0 0 1 0 1 1 0 1 0 1 0 1 1 1 0 0 0 0 0
##  [38] 1 1 1 1 1 0 1 0 1 0 1 0 1 1 1 1 0 0 0 1 1 1 1 0 0 0 1 0 0 0 1 1 0 1 1 0 0
##  [75] 0 0 0 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 0


After that we transform the binary data further into chunks of hexadecimal data because this is the format we need later on:

returns_zscores_hex <- returns_zscores_bin |> bin2hex() |> {\(x) substring(x, seq(1, nchar(x), 2), seq(2, nchar(x), 2))}()
##   [1] "12" "5a" "0b" "57" "07" "d5" "78" "f1" "1b" "03" "7f" "f3" "e7" "c3" "0e"
##  [16] "3e" "de" "4c" "c3" "b9" "1a" "e9" "30" "9d" "46" "74" "5c" "e9" "59" "e5"
##  [31] "1a" "b0" "37" "9c" "dd" "ac" "d5" "04" "2e" "2a" "7c" "63" "cd" "36" "f8"
##  [46] "16" "33" "4c" "e4" "2e" "23" "8f" "20" "1a" "e8" "c2" "fa" "0e" "1b" "92"
##  [61] "48" "8c" "bf" "e4" "69" "45" "8a" "35" "11" "9d" "89" "b3" "e1" "3a" "34"
##  [76] "cb" "54" "71" "4d" "5f" "42" "7c" "23" "86" "b1" "aa" "6e" "0b" "7c" "f4"
##  [91] "ac" "77" "8e" "06" "24" "4b" "8e" "56" "30" "dd"


To have some comparison we again create a long sequence of ones, as in our first example (here directly in hexadecimal form)

ones_hex <- rep(1, length(returns)) |> bin2hex() |> {\(x) substring(x, seq(1, nchar(x), 2), seq(2, nchar(x), 2))}()
##   [1] "ff" "ff" "ff" "ff" "ff" "ff" "ff" "ff" "ff" "ff" "ff" "ff" "ff" "ff" "ff"
##  [16] "ff" "ff" "ff" "ff" "ff" "ff" "ff" "ff" "ff" "ff" "ff" "ff" "ff" "ff" "ff"
##  [31] "ff" "ff" "ff" "ff" "ff" "ff" "ff" "ff" "ff" "ff" "ff" "ff" "ff" "ff" "ff"
##  [46] "ff" "ff" "ff" "ff" "ff" "ff" "ff" "ff" "ff" "ff" "ff" "ff" "ff" "ff" "ff"
##  [61] "ff" "ff" "ff" "ff" "ff" "ff" "ff" "ff" "ff" "ff" "ff" "ff" "ff" "ff" "ff"
##  [76] "ff" "ff" "ff" "ff" "ff" "ff" "ff" "ff" "ff" "ff" "ff" "ff" "ff" "ff" "ff"
##  [91] "ff" "ff" "ff" "ff" "ff" "ff" "ff" "ff" "ff" "ff"


As the last sequence we create a pseudo-random sequence of zeros and ones:

set.seed(123)
pseudorandom_hex <- sample(c(0, 1), length(returns), replace = TRUE) |> bin2hex() |> {\(x) substring(x, seq(1, nchar(x), 2), seq(2, nchar(x), 2))}()
##   [1] "17" "3a" "84" "35" "61" "61" "21" "a6" "42" "1b" "dc" "b6" "46" "c7" "9d"
##  [16] "46" "2f" "ee" "ed" "40" "74" "bc" "78" "66" "65" "ee" "e7" "a8" "c9" "be"
##  [31] "51" "d8" "2e" "a0" "50" "c7" "41" "cd" "22" "bc" "49" "70" "86" "88" "91"
##  [46] "8a" "8c" "97" "fa" "03" "02" "be" "80" "31" "aa" "37" "ee" "da" "68" "c2"
##  [61] "f6" "d7" "8f" "0a" "bb" "d4" "39" "e3" "9a" "ef" "6f" "a9" "64" "99" "53"
##  [76] "e1" "4b" "bf" "81" "47" "d9" "43" "d2" "0e" "3d" "16" "a7" "f8" "8d" "5c"
##  [91] "f9" "41" "a2" "17" "1a" "bf" "c9" "30" "3d" "47"


And now for the great finale, we compress all three sequences with gzip and calculate the compression rate:

round(length(memCompress(as.raw(as.hexmode(ones_hex)))) / n * 100, 2)
## [1] 1.7

round(length(memCompress(as.raw(as.hexmode(pseudorandom_hex)))) / n * 100, 2)
## [1] 101.1

round(length(memCompress(as.raw(as.hexmode(returns_zscores_hex)))) / n * 100, 2)
## [1] 101.1


Now, that is interesting: while the sequence of ones is compressed by nearly 98% of its original size two things jump out at us:

• The compression rate of the (pseudo-)random sequence and market sequence are the same!
• Both are over 100%!

The first point means that up and downtick market data are indistinguishable from randomness, the second point is due to the fact that the zipped data contains some additional metadata. Because no compression was possible (= randomness) this boils down to an inflation of the original size!

Does that mean, that markets are 100% efficient (= random)? It is at least another indication.

There remain some loopholes though:

• We were only looking at up and downtick data. It is a well-known (stylized) fact that certain market regimes exist. Taking volatility data into account could change the picture.
• We examine the whole time series at once. It could very well be that there are some pockets of predictability when we slice it up into smaller subsequences (e.g. yearly windows).
• We only look at one example, it could be that other indices, e.g. from developing countries, or single stocks are less efficient.
• Bascially we are only looking at technical analysis. Taking other information into account, from other markets, company information, the economy, etc., i.e. fundamental analysis, could also bring back some predictability.

Still, I think this is an interesting analysis with quite an unexpected result. Who would have thought that such an innocuously looking tool has such analytical power!

Please share your thoughts on market efficiency and this analysis, or even the results of your own analyses, with us in the comments.

To leave a comment for the author, please follow the link and comment on their blog: R-Bloggers – Learning Machines.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

# Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts.(You will not see this message again.)

Click here to close (This popup will not appear again)