Discovering power laws and removing “shit”

[This article was first published on sieste » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Imagine you perform a statistical analysis on a time series of stock market data. After some transformation, averaging, and “renormalization” you find that the resulting quantity, let’s call it x, behaves as a function of time like x(t)\propto t^{-\beta}. Since you are a physicist you get excited because you have just discovered a power law. Physicists love power laws.

Now you analyze some more financial time series using the same technique and find similar behavior. Power laws all over the place. You get even more excited. In the paper about the analysis (which you submit to a physics journal) you may throw buzzwords like “scale-free”, “critical phase transition”, and “universality”. You can also add to your CV that you contributed to the understanding of market dynamics.

An analysis of this sort was published in PNAS 108(19) [1]. The authors look at “trend switching in financial markets”. They analyze the behavior of financial time series between “turning points” and find power laws everywhere.

But then somebody else [2] has applied the same analysis to artificial data, simple Brownian motion, and discovered power laws as well. What does that mean? Is this effect not a bit too “universal” if it also occurs in integrated noise? It so turned out the observed power laws were an artefact of the statistical analysis. They have nothing to do with critical phase transitions and scale-freeness of financial market dynamics.

I reproduced the above mentioned analysis in R using the following script (feel free to use it to discover your own power laws):

#produce time series
N <- 1e6
x <- cumsum(rnorm(N+1))
dt <- 5
loc.max <- rep(0,N)

#find local maxima of order dt
for (i in seq(dt+1,N-dt)) {
    w <- x[(i-dt):(i+dt)]
    if ( order(w,decreasing=T)[1]==dt+1 ) loc.max[i] <- 1
}

#if w is the time difference between two local
#maxima l1 and l2, save all possible snippets
#l1...(w)...l2...(w)... of length 2*w+1 from x
loc.max.pos <- which(loc.max==1)
loc.max.N <- length(loc.max.pos)
profiles <- list()
for (j in seq(loc.max.N-5)) { #skip last 5 to keep indices below N
    w <- x[loc.max.pos[j]:(2*loc.max.pos[j+1]-loc.max.pos[j])]
    w <- w - min(w) #"normalize"
    profiles[[j]] <- w
}

#transform all profiles so they have the same length len.max
#and average over them
len.max <- 1000
profile.avg <- rep(0,len.max)
for (p in profiles) {
    p.len <- length(p)
    rep.vec <- rep( floor(len.max/p.len), p.len )
    rep.rest <- len.max - sum(rep.vec)
    rep.indplus1 <- sample(seq(p.len),rep.rest)
    rep.vec[rep.indplus1] <- rep.vec[rep.indplus1] + 1
    # now rep(p,rep.vec) returns a vector that "looks like" p
    # but has length len.max
    profile.avg <- profile.avg + rep(p,rep.vec)
}
profile.avg <- profile.avg/length(profiles)

#plot averaged profile
jpeg("avgProfile.jpg",width=400,height=400,quality=100)
t <- seq(0,2,length.out=len.max)
plot(t,profile.avg,type="l",ylab="x")
dev.off()

#plot in loglog-axes
jpeg("powerlaw.jpg",width=400,height=400,quality=100)
plot(NULL,xlim=c(-2.2,-0.5),ylim=c(.5,0.8),xlab=expression(log[10]*"|"*t-1*"|"),ylab=expression(log[10]*"x"))
ind.p <- seq(0.5*(len.max+1),len.max)
points(log10(abs(t[ind.p]-1)),log10(profile.avg[ind.p]),col="orange",lwd=3,pch=15)
ind.m <- seq(0.5*(len.max+1),1)
lines(log10(abs(t[ind.m]-1)),log10(profile.avg[ind.m]),col="blue",lwd=2)
#fit straight line over range of interest
fit.ind.p <- ind.p[12:150]
beta <- lm( log10(profile.avg[fit.ind.p])~log10(abs(t[fit.ind.p]-1)) )$coefficients
#... and plot it
lines(log10(abs(t[fit.ind.p]-1)), beta[1]+.02+beta[2]*log10(abs(t[fit.ind.p]-1)),lwd=2,lty=2)
text(x=-1,y=.75,labels=paste("beta=",round(beta[2],digits=4)))
legend(x=-2.1,y=.65,c("right slope","left slope"),lty=c(1,1),lwd=c(2,2),pch=c(15,-1),col=c("orange","blue"))
dev.off()

In the script, I analyzed a time series of Brownian motion of length 1 million. A point x_i in the time series is labelled as a local maximum if it is the largest value in the window (i-5,\cdots,i+5). Then the original time series is cut into snippets: Each snippet starts at a local maximum and has length 2w,where w is the distance to the next local maximum in the time series. Of course, the different snippets have varying lengths. In order to make them comparable, they are artificially stretched into series of equal lengths and shifted such that their minimum is equal to zero. The average over all snippets is the quantity of interest, plotted here:

The original idea was that this profile can be interpreted as characterizing the average behavior of the financial time series between and after turning points (local maxima). The slopes to the left and to the right of the central peak follow power laws with exponent \beta\approx-0.17, as shown by fitting a straight line in a log-log plot:

It bothers me a bit that unlike reported in [1] and [2], in my analysis the coefficients corresponding to the two sides of the peak come out the same. But the authors were not very specific in how the “stretching” of the individual snippets was actually performed. It seems my way of doing it (lines 30-34 in the code) made the results even more “universal”.

The reply which uncovered the mistake ultimately got rejected by PNAS, with the explanation of not adding significantly to the field. A funny part of the story is the letter [3] one of the authors of the rejected reply sent to the editor of PNAS. He complains about the bad state of science if simpler explanations of theories get rejected because of being too boring. The most remarkable sentence is this:

… a fundamental error can remain published as “truth” in PNAS without the normal debate that should be the domaim of real science. … In other words, we can add “shit” to the field but we cannot correct and remove “shit” from the field

It probably helps to be a famous professor to get away with writing a letter like this to the editor of PNAS.

It think publishing an analysis whose interpretation is not warranted by the data is nothing to be ashamed of. As can be nicely seen, the scientific process works; the error is found and reported. Future researchers are (ideally) warned and hopefully won’t make the same mistake.

However, rejecting the one who uncovers the error and offers a simpler, but less exciting explanation for the phenomenon, on grounds of not contributing anything new, is indeed something to be ashamed of. Shame on you, PNAS.

Another lesson learnt from this story is that, if you have performed any statistical analysis that is more complex than calculating the mean and the standard deviation, you should perform the same analysis on noise to make sure that whatever effect you observe is indeed a unique feature of your data and not an artefact of the analysis.

References:
[1] Preis et al (2011) “Switching processes in financial markets” PNAS doi: 10.1073/pnas.1019484108
[2] Filimonov & Sornette (2011) “Spurious trend switching phenomena in financial markets”, arxiv.org/abs/1112.3868
[3] D. Sornette (2011) Letter to the editor of PNAS

To leave a comment for the author, please follow the link and comment on their blog: sieste » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)