Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

by Joseph Rickert

The ancient Egyptians were a people with long memories. The lists of their pharaohs went back thousands of years, and we still have the names and tax assessments for certain persons and institutions from the time of Ramesses II. When Herodotus began writing about Egypt and the Nile (~ 450 BC), the Egyptians who knew that their prosperity depended on the river’s annual overflow, had been keeping records of the Nile’s high water mark for more than three millennia. So, it seems reasonable, and maybe even appropriate, that one of the first attempts to understand long memory in time series was motivated by the Nile.

The story of British hydrologist and civil servant H.E. Hurst who earned the nickname “Abu Nil”, Father of the Nile, for his 62 year career of measuring and studying the river is now fairly well known. Pondering an 847 year record of Nile overflow data, Hurst noticed that the series was persistent in the sense that heavy flood years tended to be followed by heavier than average flood years while below average flood years were typically followed by light flood years. Working from an ancient formula on optimal dam design he devised the equation: log(R/S) = K*log(N/2) where R is the range of the time series, S is the standard deviation of year-to-year flood measurements and N is the number of years in the series. Hurst noticed that the value of K for the Nile series and other series related to climate phenomena tended to be about 0.73, consistently much higher than the 0.5 value that one would expect from independent observations and short autocorrelations.

Today, mostly due to the work of Benoit Mandelbrot who rediscovered and popularized Hurst work in the early 1960s, Hurst’s Rescale/Range Analysis, and the calculation of the Hurst exponent (Mandlebrot renamed “K” to “H”) is the demarcation point for the modern study of Long Memory Time Series. To investigate let’s look at some monthly flow data taken at the Dongola measurement station that is just upstream from the high dam at Aswan. (Look here for the data, and here for map and nice Python-based analysis that covers some of the same ground as is presented below.) The data consists of average monthly flow measurements from January 1869 to December 1984.

head(nile_dat)
Year  Jan  Feb  Mar Apr May  Jun  Jul   Aug   Sep  Oct  Nov  Dec
1 1869   NA   NA   NA  NA  NA   NA 2606  8885 10918 8699   NA   NA
2 1870   NA   NA 1146 682 545  540 3027 10304 10802 8288 5709 3576
3 1871 2606 1992 1485 933 731  760 2475  8960  9953 6571 3522 2419
4 1872 1672 1033  728 605 560  879 3121  8811 10532 7952 4976 3102
5 1873 2187 1851 1235 756 556 1392 2296  7093  8410 5675 3070 2049
6 1874 1340  847  664 516 466  964 3061 10790 11805 8064 4282 2904

To get a feel for the data we plot a portion of the time series.

The pattern is very regular and the short term correlations are apparent. The following boxplots show the variation in monthly flow.

Herodotus clearly knew what he was talking about when he wrote (The Histories: Book 2, 19):

I was particularly eager to find out from them (the Egyptian priests) why the Nile starts coming down in a flood at the summer solstice and continues flooding for a hundred days, but when the hundred days are over the water starts to recede and decreases in volume, with the result that it remains low for the whole winter, until the summer solstice comes round again.

To construct a long memory time series we aggregate the monthly flows to produce a yearly time series of total flow (droppng the years 1869 and 1870 because of the missing values).

Plotting the ACF, we see that the autocorelations persist for nearly 20 years!!

So, let's compute the Hurst exponent. For our first try, we use a simple function suggested by an example in Bernard Pfaff's classic text: Analysis of Integrated and Cointegrated Time Series with R.

simpleHurst <- function(y){
sd.y <- sd(y)
m <- mean(y)
y <- y - m
max.y <- max(cumsum(y))
min.y <- min(cumsum(y))
RS <- (max.y - min.y)/sd.y
H <- log(RS) / log(length(y))
return(H)
}
simpleHurst(x)
[1] 0.7348662

Bingo! 0.73 - just what we were expecting for a long memory time series. Unfortunately, things are not so simple. The function hurst() from the pracma package which is a much more careful calculation than simpleHurst() yields:

hurst(nile.yr.ts)
[1] 1.041862

This is midly distressing since H is supposed to be bounded above by 1. The function hurstexp() from the same package which is based on Weron's MatLab code and implements the small sample correction seems to solve that problem.

> hurstexp(nile.yr.ts)
Corrected R over S Hurst exponent:   1.041862
Theoretical Hurst exponent:          0.5244148
Corrected empirical Hurst exponent:  0.7136607
Empirical Hurst exponent:            0.6975531 

0.71 is more reasonable result. However, as a post on the Timely Portfolio blog pointed out a few years ago, computing the Hurst exponent is an estimation problem not merely a calculation. So, where are the error bars?

I am afraid that confidence intervals and a look at several other methods available in R for estimating the Hurst exponent will have to wait for another time. In the meantime, the following references may be of interest. The first two are light reading from early champions of applying Rescale/Range analysis and the Hurst exponent to Financial time series. The book by Mandelbrot and Hudson is especially interesting for its sketch of the historical background. The last two are relatively early papers on the subject.