Is September Bearish?
Traders love discussing seasonality, and September declines in US equity markets are a favorite topic. Historically September has underperformed every other month of the year, offering a mean return of .56% on the S&P 500 index from 1950 to 2012; 54% of Septembers were bearish over the same period – more than any other month. Empirically, September deserves its moniker: “The Cruelest Month.”
As a trading strategy 54% isn’t a substantial win rate, and small N given that it only trades once a year. However, both as a portfolio overlay and as a trading position it’s worth considering whether bearishness in September is a statistically significant anomaly or random noise.
Issues With Testing for Seasonality
There are plenty of tests for seasonality in time series data. Many rely on some form of autocorrelation to detect seasonal components in the underlying series. These methods are usually parametric and subject to lots of assumptions – not robust, especially for a nonstationary, noisy time series like the market.
Furthermore, sorting monthly returns by performance and declaring one month “the most bearish” introduces a data snoop bias. We’re implicitly performing multiple hypothesis tests by doing so, and as such we need to correct for the problem of multiple comparisons. This gets even more interesting if we consider a spreading strategy between two months, which introduces a multiple comparison bias closely related to the Birthday Paradox.
Bootstrap to the Rescue
The nonparametric bootstrap is a general purpose tool for estimating the sampling distribution of a statistic from the data itself. The technique is a powerful, computationally intensive tool that’s easily applied for any sample statistic, works well on small samples, and makes few assumptions about the underlying data. However, one assumption that it does make is a biggie: the data must be independent, identically distributed (iid). That’s a deal breaker, unless you buy into the efficient market hypothesis, in which case this post is already pretty irrelevant to you. Bootstrapping dependent data is an active area of research and there’s no universal solution to the problem. However, there’s research showing that the bootstrap is still robust when these assumptions are violated.
To address the question of seasonality we need a reasonable way to pose the hypothesis that we’re testing  one that minimizes issues arising from path dependence. One approach is to consider the distribution of labeled and unlabeled monthly returns. This flavor of bootstrap is also known as a permutation test. The premise is simple. Data is labeled as “control” and “experimental.” Under the null hypothesis that there’s no difference between two groups a distribution is bootstapped over unlabeled data. A sample mean is calculated for the experimental data, and a pvalue is computed by finding the percentage of bootstrap replicates more extreme than the sample mean. September returns form the experimental group, and all other months comprise the control.
The Study
The monthly (log) return was calculated using the opening price on the first trading day of the month and the closing price on the last trading day. Adjusted returns on the S&P index were used in lieu of eminis or SPY in the interest of a longer return series (we’re interested in the effect, not the execution, and on the monthly level there won’t be much of a difference in mean).
We’ll start by taking a look at the bootstrapped distributions of mean monthly returns. The heavy lifting is done by the fantastic XTS, ggplot, and quantmod libraries.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 

The image above shows a bootstrap distribution for each calendar month, as well as the control. The control distribution is much tighter as there’s 12 times more data, which by the Central Limit Theorem should result in a distribution having $\sigma_\text{control} \approx \frac{\sigma}{\sqrt{12}}$.
Plots of boostrapped distributions offer a nice visual representation of the probability of committing type I and type II errors when hypothesis testing. The tail of the single month return distributions extending towards the mean of the control distribution shows how a Type II error can occur if the alternative hypothesis was indeed true. The converse holds for a Type I error in which the tail of the null hypothesis distribution extends towards the mean of an alternative hypothesis distribution.
Now for the permutation test.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 

My pvalue was $p = .0134$ (your mileage will vary as this a random sample after all)  pretty statistically significant as a stand alone hypothesis. However, we still have a lurking issue with multiple comparisons. We cherry picked one month out of the calendar year – September – and we need to account for this bias. The Bonferroni correction is one approach, in which case we would need $\alpha = \frac{.05}{12} = .0042$ if we were testing our hypothesis at the $\alpha = .05$ level. Less formally our pvalue is effectively 0.1608 – not super compelling.
Putting the statistics aside had you traded this strategy from 1950 to present your equity curve (in log returns) would look like:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 

The high and low bands show the return for the most bullish and bearish month in each year. It’s easy to see that September tends to hug the bottom band but it looks pretty dodgy as a trade – statistical significance does not a good trading strategy make.
Closing Thoughts
Bootstrapping September returns in aggregate makes the assumption that yearoveryear, September returns are independent, identically distributed. Given that the bearishness of September is common lore, it’s reasonable to hypothesize that at this point the effect is a selffulfilling prophecy in which traders take into account how the previous few Septembers went, or the effect in general. If traders fear that September is bearish and tighten stops or liquidate intramonth, an anomaly born out random variance might gain traction. Whatever the cause, the data indicates that September is indeed anomalous. As for a standalone trading strategy…not so much.
Rbloggers.com offers daily email updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...