(This article was first published on ** RStudio Blog**, and kindly contributed to R-bloggers)

I’ve released four new data packages to CRAN: babynames, fueleconomy, nasaweather and nycflights13. The goal of these packages is to provide some interesting, and relatively large, datasets to demonstrate various data analysis challenges in R. The package source code (on github, linked above) is fully reproducible so that you can see some data tidying in action, or make your own modifications to the data.

Below, I’ve listed the primary dataset found in each package. Most packages also include a number of supplementary datasets that provide additional information. Check out the docs for more details.

`babynames::babynames`

: US baby name data for each year from 1880 to 2013, the number of children of each sex given each name. All names used 5 or more times are included. 1,792,091 rows, 5 columns (year, sex, name, n, prop). (Source: Social security administration).`fueleconomy::vehicles`

: Fuel economy data for all cars sold in the US from 1984 to 2015. 33,442 rows, 12 variables. (Source: Environmental protection agency)`nasaweather::atmos`

: Data from the 2006 ASA data expo. Contains monthly atmospheric measurements from Jan 1995 to Dec 2000 on 24 x 24 grid over Central America. 41,472 observations, 11 variables. (Source: ASA data expo)`nycflights13::flights`

: This package contains information about all flights that departed from NYC (i.e., EWR, JFK and LGA) in 2013: 336,776 flights with 16 variables. To help understand what causes delays, it also includes a number of other useful datasets:`weather`

,`planes`

,`airports`

,`airlines`

. (Source: Bureau of transportation statistics)

NB: since the datasets are large, I’ve tagged each data frame with the `tbl_df`

class. If you don’t use dplyr, this has no effect. If you do use dplyr, this ensures that you won’t accidentally print thousands of rows of data. Instead, you’ll just see the first 10 rows and as many columns as will fit on screen. This makes interactive exploration much easier.

```
library(dplyr)
library(nycflights13)
flights
#> Source: local data frame [336,776 x 16]
#>
#> year month day dep_time dep_delay arr_time arr_delay carrier tailnum
#> 1 2013 1 1 517 2 830 11 UA N14228
#> 2 2013 1 1 533 4 850 20 UA N24211
#> 3 2013 1 1 542 2 923 33 AA N619AA
#> 4 2013 1 1 544 -1 1004 -18 B6 N804JB
#> 5 2013 1 1 554 -6 812 -25 DL N668DN
#> 6 2013 1 1 554 -4 740 12 UA N39463
#> 7 2013 1 1 555 -5 913 19 B6 N516JB
#> 8 2013 1 1 557 -3 709 -14 EV N829AS
#> 9 2013 1 1 557 -3 838 -8 B6 N593JB
#> 10 2013 1 1 558 -2 753 8 AA N3ALAA
#> .. ... ... ... ... ... ... ... ... ...
#> Variables not shown: flight (int), origin (chr), dest (chr), air_time
#> (dbl), distance (dbl), hour (dbl), minute (dbl)
```

To **leave a comment** for the author, please follow the link and comment on his blog: ** RStudio Blog**.

R-bloggers.com offers

(More)… ]]>

(This article was first published on ** Hyndsight » R**, and kindly contributed to R-bloggers)

When modelling data with ARIMA models, it is sometimes useful to plot the inverse characteristic roots. The following functions will compute and plot the inverse roots for any fitted ARIMA model (including seasonal models).

# Compute AR roots arroots <- function(object) { if(class(object) != "Arima" & class(object) != "ar") stop("object must be of class Arima or ar") if(class(object) == "Arima") parvec <- object$model$phi else parvec <- object$ar if(length(parvec) > 0) { last.nonzero <- max(which(abs(parvec) > 1e-08)) if (last.nonzero > 0) return(structure(list(roots=polyroot(c(1,-parvec[1:last.nonzero])), type="AR"), class='armaroots')) } return(structure(list(roots=numeric(0),type="AR"),class='armaroots')) } # Compute MA roots maroots <- function(object) { if(class(object) != "Arima") stop("object must be of class Arima") parvec <- object$model$theta if(length(parvec) > 0) { last.nonzero <- max(which(abs(parvec) > 1e-08)) if (last.nonzero > 0) return(structure(list(roots=polyroot(c(1,parvec[1:last.nonzero])), type="MA"), class='armaroots')) } return(structure(list(roots=numeric(0),type="MA"),class='armaroots')) } plot.armaroots <- function(x, xlab="Real",ylab="Imaginary", main=paste("Inverse roots of",x$type,"characteristic polynomial"), ...) { oldpar <- par(pty='s') on.exit(par(oldpar)) plot(c(-1,1),c(-1,1),xlab=xlab,ylab=ylab, type="n",bty="n",xaxt="n",yaxt="n", main=main, ...) axis(1,at=c(-1,0,1),line=0.5,tck=-0.025) axis(2,at=c(-1,0,1),label=c("-i","0","i"),line=0.5,tck=-0.025) circx <- seq(-1,1,l=501) circy <- sqrt(1-circx^2) lines(c(circx,circx),c(circy,-circy),col='gray') lines(c(-2,2),c(0,0),col='gray') lines(c(0,0),c(-2,2),col='gray') if(length(x$roots) > 0) { inside <- abs(x$roots) > 1 points(1/x$roots[inside],pch=19,col='black') if(sum(!inside) > 0) points(1/x$roots[!inside],pch=19,col='red') } } |

The `arroots`

function will return the autoregressive roots from the AR characteristic polynomial while the `maroots`

function will return the moving average roots from the MA characteristic polynomial. Both functions take an `Arima`

object as their only argument. If a seasonal ARIMA model is passed, the roots from both polynomials are computed in each case.

The `plot.armaroots`

function will plot the inverse of the roots on the complex unit circle. A causal invertible model should have all the roots outside the unit circle. Equivalently, the inverse roots should like inside the unit circle.

Here are a couple of examples demonstrating their use.

A simple example with three AR roots:

library(forecast) plot(arroots(Arima(WWWusage,c(3,1,0)))) |

A more complicated example with ten AR roots and four MA roots. (This is not actually the best model for these data.)

library(forecast) fit <- Arima(woolyrnq,order=c(2,0,0),seasonal=c(2,1,1)) par(mfrow=c(1,2)) plot(arroots(fit),main="Inverse AR roots") plot(maroots(fit),main="Inverse MA roots") |

Finally, here is an example where two inverse roots lie outside the unit circle (shown in red).

library(fma) plot(arroots(ar.ols(jcars))) |

Note that the `Arima`

function will never return a model with inverse roots outside the unit circle. The `auto.arima`

function is even stricter and will not select a model with roots close to the unit circle either, as such models are unlikely to be good for forecasting.

I won’t add these functions to the `forecast`

package as I don’t think enough people would use them, and the package is big enough as it is. So I’m making them available here instead for anyone who wants to use them.

To **leave a comment** for the author, please follow the link and comment on his blog: ** Hyndsight » R**.

R-bloggers.com offers

(This article was first published on ** librestats » R**, and kindly contributed to R-bloggers)

R has some extremely useful utilities for profiling, such as system.time(), Rprof(), the often overlooked tracemem(), and the rbenchmark package. But if you want more than just simple timings of code execution, you will mostly have to look elsewhere.

One of the best sources for profiling data is hardware performance counters, available in most modern hardware. This data can be invaluable to understanding what a program is really doing. The Performance Application Programming Interface (PAPI) library is a well-known profiling library, and allows users to easily access this profiling data. So we decided to bring PAPI to R. It's available now in the package pbdPAPI, and is supported in part as a 2014 Google Summer of Code project (thanks Googs!).

So what can you do with it? I'll show you.

Flops, or "floating point operations per second" is an important measurement of performance of some kinds of programs. A very famous benchmark known as the LINPACK benchmark is a measurement of the flops of a system solving a system of linear equations using an LU decomposition with partial pivoting. You can see current and historical data for supercomputer performance on the LINPACK Benchmark, or even see how your computer stacks up with the biggest computers in the world at Top 500.

For an example, let's turn to the pbdPAPI package's Principal Components Analysis demo. This demo measures the number of floating point operations (things like addition, subtraction, multiplication, and division) executed by your compter to perform a PCA, and compares it against the number of operations theoretically required to compute a PCA. This theoretical value is determined by evaluating the different compute kernels that make up a PCA. For an mxn matrix with PCA computed via SVD of the data matrix (as in R's `prcomp()`

), we need:

`2mn + 1`

operations to center the data.`6mn^2 + 20n^3`

operations for the SVD.`2mn^2`

operations for the projection onto the right singular vectors (the`retx=TRUE`

part).

We only add the count for centering (and not scaling), because that's the R default (for some reason...). For more details, see Golub and Van Loan's "Matrix Computations".

An example output from running the demo on this machine is:

m n measured theoretical difference pct.error mflops 1 10000 50 212563800 203500001 9063799 4.264037 2243.717

So pbdPAPI measured 212.6 million floating point operations, while the theoretical number is 203.5 million. That difference is actually quite small, and seems fairly reasonable. Also note that we clock in at around 2.2 Gflops (double precision). And we achieve all of this with a simple `system.flops()`

call from pbdPAPI:

library(pbdPAPI) m <- 10000 n <- 50 x <- matrix(rnorm(m*n), m, n) flops <- system.flops(prcomp(x, center=FALSE, scale.=FALSE))

Another interesting thing you can do with pbdPAPI is easily measure cache misses. Remember when some old grumpy jerk told you that "R matrices are column-major"? Or that, when operating on matrices, you should loop over columns first, then rows? Why is that? Short answer: computers are bad at abstraction. Long answer: cache.

If you're not entirely familiar with CPU caches, I would encourage you to take a gander at our spiffy vignette. But the quick summary is that lots of cache misses is bad. To understand why, you might want to take a look at this interactive visualization of memory access speeds.

To show off how this works, we're going to measure the cache misses of a simple operation: allocate a matrix and set all entries to 1. We're going to use Rcpp to do this, mostly because measuring the performance of for loops in R is too depressing.

First, let's do this by looping over rows and then columns. Said another way, we fix a row and and fill all of its entries with 1 before moving to the next row:

SEXP rows_first(SEXP n) { int i, j; const int n = INTEGER(n_)[0]; Rcpp::NumericMatrix x(n, n); for (i=0; i<n; i++) for (j=0; j<n; j++) x(i, j) = 1.; return x; }

Next, we'll loop over columns first, then rows. Here we fix a column and fill each row's entry in that column with 1 before proceeding:

SEXP cols_first(SEXP n) { int i, j; const int n = INTEGER(n_)[0]; Rcpp::NumericMatrix x(n, n); for (j=0; j<n; j++) for (i=0; i<n; i++) x(i, j) = 1.; return x; }

Assuming these have been compiled for use with R, say with the first as `bad()`

and the second as `good()`

, we can easily measure the cache misses like so:

library(pbdPAPI) n <- 10000L system.cache(bad(n)) system.cache(good(n))

Again using this machine as a reference we get:

$`Level 1 cache misses` [1] 202536304 $`Level 2 cache misses` [1] 168382934 $`Level 3 cache misses` [1] 21552970

for `bad()`

, and:

$`Level 1 cache misses` [1] 15771212 $`Level 2 cache misses` [1] 1889270 $`Level 3 cache misses` [1] 1286338

for `good()`

. Just staring at these huge values may not be easy on the eyes, so here's a plot showing this same information:

Here, lower is better, and so the clear winner is, as the name implies, `good()`

. Another valuable measurement is the ratio of total cache misses (data and instruction) to total cache accesses. Again, with pbdPAPI, measuring this is trivial:

system.cache(bad(n), events="l2.ratio") system.cache(good(n), events="l2.ratio")

On this machine, we see:

L2 cache miss ratio 0.8346856 L2 cache miss ratio 0.112331

Here too, lower is better, and so we again see a clear winner. The full source for this example is available here.

pbdPAPI can measure much, much more than just flops and cache misses. See the package vignette for more information about what you can measure with pbdPAPI. The package is available now on GitHub and is permissively licensed under the BSD 2-clause license, and it will come to CRAN eventually.

Ok now the downside; at the moment, it doesn't work on Windows or Mac.

We have spent the last month working on extending support to Windows and/or Mac, but it's not entirely trivial for a variety of reasons, as PAPI itself only supports Linux and FreeBSD at this time. We are committed to platform independence, and I believe we'll get there soon, in some capacity. But for now, it works fantastically on your friendly neighborhood Linux cluster.

Finally, a quick thanks again to the Googs, and also thanks to the folks who run the R organization for Google Summer of Code, especially Brian. And thanks to our student, who I think is doing a great job so far.

To **leave a comment** for the author, please follow the link and comment on his blog: ** librestats » R**.

R-bloggers.com offers

(This article was first published on ** Timely Portfolio**, and kindly contributed to R-bloggers)

Another color experiment combining resources from R and Javascript. I just wish I could do Mean Phylogenetic Distance in Javascript like rPlotter. I enjoyed using d3.js zoom behavior to pan and zoom the image on canvas. Also, filedrop.js made the drag and drop image easy. There are lots of mini lessons in this code for someone who want to pick inside the code. You might also notice the
To **leave a comment** for the author, please follow the link and comment on his blog: ** Timely Portfolio**.

R-bloggers.com offers

(This article was first published on ** R Enthusiast and R/C++ hero**, and kindly contributed to R-bloggers)

R 3.1.1 was released a few days ago, and as part of the policy we are trying to follow for `Rcpp11`

releases, here is `Rcpp11`

3.1.1. Sorry for the 12 days delay, but I was away in California, and Rcpp11 travelled with me, so I could not properly test the package. I have now tested the package extensively on these combinations:

`OS X/clang`

at home.`Ubuntu/gcc 4.6.3`

through travis`Windows/gcc 4.6.3`

under duress

Here is the extract of the `NEWS.md`

file for this release:

sugar

`sum`

now supports complex sugar vector expressionssugar

`mean`

implements the double pass algorithm for`numeric`

and`complex`

cases (#134)more C++ support for Rcomplex:

`Rcomplex& operator+=( Rcomplex&, const Rcomplex& )`

`Rcomplex operator/( Rcomplex, double )`

Internal refactoring/simplification of all api classes. Api classes are now parameterized by a class for the storage policy instead of a template as before.

`Dots`

and`NamedDots`

handle the border case when`...`

is missing (#123)If the macro

`RCPP_DO_BOUNDS_CHECKS`

is defined, vector classes will perform bounds checks. This is turned off by default because it kills performance. #141`Array`

no longer generates spurious warnings. #154Added the concept of lazy vectors. A lazy vector is similar to a sugar expression, but it only knows how to apply itself, i.e. we cannot call

`operator[](int)`

on it. This is used for implementation of`create`

and`fuse`

`create`

can now also be used as a free function. For example:`IntegerVector x = create(1,2,3) ;`

. When used as a free function,`create`

chooses to create a lazy vector of the highest type. For example,`create(1,2.0)`

makes a lazy vector of type`REALSXP`

(what makes sense for`double`

).Added the

`list`

function. It takes a variadic list of arguments and makes an R list from it. This uses the same underlying implementation as`List::create`

but is nicer to use.`mapply`

was reimplemented using variadic templates.`mapply`

now accepts a function as first parameter, then a variable number of sugar expressions.`Array`

gains a`fill`

method to initialize all its data to the same value.`is<>`

was broken.Initial implementation of

`ListOf`

.`ListOf<T>`

is similar to`List`

but it only exposes constructors that take`T`

objects and methods that maintain this requirement. The implementation differs from Kevin Ushey's implementation in Rcpp, which IMHO tries to do too much.New sugar functions

`Filter`

,`Reduce`

and`Map`

(synonym of`mapply`

) #140.New function

`Negate`

,`Compose`

for intial attempt at functional programming and function composition #140.Support for long vector has been added. Vectors are now indexed by the

`R_xlen_t`

type (64 bit on 64 bit platforms).The header

`<Rcpp11>`

has been added. It just includes`<Rcpp.h>`

, but I like it better to have`#include <Rcpp11>`

The

`Rcpp11`

namespace has been added as an alias to`Rcpp`

. So that we can type`using namespace Rcpp11 ;`

variadic trailing arguments can now be used in

`sapply`

(#189)Logical vectors internally use the

`Rboolean`

type instead of`int`

.Added the syntax

`x != NA`

to test if something is not the missing value.New sugar function

`import_n`

so that we can do`import_n( iterator, n )`

instead of`import( iterator, iterator + n) ;`

New sugar function

`enumerate`

(#153).

To **leave a comment** for the author, please follow the link and comment on his blog: ** R Enthusiast and R/C++ hero**.

R-bloggers.com offers

(This article was first published on ** RStudio Blog**, and kindly contributed to R-bloggers)

We’re excited to announce a new release of Packrat, a tool for making R projects more isolated and reproducible by managing their package dependencies.

This release brings a number of exciting features to Packrat that significantly improve the user experience:

**Automatic snapshots**ensure that new packages installed in your project library are automatically tracked by Packrat.**Bundle and share your projects**with`packrat::bundle()`and`packrat::unbundle()`— whether you want to freeze an analysis, or exchange it for collaboration with colleagues.

**Packrat mode**can now be turned on and off at will, allowing you to navigate between different Packrat projects in a single R session. Use`packrat::on()`to activate Packrat in the current directory, and`packrat::off()`to turn it off.**Local repositories**(ie, directories containing R package sources) can now be specified for projects, allowing local source packages to be used in a Packrat project alongside CRAN, BioConductor and GitHub packages (see this and more with`?"packrat-options"`).

In addition, Packrat is now tightly integrated with the RStudio IDE, making it easier to manage project dependencies than ever. Download today’s RStudio IDE 0.98.978 release and try it out!

You can install the latest version of Packrat from GitHub with:

` devtools::install_github("rstudio/packrat")`

Packrat will be coming to CRAN soon as well.

If you try it, we’d love to get your feedback. Leave a comment here or post in the packrat-discuss Google group.

To **leave a comment** for the author, please follow the link and comment on his blog: ** RStudio Blog**.

R-bloggers.com offers

(This article was first published on ** QuantStrat TradeR » R**, and kindly contributed to R-bloggers)

So between variations of different strategies, for those who have yet to come across it, my IKTrading package has a function called quandClean, which exists to get and clean daily futures data from quandl.com . The exact process can be found on the Revolution Analytics blog on this post.

While some of their futures data is quoted in other currencies, or has very short history, I’ve compiled a data file to get futures data that has long history.

Found there are price histories for ags, precious metals, forex, and more.

Here’s the code:

require(IKTrading) currency('USD') Sys.setenv(TZ="UTC") t1 <- Sys.time() if(!"CME_CL" %in% ls()) { #Energies CME_CL <- quandClean("CHRIS/CME_CL", start_date=from, end_date=to, verbose=verbose) #Crude CME_NG <- quandClean("CHRIS/CME_NG", start_date=from, end_date=to, verbose=verbose) #NatGas CME_HO <- quandClean("CHRIS/CME_HO", start_date=from, end_date=to, verbose=verbose) #HeatingOil CME_RB <- quandClean("CHRIS/CME_RB", start_date=from, end_date=to, verbose=verbose) #Gasoline ICE_B <- quandClean("CHRIS/ICE_B", start_date=from, end_date=to, verbose=verbose) #Brent ICE_G <- quandClean("CHRIS/ICE_G", start_date=from, end_date=to, verbose=verbose) #Gasoil #Grains CME_C <- quandClean("CHRIS/CME_C", start_date=from, end_date=to, verbose=verbose) #Chicago Corn CME_S <- quandClean("CHRIS/CME_S", start_date=from, end_date=to, verbose=verbose) #Chicago Soybeans CME_W <- quandClean("CHRIS/CME_W", start_date=from, end_date=to, verbose=verbose) #Chicago Wheat CME_SM <- quandClean("CHRIS/CME_SM", start_date=from, end_date=to, verbose=verbose) #Chicago Soybean Meal CME_KW <- quandClean("CHRIS/CME_KW", start_date=from, end_date=to, verbose=verbose) #Kansas City Wheat CME_BO <- quandClean("CHRIS/CME_BO", start_date=from, end_date=to, verbose=verbose) #Chicago Soybean Oil #Softs ICE_SB <- quandClean("CHRIS/ICE_SB", start_date=from, end_date=to, verbose=verbose) #Sugar ICE_KC <- quandClean("CHRIS/ICE_KC", start_date=from, end_date=to, verbose=verbose) #Coffee ICE_CC <- quandClean("CHRIS/ICE_CC", start_date=from, end_date=to, verbose=verbose) #Cocoa ICE_CT <- quandClean("CHRIS/ICE_CT", start_date=from, end_date=to, verbose=verbose) #Cotton #Other Ags CME_LC <- quandClean("CHRIS/CME_LC", start_date=from, end_date=to, verbose=verbose) #Live Cattle CME_LN <- quandClean("CHRIS/CME_LN", start_date=from, end_date=to, verbose=verbose) #Lean Hogs #Precious Metals CME_GC <- quandClean("CHRIS/CME_GC", start_date=from, end_date=to, verbose=verbose) #Gold CME_SI <- quandClean("CHRIS/CME_SI", start_date=from, end_date=to, verbose=verbose) #Silver CME_PL <- quandClean("CHRIS/CME_PL", start_date=from, end_date=to, verbose=verbose) #Platinum CME_PA <- quandClean("CHRIS/CME_PA", start_date=from, end_date=to, verbose=verbose) #Palladium #Base CME_HG <- quandClean("CHRIS/CME_HG", start_date=from, end_date=to, verbose=verbose) #Copper #Currencies CME_AD <- quandClean("CHRIS/CME_AD", start_date=from, end_date=to, verbose=verbose) #Ozzie CME_CD <- quandClean("CHRIS/CME_CD", start_date=from, end_date=to, verbose=verbose) #Loonie CME_SF <- quandClean("CHRIS/CME_SF", start_date=from, end_date=to, verbose=verbose) #Franc CME_EC <- quandClean("CHRIS/CME_EC", start_date=from, end_date=to, verbose=verbose) #Euro CME_BP <- quandClean("CHRIS/CME_BP", start_date=from, end_date=to, verbose=verbose) #Cable CME_JY <- quandClean("CHRIS/CME_JY", start_date=from, end_date=to, verbose=verbose) #Yen CME_NE <- quandClean("CHRIS/CME_NE", start_date=from, end_date=to, verbose=verbose) #Kiwi #Equities CME_ES <- quandClean("CHRIS/CME_ES", start_date=from, end_date=to, verbose=verbose) #Emini CME_MD <- quandClean("CHRIS/CME_MD", start_date=from, end_date=to, verbose=verbose) #Midcap 400 CME_NQ <- quandClean("CHRIS/CME_NQ", start_date=from, end_date=to, verbose=verbose) #Nasdaq 100 CME_TF <- quandClean("CHRIS/CME_TF", start_date=from, end_date=to, verbose=verbose) #Russell Smallcap CME_NK <- quandClean("CHRIS/CME_NK", start_date=from, end_date=to, verbose=verbose) #Nikkei #Dollar Index and Bonds/Rates ICE_DX <- quandClean("CHRIS/CME_DX", start_date=from, end_date=to, verbose=verbose) #Dixie #CME_FF <- quandClean("CHRIS/CME_FF", start_date=from, end_date=to, verbose=verbose) #30-day fed funds CME_ED <- quandClean("CHRIS/CME_ED", start_date=from, end_date=to, verbose=verbose) #3 Mo. Eurodollar/TED Spread CME_FV <- quandClean("CHRIS/CME_FV", start_date=from, end_date=to, verbose=verbose) #Five Year TNote CME_TY <- quandClean("CHRIS/CME_TY", start_date=from, end_date=to, verbose=verbose) #Ten Year Note CME_US <- quandClean("CHRIS/CME_US", start_date=from, end_date=to, verbose=verbose) #30 year bond } CMEinsts <- c("CL", "NG", "HO", "RB", "C", "S", "W", "SM", "KW", "BO", "LC", "LN", "GC", "SI", "PL", "PA", "HG", "AD", "CD", "SF", "EC", "BP", "JY", "NE", "ES", "MD", "NQ", "TF", "NK", #"FF", "ED", "FV", "TY", "US") ICEinsts <- c("B", "G", "SB", "KC", "CC", "CT", "DX") CME <- paste("CME", CMEinsts, sep="_") ICE <- paste("ICE", ICEinsts, sep="_") symbols <- c(CME, ICE) stock(symbols, currency="USD", multiplier=1) t2 <- Sys.time() print(t2-t1)

Note that you need your own quandl authorization token. However, beyond that, this process takes around 5 minutes or so to complete, so similarly to my demoData.R file, it functions based off of whether or not CME_CL (that is, the price history for crude oil) is present in your working environment.

The from (“yyyy-mm-dd”), to (same), and verbose (TRUE or FALSE) variables will be variables to set in a demo file, so you’ll have to input them yourself. Beyond that, you simply source this file, and you’ll have a large amount of futures data on which to run trading strategies. They don’t necessarily even have to be quantstrat types of trading strategies, as these are simply xts objects. I commented out CME_FF because it generally is something that is characterized by rare spikes as opposed to steady and consistent price movements.

Granted, I cannot vouch that this data will be perfect (probably a long way from it, considering that quandl isn’t the greatest source of it), but it *is* free, so for anyone who wishes to do any backtesting on futures data, well, here you go. Also, I may edit which exact instruments I use in the future if there are continuing data issues.

Thanks for reading.

To **leave a comment** for the author, please follow the link and comment on his blog: ** QuantStrat TradeR » R**.

R-bloggers.com offers

(This article was first published on ** Exegetic Analytics » R**, and kindly contributed to R-bloggers)

It has been suggested that the average Comrades Marathon runner is gradually getting older. As an “average runner” myself, I will not deny that I am personally getting older. But, what I really mean is that the average age of *all* runners taking part in this great event is gradually increasing. This is not just an idle hypothesis: it is supported by the data. If you’re interested in the technical details of the analysis, these are included at the end, otherwise read on for the results.

The histograms below show graphically how the distribution of runners’ ages at the Comrades Marathon has changed every decade starting in the 1980s and proceeding through to the 2010s. The data are encoded using blue for male and pink for female runners (apologies for the banality!). It is readily apparent how the distributions have shifted consistently towards older ages with the passing of the decades. The vertical lines in each panel indicate the average age for male (dashed line) and female (solid line) runners. Whereas in the 1980s the average age for both genders was around 34, in the 2010s it has shifted to over 40 for females and almost 42 for males.

Maybe clumping the data together into decades is hiding some of the details. The plot below shows the average age for each gender as a function of the race year. The plotted points are the observed average age, the solid line is a linear model fitted to these data and the dashed lines delineate a 95% confidence interval.

Prior to 1990 the average age for both genders was around 35 and varies somewhat erratically from year to year. Interestingly there is a pronounced decrease in the average age for both genders around 1990. Evidently something attracted more young runners that year… Since 1990 though there has been a consistent increase in average age. In 2013 the average age for men was fractionally less than 42, while for women it was over 40.

Of course, the title of this article is hyperbolic. The Comrades Marathon is a long way from being a race for geriatrics. However, there is very clear evidence that the average age of runners is getting higher every year. A linear model, which is a reasonably good fit to the data, indicates that the average age increases by 0.26 years annually and is generally 0.6 years higher for men than women. If this trend continues then, by the time of the 100th edition of the race, the average age will be almost 45.

Is the aging Comrades Marathon field a problem and, if so, what can be done about it?

As before I have used the Comrades Marathon results from 1980 through to 2013. Since my last post on this topic I have refactored these data, which now look like this:

> head(results) key year age gender category status medal direction medal_count decade 1 6a18da7 1980 39 Male Senior Finished Bronze D 20 1980 2 6570be 1980 39 Male Senior Finished Bronze D 16 1980 3 4371bd17 1980 29 Male Senior Finished Bronze D 9 1980 4 58792c25 1980 24 Male Senior Finished Silver D 25 1980 5 16fe5d63 1980 58 Male Master Finished Bronze D 9 1980 6 541c273e 1980 43 Male Veteran Finished Silver D 18 1980

The first step in the analysis was to compile decadal and annual summary statistics using plyr.

> decade.statistics = ddply(results, .(decade, gender), summarize, + median.age = median(age, na.rm = TRUE), + mean.age = mean(age, na.rm = TRUE)) > # > year.statistics = ddply(results, .(year, gender), summarize, + median.age = median(age, na.rm = TRUE), + mean.age = mean(age, na.rm = TRUE)) > head(decade.statistics) decade gender median.age mean.age 1 1980 Female 34 34.352 2 1980 Male 34 34.937 3 1990 Female 36 36.188 4 1990 Male 36 36.440 5 2000 Female 39 39.364 6 2000 Male 39 39.799 > head(year.statistics) year gender median.age mean.age 1 1980 Female 35.0 35.061 2 1980 Male 33.0 34.091 3 1981 Female 33.5 34.096 4 1981 Male 34.0 34.528 5 1982 Female 34.5 35.032 6 1982 Male 34.0 34.729

The decadal data were used to generate the histograms. I then considered a selection of linear models applied to the annual data.

> fit.1 <- lm(mean.age ~ year, data = year.statistics) > fit.2 <- lm(mean.age ~ year + year:gender, data = year.statistics) > fit.3 <- lm(mean.age ~ year + gender, data = year.statistics) > fit.4 <- lm(mean.age ~ year + year * gender, data = year.statistics)

The first model applies a simple linear relationship between average age and year. There is no discrimination between genders. The model summary (below) indicates that the average age increases by about 0.26 years annually. Both the intercept and slope coefficients are highly significant.

> summary(fit.1) Call: lm(formula = mean.age ~ year, data = year.statistics) Residuals: Min 1Q Median 3Q Max -1.3181 -0.5322 -0.0118 0.4971 1.9897 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -4.80e+02 1.83e+01 -26.2 <2e-16 *** year 2.59e-01 9.15e-03 28.3 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.74 on 66 degrees of freedom Multiple R-squared: 0.924, Adjusted R-squared: 0.923 F-statistic: 801 on 1 and 66 DF, p-value: <2e-16

The second model considers the effect on the slope of an interaction between year and gender. Here we see that the slope is slightly large for males than females. Although this interaction coefficient is statistically significant, it is extremely small relative to the slope coefficient itself. However, given that the value of the abscissa is around 2000, it still contributes roughly 0.6 extra years to the average age for men.

> summary(fit.2) Call: lm(formula = mean.age ~ year + year:gender, data = year.statistics) Residuals: Min 1Q Median 3Q Max -1.103 -0.522 0.024 0.388 2.287 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -4.80e+02 1.68e+01 -28.57 < 2e-16 *** year 2.59e-01 8.41e-03 30.78 < 2e-16 *** year:genderMale 3.00e-04 8.26e-05 3.63 0.00056 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.68 on 65 degrees of freedom Multiple R-squared: 0.937, Adjusted R-squared: 0.935 F-statistic: 481 on 2 and 65 DF, p-value: <2e-16

The third model considers an offset on the intercept based on gender. Here, again, we see that the effect of gender is small, with the fit for males being shifted slightly upwards. Again, although this effect is statistically significant, it has only a small effect on the model. Note that the value of this coefficient (5.98e-01 years) is consistent with the effect of the interaction term (0.6 years for typical values of the abscissa) in the second model above.

> summary(fit.3) Call: lm(formula = mean.age ~ year + gender, data = year.statistics) Residuals: Min 1Q Median 3Q Max -1.1038 -0.5225 0.0259 0.3866 2.2885 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -4.80e+02 1.68e+01 -28.58 < 2e-16 *** year 2.59e-01 8.41e-03 30.79 < 2e-16 *** genderMale 5.98e-01 1.65e-01 3.62 0.00057 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.68 on 65 degrees of freedom Multiple R-squared: 0.937, Adjusted R-squared: 0.935 F-statistic: 480 on 2 and 65 DF, p-value: <2e-16

The fourth and final model considers both an interaction between year and gender as well as an offset of the intercept based on gender. Here we see that the data does not differ sufficiently on the basis of gender to support both of these effects, and neither of the resulting coefficients is statistically significant.

> summary(fit.4) Call: lm(formula = mean.age ~ year + year * gender, data = year.statistics) Residuals: Min 1Q Median 3Q Max -1.0730 -0.5127 -0.0492 0.4225 2.1273 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -460.3631 23.6813 -19.44 <2e-16 *** year 0.2491 0.0119 21.00 <2e-16 *** genderMale -38.4188 33.4904 -1.15 0.26 year:genderMale 0.0195 0.0168 1.17 0.25 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.679 on 64 degrees of freedom Multiple R-squared: 0.938, Adjusted R-squared: 0.935 F-statistic: 322 on 3 and 64 DF, p-value: <2e-16

On the basis of the above discussion, the fourth model can be immediately abandoned. But how do we choose between the three remaining models? An ANOVA indicates that the second model is a significant improvement over the first model. There is little to choose, however, between the second and third models. I find the second model more intuitive, since I would expect there to be a slight gender difference in the rate of aging, rather than a simple offset. We will thus adopt the second model, which indicates that the average age of runners increases by about 0.259 years annually, with the men aging slightly faster than the women.

> anova(fit.1, fit.2, fit.3, fit.4) Analysis of Variance Table Model 1: mean.age ~ year Model 2: mean.age ~ year + year:gender Model 3: mean.age ~ year + gender Model 4: mean.age ~ year + year * gender Res.Df RSS Df Sum of Sq F Pr(>F) 1 66 36.2 2 65 30.1 1 6.09 13.23 0.00055 *** 3 65 30.1 0 -0.02 4 64 29.5 1 0.62 1.36 0.24833 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Lastly, I constructed a data frame based on the second model which gives both the model prediction and a 95% uncertainty interval. This was used to generate the second set of plots.

fit.data <- data.frame(year = rep(1980:2020, each = 2), gender = c("Female", "Male")) fit.data <- cbind(fit.data, predict(fit.2, fit.data, level = 0.95, interval = "prediction"))

To **leave a comment** for the author, please follow the link and comment on his blog: ** Exegetic Analytics » R**.

R-bloggers.com offers