Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

When working with intraday data, analysts are often facing a large dataset problem. R is well equipped to deal with this but the standard approach has to be modified in some ways.

Large dataset means different things to different people. I’m talking here about a dataset of less than 10 columns and 2 to 5 million rows. Obviously this would be considered unmanageable for Excel users and tiny for people working with Hadoop/MapReduce on a regular basis.

Storing large datasets can be done in various ways. From personal experience and the kind of size I dealt with so far, CSV files are very efficient. There are easy to manipulate and work across all platforms.

Data pre processing

Obviously this largely depends on the analysis to be done but a few generic principles can be isolated. Most of the time converting object from one type to another is very time consuming. Thinking beforehand about what will be done with each column of a dataset is the best way to build an efficient R object. Regarding intraday data, what I do most of the time, is converting everything to numeric if possible. It’s easier to manipulate and makes most analysis run quicker. The typical dataset I deal with looks like the below: roughly speaking, standard open/high/low/close data with a few extra columns over several million rows.

> head(df)
cross     date  time   open   high    low  close volume tickCount
1 EURUSD 20090715 17.14 1.4096 1.4105 1.4096 1.4104      0       310
2 EURUSD 20090715 17.15 1.4104 1.4106 1.4103 1.4104      0       247
3 EURUSD 20090715 17.16 1.4104 1.4107 1.4103 1.4107      0       369
4 EURUSD 20090715 17.17 1.4106 1.4107 1.4103 1.4104      0       439
5 EURUSD 20090715 17.18 1.4104 1.4106 1.4102 1.4105      0       301
6 EURUSD 20090715 17.19 1.4105 1.4108 1.4103 1.4105      0       500

> str(df)
'data.frame':	3002676 obs. of  9 variables:
$cross : Factor w/ 3 levels "EURUSD","GBPUSD",..: 1 1 1 1 1 1 1 1 1 1 ...$ date     : int  20090715 20090715 20090715 20090715 20090715 20090715 20090715 20090715 20090715 20090715 ...
$time : num 17.1 17.1 17.2 17.2 17.2 ...$ open     : num  1.41 1.41 1.41 1.41 1.41 ...
$high : num 1.41 1.41 1.41 1.41 1.41 ...$ low      : num  1.41 1.41 1.41 1.41 1.41 ...
$close : num 1.41 1.41 1.41 1.41 1.41 ...$ volume   : num  0 0 0 0 0 0 0 0 0 0 ...
\$ tickCount: int  310 247 369 439 301 500 426 476 630 706 ...

Accessing data

Once the data pre-processing is completed, the dataset should be saved as an .RData object. Assuming analysts want to reuse this dataset for further analysis this is a more efficient way to access it in the future. More importantly if the dataset has to be accessed simultaneously by different computers/analysts (i.e. distributed analysis) this is a far better practice. The test below compares the time it takes to read an external CSV file vs. the time taken to load the equivalent .RData object.

> system.time(df <- read.csv(paste(inputPath,"someCrosses.csv",sep=""),sep=","))
user  system elapsed
19.44    0.34   19.84
user  system elapsed
4.18    0.01    4.20

The numbers speak for themselves I think.

Manipulating data

Once the dataset is in R, we want the object to be manipulated more efficiently than a standard data.frame. The data.table package is the perfect tool for the task. In a nutshell data.table allows “fast subset, fast grouping and fast merge in a short and flexible syntax”. The code below converts the data.frame above into a data.table and assigns a key (cross) to this table for quicker access.

dt <- data.table(df)
setkey(dt,cross)

I also noticed that data.table objects tend to have smaller size than the equivalent data.frame when the dataset mixes numeric and other types. I think this is mentioned somewhere in the data.table documentation.

> object.size(dt)
180162992 bytes
> object.size(df)
348312568 bytes

Let’s now compare how fast elements in this table can be accessed compared to a data.frame

> system.time(oneCrossDt <- dt["EURUSD"])
user  system elapsed
0.10    0.02    0.11
> system.time(oneCrossDf <- df[df[,"cross"] == "EURUSD",])
user  system elapsed
0.53    0.03    0.56

> system.time(aa <- dt[,mean(close),by=cross])
user  system elapsed
0.05    0.00    0.05
> system.time(bb <- tapply(df[,"close"],df[,"cross"],mean))
user  system elapsed
0.94    0.00    0.94

Concluding remarks

What I presented in this article is by no means exhaustive or perfectly optimal, it’s simply based on experience and made the analysis I conducted so far pretty efficient. I’ll be very happy to exchange on the topic if someone is interested?

One thing that I haven’t tried yet is the plyr package. The general idea behind the package is the so called SAC methodology (Split–Apply-Combine). From a very practical point of view, a lot of the code I wrote would have to be amended to fit into that framework hence my reluctance to do it so far.

All the tests above have been conducted on my office desktop running R 2.14.1 on Windows 7 with 4 GB RAM.

Lastely thanks to Pretty R at inside-R.org