Maintaining a database of price files in R

[This article was first published on The R Trader » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Doing quantitative research implies a lot of data crunching and one needs clean and reliable data to achieve this. What is really needed is clean data that is easily accessible (even without an internet connection). The most efficient way to do this for me has been to maintain a set of csv files. Obviously this process can be handled in many ways but I found very efficient and simple overtime to maintain a directory where I store and update csv files. I have one csv file per instrument and each file is named after the instrument it contains. The reason I do so is twofold: First, I don’t want to download (price) data from Yahoo, Google etc… every time I want to test a new idea but more importantly once I identified and fixed a problem, I don’t want to have to do it again the next time I need the same instrument. Simple yet very efficient so far. The process is summarized in the chart below.

processFlow

In everything that follows, I assume that data is coming from Yahoo. The code will have to be amended for data from Google, Quandl etc… In addition I present the process of updating daily price data. The setup will be different for higher frequency data and other type of dataset (i.e. different from prices).

1 – Initial data downloading (listOfInstruments.R & historicalData.R)

The file listOfInstruments.R is a file containing only the list of all instruments.

##########################################
## List of securities (Yahoo tickers) 
## [email protected] - Nov. 2015
##########################################
theInstruments = c("^GSPC",
                   "SPY",
                   "QQQ",
                   "DDM",
                   "EFA",
                   "EEM",
                   "EWJ")

If an instrument isn’t part of my list (i.e. no csv file in my data folder) or if you do it for the very first time you have to download the initial historical data set. The example below downloads a set of ETFs daily prices from Yahoo Finance back to January 2000 and store the data in a csv file.

##########################################
## Daily prices from Yahoo 
## [email protected] - Nov. 2015
##########################################
library(quantmod)

startDate = "2000-01-01"
thePath = "D:\daily\data\"
source(paste(thePath,"code\listOfInstruments.r",sep=""))

for (ii in theInstruments){
 print(ii)
 data = getSymbols(Symbols = ii, 
                   src = "yahoo", 
                   from = startDate, 
                   auto.assign = FALSE)
 colnames(data) = c("open","high","low","close","volume","adj.")
 write.zoo(data,paste(thePath,ii,".csv",sep=""),sep=",",row.names=FALSE)
}

2 – Update existing data (updateData.R)

The below code starts from existing files in the dedicated folder and updates all of them one after the other. I usually run this process everyday except when I’m on holiday. To add a new instrument, simply run step 1 above for this instrument alone.

##########################################
## Update data files 
## [email protected] - Nov. 2015
##########################################
library(quantmod)

lookback = 60
startDate = Sys.Date() - lookback
thePath = "D:\daily\data\"
theFiles = list.files(path=thePath,pattern=".csv")

for (ii in theFiles){
 data = read.csv(paste(thePath,ii,sep=""))
 data = xts(data[,c("open","high","low","close","volume","adj.")],
 order.by = as.Date(data[,"Index"],format="%Y-%m-%d"))
 lastHistoricalDate = index(data[nrow(data),])
 
 recent = getSymbols(Symbols = substr(ii,1,nchar(ii)-4), 
                      src = "yahoo", 
                      from = statDate, 
                      auto.assign = FALSE)
 colnames(recent) = c("open","high","low","close","volume","adj.")

 pos = match(as.Date(lastHistoricalDate,format="%Y-%m-%d"),index(recent))
 
 if (!is.na(pos)){ 
  if (pos == nrow(recent))
   print("File already up-to-date")
 
  if (pos < nrow(recent)){
   dt = NULL
   dt = rbind(data,recent[(pos+1):nrow(recent),])
   write.zoo(dt,paste(thePath,ii,sep=""),sep=",",row.names=FALSE) 
  }
 }
 
 if (is.na(pos))
  print("Error: dates do not match")

3 – Create a batch file (updateDailyPrices.bat)

Another important part of the job is creating a batch file that automates the updating process above (I’m a Windows user). This avoids opening R/RStudio and run the code from there. The code below is placed on a .bat file (the path has to be amended with the reader’s setup). Note that I added an output file (updateLog.txt) to track the execution.

cd ../..
C:progra~1RR-3.1.2binR.exe CMD BATCH --vanilla --slave "D:dailydatacodeupdateHistoricalData.R" "D:dailydatacodeupdateLog.txt"

The process above is extremely simple because it only describes how to update daily price data.  I’ve been using this for a while and it has been working very smoothly for me so far. For more advanced data and/or higher frequencies, things can get much trickier.

As usual any comments welcome

To leave a comment for the author, please follow the link and comment on their blog: The R Trader » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)