Nick Stoke’s Improvements

Posted on July 25, 2011 by Steven Mosher in R bloggers | 0 Comments

[This article was first published on Steven Mosher's Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Fast on the heels of getting RomanM’s code up and running in RghcnV3, Nick Stokes whipped out a version of his approach which he covered on his blog here:

We exchanged code and few mails thrashing through details and I’m now in a position to start the integration work of his approach into Rghcnv3. In addition to providing a new method Nick also contributed some code to make the import of data considerably faster. It’s a good lesson in some of the trade offs between code clarity and maintainability and raw speed. The function in question was readV3data(). My approach with that function was to build something that was maintainable, so the function relies on a set of constants or FILE.PARAMETERS. While GHCNV3 was in beta I wanted to build something that could be changed easily by changing constants in one place rather than by editing code. So the function relies on read.fwf() and that function is controlled by constants in parameters. With GhcnV3 NCDC has added 34 extra columns of data. That drives the read time up to 8 minutes. By tweaking the “widths’ parameter I was able to get this down to 2.5 minutes. Nick’s approach BLEW THE DOORS off that tweak. His approach comes at a small price. The price is code clarity and maintainability. After spending some time working back and forth we were able to fix a bug and make the code more clear and robust. If NCDC decides to change format, then I’ll have to change code and not just constants. I don’t think that is likely. So I choose speed here. How fast was Nick’s approach? 30-40 seconds. The approach depends upon using readLine() and then “manually” editing the lines using substr() to get the file into a format that can be read by read.table(). That requires writing a temporary file. I think that step may also be unnecessary, but I’ll leave that for another day.

Next up was integrating the code that Nick produced for doing his method. If you’re unfamilar with his method please have a look at the link above . The unique thing about Nick’s approach, from my software perspective, was his data structure. Most people who work at this use a 2D data structure of some sort. You have time series. You have stations. That has to mean you should use a 2D structure ! So all of RghcnV3 looks to get the data into the right 2D structure so that various functions can use them. Nick does something entirely different and it took me a while to wrap my head around it. Since the problem was bigger than my head that took some time and plodding. But the result has some wonderful properties for data manipulation which I will cover later.

The 3D data structure is an array. The dimensions are Stations, Months, Years. Picture a cube. The face of the cube has stations on the left, months across the top, and is years deep. That’s a really nice arrangement for some of the things we want to do with station data. Below find the code that takes a 2D v3 format ( 14 columns, with a row per station) and creates a 3D structure. Most importantly, this code will insert missing years into the data, a function that v3ToZoo() and v3ToMts() perform. That opens a new possibility that I’ll show at the end.

Nicks Code, adapted and formatted to be more in line with R style guides:

v3ToArray <- function(v3Data){ # author Nick Stokes # Steve Mosher: rewrite for style, interface, error checking and comments # check the input data and set up parameters for processing if (!isV3(v3Data)) stop("v3Data must be a v3 object") stations <- nrow(v3Data) # the first year gets an offset for computing some indices below start <- min(v3Data$Year) - 1 end <- max(v3Data$Year) # Inside a v3 format every station takes up a different number # of rows. we are going to process those rows in "chunks" # station index is a vector of where different stations start and stop. # a subset command could be used inside the loop but this quick stationIndex <- c(0, which(diff(v3Data$Id) > 0), stations) stationCount <- length(stationIndex) - 1 # the array is stations by months by number of years x <- array(NA,c(stationCount,12, end - start)) # I add dimnames because I like to use them to identify what is what # also, I can now easily reconcile a temperature data file with an inventory dimnames(x) <- list(unique(v3Data$Id),month.abb,min(v3Data$Year):max(v3Data$Year)) # the tricky part. You grab a stations work of data and insert it into # the 3D array. for(i in 1:stationCount){ # loop over stations # this gets you the start row and end row for a station records <- c(stationIndex[i]+1, stationIndex[i+1]) iy <- as.matrix(v3Data[records[1]:records[2], ]) oy <- (iy[ ,2] > start) & (iy[ ,2] <= end) if(sum(oy) < 2 ) next iz <- iy[oy, ] x[i, , iz[ ,2] - start] <- t(iz[ ,3:14]) } return(x) }

And now for the real treat. Because we’ve increased the speed of readV3data(), we can actually create Nicks data structure when we read the data in. So readv3Data() now has an output format

readV3Data(filename, output = c(“V3″, “ARRAY”), Parameters = FILE.PARAMETERS)

That means you can read the data into the program in about 10 seconds using the “V3″ output format, or select the ARRAY output format and be all set to call Nick’s method with the data output by the red function. Basically, v3ToArray() gets called immediately after the read.

I’ve got one more test to run on this code and then it goes into version 1.6. version 1.5 is on CRAN. Version 1.6 will have Roman’s Method, a demo of romans method and Nick Stokes faster readv3Data(). The rest of Nick’s code is slated for 1.7.

PASSED.