Package RghcnV3

Posted on June 22, 2011 by Steven Mosher in R bloggers, Uncategorized | 0 Comments

[This article was first published on Steven Mosher's Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I am down to the last strokes of creating a package for downloading, importing and analyzing GHCN (Global Historical Climate Network0 Version 3. There were a few things holding me up, but I’ve manged to handle all of them except the last one. Let’s spend a little time looking at the package process so you can get an idea of everything that has to be done. I’m not an expert on this obviously, but this may help those who have issues.

If you take a look at my 10 step process for building in windows then you are familiar with the process. Basically you create a folder that contains all the resources ( R scripts and data ) that you want in your package. You can then call ‘package.skeleton’. That function will create a set of files that you need to edit.

EDITING THESE FILES CORRECTLY IS THE WHOLE CHALLENGE.

The files created for the manual are help files with a .Rd extension. Over and over again I found myself going back to the instructions for editing these files. If I had to build one tool for package developers it would be a better tool for editing these files. Maybe the guys at RStudio will do something. Online help during the editing process would be a big help. and WYSIWYG would be fantastic ( and hard ) let’s face it writing files in a mark up language is no joy!

The first problem I hit had to do with warnings over data objects. A quick question to the R-help list, got me the answer I needed. ASK THAT QUESTION on R-devel! doh! Also, I got another suggestion. GIYF, specifically use google HERE: http://tolstoy.newcastle.edu.au/R/

Problem solved and the warnings were removed. CRAN will not accept a package with warnings. The documentation and the code have to cross check and mine didn’t.

The next problem was I wanted to include demo code! I decided that rather than write a vignette I would do demo code. Why? Well, Vignettes are usually executed in during the CHECK process ( unless you set eval =FALSE). My Vignette would require the download of big data, so I figured that demo() is a better way to go.

I will probably rely heavily on using demo() to show you guys how the package works. So, I wrote some quick demo code complete with VERY CRUDE graphics. The point of the demo is just to show you how things work. I’ll do pretty later.

There are three demo programs. One downloads masks and displays them. One downloads all V3 files. and the last one downloads a V3 files and creates anomalies

The demo StationCount looks like this

meanAdj     <- downloadV3(url=V3.MEAN.ADJ.URL)
MeanAInv    <- readInventory(filename=meanAdj$InventoryFile)
meanAdata   <- readV3Data(filename=filename=meanAdj$DataFilename)
meanAnomaly <- createAnomaly(meanAdata)

DATA  <- intersectInvAnomalies(inv=MeanAInv,meanAnomaly)

That is it. Download the data from NOAA, selecting the Tmean files, adjusted version ( there are 6 different tar.gz). That function, downloads the data and decompresses the file to sub directories. Since V3 updates nearly every day your underlying file names change. When you download I pass back the new file names for the inventory file and the data file. so you get a fresh file and a handle to the files. Next you take those handles and use them to read the actual data. Inventory data is read in and temperature data is read in. Next we create Anomalies. The default is a base period of 1961-1990, which you can change of course, but I wouldnt.

When you create Anomalies you lose some of the stations in the inventory. There are 7280 stations is the dataset, But only about 5000 pass the screening test. the screening test is this: they have to have enough data in the base period ( 1961-90) So, after “createAnomaly” that dataset has one list of stations, while inventory has a superset.

You need to reconcile the two of them. That is what “intersectInvAnomaly” does. It makes sure that the inventory you are working with matches the datafile.

So, if want to only look at rural stations, for example, you subset the Inventory, then you “intersect” that inventory with the anomaly data.

DATA  <- intersectInvAnomalies(inv=MeanAInv,meanAnomaly)
plot(x=as.numeric(colnames(DATA$Anomalies)),y=DATA$Anomalies[3,],type="l")
plot(x=as.numeric(colnames(DATA$Anomalies)),y=colSums(!is.na(DATA$Anomalies)),type="l")

After you “intersect” these items you get a data structure back. DATA$Inventory and DATA$Anomalies. The stations in one will match the stations in the other. Here is why that is important. If at any point you want to change the inventory ( by looking only at airports for example) or if you want to eliminate data in your temperature dataset ( like only picking long records) you have to insure that in the END, the stations in the temperature file match the stations in the inventory.

The “plot” example here is mainly to show you how DATA$Anomalies is constructed: the column names contain TIME, every row in that matrix is a station record. the second plot example just shows you an easy way to get the Count of stations reporting in any given month.

The last bits before submissions. All the tests pass.The document warnings are all removed. The demo code is included and it works! WHAT”S LEFT?

1. figuring out if I have to do a separate submission for 32bit and 64 bit.

2. building for final submission

If folks want I will distribute the source from here..