CHCN: Canadian Historical Climate Network

Posted on August 4, 2011 by Steven Mosher in R bloggers | 0 Comments

[This article was first published on Steven Mosher's Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A reader asked a question about data from environment canada. He wanted to know if that data could somehow be integrated into the RGhcnV3 package. That turned out to be a bit more challenging that I expected. In short order I’d found a couple other people who had done something similar. DrJ of course was in the house with his scraper. That scraper relies on another scraper found here. That let me know it was possible and an email from Environment Canada let me know it was acceptable.

So, it’s ok to write a scraper and possible to write one. My goal was to do this in R. I really enjoy DrJ’s code and his other work, but I had to try this on my own. The folks at Environment Canada suggested that scraping was the best option as their SOAP and REST mechanism wasnt entirely supported. That was fine with me as SOAP in R wasn’t a good option. I won’t go into the reasons. So my plan of attack was to leverage a small piece of DrJs work and do the rest in R.

The Key master file is created by one of DrJs scrapers and can be downloaded here as a csv. At some point I will go duplicate that in R, but for now I rely on this file. The file lists the master list of all stations. It contains the two bits of information we need to scrape data: the station webId and the FIRST YEAR of monthly data. So there is a function to get that csv file, cleverly named downloadMaster(). The next step is to read that file and select only those stations that report monthly data: writeMonthlyStations(). The next step is to scrape the data: I looked at several ways of doing this. First I tried “on the fly” scraping. That means making a request, and then parsing the file into a friendly R object. This was some nasty code since the csv file is in an unfriendly format. Metadata in the first 17 lines and then 25 columns of climate data. It took some fiddling but I was able to manage it. It involved doing two reads on the connection. The first read to just get the metadata and the second read to skip the metadata and read the data. That function worked if the server cooperated. Alas, the server had a habit of crapping out and dumping the connection. The prospect of doing error trapping in R “trycatch()” wasn’t in the cards. But I suppose in a future version I will do that. So I oped for the brute force. Download every file. As it turns out that is 7676 csv files. The good new is that with a downloaded CSV file I can work at my leisure and not try to debug things hoping that the connection will time out so I can test the code.

Consequency we have a function scrapeToCsv() which takes the list of stations and makes a http request for every one. That works pretty slick and takes a long time. Of course, that process is also prone to server timeouts. When it balks we have a function to clean up after that: this function getMissingScrape() looks at your download directory and the files there, looks at the list of stations you wanted to scrape and figures out what is missing. calling scrapeToCvs(get = getMissingScrape()) will restart a scrape and chug along. When the scrapes are finished you have to do one last check getEmptyCsv(). There are times when the connection is made, the local file name is written, but no data is transmitted. So you get zero sized files. No problem, we detect that and rescrape: scrapeToCsv(get = getEmptyCsv()). Clever folks can just write a while loop that exits on the conditions that all files are there and non empty.

After downloading 7676 files you then create a inventory with metadata: createInventory() This includes the station name, lat, lon province. and various identifiers ( WMO etc).

      "Id" "Lat" "Lon" "Altitude" "Name" "Province" "ClimateId" "WMO" "TCid"
  99111111 "49.91" "-99.95" "409.40" "BRANDON.A" "MANITOBA" "5010480" "71140" "YBR"
  99111112 "51.10" "-100.05" "304.50" "DAUPHIN.A" "MANITOBA" "5040680" "" "PDH"

Then you create a huge master datafile createDataset(). This has all the data ( temperatures, rain etc). Next you can just extract the mean temperature asChcn() which creates a GHCN like data structure of temperature data.

version 1.1 is done and is being tested. Should hit CRAN when some outside users report back.