CHCN: Canadian Historical Climate Network

August 4, 2011

(This article was first published on Steven Mosher's Blog, and kindly contributed to R-bloggers)

A reader asked a question about data from   environment canada.  He wanted to know if that data could somehow be integrated into the RGhcnV3 package.  That turned out to be a bit more challenging that I expected.  In short order I’d found a couple other people who had done something similar.  DrJ of course was in the house with his scraper. That scraper relies on another scraper found here.  That let me know it was possible and an email from Environment Canada let me know it was acceptable.

So,  it’s ok to write a scraper and possible to write one.  My goal was to do this in R.  I  really enjoy DrJ’s  code and his other work, but I had to try this on my own. The folks at Environment Canada suggested that scraping was the best option as their SOAP and REST mechanism wasnt entirely supported. That was fine with me as SOAP in R  wasn’t a good option. I won’t go into the reasons. So my plan of attack was to leverage a small piece of DrJs work and do the rest in R.

The Key master file is created by  one of DrJs scrapers and can be downloaded here as a csv.  At some point I will go duplicate that in R, but for now I rely on this file. The file lists the master list of all stations. It contains the two bits of information we need to scrape data: the station webId and the FIRST YEAR of monthly data.  So there is a function to get that csv file, cleverly named downloadMaster().  The next step is to read that file and select only those stations that report monthly data:  writeMonthlyStations().  The next step is to scrape the data: I looked at several ways of doing this. First I tried “on the fly” scraping. That means making a request, and then parsing the file into a  friendly R object. This was some nasty code since the csv file is in an unfriendly format. Metadata in the first 17 lines and then 25 columns of climate data. It took some fiddling but I was able to manage it. It involved doing two reads on the connection. The first read to just get the metadata and the second read to skip the metadata and read the data. That function worked if the server cooperated. Alas, the server had a habit of crapping out and dumping the connection.  The prospect of doing error trapping in R “trycatch()”  wasn’t in the cards.  But I suppose in a future version I will do that. So I oped for the brute force. Download every file. As it turns out that is  7676 csv files. The good new is that with a downloaded CSV file I can work at my leisure and not try to debug things hoping that the connection will time out so I can test the code.

Consequency we have a function scrapeToCsv()  which takes the list of stations and makes a http request for every one. That works pretty slick and takes a long time. Of course, that process is also prone to server timeouts. When it balks we have a function to clean up after that: this function getMissingScrape() looks at your download directory and the files there, looks at the list of stations you wanted to scrape and figures out what is missing. calling scrapeToCvs(get = getMissingScrape())  will restart a scrape and chug along.  When the scrapes are finished you have to do one last check  getEmptyCsv(). There are times when the  connection is made, the local file name is written, but no data is transmitted. So you get zero sized files. No problem, we detect that and rescrape:  scrapeToCsv(get = getEmptyCsv()). Clever folks can just write a while loop that exits on the conditions that all files are there and non empty.

After downloading 7676 files you then create a inventory with metadata:  createInventory() This includes the station name, lat, lon province. and various identifiers ( WMO etc).

      "Id" "Lat" "Lon" "Altitude" "Name" "Province" "ClimateId" "WMO" "TCid"
  99111111 "49.91" "-99.95" "409.40" "BRANDON.A" "MANITOBA" "5010480" "71140" "YBR"
  99111112 "51.10" "-100.05" "304.50" "DAUPHIN.A" "MANITOBA" "5040680" "" "PDH"

Then you create a huge master datafile  createDataset(). This has all the data ( temperatures, rain etc). Next you can just extract the mean temperature  asChcn() which creates a GHCN like data structure of temperature data.

version 1.1 is done and is being tested. Should hit CRAN when some outside users report back.

To leave a comment for the author, please follow the link and comment on their blog: Steven Mosher's Blog. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...


Comments are closed.


Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)