# CHCN: Canadian Historical Climate Network

August 4, 2011
By

(This article was first published on Steven Mosher's Blog, and kindly contributed to R-bloggers)

A reader asked a question about data from   environment canada.  He wanted to know if that data could somehow be integrated into the RGhcnV3 package.  That turned out to be a bit more challenging that I expected.  In short order I’d found a couple other people who had done something similar.  DrJ of course was in the house with his scraper. That scraper relies on another scraper found here.  That let me know it was possible and an email from Environment Canada let me know it was acceptable.

So,  it’s ok to write a scraper and possible to write one.  My goal was to do this in R.  I  really enjoy DrJ’s  code and his other work, but I had to try this on my own. The folks at Environment Canada suggested that scraping was the best option as their SOAP and REST mechanism wasnt entirely supported. That was fine with me as SOAP in R  wasn’t a good option. I won’t go into the reasons. So my plan of attack was to leverage a small piece of DrJs work and do the rest in R.

Consequency we have a function scrapeToCsv()  which takes the list of stations and makes a http request for every one. That works pretty slick and takes a long time. Of course, that process is also prone to server timeouts. When it balks we have a function to clean up after that: this function getMissingScrape() looks at your download directory and the files there, looks at the list of stations you wanted to scrape and figures out what is missing. calling scrapeToCvs(get = getMissingScrape())  will restart a scrape and chug along.  When the scrapes are finished you have to do one last check  getEmptyCsv(). There are times when the  connection is made, the local file name is written, but no data is transmitted. So you get zero sized files. No problem, we detect that and rescrape:  scrapeToCsv(get = getEmptyCsv()). Clever folks can just write a while loop that exits on the conditions that all files are there and non empty.

After downloading 7676 files you then create a inventory with metadata:  createInventory() This includes the station name, lat, lon province. and various identifiers ( WMO etc).

      "Id" "Lat" "Lon" "Altitude" "Name" "Province" "ClimateId" "WMO" "TCid"
99111111 "49.91" "-99.95" "409.40" "BRANDON.A" "MANITOBA" "5010480" "71140" "YBR"
99111112 "51.10" "-100.05" "304.50" "DAUPHIN.A" "MANITOBA" "5040680" "" "PDH"

Then you create a huge master datafile  createDataset(). This has all the data ( temperatures, rain etc). Next you can just extract the mean temperature  asChcn() which creates a GHCN like data structure of temperature data.

version 1.1 is done and is being tested. Should hit CRAN when some outside users report back.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...