Friday fun projects

Posted on May 14, 2011 by nsaunders in R bloggers, Uncategorized | 0 Comments

[This article was first published on What You're Doing Is Rather Desperate » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

What’s a “Friday fun project”? It’s a small computing project, perfect for a Friday afternoon, which serves the dual purpose of (1) keeping your programming/data analysis skills sharp and (2) providing a mental break from the grind of your day job. Ideally, the skills learned on the project are useful and transferable to your work projects.

This post describes one of my Friday fun projects: how good is the Sydney weather forecast? There’ll be some Ruby, some R and some MongoDB. Bioinformatics? No, but some thoughts on that at the end.

1. Define the problem
Step 1 in any project is to define the problem. The question “how good is a weather forecast?” might be addressed in many ways. Obviously, we need to collect two kinds of data: predictions and observations, then see how well they match.

I decided that the simplest approach was to focus only on temperatures: specifically, the observed and predicted maximum and minimum temperature each day, for a particular weather station.

2. Get the data
The Australian Bureau of Meteorology website, known colloquially as “BoM”, is one of the most popular websites with Australian web users and quite a treasure trove of data.

2.1 Predictions
Here’s the Sydney forecast page. It provides predictions 6 days ahead, for the metropolitan area and a few other selected locations. We could try scraping the data from this web page, but down at the bottom it says “Product derived from IDN10064.” A quick Google search reveals this ASCII text file, containing the same data.
Here’s a first draft of some Ruby code to parse that text file. First, some libraries to use either now or in later operations:

require "rubygems"
require "open-uri"
require "json/pure"
require "mongo"
require "date"

Next, the get_predictions method:

def get_predictions(url)
  begin
    temps  = []
    record = []
    ts     = DateTime.now.to_s
    open(url) do |f|
      f.each_line do |line|
        if line =~ /^Issued\s+at\s+/
          ts = line
          ts.gsub!("EST", "+10:00")
        end
        if line =~ /^City\s+Centre.*?\s+Min\s+(-\d+|\d+)\s+Max\s+(-\d+|\d+)/
          temps << [$1, $2]
        end
      end
    end
    temps.each_with_index do |t, i|
      record << {'timestamp' => DateTime.parse(ts).to_s, 'date' => (Date.today + i + 1).to_s,
        'min' => t[0].to_f, 'max' => t[1].to_f, 'predicted' => 1, '_id' => "#{DateTime.parse(ts).to_s}/#{i}"}
    end
    rescue Exception => e
      puts e
    ensure
      return record
  end
en

Fun? You bet. There are a couple of nice Ruby features on display here. First, the text file contains a line that looks like this: “Issued at 8:47 am EST on Saturday 14 May 2011″. Believe it or not, you can pass that as-is to the DateTime.parse method and get a reasonable result. If you replace “EST” (which the code interprets as US EST) with the Sydney time zone, GMT+10:00, you get an even better result:

str = "Issued at 8:47 am EST on Saturday 14 May 2011"
DateTime.parse(str).to_s
=> "2011-05-14T08:47:00-05:00"
str.gsub! "EST", "+10:00"
DateTime.parse(str).to_s
=> "2011-05-14T08:47:00+10:00"

Second, Ruby’s many and useful array methods. Maximum and minimum temperatures are stored as an array of 6 arrays. By using each_with_index we can get the index for each prediction (0, 1, 2…5), add 1 to it and then add that to todays date to create the date for the prediction. We can also use the time stamp and index to create a unique key, since predictions are updated throughout the day and the week.

    temps.each_with_index do |t, i|
      record << {'timestamp' => DateTime.parse(ts).to_s, 'date' => (Date.today + i + 1).to_s,
        'min' => t[0].to_f, 'max' => t[1].to_f, 'predicted' => 1, '_id' => "#{DateTime.parse(ts).to_s}/#{i}"}
    end
# example value
=> { "_id" : "2011-05-14T08:47:00+10:00/0", "predicted" : 1, "timestamp" : "2011-05-14T08:47:00+10:00",
 "max" : 18, "date" : "2011-05-15", "min" : 8 }

If any of that throws an exception, the error is printed and an empty hash returned. Yes, I know “catch-all” exception code is bad practice. This is the first draft.

2.2 Observations
Getting the observations is a little easier, but requires some additional processing. Weather observations for the past 72 hours are available here. For this product, BoM also provide links to CSV and JSON format. Here’s some Ruby code to obtain the latter:

def get_yesterday(url)
  begin
    record = {}
    data  = JSON.parse(open(url).read)
    ts    = data['observations']['header'].first['refresh_message']
    ts.gsub!("EST", "+10:00")
    temps = data['observations']['data'].map { |obs| [obs['local_date_time_full'], obs['air_temp']] }
    temps.reject! {|t| Date.today - Date.parse(t[0]) != 1}
    min = temps.min_by { |t| t[1] }
    max = temps.max_by { |t| t[1] }
    record = {'timestamp' => DateTime.parse(ts).to_s, 'date' => (Date.today - 1).to_s,
              'min' => min[1], 'max' => max[1], 'predicted' => 0, '_id' => DateTime.parse(ts).to_s}
    rescue Exception => e
      puts e
    ensure
      return record
  end
end

Even more Ruby array fun. Here, we grab and parse the JSON data in one step, then use map to create an array of arrays in which [element 0 = time stamp, element 1 = temperature]. Next, the useful reject method removes those elements that don’t contain data from yesterday, this time by employing Date.parse:

temps.reject! {|t| Date.today - Date.parse(t[0]) != 1}

Two more useful methods, max_by and min_by, make finding yesterdays maximum and minimum a breeze. Use time stamp as key, set key predicted = 0 and you end up with this:

    record = {'timestamp' => DateTime.parse(ts).to_s, 'date' => (Date.today - 1).to_s,
              'min' => min[1], 'max' => max[1], 'predicted' => 0, '_id' => DateTime.parse(ts).to_s}
# example record
=> { "_id" : "2011-05-13T18:15:00+10:00", "predicted" : 0, "timestamp" : "2011-05-13T18:15:00+10:00",
 "max" : 16.6, "date" : "2011-05-12", "min" : 8 }

Final step: save to a MongoDB database named “bom”, in the collection “records”:

db  = Mongo::Connection.new.db('bom')
col = db.collection('records')

# save to db
observed  = get_yesterday("http://www.bom.gov.au/fwo/IDN60901/IDN60901.94768.json")
if observed.count == 6
  col.save(observed)
end

predicted = get_predictions("http://www.bom.gov.au/fwo/IDN10064.txt")
predicted.each do |p|
  if p.count == 6
    col.save(p)
  end
end

3. Visualize the data
At some point, I’ll turn to my favourite web application combo: Sinatra + MongoDB + Highcharts, to visualize these data dynamically on a web page. For now though, we can get a quick idea and create even more Friday fun by learning how to use RApache to run and view R code in the browser.

At this point, the fun briefly wavered. I’ve compiled RApache on several machines but my home machine (Ubuntu 10.10 x86_64) always has a problem. This time around, after the initial make failed with an error code (127), I ran the failed command manually, then ran make again without error:

`/usr/bin/R RHOME`/bin/Rscript tools/config_http.R /usr/bin/apxs2 /usr/sbin/apache2
make
sudo make install
sudo /etc/init.d/apache restart

My /etc/apache2/conf.d/rapache.conf looks like this:

LoadModule R_module           /usr/lib/apache2/modules/mod_R.so

<Location "/R">
  ROutputErrors
  SetHandler r-script
  RHandler sys.source
  REvalOnStartup "library(lattice)"
</Location>

And my R code to plot the data, saved as /var/www/R/bom.R, looks like this:

library(grid)
library(digest, lib.loc = "/home/neil/R/x86_64-pc-linux-gnu-library/2.13")
library(proto, lib.loc = "/home/neil/R/x86_64-pc-linux-gnu-library/2.13")
library(plyr, lib.loc = "/home/neil/R/x86_64-pc-linux-gnu-library/2.13")
library(bitops, lib.loc = "/home/neil/R/x86_64-pc-linux-gnu-library/2.13")
library(RCurl, lib.loc = "/home/neil/R/x86_64-pc-linux-gnu-library/2.13")
library(rjson, lib.loc = "/home/neil/R/x86_64-pc-linux-gnu-library/2.13")
library(reshape, lib.loc = "/home/neil/R/x86_64-pc-linux-gnu-library/2.13", warn.conflicts = F)
library(ggplot2, lib.loc = "/home/neil/R/x86_64-pc-linux-gnu-library/2.13", warn.conflicts = F)

setContentType("image/png")
d <- fromJSON(getURL("http://localhost:28017/bom/records/"))
d <- ldply(d$rows, function(x) as.data.frame(x))
t <- tempfile()
png(t,type="cairo", width = 640, height = 480)
gg <- ggplot(d) + geom_point(aes(as.Date(date), max, color = factor(predicted))) + geom_point(aes(as.Date(date), min, color = factor(predicted))) +
                  xlab("Date") + ylab("Temperature")
print(gg)
dev.off()
sendBin(readBin(t,'raw',n=file.info(t)$size))
unlink(t)
DONE

BoM data as rendered in browser using RApache

Why, you might ask, am I loading all of those libraries? Two reasons. First, they are located in my $HOME/R and so RApache (owned by the web server user www-data) would not read them by default. Second, if R tries to print any warnings or messages during library loading, we’ll get an invalid PNG file which will not display. That took quite some debugging, trial and error, I can tell you.

Briefly, this code grabs the records as JSON from the MongoDB HTTP REST interface, converts to a data frame and plots using ggplot2. So – run the Ruby code, check the database records, open up http://localhost/R/bom.R in a web browser and there you have it (see right, click for larger version).

The plot needs some tweaking and for the first couple of days, prediction cannot be compared with observation (because we don’t have today’s observation or yesterday’s prediction). However, run the Ruby code as a cron job – for example at noon every day for several months – and we should soon have a nice dataset with which to work.

Summary: fun and bioinformatics
What makes for a fun project?
First, it should be small and achieve results rapidly.

Second, the data analysis should appeal to your “inner data geek.” Yes, when I was a kid I built a weather station and made daily observations. Don’t laugh – you did too.

Third, writing the code should be educational and fun. Matz, the creator of Ruby has stated that the primary purpose of Ruby is “to help every programmer in the world to be productive, and to enjoy programming, and to be happy.” I tend to agree with him. R is rather less fun but very powerful, once you get a feel for it.

It’s often easier to learn or develop skills in a “fun” setting, then transfer those skills to other projects. I have plenty of vague ideas for using Ruby, R, MongoDB and web applications in bioinformatics, but this is a concrete example that I can develop rapidly.

Fourth, projects are more fun when data are easy to access and process. At this point, I climb onto the bioinformatics soapbox and proclaim: still, in 2011, this is not the case for biological data. A statistician colleague told me that “statisticians don’t really worry about where the data comes from; we assume someone else takes care of that.” Well, bioinformaticians spend most of their time worrying about where data comes from – and most of the rest of their time cleaning it up so as they can do something useful with it. This activity used to be considered a large part of our core skill set but frankly, there’s no quicker way to take the fun out of your work.

Now, do I need to start a new blog for these fun side projects?