APIs: I wish the life sciences would learn from social networks

Posted on December 10, 2009 by nsaunders in R bloggers | 0 Comments

[This article was first published on What You're Doing Is Rather Desperate » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I was prompted by a thread on the apparent decline of FriendFeed to look for evidence of declining participation in my networks.

First, a quick and dirty Ruby script, tls.rb to grab the Life Scientists feed and count the likes and comments:

#!/usr/bin/ruby

require 'rubygems'
require 'json/pure'
require 'net/http'
require 'open-uri'

def format_date(d)
  if d =~ /(d{4}-d{2}-d{2})T(d{2}:d{2}:d{2})Z/
    return "#{$1},#{$2}"
    else
    return d
  end
end

def count_items(i)
  if i.nil?
    return 0
    else
    return i.count
  end
end

n = ARGV[0]
u = "http://friendfeed-api.com/v2/feed/the-life-scientists?start=#{n}"
f = open(u).read
j = JSON.parse(f)

j.each_pair do |k,v|
  if k == "entries"
    v.each do |entry|
      date = format_date(entry['date'])
      likes = count_items(entry['likes'])
      comments = count_items(entry['comments'])
      puts "#{entry['id']},#{date},#{likes},#{comments}"
    end
  end
end

By default, the API call returns the last 30 items, starting at zero. You can move back in time by running this script with, for example, “tls.rb 30″. Really, there should be a check to see if ARGV[0] is an integer but in fact the argument can be absent (or nothing at all) and it will be ignored. I did say quick and dirty.

The script returns CSV with entry ID, date, time, likes count and comments count, looking like this:

e/701b62de37c751ccea1b746b53d00352,2009-12-11,02:50:49,2,0
e/7b7db83f08debbf92670c74700574b8c,2009-12-11,01:10:58,0,0
e/59efb928ec73ea9849beca02f0f86b48,2009-12-10,23:55:48,0,0
e/53f8222704468289608378ed17489156,2009-12-10,23:54:17,0,0
....

One big drawback of the FriendFeed API is that you cannot retrieve entries by date, or a range of dates. By experimenting with values of “?start=N” in the URL, it seemed that N=3600 retrieved entries from late 2008 onwards. And so:

for i in `seq 0 30 3600`;
  do ./tls.rb $i >> ffdata-raw.csv;
done

Be aware that this will not retrieve all posts for 2009 and there will also be duplicate entries – which we can filter out by entry ID. To remove duplicates and 2008 entries:

sort -u ffdata-raw.csv | grep ",2009-" > ffdata-filtered.csv

We’re not quite there yet. We have unique records but they can have the same date. We need to sum the counts and likes for each date. Should have done that in the Ruby script really…but we can use awk, to sum the likes, as follows:

awk -F"," '{OFS=",";cnt1[$2]+=$4}END{for (x in cnt1){print x,cnt1[x]}}' ffdata-filtered.csv > ffdata-likes.csv

Just substitute $5 to sum the comments.

Last step: read the file into R, download Paul Bleicher’s calendarHeat.R code and generate plots:

> source("calendarHeat.R")
> fflikes <- read.csv("ffdata-likes.csv", check.names=F,header=F)
> png(filename="tls-likes.png", type="cairo", width=640)
> calendarHeat(fflikes$V1, fflikes$V2, varname="Likes",color="r2b")
> dev.off()

That was quick, relatively easy and most of all, fun.
In contrast, I’ve been trying to mine microarray data from the NCBI GEO database for the best part of 8 months now.
There’s an API of sorts but getting the results that I want is not quick, easy and most certainly not fun.