The Life Scientists at FriendFeed: 2009 summary

Posted on December 23, 2009 by nsaunders in R bloggers | 0 Comments

[This article was first published on What You're Doing Is Rather Desperate » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The Life Scientists 2009

It’s Christmas Eve tomorrow and so I declare the year over. My Christmas gift to you is a summary of activity in 2009 at the FriendFeed Life Scientists group. It’s crafted using R + Ruby, with raw data and some code snippets available. If you want to see the most popular items from the group this year, head down to the bottom of this post.

(Note: this post is a work in progress)

The contributors
First of all, take a look at yourselves. There are, allegedly, 1250 subscribers to the group, but I can only retrieve profiles for 1053 of them.

248 of you are rather shy, opting for the default avatar and one or two of you look rather like porn stars. If nothing else, this illustrates the difficulty of compiling reliable user statistics.

Here’s how this image was assembled. User pictures were fetched using this script:

#!/usr/bin/ruby

require 'rubygems'
require 'json/pure'
require 'net/http'
require 'open-uri'

f = open("http://friendfeed-api.com/v2/feedinfo/the-life-scientists").read
j = JSON.parse(f)

j.each_pair do |k,v|
  if k == "subscribers"
    v.each do |a|
      nick = a["id"]
      url  = "http://friendfeed.com/#{nick}/picture?size=small"
      r    = Net::HTTP.get_response(URI.parse(url))
      if r['location'] =~/nomugshot/
        pic     = "http://friendfeed.com/static/images/nomugshot-small.png?v=0fa95dbfe38cc4f25334187ca5485a58"
        outfile = "#{nick}.png"
      else
        pic     = r['location']
        outfile = "#{nick}.jpg"
      end
      if pic
        open("#{pic}") {|file|
          File.open(outfile,"wb") do |out|
            out.puts file.read
          end
        }
      end
    end
  end
end

ImageMagick was used to make the montage:

> montage *.jpg *.png -geometry +2+2 montage.png

Fetching the data
In a recent post, I presented Ruby code to fetch FriendFeed entries using the API. To recap: entries cannot be fetched by date, so we have to employ a loop and the “?start=N” URL parameter to go back in time. Naming the script tls2csv.rb, I went back to late 2008 like so:

# fetch entries
for i in `seq 0 30 4050`
  do tls2csv.rb $i > $i.csv
done
# concatenate
cat *.csv > tls-20091223.raw.csv
# remove duplicates and 2008 entries
sort -u tls-20091223.raw.csv | grep -v ",2008-" > tls-20091223.unique.csv

That generates a CSV file with 5 fields: entry id, date, time, likes count and comments count. If you wish, you can get it from Dropbox.

Analysis using R
OK – let’s get to work. First, load the data into R:

> library(ggplot2) # also loads library plyr
> tls <- read.csv("tls-20091223.unique.csv", header=T)
>dim(tls)
[1] 2363    5

Posts, likes and comments per day
We’ve retrieved 2363 unique posts. First question: how much activity is there on any given day? We’d like to sum the posts, likes and comments for each day, then visualise the activity. Here’s one way to do that, again using the calendar heat map. We use table() to sum posts by date and the plyr function ddply() to create a data frame for both likes and comments, summed by date:

> posts <- as.data.frame(table(tls$date))
> dim(posts)
[1] 346   2
> lc <- ddply(tls, c("date"), function(df)data.frame(likes=sum(df$likes),comments=sum(df$comments)))
> source("calendarHeat.R")
# create calendar heatmap for posts
> png(filename="tls-posts.png", type="cairo", width=640)
> calendarHeat(posts$Var1, posts$Freq, varname="The Life Scientists 2009: Posts", color="r2b")
> dev.off()
# similarly for likes and comments (latter not shown)
> png(filename="tls-likes.png", type="cairo", width=640)
> calendarHeat(lc$date, lc$likes, varname="The Life Scientists 2009: Likes", color="r2b")
> dev.off()

Life Scientists 2009: daily posts, likes and comments

So, we managed to retrieve data for 346 days and the plots are shown on the right.
Are they useful? The small height of the scale bar is a little annoying and I know that these plots can be over-used to little effect, but I think they’re pretty and kind of cool. Here’s what I see:

A moderate level of posts throughout the year
Weekends are quieter (obviously) – as was October for some reason?
Post “hot-spots” in June and July (conference season?)
It’s easier to like than to comment (generally, less dark colours across the likes plot)
Most-liked posts are not necessarily the most-commented

How many posts, likes and comments can we expect per day, or likes/comments per post?

# fivenum() = minimum, lower-hinge, median, upper-hinge, maximum
# stats per day
> fivenum(posts)
[1]  1  4  6  9 18
> fivenum(lc$likes)
[1]  0 11 20 34 95
> fivenum(lc$comments)
[1]   0   8  18  35 160
# stats per post
> fivenum(tls$likes)
[1]  0  0  2  5 36
> fivenum(tls$comments)
[1]  0  0  1  4 85

Quite a range. A median of 6 posts, 20 likes and 18 comments per day. Or, 2 likes and 1 comment per post.
To me, this indicates a reasonable level of activity in the room: there is new material and discussion most days and a post stands a good chance of receiving a like or a comment. Another interesting observation is that although the “barrier to like” is lower than the “barrier to comment”, posts tend to be highly-commented rather than highly-liked. Does this suggest that if a topic is interesting, important or controversial, it provokes comment and comments provoke more comments, whereas likes don’t generate more likes?

Looking at those numbers is a little dull, though. Let’s do a simple qplot from ggplot2 to look at how many likes and comments each post receives and if they are related:

# qplot likes (comments done in similar way)
> png(filename="tls-likes-dist.png", type="cairo", width=640)
> qplot(likes, data=tls, main="Life Scientists 2009: likes distribution")
stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
> dev.off()

# likes/comments correlation
# first do cor.test(tls$likes, tls$comments, method="kendall")
# then do cor.test(lc$likes, lc$comments, method="kendall")
# then e.g.
> png(filename="date-lc.png", type="cairo", width=640)
> qplot(likes, comments, data=lc, main="Life Scientists 2009 likes/comments/by date: Kendall: z = 15.159, tau = 0.556")
> dev.off()

Life Scientists 2009: likes/comments distributions and correlations

Here are the plots. Unsurprisingly, most posts receive few likes, with zero by far the most common number. However, the most frequent comment count for a post is one. This may be because a comment is often posted automatically with an item, or because the person who posts often writes an explanatory comment.
Do likes and comments go hand in hand? We can do a quick Kendall rank correlation and add the results to a scatterplot. There is some association between likes and comments “by post”, but a stronger association “by day”. In other words as the general level of discussion increases, both types of input, likes and comments, increase.

Posts that generated the most discussion
This is probably what you came for. Here are the top 10 posts, sorted first by likes and then by comments. First, here’s the R code to sort a data frame by one of its factors (comments, then likes):

# by comments
> tls[sort.list(tls$comments, decreasing=T),][1:10,]
                                     id       date     time likes comments
456  e/322d1c29cb704a07b19073753b4eaebe 2009-01-29 04:07:26    10       85
948  e/6a1b501459b34849be9ff93e8b2bcbaa 2009-05-22 19:16:07    16       65
1147 e/813658cfc851446fadf7b11550fc15ac 2009-11-20 12:08:16    27       65
867  e/603db6a4322d4ba98443b67268cfaa6b 2009-02-22 03:50:26    14       58
663  e/4950d4652b8c4570b2aa85c5317c8952 2009-01-25 09:14:55    12       55
2044 e/e2d349ddae164849a1d3c37fac0242c8 2009-04-22 07:45:41    32       54
400  e/2bfe2f28871e4e5cbb715ba687eba114 2009-01-29 11:51:00    10       50
2336 e/fe523209c16b4d909190ee81e8372c00 2009-10-30 18:10:35     0       50
1513 e/a842e6e56b2845dcb8b70537080922fb 2009-07-11 04:42:18    15       45
2284 e/f90daf1d0dc34de886e43376ebf37b1a 2009-09-15 15:51:18    12       45

# by likes
> tls[sort.list(tls$likes, decreasing=T),][1:10,]
                                     id       date     time likes comments
1840 e/cbbe2531a8e9439e833dff099b3d0573 2009-07-21 02:22:06    36       28
2044 e/e2d349ddae164849a1d3c37fac0242c8 2009-04-22 07:45:41    32       54
809  e/590c7cecbd3a4244a9526cc2a9a90941 2009-03-04 20:56:28    31       42
703  e/4d88961c9aff4278872d2b0dc3da783a 2009-05-31 03:04:26    28       27
782  e/56710bf33cad43ac8d063ad989e95aa0 2009-12-06 05:47:32    28        8
1760 e/c4fd9b133fa1477f805109add6897ac3 2009-09-15 16:33:49    28        1
1147 e/813658cfc851446fadf7b11550fc15ac 2009-11-20 12:08:16    27       65
1335 e/95230b310f0f44c6bd4d3403f09a3fe1 2009-07-14 13:11:18    27        5
2031 e/e0c29747484b4e16896b1e4e8a299aca 2009-05-13 03:11:38    27       20
685  e/4ad4515bfd304f9fbf5c6a1381692fdc 2009-01-21 02:07:08    26       30

Since we have the entry ID, we can retrieve the entries as JSON, e.g. using curl:

# write out list of entry IDs from R, e.g. for likes
> write(as.vector(tls[sort.list(tls$likes, decreasing=T),][1:10,1]), "likes.list")
# fetch JSON in bash
for i in $(cat likes.list)
  do curl "http://friendfeed-api.com/v2/entry/$i?pretty=1" >> likes.json
done

Without further ado, the most popular posts and their contributor.

Life Scientists: most commented entries

Life Scientists: most liked entries

Elsevier announces the ‘Article of the Future’
Contributor: Alexey
I’m thinking about putting a proposal to bring our data resources kicking and screaming into the Semantic Web
Contributor: Andrew Clegg
“The All Results Journals – ‘Because all your results are good results’
Contributor: Shirley Wu
Signing up to be notified about google wave
Contributor: Steve Koch
Science: The looming crisis in human genetics
Contributor: Itachi
Human form [cross-sectional animated gif]
Contributor: Adriano
I’m going to do a round of looking at some of the Science Social Networking sites again
Contributor: Cameron Neylon
Great Tweets of Science
Contributor: Abhishek Tiwari
Univ. of Washington grad student Darren Begley plans to live stream his PhD Thesis defense
Contributor: Mary Canady
Comparison of biological wikis
Contributor: Andrew Su

As to why they were the most discussed entries – you’ll just have to go and read them. The discussions don’t always revolve around the initial subject of the post 🙂

Posted in R, statistics, web resources Tagged: 2009, friendfeed, ruby, summary, the life scientists, visualisation