BioStar users (of the world, unite)

[This article was first published on What You're Doing Is Rather Desperate » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Egon writes:

Can someone please plot the BioStar users on a Google Map?

Sounds like a challenge. Let’s go.

1. Harvesting user IP addresses
BioStar user profiles (here’s mine) include a location field. It’s free text and optional, which means that location is missing or inaccurate for many users. However, if you’re logged into BioStar (and perhaps, if you’re a moderator – I’m not sure), you’ll see a field that says:

Last activity: 4 hours ago from XXX.XXX.XXX.XXX

where “XXX.XXX.XXX.XXX” is either an IP address or, for your own page, the text “this IP address” (assuming your latest activity was from your current machine).

IP addresses can be used for geolocation – we’ll see how shortly. The problem is that they are only present when logged into BioStar, which uses OpenID for authentication. So to write code which automates the collection of user IP addresses, you’d have to convince BioStar that you were logged in.

I’m sure that it’s possible to write code which stores OAuth credentials and sends them to BioStar, but it would take some time to develop. So instead, I used a very ugly and largely manual approach. First, I wrote this simple Greasemonkey script:

// ==UserScript==
// @name           BioStar IP
// @namespace      http://twitter.com/neilfws
// @description    Get user IP
// @include        http://biostar.stackexchange.com/users/*
// ==/UserScript==

var d;
d = document.evaluate("//div[@class='summaryinfo']",
                      document,
                      null,
                      XPathResult.UNORDERED_NODE_SNAPSHOT_TYPE,
                      null);

console.log(d.snapshotItem(0).innerHTML);

It captures the content of the DIV with class summaryinfo and writes it to the Javascript console. That content looks something like this:

Last activity: <span title="2010-10-03 23:06:52Z UTC" class="relativetime">Oct 3 at 23:06</span> from XXX.XXX.XXX.XXX

Again, XXX.XXX.XXX.XXX is the IP address.

So I opened Firefox, installed the Greasemonkey and Firebug extensions, installed my user script, navigated to the BioStar users page, opened the Firebug console and started clicking through users. By choosing “Persist” and increasing the console log limit, I was able to record the IP address of each user in the console. When finished, I copied the console contents to a text file.

There is no worse solution, for a bioinformatician, than one that involves manual labour, copy and paste. Currently, there are 17 pages of users (16 x 35 + 1 x 11 = 571 total). My file contains 567 of them: at least one did not display an IP address and perhaps I missed a couple. This is why we learn to script.

2. Location using GeoIP
So how do we find location using IP? The answer is GeoIP.

First, head over to the MaxMind website and download their GeoIP C API. I installed it (for Ubuntu) like so:

wget http://geolite.maxmind.com/download/geoip/api/c/GeoIP.tar.gz
tar zxvf GeoIP.tar.gz
cd GeoIP-1.4.6
./configure --prefix=/opt/GeoIP
make
sudo make install
# install the city database
wget http://geolite.maxmind.com/download/geoip/database/GeoLiteCity.dat.gz
gunzip GeoLiteCity.dat.gz
sudo mv GeoLiteCity.dat /opt/GeoIP/share/GeoIP/

GeoIP comes with a free database of countries, located in /opt/GeoIP/share/GeoIP/GeoIP.dat. I also installed their free city database, as shown above.

Next, the Ruby gem for GeoIP:

[sudo] gem install mtodd-geoip -s http://gems.github.com/ -- --with-geoip-dir=/opt/GeoIP

Now, quick and very dirty Ruby code to read the text file containing IP addresses and look them up in the GeoIP database:

require "rubygems"
require "geoip"

ip  = "ip.txt"  # the text file containing IPs, copied from console.log
db  = GeoIP::City.new("/opt/GeoIP/share/GeoIP/GeoLiteCity.dat")

File.read(ip).each do |line|
  line.chomp
  if line =~/froms+(d+.d+.d+.d+)/
    locn = []
    lookup = db.look_up($1)
    locn.push(lookup[:country_name], lookup[:country_code], lookup[:city], lookup[:latitude], lookup[:longitude])
    puts locn.join("t")
  end
end

That prints out a tab-delimited file, which looks like this:

United States   US  East Lansing    42.7282981872559   -84.4881973266602
Italy           IT  Rome            41.9000015258789   12.4833002090454
Portugal        PT  Fafe            41.4500007629395   -8.16670036315918
China           CN  Wuhan           30.5832996368408   114.266700744629
United States   US  Oklahoma City   35.4715003967285   -97.5189971923828
...

3. Plotting maps using R
Before we go all Google-y, let’s look at plotting geographical data using R. There are many libraries and mapping solutions, but here’s a simple script to plot our users on a world map. It requires the packages ggplot2 and maps. Assuming that the output from the Ruby script is saved in a file, biostar.tab:

library(ggplot2)
library(maps)

biostar <- read.table("biostar.tab", header = F, stringsAsFactors = F, sep = "t")
colnames(biostar) <- c("country", "code", "city", "lat", "long")
world <- map_data("world")

png(file = "biostar.png", width = 1024, height = 768)
print(ggplot(world, aes(long, lat)) + geom_polygon(aes(group = group), fill = "darkslategrey") + geom_point(data = biostar, aes(long, lat), colour = "red") + scale_colour_discrete(legend = FALSE))
dev.off()
And here’s the result (click for the full-size version).
biostar

BioStar user locations

4. Plotting on a Google Map
There are many options for getting data into Google Maps. I figured that there must be a site where you can upload a simple CSV file containing latitude + longitude and display a Google Map. There is – it’s called ZeeMaps. It has many features – some free, some paid – which I’m yet to investigate fully.

For CSV upload your file requires a column headed “Name” (I chose the city in my file), plus columns of coordinates headed “Latitude” and “Longitude”. All you need to do is create a new map, upload the file and select “refresh”. Here’s the map that I created. Unfortunately, it cannot be embedded in this blog post (click image, right, for a full-size screenshot). I have no idea if that link is permanent and I suspect that anyone can make alterations to the map.
zeemaps

BioStar users at ZeeMaps

Of course, IPs can be spoofed, users move around and the location of a machine might not reflect the location of the user. However, I think it’s a more reliable geolocation approach than an arbitrary text description. Now, if I could just automate that IP-harvesting code…


Filed under: bioinformatics, greasemonkey, programming, R, ruby, statistics Tagged: biostar, geolocation, google maps, javascript, maps

To leave a comment for the author, please follow the link and comment on their blog: What You're Doing Is Rather Desperate » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)