# BioStar users (of the world, unite)

October 9, 2010
By

(This article was first published on What You're Doing Is Rather Desperate » R, and kindly contributed to R-bloggers)

Sounds like a challenge. Let’s go.

BioStar user profiles (here’s mine) include a location field. It’s free text and optional, which means that location is missing or inaccurate for many users. However, if you’re logged into BioStar (and perhaps, if you’re a moderator – I’m not sure), you’ll see a field that says:

Last activity: 4 hours ago from XXX.XXX.XXX.XXX


IP addresses can be used for geolocation – we’ll see how shortly. The problem is that they are only present when logged into BioStar, which uses OpenID for authentication. So to write code which automates the collection of user IP addresses, you’d have to convince BioStar that you were logged in.

I’m sure that it’s possible to write code which stores OAuth credentials and sends them to BioStar, but it would take some time to develop. So instead, I used a very ugly and largely manual approach. First, I wrote this simple Greasemonkey script:

// ==UserScript==
// @name           BioStar IP
// @description    Get user IP
// @include        http://biostar.stackexchange.com/users/*
// ==/UserScript==

var d;
d = document.evaluate("//div[@class='summaryinfo']",
document,
null,
XPathResult.UNORDERED_NODE_SNAPSHOT_TYPE,
null);

console.log(d.snapshotItem(0).innerHTML);


It captures the content of the DIV with class summaryinfo and writes it to the Javascript console. That content looks something like this:

Last activity: <span title="2010-10-03 23:06:52Z UTC" class="relativetime">Oct 3 at 23:06</span> from XXX.XXX.XXX.XXX


Again, XXX.XXX.XXX.XXX is the IP address.

So I opened Firefox, installed the Greasemonkey and Firebug extensions, installed my user script, navigated to the BioStar users page, opened the Firebug console and started clicking through users. By choosing “Persist” and increasing the console log limit, I was able to record the IP address of each user in the console. When finished, I copied the console contents to a text file.

There is no worse solution, for a bioinformatician, than one that involves manual labour, copy and paste. Currently, there are 17 pages of users (16 x 35 + 1 x 11 = 571 total). My file contains 567 of them: at least one did not display an IP address and perhaps I missed a couple. This is why we learn to script.

2. Location using GeoIP
So how do we find location using IP? The answer is GeoIP.

First, head over to the MaxMind website and download their GeoIP C API. I installed it (for Ubuntu) like so:

wget http://geolite.maxmind.com/download/geoip/api/c/GeoIP.tar.gz
tar zxvf GeoIP.tar.gz
cd GeoIP-1.4.6
./configure --prefix=/opt/GeoIP
make
sudo make install
# install the city database
gunzip GeoLiteCity.dat.gz
sudo mv GeoLiteCity.dat /opt/GeoIP/share/GeoIP/


GeoIP comes with a free database of countries, located in /opt/GeoIP/share/GeoIP/GeoIP.dat. I also installed their free city database, as shown above.

Next, the Ruby gem for GeoIP:

[sudo] gem install mtodd-geoip -s http://gems.github.com/ -- --with-geoip-dir=/opt/GeoIP


Now, quick and very dirty Ruby code to read the text file containing IP addresses and look them up in the GeoIP database:

require "rubygems"
require "geoip"

ip  = "ip.txt"  # the text file containing IPs, copied from console.log
db  = GeoIP::City.new("/opt/GeoIP/share/GeoIP/GeoLiteCity.dat")

line.chomp
if line =~/froms+(d+.d+.d+.d+)/
locn = []
lookup = db.look_up(\$1)
locn.push(lookup[:country_name], lookup[:country_code], lookup[:city], lookup[:latitude], lookup[:longitude])
puts locn.join("t")
end
end


That prints out a tab-delimited file, which looks like this:

United States   US  East Lansing    42.7282981872559   -84.4881973266602
Italy           IT  Rome            41.9000015258789   12.4833002090454
Portugal        PT  Fafe            41.4500007629395   -8.16670036315918
China           CN  Wuhan           30.5832996368408   114.266700744629
United States   US  Oklahoma City   35.4715003967285   -97.5189971923828
...


3. Plotting maps using R
Before we go all Google-y, let’s look at plotting geographical data using R. There are many libraries and mapping solutions, but here’s a simple script to plot our users on a world map. It requires the packages ggplot2 and maps. Assuming that the output from the Ruby script is saved in a file, biostar.tab:

library(ggplot2)
library(maps)

biostar <- read.table("biostar.tab", header = F, stringsAsFactors = F, sep = "t")
colnames(biostar) <- c("country", "code", "city", "lat", "long")
world <- map_data("world")

png(file = "biostar.png", width = 1024, height = 768)
print(ggplot(world, aes(long, lat)) + geom_polygon(aes(group = group), fill = "darkslategrey") + geom_point(data = biostar, aes(long, lat), colour = "red") + scale_colour_discrete(legend = FALSE))
dev.off()

 And here’s the result (click for the full-size version). BioStar user locations

4. Plotting on a Google Map
There are many options for getting data into Google Maps. I figured that there must be a site where you can upload a simple CSV file containing latitude + longitude and display a Google Map. There is – it’s called ZeeMaps. It has many features – some free, some paid – which I’m yet to investigate fully.

 For CSV upload your file requires a column headed “Name” (I chose the city in my file), plus columns of coordinates headed “Latitude” and “Longitude”. All you need to do is create a new map, upload the file and select “refresh”. Here’s the map that I created. Unfortunately, it cannot be embedded in this blog post (click image, right, for a full-size screenshot). I have no idea if that link is permanent and I suspect that anyone can make alterations to the map. BioStar users at ZeeMaps

Of course, IPs can be spoofed, users move around and the location of a machine might not reflect the location of the user. However, I think it’s a more reliable geolocation approach than an arbitrary text description. Now, if I could just automate that IP-harvesting code…

Filed under: bioinformatics, greasemonkey, programming, R, ruby, statistics Tagged: biostar, geolocation, google maps, javascript, maps

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...