[This article was first published on R-Chart, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

 One of the top searches on rubyflow is “conference”.  A recent post showed how to create a map with the location of the 2010 R User Conference.  So why not expand on the subject and create a map with numerous conference locations throughout the world?

This post shows how to create a map of locations (upcoming Ruby conferences) them straight off the web using Ruby and R.

The Packages and APIs
Both R and Ruby have a ton of functionality baked in to accomplish this task.  Ruby can scrape the web using Hpricot and geocode the information using Google.  It can call R through Rserve, and the maps and ggplot2 libraries can be used to render the result.

The Process
Ruby code described here is available on Github.  Rserve must be running when the program is run.  The following block of code is used to create a connection – and if the connection is not available, starts Rserve.

begin
puts “Creating a new Rserve Connection.”
$c = Connection.new rescue puts “Could not create an Rserve Connection: #{$!}”
puts “Trying to start one now…”
File.open(‘tmp.R’,’w’){|f|f.puts “library(Rserve)nRserve()”}
system(‘”R.exe” –no-save < tmp.R')
sleep 3
\$c = Connection.new
puts “Rserve Started.”
end

Getting a list of conferences involves parsing HTML.  In R the XML packages includes useful functionality and can be handy when data is in an HTML table.  In this case, the data was not in an HTML table.  Instead, the Hpricot parser accepts an XPath expression and iterates over the relevant elements.  The data is extracted and stored in an array of hashes.

def get_conference_list()
u=’http://blog.sphereinc.com/2010/08/13-upcoming-ruby-and-rails-conferences-you-dont-want-to-miss’
doc=Hpricot(open(u))
recs=[]
(doc/”//div[@id=’post-216′]/div/p/strong”).entries.each_with_index{|e,i|
h={}
e.inner_text.split(“n”).each{|d|
p=d.split(‘:’)
unless [nil,”].include?(p[0]) or  [nil,”].include?(p[1])
#  puts “>>#{p[0].strip} = #{p[1].strip}<<"
h[p[0].strip]= p[1].strip
end
}
recs << h
}
recs
end

Hpricot can parse XML as well as HTML, and so is used to get the latitude and longitude for each location.

def get_location(str)
u=URI.encode(
)
loc=(Hpricot.XML(open(u)))/’//location’
h={}
h[‘lat’]=(loc/:lat).inner_text
h[‘lng’]=(loc/:lng).inner_text
h
end

A data file is created which contains semicolon delimited records.  This will provide the input to the R program.

Finally, an R program is used to plot the data.  Since the files in view will be in the current working directory (and Rserve has no reference to this) it is substituted in prior to executing the program.

Challenges
In many cases ggplot2 creates publication quality graphics with a simple call.  In this case, data was grouped very unevenly, and attempts to automatically add text (in this case the city names) to the map can result in overlapping and large amounts of wasted space.

I thought of a variety of ways to address the problem.
• The simplest was to resize the image various ways.  This was not successful – but I did later crop the images in a manual post processing step.
• Another approach was to modify the scale in use in some way (e.g. the use of a log scale with scatter plots as Tal pointed out in this post).  The ggplot2 package includes a large number of map projections (ways of representing a three dimensional sphere in two dimensions).  Available projections provided with ggplot2 are described in help:
mapproject

None of these solved the problem in and of themselves, but provided some interesting variations.
• I decided instead to simply “zoom in” on the relevant part of the chart and split the chart (sort of ad hoc faceting).  In some cases this worked perfectly.
In other cases the results were suboptimal.  Paths were sometimes drawn between points on the border that had been cropped and so extraneous map lines appear in some zooms.  I finally decided that a general purpose automatic solution was not readily available (or at least known to me).  So I cleaned up the final images using image editing software (Gimp).

If anybody has ideas about how this kind of processing can be used to automatically generate graphs that are “cleaned up” please comment below.  Scriptable solutions that create finished products with no manual intervention are preferable.

The code is available on Github.