The R User Conference 2010 will be underway shortly - and what better way to commemorate the event than to blast out some R code related to the conference? HTML tables on websites for the past three years list participants and organizations, and the 2010 version includes country as well. Using the XML package to scrape the pages and the sqldf package to treat data frames as tables, we can effectively "query" HTML tables as database tables from R! We will create some plots of the data using ggplot2.
# Populate each of the data frames
# Set the column names
Let's see which organizations have attended the R user conference every year for the past three years.
sqldf("select distinct useR2010.organization from useR2010
join useR2008 using(organization)
join useR2009 using(organization)
where useR2010.organization !=''")
2 Case Western Reserve University
5 Mango Solutions
6 Monash University
7 REvolution Computing
8 The Institute of Statistical Mathematics
10 University of Kent
11 Vanderbilt University
12 Vienna University of Technology
13 West Virginia University
14 Yale University
Now lets see if there is any trends related to attendance.
everyone=sqldf("select 2008 year, count(*) total
from useR2008 union
select 2009, count(*) from useR2009 union
select 2010, count(*) from useR2010")
1 2008 406
2 2009 460
3 2010 465
Based upon the names of organizations, we can make some guesses about educational participants. This query counts any organization that has "university" or "school" in the name (case-insensitive) as educational.
ed=sqldf("select 2008 year, count(*) total
where UPPER(useR2008.organization) like '%UNIVERSITY%' or
UPPER(useR2008.organization) like '%SCHOOL%'
select 2009, count(*)
where UPPER(useR2009.organization) like '%UNIVERSITY%' or
UPPER(useR2009.organization) like '%SCHOOL%'
select 2010 year, count(*) total
where UPPER(useR2010.organization) like '%UNIVERSITY%' or
UPPER(useR2010.organization) like '%SCHOOL%'")
You can plot this if you like:
qplot(data=everyone, factor(year), total, geom='bar')
We can take the data frames we just created and query them from sqldf and view plot the total conference participants for each year along with the educational participants only.
DF=sqldf("select 'All' name, yr, total from everyone union select 'Ed' name, yr, total from ed")
ggplot(DF, aes(yr, total, color=name)) + geom_line(stat="identity") + theme_bw() + xlab("Year") + opts(title="UseR Conference Attendance")
So there is a steady upward trend for the past three years for both educational participants and conference participants in general. Now lets see which organizations and countries are best represented this year. This involves the same basic query, so the following function can be used to produce the plot intended.
par(las=2, mar=c(10, 4, 4, 2) + 0.1, cex=0.85)
To limit the results a bit, we will only display entities that have at least two participants.
plotCount('select organization, count(*) from useR2010 group by organization having count(*) > 1 order by 2,1')
plotCount('select country, count(*) from useR2010 group by country
having count(*) > 1 order by 2,1')
The charts above don't tell the whole story - for instance, the conference had been held in European countries during the past few years, so the participating countries have varied each year. Still, a fun example of how you can treat online data as if it were relational data.