Analyze Online R User Conference Data

July 19, 2010
By

(This article was first published on R-Chart, and kindly contributed to R-bloggers)



The R User Conference 2010 will be underway shortly - and what better way to commemorate the event than to  blast out some R code related to the conference?  HTML tables on websites for the past three years list participants and organizations, and the 2010 version includes country as well.  Using the XML package to scrape the pages and the sqldf package to treat data frames as tables, we can effectively "query" HTML tables as database tables from R!  We will create some plots of the data using ggplot2.


library(XML)
library(sqldf)
library(ggplot2)


# Populate each of the data frames
useR2008=as.data.frame(
  readHTMLTable(
    'http://www.statistik.uni-dortmund.de/useR-2008/participants.html'
  )[2]
)


useR2009=as.data.frame(
  readHTMLTable(
    'http://www.agrocampus-ouest.fr/math/useR-2009/participants.html'
  )[2]
)


useR2010=as.data.frame(
  readHTMLTable(
    'http://user2010.org/participants.html'
  )[2]
)


# Set the column names
colnames(useR2010)=c('name','country','organization')
colnames(useR2009)=c('name','organization')
colnames(useR2008)=c('name','organization')


Let's see which organizations have attended the R user conference every year for the past three years.

sqldf("select distinct useR2010.organization from useR2010 
join useR2008 using(organization) 
join useR2009 using(organization) 
where useR2010.organization !=''")




1                                Agrocampus
2           Case Western Reserve University
3                                 Genentech
4                                      INRA
5                           Mango Solutions
6                         Monash University
7                      REvolution Computing
8  The Institute of Statistical Mathematics
9                                      UCLA
10                       University of Kent
11                    Vanderbilt University
12          Vienna University of Technology
13                 West Virginia University
14                          Yale University




Now lets see if there is any trends related to attendance.
everyone=sqldf("select 2008 year, count(*) total 
             from useR2008 union 
       select 2009, count(*) from useR2009 union 
       select 2010, count(*) from useR2010")



everyone
  year total
1 2008   406
2 2009   460
3 2010   465



Based upon the names of organizations, we can make some guesses about educational participants.  This query counts any organization that has "university" or "school" in the name (case-insensitive) as educational.


ed=sqldf("select 2008 year, count(*) total
from useR2008
where UPPER(useR2008.organization) like '%UNIVERSITY%' or 
UPPER(useR2008.organization) like '%SCHOOL%' 
union
select 2009, count(*)
from useR2009
where UPPER(useR2009.organization) like '%UNIVERSITY%' or 
UPPER(useR2009.organization) like '%SCHOOL%' 
union
select 2010 year, count(*) total
from useR2010
where UPPER(useR2010.organization) like '%UNIVERSITY%' or 
UPPER(useR2010.organization) like '%SCHOOL%'")


You can plot this if you like:


qplot(data=everyone, factor(year), total, geom='bar')

We can take the data frames we just created and query them from sqldf and view plot the total conference participants for each year along with the educational participants only.

DF=sqldf("select 'All' name, yr, total from everyone union select 'Ed' name, yr, total from ed")


ggplot(DF, aes(yr, total, color=name)) + geom_line(stat="identity") + theme_bw() + xlab("Year") + opts(title="UseR Conference Attendance")


So there is a steady upward trend for the past three years for both educational participants and conference participants in general.  Now lets see which organizations and countries are best represented this year.  This involves the same basic query, so the following function can be used to produce the plot intended.


plotCount=function(sql){


  m=as.matrix(sqldf(sql))
  rownames(m)=m[,1]
  m=m[,-1]
  par(las=2, mar=c(10, 4, 4, 2) + 0.1, cex=0.85)
  #barplot(as.numeric(m), names=names(m))
  qplot( names(m),m,geom='bar')+coord_flip()


}


To limit the results a bit, we will only display entities that have at least two participants.



plotCount('select organization, count(*) from useR2010 group by organization having count(*) > 1 order by 2,1')





plotCount('select country, count(*) from useR2010 group by country
having count(*) > 1 order by 2,1')





The charts above don't tell the whole story - for instance, the conference had been held in European countries during the past few years, so the participating countries have varied each year.  Still, a fun example of how you can treat online data as if it were relational data.



To leave a comment for the author, please follow the link and comment on his blog: R-Chart.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Tags: , ,

Comments are closed.