Datagrabbing Commonly Formatted Sheets from a Google Spreadsheet – Guardian 2014 University Guide Data

June 20, 2013

(This article was first published on OUseful.Info, the blog... » Rstats, and kindly contributed to R-bloggers)

So it seems like it’s that time of year when the Guardian publish their university rankings data (Datablog: University guide 2014), which means another opportunity to have a tinker and see what I’ve learned since last year…

(Last year’s hack was a Filtering Guardian University Data Every Which Way You Can…, where I had a quick go at creating a simple interactive dashboard viewer and charter over the Guardian tables, as published via Google spreadsheets.)

The data is published via a Google spreadsheet using a standard sheet layout (apart from the first two sheets, with gid numbers 0 and 1 respectively). Sheets 2..47 are formatted as follows:

Guardian data table uni 2014

The following R function provides the basis for downloading the data from a single sheet, given the spreadsheet key and the sheet gid, and puts the data into a dataframe.

gsqAPI = function(key,query,gid=0){
  tmp=getURL( paste( sep="",'', 'tqx=out:csv','&tq=', curlEscape(query), '&key=', key, '&gid=', gid), ssl.verifypeer = FALSE )
  return( read.csv( textConnection( tmp ),  stringsAsFactors=F ) )

The query is a query written using the SQL-like Google visualisation API query language.

The following routine will download non-empty rows from a specific sheet, and rename the subject specific rank column to a generically titled column (“Subject Rank”). An additional column is added to the table that denotes what subject the table is associated with.

  tmp=gsqAPI(key,"select * where B!=''", i)

We can now download the first subject specific sheet using the following call:


This loads in the data as follows:

seeding guardian ranking data table

We can then add data to this dataframe from all the other sheets, to give us a single data from containing ranking data for all the subjects listed.

for (i in 3:47){

(This is probably not a very R idiomatic (iRonic?) way of doing things, but it seems to do the trick…)

The result is a single dataframe containing all the subject specific data from the Guardian spreadsheets.

If we check the data types of the columns in the full data set, we can see that most of the columns that should be viewed as numeric types as instead being treated as characters.

Data format - need to make numbers

We can fix this with a one liner, that forces the type on the columns we select (the fourth to the eleventh column):

gdata[, 4:11] <- sapply(gdata[, 4:11], as.numeric)
#Cast the university names and subjects to levelled factors

One advantage of combining the data from the separate sheets into a monolithic data table is that we can see the ranking across all subject areas for a given university. For example:

oxfordbrookes.rankings = subset(gdata, Name.of.Institution=='Oxford Brookes', select=c('subject','Subject.Rank') )
#Let's also sort the results:
oxfordbrookes.rankings = oxfordbrookes.rankings[ order(oxfordbrookes.rankings['Subject.Rank']), ]

Cross-subject rankings

We can also start to quiz the data in a graphical way. For example, can we get a feel for how courses are distributed within a university according to teaching and satisfaction levels, whilst also paying heed to the value add score?

oxfordbrookes.full = subset(gdata, Name.of.Institution=='Oxford Brookes' )
ggplot(oxfordbrookes.full) + geom_text(aes(x=X..Satisfied.with.Teaching, y=X..Satisfied.with.Assessment, label=subject,size=Value.added.score.10))

oxfordbrookes start asking

All told, it took maybe a couple of hours of faffing around trying to remember R syntax and cope with https hassles to get this far (including a chunk of time to write this post;-). But now we’re in a position to start hacking out quick queries and having a proper conversation with the data. The work can also be viewed as disposable tool building – it hasn’t required a project proposal, it hasn’t used lots of third party developer time etc etc.

As it is though, I’m not sure how useful getting this far is to anyone who isn’t willing to have a go at hacking te data for themselves…

Hmmm… maybe I should try to sketch out a quick Shiny app…..?

To leave a comment for the author, please follow the link and comment on their blog: OUseful.Info, the blog... » Rstats. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.


Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)