Olimpic predictions – from an R web service provider’s point of view

Posted on September 17, 2012 by rapporter in R bloggers | 0 Comments

[This article was first published on rapporter, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Hello, world!

Back in July we have read Markus Gesmann’s great blogpost about a prediction for the 100m final in London. Soon we decided to create similar estimates about the forthcoming events and started to post our results on Facebook.

We would like to emphasise again that these kind of extrapolated estimates are rather just for fun and we also think that any sport fan can give more valid and even more reliable estimates about the results without any statistical knowledge. All our models assume that the performance of the winners, so the expected results also fit the historical data. So this blogpost has nothing to do with causal models, nor would we encourage anyone to bet on these kind of estimates in the forthcoming Olympics, but it might be interesting to see, visualise the lower and lower results from year to year.

In this article we will show how the models look like, what kind of tools we used to build and visualise those and also providing a demo web application where anyone could compile a similar plot with a decent amount of annotations with a single click.

As you could see in the above demo graph, we have built two models for all sport events. A so called “simple” prediction which stands for a log-linear, and a “power” estimate for a non-linear model.

Of course R provides some really great tools and databaseolympics.com has all kind of historical data to patch a plot like the above in a few minutes, but we tried to build some environment to even let the non-R users benefit from all these resources and also give an instrument to professional R users where they can concentrate on the code and do not have to deal with the wearisome process of editing/fine-tuning/finalising/re-thinking the style of the output.

We would ask all our kind readers without any interest in the technical details to check out our demo web application right now and also to leave some feedback below in the comment box to let us know what you would expect from a statistical tool running in your browser.

Hello, R users!

Before digging deeper in the models, you might be interested in our work flow. At least this article concentrates rather on this latter, as we are pretty sure you are familiar with the prior 🙂

Most R users have gotten used to or developed some kind of tools to export the results from their R console to some user-friendly format (maybe in pdf, HTML, odt, docx or xls beside others) available to share with clients/collaborators etc. They might have dedicated much time to create custom templates and helper functions to speed-up the dry part of the job, or even still puzzling which to choose from the wide variety of R packages for this end.

Although we are aware of sweave, Hmisc, xtable, R2HTML and hwriter, even odfWeave or ascii and brew beside the Windows specific libraries and others which we have not used in production (but also worth checking on the reproducible research CRAN task view), and have also seen the rise of knitr (although started to work on this project well before we first heard of this great package back in January), we decided to introduce some new packages to fit our custom needs: to provide some ways to run regular R commands with pretty GUI and output in a browser.

Introducing rapport

Aleksandar Blagotić and Gergely Daróczi started to work on a package almost one and a half year ago to create reproducible “R templates”. These templates hold some R code with optional parameters which could be specified in a simple R command.

Of course this sounds like the templates are some custom R functions, but actually the template body is a Pandoc’s markdown text with any number of inline R commands (called: “chunks”) specified between brew-like tags. The main difference from brew is that the “chunks” are automatically rendered as markdown and the resulting output of a template can be easily exported to bunch of formats (just to name a few: HTML, pdf, odt or docx). If you might be interested, please check out our examples (do not forget to click on the “exports” tabs in the list) and for more details, please see the above URL.

We started to build on the awesome ascii and evaluate packages with a bunch of document converter back-ends, but soon we realised that we had to tweak those too much for our specific needs, so decided to write our custom “evaluate R call(s) and render results in markdown” package.

Introducing pander

As stated above, we needed a package which can eval R commands with the following parameters:

catching all messages, warnings and errors,
catching the stdout (like capture.output) while evaluating,
storing the results of each R command (not per chunk!) in R objects,
running each R command and not halting on the first error,
rendering the results (let it be a character vector, table, data.frame or an lm object beside others) in markdown,
grabbing each plot (let it be a so-called base plot, lattice or ggplot2) and save to disk (also: show to the user on demand and also including those in the exported documents),
and last but not least trying to unify the images based on user preferences in the major graph libraries (graphics, lattice and ggplot2).

That latter required the most of coding (and tweaks), but now evals (the base command of the package) can be considered as quite stable and mature with even a custom caching mechanism.

evals takes a character vector of R commands to run as its main parameter with some optional arguments, and is frequently called in Pandoc.brew which was the main purpose of pander: to provide a brew-like engine resulting in markdown while keeping the great advantages of brew (running loops or printing parts of a report conditionally). Please check out the examples (and the exported documents of those in the bottom) to see the engine in action.

Beside Pandoc.brew there is also an R5 reference class to create styled reports in the R console based on the idea got from the ascii package.

Back to the Olympics

Why did we write that lengthy introduction without any interesting source code? Of course we used our packages to generate the above plot with some chatty annotations shown in our demo (or follow the direct link to “100m Freestyle Men”) . We encourage you to check that out before reading further.

Fetching data

First, we downloaded the required data set from databaseolympics.com if not found in our local cache:

## specifying the sport event based on the above URL
u <- 'SWI-120'

## datafile store (caching not to bother the server)
dir.create(file.path(getwd(), 'reports', 'data'), recursive = TRUE, showWarnings = FALSE)
datafile <- file.path(getwd(), 'reports', 'data', u)

## build the URL
uri <- strsplit(u, '-', fixed = TRUE)[[1]]
url <- paste0('http://www.databaseolympics.com/sport/sportevent.htm?sp=', uri[1], '&enum=', uri[2])

## fetching data from \url{databaseolympics.com} if not cached
if (!file.exists(datafile)) {
    library(XML)
    d <- readHTMLTable(readLines(url, warn = FALSE), which = 2, header = TRUE)
    saveRDS(d, file = datafile)
} else {
    d <- readRDS(datafile)
}

Data transformations

Now we have all cells of the above HTML table saved to d, which is to be filtered in most cases (we need only the first results from each year without any duplicates):

## defining events which seem to be invalid
dEvents         <- tail(as.character(d$Event[d$Event != '']), 1)
dEvents.invalid <- which(d$Event != dEvents)

## removing events which seem to be invalid
if (length(dEvents.invalid) > 0)
    d <- d[-dEvents.invalid, ]

## just dealing with the winners ATM
golddata <- subset(d, Medal %in% "GOLD")

## removing duplicated rows (we just need the Results)
if (any(table(golddata['Year']) > 1))
    golddata <- golddata[-which(duplicated(golddata['Year'])), ]

## are we dealing with time results?
resultInTime <- any(grepl(':', golddata$Result))

## transforming data
d$Event            <- as.character(d$Event)
golddata$Year      <- as.numeric(as.character(golddata$Year))
years              <- c(as.numeric(as.character(sort(unique(golddata$Year)))), 2012)
golddata$Result    <- rms(as.character(golddata$Result), resultInTime)
rownames(golddata) <- NULL

We called there a custom function, which transforms the human readable time format to seconds:

#' Revert pretty time printing
#' @param text character in \code{hours:minutes:seconds} format
#' @param transform if set to FALSE the input will be returned without any tweaks
#' @return numeric
rms <- function(text, transform = TRUE) {
    sapply(text, function(x) {
        if (transform) {
            x <- as.numeric(strsplit(x, ':')[[1]])
            if (length(x) == 1)
                return(x)
            if (length(x) == 2)
                return(x[1] * 60 + x[2])
            if (length(x) == 3)
                return(x[1] * 60^2 + x[2] * 60 + x[3])
        } else
            round(as.numeric(x), 2)
    }, USE.NAMES = FALSE)
}

Fitting models

Building the models cannot be easier in R. We decided to build a non-linear (power) and a log-linear model, which fit the data set in most cases:

nonLin         <- suppressWarnings(lm(Result ~ poly(Year, 4), data = golddata))
nonLin.predict <- suppressWarnings(predict(nonLin, newdata = data.frame(Year = years)))
logLin         <- lm(log(Result) ~ Year, data = golddata)
logLin.predict <- exp(predict(logLin, newdata = data.frame(Year = years)))

Of course checking the assumptions of the models also makes sense, where we build on the gvlma package for simplicity:

library(gvlma)
summary(gvlma(nonLin))
summary(gvlma(logLin))

Please note that in the beginning of August (when the first plot of this post was created) we used a more saturated model for the power prediction:

year2   <- golddata$Year**2
year3   <- golddata$Year**3
year4   <- golddata$Year**4
nonlin3 <- lm(Result~Year*year2*year3*year4, data=cbind(golddata, year2, year3, year4))

Visualising results

A quick ggplot2 solution can be seen below after some data transformation (adding the expected results in 2012 to the data set):

golddata2012        <- rbind(golddata, c(2012, rep(NA, times = ncol(golddata)-1)))
golddata2012$nonLin <- nonLin.predict
golddata2012$logLin <- logLin.predict
library(ggplot2)
g <- ggplot(golddata2012) + geom_point(aes(x=Year, y=Result)) + geom_smooth(aes(x=Year, y=nonLin), alpha=0.2) + geom_smooth(aes(x=Year, y=logLin, col = "#56B4E9"), alpha=0.2, col = "#009E73") + opts(title = d$Event[1]) + ylab("") + xlab("") + theme_bw() + opts(legend.position = "none")
if (resultInTime)
    g <- g + scale_y_continuous(labels = ms)
g

Which looks like (automatically styled with pander):

100m Freestaly Men (rapporter demo with ggplot2)

In the above ggplot call we used another custom function, which does the opposite of rms:

#' Pretty time printing
#' @param sec integer
#' @param transform if set to FALSE the input will be returned without any tweaks
#' @return character in \code{hours:minutes:seconds} format
ms <- function(sec, transform = TRUE) {
    sapply(sec, function(sec) {
        if (transform) {
            if (sec < 60) {
                msec <- sec %% 1
                if (msec == 0)
                    msec <- ''
                else
                    msec <- paste0('.', round(msec, 2) * 100)
            } else
                msec <- ''
            sec <- round(sec)
            minute <- sec %/% 60
            if (minute > 59) {
                sec    <- sprintf("%02d", sec - minute * 60)
                hours  <- minute %/% 60
                minute <- sprintf("%02d", minute - hours * 60)
                paste0(paste(hours, minute, sec, sep=":"), msec)
            } else {
                sec <- sprintf("%02d", sec - minute * 60)
                paste0(paste(minute, sec, sep=":"), msec)
            }
        } else
            round(sec, 2)
    })
}

Creating a reusable template for easy access

The above scripts (and even more: the above "complex plot" beside some chatty annotations and optionally added weights to the models - which can be seen in our demo also) were compiled to a brew file and can be found on GitHub.

Please note, that this solution uses only pander, but creating a template which shows the predictions building on rapport with a one-liner in R can be accomplished easily based on the above, on which we will focus in our next post.

Welcome, feedback!

We really hope you may find our packages useful. Please do not hesitate to share your opinion about pander or rapport on GitHub, or more generally below in the comment box!

To leave a comment for the author, please follow the link and comment on their blog: rapporter.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

Olimpic predictions – from an R web service provider’s point of view

Hello, world!

Hello, R users!

Introducing rapport

Introducing pander

Back to the Olympics

Fetching data

Data transformations

Fitting models

Visualising results

Creating a reusable template for easy access

Welcome, feedback!

Related

Hello, world!

Hello, R users!

Introducing rapport

Introducing pander

Back to the Olympics

Fetching data

Data transformations

Fitting models

Visualising results

Creating a reusable template for easy access

Welcome, feedback!

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)