# Predicting the six nations

February 4, 2015
By

[This article was first published on Mango Solutions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I don’t know a lot about rugby, which can be a problem living in a rugby town. Especially when the office sweep stake on the upcoming Wales vs England six nations game goes round: apparently 2-1 is not a valid rugby score. I’m not about to put a pound down without some research. Fortunately England and Wales have played each other before:

wikipedia.org/wiki/History_of_rugby_union_matches_between_England_and_Wales

and R has some nice tools for grabbing data from the web.

## Scraping and cleaning

To get the data in a usable form the rvest package has some really useful tools. With a few lines of code we can pull the html data into a data frame.

Some grepping and date parsing later we have a cleaned up dataset.

Date Home     England     Wales
2014-03-09 England 29 18
2013-03-16 Wales 3 30
2012-02-25 England 12 19
2011-08-13 Wales 9 19
2011-08-06 England 23 19
2011-02-04 Wales 26 19

A quick look at the data suggests things are have been pretty even over the years, with some big wins for England around the turn of the Millenium and Wales dominant in the 60s and 70s.

## Who’s going to win?

If we just look at who has won previous encounters we see that Wales have a slight edge but nothing statistically significant.

Wales Wins England Wins Draw
56 52 12

How about if we take into account home and away form? The game on the 6th will be in Cardiff, will that give the edge to Wales?

Estimate Std. Error z value Pr(>|z|)
homeEngland -0.769 0.278 -2.771 0.006
homeOther 0.000 1.414 0.000 1.000
homeWales 0.492 0.271 1.820 0.069

This suggests the chances of a Wales win is 62%. So I’d say yes. It’s not a powerful prediction but Wales tend to win in Wales. Good enough for me, I’ll go with Wales. OK, so what’s the damage going to be?

## What’s the score?

The sweep stake requires scores. This is the bit I really have no idea about, for a football fan used to scores such as 2-0, rugby scores seem arbitrarily large. Back to the data I guess. First up, what’s the total?

Interestingly it looks like the total points has been going up since the 50s. At this point I’m desperate, let’s predict the total score by fitting since the 50s and extrapolating. When Wales win there tend to be less points, let’s throw that into the model as well, it will screen out those silly big English wins at the Millenium.

as.Date("1950-01-01"), ] fitScore <- lm(englandScore + walesScore ~ Date * Winner, data=rugbyData50) fitDiff <- lm(winningScore - losingScore ~ Date * Winner, data=rugbyData50) tScore <- predict(fitScore, data.frame(Date=as.Date("2015-02-06"), Winner="Wales")) dScore <- predict(fitDiff, data.frame(Date=as.Date("2015-02-06"), Winner="Wales"))" width="450" height="72" />

Which predicts a total score on Friday of 39 and a difference of 9, giving my final prediction as

### Wales 24 – 15 England

That’ll do for a pound I think.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...