I don’t know a lot about rugby, which can be a problem living in a rugby town. Especially when the office sweep stake on the upcoming Wales vs England six nations game goes round: apparently 2-1 is not a valid rugby score. I’m not about to put a pound down without some research. Fortunately England and Wales have played each other before:
and R has some nice tools for grabbing data from the web.
Scraping and cleaning
To get the data in a usable form the rvest package has some really useful tools. With a few lines of code we can pull the html data into a data frame.
Some grepping and date parsing later we have a cleaned up dataset.
A quick look at the data suggests things are have been pretty even over the years, with some big wins for England around the turn of the Millenium and Wales dominant in the 60s and 70s.
Who’s going to win?
If we just look at who has won previous encounters we see that Wales have a slight edge but nothing statistically significant.
|Wales Wins||England Wins||Draw|
How about if we take into account home and away form? The game on the 6th will be in Cardiff, will that give the edge to Wales?
|Estimate||Std. Error||z value||Pr(>|z|)|
This suggests the chances of a Wales win is 62%. So I’d say yes. It’s not a powerful prediction but Wales tend to win in Wales. Good enough for me, I’ll go with Wales. OK, so what’s the damage going to be?
What’s the score?
The sweep stake requires scores. This is the bit I really have no idea about, for a football fan used to scores such as 2-0, rugby scores seem arbitrarily large. Back to the data I guess. First up, what’s the total?
Interestingly it looks like the total points has been going up since the 50s. At this point I’m desperate, let’s predict the total score by fitting since the 50s and extrapolating. When Wales win there tend to be less points, let’s throw that into the model as well, it will screen out those silly big English wins at the Millenium.
as.Date("1950-01-01"), ] fitScore <- lm(englandScore + walesScore ~ Date * Winner, data=rugbyData50) fitDiff <- lm(winningScore - losingScore ~ Date * Winner, data=rugbyData50) tScore <- predict(fitScore, data.frame(Date=as.Date("2015-02-06"), Winner="Wales")) dScore <- predict(fitDiff, data.frame(Date=as.Date("2015-02-06"), Winner="Wales"))" width="450" height="72" />
Which predicts a total score on Friday of 39 and a difference of 9, giving my final prediction as
Wales 24 – 15 England
That’ll do for a pound I think.