(This article was first published on

**Doodling with Data**, and kindly contributed to R-bloggers)This year, the world chess championship will be played between Vishwanathan Anand and 22-yo Magnus Carlsen, in Chennai, India from the 9th to the 28th of November. The passions are sure to run strong. Both GMs have ardent supporters. Carlsen is in a dreamlike form, and Anand has the experience and the home field advantage. But what do the numbers say?

Let's use R to look at some data and see what we can infer.

The data comes from chessgames.com. I took the raw data and created two new columns to facilitate my analysis. 1. A column called Anand.White (1 or 0) and 2. a column called Anand.won (which is a factor with 3 values: 0, Draw, 1). The cleaned csv files can be found here.

**1. Lifetime tally**

This is always a good place to start. We have data for a total of 62 games. Anand has a slight lead on this count, with 3 more wins than Carlsen. (Not distinguishing between rapid and standard games here.)

In R, we can simply run the

**table()**command on the Anand.won column.

**Loss Win Draw**

11 14 37

11 14 37

**2. How has each GM grown in strength**

We can use ELO ratings since 2000 to see how both GMs have performed over time. Anand, of course, has been in the top 5 in the world for the past two decades, pretty much since Carlsen was born! But the visual showing Magnus' meteoric rise is quite striking. (The data comes from FIDE.com and can be found here.)

**3. Win-Loss-Draw record by Year**

Let's say we want to look, year by year, how the two GMs have fared against each other. R has this great package called "plyr" which is tailor made for these kinds of "Split-Apply-Combine" type analyses. We are splitting the data by year, and combining based on win-loss, and plotting the tallies. The plotting package ggplot plays well with the output of plyr.

Once we do the plotting, we get a sense of what has been happening. In the early 2000s, Anand had a much higher share of wins. Overall the number of draws has gone up over the years. But Carlsen has had the upper hand the last year or two. (Of course, the number of games is too small, and we should be careful about "inferring" when the data is this tiny.) That said, we could make a strong the case that Carlsen has the momentum going for him.

**4. Choice of Openings**

Finally, we know that both GMs are holed up somewhere with their team of seconds and coaches, preparing. What do these experts prepare? A good majority of the times they are preparing opening surprises to spring on their opponents. They are studying each others' games looking for weaknesses. By looking at how their choice of openings helped them in the past, we can make a broad guess about what they might go.

R allows us to slice the data by their choice of openings, and we can see how they fared.

So we can expect that Anand will favor the openings that have more "green" (wins) for him, while Carlsen will try to play the openings that have been "red" (losses) for Anand.

By this logic, we can expect Anand to opt for the Queen's Gambit Declined-semi-slav(D47), the Ruy Lopez, closed (C96) or the Sicilian closed (B23). Magnus will be trying to steer the game towards the English (A20) or the Benko (A58) which are slightly more unorthodox, but have served him well against Anand. The expansions behind the ECO list can be found here.

Of course, there will always be surprises. (Of course, this is where it all gets game-theoretical. If only it were this easy to predict...) And that's why we should watch what unfolds in November.

The R code used can be found here.

Ram

To

**leave a comment**for the author, please follow the link and comment on his blog:**Doodling with Data**.R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...