Who doesn’t like chess? Me! Sure, I like the idea of chess – intellectual masterminds battling each other using nothing but pure thought – the problem is that I tend to loose, probably because I don’t really know how to play well, and because I never practice. I do know one thing: How much the different pieces are worth, their point values:
This was among the first things I learned when I was taught chess by my father. Given these point values it should be as valuable to have a knight on the board as having three pawns, for example. So where do these values come from? The point values are not actually part of the rules for chess, but are rather just used as a guideline when trading pieces, and they seem to be based on the expert judgment of chess authorities. (According to the guardian of truth there are many alternative valuations, all in the same ballpark as above.) As I recently learned that it is very important to be able to write Big Data on your CV, I decided to see if these point values could be retrieved using zero expert judgement in favor of huge amounts of chess game data.
How to allocate point values to the chess pieces using only data? One way of doing this is to calculate the predictive values of the chess pieces. That is, given that we only know the current number of pieces in a game of chess and use that information to predict the outcome of that game, how much does each type of piece contribute to the prediction? We need a model to predict the outcome of chess games where we have the following restrictions:
- Each type of piece has a single point value that directly contributes to the predicted outcome of the game, so no interaction effects between the pieces.
- The value of a piece does not change over the course of the game.
- Use no context and nor positional information.
Now these restrictions might feel a bit restrictive, especially if we actually would want to predict the outcome of chess games as well as possible, but they come from that the original point values follow the same restrictions. As the original point values doesn’t change with context, neither should ours. Now, as my colleague Can Kabadayi (with an ELO well above 2000) remarked: “But context is everything in Chess!”. Absolutely, but I’m not trying to do anything profound here, this is just a fun exercise! 🙂
Given the restrictions there is one obvious model: Logistic regression, a vanilla statistical model that calculates the probability of a binary event, like a loss-win. To get it going I needed data and the biggest Big Data data set I could find was the Million Base 2.2 which contains over 2.2 million chess games. I had to do a fair deal of data munging to get it into a format that I could work with, but the final result was a table with a lot of rows that looked like this:
pawn_diff rook_diff knight_diff bishop_diff queen_diff white_win 1 0 1 -1 0 TRUE
Here each row is from a position in a game where a positive number means White has more of that piece. For the position above white has one more pawn and knight, but one less bishop than Black. Last in each row we get to know whether White won or lost in the end, as logistic regression assumes a binary outcome I discarded all games that ended in a draw. My résumé is unfortunately not going to look that good as I never really solved the Big Data problem well. Two million chess games are a lot of games and it took my poor old laptop over a day to process only the first 100,000 games. Then I had the classic Big Data problem that I couldn’t fit it all into working memory, so I simply threw away data until worked. Still, for the analysis I ended up using a sample of 1,000,000 chess positions from the first 100,000 games in the Million Base 2.2 . Big enough data for me.
Using the statistical language R I first fitted the following logistic model using maximum likelihood (here described by R’s formula language):
white_win ~ 1 + pawn_diff + knight_diff + bishop_diff + rook_diff + queen_diff