Cricket Moneyball, pt 2

[This article was first published on Sport Data Science, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Hello readers, today we have part 2 of this cricket moneyball series. If you missed the first one its here:

In it I looked at calculating the Pythagorean win percentage for each team in the IPL and then moving that forward to calculating the how many extra runs are needed to win one extra game. Therefore all team building should be done to get to that number. Where can you get an extra 60 runs from.

The question I am looking to answer predicting how many runs a batsman may score. If i was to look at a model I think by far the most predictive element how many runs a batsman will score is how many balls they face.

Plotted together they had an r squared of 0.8627 so 86% of the variance in runs is caused by how many balls a batsmen faces. The next question is how can I come up with a good value for how many balls a batsmen will face. It will depend on the bowler and when in the innings that batsman is facing. These are areas that the model could be further refined going forward. However. to start with

Overall the distributions on innings lengths are similar – as you would expect. The distributions have a low maximum, reflecting most innings in the IPL are relatively short but a long tail showing that there are a lot of innings of substantial length as well

In order to simulate how many runs a batsman might score, i am going to use the beta distribution and randomly draw from the distribution.

The beta distribution with shape parameters of 1.25 and 6 over 900 draws gives a shape broadly similar to the historical shape of all IPL innings. There are 14 games in the group stage of the IPL which is a relatively small sample size and there are many different quality of batsman. There are 2 arguments to the beta distribution which dictate the overall shape. The idea is to use a batsman’s historical average ball faced to adjust the shape2 parameter. This will then give a more reasonable draw of balls faced for each batsman.

Reviewing the average number of balls per innings compared to the total number of balls faced

It looks like over a fairly decent career size its pretty difficult to average more then 30 balls per innings. However, there are a few batsman that over a relatively short career average significantly more. To create the most accurate model, If a batsman has an average innings length more then 30 and has faced less then 1000 balls I am going to put there average balls as 30. This will stop the model over weighting small sample size batsmen. This methodology can be further refined in the future.

We now have the output which simulates how many balls each batsman might face for an innings. The next step is to turn that into an amount of runs. I am going to use a linear model to predict the amount of balls. This model will be further refined in the future and I will talk about it another time. These are the predictors I will be using:


  • No. Balls Faced – Drawn from the beta model previously
  • Dot Percentage – percentage of balls faced which end in dot balls
  • Non boundary strike rate – does the batsman rotate the strike or just stand there hitting sixes
  • Six Percentage and 4 Percentage: what percentage of their balls do they hit for 4
  • Strike Rate – on average the overall strike rate for the batsman

With these features they can be used to predict how many runs a batsman would be expected to score in the IPL. For now i am just using a simple linear model this can be improved by using a more powerful model and probably more powerful predictors but this is a first version of the model so keeping it simple for now.

Now onto evaluating the model performance. The model is only any use if it gets a value which is around what you would expect. Therefore its important to test the performance. The first test is in the 2019 season what percentage of batsmen did the model get with +/- 50 runs

As you can see out of 10000 runs of the model there were around 55% of batsmen within 50 runs which I am relatively pleased with. Model performance is an iterative process and this looks to be a good baseline to start with.

Also, when the actual runs are plotted against the predicted runs most batsmen follow a similar line. The furthest point away seems to be Kl Rahul who scored a lot less runs then predicted. Thats the model for today in the next blog i’m going to look at individual player performances and compare the 2020 IPL squads. Who bought the best players? The code for the model will be available on the GitHub

To leave a comment for the author, please follow the link and comment on their blog: Sport Data Science. offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)