Mirror, mirror … the best batsman of them all?

[This article was first published on Giga thoughts ... » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

“Full many a gem of purest serene
The dark oceans of cave bear.”
Thomas Gray – Elegy in country churchyard

In this post I do a fine grained analysis of the batting performances of cricketing icons from India and also from the international scene to determine how they stack up against each other.  I perform 2 separate analyses 1) Between Indian legends (Sunil Gavaskar, Sachin Tendulkar & Rahul Dravid) and another 2) Between contemporary cricketing stars (Brian Lara, Sachin Tendulkar, Ricky Ponting and A B De Villiers)

In the world and more so in India, Tendulkar is probably placed on a higher pedestal than all other cricketers. I was curious to know how much of this adulation is justified. In “Zen and the art of motorcycle maintenance” Robert Pirsig mentions that while we cannot define Quality (in a book, music or painting) we usually know it when we see it. So do the people see an ineffable quality in Tendulkar or are they intuiting his greatness based on overall averages?

In this context, we need to keep in mind the warning that Daniel Kahnemann highlights in his book, ‘Thinking fast and slow’. Kahnemann suggests that we should regard “statistical intuition with proper suspicion and replace impression formation by computation wherever possible”. This is because our minds usually detects patterns and associations  even when none actually exist.

So this analysis tries to look deeper into these aspects by performing a detailed statistical analysis.

The data for all the batsman has been taken from ESPN Cricinfo. The data is then cleaned to remove ‘DNB’ (did not bat), ‘TDNB’ (Team did not bat) etc before generating the graphs.

The code, data and the plots can be cloned,forked from Github at the following link bestBatsman. You should be able to use the code as-is for any other batsman you choose to.

Feel free to agree, disagree, dispute or argue with my analysis.

The batting performances of the each of the cricketers is described in 3 plots a) Combined boxplot & histogram b) Runs frequency vs Runs plot c) Mean Strike Rate vs Runs plot

A) Batting performance of Sachin Tendulkar

a) Combined Boxplot and histogram of runs scored
srt-boxhist1

The above graph is combined boxplot and a histogram. The boxplot at the top shows the 1st quantile (25th percentile) which is the left side of the green rectangle, the 3rd quantile( 75% percentile) right side of the green rectangle and the mean and the median. These values are also shown in the histogram below. The histogram gives the frequency of Runs scored in the given range for e.g (0-10, 11-20, 21-30 etc) for Tendulkar

b) Batting performance – Runs frequency vs Runs
srt-perf

The graph above plots the  best fitting curve for Runs scored in the frequency ranges.

c) Mean Strike Rate vs Runs
srt-sr

This plot computes the Mean Strike Rate for each interval for e.g if between the ranges 11-21 the Strike Rates were 40.5, 48.5, 32.7, 56.8 then the average of these values is computed for the range 11-21 = (40.5 + 48.5 + 32.7 + 56.8)/4. This is done for all ranges and the Mean Strike Rate in each range is plotted and the loess curve is fitted for this data.

B) Batting performance of Rahul Dravid
a) Combined Boxplot and histogram of runs scored
dravid-boxhist1

The mean, median, the 25th and 75 th percentiles for the runs scored by Rahul Dravid are shown above

b) Batting performance – Runs frequency vs Runs
dravid-perf

c) Mean Strike Rate vs Runs
dravid-sr

C) Batting performance of Sunil Gavaskar
a) Combined Boxplot and histogram of runs scored
gavaskar-boxhist1

The mean, median, the 25th and 75 th percentiles for the runs scored by Sunil Gavaskar are shown above
b) Batting performance – Runs frequency vs Runs
gavaskar-perf

c) Mean Strike Rate vs Runs
gavaskar-sr
D) Relative performances of Tendulkar, Dravid and Gavaskar
relative-perf1

The above plot computes the percentage of the total career runs scored in a given range for each of the batsman.
For e.g if Dravid scored the runs 23, 22, 28, 21, 25 in the range 21-30 then the
Range 21 – 20 => percentageRuns = ( 23 + 22 + 28 + 21 + 25)/ Total runs in career * 100
The above plot shows that Rahul Dravid’s has a higher contribution in the range 20-70 while Tendulkar has a larger percentahe in the range 150-230

E) Relative Strike Rates of Tendulkar, Dravid and Gavaskar
relative-SR

With respect to the Mean Strike Rate Tendulkar is clearly superior to both Gavaskar & Dravid

F) Analysis of Tendulkar, Dravid and Gavaskar
rel-perf1

The above table captures the the career details of each of the batsman
The following points can be noted
1) The ‘number of innings’ is the data you get after removing rows with DNB, TDNB etc
2) Tendulkar has the higher average 48.39 > Gavaskar (47.3) > Dravid (46.46)
3) The skew of  Dravid (1.67) is greater which implies that there the runs scored are more skewed to right (greater runs) in comparison to mean

G) Batting performance of Brian Lara
a) Combined Boxplot and histogram of runs scored
lara-boxhist1
The mean, median, 1st and 3rd quartile are shown above

b) Batting performance – Runs frequency vs Runs
lara-perf

c) Mean Strike Rate vs Runs
lara-sr

H) Batting performance of Ricky Ponting
a) Combined Boxplot and histogram of runs scored
ponting-boxhist1

b) Batting performance – Runs frequency vs Runs
ponting-perf

c) Mean Strike Rate vs Runs
ponting-SR

I) Batting performance of AB De Villiers
a) Combined Boxplot and histogram of runs scored
devilliers-boxhist1

b) Batting performance – Runs frequency vs Runs
devillier-perf

c) Mean Strike Rate vs Runs
devilliers-SR

J) Relative performances of Tendulkar, Lara, Ponting and De Villiers
relative-perf-intl1

Clearly De Villiers is ahead in the percentage Runs scores in the range 30-80. Tendulkar is better in the range between 80-120. Lara’s career has a long tail.

K) Relative Strike Rates of Tendulkar, Lara, Ponting and De Villiers
relative-SR-intl

The Mean Strike Rate of Lara is ahead of the lot, followed by De Villiers, Ponting and then Tendulkar
L) Analysis of Tendulkar, Lara, Ponting and De Villiers
rel-perf-intl1
The following can be observed from the above table
1) Brian Lara has the highest average (51.52) > Sachin Tendulkar (48.39 > Ricky Ponting (46.61) > AB De Villiers (46.55)
2) Brian Lara also the highest skew which means that the data is more skewed to the right of the mean than the others

You can clone the code rom Github at the following link bestBatsman. You should be able to use the code as-is for any other batsman you choose to.

Also see
1. Informed choices through Machine Learning : Analyzing Kohli, Tendulkar and Dravid
2. Informed choices through Machine Learning-2: Pitting together Kumble, Kapil, Chandra
3. Analyzing cricket’s batting legends – Through the mirage with R
4. Masters of spin – Unraveling the web with R

You may also like
1. A peek into literacy in India:Statistical learning with R
2. A crime map of India in R: Crimes against women
3.  What’s up Watson? Using IBM Watson’s QAAPI with Bluemix, NodeExpress – Part 1
4.  Bend it like Bluemix, MongoDB with autoscaling – Part 2


To leave a comment for the author, please follow the link and comment on their blog: Giga thoughts ... » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)