**Giga thoughts ... » R**, and kindly contributed to R-bloggers)

*“Full many a gem of purest serene*

* The dark oceans of cave bear.”*

Thomas Gray – Elegy in country churchyard

In this post I do a fine grained analysis of the batting performances of cricketing icons from India and also from the international scene to determine how they stack up against each other. I perform 2 separate analyses 1) Between Indian legends (Sunil Gavaskar, Sachin Tendulkar & Rahul Dravid) and another 2) Between contemporary cricketing stars (Brian Lara, Sachin Tendulkar, Ricky Ponting and A B De Villiers)

In the world and more so in India, Tendulkar is probably placed on a higher pedestal than all other cricketers. I was curious to know how much of this adulation is justified. In “Zen and the art of motorcycle maintenance” Robert Pirsig mentions that while we cannot define Quality (in a book, music or painting) we usually know it when we see it. So do the people see an ineffable quality in Tendulkar or are they intuiting his greatness based on overall averages?

In this context, we need to keep in mind the warning that Daniel Kahnemann highlights in his book, ‘Thinking fast and slow’. Kahnemann suggests that we should regard “statistical intuition with proper suspicion and replace impression formation by computation wherever possible”. This is because our minds usually detects patterns and associations even when none actually exist.

So this analysis tries to look deeper into these aspects by performing a detailed statistical analysis.

The data for all the batsman has been taken from ESPN Cricinfo. The data is then cleaned to remove ‘DNB’ (did not bat), ‘TDNB’ (Team did not bat) etc before generating the graphs.

The code, data and the plots can be cloned,forked from Github at the following link bestBatsman. You should be able to use the code as-is for any other batsman you choose to.

Feel free to agree, disagree, dispute or argue with my analysis.

The batting performances of the each of the cricketers is described in 3 plots a) Combined boxplot & histogram b) Runs frequency vs Runs plot c) Mean Strike Rate vs Runs plot

**A) Batting performance of Sachin Tendulkar**

a) Combined Boxplot and histogram of runs scored

The above graph is combined boxplot and a histogram. The boxplot at the top shows the 1st quantile (25th percentile) which is the left side of the green rectangle, the 3rd quantile( 75% percentile) right side of the green rectangle and the mean and the median. These values are also shown in the histogram below. The histogram gives the frequency of Runs scored in the given range for e.g (0-10, 11-20, 21-30 etc) for Tendulkar

b) Batting performance – Runs frequency vs Runs

The graph above plots the best fitting curve for Runs scored in the frequency ranges.

This plot computes the Mean Strike Rate for each interval for e.g if between the ranges 11-21 the Strike Rates were 40.5, 48.5, 32.7, 56.8 then the average of these values is computed for the range 11-21 = (40.5 + 48.5 + 32.7 + 56.8)/4. This is done for all ranges and the Mean Strike Rate in each range is plotted and the loess curve is fitted for this data.

**B) Batting performance of Rahul Dravid**

a) Combined Boxplot and histogram of runs scored

The mean, median, the 25th and 75 th percentiles for the runs scored by Rahul Dravid are shown above

b) Batting performance – Runs frequency vs Runs

**C) Batting performance of Sunil Gavaskar**

a) Combined Boxplot and histogram of runs scored

The mean, median, the 25th and 75 th percentiles for the runs scored by Sunil Gavaskar are shown above

b) Batting performance – Runs frequency vs Runs

c) Mean Strike Rate vs Runs

**D) Relative performances of Tendulkar, Dravid and Gavaskar**

The above plot computes the percentage of the total career runs scored in a given range for each of the batsman.

For e.g if Dravid scored the runs 23, 22, 28, 21, 25 in the range 21-30 then the

Range 21 – 20 => percentageRuns = ( 23 + 22 + 28 + 21 + 25)/ Total runs in career * 100

The above plot shows that Rahul Dravid’s has a higher contribution in the range 20-70 while Tendulkar has a larger percentahe in the range 150-230

**E) Relative Strike Rates of Tendulkar, Dravid and Gavaskar**

With respect to the Mean Strike Rate Tendulkar is clearly superior to both Gavaskar & Dravid

**F) Analysis of Tendulkar, Dravid and Gavaskar**

The above table captures the the career details of each of the batsman

The following points can be noted

1) The ‘number of innings’ is the data you get after removing rows with DNB, TDNB etc

2) Tendulkar has the higher average 48.39 > Gavaskar (47.3) > Dravid (46.46)

3) The skew of Dravid (1.67) is greater which implies that there the runs scored are more skewed to right (greater runs) in comparison to mean

**G) Batting performance of Brian Lara**

a) Combined Boxplot and histogram of runs scored

The mean, median, 1st and 3rd quartile are shown above

b) Batting performance – Runs frequency vs Runs

**H) Batting performance of Ricky Ponting**

a) Combined Boxplot and histogram of runs scored

b) Batting performance – Runs frequency vs Runs

**I) Batting performance of AB De Villiers**

a) Combined Boxplot and histogram of runs scored

b) Batting performance – Runs frequency vs Runs

**J) Relative performances of Tendulkar, Lara, Ponting and De Villiers**

Clearly De Villiers is ahead in the percentage Runs scores in the range 30-80. Tendulkar is better in the range between 80-120. Lara’s career has a long tail.

**K) Relative Strike Rates of Tendulkar, Lara, Ponting and De Villiers**

The Mean Strike Rate of Lara is ahead of the lot, followed by De Villiers, Ponting and then Tendulkar

**L) Analysis of Tendulkar, Lara, Ponting and De Villiers**

The following can be observed from the above table

1) Brian Lara has the highest average (51.52) > Sachin Tendulkar (48.39 > Ricky Ponting (46.61) > AB De Villiers (46.55)

2) Brian Lara also the highest skew which means that the data is more skewed to the right of the mean than the others

You can clone the code rom Github at the following link bestBatsman. You should be able to use the code as-is for any other batsman you choose to.

Also see

1. Informed choices through Machine Learning : Analyzing Kohli, Tendulkar and Dravid

2. Informed choices through Machine Learning-2: Pitting together Kumble, Kapil, Chandra

3. Analyzing cricket’s batting legends – Through the mirage with R

4. Masters of spin – Unraveling the web with R

You may also like

1. A peek into literacy in India:Statistical learning with R

2. A crime map of India in R: Crimes against women

3. What’s up Watson? Using IBM Watson’s QAAPI with Bluemix, NodeExpress – Part 1

4. Bend it like Bluemix, MongoDB with autoscaling – Part 2

**leave a comment**for the author, please follow the link and comment on their blog:

**Giga thoughts ... » R**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...