Cricket data analysis

September 4, 2010

(This article was first published on Enterprise Software Doesn't Have to Suck, and kindly contributed to R-bloggers)

Cricket World Cup 2011 is approaching and I'm interested in analyzing one day international cricket data to predict some results and share interesting information about cricket.

For the analysis, I need cricket data and tried several things to get it...
  • Personal research: Explored the web but couldn't find aggregated cricket data anywhere. There are many cricket-statistics oriented websites but none of them were useful (except Cricinfo, my favorite cricket website)
  • Reach out to my network: I requested my friends for advise last month and received many emails with information and offer to help compile the data
  • Reached out to Sports data companies: I contacted Opta Sports to buy who this data. Although they have the data, it was too expensive for my personal experiment.
So I decided to collect this data myself by web scraping cricket scorecards. I first tried to use R libraries to web scrape but found it lacking. So I switched to Ruby, which has a great library for web scraping - Hpricot (thanks Satty for getting me started and Amit/Thomas for solving my newbee issues).

I'm happy to report that now I have a robust Ruby script that can download all One Day International Cricket data (3000+ matches) in 3 handy files:

1) Win-Loss data:

Match_ID Team1 Team2 Winner Margin First.Innings.Total Second.Innings.Total Ground Matchdate Ground_Country Ground_Latitude Ground_Longitude Series
ODI no. 1 Sri Lanka New Zealand no result203 Dambulla Aug 19, 2010 Sri Lanka7.856666780.6491667 Sri Lanka Triangular Series
ODI no. 2 Sri Lanka India Sri Lanka 8 wickets103104 Dambulla Aug 22, 2010 Sri Lanka7.856666780.6491667 Sri Lanka Triangular Series
ODI no. 3 India New Zealand India 105 runs223118 Dambulla Aug 25, 2010 Sri Lanka7.856666780.6491667 Sri Lanka Triangular Series

2) Batting data:

Match_ID Inning Player Country Out Runs Minutes Balls Fours Sixes Scorerate
ODI no. 11V SehwagIndialbw b Kulasekara12
ODI no. 11RG SharmaIndialbw b Mathews11
ODI no. 11Yuvraj SinghIndialbw b Malinga38
ODI no. 11SK RainaIndiac Sangakkara b Perera8

3) Bowling data:

Match_ID Inning Player Country Overs Maidens Runs Wickets Economy
ODI no. 11SL MalingaSri Lanka912122.33
ODI no. 11KMDN KulasekaraSri Lanka923123.44
ODI no. 11AD MathewsSri Lanka832012.5
ODI no. 11NLTC PereraSri Lanka7.412853.65

    This Ruby script takes about 40 minutes on a fast internet connection to collect the data. It took me ~ 40 hours to write and fine tune the script. Most of the time was spent in dealing with typical data issues associated with web scraping and making the script generic to handle Test cricket and T20 cricket scorecards as well.

    To leave a comment for the author, please follow the link and comment on his blog: Enterprise Software Doesn't Have to Suck. offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...


    Comments are closed.