September 4, 2010

Cricket World Cup 2011 is approaching and I’m interested in analyzing one day international cricket data to predict some results and share interesting information about cricket.

For the analysis, I need cricket data and tried several things to get it…

  • Personal research: Explored the web but couldn’t find aggregated cricket data anywhere. There are many cricket-statistics oriented websites but none of them were useful (except Cricinfo, my favorite cricket website)
  • Reach out to my network: I requested my friends for advise last month and received many emails with information and offer to help compile the data
  • Reached out to Sports data companies: I contacted Opta Sports to buy who this data. Although they have the data, it was too expensive for my personal experiment.

So I decided to collect this data myself by web scraping cricket scorecards. I first tried to use R libraries to web scrape but found it lacking. So I switched to Ruby, which has a great library for web scraping – Hpricot (thanks Satty for getting me started and Amit/Thomas for solving my newbee issues).

I’m happy to report that now I have a robust Ruby script that can download all One Day International Cricket data (3000+ matches) in 3 handy files:

1) Win-Loss data:

Match_ID  Team1  Team2  Winner  Margin  First.Innings.Total  Second.Innings.Total  Ground  Matchdate  Ground_Country  Ground_Latitude  Ground_Longitude  Series
ODI no. 1  Sri Lanka  New Zealand  no result 203  Dambulla  Aug 19, 2010  Sri Lanka 7.8566667 80.6491667  Sri Lanka Triangular Series
ODI no. 2  Sri Lanka  India  Sri Lanka  8 wickets 103 104  Dambulla  Aug 22, 2010  Sri Lanka 7.8566667 80.6491667  Sri Lanka Triangular Series
ODI no. 3  India  New Zealand  India  105 runs 223 118  Dambulla  Aug 25, 2010  Sri Lanka 7.8566667 80.6491667  Sri Lanka Triangular Series

2) Batting data:

Match_ID  Inning  Player  Country  Out  Runs  Minutes  Balls  Fours  Sixes  Scorerate
ODI no. 1 1 V Sehwag India lbw b Kulasekara 12 12 2 0 100
ODI no. 1 1 RG Sharma India lbw b Mathews 11 21 2 0 52.38
ODI no. 1 1 Yuvraj Singh India lbw b Malinga 38 64 5 1 59.37
ODI no. 1 1 SK Raina India c Sangakkara b Perera 8 16 1 0 50

3) Bowling data:

Match_ID  Inning  Player  Country  Overs  Maidens  Runs  Wickets  Economy
ODI no. 1 1 SL Malinga Sri Lanka 9 1 21 2 2.33
ODI no. 1 1 KMDN Kulasekara Sri Lanka 9 2 31 2 3.44
ODI no. 1 1 AD Mathews Sri Lanka 8 3 20 1 2.5
ODI no. 1 1 NLTC Perera Sri Lanka 7.4 1 28 5 3.65

    This Ruby script takes about 40 minutes on a fast internet connection to collect the data. It took me ~ 40 hours to write and fine tune the script. Most of the time was spent in dealing with typical data issues associated with web scraping and making the script generic to handle Test cricket and T20 cricket scorecards as well.

