Cricket World Cup 2011 is approaching and I’m interested in analyzing one day international cricket data to predict some results and share interesting information about cricket.
For the analysis, I need cricket data and tried several things to get it…
- Personal research: Explored the web but couldn’t find aggregated cricket data anywhere. There are many cricket-statistics oriented websites but none of them were useful (except Cricinfo, my favorite cricket website)
- Reach out to my network: I requested my friends for advise last month and received many emails with information and offer to help compile the data
- Reached out to Sports data companies: I contacted Opta Sports to buy who this data. Although they have the data, it was too expensive for my personal experiment.
So I decided to collect this data myself by web scraping cricket scorecards. I first tried to use R libraries to web scrape but found it lacking. So I switched to Ruby, which has a great library for web scraping – Hpricot (thanks Satty for getting me started and Amit/Thomas for solving my newbee issues).
I’m happy to report that now I have a robust Ruby script that can download all One Day International Cricket data (3000+ matches) in 3 handy files:
1) Win-Loss data:
|ODI no. 1||Sri Lanka||New Zealand||no result||203||Dambulla||Aug 19, 2010||Sri Lanka||7.8566667||80.6491667||Sri Lanka Triangular Series|
|ODI no. 2||Sri Lanka||India||Sri Lanka||8 wickets||103||104||Dambulla||Aug 22, 2010||Sri Lanka||7.8566667||80.6491667||Sri Lanka Triangular Series|
|ODI no. 3||India||New Zealand||India||105 runs||223||118||Dambulla||Aug 25, 2010||Sri Lanka||7.8566667||80.6491667||Sri Lanka Triangular Series|
2) Batting data:
|ODI no. 1||1||V Sehwag||India||lbw b Kulasekara||12||12||2||0||100|
|ODI no. 1||1||RG Sharma||India||lbw b Mathews||11||21||2||0||52.38|
|ODI no. 1||1||Yuvraj Singh||India||lbw b Malinga||38||64||5||1||59.37|
|ODI no. 1||1||SK Raina||India||c Sangakkara b Perera||8||16||1||0||50|
3) Bowling data:
|ODI no. 1||1||SL Malinga||Sri Lanka||9||1||21||2||2.33|
|ODI no. 1||1||KMDN Kulasekara||Sri Lanka||9||2||31||2||3.44|
|ODI no. 1||1||AD Mathews||Sri Lanka||8||3||20||1||2.5|
|ODI no. 1||1||NLTC Perera||Sri Lanka||7.4||1||28||5||3.65|
This Ruby script takes about 40 minutes on a fast internet connection to collect the data. It took me ~ 40 hours to write and fine tune the script. Most of the time was spent in dealing with typical data issues associated with web scraping and making the script generic to handle Test cricket and T20 cricket scorecards as well.