Analysing IPL matches using Cricsheet data – Part 1

[This article was first published on Anindya Mozumdar, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In a series of articles, I will be analysing Indian Premier League (IPL) cricket matches using data from cricsheet and using the R programming language. Cricsheet is an excellent website which provides ball-by-ball data for a large number of cricket matches. The IPL is a professional Twenty20 cricket league in India. I chose the IPL because the complete data for all seasons are available.

The data used is the ipl.zip file downloaded as of December 2017. The data is provided in YAML format, and requires some processing before it can be used for additional analysis. I prefer to convert the data into multiple tables which makes it easier to query and summarise the data at various levels. In this article, I will only be looking at reading match level information; in a subsequent article, I will be covering the use of ball-by-ball information.

The format of the data is described in this page. YAML data has a tree-like structure. The R package yaml loads a YAML file and converts it into a deeply nested list structure. The user-defined function cricsheet_ipl_load_meta then converts the data into a set of tables. The following tables (all of them linked via a common match id) are created –

  • metadata – match id, file version, revision and date of creation
  • match_info – captures the city, match date, player of the match, venue and a flag to indicate if the venue was neutral
  • match_teams – captures the two teams which played the match
  • match_toss – captures the team which won the toss and the decision to bat or field
  • match_umpires – captures the two on-field umpires for the match
  • match_outcome – captures the result of the match and the number of runs or wickets by which the match was won; if the match was decided via a super-over, then the result is recorded as a tie and a flag is set to determine that an eliminator over was used

I have also assumed that an RStudio project has been set up and the input files are all stored in a folder called data inside the project. The code starts by loading the required packages and defining the function described above.

library(tidyverse)
library(yaml)
library(purrr)
library(lubridate)

cricsheet_ipl_load_meta <- function(input_file) {
  
  # Assign the match id based on the file name
  match_id <- str_extract(input_file, "[0-9]+")
  match_id <- parse_integer(match_id)
  writeLines(as.character(match_id))
  
  # Load the input file
  input_data <- yaml.load_file(input_file)
  
  # Metadata table
  meta_version <- input_data$meta$data_version
  meta_created <- ymd(input_data$meta$created)
  meta_revision <- input_data$meta$revision
  metadata <- tibble(
    id = match_id,
    version = meta_version,
    created = meta_created,
    revision = meta_revision
  )
  
  # Match information table
  info <- input_data$info
  info_city <- ifelse("city" %in% names(info), info$city, NA)
  info_date <- ymd(info$dates) # Assume IPL match will be played only on a day
  info_player_of_match <- ifelse("player_of_match" %in% names(info),
                                 info$player_of_match, NA)
  info_venue <- ifelse("venue" %in% names(info), info$venue, NA)
  info_neutral_venue <- ifelse("neutral_venue" %in% names(info),
                               info$neutral_venue, 0)
  # Ignore competition, gender, overs
  match_info <- tibble(
    id = match_id,
    city = info_city,
    date = info_date,
    player_of_match = info_player_of_match,
    venue = info_venue,
    neutral_venue = info_neutral_venue
  )
  
  # Match teams table
  info_teams <- info$teams
  match_teams <- tibble(
    id = rep(match_id, 2),
    teams = info_teams
  )
  
  # Match toss table
  info_toss_winner <- info$toss$winner
  info_toss_decision <- info$toss$decision
  match_toss <- tibble(
    id = match_id,
    winner = info_toss_winner,
    decision = info_toss_decision
  )
  
  # Match umpires
  info_umpires <- info$umpires
  match_umpires <- tibble(
    id = rep(match_id, 2),
    umpires = info_umpires
  )
  
  # Match outcomes
  info_outcome <- input_data$info$outcome
  info_winner <- NA
  info_result <- NA
  info_result_margin <- NA
  info_eliminator <- NA
  if ("winner" %in% names(info_outcome)) {
    info_winner <- info_outcome$winner
    info_eliminator <- "N"
    info_result <- ifelse("runs" %in% names(info_outcome$by),
                          "runs", "wickets")
    info_result_margin <- ifelse("runs" %in% names(info_outcome$by),
                                 info_outcome$by$runs,
                                 info_outcome$by$wickets)
  } else if ("eliminator" %in% names(info_outcome)) {
    info_winner <- info_outcome$eliminator
    info_eliminator <- "Y"
    info_result <- info_outcome$result
  }
  info_method <- ifelse("method" %in% names(info_outcome),
                        info_outcome$method, NA)
  match_outcome <- tibble(
    id = match_id,
    winner = info_winner,
    result = info_result,
    result_margin = info_result_margin,
    eliminator = info_eliminator,
    method = info_method
  )
  
  # Return a list of tables
  retlist <- list(metadata = metadata, match_info = match_info,
                  match_teams = match_teams, match_toss = match_toss,
                  match_umpires = match_umpires, match_outcome = match_outcome)
  return(retlist)
}

Once the above function is loaded, it is a simple job of mapping it over all the file names.

# Read all the IPL data
filenames <- list.files("data", pattern = "*.yaml", full.names = TRUE)
ipl_data <- map(filenames, cricsheet_ipl_load_meta)

The call to map returns a large list, each element of which stores six tables described above. The following code creates six individual tables which hold the complete information.

# Store all the data as individual data frames
ret_table <- function(x, table) {
  return(x[[table]])
}
temp <- map(ipl_data, ret_table, "metadata")
metadata <- bind_rows(temp)
temp <- map(ipl_data, ret_table, "match_info")
match_info <- bind_rows(temp)
temp <- map(ipl_data, ret_table, "match_teams")
match_teams <- bind_rows(temp)
temp <- map(ipl_data, ret_table, "match_toss")
match_toss <- bind_rows(temp)
temp <- map(ipl_data, ret_table, "match_umpires")
match_umpires <- bind_rows(temp)
temp <- map(ipl_data, ret_table, "match_outcome")
match_outcome <- bind_rows(temp)

# Clean up
rm(temp)
rm(ipl_data)
rm(filenames)

Continue to the 2nd part. The complete code is available in github.

To leave a comment for the author, please follow the link and comment on their blog: Anindya Mozumdar.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)