Analysing IPL matches using Cricsheet data – Part 2

[This article was first published on Anindya Mozumdar, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This is the 2nd in the series of articles to analyse IPL cricket matches using data from cricsheet. The first article in the series can be found here.

The first article showed you how to load the information pertaining to the matches into five main tables. In this, our focus will be on loading the ball by ball information. Since we are restricted to IPL matches, we are free to make certain assumptions. For example, a maximum of two innings will be available in the data. Also, there are no matches where penalty runs were awarded and the number of times a bowler was replaced in the middle of an over were too few. Thus, the code does not try to parse and load this information.

The format of the data is described in this page. YAML data has a tree-like structure. The R package yaml loads a YAML file and converts it into a deeply nested list structure. The user-defined function cricsheet_ipl_load_innings then converts the data into a set of tables. This function uses a helper function process_delivery which extracts all the information for a single delivery. The following tables (all of them linked via a common match id) are created –

  • match_innings – captures the innings number and the team playing that innings
  • match_deliveries – captures the innings number, over, ball, batsman, non-striker, bowler, runs attributed to the batsman, extras and total respectively, a flag to indicate if it was a non-boundary (so the total runs may be 4 but the non-boundary flag may be 1 indicating that the batsmen ran 4 runs), a flag to indicate if a wicket was taken in that delivery, the kind of wicket, the player who got out and the fielders who were involved, and finally the type of extras (in case there are extras in that delivery)

I have also assumed that an RStudio project has been set up and the input files are all stored in a folder called data inside the project. The code starts by loading the required packages and defining the functions described above.


process_delivery <- function(delivery) {
  delivery_name <- names(delivery)
  delivery_double <- as.double(delivery_name)
  delivery_over <- trunc(delivery_double) + 1
  delivery_ball <- (delivery_double - trunc(delivery_double)) * 10
  delivery_batsman <- delivery[[delivery_name]]$batsman
  delivery_non_striker <- delivery[[delivery_name]]$non_striker
  delivery_bowler <- delivery[[delivery_name]]$bowler
  delivery_runs_batsman <- as.integer(delivery[[delivery_name]]$runs$batsman)
  delivery_runs_extras <- as.integer(delivery[[delivery_name]]$runs$extras)
  delivery_runs_total <- as.integer(delivery[[delivery_name]]$runs$total)
  delivery_runs_non_boundary <- ifelse("non_boundary"
                                       1, 0)
  delivery_wicket <- ifelse("wicket" %in% names(delivery[[delivery_name]]),
                            1, 0)
  if (delivery_wicket == 1) {
    delivery_wicket_kind <- delivery[[delivery_name]]$wicket$kind
    delivery_wicket_player_out <- delivery[[delivery_name]]$wicket$player_out
    delivery_wicket_fielders <- ifelse("fielders"
                                             $wicket$fielders, collapse = ","),
  } else {
    delivery_wicket_kind <- NA
    delivery_wicket_player_out <- NA
    delivery_wicket_fielders <- NA
  delivery_extras_type <- ifelse("extras" %in% names(delivery[[delivery_name]]),
  return(list(delivery_over = delivery_over,
              delivery_ball = delivery_ball,
              delivery_batsman = delivery_batsman,
              delivery_non_striker = delivery_non_striker,
              delivery_bowler = delivery_bowler,
              delivery_runs_batsman = delivery_runs_batsman,
              delivery_runs_extras = delivery_runs_extras,
              delivery_runs_total = delivery_runs_total,
              delivery_runs_non_boundary = delivery_runs_non_boundary,
              delivery_wicket = delivery_wicket,
              delivery_wicket_kind = delivery_wicket_kind,
              delivery_wicket_player_out = delivery_wicket_player_out,
              delivery_wicket_fielders = delivery_wicket_fielders,
              delivery_extras_type = delivery_extras_type))

cricsheet_ipl_load_innings <- function(input_file) {
  # Assign the match id based on the file name
  match_id <- str_extract(input_file, "[0-9]+")
  match_id <- parse_integer(match_id)
  # Load the input file
  input_data <- yaml.load_file(input_file)
  # Innings table
  innings <- input_data$innings
  number_of_innings <- length(input_data$innings)
  # Ignore absent_hurt, penalty_runs, declared
  i1 <- innings[[1]]$`1st innings`
  i2 <- NULL
  if (number_of_innings > 1) {
    i2 <- innings[[2]]$`2nd innings`
    teams <- c(i1$team, i2$team)
  } else {
    teams <- c(i1$team, NA)
  match_innings <- tibble(
    id = rep(match_id, 2),
    innings_num = as.integer(c(1, 2)),
    innings_team = teams
  # Deliveries table
  # Ignore replacements
  i1_deliveries <- i1$deliveries
  i1_delivery_list <- map(i1_deliveries, process_delivery)
  i1_deliveries <- bind_rows(i1_delivery_list)
  temp <- tibble(id = rep(match_id, nrow(i1_deliveries)),
                 innings_num = rep(1, nrow(i1_deliveries)))
  i1_deliveries <- bind_cols(temp, i1_deliveries)
  if (number_of_innings > 1) {
    i2_deliveries <- i2$deliveries
    i2_delivery_list <- map(i2_deliveries, process_delivery)
    i2_deliveries <- bind_rows(i2_delivery_list)
    temp <- tibble(id = rep(match_id, nrow(i2_deliveries)),
                   innings_num = rep(2, nrow(i2_deliveries)))
    i2_deliveries <- bind_cols(temp, i2_deliveries)
    match_deliveries <- bind_rows(i1_deliveries, i2_deliveries)
  } else {
    match_deliveries <- i1_deliveries
  # Return a list of tables
  retlist <- list(match_innings = match_innings,
                  match_deliveries = match_deliveries)

Once the above function is loaded, it is a simple job of mapping it over all the file names.

# Read all the IPL data
filenames <- list.files("data", pattern = "*.yaml", full.names = TRUE)
ipl_data <- map(filenames, cricsheet_ipl_load_innings)

The call to map returns a large list, each element of which stores two tables described above. The following code creates two individual tables which hold the complete information.

# Store all the data as individual data frames
ret_table <- function(x, table) {
temp <- map(ipl_data, ret_table, "match_innings")
match_innings <- bind_rows(temp)
temp <- map(ipl_data, ret_table, "match_deliveries")
match_deliveries <- bind_rows(temp)

# Clean up

The complete code is available in github.

To leave a comment for the author, please follow the link and comment on their blog: Anindya Mozumdar. offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)