How to Analyze Ball-by-Ball Cricket Data in R (cricketdata)

[This article was first published on Blog - R Programming Books, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Focus keyphrase: cricket analytics in R • Secondary: R cricket data analysis • Package: cricketdata

Cricket analytics is no longer limited to season averages and simple leaderboards. With modern ball-by-ball datasets, we can quantify tempo, isolate phase-specific skills, evaluate matchups, and model outcomes under uncertainty. R is a strong environment for this work because it combines data wrangling, visualization, statistical modeling, and reproducible reporting in one place.

What you’ll learn in this post:
  • How cricket data is typically structured (match, innings, ball-by-ball)
  • How to engineer metrics for batting and bowling that respect cricket context
  • How to perform phase analysis (Powerplay / Middle / Death) and matchup analysis
  • How to build a baseline win probability model in R
  • How to extend the workflow for IPL insights and role-based evaluation
  • How to keep your analysis reproducible using Quarto/R Markdown

1) A Practical Workflow for Cricket Analytics in R

A professional cricket analytics workflow is easiest to maintain when you separate the work into layers: (1) data, (2) context, (3) features, (4) metrics, (5) models, and (6) communication. This structure reduces confusion and keeps analyses reproducible across tournaments and seasons.

Layer What you do Typical outputs
Data Load ball-by-ball + match metadata; standardize columns Cleaned tables with stable IDs
Context Add format, venue, innings state, chase information Phase labels, required run rate, wickets in hand
Features Create derived variables at ball and player level Dots, boundaries, pressure flags, matchup summaries
Metrics Aggregate in ways that reflect roles and phases Role-aware leaderboards, split tables
Models Predict outcomes or estimate player value Win probability, outcome prediction, uncertainty
Communication Publish results as charts, tables, dashboards, reports Quarto/Markdown reports and consistent outputs

2) Data Structures and Ingestion

Cricket data typically appears at three levels:

  • Match-level: teams, venue, toss, winner, margin, date
  • Innings-level: runs, wickets, overs, target, result context
  • Ball-by-ball: batter, bowler, runs, extras, wickets, over/ball index

Ball-by-ball data is the most valuable layer because it captures the decisions and the state transitions that drive outcomes. If you want phase metrics, win probability, or matchup analysis, ball-by-ball is the foundation.

2.1 Install and load packages

install.packages(c(
  "cricketdata", "dplyr", "tidyr", "stringr", "lubridate",
  "purrr", "ggplot2", "slider", "broom"
))

library(cricketdata)
library(dplyr)
library(tidyr)
library(stringr)
library(lubridate)
library(purrr)
library(ggplot2)
library(slider)
library(broom)
    

2.2 Keep your data model explicit

It helps to define (and document) the expected schema for your ball-by-ball table. At minimum, you want: match_id, innings, over, ball_in_over, batter, bowler, batter_runs, extras_runs, total_runs, and a wicket indicator such as is_wicket.

Practical rule: treat your ball-by-ball table as the single source of truth. Build everything else (tables, charts, model datasets) from it, not from hand-edited exports.

3) Cleaning and Cricket-Specific Preprocessing

Cricket data cleaning is rarely about generic missingness. Most issues are cricket-specific: inconsistent player names, extras affecting “balls faced” and “balls bowled”, run-out attribution, and multiple encodings of dismissal types.

3.1 Standardize names and IDs

clean_name <- function(x){
  x %>%
    str_replace_all("[’`]", "'") %>%
    str_squish() %>%
    str_trim()
}

# Example usage:
# balls <- balls %>%
#   mutate(
#     batter = clean_name(batter),
#     bowler = clean_name(bowler),
#     non_striker = clean_name(non_striker),
#     team_batting = clean_name(team_batting),
#     team_bowling = clean_name(team_bowling)
#   )
    

3.2 Legal balls vs extras

A common mistake is using every row as a “ball” for strike rate or bowling strike rate. In many datasets, wides and some no-balls are not legal deliveries (rules differ by format and encoding). A robust approach is to create a legal_ball flag.

# Template: adjust to your dataset columns
# balls <- balls %>%
#   mutate(
#     total_runs  = batter_runs + extras_runs,
#     legal_ball  = if_else(extras_type %in% c("wides"), 0L, 1L)
#   )
    

3.3 Wickets: be explicit about what counts

Many analyses treat “bowler wickets” differently than total wickets. For example, run outs are not credited to the bowler. You can create separate fields:

  • is_wicket: any wicket fell on the ball
  • is_bowler_wicket: wicket credited to bowler (exclude run outs)
# balls <- balls %>%
#   mutate(
#     is_wicket = as.integer(!is.na(dismissal_kind)),
#     is_bowler_wicket = as.integer(is_wicket == 1 & dismissal_kind != "run out")
#   )
    

4) Phase Labeling and Innings Context

“Phase-aware” analysis is one of the biggest upgrades you can make in limited-overs cricket. A batter who dominates the powerplay may not be a strong death-overs hitter; likewise, a death specialist bowler should not be judged by powerplay economy alone.

4.1 Phase labeling (T20 example)

label_phase_t20 <- function(over){
  dplyr::case_when(
    over >= 0  & over < 6   ~ "Powerplay",
    over >= 6  & over < 16  ~ "Middle",
    over >= 16 & over 

4.2 Chase context (innings 2 example)

To model win probability during a chase, you need game state features. A baseline set includes: runs needed, balls left, and wickets in hand. From these, you can compute required run rate.

# NOTE: adjust "max_balls" to format (e.g., 120 for T20, 300 for ODI)
# max_balls <- 120

# chase <- balls %>%
#   filter(innings == 2) %>%
#   group_by(match_id, innings) %>%
#   arrange(over, ball_in_over, .by_group = TRUE) %>%
#   mutate(
#     cum_runs = cumsum(total_runs),
#     cum_wkts = cumsum(is_wicket),
#     legal_balls = cumsum(legal_ball),
#     balls_left = pmax(max_balls - legal_balls, 0),
#     wkts_in_hand = 10 - cum_wkts,
#     runs_needed = pmax(target - cum_runs, 0),
#     req_rr = if_else(balls_left > 0, 6 * runs_needed / balls_left, NA_real_)
#   ) %>%
#   ungroup()
    

5) Batting Analytics: Metrics That Explain Style

Batting analysis becomes more informative when you separate “output” from “method.” Totals (runs) are output. Style shows up in dots, boundaries, rotation, and risk. Below are metrics that are both interpretable and useful.

5.1 Core batting metrics (phase-aware)

  • Strike rate (SR): runs per 100 legal balls faced
  • Dot-ball %: dots per legal balls faced
  • Boundary %: (4s + 6s) per legal balls faced
  • Singles/rotation rate: % balls with 1 run off the bat
  • Dismissal rate: outs per 100 legal balls faced
# batting_phase <- balls %>%
#   group_by(batter, phase) %>%
#   summarise(
#     balls_faced = sum(legal_ball),
#     runs = sum(batter_runs),
#     dots = sum(legal_ball == 1 & batter_runs == 0),
#     ones = sum(legal_ball == 1 & batter_runs == 1),
#     fours = sum(batter_runs == 4),
#     sixes = sum(batter_runs == 6),
#     outs = sum(is_wicket == 1 & player_dismissed == batter),
#     sr = 100 * runs / pmax(balls_faced, 1),
#     dot_pct = 100 * dots / pmax(balls_faced, 1),
#     boundary_pct = 100 * (fours + sixes) / pmax(balls_faced, 1),
#     rotation_pct = 100 * ones / pmax(balls_faced, 1),
#     out_rate = 100 * outs / pmax(balls_faced, 1),
#     .groups = "drop"
#   )
    

5.2 Intent vs risk (simple but powerful)

A practical comparison for T20 batters is a two-dimensional view: strike rate versus dismissal rate. You can do this by phase, and optionally add minimum sample thresholds (e.g., at least 100 legal balls in that phase).

# batting_filtered <- batting_phase %>% filter(balls_faced >= 100)

# ggplot(batting_filtered, aes(x = out_rate, y = sr)) +
#   geom_point() +
#   facet_wrap(~phase) +
#   labs(
#     x = "Dismissals per 100 balls",
#     y = "Strike rate",
#     title = "Intent vs Risk by Phase"
#   )
    
Interpretation tip: a player with high SR and low out rate is rare and typically elite. Players cluster by role: powerplay aggressors, middle-over stabilizers, and death-over finishers.

5.3 A “pressure” proxy you can compute quickly

Pressure is hard to define perfectly, but you can build useful proxies using innings state. One simple approach in a chase: treat pressure as higher when req_rr exceeds a threshold.

# chase <- chase %>%
#   mutate(pressure = as.integer(req_rr >= 10))  # example threshold
    

6) Bowling Analytics: Economy, Wickets, and Pressure

Bowling value is multi-dimensional. Economy tells you how well runs were contained, but wickets create discontinuities in the innings. Modern analysis usually studies both together, often by phase.

6.1 Core bowling metrics (phase-aware)

  • Economy: runs conceded per over (use total runs)
  • Bowling strike rate: legal balls per wicket (exclude run outs)
  • Dot-ball %: dot deliveries per legal balls
  • Boundary conceded %: % balls conceding 4 or 6
# bowling_phase <- balls %>%
#   group_by(bowler, phase) %>%
#   summarise(
#     balls = sum(legal_ball),
#     overs = balls / 6,
#     runs_conceded = sum(total_runs),
#     wickets = sum(is_bowler_wicket),
#     dots = sum(legal_ball == 1 & total_runs == 0),
#     boundaries = sum(batter_runs %in% c(4,6)),
#     econ = runs_conceded / pmax(overs, 0.1),
#     bowl_sr = balls / pmax(wickets, 1),
#     dot_pct = 100 * dots / pmax(balls, 1),
#     boundary_pct = 100 * boundaries / pmax(balls, 1),
#     .groups="drop"
#   )
    

6.2 Death bowling: separating skill from exposure

Death overs are higher variance by nature: batters swing harder and boundaries are more frequent. To evaluate death bowlers fairly, compare them to phase baselines (league/season averages for the death phase). That helps you see whether a bowler is genuinely strong in the death or simply facing harsher conditions.

# phase_baseline <- balls %>%
#   group_by(phase) %>%
#   summarise(
#     baseline_econ = 6 * sum(total_runs) / pmax(sum(legal_ball), 1),
#     .groups="drop"
#   )

# bowling_adj <- bowling_phase %>%
#   left_join(phase_baseline, by = "phase") %>%
#   mutate(econ_above_baseline = econ - baseline_econ)
    

7) Matchups: Bowler vs Batter (and How Not to Overfit)

Matchups are popular because they feel actionable: “Does bowler A match up well against batter B?” The risk is that many matchups are based on small samples. The solution is to: (1) enforce minimum balls, (2) report uncertainty, and (3) consider shrinkage if you operationalize results.

7.1 Simple matchup table

# matchups <- balls %>%
#   group_by(bowler, batter) %>%
#   summarise(
#     balls = sum(legal_ball),
#     runs = sum(batter_runs),
#     outs = sum(is_wicket == 1 & player_dismissed == batter),
#     sr = 100 * runs / pmax(balls, 1),
#     out_rate = 100 * outs / pmax(balls, 1),
#     .groups="drop"
#   ) %>%
#   filter(balls >= 30) %>%
#   arrange(desc(out_rate))
    

7.2 Add confidence intervals (quick approximation)

As a lightweight option, treat out events as binomial and compute approximate intervals for out rate. This is not perfect, but it is better than treating a 2-out sample the same as a 20-out sample.

# matchups_ci <- matchups %>%
#   mutate(
#     p = outs / pmax(balls, 1),
#     se = sqrt(p * (1 - p) / pmax(balls, 1)),
#     lo = pmax(p - 1.96 * se, 0),
#     hi = pmin(p + 1.96 * se, 1),
#     out_rate_lo = 100 * lo,
#     out_rate_hi = 100 * hi
#   )
    

8) Visualizations That Make Sense to Cricket Fans

The best cricket charts are those that map directly to the mental model of the game. Here are a few workhorses:

  • Run rate by over to reveal acceleration and collapse patterns
  • Worm charts (cumulative runs) to compare innings trajectories
  • Wicket timeline to explain how innings shape changes
  • Phase leaderboards to compare roles (powerplay vs death)

8.1 Run rate by over

# over_summary <- balls %>%
#   group_by(match_id, innings, over) %>%
#   summarise(
#     runs = sum(total_runs),
#     legal_balls = sum(legal_ball),
#     .groups="drop"
#   ) %>%
#   mutate(rr = 6 * runs / pmax(legal_balls, 1))

# ggplot(over_summary, aes(x = over, y = rr)) +
#   geom_line() +
#   facet_wrap(~innings) +
#   labs(x="Over", y="Run rate", title="Run Rate by Over")
    

8.2 Worm chart (cumulative runs)

# worm <- balls %>%
#   group_by(match_id, innings) %>%
#   arrange(over, ball_in_over, .by_group = TRUE) %>%
#   mutate(cum_runs = cumsum(total_runs),
#          legal_balls = cumsum(legal_ball)) %>%
#   ungroup()

# ggplot(worm, aes(x = legal_balls, y = cum_runs, group = innings)) +
#   geom_line() +
#   facet_wrap(~match_id) +
#   labs(x="Legal balls", y="Cumulative runs", title="Worm Chart")
    

9) Win Probability Modeling in R (Baseline + Upgrades)

Win probability models answer a common fan and analyst question: “Given the current state, how likely is the chasing team to win?” A simple and surprisingly effective baseline uses a logistic regression on chase state variables.

9.1 Baseline logistic regression

# wp_data <- chase %>%
#   filter(balls_left > 0) %>%
#   mutate(
#     win = as.integer(chasing_team_won)  # adapt to your encoding
#   ) %>%
#   select(win, runs_needed, balls_left, wkts_in_hand, req_rr) %>%
#   filter(is.finite(req_rr))

# wp_model <- glm(
#   win ~ runs_needed + balls_left + wkts_in_hand + req_rr,
#   data = wp_data,
#   family = binomial()
# )

# wp_data$win_prob <- predict(wp_model, newdata = wp_data, type = "response")
    

9.2 Evaluate your model (don’t skip this)

If you publish probabilities, calibration matters. At minimum, track:

  • Log loss (probability quality)
  • Brier score (calibration)
  • Time-based splits (train on earlier seasons, test on later)
# brier <- mean((wp_data$win_prob - wp_data$win)^2, na.rm = TRUE)
# brier
    

9.3 Practical upgrades

  • Non-linearity: gradient boosting or splines for runs_needed × balls_left effects
  • Venue priors: include ground scoring tendencies
  • Team strength: add pre-match estimates as a prior
  • Calibration: apply isotonic regression or Platt scaling
Recommendation: start with the baseline, validate it, then upgrade one dimension at a time. Most real improvements come from better features and better evaluation, not from a fancier algorithm.

10) IPL Insights: Roles, Venues, and Player Value

The IPL is ideal for analytics because it combines diverse conditions with frequent high-pressure situations and specialized roles. Instead of asking “Who scored the most runs?”, a more IPL-relevant question is “Who performs a role efficiently?”

10.1 Role-based leaderboards

One of the most useful patterns is to build phase-based leaderboards for: powerplay aggressors, middle-over stabilizers, and death-over finishers. The same idea applies to bowlers (powerplay specialists, middle controllers, death defenders).

# Finishers (Death overs, minimum sample)
# finishers <- batting_phase %>%
#   filter(phase == "Death", balls_faced >= 100) %>%
#   arrange(desc(sr))

# Powerplay bowlers (Powerplay, minimum sample)
# pp_bowlers <- bowling_phase %>%
#   filter(phase == "Powerplay", balls >= 120) %>%
#   arrange(econ)
    

10.2 Venue adjustment (separating skill from conditions)

Some venues inflate scoring; others suppress it. A simple adjustment is to compute a venue baseline run rate and then measure player performance relative to that baseline.

# venue_rr <- balls %>%
#   group_by(venue) %>%
#   summarise(
#     venue_run_rate = 6 * sum(total_runs) / pmax(sum(legal_ball), 1),
#     .groups="drop"
#   )

# batter_venue <- balls %>%
#   group_by(batter, venue) %>%
#   summarise(
#     batter_run_rate = 6 * sum(total_runs) / pmax(sum(legal_ball), 1),
#     balls = sum(legal_ball),
#     .groups="drop"
#   ) %>%
#   left_join(venue_rr, by="venue") %>%
#   mutate(adj_run_rate = batter_run_rate - venue_run_rate)
    

10.3 Player “value” as expected contribution

If you want to move toward value modeling, a practical approach is to estimate expected runs per ball (batting) and expected runs conceded per ball (bowling) in context (phase, venue, matchup). You can then compare players under similar conditions.

11) Reproducible Reporting in R (Quarto / R Markdown)

Reproducibility is a competitive advantage in analytics. It ensures your results can be refreshed with new matches, audited, and reused. Quarto (or R Markdown) lets you publish analysis as a single document that includes narrative, code, and output.

11.1 A clean project structure

cricket-analytics/
  data/
    raw/
    cleaned/
  R/
    cleaning.R
    phases.R
    metrics.R
    plots.R
  reports/
    weekly-report.qmd
    match-preview.qmd
  models/
  output/
  README.md
    

11.2 A minimal Quarto report pattern

# weekly-report.qmd
# ---
# title: "Weekly Cricket Analytics Report"
# format: html
# ---

# ```{r}
# source("R/metrics.R")
# balls <- readRDS("data/cleaned/balls.rds")
# leaderboard <- make_batting_leaderboard(balls)
# leaderboard
# ```
    
Best practice: keep reusable logic in R/ functions and call them from reports. That prevents copy-paste drift and keeps updates consistent.

12) FAQ

Is R a good choice for cricket analytics?

Yes. R is strong for cricket analytics because it supports tidy data workflows, fast iteration, high-quality visualization, and a wide range of statistical and machine learning models. It’s especially effective when you need reproducible reporting.

What’s more important: match summaries or ball-by-ball data?

Match summaries help with quick comparisons, but ball-by-ball data enables deeper questions: phase analysis, matchup evaluation, pressure modeling, and win probability estimation.

How do I avoid misleading player comparisons?

Use minimum sample thresholds, compare within roles/phases, and consider uncertainty. Adjust for venue and era effects when comparing across seasons. When you turn results into decisions, consider shrinkage or Bayesian approaches.

How should I start if I’m new?

Start with one format (often T20), build a clean ball-by-ball table, compute phase leaderboards for batting and bowling, then add one modeling task (like chase win probability). Keep everything reproducible from the beginning.

13) Next Steps

If you want a complete, structured, hands-on learning path—with end-to-end workflows, practical examples, and a clear progression from data ingestion to modeling—check out: Cricket Analytics with R .

The post How to Analyze Ball-by-Ball Cricket Data in R (cricketdata) appeared first on R Programming Books.

To leave a comment for the author, please follow the link and comment on their blog: Blog - R Programming Books.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)