How to Analyze Ball-by-Ball Cricket Data in R (cricketdata)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Focus keyphrase: cricket analytics in R • Secondary: R cricket data analysis • Package: cricketdata
Cricket analytics is no longer limited to season averages and simple leaderboards. With modern ball-by-ball datasets, we can quantify tempo, isolate phase-specific skills, evaluate matchups, and model outcomes under uncertainty. R is a strong environment for this work because it combines data wrangling, visualization, statistical modeling, and reproducible reporting in one place.
- How cricket data is typically structured (match, innings, ball-by-ball)
- How to engineer metrics for batting and bowling that respect cricket context
- How to perform phase analysis (Powerplay / Middle / Death) and matchup analysis
- How to build a baseline win probability model in R
- How to extend the workflow for IPL insights and role-based evaluation
- How to keep your analysis reproducible using Quarto/R Markdown
1) A Practical Workflow for Cricket Analytics in R
A professional cricket analytics workflow is easiest to maintain when you separate the work into layers: (1) data, (2) context, (3) features, (4) metrics, (5) models, and (6) communication. This structure reduces confusion and keeps analyses reproducible across tournaments and seasons.
| Layer | What you do | Typical outputs |
|---|---|---|
| Data | Load ball-by-ball + match metadata; standardize columns | Cleaned tables with stable IDs |
| Context | Add format, venue, innings state, chase information | Phase labels, required run rate, wickets in hand |
| Features | Create derived variables at ball and player level | Dots, boundaries, pressure flags, matchup summaries |
| Metrics | Aggregate in ways that reflect roles and phases | Role-aware leaderboards, split tables |
| Models | Predict outcomes or estimate player value | Win probability, outcome prediction, uncertainty |
| Communication | Publish results as charts, tables, dashboards, reports | Quarto/Markdown reports and consistent outputs |
2) Data Structures and Ingestion
Cricket data typically appears at three levels:
- Match-level: teams, venue, toss, winner, margin, date
- Innings-level: runs, wickets, overs, target, result context
- Ball-by-ball: batter, bowler, runs, extras, wickets, over/ball index
Ball-by-ball data is the most valuable layer because it captures the decisions and the state transitions that drive outcomes. If you want phase metrics, win probability, or matchup analysis, ball-by-ball is the foundation.
2.1 Install and load packages
install.packages(c(
"cricketdata", "dplyr", "tidyr", "stringr", "lubridate",
"purrr", "ggplot2", "slider", "broom"
))
library(cricketdata)
library(dplyr)
library(tidyr)
library(stringr)
library(lubridate)
library(purrr)
library(ggplot2)
library(slider)
library(broom)
2.2 Keep your data model explicit
It helps to define (and document) the expected schema for your ball-by-ball table. At minimum, you want:
match_id, innings, over, ball_in_over,
batter, bowler, batter_runs, extras_runs,
total_runs, and a wicket indicator such as is_wicket.
3) Cleaning and Cricket-Specific Preprocessing
Cricket data cleaning is rarely about generic missingness. Most issues are cricket-specific: inconsistent player names, extras affecting “balls faced” and “balls bowled”, run-out attribution, and multiple encodings of dismissal types.
3.1 Standardize names and IDs
clean_name <- function(x){
x %>%
str_replace_all("[’`]", "'") %>%
str_squish() %>%
str_trim()
}
# Example usage:
# balls <- balls %>%
# mutate(
# batter = clean_name(batter),
# bowler = clean_name(bowler),
# non_striker = clean_name(non_striker),
# team_batting = clean_name(team_batting),
# team_bowling = clean_name(team_bowling)
# )
3.2 Legal balls vs extras
A common mistake is using every row as a “ball” for strike rate or bowling strike rate. In many datasets,
wides and some no-balls are not legal deliveries (rules differ by format and encoding). A robust approach is to
create a legal_ball flag.
# Template: adjust to your dataset columns
# balls <- balls %>%
# mutate(
# total_runs = batter_runs + extras_runs,
# legal_ball = if_else(extras_type %in% c("wides"), 0L, 1L)
# )
3.3 Wickets: be explicit about what counts
Many analyses treat “bowler wickets” differently than total wickets. For example, run outs are not credited to the bowler. You can create separate fields:
- is_wicket: any wicket fell on the ball
- is_bowler_wicket: wicket credited to bowler (exclude run outs)
# balls <- balls %>%
# mutate(
# is_wicket = as.integer(!is.na(dismissal_kind)),
# is_bowler_wicket = as.integer(is_wicket == 1 & dismissal_kind != "run out")
# )
4) Phase Labeling and Innings Context
“Phase-aware” analysis is one of the biggest upgrades you can make in limited-overs cricket. A batter who dominates the powerplay may not be a strong death-overs hitter; likewise, a death specialist bowler should not be judged by powerplay economy alone.
4.1 Phase labeling (T20 example)
label_phase_t20 <- function(over){
dplyr::case_when(
over >= 0 & over < 6 ~ "Powerplay",
over >= 6 & over < 16 ~ "Middle",
over >= 16 & over
4.2 Chase context (innings 2 example)
To model win probability during a chase, you need game state features. A baseline set includes: runs needed, balls left, and wickets in hand. From these, you can compute required run rate.
# NOTE: adjust "max_balls" to format (e.g., 120 for T20, 300 for ODI)
# max_balls <- 120
# chase <- balls %>%
# filter(innings == 2) %>%
# group_by(match_id, innings) %>%
# arrange(over, ball_in_over, .by_group = TRUE) %>%
# mutate(
# cum_runs = cumsum(total_runs),
# cum_wkts = cumsum(is_wicket),
# legal_balls = cumsum(legal_ball),
# balls_left = pmax(max_balls - legal_balls, 0),
# wkts_in_hand = 10 - cum_wkts,
# runs_needed = pmax(target - cum_runs, 0),
# req_rr = if_else(balls_left > 0, 6 * runs_needed / balls_left, NA_real_)
# ) %>%
# ungroup()
5) Batting Analytics: Metrics That Explain Style
Batting analysis becomes more informative when you separate “output” from “method.” Totals (runs) are output. Style shows up in dots, boundaries, rotation, and risk. Below are metrics that are both interpretable and useful.
5.1 Core batting metrics (phase-aware)
- Strike rate (SR): runs per 100 legal balls faced
- Dot-ball %: dots per legal balls faced
- Boundary %: (4s + 6s) per legal balls faced
- Singles/rotation rate: % balls with 1 run off the bat
- Dismissal rate: outs per 100 legal balls faced
# batting_phase <- balls %>%
# group_by(batter, phase) %>%
# summarise(
# balls_faced = sum(legal_ball),
# runs = sum(batter_runs),
# dots = sum(legal_ball == 1 & batter_runs == 0),
# ones = sum(legal_ball == 1 & batter_runs == 1),
# fours = sum(batter_runs == 4),
# sixes = sum(batter_runs == 6),
# outs = sum(is_wicket == 1 & player_dismissed == batter),
# sr = 100 * runs / pmax(balls_faced, 1),
# dot_pct = 100 * dots / pmax(balls_faced, 1),
# boundary_pct = 100 * (fours + sixes) / pmax(balls_faced, 1),
# rotation_pct = 100 * ones / pmax(balls_faced, 1),
# out_rate = 100 * outs / pmax(balls_faced, 1),
# .groups = "drop"
# )
5.2 Intent vs risk (simple but powerful)
A practical comparison for T20 batters is a two-dimensional view: strike rate versus dismissal rate. You can do this by phase, and optionally add minimum sample thresholds (e.g., at least 100 legal balls in that phase).
# batting_filtered <- batting_phase %>% filter(balls_faced >= 100)
# ggplot(batting_filtered, aes(x = out_rate, y = sr)) +
# geom_point() +
# facet_wrap(~phase) +
# labs(
# x = "Dismissals per 100 balls",
# y = "Strike rate",
# title = "Intent vs Risk by Phase"
# )
5.3 A “pressure” proxy you can compute quickly
Pressure is hard to define perfectly, but you can build useful proxies using innings state.
One simple approach in a chase: treat pressure as higher when req_rr exceeds a threshold.
# chase <- chase %>%
# mutate(pressure = as.integer(req_rr >= 10)) # example threshold
6) Bowling Analytics: Economy, Wickets, and Pressure
Bowling value is multi-dimensional. Economy tells you how well runs were contained, but wickets create discontinuities in the innings. Modern analysis usually studies both together, often by phase.
6.1 Core bowling metrics (phase-aware)
- Economy: runs conceded per over (use total runs)
- Bowling strike rate: legal balls per wicket (exclude run outs)
- Dot-ball %: dot deliveries per legal balls
- Boundary conceded %: % balls conceding 4 or 6
# bowling_phase <- balls %>%
# group_by(bowler, phase) %>%
# summarise(
# balls = sum(legal_ball),
# overs = balls / 6,
# runs_conceded = sum(total_runs),
# wickets = sum(is_bowler_wicket),
# dots = sum(legal_ball == 1 & total_runs == 0),
# boundaries = sum(batter_runs %in% c(4,6)),
# econ = runs_conceded / pmax(overs, 0.1),
# bowl_sr = balls / pmax(wickets, 1),
# dot_pct = 100 * dots / pmax(balls, 1),
# boundary_pct = 100 * boundaries / pmax(balls, 1),
# .groups="drop"
# )
6.2 Death bowling: separating skill from exposure
Death overs are higher variance by nature: batters swing harder and boundaries are more frequent. To evaluate death bowlers fairly, compare them to phase baselines (league/season averages for the death phase). That helps you see whether a bowler is genuinely strong in the death or simply facing harsher conditions.
# phase_baseline <- balls %>%
# group_by(phase) %>%
# summarise(
# baseline_econ = 6 * sum(total_runs) / pmax(sum(legal_ball), 1),
# .groups="drop"
# )
# bowling_adj <- bowling_phase %>%
# left_join(phase_baseline, by = "phase") %>%
# mutate(econ_above_baseline = econ - baseline_econ)
7) Matchups: Bowler vs Batter (and How Not to Overfit)
Matchups are popular because they feel actionable: “Does bowler A match up well against batter B?” The risk is that many matchups are based on small samples. The solution is to: (1) enforce minimum balls, (2) report uncertainty, and (3) consider shrinkage if you operationalize results.
7.1 Simple matchup table
# matchups <- balls %>%
# group_by(bowler, batter) %>%
# summarise(
# balls = sum(legal_ball),
# runs = sum(batter_runs),
# outs = sum(is_wicket == 1 & player_dismissed == batter),
# sr = 100 * runs / pmax(balls, 1),
# out_rate = 100 * outs / pmax(balls, 1),
# .groups="drop"
# ) %>%
# filter(balls >= 30) %>%
# arrange(desc(out_rate))
7.2 Add confidence intervals (quick approximation)
As a lightweight option, treat out events as binomial and compute approximate intervals for out rate. This is not perfect, but it is better than treating a 2-out sample the same as a 20-out sample.
# matchups_ci <- matchups %>%
# mutate(
# p = outs / pmax(balls, 1),
# se = sqrt(p * (1 - p) / pmax(balls, 1)),
# lo = pmax(p - 1.96 * se, 0),
# hi = pmin(p + 1.96 * se, 1),
# out_rate_lo = 100 * lo,
# out_rate_hi = 100 * hi
# )
8) Visualizations That Make Sense to Cricket Fans
The best cricket charts are those that map directly to the mental model of the game. Here are a few workhorses:
- Run rate by over to reveal acceleration and collapse patterns
- Worm charts (cumulative runs) to compare innings trajectories
- Wicket timeline to explain how innings shape changes
- Phase leaderboards to compare roles (powerplay vs death)
8.1 Run rate by over
# over_summary <- balls %>%
# group_by(match_id, innings, over) %>%
# summarise(
# runs = sum(total_runs),
# legal_balls = sum(legal_ball),
# .groups="drop"
# ) %>%
# mutate(rr = 6 * runs / pmax(legal_balls, 1))
# ggplot(over_summary, aes(x = over, y = rr)) +
# geom_line() +
# facet_wrap(~innings) +
# labs(x="Over", y="Run rate", title="Run Rate by Over")
8.2 Worm chart (cumulative runs)
# worm <- balls %>%
# group_by(match_id, innings) %>%
# arrange(over, ball_in_over, .by_group = TRUE) %>%
# mutate(cum_runs = cumsum(total_runs),
# legal_balls = cumsum(legal_ball)) %>%
# ungroup()
# ggplot(worm, aes(x = legal_balls, y = cum_runs, group = innings)) +
# geom_line() +
# facet_wrap(~match_id) +
# labs(x="Legal balls", y="Cumulative runs", title="Worm Chart")
9) Win Probability Modeling in R (Baseline + Upgrades)
Win probability models answer a common fan and analyst question: “Given the current state, how likely is the chasing team to win?” A simple and surprisingly effective baseline uses a logistic regression on chase state variables.
9.1 Baseline logistic regression
# wp_data <- chase %>%
# filter(balls_left > 0) %>%
# mutate(
# win = as.integer(chasing_team_won) # adapt to your encoding
# ) %>%
# select(win, runs_needed, balls_left, wkts_in_hand, req_rr) %>%
# filter(is.finite(req_rr))
# wp_model <- glm(
# win ~ runs_needed + balls_left + wkts_in_hand + req_rr,
# data = wp_data,
# family = binomial()
# )
# wp_data$win_prob <- predict(wp_model, newdata = wp_data, type = "response")
9.2 Evaluate your model (don’t skip this)
If you publish probabilities, calibration matters. At minimum, track:
- Log loss (probability quality)
- Brier score (calibration)
- Time-based splits (train on earlier seasons, test on later)
# brier <- mean((wp_data$win_prob - wp_data$win)^2, na.rm = TRUE)
# brier
9.3 Practical upgrades
- Non-linearity: gradient boosting or splines for runs_needed × balls_left effects
- Venue priors: include ground scoring tendencies
- Team strength: add pre-match estimates as a prior
- Calibration: apply isotonic regression or Platt scaling
10) IPL Insights: Roles, Venues, and Player Value
The IPL is ideal for analytics because it combines diverse conditions with frequent high-pressure situations and specialized roles. Instead of asking “Who scored the most runs?”, a more IPL-relevant question is “Who performs a role efficiently?”
10.1 Role-based leaderboards
One of the most useful patterns is to build phase-based leaderboards for: powerplay aggressors, middle-over stabilizers, and death-over finishers. The same idea applies to bowlers (powerplay specialists, middle controllers, death defenders).
# Finishers (Death overs, minimum sample)
# finishers <- batting_phase %>%
# filter(phase == "Death", balls_faced >= 100) %>%
# arrange(desc(sr))
# Powerplay bowlers (Powerplay, minimum sample)
# pp_bowlers <- bowling_phase %>%
# filter(phase == "Powerplay", balls >= 120) %>%
# arrange(econ)
10.2 Venue adjustment (separating skill from conditions)
Some venues inflate scoring; others suppress it. A simple adjustment is to compute a venue baseline run rate and then measure player performance relative to that baseline.
# venue_rr <- balls %>%
# group_by(venue) %>%
# summarise(
# venue_run_rate = 6 * sum(total_runs) / pmax(sum(legal_ball), 1),
# .groups="drop"
# )
# batter_venue <- balls %>%
# group_by(batter, venue) %>%
# summarise(
# batter_run_rate = 6 * sum(total_runs) / pmax(sum(legal_ball), 1),
# balls = sum(legal_ball),
# .groups="drop"
# ) %>%
# left_join(venue_rr, by="venue") %>%
# mutate(adj_run_rate = batter_run_rate - venue_run_rate)
10.3 Player “value” as expected contribution
If you want to move toward value modeling, a practical approach is to estimate expected runs per ball (batting) and expected runs conceded per ball (bowling) in context (phase, venue, matchup). You can then compare players under similar conditions.
11) Reproducible Reporting in R (Quarto / R Markdown)
Reproducibility is a competitive advantage in analytics. It ensures your results can be refreshed with new matches, audited, and reused. Quarto (or R Markdown) lets you publish analysis as a single document that includes narrative, code, and output.
11.1 A clean project structure
cricket-analytics/
data/
raw/
cleaned/
R/
cleaning.R
phases.R
metrics.R
plots.R
reports/
weekly-report.qmd
match-preview.qmd
models/
output/
README.md
11.2 A minimal Quarto report pattern
# weekly-report.qmd
# ---
# title: "Weekly Cricket Analytics Report"
# format: html
# ---
# ```{r}
# source("R/metrics.R")
# balls <- readRDS("data/cleaned/balls.rds")
# leaderboard <- make_batting_leaderboard(balls)
# leaderboard
# ```
R/ functions and call them from reports.
That prevents copy-paste drift and keeps updates consistent.
12) FAQ
Is R a good choice for cricket analytics?
Yes. R is strong for cricket analytics because it supports tidy data workflows, fast iteration, high-quality visualization, and a wide range of statistical and machine learning models. It’s especially effective when you need reproducible reporting.
What’s more important: match summaries or ball-by-ball data?
Match summaries help with quick comparisons, but ball-by-ball data enables deeper questions: phase analysis, matchup evaluation, pressure modeling, and win probability estimation.
How do I avoid misleading player comparisons?
Use minimum sample thresholds, compare within roles/phases, and consider uncertainty. Adjust for venue and era effects when comparing across seasons. When you turn results into decisions, consider shrinkage or Bayesian approaches.
How should I start if I’m new?
Start with one format (often T20), build a clean ball-by-ball table, compute phase leaderboards for batting and bowling, then add one modeling task (like chase win probability). Keep everything reproducible from the beginning.
13) Next Steps
If you want a complete, structured, hands-on learning path—with end-to-end workflows, practical examples, and a clear progression from data ingestion to modeling—check out: Cricket Analytics with R .
The post How to Analyze Ball-by-Ball Cricket Data in R (cricketdata) appeared first on R Programming Books.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.