Fight Data Science in R: Proven Boxing Metrics & Models

Boxing analysis is no longer just about punch totals or “who looked busier.” Modern fight analysis is data science: repeatable pipelines, validated data, explainable models, and performance indicators that translate into strategy. This post shows how to build a professional fight data science workflow in R—from raw data to metrics, modeling, and tactical insights—using code you can adapt to your own datasets.

You’ll get: a production-style project structure, data contracts, validation checks, feature engineering patterns, round-by-round models, fatigue and momentum signals, and high-signal visualizations for coaches and analysts. The goal is to help you move from “interesting charts” to decision-grade analytics.

1) Professional setup and project structure

A “pro” analytics workflow starts with discipline: consistent folders, reproducible environments, and clear separation of raw → clean → features → models. Even if you’re solo, this structure makes your work easier to iterate and publish.

# Core libraries for fight data science
pkgs <- c(
  "tidyverse", "janitor", "lubridate", "glue", "cli",
  "arrow", "here", "fs",
  "duckdb", "DBI",
  "slider",
  "rsample", "recipes", "parsnip", "workflows", "tune", "dials",
  "yardstick", "broom",
  "ggrepel", "patchwork"
)

to_install <- pkgs[!pkgs %in% installed.packages()[, "Package"]]
if (length(to_install) > 0) install.packages(to_install, dependencies = TRUE)
invisible(lapply(pkgs, library, character.only = TRUE))

# Create a clean project layout (idempotent)
dirs <- c(
  "data/raw",
  "data/clean",
  "data/features",
  "data/models",
  "plots",
  "reports",
  "R"
)

walk(dirs, ~ fs::dir_create(here::here(.x)))

log_info <- function(...) cli::cli_alert_info(glue::glue(...))
log_ok   <- function(...) cli::cli_alert_success(glue::glue(...))

log_ok("Project folders ready at: {here::here()}")

Tip: store raw files read-only, and always write standardized outputs (e.g., Parquet) to data/clean/. You’ll instantly speed up your workflow and reduce “mystery bugs.”

2) A fight data contract (schemas that prevent chaos)

The fastest way to break a fight analytics project is to let columns drift (“fighter” vs “boxer”, mixed date formats, different naming conventions for the same actions). A data contract prevents that. Below are two useful contracts:

Round totals (works with CompuBox-style aggregates)
Event-level metadata (for joining and reporting)

round_schema <- tibble::tribble(
  ~column,              ~type,       ~notes,
  "fight_id",           "character",  "Unique fight identifier",
  "event_id",           "character",  "Unique event identifier",
  "event_date",         "date",       "ISO date",
  "weight_class",       "character",  "e.g., Welterweight",
  "fighter",            "character",  "This row's fighter",
  "opponent",           "character",  "Opponent fighter",
  "corner",             "character",  "Red/Blue or A/B",
  "round",              "integer",    "Round number",
  "jabs_landed",        "integer",    "Jabs landed",
  "jabs_attempted",     "integer",    "Jabs attempted",
  "power_landed",       "integer",    "Power shots landed",
  "power_attempted",    "integer",    "Power shots attempted",
  "knockdowns",         "integer",    "Knockdowns in round",
  "stance",             "character",  "orthodox/southpaw/other",
  "result_round",       "integer",    "1 if fighter won the round, 0 if lost (or NA if unknown)"
)

event_schema <- tibble::tribble(
  ~column,         ~type,       ~notes,
  "event_id",      "character", "Unique event identifier",
  "event_name",    "character", "Event name",
  "event_date",    "date",      "ISO date",
  "location",      "character", "City/Country (optional)",
  "promotion",     "character", "Promotion/org (optional)"
)

round_schema

If you don’t have result_round, you can still do great analytics: predict round outcomes, infer momentum, and quantify “who was in control” using validated scoring proxies.

3) Ingestion and standardization

Here’s a robust ingestion pattern: read raw CSVs, normalize names, enforce types, standardize fighter naming, and write Parquet for speed. Adjust paths to your sources.

read_round_totals <- function(path) {
  log_info("Reading raw round totals: {path}")
  readr::read_csv(path, show_col_types = FALSE) %>%
    janitor::clean_names()
}

standardize_round_totals <- function(df) {
  df %>%
    mutate(
      event_date = as.Date(event_date),
      round = as.integer(round),
      across(
        c(jabs_landed, jabs_attempted, power_landed, power_attempted, knockdowns),
        ~ as.integer(replace_na(.x, 0))
      ),
      across(c(fight_id, event_id, fighter, opponent, weight_class, stance, corner), as.character),
      stance = tolower(stance),
      corner = toupper(corner)
    ) %>%
    # Basic name normalization
    mutate(
      fighter  = str_squish(str_replace_all(fighter, "\\s+", " ")),
      opponent = str_squish(str_replace_all(opponent, "\\s+", " ")),
      weight_class = str_squish(weight_class)
    )
}

write_clean_parquet <- function(df, out_path) {
  fs::dir_create(fs::path_dir(out_path))
  arrow::write_parquet(df, out_path)
  log_ok("Wrote Parquet: {out_path}")
}

# Example:
# raw_path  <- here::here("data/raw/round_totals.csv")
# clean_out <- here::here("data/clean/round_totals.parquet")
# rounds_clean <- read_round_totals(raw_path) %>% standardize_round_totals()
# write_clean_parquet(rounds_clean, clean_out)

Parquet is a game-changer for analytics work: fast I/O, consistent types, and easy integration with DuckDB for SQL-style querying.

4) Validation, QA, and anomaly detection

Fight data is full of subtle mistakes: attempted shots < landed shots, duplicated rounds, mixed fighter/opponent rows, or “impossible” knockdown counts. Validation should be automatic.

validate_round_totals <- function(df) {
  # Required columns check
  required <- round_schema$column
  missing_cols <- setdiff(required, names(df))
  if (length(missing_cols) > 0) {
    stop(glue::glue("Missing required columns: {paste(missing_cols, collapse=', ')}"))
  }

  # Logical checks
  bad_landed <- df %>%
    filter(jabs_landed > jabs_attempted | power_landed > power_attempted)

  if (nrow(bad_landed) > 0) {
    log_info("Found {nrow(bad_landed)} rows where landed > attempted (check source or parsing).")
  }

  # Duplicate round rows (same fight_id, fighter, round)
  dupes <- df %>%
    count(fight_id, fighter, round) %>%
    filter(n > 1)

  if (nrow(dupes) > 0) {
    log_info("Found duplicates: {nrow(dupes)} fight/fighter/round combinations.")
  }

  # Suspicious extremes (simple heuristic)
  suspicious <- df %>%
    mutate(total_attempted = jabs_attempted + power_attempted) %>%
    filter(total_attempted > 120 | knockdowns > 3)

  if (nrow(suspicious) > 0) {
    log_info("Found {nrow(suspicious)} suspicious rows (very high volume or knockdowns).")
  }

  df
}

# Example:
# rounds_clean <- rounds_clean %>% validate_round_totals()

Validation gives you confidence. And confidence is what makes analytics actionable—especially when you’re presenting results to coaches, fighters, or bettors who will challenge your assumptions.

5) Feature engineering: pace, accuracy, intent, damage proxies

Fight performance is multidimensional. A clean feature set usually includes:

Pace: attempts per round, pace change across rounds
Accuracy: landed / attempted (jabs, power, total)
Intent / style: jab share vs power share
Damage proxies: power landed, knockdowns, power accuracy
Relative dominance: fighter metrics minus opponent metrics

engineer_round_features <- function(df) {
  df %>%
    mutate(
      total_landed    = jabs_landed + power_landed,
      total_attempted = jabs_attempted + power_attempted,
      acc_jab   = if_else(jabs_attempted > 0, jabs_landed / jabs_attempted, NA_real_),
      acc_power = if_else(power_attempted > 0, power_landed / power_attempted, NA_real_),
      acc_total = if_else(total_attempted > 0, total_landed / total_attempted, NA_real_),
      jab_share_attempts = if_else(total_attempted > 0, jabs_attempted / total_attempted, NA_real_),
      power_share_attempts = if_else(total_attempted > 0, power_attempted / total_attempted, NA_real_),
      # Simple damage proxy: power landed + weighted knockdowns
      damage_proxy = power_landed + 8 * knockdowns
    )
}

# Opponent-relative features (requires pairing fighter vs opponent within the same fight_id and round)
add_relative_features <- function(df) {
  df2 <- df %>%
    select(fight_id, round, fighter, opponent,
           total_landed, total_attempted, acc_total,
           power_landed, power_attempted, acc_power,
           damage_proxy, knockdowns) %>%
    rename_with(~ paste0("opp_", .x), -c(fight_id, round, fighter, opponent)) %>%
    rename(fighter_join = opponent, opponent_join = fighter)

  df %>%
    left_join(
      df2,
      by = c("fight_id" = "fight_id", "round" = "round", "fighter" = "fighter_join", "opponent" = "opponent_join")
    ) %>%
    mutate(
      rel_total_landed = total_landed - opp_total_landed,
      rel_acc_total    = acc_total - opp_acc_total,
      rel_power_landed = power_landed - opp_power_landed,
      rel_damage       = damage_proxy - opp_damage_proxy,
      rel_knockdowns   = knockdowns - opp_knockdowns
    )
}

# Example:
# rounds_feat <- rounds_clean %>% engineer_round_features() %>% add_relative_features()

Relative features are where fight analytics becomes tactical: a fighter’s pace means little without context. Dominance is “what you did” minus “what you absorbed.”

6) Round-by-round modeling (probability of winning a round)

If you have labeled rounds (result_round = 1/0), you can model round outcomes using interpretable classifiers. Even if you don’t, you can label from trusted sources or use proxy labels (with caution).

Below is an end-to-end workflow using tidymodels: split, recipe, logistic regression with regularization, tuning, and calibration-friendly evaluation.

# Assume you have rounds_feat with result_round (1/0) for some fights
# rounds_feat <- rounds_feat %>% filter(!is.na(result_round))

set.seed(123)
spl <- rsample::initial_split(rounds_feat %>% filter(!is.na(result_round)), prop = 0.8, strata = result_round)
train <- rsample::training(spl)
test  <- rsample::testing(spl)

rec <- recipes::recipe(result_round ~ rel_total_landed + rel_acc_total + rel_power_landed + rel_damage +
                        total_attempted + acc_total + jab_share_attempts + damage_proxy +
                        stance,
                      data = train) %>%
  step_impute_median(all_numeric_predictors()) %>%
  step_impute_mode(all_nominal_predictors()) %>%
  step_dummy(all_nominal_predictors()) %>%
  step_zv(all_predictors()) %>%
  step_normalize(all_numeric_predictors())

mod <- parsnip::logistic_reg(penalty = tune(), mixture = 1) %>%
  set_engine("glmnet")

wf <- workflows::workflow() %>%
  add_recipe(rec) %>%
  add_model(mod)

grid <- dials::grid_regular(dials::penalty(range = c(-6, 0)), levels = 30)

set.seed(123)
folds <- rsample::vfold_cv(train, v = 5, strata = result_round)

metrics <- yardstick::metric_set(yardstick::roc_auc, yardstick::pr_auc, yardstick::accuracy, yardstick::mn_log_loss)

tuned <- tune::tune_grid(
  wf,
  resamples = folds,
  grid = grid,
  metrics = metrics
)

best <- tune::select_best(tuned, metric = "mn_log_loss")
final_wf <- tune::finalize_workflow(wf, best)

final_fit <- final_wf %>% fit(train)

# Evaluate on holdout test set
test_pred <- predict(final_fit, test, type = "prob") %>%
  bind_cols(test %>% select(result_round))

yardstick::roc_auc(test_pred, truth = result_round, .pred_1)
yardstick::mn_log_loss(test_pred, truth = result_round, .pred_1)

Why log loss? Because in fight analytics, calibrated probabilities matter. A model that says “0.55” should be right ~55% of the time—not just classify correctly.

7) Fight outcome modeling (interpretable + calibrated)

Fight outcomes can be modeled from aggregated round features: average dominance, variance (consistency), late-round fade, and knockdown impacts. First, summarize per fight and fighter.

summarize_fight_features <- function(df) {
  df %>%
    group_by(fight_id, event_id, event_date, weight_class, fighter, opponent) %>%
    summarise(
      rounds = n(),
      avg_rel_damage = mean(rel_damage, na.rm = TRUE),
      avg_rel_power_landed = mean(rel_power_landed, na.rm = TRUE),
      avg_rel_total_landed = mean(rel_total_landed, na.rm = TRUE),
      avg_rel_acc_total = mean(rel_acc_total, na.rm = TRUE),
      # Volatility/consistency
      sd_rel_damage = sd(rel_damage, na.rm = TRUE),
      # Pace markers
      avg_total_attempted = mean(total_attempted, na.rm = TRUE),
      # Knockdown signal
      total_knockdowns = sum(knockdowns, na.rm = TRUE),
      .groups = "drop"
    )
}

fight_level <- summarize_fight_features(rounds_feat)

# If you have fight outcome label for fighter perspective (win=1/lose=0):
# fight_level <- fight_level %>% left_join(outcomes, by = c("fight_id","fighter"))

Then model fight wins with an interpretable learner. Logistic regression is often a strong baseline; boosted trees can add performance if you keep explainability via feature importance and partial dependence (where appropriate).

# Example: win label in fight_level as win (1/0)
set.seed(42)
spl2 <- rsample::initial_split(fight_level %>% filter(!is.na(win)), prop = 0.8, strata = win)
tr2 <- training(spl2)
te2 <- testing(spl2)

rec2 <- recipe(win ~ avg_rel_damage + avg_rel_power_landed + avg_rel_total_landed +
                avg_rel_acc_total + sd_rel_damage + avg_total_attempted + total_knockdowns,
              data = tr2) %>%
  step_impute_median(all_numeric_predictors()) %>%
  step_normalize(all_numeric_predictors())

mod2 <- logistic_reg(penalty = tune(), mixture = 1) %>% set_engine("glmnet")

wf2 <- workflow() %>% add_recipe(rec2) %>% add_model(mod2)

grid2 <- grid_regular(penalty(range = c(-7, 0)), levels = 40)

set.seed(42)
folds2 <- vfold_cv(tr2, v = 5, strata = win)

tuned2 <- tune_grid(wf2, resamples = folds2, grid = grid2, metrics = metrics)

best2 <- select_best(tuned2, "mn_log_loss")
final2 <- finalize_workflow(wf2, best2) %>% fit(tr2)

pred2 <- predict(final2, te2, type = "prob") %>% bind_cols(te2 %>% select(win))
roc_auc(pred2, truth = win, .pred_1)
mn_log_loss(pred2, truth = win, .pred_1)

# Inspect coefficients (interpretability)
final2 %>%
  extract_fit_parsnip() %>%
  broom::tidy() %>%
  arrange(desc(abs(estimate))) %>%
  head(20)

A practical coaching readout might be: “Your average relative damage was +4 per round, but volatility was high. You won the peaks and lost the valleys—work on maintaining output in the middle rounds.”

8) Fatigue, momentum, and tactical shifts

Fatigue often shows up as a drop in attempt rate, a decline in power accuracy, or a shift toward safer output (more jabs, fewer exchanges). Momentum often appears as multi-round streaks in relative dominance.

Below are two useful constructs:

Fatigue Index: compare late rounds vs early rounds on pace and accuracy
Momentum Signal: rolling mean of relative damage/dominance

fatigue_index <- function(df) {
  df %>%
    group_by(fight_id, fighter) %>%
    mutate(
      early = round <= 3,
      late  = round >= max(round, na.rm = TRUE) - 2
    ) %>%
    summarise(
      early_pace = mean(total_attempted[early], na.rm = TRUE),
      late_pace  = mean(total_attempted[late], na.rm = TRUE),
      early_acc  = mean(acc_total[early], na.rm = TRUE),
      late_acc   = mean(acc_total[late], na.rm = TRUE),
      fatigue_pace_drop = (late_pace - early_pace) / pmax(early_pace, 1),
      fatigue_acc_drop  = (late_acc - early_acc) / pmax(early_acc, 1e-6),
      .groups = "drop"
    ) %>%
    mutate(
      fatigue_score = 0.7 * fatigue_pace_drop + 0.3 * fatigue_acc_drop
    )
}

momentum_signal <- function(df, window = 3) {
  df %>%
    arrange(fight_id, fighter, round) %>%
    group_by(fight_id, fighter) %>%
    mutate(
      rel_damage_roll = slider::slide_dbl(rel_damage, mean, .before = window - 1, .complete = FALSE, na.rm = TRUE),
      rel_landed_roll = slider::slide_dbl(rel_total_landed, mean, .before = window - 1, .complete = FALSE, na.rm = TRUE)
    ) %>%
    ungroup()
}

fatigue_tbl <- fatigue_index(rounds_feat)
rounds_mom <- momentum_signal(rounds_feat, window = 3)

Interpretation tips:

Fatigue score negative → late output/accuracy declined (common)
Fatigue score near zero → stable performance (valuable at elite levels)
Rolling dominance crossing zero → tactical turning point (corner adjustments matter most here)

9) Visual analytics for strategy

The “best” plots are the ones that change decisions. Two strategy-grade visuals:

Dominance timeline (relative damage rolling mean)
Style map (jab share vs power accuracy)

plot_dominance_timeline <- function(df, fight_id_pick, fighter_pick) {
  d <- df %>%
    filter(fight_id == fight_id_pick, fighter == fighter_pick) %>%
    arrange(round)

  ggplot(d, aes(x = round, y = rel_damage_roll)) +
    geom_hline(yintercept = 0, linewidth = 0.6) +
    geom_line(linewidth = 1) +
    geom_point(size = 2) +
    labs(
      x = "Round",
      y = "Rolling Relative Damage (windowed mean)",
      title = "Dominance Timeline",
      subtitle = glue::glue("Fight {fight_id_pick} — {fighter_pick}")
    ) +
    theme_minimal(base_size = 12)
}

plot_style_map <- function(df, fight_id_pick) {
  d <- df %>%
    filter(fight_id == fight_id_pick) %>%
    group_by(fighter) %>%
    summarise(
      jab_share = mean(jab_share_attempts, na.rm = TRUE),
      power_acc = mean(acc_power, na.rm = TRUE),
      pace = mean(total_attempted, na.rm = TRUE),
      .groups = "drop"
    )

  ggplot(d, aes(x = jab_share, y = power_acc, label = fighter, size = pace)) +
    geom_point(alpha = 0.7) +
    ggrepel::geom_text_repel(max.overlaps = 50) +
    labs(
      x = "Jab Share (Attempts)",
      y = "Power Accuracy",
      title = "Style Map (per fight)",
      subtitle = "Higher pace = larger point"
    ) +
    theme_minimal(base_size = 12)
}

# Example:
# p1 <- plot_dominance_timeline(rounds_mom, fight_id_pick = "F123", fighter_pick = "Fighter A")
# p2 <- plot_style_map(rounds_feat, fight_id_pick = "F123")
# p1 + p2

How coaches use these:

If dominance drops after round 4, check conditioning or defensive adjustments from the opponent.
If jab share is high but power accuracy is low, the jab may be “busy” but not creating openings.
If pace is high and accuracy stable, that’s often a winning profile—especially across long fights.

10) Scalable pipelines: Parquet, DuckDB, reproducibility

Once your data grows (multiple events, seasons, amateur + pro, different sources), SQL-style analysis becomes extremely useful. DuckDB lets you query Parquet directly with zero database admin.

# Connect to DuckDB (in-memory or file-backed)
con <- DBI::dbConnect(duckdb::duckdb(), dbdir = here::here("data/fight_analytics.duckdb"))

# Point DuckDB at a Parquet file (or a folder of Parquet files)
parquet_path <- here::here("data/clean/round_totals.parquet")

DBI::dbExecute(con, glue::glue("
  CREATE OR REPLACE VIEW rounds AS
  SELECT * FROM read_parquet('{parquet_path}')
"))

# Example: top rounds by volume (attempted punches)
top_volume <- DBI::dbGetQuery(con, "
  SELECT fighter, fight_id, round,
         (jabs_attempted + power_attempted) AS total_attempted
  FROM rounds
  ORDER BY total_attempted DESC
  LIMIT 25
")

top_volume %>% as_tibble()

# Close when done
DBI::dbDisconnect(con, shutdown = TRUE)

This makes it easy to build reliable reporting: “Highest pace fights,” “Largest late-round fades,” “Most consistent dominance,” and “Knockdown-driven wins.”

11) Wrap-up and next steps

A fight data science workflow in R becomes powerful when you combine:

Clean contracts so your data doesn’t drift
Validation so your results are trustworthy
Relative features so metrics become tactical
Probability models so conclusions are calibrated
Fatigue/momentum so strategy reflects real turning points

If you want a more structured, end-to-end path with deeper modeling, richer case studies, and a full workflow designed specifically for boxing, you may like this resource: a complete hands-on book focused on boxing data science and fight performance strategy in R .

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

Fight Data Science in R: Proven Boxing Metrics & Models

Table of Contents

1) Professional setup and project structure

2) A fight data contract (schemas that prevent chaos)

3) Ingestion and standardization

4) Validation, QA, and anomaly detection

5) Feature engineering: pace, accuracy, intent, damage proxies

6) Round-by-round modeling (probability of winning a round)

7) Fight outcome modeling (interpretable + calibrated)

8) Fatigue, momentum, and tactical shifts

9) Visual analytics for strategy

10) Scalable pipelines: Parquet, DuckDB, reproducibility

11) Wrap-up and next steps

Related

Table of Contents

1) Professional setup and project structure

2) A fight data contract (schemas that prevent chaos)

3) Ingestion and standardization

4) Validation, QA, and anomaly detection

5) Feature engineering: pace, accuracy, intent, damage proxies

6) Round-by-round modeling (probability of winning a round)

7) Fight outcome modeling (interpretable + calibrated)

8) Fatigue, momentum, and tactical shifts

9) Visual analytics for strategy

10) Scalable pipelines: Parquet, DuckDB, reproducibility

11) Wrap-up and next steps

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)