R Programming Books https://rprogrammingbooks.com/ Digital books on R programming, sports analytics, and data science. Sat, 14 Mar 2026 18:27:37 +0000 en-US hourly 1 https://wordpress.org/?v=6.9.4 https://rprogrammingbooks.com/wp-content/uploads/2025/05/cropped-Icon-web-programming-R-32x32.png R Programming Books https://rprogrammingbooks.com/ 32 32 Bayesian Linear Regression in R: A Step-by-Step Tutorial https://rprogrammingbooks.com/bayesian-linear-regression-in-r-step-by-step-tutorial/?utm_source=rss&utm_medium=rss&utm_campaign=bayesian-linear-regression-in-r-step-by-step-tutorial https://rprogrammingbooks.com/bayesian-linear-regression-in-r-step-by-step-tutorial/#respond Sat, 14 Mar 2026 18:27:36 +0000 https://rprogrammingbooks.com/?p=2516 Bayesian linear regression is one of the best ways to learn Bayesian modeling in R because it combines familiar regression ideas with a more realistic treatment of uncertainty. Instead of estimating a single fixed coefficient for each parameter, Bayesian methods estimate full probability distributions. That means we can talk about uncertainty, prior beliefs, posterior updates, […]

The post Bayesian Linear Regression in R: A Step-by-Step Tutorial appeared first on R Programming Books.

]]>

Bayesian linear regression is one of the best ways to learn Bayesian modeling in R because it combines familiar regression ideas with a more realistic treatment of uncertainty. Instead of estimating a single fixed coefficient for each parameter, Bayesian methods estimate full probability distributions. That means we can talk about uncertainty, prior beliefs, posterior updates, and credible intervals in a way that is often more intuitive than classical statistics.

In this tutorial, you will learn how to fit a Bayesian linear regression model in R step by step. We will start with the theory, build a dataset, choose priors, fit a model with brms, inspect posterior distributions, evaluate diagnostics, perform posterior predictive checks, and generate predictions for new observations. We will also look at several R packages that belong to a practical Bayesian workflow.

What you will learn in this tutorial:
  • How Bayesian linear regression works
  • How priors and posteriors differ from classical estimates
  • How to fit a model in R using brms
  • How to inspect convergence and model quality
  • How to use related packages such as tidybayes, bayesplot, and rstanarm

Why Bayesian Linear Regression Matters

In classical linear regression, the model estimates coefficients such as the intercept and slope as fixed unknown values. In Bayesian regression, those same coefficients are modeled as random variables with prior distributions. Once data is observed, those priors are updated into posterior distributions.

This gives us several advantages:

  • We can incorporate prior knowledge into the model
  • We get full uncertainty distributions, not just point estimates
  • Predictions naturally include uncertainty
  • Bayesian methods scale well into multilevel and hierarchical models
  • The interpretation of intervals is often more direct

If you are working in predictive analytics, experimental analysis, or sports modeling, this framework is especially useful because it lets you update beliefs as new data arrives.

The Bayesian Formula Behind Linear Regression

A simple linear regression can be written as:

y = β0 + β1x + ε

Where:

  • y is the response variable
  • x is the predictor
  • β0 is the intercept
  • β1 is the slope
  • ε is the error term, typically assumed to be normally distributed

In the Bayesian version, we add priors:

β0 ~ Normal(0, 10)
β1 ~ Normal(0, 5)
σ  ~ Student_t(3, 0, 2.5)

After seeing the data, we compute:

Posterior ∝ Likelihood × Prior

That one line is the core of Bayesian inference.

R Packages You Should Know for Bayesian Regression

Before fitting models, it helps to understand the ecosystem. Bayesian modeling in R is not just about one package. It is usually a workflow involving model fitting, posterior extraction, diagnostics, and visualization.

brms

High-level Bayesian regression modeling with formula syntax similar to lm() and glm().

rstanarm

Bayesian applied regression with an interface that feels familiar to many R users.

tidybayes

Extracts and visualizes posterior draws in a tidy format for easy analysis and plotting.

bayesplot

Useful for trace plots, posterior predictive checks, and MCMC diagnostics.

posterior

Helpful for working with draws, summaries, and posterior diagnostics in a standardized way.

cmdstanr

R interface to CmdStan, useful for users who want more direct Stan workflows and model control.

loo

Widely used for approximate leave-one-out cross-validation and model comparison.

ggplot2

Still essential for custom data exploration and clean visualization of posterior summaries.

Installing the Required Packages

For this tutorial, we will focus on brms for model fitting, while also using a few companion packages for diagnostics and visualization.

install.packages(c(
  "brms",
  "tidyverse",
  "tidybayes",
  "bayesplot",
  "posterior",
  "loo",
  "rstanarm"
))

Then load the packages:

library(brms)
library(tidyverse)
library(tidybayes)
library(bayesplot)
library(posterior)
library(loo)
library(rstanarm)
Tip: If you want a lower-level interface to Stan, you can also explore cmdstanr. For most readers, however, brms is a better starting point because it keeps the syntax concise while still giving access to advanced Bayesian models.

Creating a Simple Dataset

To make the tutorial reproducible, we will simulate a small dataset where one predictor explains a continuous response.

set.seed(123)

n <- 120

advertising_spend <- rnorm(n, mean = 15, sd = 4)

sales <- 20 + 3.5 * advertising_spend + rnorm(n, mean = 0, sd = 8)

df <- data.frame(
  advertising_spend = advertising_spend,
  sales = sales
)

head(df)

In this synthetic example, higher advertising spend tends to increase sales. The true slope used in the simulation is 3.5, but in a real modeling situation we would not know that value ahead of time.

Exploring the Data First

It is always a good idea to inspect the data visually before fitting any Bayesian model.

summary(df)

ggplot(df, aes(x = advertising_spend, y = sales)) +
  geom_point(alpha = 0.7) +
  geom_smooth(method = "lm", se = FALSE) +
  theme_minimal()

This scatterplot helps confirm that the relationship is roughly linear. Even when using Bayesian methods, the basic logic of exploratory data analysis still applies.

A Quick Classical Baseline with lm()

Before fitting the Bayesian model, it is useful to compare it with a standard linear regression.

lm_model <- lm(sales ~ advertising_spend, data = df)
summary(lm_model)

This gives us a baseline estimate for the intercept and slope. Later, we will compare it to Bayesian posterior summaries.

Choosing Priors

Priors are a defining part of Bayesian modeling. A prior reflects what we believe about a parameter before seeing the current data. Priors can be weakly informative, informative, or strongly regularizing depending on the context.

In many practical applications, weakly informative priors are a good default.

priors <- c(
  prior(normal(0, 20), class = "Intercept"),
  prior(normal(0, 10), class = "b"),
  prior(student_t(3, 0, 10), class = "sigma")
)

priors

This prior specification says:

  • The intercept is centered around 0 with broad uncertainty
  • The slope is also centered around 0 with a wide standard deviation
  • The residual standard deviation is positive and given a weakly informative prior

In real projects, priors should reflect domain knowledge whenever possible. For example, in marketing, finance, or sports analytics, prior expectations often come from previous seasons, experiments, or historical model performance.

Fitting the Bayesian Linear Regression Model

Now we can fit the Bayesian model using brm().

bayes_model <- brm(
  formula = sales ~ advertising_spend,
  data = df,
  prior = priors,
  family = gaussian(),
  chains = 4,
  iter = 4000,
  warmup = 2000,
  seed = 123
)

Here is what the most important arguments mean:

  • chains = 4: run four independent Markov chains
  • iter = 4000: total iterations per chain
  • warmup = 2000: burn-in samples used for adaptation
  • family = gaussian(): assume a normal likelihood for the response

Reading the Model Summary

summary(bayes_model)

The summary output typically reports:

  • Posterior mean and standard error for each parameter
  • Credible intervals
  • R-hat values for convergence
  • Effective sample sizes

A good sign is when R-hat values are close to 1.00. That suggests the MCMC chains mixed well and converged.

Understanding the Posterior Output

Suppose the slope posterior is centered near 3.4 with a 95% credible interval from 3.0 to 3.8. In Bayesian terms, that means the model assigns high posterior probability to the slope being in that interval. This is one reason many analysts find Bayesian intervals easier to interpret.

In practical language, we could say:

Based on the model and the observed data, higher advertising spend is strongly associated with higher sales, and the posterior distribution indicates that the effect is likely positive and substantial.

Extracting Posterior Draws

One of the strengths of Bayesian modeling is that you can work directly with posterior draws.

draws <- as_draws_df(bayes_model)
head(draws)

This lets you explore parameter distributions, uncertainty, and custom probabilities.

mean(draws$b_advertising_spend > 0)

The code above estimates the posterior probability that the slope is greater than zero. That is a very natural Bayesian quantity.

Visualizing Posterior Distributions

plot(bayes_model)

The default plot gives a quick view of posterior densities and chain behavior. You can also visualize intervals more explicitly:

mcmc_areas(
  as.array(bayes_model),
  pars = c("b_Intercept", "b_advertising_spend", "sigma")
)

This is where bayesplot becomes especially useful.

Checking Convergence with Trace Plots

Trace plots help determine whether the MCMC chains have mixed properly.

mcmc_trace(
  as.array(bayes_model),
  pars = c("b_Intercept", "b_advertising_spend", "sigma")
)

Healthy trace plots should look like fuzzy horizontal bands rather than trending lines or stuck sequences.

Posterior Predictive Checks

Posterior predictive checks are one of the most important parts of a Bayesian workflow. They compare the observed data to data simulated from the fitted model.

pp_check(bayes_model)

If the simulated data looks broadly similar to the observed data, that is a sign the model captures the main structure reasonably well.

You can also try more specific checks:

pp_check(bayes_model, type = "dens_overlay")
pp_check(bayes_model, type = "hist")
pp_check(bayes_model, type = "scatter_avg")

Using tidybayes for Tidy Posterior Workflows

The tidybayes package is extremely useful when you want to extract posterior draws into tidy data frames and build custom visualizations with ggplot2.

tidy_draws <- bayes_model %>%
  spread_draws(b_Intercept, b_advertising_spend, sigma)

head(tidy_draws)

For example, you can visualize the slope distribution:

tidy_draws %>%
  ggplot(aes(x = b_advertising_spend)) +
  geom_density(fill = "steelblue", alpha = 0.4) +
  theme_minimal()

This makes posterior analysis much more flexible than relying only on built-in summary output.

Generating Predictions for New Data

One of the biggest reasons to use regression is prediction. Bayesian models make this especially valuable because predictions come with uncertainty intervals.

new_customers <- data.frame(
  advertising_spend = c(10, 15, 20, 25)
)

predict(bayes_model, newdata = new_customers)

You can also generate expected values without residual noise:

fitted(bayes_model, newdata = new_customers)

The difference is important:

  • predict() includes outcome uncertainty
  • fitted() focuses on the expected mean response

Visualizing the Regression Line with Uncertainty

conditional_effects(bayes_model)

This is a quick way to visualize the fitted relationship and credible intervals. It is particularly useful when presenting results to readers who are new to Bayesian modeling.

Comparing the Classical and Bayesian Models

Aspect Classical lm() Bayesian Model
Parameter estimates Single point estimate Full posterior distribution
Intervals Confidence intervals Credible intervals
Prior knowledge Not included directly Included through priors
Predictions Often point-centered Naturally uncertainty-aware
Interpretability Frequentist framework Probability-based framework

Alternative Approach with rstanarm

If you want a very approachable alternative to brms, you can fit a similar model with rstanarm.

rstanarm_model <- stan_glm(
  sales ~ advertising_spend,
  data = df,
  family = gaussian(),
  chains = 4,
  iter = 4000,
  seed = 123
)

print(rstanarm_model)

This package is especially attractive for users who want Bayesian estimation with minimal syntax changes from familiar regression workflows.

Model Comparison with loo

For more advanced workflows, model comparison is often done with approximate leave-one-out cross-validation.

loo_result <- loo(bayes_model)
print(loo_result)

This becomes particularly useful when comparing multiple Bayesian models with different predictors or structures.

Common Beginner Mistakes in Bayesian Regression

  • Using priors without thinking about the scale of the data
  • Ignoring convergence diagnostics such as R-hat and trace plots
  • Skipping posterior predictive checks
  • Confusing credible intervals with classical confidence intervals
  • Treating Bayesian modeling as only a different fitting function rather than a full workflow

When Bayesian Linear Regression Is a Great Choice

Bayesian linear regression is especially useful when:

  • You want to express uncertainty directly
  • You have prior knowledge from previous studies or historical data
  • Your sample size is not huge and regularization helps
  • You plan to expand into hierarchical or multilevel models later
  • You need probabilistic predictions rather than just fitted coefficients

From Linear Regression to Real-World Prediction

Once you understand Bayesian linear regression, you can move into more realistic applications such as multilevel models, logistic regression, time series forecasting, and domain-specific predictive systems. In practice, many analysts first learn Bayesian methods through regression, then extend them into richer workflows for decision-making and forecasting.

If your interest goes beyond introductory examples and into real prediction workflows, Bayesian methods are especially valuable in sports modeling, where uncertainty, updating, and probabilistic forecasts matter a lot.

Those kinds of projects often build on the same foundations covered here: priors, posterior updating, uncertainty-aware prediction, and iterative model improvement.

Conclusion

Bayesian linear regression in R is one of the best entry points into Bayesian statistics because it combines familiar regression ideas with a much richer treatment of uncertainty. Instead of asking only for a coefficient estimate, you ask for a distribution of plausible values. Instead of pretending uncertainty is secondary, you put it at the center of the analysis.

In this tutorial, we covered the full process:

  • Building a dataset
  • Understanding priors
  • Fitting a model with brms
  • Inspecting posterior summaries
  • Checking convergence and fit
  • Generating predictions
  • Using additional packages from the Bayesian R ecosystem

Once you are comfortable with these steps, the next natural move is to explore Bayesian logistic regression, hierarchical models, and domain-specific forecasting systems.

The post Bayesian Linear Regression in R: A Step-by-Step Tutorial appeared first on R Programming Books.

]]>
https://rprogrammingbooks.com/bayesian-linear-regression-in-r-step-by-step-tutorial/feed/ 0
Formula 1 Analysis in R with f1dataR: Lap Times, Pit Stops, and Driver Performance https://rprogrammingbooks.com/formula-1-analysis-r-f1datar/?utm_source=rss&utm_medium=rss&utm_campaign=formula-1-analysis-r-f1datar https://rprogrammingbooks.com/formula-1-analysis-r-f1datar/#respond Mon, 09 Mar 2026 20:32:37 +0000 https://rprogrammingbooks.com/?p=2507 Formula 1 is one of the most compelling areas for data analysis in R because it combines structured results, lap-by-lap timing, pit strategy, and driver performance into one of the richest datasets in sport. For anyone building authority in technical R content, this is an excellent niche: it is specific enough to stand out, but […]

The post Formula 1 Analysis in R with f1dataR: Lap Times, Pit Stops, and Driver Performance appeared first on R Programming Books.

]]>
Formula 1 is one of the most compelling areas for data analysis in R because it combines structured results, lap-by-lap timing, pit strategy, and driver performance into one of the richest datasets in sport. For anyone building authority in technical R content, this is an excellent niche: it is specific enough to stand out, but broad enough to support tutorials, visualizations, predictive models, and long-form analytical writing.

One of the biggest advantages of working in this space is that f1dataR gives R users access to both historical Formula 1 data and richer session-level workflows linked to the wider Ergast/Jolpica and FastF1 ecosystem. That makes it possible to move from simple race results into much more interesting questions: Who had the strongest race pace? Which driver managed tyre degradation best? Did a pit stop strategy actually work? Can we build a basic model to estimate race outcomes?

This is where Formula 1 becomes much more than a sports topic. It becomes a practical case study in data wrangling, time-series thinking, feature engineering, visualization, and prediction. And because the R blog space has relatively little deep Formula 1 content compared with more general analytics topics, a strong tutorial here can help position your site as a serious source of expertise.

Why Formula 1 analysis in R is such a strong niche

Most R tutorials on the web focus on standard examples: sales dashboards, housing prices, or generic machine learning datasets. Formula 1 is different. The data has context, drama, and a built-in audience. Every race gives you new material to analyze, and every session contains multiple layers of information: qualifying pace, stint length, tyre compounds, safety car timing, sector performance, overtakes, and pit strategy.

That is part of what makes this topic attractive for long-form content. You are not just teaching code. You are showing how code helps explain real competitive decisions. A lap time is not just a number. It is evidence of tyre wear, traffic, fuel load, track evolution, and driver execution.

For readers who want to go deeper into this kind of workflow, resources such as Racing with Data: Formula 1 and NASCAR Analytics with R are useful because they reinforce the idea that racing analytics in R can go well beyond basic charts and into serious, code-driven analysis.

Installing the packages

The first step is to set up a workflow that is both reproducible and flexible. For most Formula 1 analysis projects in R, you will want f1dataR plus a small set of packages for data cleaning, plotting, reporting, and modeling.

install.packages(c(
  "f1dataR",
  "tidyverse",
  "lubridate",
  "janitor",
  "scales",
  "slider",
  "broom",
  "tidymodels",
  "gt",
  "patchwork"
))

library(f1dataR)
library(tidyverse)
library(lubridate)
library(janitor)
library(scales)
library(slider)
library(broom)
library(tidymodels)
library(gt)
library(patchwork)

If you want to work with official session-level timing data, it is also a good idea to configure FastF1 support and define a local cache.

setup_fastf1()

options(f1dataR.cache = "f1_cache")
dir.create("f1_cache", showWarnings = FALSE)

That may look like a small detail, but caching matters when you are building serious analytical content. It makes your workflow faster, cleaner, and much easier to reproduce when updating notebooks, reports, or blog posts later.

Start with race results

Before diving into laps and strategy, start with historical race results. They provide the backbone for season summaries, driver comparisons, constructor trends, and predictive features.

results_2024 <- load_results(season = 2024)

results_2024 %>%
  clean_names() %>%
  select(round, race_name, driver, constructor, grid, position, points, status) %>%
  glimpse()

Once the results are loaded, you can build a season summary table that gives readers an immediate overview of the competitive picture.

season_table <- results_2024 %>%
  clean_names() %>%
  group_by(driver, constructor) %>%
  summarise(
    races = n(),
    wins = sum(position == 1, na.rm = TRUE),
    podiums = sum(position <= 3, na.rm = TRUE),
    avg_finish = mean(position, na.rm = TRUE),
    avg_grid = mean(grid, na.rm = TRUE),
    points = sum(points, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  arrange(desc(points), avg_finish)

season_table

You can also convert that summary into a cleaner publication table for a blog or report.

season_table %>%
  mutate(
    avg_finish = round(avg_finish, 2),
    avg_grid = round(avg_grid, 2)
  ) %>%
  gt() %>%
  tab_header(
    title = "2024 Driver Season Summary",
    subtitle = "Wins, podiums, average finish, and points"
  )

This type of summary is useful, but by itself it does not explain much about how results were achieved. That is why the next step matters.

Looking beyond the finishing position

One of the easiest ways to improve an F1 analysis is to move beyond final classification. A driver finishing sixth may have delivered an excellent performance in a midfield car, while a podium in a dominant car may tell a much simpler story. A stronger framework compares results to starting position, teammate performance, and race pace.

A good place to begin is position gain.

position_gain_table <- results_2024 %>%
  clean_names() %>%
  mutate(
    position_gain = grid - position
  ) %>%
  group_by(driver, constructor) %>%
  summarise(
    mean_gain = mean(position_gain, na.rm = TRUE),
    median_gain = median(position_gain, na.rm = TRUE),
    total_gain = sum(position_gain, na.rm = TRUE),
    races = n(),
    .groups = "drop"
  ) %>%
  arrange(desc(mean_gain))

position_gain_table

This metric is simple, but it is still valuable because it gives a first signal of race execution. Of course, it has limits. Front-runners have less room to gain places, and midfield races are often influenced by strategy variance, incidents, and reliability. Still, that nuance is exactly what makes the discussion interesting.

Add race and circuit context

Formula 1 performance is always track-dependent. Some cars are stronger on high-speed circuits, some drivers thrive on street tracks, and some teams handle tyre-sensitive venues better than others. Joining race results with schedule data allows you to frame those questions more clearly.

schedule_2024 <- load_schedule(season = 2024) %>%
  clean_names()

results_with_schedule <- results_2024 %>%
  clean_names() %>%
  left_join(
    schedule_2024 %>%
      select(round, race_name, circuit_name, locality, country, race_date),
    by = c("round", "race_name")
  )

results_with_schedule %>%
  select(round, race_name, circuit_name, country, driver, constructor, grid, position) %>%
  slice_head(n = 10)

Even at this stage, you already have enough structure to write multiple types of posts: best performing drivers by circuit type, constructor consistency across the season, teammate gaps by venue, or overperformance relative to starting position.

Lap times: where the analysis gets serious

Race results tell you what happened. Lap times tell you how it happened. This is where Formula 1 analysis becomes much more valuable, because you can begin to evaluate race pace, traffic effects, tyre degradation, and the shape of a driver’s performance over the full event.

It is usually best to focus on one race session first, especially if your goal is to explain the process clearly.

session_laps <- load_laps(
  season = 2024,
  round = 10,
  session = "R"
) %>%
  clean_names()

session_laps %>%
  select(driver, lap_number, lap_time, compound, tyre_life, stint, pit_out_time, pit_in_time) %>%
  glimpse()

Lap time fields often need cleaning before they are suitable for visualization or modeling. Converting them into seconds is usually the most practical approach.

laps_clean <- session_laps %>%
  mutate(
    lap_time_seconds = as.numeric(lap_time),
    sector1_seconds = as.numeric(sector_1_time),
    sector2_seconds = as.numeric(sector_2_time),
    sector3_seconds = as.numeric(sector_3_time)
  ) %>%
  filter(!is.na(lap_time_seconds)) %>%
  filter(lap_time_seconds > 50, lap_time_seconds < 200)

summary(laps_clean$lap_time_seconds)

Comparing race pace by driver

Once the lap data is cleaned, you can compare selected drivers and visualize how their pace evolves through the race.

selected_drivers <- c("VER", "NOR", "LEC", "HAM")

laps_clean %>%
  filter(driver %in% selected_drivers) %>%
  ggplot(aes(x = lap_number, y = lap_time_seconds, color = driver)) +
  geom_line(alpha = 0.8, linewidth = 0.8) +
  geom_point(size = 1.2, alpha = 0.7) +
  scale_y_continuous(labels = label_number(accuracy = 0.1)) +
  labs(
    title = "Race pace by lap",
    subtitle = "Raw lap times across the Grand Prix",
    x = "Lap",
    y = "Lap time (seconds)",
    color = "Driver"
  ) +
  theme_minimal(base_size = 13)

Raw lap time plots are useful, but they are often noisy because pit laps, out-laps, and unusual traffic can distort the pattern. A stronger analysis filters some of that noise and focuses on green-flag pace.

green_flag_laps <- laps_clean %>%
  filter(driver %in% selected_drivers) %>%
  filter(is.na(pit_in_time), is.na(pit_out_time)) %>%
  group_by(driver) %>%
  mutate(
    median_lap = median(lap_time_seconds, na.rm = TRUE),
    lap_delta = lap_time_seconds - median_lap
  ) %>%
  ungroup() %>%
  filter(abs(lap_delta) < 5)

green_flag_laps %>%
  ggplot(aes(lap_number, lap_time_seconds, color = driver)) +
  geom_line(linewidth = 0.9) +
  geom_smooth(se = FALSE, method = "loess", span = 0.25, linewidth = 1.1) +
  labs(
    title = "Green-flag race pace",
    subtitle = "Smoothed lap-time profile after removing pit laps and large outliers",
    x = "Lap",
    y = "Lap time (seconds)"
  ) +
  theme_minimal(base_size = 13)

This kind of chart is one of the most useful in F1 analytics because it shows whether a driver was genuinely fast, merely benefiting from track position, or fading late in the race.

Tyre degradation and stint analysis

One of the best ways to add real authority to an F1 post is to quantify degradation. Instead of simply saying a driver “managed tyres well,” you can estimate how lap time changed as tyre life increased during a stint.

stint_degradation <- laps_clean %>%
  filter(driver %in% selected_drivers) %>%
  filter(!is.na(stint), !is.na(tyre_life), !is.na(compound)) %>%
  filter(is.na(pit_in_time), is.na(pit_out_time)) %>%
  group_by(driver, stint, compound) %>%
  filter(n() >= 8) %>%
  nest() %>%
  mutate(
    model = map(data, ~ lm(lap_time_seconds ~ tyre_life, data = .x)),
    tidied = map(model, broom::tidy)
  ) %>%
  unnest(tidied) %>%
  filter(term == "tyre_life") %>%
  transmute(
    driver,
    stint,
    compound,
    degradation_per_lap = estimate,
    p_value = p.value
  ) %>%
  arrange(degradation_per_lap)

stint_degradation

A positive slope generally means pace is dropping as the stint gets older. A smaller slope suggests better tyre preservation or more stable pace. The interpretation is not always simple, because race context matters, but the method is very effective for turning race discussion into evidence.

laps_clean %>%
  filter(driver %in% selected_drivers, !is.na(stint), !is.na(tyre_life)) %>%
  filter(is.na(pit_in_time), is.na(pit_out_time)) %>%
  ggplot(aes(tyre_life, lap_time_seconds, color = driver)) +
  geom_point(alpha = 0.5, size = 1.6) +
  geom_smooth(method = "lm", se = FALSE, linewidth = 1) +
  facet_wrap(~ compound, scales = "free_x") +
  labs(
    title = "Tyre degradation by compound",
    subtitle = "Linear approximation of pace loss as the stint ages",
    x = "Tyre life (laps)",
    y = "Lap time (seconds)"
  ) +
  theme_minimal(base_size = 13)

This is exactly the kind of analysis that makes a technical article memorable, because it moves from “who won?” to “why did the performance pattern look the way it did?”

Pit stops and strategy

Pit strategy is one of the clearest examples of how Formula 1 combines data and decision-making. A stop is not just an event; it is a trade-off between track position, tyre life, race pace, and the behaviour of nearby competitors.

pit_summary <- session_laps %>%
  clean_names() %>%
  mutate(
    had_pit_event = !is.na(pit_out_time) | !is.na(pit_in_time)
  ) %>%
  group_by(driver) %>%
  summarise(
    total_laps = n(),
    pit_events = sum(had_pit_event, na.rm = TRUE),
    stints = n_distinct(stint, na.rm = TRUE),
    first_compound = first(na.omit(compound)),
    last_compound = last(na.omit(compound)),
    .groups = "drop"
  ) %>%
  arrange(desc(pit_events))

pit_summary

A better way to explain strategy is to reconstruct the stints directly.

strategy_table <- session_laps %>%
  clean_names() %>%
  arrange(driver, lap_number) %>%
  group_by(driver, stint) %>%
  summarise(
    start_lap = min(lap_number, na.rm = TRUE),
    end_lap = max(lap_number, na.rm = TRUE),
    laps_in_stint = n(),
    compound = first(na.omit(compound)),
    avg_lap = mean(as.numeric(lap_time), na.rm = TRUE),
    median_lap = median(as.numeric(lap_time), na.rm = TRUE),
    .groups = "drop"
  ) %>%
  arrange(driver, stint)

strategy_table
strategy_table %>%
  ggplot(aes(x = start_lap, xend = end_lap, y = driver, yend = driver, color = compound)) +
  geom_segment(linewidth = 6, lineend = "round") +
  labs(
    title = "Race strategy by driver",
    subtitle = "Stint map reconstructed from lap-level data",
    x = "Lap window",
    y = "Driver",
    color = "Compound"
  ) +
  theme_minimal(base_size = 13)

Once you have stint maps, your analysis immediately becomes more strategic. You can discuss undercuts, overcuts, long first stints, aggressive early stops, and whether a team actually converted tyre freshness into meaningful gains.

Measuring post-stop pace

A useful extension is to examine whether a driver actually benefitted from fresh tyres after a stop. That is one of the simplest ways to move from descriptive pit analysis into strategic interpretation.

post_stop_pace <- session_laps %>%
  clean_names() %>%
  arrange(driver, lap_number) %>%
  group_by(driver) %>%
  mutate(
    pit_out_lap = !is.na(pit_out_time),
    laps_since_stop = cumsum(lag(pit_out_lap, default = FALSE))
  ) %>%
  ungroup() %>%
  filter(!is.na(lap_time)) %>%
  group_by(driver, laps_since_stop) %>%
  summarise(
    first_laps_avg = mean(as.numeric(lap_time)[1:min(3, n())], na.rm = TRUE),
    stint_avg = mean(as.numeric(lap_time), na.rm = TRUE),
    .groups = "drop"
  )

post_stop_pace

This kind of table helps answer a much better question than “when did they pit?” It asks: “Did the stop create usable pace, and was that pace strong enough to influence the race?”

Teammate comparison as the best benchmark

In Formula 1, teammate comparison is often more informative than full-grid comparison because the car is the closest thing to a controlled environment. If one driver consistently beats the other in grid position, race finish, or pace consistency, that tells you something much more precise than the overall championship table.

teammate_table <- results_2024 %>%
  clean_names() %>%
  group_by(constructor, round, race_name) %>%
  mutate(
    teammate_finish_rank = min_rank(position),
    teammate_grid_rank = min_rank(grid)
  ) %>%
  ungroup() %>%
  group_by(driver, constructor) %>%
  summarise(
    avg_finish = mean(position, na.rm = TRUE),
    avg_grid = mean(grid, na.rm = TRUE),
    teammate_beating_rate_finish = mean(teammate_finish_rank == 1, na.rm = TRUE),
    teammate_beating_rate_grid = mean(teammate_grid_rank == 1, na.rm = TRUE),
    points = sum(points, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  arrange(desc(teammate_beating_rate_finish), desc(points))

teammate_table

That kind of comparison is especially strong in a technical post because it gives readers a benchmark they already understand intuitively, while still grounding the discussion in data.

Sector analysis

If lap times tell you the overall pace story, sectors can help reveal where that pace is being gained or lost. Even without diving into full telemetry, sector splits can expose whether a driver is strong in traction zones, high-speed sections, or braking-heavy parts of the circuit.

sector_summary <- laps_clean %>%
  filter(driver %in% selected_drivers) %>%
  group_by(driver) %>%
  summarise(
    s1 = mean(sector1_seconds, na.rm = TRUE),
    s2 = mean(sector2_seconds, na.rm = TRUE),
    s3 = mean(sector3_seconds, na.rm = TRUE),
    total = mean(lap_time_seconds, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  pivot_longer(cols = c(s1, s2, s3), names_to = "sector", values_to = "seconds")

sector_summary %>%
  ggplot(aes(sector, seconds, fill = driver)) +
  geom_col(position = "dodge") +
  labs(
    title = "Average sector times by driver",
    subtitle = "A simple way to localize pace differences",
    x = "Sector",
    y = "Average time (seconds)",
    fill = "Driver"
  ) +
  theme_minimal(base_size = 13)

This type of breakdown is useful because it adds shape to the analysis. Instead of saying a driver was faster overall, you can show where the time was coming from.

From description to prediction

One of the strongest editorial angles for an article like this is to end with a predictive modeling section. A title such as Formula 1 Data Science in R: Predicting Race Results works well because it combines clear intent, technical interest, and a topic with built-in audience appeal.

The key is to be realistic. The purpose is not to promise perfect forecasts. It is to show how descriptive Formula 1 data can be converted into features for a baseline model.

model_data <- results_2024 %>%
  clean_names() %>%
  arrange(driver, round) %>%
  group_by(driver) %>%
  mutate(
    rolling_avg_finish_3 = slide_dbl(position, mean, .before = 2, .complete = FALSE, na.rm = TRUE),
    rolling_avg_grid_3 = slide_dbl(grid, mean, .before = 2, .complete = FALSE, na.rm = TRUE),
    rolling_points_3 = slide_dbl(points, mean, .before = 2, .complete = FALSE, na.rm = TRUE),
    prev_finish = lag(position),
    prev_grid = lag(grid)
  ) %>%
  ungroup() %>%
  mutate(
    target_top10 = if_else(position <= 10, 1, 0),
    target_podium = if_else(position <= 3, 1, 0)
  ) %>%
  select(
    round, race_name, driver, constructor, grid, points, position,
    rolling_avg_finish_3, rolling_avg_grid_3, rolling_points_3,
    prev_finish, prev_grid, target_top10, target_podium
  ) %>%
  drop_na()

glimpse(model_data)

This dataset is intentionally simple, but that is a strength in a tutorial. It makes the logic visible and gives readers something they can actually reproduce and extend.

Predicting a top-10 finish

set.seed(42)

split_obj <- initial_split(model_data, prop = 0.8, strata = target_top10)
train_data <- training(split_obj)
test_data <- testing(split_obj)

log_recipe <- recipe(
  target_top10 ~ grid + rolling_avg_finish_3 + rolling_avg_grid_3 +
    rolling_points_3 + prev_finish + prev_grid,
  data = train_data
) %>%
  step_impute_median(all_numeric_predictors()) %>%
  step_normalize(all_numeric_predictors())

log_spec <- logistic_reg() %>%
  set_engine("glm")

log_workflow <- workflow() %>%
  add_recipe(log_recipe) %>%
  add_model(log_spec)

log_fit <- fit(log_workflow, data = train_data)

top10_predictions <- predict(log_fit, new_data = test_data, type = "prob") %>%
  bind_cols(predict(log_fit, new_data = test_data)) %>%
  bind_cols(test_data %>% select(target_top10))

top10_predictions
top10_predictions %>%
  roc_auc(truth = factor(target_top10), .pred_1)

top10_predictions %>%
  accuracy(truth = factor(target_top10), estimate = .pred_class)

Predicting finishing position

finish_recipe <- recipe(
  position ~ grid + rolling_avg_finish_3 + rolling_avg_grid_3 +
    rolling_points_3 + prev_finish + prev_grid,
  data = train_data
) %>%
  step_impute_median(all_numeric_predictors()) %>%
  step_normalize(all_numeric_predictors())

lm_spec <- linear_reg() %>%
  set_engine("lm")

lm_workflow <- workflow() %>%
  add_recipe(finish_recipe) %>%
  add_model(lm_spec)

lm_fit <- fit(lm_workflow, data = train_data)

finish_predictions <- predict(lm_fit, new_data = test_data) %>%
  bind_cols(test_data %>% select(position, driver, constructor, race_name, grid))

metrics(finish_predictions, truth = position, estimate = .pred)
finish_predictions %>%
  ggplot(aes(position, .pred)) +
  geom_point(alpha = 0.7, size = 2) +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed") +
  labs(
    title = "Predicted vs actual finishing position",
    subtitle = "Baseline linear model",
    x = "Actual finish",
    y = "Predicted finish"
  ) +
  theme_minimal(base_size = 13)

A baseline model like this is not meant to be a perfect forecasting system. Its real value is educational. It shows how to move from results tables to feature engineering, then from features into a reproducible predictive workflow.

A simple custom driver rating

If you want the article to feel more original, one strong option is to create a custom driver score. Composite metrics work well in Formula 1 writing because they combine multiple dimensions of performance into one interpretable ranking.

driver_rating <- results_2024 %>%
  clean_names() %>%
  group_by(driver, constructor) %>%
  summarise(
    avg_finish = mean(position, na.rm = TRUE),
    avg_grid = mean(grid, na.rm = TRUE),
    points = sum(points, na.rm = TRUE),
    wins = sum(position == 1, na.rm = TRUE),
    podiums = sum(position <= 3, na.rm = TRUE),
    gain = mean(grid - position, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  mutate(
    finish_score = rescale(-avg_finish, to = c(0, 100)),
    grid_score = rescale(-avg_grid, to = c(0, 100)),
    points_score = rescale(points, to = c(0, 100)),
    gain_score = rescale(gain, to = c(0, 100)),
    win_score = rescale(wins, to = c(0, 100)),
    rating = 0.30 * finish_score +
             0.20 * grid_score +
             0.25 * points_score +
             0.15 * gain_score +
             0.10 * win_score
  ) %>%
  arrange(desc(rating))

driver_rating

The important thing here is transparency. Readers do not need to agree with every weight in the formula. What matters is that the method is explicit, interpretable, and easy to critique or improve.

Final thoughts

Formula 1 analysis in R is an unusually strong content niche because it combines technical rigor with a naturally engaged audience. With f1dataR, you can begin with historical race results, move into lap-time and stint analysis, explore pit strategy and driver benchmarking, and then build baseline predictive models that make the workflow feel complete.

That range is exactly what makes this such a good topic for an authority-building article. It is practical, it is reproducible, and it opens the door to an entire cluster of follow-up posts on telemetry, qualifying, tyre degradation, teammate comparisons, and race prediction.

If your goal is to publish technical content that demonstrates real expertise rather than just covering surface-level examples, Formula 1 data science in R is one of the best domains you can choose.

The post Formula 1 Analysis in R with f1dataR: Lap Times, Pit Stops, and Driver Performance appeared first on R Programming Books.

]]>
https://rprogrammingbooks.com/formula-1-analysis-r-f1datar/feed/ 0
eSports Analytics in R: Predicting Dota 2 Matches https://rprogrammingbooks.com/esports-analytics-in-r-predicting-dota-2-matches/?utm_source=rss&utm_medium=rss&utm_campaign=esports-analytics-in-r-predicting-dota-2-matches https://rprogrammingbooks.com/esports-analytics-in-r-predicting-dota-2-matches/#respond Fri, 06 Mar 2026 18:51:30 +0000 https://rprogrammingbooks.com/?p=2499 eSports analytics is still an underexplored area in the R ecosystem, which makes it a great niche for practical, original work. While football, basketball, and betting models already have strong communities, competitive games such as Dota 2 and Counter-Strike offer rich event data, fast feedback loops, and interesting prediction problems. In this post, I will […]

The post eSports Analytics in R: Predicting Dota 2 Matches appeared first on R Programming Books.

]]>

eSports analytics is still an underexplored area in the R ecosystem, which makes it a great niche for practical, original work. While football, basketball, and betting models already have strong communities, competitive games such as Dota 2 and Counter-Strike offer rich event data, fast feedback loops, and interesting prediction problems. In this post, I will show how R can be used to extract match data, engineer useful features, classify players or teams, and build models to predict match outcomes.

The main idea is simple: treat eSports matches like any other structured competition dataset. We can collect historical match information, transform it into team-level or player-level predictors, and then train machine learning models that estimate the probability of victory. For Dota 2, the OpenDota ecosystem is especially useful because it exposes public match and player data through an API that can be accessed from R.

Why eSports analytics is a strong fit for R

R is particularly well suited for eSports analytics because it combines data collection, cleaning, visualization, modeling, and reporting in a single workflow. With packages from the tidyverse, tidymodels, and API tools such as httr and jsonlite, it becomes straightforward to move from raw match endpoints to a predictive pipeline.

This is also one of the reasons the topic stands out. Compared with mainstream sports, eSports still has much less mature R coverage, so a post focused on predicting Dota 2 matches in R feels fresh. It is practical, technically interesting, and relevant to analysts who want to work on non-traditional sports datasets.

Typical analytics questions in Dota 2 or CS-style games

Once match data is available, several interesting problems appear naturally:

  • Which team features are most associated with winning?
  • Can we predict the outcome of a match before it starts?
  • Which players outperform their role or bracket expectations?
  • Do certain heroes, maps, or compositions create measurable edges?
  • How stable are team ratings over time?

Some of these are classification tasks, others are ranking or regression problems, and several can benefit from time-aware modeling. If you enjoy probabilistic approaches, a Bayesian sports analytics book in R can be a useful complement when you want to move from point predictions to uncertainty-aware forecasts.

Data collection in R with OpenDota

A practical starting point is Dota 2 match data from the OpenDota API. In R, you can work either with a dedicated wrapper such as ROpenDota when available in your environment, or call the API directly with httr2, httr, and jsonlite. I often prefer direct API calls because they make the data flow more transparent and easier to debug.

The example below shows a simple way to retrieve recent professional matches and convert them into a tidy tibble.

library(httr2)
library(jsonlite)
library(dplyr)
library(purrr)
library(tibble)

base_url <- "https://api.opendota.com/api/proMatches"

resp <- request(base_url) |>
  req_perform()

pro_matches <- resp |>
  resp_body_string() |>
  fromJSON(flatten = TRUE) |>
  as_tibble()

glimpse(pro_matches)

At this stage, the key goal is not modeling yet. It is understanding what the dataset contains. You want to inspect variables such as match identifiers, start times, radiant and dire team names, duration, league information, and the final winner. Once the structure is clear, the next step is to collect richer match-level or player-level details.

Downloading detailed match records

Predictive models usually need more than a top-level match result. We often want per-match detail: kills, deaths, assists, gold per minute, experience per minute, hero picks, bans, lobby type, patch information, and team-level aggregates. A common workflow is to fetch a set of match IDs and then loop through the detailed endpoint for each match.

library(httr2)
library(jsonlite)
library(dplyr)
library(purrr)
library(tidyr)

get_match_details <- function(match_id) {
  url <- paste0("https://api.opendota.com/api/matches/", match_id)

  tryCatch({
    request(url) |>
      req_perform() |>
      resp_body_string() |>
      fromJSON(flatten = TRUE)
  }, error = function(e) {
    NULL
  })
}

sample_ids <- pro_matches |>
  slice_head(n = 50) |>
  pull(match_id)

match_details_raw <- map(sample_ids, get_match_details)
match_details_raw <- compact(match_details_raw)

This gives us a list of match records. From there, we can create a team-level modeling table. For predictive work, that usually means one row per team per match, along with a target variable indicating whether that team won.

Feature engineering for match prediction

Feature engineering is where most of the value is created. A model rarely becomes useful because of the algorithm alone; it becomes useful because the input variables capture something meaningful about team quality, momentum, or composition.

Some strong candidate features include:

  • Recent win rate over the last 5 or 10 matches
  • Average team KDA from recent games
  • Average gold per minute and experience per minute
  • Hero-pool diversity
  • Patch-specific performance
  • Opponent strength proxies
  • Side indicator such as Radiant vs Dire
  • Time since the team last played

A basic team-level engineering pipeline in R might look like this:

library(dplyr)
library(purrr)
library(tidyr)
library(stringr)

team_rows <- map_dfr(match_details_raw, function(m) {
  if (is.null(m$players) || length(m$players) == 0) return(NULL)

  players <- as_tibble(m$players)

  players <- players |>
    mutate(
      side = if_else(player_slot < 128, "radiant", "dire")
    )

  team_summary <- players |>
    group_by(side) |>
    summarise(
      team_kills = sum(kills, na.rm = TRUE),
      team_deaths = sum(deaths, na.rm = TRUE),
      team_assists = sum(assists, na.rm = TRUE),
      avg_gpm = mean(gold_per_min, na.rm = TRUE),
      avg_xpm = mean(xp_per_min, na.rm = TRUE),
      hero_diversity = n_distinct(hero_id),
      .groups = "drop"
    ) |>
    mutate(
      match_id = m$match_id,
      duration = m$duration,
      radiant_win = m$radiant_win,
      win = if_else(
        (side == "radiant" & radiant_win) | (side == "dire" & !radiant_win),
        1, 0
      )
    )

  team_summary
})

team_rows

This table is already enough for a first classification model. It is not perfect, and it does not yet include pre-match only features, but it is ideal for prototyping. In real forecasting, we should be careful not to leak post-match information into the prediction target. For example, final kills and average GPM are fine for explanatory analysis but not for true pre-match forecasting.

Building a proper pre-match dataset

If the goal is to predict the winner before the game begins, then every feature must be available before the first second of the match. That means historical rolling summaries are usually better than in-match totals. A cleaner setup is:

  1. Sort matches chronologically
  2. Create one row per team per match
  3. Compute rolling features from previous matches only
  4. Join the two competing teams into a head-to-head row
  5. Train a binary classifier on the winner

Here is a simplified example of rolling team form:

library(dplyr)
library(slider)

team_history <- team_rows |>
  arrange(match_id) |>
  group_by(side) |>
  mutate(
    recent_win_rate = slide_dbl(win, mean, .before = 5, .complete = FALSE),
    recent_avg_kills = slide_dbl(team_kills, mean, .before = 5, .complete = FALSE),
    recent_avg_deaths = slide_dbl(team_deaths, mean, .before = 5, .complete = FALSE)
  ) |>
  ungroup()

In a more complete dataset, you would calculate these rolling statistics by actual team identity rather than by side alone. That produces a much more realistic team-strength signal.

Predicting match outcomes with tidymodels

Once a clean modeling table is ready, tidymodels provides an elegant framework for splitting data, preprocessing predictors, training models, and evaluating performance. Logistic regression is a strong baseline because it is interpretable and fast. After that, tree-based methods such as random forests or gradient boosting can be tested.

library(tidymodels)

model_data <- team_rows |>
  select(win, team_kills, team_deaths, team_assists, avg_gpm, avg_xpm, hero_diversity, duration) |>
  mutate(win = factor(win, levels = c(0, 1)))

set.seed(123)

split_obj <- initial_split(model_data, prop = 0.8, strata = win)
train_data <- training(split_obj)
test_data  <- testing(split_obj)

rec <- recipe(win ~ ., data = train_data) |>
  step_impute_median(all_numeric_predictors()) |>
  step_normalize(all_numeric_predictors())

log_spec <- logistic_reg() |>
  set_engine("glm")

wf <- workflow() |>
  add_recipe(rec) |>
  add_model(log_spec)

fit_log <- fit(wf, data = train_data)

preds <- predict(fit_log, test_data, type = "prob") |>
  bind_cols(predict(fit_log, test_data)) |>
  bind_cols(test_data)

roc_auc(preds, truth = win, .pred_1)
accuracy(preds, truth = win, .pred_class)

The first model is rarely the final model, but it gives us a baseline. If performance is weak, that usually means the issue is in the feature set rather than the modeling syntax. Better historical variables, better team identifiers, and better patch-aware data often matter more than switching algorithms immediately.

Moving beyond logistic regression

After a baseline, several improvements are possible. Random forests can capture nonlinear relationships. Gradient boosting often performs well when feature interactions matter. Bayesian models can be especially attractive when sample sizes are uneven or when you want probability distributions instead of single-point estimates. For readers interested in probabilistic thinking and predictive uncertainty, a resource on Bayesian sports betting with R can help connect model outputs with practical decision-making.

rf_spec <- rand_forest(
  trees = 500,
  min_n = 5
) |>
  set_engine("ranger") |>
  set_mode("classification")

rf_wf <- workflow() |>
  add_recipe(rec) |>
  add_model(rf_spec)

fit_rf <- fit(rf_wf, data = train_data)

rf_preds <- predict(fit_rf, test_data, type = "prob") |>
  bind_cols(predict(fit_rf, test_data)) |>
  bind_cols(test_data)

roc_auc(rf_preds, truth = win, .pred_1)
accuracy(rf_preds, truth = win, .pred_class)

A good post does not need to claim perfect predictive power. In fact, readers usually trust the analysis more when you clearly explain the constraints. Team rosters change, patches alter the meta, public data can be incomplete, and many matches are influenced by contextual factors that are difficult to encode numerically.

Player classification and rating ideas

Match prediction is only one angle. Another strong direction is player classification. For example, we can cluster players based on aggression, farming style, support contribution, and efficiency. This is particularly interesting because eSports roles are both strategic and behavioral.

A simple unsupervised workflow could include:

  • K-means clustering on player performance metrics
  • PCA for dimensionality reduction and visualization
  • Role classification using labeled examples
  • Elo-style or Glicko-style rating systems for evolving skill estimates
library(dplyr)
library(ggplot2)

player_data <- map_dfr(match_details_raw, function(m) {
  if (is.null(m$players) || length(m$players) == 0) return(NULL)

  as_tibble(m$players) |>
    transmute(
      match_id = m$match_id,
      account_id = account_id,
      hero_id = hero_id,
      kills = kills,
      deaths = deaths,
      assists = assists,
      gpm = gold_per_min,
      xpm = xp_per_min,
      last_hits = last_hits
    )
}) |>
  filter(!is.na(account_id))

player_summary <- player_data |>
  group_by(account_id) |>
  summarise(
    avg_kills = mean(kills, na.rm = TRUE),
    avg_deaths = mean(deaths, na.rm = TRUE),
    avg_assists = mean(assists, na.rm = TRUE),
    avg_gpm = mean(gpm, na.rm = TRUE),
    avg_xpm = mean(xpm, na.rm = TRUE),
    avg_last_hits = mean(last_hits, na.rm = TRUE),
    matches = n(),
    .groups = "drop"
  ) |>
  filter(matches >= 10)

From there, clustering or supervised classification becomes straightforward. This is the kind of section that makes an eSports article feel broader than a simple API tutorial.

Visualization ideas that make the post stronger

Visuals can turn a technical post into a memorable one. In eSports, a few plots are especially effective:

  • Win probability calibration plots
  • Rolling team form charts
  • Hero usage and win-rate heatmaps
  • Player cluster scatterplots from PCA
  • Feature importance plots for tree models

For example, here is a simple variable importance chart after fitting a random forest:

library(vip)

fit_rf |>
  extract_fit_parsnip() |>
  vip()

The purpose of these plots is not just decoration. They help answer the analytical question visually: what actually drives team success, and which signals seem stable across matches?

What about Counter-Strike or other eSports titles?

The same workflow generalizes well. Even if package support is less standardized than in Dota 2, the modeling logic remains the same:

  • Collect historical match data
  • Build team and player features
  • Use rolling windows to represent recent form
  • Train classification or rating models
  • Evaluate probabilities, not just hard predictions

In Counter-Strike style datasets, likely features include map win rates, side-specific strength, recent kill differential, roster stability, and head-to-head history. In that sense, the sport changes, but the R workflow does not.

Why this kind of post can stand out

A post on eSports analytics in R stands out because it sits at the intersection of data science novelty and practical modeling. It is specific enough to be useful, but unusual enough to attract readers who are tired of the same repeated examples from traditional sports. A title built around predicting Dota 2 matches is especially effective because it immediately communicates a concrete deliverable.

It also fits naturally into a broader sports analytics learning path. Readers who discover this topic through eSports may later want to explore work in football, soccer, or multi-sport modeling, where books such as Football Analytics with R, Mastering Sports Analytics with R: Soccer, or Sports Analytics with R across multiple sports can expand the same analytical mindset into other domains.

Final thoughts

eSports analytics deserves more attention in R, and Dota 2 is one of the best places to start. With API access, tidy data workflows, and flexible modeling tools, it is possible to go from raw public match records to meaningful predictive systems entirely in R. Even a simple first version can teach a lot about data engineering, feature design, classification, and evaluation.

The real opportunity is not only to predict winners, but to build a reproducible framework for understanding team performance, player styles, and competitive dynamics in games that are becoming more important every year. That combination of novelty, data richness, and analytical depth is exactly what makes eSports such a compelling subject for an R post.

The post eSports Analytics in R: Predicting Dota 2 Matches appeared first on R Programming Books.

]]>
https://rprogrammingbooks.com/esports-analytics-in-r-predicting-dota-2-matches/feed/ 0
How to Fit Hierarchical Bayesian Models in R with brms: Partial Pooling Explained https://rprogrammingbooks.com/hierarchical-bayesian-models-in-r-brms-partial-pooling/?utm_source=rss&utm_medium=rss&utm_campaign=hierarchical-bayesian-models-in-r-brms-partial-pooling https://rprogrammingbooks.com/hierarchical-bayesian-models-in-r-brms-partial-pooling/#respond Sun, 01 Mar 2026 19:25:47 +0000 https://rprogrammingbooks.com/?p=2489 Hierarchical Bayesian modeling (also called multilevel modeling) is one of the most reliable ways to build predictive and inferential models when your data has natural grouping—teams, players, seasons, leagues, referees, venues, or even game states. In sports analytics, that grouping is unavoidable. In R, hierarchical Bayesian models are commonly implemented via brms (Stan), rstanarm, or […]

The post How to Fit Hierarchical Bayesian Models in R with brms: Partial Pooling Explained appeared first on R Programming Books.

]]>

Hierarchical Bayesian modeling (also called multilevel modeling) is one of the most reliable ways to build predictive and inferential models when your data has natural grouping—teams, players, seasons, leagues, referees, venues, or even game states. In sports analytics, that grouping is unavoidable. In R, hierarchical Bayesian models are commonly implemented via brms (Stan), rstanarm, or cmdstanr.

This tutorial focuses on partial pooling (a.k.a. shrinkage) and why it’s the default choice for academic, production-grade modeling: it reduces overfitting, improves out-of-sample performance, and produces honest uncertainty quantification. We will use a sports dataset as a concrete example, but the modeling principles generalize to many domains (education, marketing, medicine, A/B testing, and more).

Core keywords covered: hierarchical Bayesian models in R, partial pooling, multilevel modeling, Bayesian shrinkage, Bayesian inference in Stan, brms tutorial, posterior predictive checks, priors, model calibration, sports analytics in R.


Pooling Strategies: No Pooling, Complete Pooling, Partial Pooling

Suppose we want to estimate team strength. There are three classic approaches:

  • No pooling: estimate a separate parameter for each team using only that team’s data. This can be high-variance (overfits small samples).
  • Complete pooling: ignore teams and estimate one global parameter. This is low-variance but high-bias (misses real differences).
  • Partial pooling: estimate team parameters while sharing information via a population-level distribution. This is the hierarchical Bayesian compromise that tends to dominate in practice.

In partial pooling, each group effect is “pulled” toward the global mean in proportion to how much information that group has. Teams with few matches shrink more; teams with many matches shrink less. This is not a trick—it’s what the posterior implies under a reasonable hierarchical prior structure.


R Setup

We will fit models using brms, which compiles Bayesian models to Stan and provides a high-level formula interface. The workflow below emphasizes reproducibility and good Bayesian practice: priors, diagnostics, posterior predictive checks, and principled comparison against simpler baselines.

# Core modeling stack
install.packages(c("brms", "tidyverse", "posterior", "bayesplot", "tidybayes"))
# Optional but recommended for speed (CmdStan backend)
install.packages("cmdstanr")

library(brms)
library(tidyverse)
library(posterior)
library(bayesplot)
library(tidybayes)

# If using cmdstanr (recommended), set backend once:
# cmdstanr::install_cmdstan()
options(brms.backend = "cmdstanr")

Example Data: Match Outcomes with Team Effects

To keep this tutorial self-contained, we’ll simulate a dataset that resembles soccer-style goal scoring. The same structure appears in hockey, handball, futsal, and other low-to-moderate scoring sports. The key is that each match has two teams and outcomes depend on latent team strength.

set.seed(123)

n_teams <- 20
teams   <- paste0("Team_", seq_len(n_teams))

# True latent parameters (unknown in real life)
true_attack  <- rnorm(n_teams, 0, 0.35)
true_defense <- rnorm(n_teams, 0, 0.35)
true_home_adv <- 0.20

# Schedule: random pairings
n_matches <- 600
home_id <- sample(seq_len(n_teams), n_matches, replace = TRUE)
away_id <- sample(seq_len(n_teams), n_matches, replace = TRUE)
# Avoid self-matches
same <- home_id == away_id
while (any(same)) {
  away_id[same] <- sample(seq_len(n_teams), sum(same), replace = TRUE)
  same <- home_id == away_id
}

home_team <- teams[home_id]
away_team <- teams[away_id]

# Poisson rates for goals
log_lambda_home <- true_home_adv + true_attack[home_id] - true_defense[away_id]
log_lambda_away <-               true_attack[away_id] - true_defense[home_id]

lambda_home <- exp(log_lambda_home)
lambda_away <- exp(log_lambda_away)

home_goals <- rpois(n_matches, lambda_home)
away_goals <- rpois(n_matches, lambda_away)

matches <- tibble(
  match_id   = seq_len(n_matches),
  home_team  = factor(home_team),
  away_team  = factor(away_team),
  home_goals = home_goals,
  away_goals = away_goals
)

matches %>% glimpse()

In a real sports analytics pipeline, you would replace simulation with data ingestion (CSV/APIs), feature engineering (rest days, injuries, xG proxies, travel, Elo, etc.), and train/test splits. The hierarchical core remains: group-level effects with partial pooling.


Baseline: Complete Pooling (No Team Effects)

As a baseline, model home goals using only a global intercept and home advantage indicator. This is intentionally too simple: it cannot learn that some teams are stronger than others.

m_pool <- brm(
  home_goals ~ 1,
  data   = matches,
  family = poisson(),
  prior  = c(
    prior(normal(0, 1.0), class = "Intercept")
  ),
  chains = 4, cores = 4, iter = 2000, seed = 1
)

summary(m_pool)

Complete pooling often underfits: it produces well-behaved uncertainty but misses systematic structure. Next we move to no pooling and then partial pooling.


No Pooling: Separate Team Parameters (High Variance)

No pooling estimates a fixed effect for each team without sharing strength across teams. This can be unstable when some teams have fewer observations or unbalanced schedules.

m_nopool <- brm(
  home_goals ~ 1 + home_team + away_team,
  data   = matches,
  family = poisson(),
  prior  = c(
    prior(normal(0, 1.0), class = "Intercept"),
    prior(normal(0, 0.5), class = "b")  # regularization, but still no pooling
  ),
  chains = 4, cores = 4, iter = 2000, seed = 2
)

summary(m_nopool)

Even with regularizing priors, no pooling is typically noisier than hierarchical partial pooling, especially in small samples. The multilevel model handles this in a more principled way.


Partial Pooling: Hierarchical (Multilevel) Team Effects

A standard approach is to model goals with random effects for home and away team. In brms syntax, (1 | home_team) means each home team gets its own intercept drawn from a common distribution. The distribution’s standard deviation is learned from data and controls the amount of shrinkage.

Below, we fit a model for home goals with hierarchical intercepts for both home and away teams. This captures: home attacking propensity (home team effect) and away defensive vulnerability (away team effect), albeit in a simplified way.

priors_hier <- c(
  prior(normal(0, 1.0), class = "Intercept"),
  # SD priors control how much team-to-team variation is plausible
  prior(exponential(1.0), class = "sd")
)

m_hier <- brm(
  home_goals ~ 1 + (1 | home_team) + (1 | away_team),
  data   = matches,
  family = poisson(),
  prior  = priors_hier,
  chains = 4, cores = 4, iter = 2500, seed = 3
)

summary(m_hier)

Interpretation: the model estimates a global goal rate (intercept) plus deviations for each team, but those deviations are partially pooled. This is the practical meaning of partial pooling: group-level parameters borrow statistical strength.


Modeling Both Scores Jointly

A more realistic sports modeling setup estimates both home and away goals and separates attack/defense structure. brms supports multivariate models. Here’s a simple bivariate Poisson-like approach by fitting two Poisson responses with correlated random effects (conceptually similar to attack/defense components).

bf_home <- bf(home_goals ~ 1 + (1 | home_team) + (1 | away_team))
bf_away <- bf(away_goals ~ 1 + (1 | away_team) + (1 | home_team))

m_mv <- brm(
  bf_home + bf_away + set_rescor(FALSE),
  data   = matches,
  family = poisson(),
  prior  = c(
    prior(normal(0, 1.0), class = "Intercept", resp = "homegoals"),
    prior(normal(0, 1.0), class = "Intercept", resp = "awaygoals"),
    prior(exponential(1.0), class = "sd")
  ),
  chains = 4, cores = 4, iter = 3000, seed = 4
)

summary(m_mv)

This multivariate model is already a meaningful step toward academic-grade sports inference: it produces posterior distributions for team effects with coherent uncertainty, and it avoids overreacting to noisy short-term form.


Diagnostics and Posterior Predictive Checks (Academic Essentials)

A Bayesian model is only as credible as its diagnostics. At minimum, check: R-hat (convergence), effective sample size, and posterior predictive fit.

# Convergence diagnostics
m_hier %>% summary()

# Posterior predictive checks: do simulated goals resemble observed?
pp_check(m_hier, type = "hist", ndraws = 100)

# Another useful view: check distribution by team (subset example)
some_teams <- levels(matches$home_team)[1:6]
pp_check(m_hier, type = "hist", ndraws = 50) +
  ggplot2::facet_wrap(~ home_team, ncol = 3)

Posterior predictive checks help detect mis-specification: too many zeros, heavy tails, or systematic under/over-dispersion. If your sport has extra dispersion, consider a negative binomial likelihood: family = negbinomial().

m_hier_nb <- brm(
  home_goals ~ 1 + (1 | home_team) + (1 | away_team),
  data   = matches,
  family = negbinomial(),
  prior  = c(
    prior(normal(0, 1.0), class = "Intercept"),
    prior(exponential(1.0), class = "sd"),
    prior(exponential(1.0), class = "shape")  # NB dispersion
  ),
  chains = 4, cores = 4, iter = 2500, seed = 5
)

pp_check(m_hier_nb, type = "hist", ndraws = 100)

Visualizing Partial Pooling (Shrinkage) in R

The cleanest way to “see” partial pooling is to compare group estimates under: (a) no pooling and (b) hierarchical pooling. Teams with limited data will shrink toward the global mean in the hierarchical model.

# Extract team effects from both models
re_hier <- ranef(m_hier)$home_team[, , "Intercept"] %>%
  as_tibble(.name_repair = "minimal") %>%
  setNames(c("estimate", "est_error", "q2.5", "q97.5")) %>%
  mutate(team = rownames(ranef(m_hier)$home_team[, , "Intercept"]))

# No pooling fixed effects: home_team coefficients (approx comparison)
fix_nopool <- fixef(m_nopool) %>% as.data.frame() %>% rownames_to_column("term") %>%
  filter(str_starts(term, "home_team")) %>%
  mutate(team = str_remove(term, "home_team")) %>%
  transmute(team, estimate = Estimate, q2.5 = Q2.5, q97.5 = Q97.5)

# Join and compare
comp <- re_hier %>%
  left_join(fix_nopool, by = "team", suffix = c("_hier", "_nopool"))

comp %>%
  ggplot(aes(x = estimate_nopool, y = estimate_hier)) +
  geom_point() +
  geom_abline(slope = 1, intercept = 0) +
  labs(
    x = "No pooling estimate (fixed effects)",
    y = "Partial pooling estimate (hierarchical)",
    title = "Shrinkage: hierarchical estimates pull extreme values toward the mean"
  )

In applied work, this shrinkage is a feature, not a bug: it protects you from chasing noise, especially early in a season or when some teams have faced unusually strong/weak opponents.


Posterior Predictions: From Parameters to Predictive Distributions

Hierarchical Bayesian models shine when you need uncertainty-aware predictions. Instead of a single number, you get a full posterior predictive distribution—useful for forecast intervals, simulations, and decision analysis.

# Create a small set of future fixtures (example)
new_matches <- tibble(
  home_team = factor(c("Team_1", "Team_2", "Team_3"), levels = levels(matches$home_team)),
  away_team = factor(c("Team_4", "Team_5", "Team_6"), levels = levels(matches$away_team))
)

# Posterior expected goals (lambda) for home_goals model
epred <- posterior_epred(m_hier, newdata = new_matches, ndraws = 2000)
# epred is draws x rows. Summarize mean and interval per match:
pred_summary <- apply(epred, 2, function(x) {
  c(mean = mean(x), q10 = quantile(x, 0.10), q90 = quantile(x, 0.90))
}) %>% t() %>% as_tibble()

bind_cols(new_matches, pred_summary)

If you need simulated goal counts (not just expected value), use posterior_predict().

yrep <- posterior_predict(m_hier, newdata = new_matches, ndraws = 2000)

sim_summary <- apply(yrep, 2, function(x) {
  c(mean = mean(x), q10 = quantile(x, 0.10), q90 = quantile(x, 0.90))
}) %>% t() %>% as_tibble()

bind_cols(new_matches, sim_summary)

Model Comparison: Why Hierarchical Often Wins

A common academic question: does partial pooling actually improve predictive accuracy? One principled approach is approximate leave-one-out cross-validation (LOO) using loo().

# Requires: install.packages("loo")
library(loo)

loo_pool   <- loo(m_pool)
loo_nopool <- loo(m_nopool)
loo_hier   <- loo(m_hier)

loo_compare(loo_pool, loo_nopool, loo_hier)

In many real datasets, hierarchical models dominate no-pooling models because they reduce variance without imposing the high bias of complete pooling. When the sample is large and balanced, no-pooling can catch up, but it remains more fragile under distribution shift (new season, roster changes, schedule imbalance).


Priors for Multilevel Sports Models: Practical Guidance

Priors are not “optional decoration”—they formalize regularization and encode plausible scales. For sports scoring, consider:

  • Intercept prior: reflects typical goal rate (log scale for Poisson).
  • Group SD priors: control how much teams can vary from the league mean.
  • Likelihood choice: Poisson vs negative binomial for overdispersion.
# Example: informative-ish intercept prior based on typical goals per match
# If average home goals ~ 1.4, then log(1.4) ~ 0.336
priors_sporty <- c(
  prior(normal(log(1.4), 0.5), class = "Intercept"),
  prior(exponential(1.0), class = "sd")
)

m_hier2 <- brm(
  home_goals ~ 1 + (1 | home_team) + (1 | away_team),
  data   = matches,
  family = poisson(),
  prior  = priors_sporty,
  chains = 4, cores = 4, iter = 2500, seed = 6
)

If you publish academic-style modeling content, a short prior sensitivity check is a strong credibility signal: refit with slightly wider SD priors and confirm conclusions are stable.


Practical Notes for Real Sports Data

  • Unbalanced schedules: hierarchical structure helps stabilize estimates when teams face different opponent quality.
  • Small samples: early season or new leagues are where partial pooling is most valuable.
  • Covariates: add rest days, travel, injuries, Elo, or rolling form as fixed effects, but keep team effects hierarchical.
  • Time dynamics: for long seasons, consider random walks or season-by-season hierarchical layers.
# Add a covariate example (simulated here):
matches2 <- matches %>%
  mutate(rest_diff = rnorm(n(), 0, 1))  # placeholder for real engineered feature

m_cov <- brm(
  home_goals ~ 1 + rest_diff + (1 | home_team) + (1 | away_team),
  data   = matches2,
  family = poisson(),
  prior  = c(
    prior(normal(0, 1.0), class = "Intercept"),
    prior(normal(0, 0.3), class = "b"),   # effect size prior for covariate
    prior(exponential(1.0), class = "sd")
  ),
  chains = 4, cores = 4, iter = 2500, seed = 7
)

summary(m_cov)

Further Reading and Deeper Projects (Optional)

If you want to extend this tutorial into a full sports analytics workflow—data acquisition, feature engineering, predictive evaluation, and model deployment—two longer-form references can be useful as project companions:

You can treat those as optional deep-dives; the core multilevel concepts in this post stand on their own and transfer cleanly to non-sports hierarchical modeling problems.


Conclusion

Partial pooling is the practical heart of hierarchical Bayesian modeling. In R, multilevel models with brms/Stan give you: (1) stable group estimates, (2) principled regularization, and (3) posterior predictive uncertainty that supports simulation, forecasting, and evaluation. If your data has groups—and most real-world data does—hierarchical Bayes is often the most defensible baseline you can set.

The post How to Fit Hierarchical Bayesian Models in R with brms: Partial Pooling Explained appeared first on R Programming Books.

]]>
https://rprogrammingbooks.com/hierarchical-bayesian-models-in-r-brms-partial-pooling/feed/ 0
Football Betting Model in R (Step-by-Step Guide 2026) https://rprogrammingbooks.com/football-betting-model-r-guide-2026/?utm_source=rss&utm_medium=rss&utm_campaign=football-betting-model-r-guide-2026 https://rprogrammingbooks.com/football-betting-model-r-guide-2026/#respond Sun, 22 Feb 2026 22:06:52 +0000 https://rprogrammingbooks.com/?p=2472 Related (on this site): Install & Use worldfootballR worldfootballR Guide Sports Analytics with R NFL Analytics with R Tennis Analytics with R Boxing Analytics with R Bayesian Sports Analytics (Book/Product) Contents Setup Get match data Feature engineering Model 1: Poisson goals (baseline) Model 2: Dixon–Coles adjustment (improves low scores) From scorelines to 1X2 probabilities Odds, […]

The post Football Betting Model in R (Step-by-Step Guide 2026) appeared first on R Programming Books.

]]>

Setup

This is a fully reproducible, code-heavy walkthrough for building a football betting model in R: data → features → model → probabilities → value bets → bankroll rules → backtest. If you’re new to worldfootballR, start here: Install & Use worldfootballR and keep the worldfootballR Guide open as reference.

# Core packages
install.packages(c(
  "dplyr","tidyr","purrr","stringr","lubridate",
  "readr","ggplot2","tibble","janitor","glue"
))

# Modeling + evaluation
install.packages(c("broom","rsample","yardstick","scoringRules","pROC"))

# Optional (for speed / nicer tables)
install.packages(c("data.table","DT"))

# Football data
# worldfootballR is usually installed from GitHub
# See: https://rprogrammingbooks.com/install-use-worldfootballr/
# If needed:
# install.packages("remotes")
# remotes::install_github("JaseZiv/worldfootballR")

library(dplyr)
library(tidyr)
library(purrr)
library(stringr)
library(lubridate)
library(readr)
library(ggplot2)
library(janitor)
library(glue)

# worldfootballR (uncomment after install)
# library(worldfootballR)

set.seed(2026)
Modeling note: The baseline approach below is the classic Poisson goals model (team attack/defense + home advantage). It’s simple, explainable, and a great foundation. You can extend it later to xG models, Bayesian hierarchical models, or time-varying strength. If you like Bayesian approaches, see: Bayesian Sports Analytics (Book/Product) .

Get match data

The easiest path is to pull historical match results from public sources. With worldfootballR, you can often scrape league seasons from sources like FBref. The specific function names can vary by package version/source, so below you’ll see:

  • Option A: Use worldfootballR directly (recommended).
  • Option B: Use your own CSV export if you already have data.

Option A — Pull league season data with worldfootballR

If you need help with installation and authentication quirks, see: Install & Use worldfootballR.

# --- Option A (worldfootballR) ---
# The exact workflow depends on the data source (FBref / other).
# Typical pattern:
# 1) Get competition URLs
# 2) Pull match results for seasons
#
# PSEUDOCODE (adjust per your worldfootballR version):
#
# comp_urls <- fb_league_urls(country = "ENG", gender = "M", season_end_year = 2026)
# epl_url <- comp_urls %>% filter(str_detect(Competition_Name, "Premier League")) %>% pull(Seasons_Urls) %>% .[1]
#
# matches <- fb_match_results(season_url = epl_url) %>%
#   janitor::clean_names()
#
# head(matches)

Option B — Use a CSV export (works everywhere)

Your data needs (minimum): date, home_team, away_team, home_goals, away_goals. Save it as matches.csv.

# --- Option B (CSV) ---
matches <- readr::read_csv("matches.csv") %>%
  janitor::clean_names() %>%
  mutate(date = as.Date(date)) %>%
  filter(!is.na(home_goals), !is.na(away_goals)) %>%
  arrange(date)

dplyr::glimpse(matches)

Standardize columns

# Make sure we have standardized column names
matches <- matches %>%
  transmute(
    date = as.Date(date),
    season = if_else(month(date) >= 7, year(date) + 1L, year(date)), # football season heuristic
    home = as.character(home_team),
    away = as.character(away_team),
    hg = as.integer(home_goals),
    ag = as.integer(away_goals)
  ) %>%
  filter(!is.na(date), !is.na(home), !is.na(away), !is.na(hg), !is.na(ag)) %>%
  arrange(date)

# Basic sanity checks
stopifnot(all(matches$hg >= 0), all(matches$ag >= 0))
matches %>% count(season) %>% arrange(desc(season)) %>% print(n = 50)

Feature engineering

For a baseline Poisson model, we’ll estimate team strength through parameters: attack and defense, plus a home advantage. We’ll also build rolling form features as optional enhancements.

Long format for modeling

long <- matches %>%
  mutate(match_id = row_number()) %>%
  tidyr::pivot_longer(
    cols = c(home, away),
    names_to = "side",
    values_to = "team"
  ) %>%
  mutate(
    opp = if_else(side == "home", away, home),
    goals = if_else(side == "home", hg, ag),
    conceded = if_else(side == "home", ag, hg),
    is_home = as.integer(side == "home")
  ) %>%
  select(match_id, date, season, team, opp, is_home, goals, conceded)

head(long)

Optional: rolling “form” features

These can help a bit, but don’t overfit. Keep them simple and always validate out-of-sample.

# Rolling averages for goals scored/conceded over last N matches (per team)
add_form_features <- function(df, n = 5) {
  df %>%
    arrange(team, date, match_id) %>%
    group_by(team) %>%
    mutate(
      gf_roll = zoo::rollapplyr(goals, width = n, FUN = mean, fill = NA, partial = TRUE),
      ga_roll = zoo::rollapplyr(conceded, width = n, FUN = mean, fill = NA, partial = TRUE)
    ) %>%
    ungroup()
}

# install.packages("zoo") if needed
# library(zoo)
# long <- add_form_features(long, n = 5)
Link building tip: If you cover multiple sports, create a “methods hub” page and link out to each sport’s analytics guide: Sports Analytics with R, NFL, Tennis, Boxing. This strengthens topical authority and internal PageRank flow.

Model 1: Poisson goals (baseline)

We’ll fit two Poisson regressions: one for home goals and one for away goals, with team attack/defense effects and a home advantage term. A standard approach is:

  • home_goals ~ home_adv + attack(home) + defense(away)
  • away_goals ~ attack(away) + defense(home)

To avoid identifiability issues, we’ll set a baseline team as reference via factor levels.

Train/test split by time (realistic for betting)

# Time-based split (e.g., last 20% of matches as test)
n_total <- nrow(matches)
cut_idx <- floor(n_total * 0.80)

train <- matches %>% slice(1:cut_idx)
test  <- matches %>% slice((cut_idx + 1):n_total)

# Ensure consistent factor levels
teams <- sort(unique(c(matches$home, matches$away)))
train <- train %>% mutate(home = factor(home, levels = teams), away = factor(away, levels = teams))
test  <- test  %>% mutate(home = factor(home, levels = teams), away = factor(away, levels = teams))

# Fit models
home_mod <- glm(hg ~ 1 + home + away, data = train, family = poisson())
away_mod <- glm(ag ~ 1 + away + home, data = train, family = poisson())

summary(home_mod)
summary(away_mod)
Interpretation: The simplest version above uses team fixed effects as factors. It works, but it mixes attack/defense. Next, we’ll fit the more interpretable attack/defense parameterization.

Attack/Defense parameterization (more interpretable)

# Build a modeling frame in the classic attack/defense form
# We model home goals:
# log(lambda_home) = home_adv + attack_home - defense_away
# And away goals:
# log(lambda_away) = attack_away - defense_home

# Create team factors
train2 <- train %>%
  mutate(
    home = factor(home, levels = teams),
    away = factor(away, levels = teams)
  )

# We'll encode attack and defense as separate factors by prefixing labels
mk_attack <- function(team) factor(paste0("att_", team), levels = paste0("att_", teams))
mk_def    <- function(team) factor(paste0("def_", team), levels = paste0("def_", teams))

train_home <- train2 %>%
  transmute(
    goals = hg,
    is_home = 1L,
    att = mk_attack(home),
    def = mk_def(away)
  )

train_away <- train2 %>%
  transmute(
    goals = ag,
    is_home = 0L,
    att = mk_attack(away),
    def = mk_def(home)
  )

train_long <- bind_rows(train_home, train_away)

# Fit a single Poisson model with:
# goals ~ is_home + att + def
# Note: to reflect "- defense" we can include def and allow coefficients to learn direction;
# For stricter structure you can re-code defense sign, but this works well in practice.
ad_mod <- glm(goals ~ is_home + att + def, data = train_long, family = poisson())

summary(ad_mod)

Predict expected goals (lambda) for each match

predict_lambdas <- function(df, model, teams) {
  df2 <- df %>%
    mutate(
      home = factor(home, levels = teams),
      away = factor(away, levels = teams),
      att_home = factor(paste0("att_", home), levels = paste0("att_", teams)),
      def_away = factor(paste0("def_", away), levels = paste0("def_", teams)),
      att_away = factor(paste0("att_", away), levels = paste0("att_", teams)),
      def_home = factor(paste0("def_", home), levels = paste0("def_", teams))
    )

  # home lambda
  new_home <- df2 %>%
    transmute(is_home = 1L, att = att_home, def = def_away)

  # away lambda
  new_away <- df2 %>%
    transmute(is_home = 0L, att = att_away, def = def_home)

  lam_home <- predict(model, newdata = new_home, type = "response")
  lam_away <- predict(model, newdata = new_away, type = "response")

  df2 %>%
    mutate(lambda_home = lam_home, lambda_away = lam_away)
}

test_pred <- predict_lambdas(test, ad_mod, teams)
head(test_pred)

Model 2: Dixon–Coles adjustment (improves low scores)

The independent Poisson assumption can under/overestimate probabilities for low scores (0–0, 1–0, 0–1, 1–1). Dixon–Coles introduces a small correction factor. Below is a clean implementation.

# Dixon-Coles tau adjustment for low-score dependence
tau_dc <- function(x, y, lam_x, lam_y, rho) {
  # x = home goals, y = away goals
  # rho is the dependence parameter
  if (x == 0 && y == 0) return(1 - (lam_x * lam_y * rho))
  if (x == 0 && y == 1) return(1 + (lam_x * rho))
  if (x == 1 && y == 0) return(1 + (lam_y * rho))
  if (x == 1 && y == 1) return(1 - rho)
  return(1)
}

# Scoreline probability matrix up to max_goals
score_matrix <- function(lam_h, lam_a, rho = 0, max_goals = 10) {
  xs <- 0:max_goals
  ys <- 0:max_goals

  ph <- dpois(xs, lam_h)
  pa <- dpois(ys, lam_a)

  # outer product for independent probabilities
  P <- outer(ph, pa)

  # apply DC tau correction
  for (i in seq_along(xs)) {
    for (j in seq_along(ys)) {
      P[i, j] <- P[i, j] * tau_dc(xs[i], ys[j], lam_h, lam_a, rho)
    }
  }

  # renormalize
  P / sum(P)
}

# Example
P_ex <- score_matrix(lam_h = 1.4, lam_a = 1.1, rho = 0.05, max_goals = 8)
round(P_ex[1:5,1:5], 4)

How do we choose rho? You can estimate it by maximizing likelihood on training data. Here’s a lightweight optimizer:

# Estimate rho by maximizing log-likelihood on train set given lambdas
train_pred <- predict_lambdas(train, ad_mod, teams)

dc_loglik <- function(rho, df, max_goals = 10) {
  # clamp rho to a reasonable range to avoid numerical issues
  rho <- max(min(rho, 0.3), -0.3)

  ll <- 0
  for (k in seq_len(nrow(df))) {
    lam_h <- df$lambda_home[k]
    lam_a <- df$lambda_away[k]
    hg <- df$hg[k]
    ag <- df$ag[k]

    P <- score_matrix(lam_h, lam_a, rho = rho, max_goals = max_goals)

    # if score exceeds max_goals, treat as tiny prob (or increase max_goals)
    if (hg > max_goals || ag > max_goals) {
      ll <- ll + log(1e-12)
    } else {
      ll <- ll + log(P[hg + 1, ag + 1] + 1e-15)
    }
  }
  ll
}

opt <- optimize(
  f = function(r) -dc_loglik(r, train_pred, max_goals = 10),
  interval = c(-0.2, 0.2)
)

rho_hat <- opt$minimum
rho_hat

From scorelines to 1X2 probabilities

Once you have a scoreline probability matrix P, compute:

  • Home win: sum of probabilities where home goals > away goals
  • Draw: sum of diagonal
  • Away win: sum where home goals < away goals
p1x2_from_matrix <- function(P) {
  max_g <- nrow(P) - 1
  xs <- 0:max_g
  ys <- 0:max_g

  p_home <- 0
  p_draw <- 0
  p_away <- 0

  for (i in seq_along(xs)) {
    for (j in seq_along(ys)) {
      if (xs[i] > ys[j]) p_home <- p_home + P[i, j]
      if (xs[i] == ys[j]) p_draw <- p_draw + P[i, j]
      if (xs[i] < ys[j]) p_away <- p_away + P[i, j]
    }
  }

  tibble(p_home = p_home, p_draw = p_draw, p_away = p_away)
}

predict_1x2 <- function(df, rho = 0, max_goals = 10) {
  out <- vector("list", nrow(df))
  for (k in seq_len(nrow(df))) {
    P <- score_matrix(df$lambda_home[k], df$lambda_away[k], rho = rho, max_goals = max_goals)
    out[[k]] <- p1x2_from_matrix(P)
  }
  bind_rows(out)
}

test_1x2 <- bind_cols(
  test_pred,
  predict_1x2(test_pred, rho = rho_hat, max_goals = 10)
)

test_1x2 %>%
  select(date, home, away, hg, ag, lambda_home, lambda_away, p_home, p_draw, p_away) %>%
  head(10)

Odds, implied probabilities & value

Betting decisions should be driven by expected value (EV). If the market offers decimal odds O and your model probability is p, then:

  • Expected value per unit stake: EV = p*(O-1) - (1-p)
  • Value condition: p > 1/O

You’ll typically have an odds feed. Below we assume you have a file odds.csv: date, home, away, odds_home, odds_draw, odds_away.

odds <- readr::read_csv("odds.csv") %>%
  janitor::clean_names() %>%
  mutate(date = as.Date(date)) %>%
  transmute(
    date,
    home = as.character(home),
    away = as.character(away),
    o_home = as.numeric(odds_home),
    o_draw = as.numeric(odds_draw),
    o_away = as.numeric(odds_away)
  )

df <- test_1x2 %>%
  mutate(home = as.character(home), away = as.character(away)) %>%
  left_join(odds, by = c("date","home","away"))

# Implied probs (no vig removal yet)
df <- df %>%
  mutate(
    imp_home = 1 / o_home,
    imp_draw = 1 / o_draw,
    imp_away = 1 / o_away,
    overround = imp_home + imp_draw + imp_away
  )

# Simple vig removal by normalization
df <- df %>%
  mutate(
    mkt_home = imp_home / overround,
    mkt_draw = imp_draw / overround,
    mkt_away = imp_away / overround
  )

# EV per 1 unit stake
ev <- function(p, o) p*(o - 1) - (1 - p)

df <- df %>%
  mutate(
    ev_home = ev(p_home, o_home),
    ev_draw = ev(p_draw, o_draw),
    ev_away = ev(p_away, o_away)
  )

df %>%
  select(date, home, away, p_home, p_draw, p_away, o_home, o_draw, o_away, ev_home, ev_draw, ev_away) %>%
  head(10)

Pick bets with thresholds (avoid noise)

# Practical filters: require at least some edge and avoid tiny probabilities
EDGE_MIN <- 0.02   # 2% EV edge
P_MIN    <- 0.05   # avoid extreme longshots unless you model them well

df_bets <- df %>%
  mutate(
    pick = case_when(
      ev_home == pmax(ev_home, ev_draw, ev_away, na.rm = TRUE) ~ "H",
      ev_draw == pmax(ev_home, ev_draw, ev_away, na.rm = TRUE) ~ "D",
      TRUE ~ "A"
    ),
    p_pick = case_when(pick == "H" ~ p_home, pick == "D" ~ p_draw, TRUE ~ p_away),
    o_pick = case_when(pick == "H" ~ o_home, pick == "D" ~ o_draw, TRUE ~ o_away),
    ev_pick = case_when(pick == "H" ~ ev_home, pick == "D" ~ ev_draw, TRUE ~ ev_away)
  ) %>%
  filter(!is.na(o_pick)) %>%
  filter(p_pick >= P_MIN, ev_pick >= EDGE_MIN)

df_bets %>% count(pick)

Backtest: flat stake vs Kelly

Backtesting is where most people fool themselves. Use a time-based split, realistic bet selection rules, and conservative bankroll sizing.

Compute bet results

# Outcome label
df_bets <- df_bets %>%
  mutate(
    result = case_when(
      hg > ag ~ "H",
      hg == ag ~ "D",
      TRUE ~ "A"
    ),
    win = as.integer(pick == result)
  )

# Profit per 1 unit stake
df_bets <- df_bets %>%
  mutate(
    profit_flat = if_else(win == 1, o_pick - 1, -1)
  )

df_bets %>% summarise(
  n_bets = n(),
  hit_rate = mean(win),
  avg_odds = mean(o_pick),
  roi_flat = mean(profit_flat)
)

Kelly staking (fractional Kelly recommended)

Full Kelly is often too aggressive. Use fractional Kelly (e.g., 0.25 Kelly). If you want a deeper treatment of Kelly and Bayesian uncertainty, see: Bayesian Sports Analytics (Book/Product) .

kelly_fraction <- function(p, o) {
  # Decimal odds o. Net odds b = o - 1
  b <- o - 1
  q <- 1 - p
  f <- (b*p - q) / b
  pmax(0, f)
}

KELLY_MULT <- 0.25  # fractional Kelly

df_bets <- df_bets %>%
  mutate(
    f_kelly = kelly_fraction(p_pick, o_pick),
    stake_kelly = KELLY_MULT * f_kelly
  )

# Simulate bankroll
simulate_bankroll <- function(df, start_bankroll = 100, stake_col = "stake_kelly") {
  br <- start_bankroll
  path <- numeric(nrow(df))

  for (i in seq_len(nrow(df))) {
    stake <- df[[stake_col]][i]
    stake_amt <- br * stake

    # Profit = stake_amt*(o-1) if win else -stake_amt
    prof <- if (df$win[i] == 1) stake_amt*(df$o_pick[i] - 1) else -stake_amt
    br <- br + prof
    path[i] <- br
  }

  path
}

df_bets <- df_bets %>% arrange(date)

# Flat staking: e.g., 1% bankroll per bet
df_bets <- df_bets %>% mutate(stake_flat = 0.01)

df_bets$br_flat  <- simulate_bankroll(df_bets, start_bankroll = 100, stake_col = "stake_flat")
df_bets$br_kelly <- simulate_bankroll(df_bets, start_bankroll = 100, stake_col = "stake_kelly")

tail(df_bets %>% select(date, home, away, pick, o_pick, win, br_flat, br_kelly), 10)

Plot bankroll curves

plot_df <- df_bets %>%
  select(date, br_flat, br_kelly) %>%
  pivot_longer(cols = c(br_flat, br_kelly), names_to = "strategy", values_to = "bankroll")

ggplot(plot_df, aes(x = date, y = bankroll, group = strategy)) +
  geom_line() +
  labs(x = NULL, y = "Bankroll", title = "Backtest Bankroll: Flat vs Fractional Kelly")

Calibration diagnostics

Profit is noisy. Calibration tells you if probabilities are sensible. A great proper scoring rule for 1X2 is the multi-class log loss. We’ll compute log loss for your 1X2 probabilities on the test set.

# Create a probability matrix and truth labels
df_eval <- df %>%
  filter(!is.na(p_home), !is.na(p_draw), !is.na(p_away)) %>%
  mutate(
    truth = case_when(hg > ag ~ "H", hg == ag ~ "D", TRUE ~ "A"),
    truth = factor(truth, levels = c("H","D","A"))
  )

# Log loss (manual)
log_loss_1x2 <- function(pH, pD, pA, y) {
  eps <- 1e-15
  p <- ifelse(y=="H", pH, ifelse(y=="D", pD, pA))
  -mean(log(pmax(p, eps)))
}

ll <- log_loss_1x2(df_eval$p_home, df_eval$p_draw, df_eval$p_away, df_eval$truth)
ll

Reliability plot (binning)

# Example: calibration for HOME-win probability
calib_home <- df_eval %>%
  mutate(bin = ntile(p_home, 10)) %>%
  group_by(bin) %>%
  summarise(
    p_mean = mean(p_home),
    freq = mean(truth == "H"),
    n = n(),
    .groups = "drop"
  )

ggplot(calib_home, aes(x = p_mean, y = freq)) +
  geom_point() +
  geom_abline(slope = 1, intercept = 0) +
  labs(x = "Predicted P(Home win)", y = "Observed frequency", title = "Calibration (Home win)")
Common upgrade path: Add xG features (if you have them) and/or time decay (recent matches matter more). Then validate calibration again. Don’t chase ROI without checking probability quality.

Production: weekly pipeline

Here’s a practical skeleton you can run weekly: fetch latest matches → refit/refresh → generate next fixtures probabilities → compare with odds.

# 1) Load historical matches (from worldfootballR or your CSV)
matches <- readr::read_csv("matches.csv") %>% janitor::clean_names()

# 2) Train up to cutoff date (e.g., yesterday)
cutoff_date <- Sys.Date() - 1

hist <- matches %>%
  mutate(date = as.Date(date)) %>%
  filter(date <= cutoff_date) %>%
  transmute(date, home = home_team, away = away_team, hg = home_goals, ag = away_goals) %>%
  arrange(date)

teams <- sort(unique(c(hist$home, hist$away)))

# 3) Fit attack/defense model
train_long <- bind_rows(
  hist %>%
    transmute(goals = hg, is_home = 1L,
              att = factor(paste0("att_", home), levels = paste0("att_", teams)),
              def = factor(paste0("def_", away), levels = paste0("def_", teams))),
  hist %>%
    transmute(goals = ag, is_home = 0L,
              att = factor(paste0("att_", away), levels = paste0("att_", teams)),
              def = factor(paste0("def_", home), levels = paste0("def_", teams)))
)

ad_mod <- glm(goals ~ is_home + att + def, data = train_long, family = poisson())

# 4) Predict for upcoming fixtures (you need a fixtures table)
fixtures <- readr::read_csv("fixtures.csv") %>%
  janitor::clean_names() %>%
  mutate(date = as.Date(date)) %>%
  transmute(date, home = home_team, away = away_team)

# Compute lambdas
fixtures2 <- fixtures %>%
  mutate(
    home = factor(home, levels = teams),
    away = factor(away, levels = teams)
  )

fixtures_pred <- predict_lambdas(
  df = fixtures2 %>% mutate(hg = 0L, ag = 0L), # placeholders
  model = ad_mod,
  teams = teams
)

# 5) Estimate rho (optional) on recent history only (faster)
hist_pred <- hist %>% mutate(home = factor(home, levels=teams), away=factor(away, levels=teams))
hist_pred <- predict_lambdas(hist_pred, ad_mod, teams)

opt <- optimize(
  f = function(r) -dc_loglik(r, hist_pred %>% mutate(lambda_home=lambda_home, lambda_away=lambda_away),
                             max_goals = 10),
  interval = c(-0.2, 0.2)
)
rho_hat <- opt$minimum

# 6) Convert to 1X2
fixtures_1x2 <- bind_cols(
  fixtures_pred,
  predict_1x2(fixtures_pred, rho = rho_hat, max_goals = 10)
) %>%
  select(date, home, away, lambda_home, lambda_away, p_home, p_draw, p_away)

write_csv(fixtures_1x2, "model_probs.csv")

At this point you have model_probs.csv ready to merge with bookmaker odds and produce a bet shortlist. In your content strategy, link this post to your broader methods pages: Sports Analytics with R and the relevant sport hubs (NFL, tennis, boxing) to strengthen internal linking.

FAQ

Is a Poisson model “good enough” for football betting?

It’s a strong baseline. It captures team strength and home advantage with minimal complexity. Many upgrades (xG, time decay, Bayesian partial pooling) improve robustness, but the baseline can already be useful.

How do I avoid overfitting?

Use time-based validation, keep features simple, and prioritize calibration and log loss. Don’t tune thresholds using the same data you evaluate on.

What’s the simplest value-bet rule?

Bet only when p_model > p_implied and you have a buffer (e.g. EV > 2%), then stake conservatively (flat 0.5%–1% bankroll or fractional Kelly).

Where do I learn more advanced Bayesian sports models in R?

If you want Bayesian approaches, uncertainty-aware staking, and a deeper treatment of the Kelly criterion, see: Bayesian Sports Analytics (Book/Product) .


Next reads on rprogrammingbooks.com

The post Football Betting Model in R (Step-by-Step Guide 2026) appeared first on R Programming Books.

]]>
https://rprogrammingbooks.com/football-betting-model-r-guide-2026/feed/ 0
Quantitative Horse Racing with R: Calibration, Backtesting, and Deployment https://rprogrammingbooks.com/quantitative-horse-racing-r-calibration-backtesting-deployment/?utm_source=rss&utm_medium=rss&utm_campaign=quantitative-horse-racing-r-calibration-backtesting-deployment https://rprogrammingbooks.com/quantitative-horse-racing-r-calibration-backtesting-deployment/#respond Thu, 19 Feb 2026 19:19:08 +0000 https://rprogrammingbooks.com/?p=2460 R DuckDB Parquet Calibration Ranking Bayesian Odds TS Backtesting Racing analytics as an inference-and-decision system Thoroughbred flat racing is not a binary classification problem. It is a multi-competitor outcome process with hierarchy (horse / trainer / jockey / track), time dependence (form cycles, market moves), and decision layers (how you act on probabilities). This macro […]

The post Quantitative Horse Racing with R: Calibration, Backtesting, and Deployment appeared first on R Programming Books.

]]>
R
DuckDB
Parquet
Calibration
Ranking
Bayesian
Odds TS
Backtesting

Racing analytics as an inference-and-decision system

Thoroughbred flat racing is not a binary classification problem. It is a multi-competitor outcome process with hierarchy (horse / trainer / jockey / track), time dependence (form cycles, market moves), and decision layers (how you act on probabilities).

This macro post is deliberately code-heavy. The goal is to show a canonical, reproducible R workflow you can adapt: from a data contract to modeling to evaluation to deployment.

Design principle: treat predictive skill and wagering decisions as separate layers.
  • Model layer: calibrated probabilities and uncertainty.
  • Decision layer: explicit assumptions, risk controls, stress tests.

If you do only one thing, do this: never judge models by ROI alone. Use proper scoring rules and calibration first.

1) Canonical data contract: race → runner → odds snapshots

The fastest way to avoid analytics chaos is to standardize your storage model. A minimal schema lets every chapter, notebook, or experiment start from the same tables.

DuckDB schema (SQL)

-- sql/schema.sql
CREATE TABLE IF NOT EXISTS races (
  race_id     VARCHAR PRIMARY KEY,
  race_date   DATE,
  region      VARCHAR,
  track       VARCHAR,
  surface     VARCHAR,
  going       VARCHAR,
  distance_m  INTEGER,
  race_class  VARCHAR,
  purse       DOUBLE,
  field_size  INTEGER
);

CREATE TABLE IF NOT EXISTS runners (
  race_id        VARCHAR,
  runner_id      VARCHAR,
  horse_id       VARCHAR,
  jockey_id      VARCHAR,
  trainer_id     VARCHAR,
  draw           INTEGER,
  weight_kg      DOUBLE,
  age            INTEGER,
  sex            VARCHAR,
  odds_decimal   DOUBLE,
  finish_pos     INTEGER,
  finish_time_s  DOUBLE,
  win            INTEGER,
  PRIMARY KEY (race_id, runner_id)
);

CREATE TABLE IF NOT EXISTS odds_snapshots (
  race_id      VARCHAR,
  runner_id    VARCHAR,
  ts           TIMESTAMP,
  odds_decimal DOUBLE,
  traded_vol   DOUBLE,
  PRIMARY KEY (race_id, runner_id, ts)
);
Why DuckDB?
It queries Parquet locally with high performance (predicate pushdown, parallel scans) and avoids running a server database for most use-cases.
Why Parquet?
Columnar storage + compression + fast scans. Perfect for large runner-level tables and time-series snapshots.

2) Generate a reproducible simulated dataset

High-quality racing data is often license-restricted. A robust approach is to make every lab runnable on a simulator that matches your canonical schema, then provide adapters to ingest real data only when access is lawful.

Race simulator (R)

# R/simulate.R
library(dplyr)
library(tidyr)

simulate_races <- function(
  n_races = 200, n_horses = 600, n_jockeys = 200, n_trainers = 150,
  min_field = 6, max_field = 14, seed = 1
) {
  set.seed(seed)

  horses  <- tibble(horse_id  = sprintf("H%04d", 1:n_horses),
                    ability   = rnorm(n_horses, 0, 1))
  jockeys <- tibble(jockey_id = sprintf("J%03d", 1:n_jockeys),
                    skill     = rnorm(n_jockeys, 0, 0.4))
  trainers<- tibble(trainer_id= sprintf("T%03d", 1:n_trainers),
                    skill     = rnorm(n_trainers, 0, 0.5))

  surfaces <- c("Turf","Dirt","AllWeather")
  goings   <- c("Firm","Good","Gd-Fm","Gd-Sft","Soft","Heavy","Standard","Fast","Sloppy")

  races <- tibble(
    race_id    = sprintf("R%05d", 1:n_races),
    race_date  = as.Date("2025-01-01") + sample(0:365, n_races, TRUE),
    region     = sample(c("GB","IE","US","AU","FR"), n_races, TRUE),
    track      = sample(c("TRACK_A","TRACK_B","TRACK_C"), n_races, TRUE),
    surface    = sample(surfaces, n_races, TRUE, prob = c(0.45,0.35,0.20)),
    going      = sample(goings, n_races, TRUE),
    distance_m = sample(c(1000,1200,1400,1600,1800,2000,2400,2800), n_races, TRUE),
    race_class = sample(c("G1","G2","G3","HCP","ALW","CLM"), n_races, TRUE),
    purse      = round(exp(rnorm(n_races, log(25000), 0.7))),
    field_size = sample(min_field:max_field, n_races, TRUE)
  )

  runners <- races |>
    rowwise() |>
    do({
      r <- .
      k <- r$field_size

      df <- tibble(
        race_id    = r$race_id,
        runner_id  = paste0(r$race_id, "_", seq_len(k)),
        horse_id   = sample(horses$horse_id, k, FALSE),
        jockey_id  = sample(jockeys$jockey_id, k, TRUE),
        trainer_id = sample(trainers$trainer_id, k, TRUE),
        draw       = sample(1:k, k, FALSE),
        weight_kg  = pmax(45, rnorm(k, 55, 3)),
        age        = sample(2:7, k, TRUE),
        sex        = sample(c("C","F","G","H"), k, TRUE, prob = c(0.2,0.35,0.35,0.1))
      ) |>
        left_join(horses, by="horse_id") |>
        left_join(jockeys, by="jockey_id", suffix=c("","_j")) |>
        left_join(trainers, by="trainer_id", suffix=c("","_t")) |>
        mutate(
          draw_effect    = ifelse(k >= 12, (k + 1 - draw)/k, 0),
          surface_effect = case_when(r$surface=="Turf" ~ 0.15,
                                     r$surface=="AllWeather" ~ 0.05,
                                     TRUE ~ 0),
          weight_penalty = -0.03*(weight_kg - 55),
          utility        = ability + skill + skill_t + 0.12*draw_effect +
                           surface_effect + weight_penalty + rnorm(k, 0, 0.25)
        )

      # Sequential proportional-to-exp(utility) finish ordering
      remaining <- seq_len(k)
      w <- exp(df$utility - max(df$utility))
      order_idx <- integer(0)
      for (pos in seq_len(k)) {
        probs <- w[remaining]/sum(w[remaining])
        pick <- sample(remaining, 1, prob = probs)
        order_idx <- c(order_idx, pick)
        remaining <- setdiff(remaining, pick)
      }
      df$finish_pos <- match(seq_len(k), order_idx)
      df$win <- as.integer(df$finish_pos == 1)

      base_time <- r$distance_m/16
      df$finish_time_s <- base_time - 0.8*df$ability + rnorm(k, 0, 1.2)

      # Market-like odds with an overround
      p <- exp(df$utility - max(df$utility)); p <- p/sum(p)
      overround <- 1.18
      df$odds_decimal <- pmin(200, pmax(1.01, 1/(overround*p) * exp(rnorm(k, 0, 0.06))))

      df |>
        select(race_id, runner_id, horse_id, jockey_id, trainer_id,
               draw, weight_kg, age, sex, odds_decimal,
               finish_pos, finish_time_s, win)
    }) |>
    ungroup()

  list(races = races, runners = runners)
}

sim <- simulate_races(seed = 42)

Quick EDA sanity checks

library(ggplot2)

ggplot(sim$races, aes(field_size)) +
  geom_histogram(bins = 12) +
  labs(x = "Runners per race", y = "Count")

ggplot(sim$runners, aes(odds_decimal)) +
  geom_histogram(bins = 50) +
  scale_x_log10() +
  labs(x = "Decimal odds (log scale)", y = "Count")

3) Persist to Parquet and query with DuckDB

A clean pattern is: write Parquet (Arrow) → create tables in DuckDB from read_parquet(). This yields a fast, local warehouse without heavyweight infrastructure.

library(arrow)
library(DBI)
library(duckdb)

dir.create("data/derived", recursive = TRUE, showWarnings = FALSE)
write_parquet(sim$races,   "data/derived/races.parquet")
write_parquet(sim$runners, "data/derived/runners.parquet")

con <- dbConnect(duckdb(), "data/warehouse/racing.duckdb")
dbExecute(con, "INSTALL parquet; LOAD parquet;")

dbExecute(con, "
  CREATE OR REPLACE TABLE races AS
  SELECT * FROM read_parquet('data/derived/races.parquet');
")

dbExecute(con, "
  CREATE OR REPLACE TABLE runners AS
  SELECT * FROM read_parquet('data/derived/runners.parquet');
")

dbGetQuery(con, "SELECT COUNT(*) AS n_races FROM races;")
dbGetQuery(con, "SELECT COUNT(*) AS n_runners FROM runners;")

dbDisconnect(con, shutdown = TRUE)

Query patterns you will use constantly

-- Average odds and win rate by surface
SELECT
  surface,
  AVG(odds_decimal) AS avg_odds,
  AVG(win)          AS win_rate,
  COUNT(*)          AS n
FROM runners r
JOIN races   x USING (race_id)
GROUP BY surface
ORDER BY n DESC;

-- Within-race normalization of implied probability
WITH base AS (
  SELECT
    race_id,
    runner_id,
    odds_decimal,
    1.0 / odds_decimal AS implied_raw
  FROM runners
)
SELECT
  race_id,
  runner_id,
  implied_raw / SUM(implied_raw) OVER (PARTITION BY race_id) AS implied_p_norm
FROM base;

4) Baselines: probabilistic win modeling + calibration

In racing analytics, your model output is typically a probability. That means evaluation should prioritize proper scoring rules and calibration, not just AUC.

Feature store (leakage-safe)

library(dplyr)

features <- sim$runners |>
  left_join(sim$races, by = "race_id") |>
  transmute(
    race_id, runner_id,
    win,
    # pre-race covariates only
    draw, weight_kg, age, sex,
    surface, going, distance_m, race_class, purse,
    odds_decimal,
    log_odds = log(odds_decimal),
    implied_p_raw = 1/odds_decimal
  )

stopifnot(all(is.finite(features$log_odds)))
stopifnot(all(features$odds_decimal >= 1.01))

tidymodels GLM baseline + log loss / Brier / AUC

library(tidymodels)

set.seed(123)

df <- features |>
  group_by(race_id) |>
  mutate(implied_p = implied_p_raw / sum(implied_p_raw)) |>
  ungroup() |>
  mutate(win = factor(win, levels = c(0,1), labels = c("no","yes")))

spl  <- initial_split(df, strata = win)
train<- training(spl)
test <- testing(spl)

rec <- recipe(win ~ draw + weight_kg + age + sex + surface + going +
                distance_m + odds_decimal, data = train) |>
  step_dummy(all_nominal_predictors()) |>
  step_zv(all_predictors()) |>
  step_normalize(all_numeric_predictors())

mod <- logistic_reg() |> set_engine("glm")
wf  <- workflow() |> add_recipe(rec) |> add_model(mod)

fit  <- fit(wf, train)
pred <- predict(fit, test, type = "prob") |>
  bind_cols(test |> select(win))

metric_set(mn_log_loss, brier_class, roc_auc)(pred, truth = win, .pred_yes)

Calibration plot

library(probably)

# Binned calibration plot
cal_plot_breaks(pred, truth = win, estimate = .pred_yes, num_breaks = 10)
Calibration mindset: a model can be “accurate” in ranking yet systematically overconfident. In betting or pricing contexts, miscalibration is usually more dangerous than a small drop in AUC.

5) Multi-runner outcome modeling: rankings, not just winners

Binary win models throw away a lot of structure. Racing naturally produces a ranking: finish positions across multiple runners. Ranking models let you target that structure directly.

Plackett–Luce (direct ranking likelihood)

library(PlackettLuce)

set.seed(99)
items  <- paste0("H", 1:12)
n_races <- 40
ability <- rnorm(length(items)); names(ability) <- items

Rmat <- matrix(0, nrow = n_races, ncol = length(items))
colnames(Rmat) <- items

for (r in 1:n_races) {
  field <- sample(items, size = sample(6:10, 1), replace = FALSE)
  util  <- ability[field] + rnorm(length(field), 0, 0.3)
  ord   <- field[order(util, decreasing = TRUE)]
  Rmat[r, ord] <- seq_along(ord) # 1 = best; 0 = absent
}

rankings <- as.rankings(Rmat)
pl <- PlackettLuce(rankings)

summary(pl)
coef(pl)  # worth parameters

Bradley–Terry via pairwise decomposition

library(dplyr)
library(BradleyTerry2)

pairwise <- tibble()
for (r in 1:n_races) {
  in_race <- names(which(Rmat[r, ] > 0))
  ranks   <- Rmat[r, in_race]
  for (i in 1:(length(in_race) - 1)) {
    for (j in (i + 1):length(in_race)) {
      h1 <- in_race[i]; h2 <- in_race[j]
      win1 <- as.integer(ranks[h1] < ranks[h2])
      pairwise <- bind_rows(pairwise, tibble(h1=h1, h2=h2, win1=win1, win2=1-win1))
    }
  }
}

agg <- pairwise |>
  group_by(h1, h2) |>
  summarise(win1 = sum(win1), win2 = sum(win2), .groups = "drop")

bt <- BTm(cbind(win1, win2), h1, h2, data = agg, id = "horse")
BTabilities(bt)
Interpretation
Both families estimate “ability-like” latent parameters, but they do it with different data representations: full rankings vs aggregated pairwise outcomes.
When rankings matter
If you care about place markets, exactas/quinellas, or modeling finish times/positions beyond “win,” ranking-aware methods often align better with the objective.

6) Hierarchy: GLMMs and Bayesian multilevel models

Racing is fundamentally hierarchical: repeated horses, trainers, jockeys, tracks, regions. Fixed effects can overfit; multilevel models provide shrinkage and better uncertainty accounting.

GLMM with trainer/jockey random effects (lme4)

library(lme4)
library(dplyr)

sim2 <- simulate_races(n_races=500, n_horses=800, n_jockeys=120, n_trainers=100, seed=202)
df2  <- sim2$runners |> left_join(sim2$races, by="race_id")

m_glmm <- glmer(
  win ~ scale(weight_kg) + scale(draw) + surface + going +
    (1 | trainer_id) + (1 | jockey_id),
  data   = df2,
  family = binomial()
)

summary(m_glmm)

Bayesian multilevel model (brms)

library(brms)

priors <- c(
  prior(normal(0, 1), class = "b"),
  prior(exponential(1), class = "sd")
)

fit_bayes <- brm(
  win ~ scale(weight_kg) + scale(draw) + surface + going +
    (1 | trainer_id) + (1 | jockey_id),
  data   = df2,
  family = bernoulli(),
  prior  = priors,
  chains = 2, iter = 1000, cores = 2, seed = 202
)

summary(fit_bayes)
pp_check(fit_bayes)
Why Bayesian here? You get a principled way to encode priors (e.g., smaller effects are more likely than huge ones), propagate uncertainty through derived probabilities, and do posterior predictive checks.

7) Time-to-event: survival and frailty modeling

Some questions in racing are naturally time-to-event: time to first win, time to peak performance, time to retirement, time between races. Survival analysis gives you the right likelihood and handles censoring.

Cox proportional hazards

library(dplyr)
library(survival)

set.seed(1)
n <- 1000

horses <- tibble(
  horse_id  = sprintf("H%04d", 1:n),
  talent    = rnorm(n),
  trainer_id= sample(sprintf("T%03d", 1:80), n, TRUE),
  debut_age = pmax(2, rnorm(n, 3.0, 0.6))
)

base_rate <- 0.002
linpred   <- 0.65 * horses$talent - 0.25 * (horses$debut_age - 3)
rate      <- base_rate * exp(linpred)

time_days   <- rexp(n, rate = rate)
censor_days <- runif(n, 60, 700)

status   <- as.integer(time_days <= censor_days)
time_obs <- pmin(time_days, censor_days)

df_surv <- horses |> mutate(time_days = time_obs, status = status)

cox <- coxph(Surv(time_days, status) ~ scale(talent) + scale(debut_age), data = df_surv)
summary(cox)
cox.zph(cox)

Frailty term (trainer-level heterogeneity)

cox_f <- coxph(
  Surv(time_days, status) ~ scale(talent) + scale(debut_age) + frailty(trainer_id),
  data = df_surv
)

summary(cox_f)

8) Modern ML: XGBoost in tidymodels and mlr3

Gradient boosting can capture non-linearities and interactions without manual feature engineering. But if you don’t evaluate probabilistically (log loss, calibration), you can accidentally ship a model that “ranks well” yet prices poorly.

tidymodels: tuned XGBoost

library(tidymodels)
library(dplyr)

set.seed(314)
sim3 <- simulate_races(n_races = 500, n_horses = 1200, seed = 314)

df3 <- sim3$runners |>
  left_join(sim3$races, by="race_id") |>
  mutate(win = factor(win, levels=c(0,1), labels=c("no","yes")))

spl   <- initial_split(df3, strata = win)
train <- training(spl)
test  <- testing(spl)

rec <- recipe(win ~ draw + weight_kg + age + sex + surface + going +
               distance_m + odds_decimal, data=train) |>
  step_dummy(all_nominal_predictors()) |>
  step_zv(all_predictors())

folds <- vfold_cv(train, v = 5, strata = win)

xgb_spec <- boost_tree(
  trees = 800,
  tree_depth     = tune(),
  learn_rate     = tune(),
  loss_reduction = tune(),
  sample_size    = tune(),
  mtry           = tune()
) |>
  set_engine("xgboost") |>
  set_mode("classification")

wf <- workflow() |> add_recipe(rec) |> add_model(xgb_spec)

grid <- grid_space_filling(
  tree_depth(),
  learn_rate(),
  loss_reduction(),
  sample_size = sample_prop(),
  mtry(range = c(5, 60)),
  size = 20
)

res <- tune_grid(
  wf,
  resamples = folds,
  grid = grid,
  metrics = metric_set(mn_log_loss, roc_auc)
)

best <- select_best(res, "mn_log_loss")
final_fit <- finalize_workflow(wf, best) |> fit(train)

pred <- predict(final_fit, test, type="prob") |> bind_cols(test |> select(win))
metric_set(mn_log_loss, brier_class, roc_auc)(pred, truth = win, .pred_yes)

mlr3: comparable workflow

library(data.table)
library(mlr3)
library(mlr3learners)

set.seed(99)
sim4 <- simulate_races(n_races = 400, n_horses = 900, seed = 99)

df4 <- as.data.table(merge(sim4$runners, sim4$races, by = "race_id"))
df4[, win := factor(ifelse(win == 1, "yes", "no"))]

task <- as_task_classif(df4, target = "win", id = "win_task")
task$set_col_roles(
  c("race_id","runner_id","horse_id","jockey_id","trainer_id"),
  roles = "name"
)

rsmp <- rsmp("cv", folds = 5)

lrn_glm <- lrn("classif.log_reg", predict_type = "prob")
rr_glm  <- resample(task, lrn_glm, rsmp)
rr_glm$aggregate(msr("classif.logloss"))
rr_glm$aggregate(msr("classif.auc"))

lrn_xgb <- lrn("classif.xgboost", predict_type = "prob",
               nrounds = 300, eta = 0.05, max_depth = 4)
rr_xgb <- resample(task, lrn_xgb, rsmp)
rr_xgb$aggregate(msr("classif.logloss"))
rr_xgb$aggregate(msr("classif.auc"))

9) Odds time series: from snapshots to features

Exchange markets produce time-stamped odds and volumes. Treat them as time series and engineer features like drift, volatility, late steam, and liquidity. Keep it honest: market microstructure is noisy, and backtests require strong assumptions.

Simulated snapshots as a tsibble + feature extraction

library(dplyr)
library(tsibble)
library(feasts)
library(ggplot2)

set.seed(1)
snap <- tibble(
  runner_id = "R00001_01",
  ts = as.POSIXct("2025-06-01 14:00:00", tz = "UTC") + seq(0, 60*60, by = 60),
  odds_decimal = pmax(1.01, 6 + cumsum(rnorm(61, 0, 0.05))),
  traded_vol = cumsum(pmax(0, rnorm(61, 10, 3)))
)

tsbl <- snap |> as_tsibble(index = ts, key = runner_id)

feat <- tsbl |>
  features(odds_decimal, list(
    mean  = ~ mean(.),
    sd    = ~ sd(.),
    min   = ~ min(.),
    max   = ~ max(.),
    slope = ~ coef(lm(. ~ seq_along(.)))[2]
  ))

feat

ggplot(tsbl, aes(ts, odds_decimal)) +
  geom_line() +
  labs(x = "Time", y = "Decimal odds")

Schema normalization: snapshots table

# Example transformation skeleton for real exchange data:
# 1) parse raw feed messages
# 2) map to normalized columns: race_id, runner_id, ts, odds_decimal, traded_vol
# 3) write Parquet and load into DuckDB

# write_parquet(odds_snapshots, "data/derived/odds_snapshots.parquet")
# dbExecute(con, "CREATE OR REPLACE TABLE odds_snapshots AS
#                 SELECT * FROM read_parquet('data/derived/odds_snapshots.parquet');")

10) Backtesting as a decision layer (fractional Kelly + guardrails)

Once you have calibrated probabilities, you can add a decision policy. One common framework is fractional Kelly sizing, but it must be capped and stress tested. The same probabilities can produce very different outcomes depending on execution constraints, market impact, limits, and selection filters.

Simulated fractional Kelly example with risk caps

library(dplyr)
library(ggplot2)

sim5 <- simulate_races(n_races = 180, n_horses = 500, seed = 9)
runners <- sim5$runners |>
  group_by(race_id) |>
  mutate(
    p_market = (1/odds_decimal) / sum(1/odds_decimal),
    # toy "model": market + noise (for demonstration)
    p_model  = pmin(0.99, pmax(0.001, p_market + rnorm(n(), 0, 0.02))),
    edge     = p_model - p_market
  ) |>
  ungroup()

fractional_kelly <- function(p, odds_dec, fraction = 0.25, cap = 0.05) {
  b <- odds_dec - 1
  f_star <- (b*p - (1 - p)) / b
  f <- pmax(0, f_star) * fraction
  pmin(f, cap)
}

bankroll <- 1
max_drawdown <- 0.25
peak <- bankroll
hist <- tibble()

for (rid in unique(runners$race_id)) {
  if (bankroll < (1 - max_drawdown) * peak) break  # guardrail: stop

  race <- runners |> filter(race_id == rid)
  pick <- race |> arrange(desc(edge)) |> slice(1)

  if (pick$edge > 0.01 && pick$odds_decimal <= 30) {
    f <- fractional_kelly(pick$p_model, pick$odds_decimal, fraction = 0.25, cap = 0.05)
    stake <- bankroll * f
    pnl <- if (pick$win == 1) stake * (pick$odds_decimal - 1) else -stake
    bankroll <- bankroll + pnl
    peak <- max(peak, bankroll)
  }

  hist <- bind_rows(hist, tibble(race_id = rid, bankroll = bankroll, peak = peak))
}

hist <- hist |> mutate(drawdown = 1 - bankroll/peak)

ggplot(hist, aes(seq_along(bankroll), bankroll)) + geom_line() +
  labs(x = "Race index", y = "Bankroll")

ggplot(hist, aes(seq_along(drawdown), drawdown)) + geom_line() +
  labs(x = "Race index", y = "Drawdown")
Practical warning: this example is intentionally simplified. Real backtests require: commissions, slippage, bet limits, late price changes, liquidity constraints, and selection bias checks.

11) Deployment: versioned models + HTTP prediction API

If you want your work to be reusable, build a deployable artifact. A clean R-native pattern is: pins for versioning + vetiver for model packaging + plumber for the API server.

library(tidymodels)
library(pins)
library(vetiver)
library(plumber)
library(dplyr)

sim6 <- simulate_races(n_races = 250, n_horses = 700, seed = 101)

df6 <- sim6$runners |>
  left_join(sim6$races, by="race_id") |>
  mutate(win = factor(win, levels=c(0,1), labels=c("no","yes")))

rec <- recipe(win ~ draw + weight_kg + age + sex + surface + going +
                distance_m + odds_decimal, data = df6) |>
  step_dummy(all_nominal_predictors()) |>
  step_zv(all_predictors())

fit <- workflow() |>
  add_recipe(rec) |>
  add_model(logistic_reg() |> set_engine("glm")) |>
  fit(df6)

board <- board_folder("pins", versioned = TRUE)
v <- vetiver_model(fit, model_name = "win_prob_glm")

vetiver_pin_write(board, v)

api <- pr() |> vetiver_api(v)

# Run locally:
# api |> pr_run(host = "0.0.0.0", port = 8080)

api

Minimal request payload shape (JSON)

{
  "draw": 7,
  "weight_kg": 55.5,
  "age": 4,
  "sex": "G",
  "surface": "Turf",
  "going": "Good",
  "distance_m": 1600,
  "odds_decimal": 6.2
}

12) Reproducibility mechanics: renv + deterministic builds

Reproducibility is not vibes. Lock dependencies, record versions, and standardize builds. This is especially important when your analytics becomes a book, a report, or a product artifact.

renv workflow

# Initialize (once)
# renv::init()

# Snapshot current library into renv.lock
renv::snapshot()

# Restore exact package versions from renv.lock
renv::restore()

Dockerfile for deterministic PDF builds (Quarto)

FROM rocker/verse:4.4.2
WORKDIR /book

COPY renv.lock renv.lock
COPY renv/ renv/
COPY .Rprofile .Rprofile

RUN R -q -e "install.packages('renv'); renv::restore(prompt = FALSE)"

COPY . .
CMD ["quarto", "render", "book", "--to", "pdf"]

CI skeleton (GitHub Actions)

name: Render PDF
on:
  push:
    branches: [ main ]
  workflow_dispatch:

jobs:
  pdf:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: r-lib/actions/setup-r@v2
      - uses: r-lib/actions/setup-renv@v2
      - name: Render Quarto book
        run: quarto render book --to pdf

13) An “engineering checklist” you can actually follow

Data contract
  • Canonical schema
  • Parquet + DuckDB
  • Adapters, not redistribution
Modeling
  • GLM/GLMM baselines
  • Ranking likelihoods
  • Bayesian uncertainty
Evaluation
  • Log loss / Brier
  • Calibration curves
  • Stress-tested backtests
Time dependence
  • Odds snapshots
  • Drift/volatility features
  • Liquidity constraints
Decision layer
  • Explicit assumptions
  • Fractional Kelly caps
  • Drawdown guardrails
Deployment
  • pins versioning
  • vetiver packaging
  • plumber API

One more code snippet: within-race probability normalization

Many racing tables store “odds” per runner, but for inference you often want a normalized probability within a field. Here’s a standard pattern that also highlights overround issues.

library(dplyr)

normalize_field_probs <- function(df, odds_col = odds_decimal) {
  df |>
    group_by(race_id) |>
    mutate(
      implied_raw = 1 / {{ odds_col }},
      overround   = sum(implied_raw),
      p_market    = implied_raw / overround
    ) |>
    ungroup()
}

df_norm <- normalize_field_probs(sim$runners)

df_norm |>
  summarise(
    avg_overround = mean(overround),
    pmin_overround= min(overround),
    pmax_overround= max(overround)
  )
Interpretation: overround > 1 implies the market’s implied probabilities sum above 1 (built-in margin). Separating “market implied probability” from “model probability” is essential if you want honest evaluation.

The post Quantitative Horse Racing with R: Calibration, Backtesting, and Deployment appeared first on R Programming Books.

]]>
https://rprogrammingbooks.com/quantitative-horse-racing-r-calibration-backtesting-deployment/feed/ 0
Machine Learning for Sports Analytics in R: A Complete Professional Guide https://rprogrammingbooks.com/machine-learning-sports-analytics-r/?utm_source=rss&utm_medium=rss&utm_campaign=machine-learning-sports-analytics-r https://rprogrammingbooks.com/machine-learning-sports-analytics-r/#respond Sun, 15 Feb 2026 19:12:15 +0000 https://rprogrammingbooks.com/?p=2448 Table of Contents 1. Introduction to Machine Learning in Sports Analytics Machine Learning has transformed modern sports analytics. What was once limited to box scores and descriptive statistics has evolved into predictive modeling, simulation systems, optimization engines, and automated scouting pipelines. Today, teams, analysts, researchers, and performance departments rely on machine learning to gain measurable […]

The post Machine Learning for Sports Analytics in R: A Complete Professional Guide appeared first on R Programming Books.

]]>
Table of Contents
  1. Introduction to Machine Learning in Sports Analytics
  2. Why Use R for Sports Machine Learning?
  3. End-to-End Machine Learning Workflow
  4. Sports Data Collection and Sources
  5. Feature Engineering for Sports Models
  6. Data Preprocessing and Cleaning
  7. Train/Test Splitting and Cross-Validation
  8. Baseline Model: Logistic Regression
  9. Ensemble Learning: Random Forest
  10. Gradient Boosting with XGBoost
  11. Model Evaluation Metrics
  12. Hyperparameter Tuning
  13. Model Interpretability in Sports
  14. Time-Aware Modeling in Sports
  15. From Model to Production
  16. Advanced Topics in Sports Machine Learning
  17. Conclusion

1. Introduction to Machine Learning in Sports Analytics

Machine Learning has transformed modern sports analytics. What was once limited to box scores and descriptive statistics has evolved into predictive modeling, simulation systems, optimization engines, and automated scouting pipelines. Today, teams, analysts, researchers, and performance departments rely on machine learning to gain measurable competitive advantages.

In sports environments, machine learning models are commonly used to:

  • Predict match outcomes and win probabilities
  • Estimate player performance trajectories
  • Model scoring or serve probabilities
  • Quantify tactical efficiency
  • Detect undervalued players in recruitment markets
  • Simulate season scenarios and tournament paths

This guide provides a complete professional workflow in R, covering the entire machine learning lifecycle from data preprocessing to advanced ensemble modeling and evaluation.

2. Why Use R for Sports Machine Learning?

R remains one of the strongest ecosystems for statistical computing and sports analytics research. Its advantages include:

  • Deep statistical foundations
  • Reproducible research workflows
  • Powerful visualization capabilities
  • Comprehensive modeling libraries
  • Strong adoption in academic sports science

install.packages(c(
  "tidyverse",
  "caret",
  "tidymodels",
  "randomForest",
  "xgboost",
  "pROC",
  "yardstick",
  "vip",
  "glmnet",
  "zoo"
))

library(tidyverse)
library(caret)
library(tidymodels)
library(randomForest)
library(xgboost)
library(pROC)
library(yardstick)
library(vip)
library(glmnet)
library(zoo)

3. End-to-End Machine Learning Workflow

A robust sports ML workflow includes:

  1. Data acquisition
  2. Cleaning and preprocessing
  3. Feature engineering
  4. Train/test splitting
  5. Baseline modeling
  6. Advanced ensemble modeling
  7. Evaluation and validation
  8. Interpretability
  9. Deployment

4. Sports Data Collection and Sources

Sports datasets may include match-level data, play-by-play event data, tracking coordinates, physiological metrics, and contextual features.


set.seed(123)

n <- 6000

sports_data <- tibble(
  home_rating = rnorm(n, 1500, 120),
  away_rating = rnorm(n, 1500, 120),
  home_form = rnorm(n, 0.5, 0.1),
  away_form = rnorm(n, 0.5, 0.1),
  home_shots = rpois(n, 14),
  away_shots = rpois(n, 11),
  home_possession = rnorm(n, 0.55, 0.05),
  away_possession = rnorm(n, 0.45, 0.05)
) %>%
  mutate(
    rating_diff = home_rating - away_rating,
    form_diff = home_form - away_form,
    shot_diff = home_shots - away_shots,
    possession_diff = home_possession - away_possession,
    home_win = ifelse(
      0.004 * rating_diff +
      2.5 * form_diff +
      0.08 * shot_diff +
      2 * possession_diff +
      rnorm(n, 0, 1) > 0,
      1, 0
    )
  )

sports_data$home_win <- as.factor(sports_data$home_win)

5. Feature Engineering for Sports Models

In sports analytics, relative metrics often outperform raw metrics. Differences between teams or players are typically more informative.


sports_data <- sports_data %>%
  mutate(
    momentum_index = 0.6 * form_diff + 0.4 * shot_diff,
    dominance_score = rating_diff * 0.5 + possession_diff * 100
  )

6. Train/Test Split


set.seed(42)

train_index <- createDataPartition(
  sports_data$home_win,
  p = 0.8,
  list = FALSE
)

train_data <- sports_data[train_index, ]
test_data  <- sports_data[-train_index, ]

7. Baseline Model: Logistic Regression


log_model <- glm(
  home_win ~ rating_diff + form_diff +
             shot_diff + possession_diff +
             momentum_index,
  data = train_data,
  family = binomial
)

summary(log_model)

log_probs <- predict(log_model, test_data, type = "response")
log_preds <- ifelse(log_probs > 0.5, 1, 0)

confusionMatrix(
  as.factor(log_preds),
  test_data$home_win
)

8. Random Forest Model


rf_model <- randomForest(
  home_win ~ rating_diff + form_diff +
             shot_diff + possession_diff +
             momentum_index + dominance_score,
  data = train_data,
  ntree = 600,
  mtry = 3,
  importance = TRUE
)

rf_preds <- predict(rf_model, test_data)

confusionMatrix(rf_preds, test_data$home_win)

varImpPlot(rf_model)

9. Gradient Boosting with XGBoost


train_matrix <- model.matrix(
  home_win ~ rating_diff + form_diff +
              shot_diff + possession_diff +
              momentum_index + dominance_score,
  train_data
)[, -1]

test_matrix <- model.matrix(
  home_win ~ rating_diff + form_diff +
              shot_diff + possession_diff +
              momentum_index + dominance_score,
  test_data
)[, -1]

dtrain <- xgb.DMatrix(
  data = train_matrix,
  label = as.numeric(train_data$home_win) - 1
)

dtest <- xgb.DMatrix(
  data = test_matrix,
  label = as.numeric(test_data$home_win) - 1
)

params <- list(
  objective = "binary:logistic",
  eval_metric = "auc",
  max_depth = 5,
  eta = 0.05,
  subsample = 0.8,
  colsample_bytree = 0.8
)

xgb_model <- xgb.train(
  params = params,
  data = dtrain,
  nrounds = 350,
  verbose = 0
)

xgb_preds <- predict(xgb_model, dtest)

roc_obj <- roc(as.numeric(test_data$home_win), xgb_preds)
auc(roc_obj)

10. Model Evaluation Metrics

Choosing appropriate metrics is essential in sports modeling. Accuracy alone is rarely sufficient.


metrics_vec(
  truth = test_data$home_win,
  estimate = as.factor(ifelse(xgb_preds > 0.5, 1, 0)),
  metric_set(accuracy, precision, recall, f_meas)
)

11. Time-Aware Modeling


sports_data <- sports_data %>%
  arrange(desc(rating_diff)) %>%
  mutate(
    rolling_form = rollmean(form_diff, k = 5, fill = NA)
  )

12. Advanced Topics

  • Neural Networks with keras
  • Player clustering
  • Expected goals modeling
  • Bayesian hierarchical models
  • Simulation-based forecasting

13. Deployment

Models can be deployed using Shiny dashboards, automated pipelines, or APIs using plumber for real-time prediction systems.

14. Conclusion

Machine Learning in R offers a rigorous and flexible framework for sports analytics applications. By combining strong statistical foundations with modern ensemble methods, analysts can generate reliable predictive systems adaptable to multiple sports contexts.

If you want to go deeper into structured sports analytics modeling in R, including advanced case studies, simulation frameworks, and sport-specific implementations, you can explore specialized resources below.

Explore Sports Analytics Programming Books in R

The post Machine Learning for Sports Analytics in R: A Complete Professional Guide appeared first on R Programming Books.

]]>
https://rprogrammingbooks.com/machine-learning-sports-analytics-r/feed/ 0
How to Predict Sports in R: Elo, Monte Carlo, and Real Simulations https://rprogrammingbooks.com/predict-sports-in-r-elo-monte-carlo-simulations/?utm_source=rss&utm_medium=rss&utm_campaign=predict-sports-in-r-elo-monte-carlo-simulations https://rprogrammingbooks.com/predict-sports-in-r-elo-monte-carlo-simulations/#respond Fri, 06 Feb 2026 20:20:13 +0000 https://rprogrammingbooks.com/?p=2431 R • Sports Analytics • Ratings • Monte Carlo • Forecasting Sports are noisy. Teams change. Injuries happen. Upsets happen. But uncertainty is not the enemy—it’s the input. In this hands-on guide you’ll build a practical sports prediction workflow in R using tidyverse, PlayerRatings, and NFLSimulatoR, then connect ratings to Monte Carlo simulations and forecasting […]

The post How to Predict Sports in R: Elo, Monte Carlo, and Real Simulations appeared first on R Programming Books.

]]>

R • Sports Analytics • Ratings • Monte Carlo • Forecasting

Sports are noisy. Teams change. Injuries happen. Upsets happen. But uncertainty is not the enemy—it’s the input. In this hands-on guide you’ll build a practical sports prediction workflow in R using tidyverse, PlayerRatings, and NFLSimulatoR, then connect ratings to Monte Carlo simulations and forecasting models (logistic regression & Poisson).

Goal

Probabilities, not “picks”

Core tools

Ratings + Models + Simulation

Output

Reproducible code & scenarios

Table of Contents

Everything below is meant to be copied into an R script and run. Swap the toy data for your league’s data (football/soccer, basketball, American football, tennis, baseball, esports—you name it).

1) Setup: a reproducible R project

Prediction work gets messy fast if you can’t reproduce it. Start every project with: (1) a clean package setup, (2) a fixed random seed, (3) tidy data conventions.

# Install (run once)
install.packages(c("tidyverse", "lubridate", "PlayerRatings"))

# If you plan to use NFLSimulatoR (CRAN/GitHub availability varies):
# install.packages("NFLSimulatoR") 
# or, if needed:
# install.packages("remotes"); remotes::install_github(".../NFLSimulatoR")

library(tidyverse)
library(lubridate)
library(PlayerRatings)

set.seed(2026)  # reproducibility: simulation + model splits will match

Rule of thumb: if your analysis can’t be rerun end-to-end from a fresh session, it’s not ready for decisions (betting, scouting, or strategy).

2) A tiny dataset (so we can focus on methods)

We’ll build a compact match table with team names, dates, and outcomes. You can replace this with real data later (e.g., football leagues, NBA games, tennis matches).

# Create toy match results
teams <- c("Lions", "Tigers", "Bears", "Wolves", "Hawks", "Sharks")

n_games <- 120

matches <- tibble(
  date  = sample(seq.Date(as.Date("2025-08-01"), as.Date("2026-01-31"), by = "day"), n_games, replace = TRUE),
  home  = sample(teams, n_games, replace = TRUE),
  away  = sample(teams, n_games, replace = TRUE)
) %>%
  filter(home != away) %>%
  arrange(date) %>%
  mutate(
    # "true" (hidden) team strengths to generate outcomes (toy world)
    home_strength = recode(home,
      "Lions"=0.6,"Tigers"=0.5,"Bears"=0.4,"Wolves"=0.2,"Hawks"=0.1,"Sharks"=0.0),
    away_strength = recode(away,
      "Lions"=0.6,"Tigers"=0.5,"Bears"=0.4,"Wolves"=0.2,"Hawks"=0.1,"Sharks"=0.0),

    # Convert strength difference into a win probability
    p_home = plogis(0.4 + 2.0*(home_strength - away_strength)),  # home advantage + skill gap
    home_win = rbinom(n(), size = 1, prob = p_home)
  ) %>%
  select(date, home, away, home_win, p_home)

matches %>% slice(1:8)

This is intentionally minimal: just enough to demonstrate Elo/Glicko, win probabilities, simulation, and modeling.

3) Elo ratings with PlayerRatings

Elo is a rating system that updates after each game. It’s not “magic”—it’s a controlled way to: (1) estimate team strength from history, and (2) turn rating gaps into expected win probabilities.

3.1 Prepare the input format

# PlayerRatings expects a "competitionResult" object.
# We'll represent teams as "players" and each game as a head-to-head match.

elo_input <- matches %>%
  transmute(
    Date = date,
    Player1 = home,
    Player2 = away,
    # For PlayerRatings: 1 = Player1 win, 0 = Player2 win (or use Score columns)
    Result = if_else(home_win == 1, 1, 0)
  )

head(elo_input)

3.2 Fit Elo

# Run Elo with a K-factor (update speed). Tune K by sport and data volume.
elo_fit <- elo(elo_input, k = 20)

# Ratings after processing all games
elo_ratings <- as.data.frame(elo_fit$ratings) %>%
  rownames_to_column("team") %>%
  as_tibble() %>%
  rename(elo = Rating) %>%
  arrange(desc(elo))

elo_ratings

Why Elo is useful: it’s lightweight, interpretable, and often surprisingly competitive as a baseline. Use it to sanity-check more complex models.

4) Glicko ratings (when uncertainty matters)

Elo tells you “strength.” Glicko tells you “strength + uncertainty.” Early season? New team? Few games played? Glicko’s rating deviation helps you avoid false confidence.

# Glicko uses a similar input, but the function differs (depends on PlayerRatings version).
# Many workflows look like this:

glicko_input <- elo_input  # same structure works for many head-to-head methods

# Some installs provide a glicko() function:
# glicko_fit <- glicko(glicko_input)

# If your PlayerRatings build doesn't expose glicko() directly,
# you can still rely on Elo for the main pipeline and add uncertainty via modeling/simulation.

# Pseudo-example structure:
# glicko_ratings <- glicko_fit$ratings

Heads-up: package APIs can change. If your local PlayerRatings version doesn’t expose a direct glicko() helper, you can still do uncertainty-aware prediction by combining ratings with a probabilistic model (logistic regression/Bayesian) and then simulating forward.

5) Turning ratings into win probabilities

Ratings become actionable once you map them to probabilities. The classic Elo expectation uses a logistic curve. You can also learn this mapping empirically (recommended) by fitting a model on past games.

5.1 Join Elo back into match rows

matches_w_elo <- matches %>%
  left_join(elo_ratings %>% rename(home_elo = elo), by = c("home" = "team")) %>%
  left_join(elo_ratings %>% rename(away_elo = elo), by = c("away" = "team")) %>%
  mutate(elo_diff = home_elo - away_elo)

matches_w_elo %>% select(date, home, away, home_win, home_elo, away_elo, elo_diff) %>% slice(1:10)

5.2 A simple Elo-to-probability mapping

# Classic Elo expected score mapping (base-10 logistic).
# "scale" controls steepness. For chess Elo it's 400 by convention.
elo_prob <- function(diff, scale = 400) {
  1 / (1 + 10^(-diff / scale))
}

matches_w_elo <- matches_w_elo %>%
  mutate(p_home_elo = elo_prob(elo_diff, scale = 250))  # scale tuned for our toy world

matches_w_elo %>%
  summarize(
    avg_p = mean(p_home_elo),
    brier = mean((home_win - p_home_elo)^2)
  )

Elo gives a baseline probability. Next we’ll simulate seasons, then we’ll improve the mapping using regression models.

6) Monte Carlo: simulate a season & playoff odds

Monte Carlo simulation is how you turn game probabilities into season-level questions: “What are the chances we finish top-2?” “How often do we make playoffs?” “What is the distribution of wins?”

6.1 Build a future schedule

# Create a "future" schedule (e.g., the next 60 games)
n_future <- 60
future <- tibble(
  date = seq.Date(as.Date("2026-02-01"), by = "day", length.out = n_future),
  home = sample(teams, n_future, replace = TRUE),
  away = sample(teams, n_future, replace = TRUE)
) %>%
  filter(home != away) %>%
  left_join(elo_ratings %>% rename(home_elo = elo), by = c("home" = "team")) %>%
  left_join(elo_ratings %>% rename(away_elo = elo), by = c("away" = "team")) %>%
  mutate(
    elo_diff = home_elo - away_elo,
    p_home = elo_prob(elo_diff, scale = 250)
  )

future %>% slice(1:8)

6.2 Simulate the remaining season many times

simulate_season <- function(future_games, n_sims = 5000) {

  # We'll simulate all games n_sims times, then compute win totals.
  sims <- replicate(n_sims, {
    future_games %>%
      mutate(home_win = rbinom(n(), 1, p_home)) %>%
      transmute(
        home, away,
        home_w = home_win,
        away_w = 1 - home_win
      ) %>%
      pivot_longer(cols = c(home, away), names_to = "side", values_to = "team") %>%
      mutate(win = if_else(side == "home", home_w, away_w)) %>%
      group_by(team) %>%
      summarize(wins = sum(win), .groups = "drop")
  }, simplify = FALSE)

  bind_rows(sims, .id = "sim") %>%
    mutate(sim = as.integer(sim))
}

season_sims <- simulate_season(future, n_sims = 3000)

# Win distribution by team
win_dist <- season_sims %>%
  group_by(team) %>%
  summarize(
    avg_wins = mean(wins),
    p_top2 = mean(rank(-wins, ties.method = "random") <= 2),
    p_top3 = mean(rank(-wins, ties.method = "random") <= 3),
    .groups = "drop"
  ) %>%
  arrange(desc(avg_wins))

win_dist

Interpretation tip: simulation answers “distribution” questions (ranges, tails, chances). That’s exactly what decision-makers care about.

7) “What-if” analysis: injuries, transfers, roster changes

Counterfactuals are where modeling becomes practical: you modify team strength assumptions and see how outcomes shift. For Elo-based pipelines, this is as easy as adding/subtracting rating points.

# Example: star injury reduces Lions by 60 Elo points for the rest of the season
adjusted_ratings <- elo_ratings %>%
  mutate(elo_adj = if_else(team == "Lions", elo - 60, elo))

future_counterfactual <- future %>%
  select(date, home, away) %>%
  left_join(adjusted_ratings %>% select(team, home_elo = elo_adj), by = c("home" = "team")) %>%
  left_join(adjusted_ratings %>% select(team, away_elo = elo_adj), by = c("away" = "team")) %>%
  mutate(
    elo_diff = home_elo - away_elo,
    p_home = elo_prob(elo_diff, scale = 250)
  )

baseline <- simulate_season(future, n_sims = 2000) %>% mutate(scenario = "baseline")
injury   <- simulate_season(future_counterfactual, n_sims = 2000) %>% mutate(scenario = "lions_injury")

compare <- bind_rows(baseline, injury) %>%
  group_by(scenario, team) %>%
  summarize(avg_wins = mean(wins), .groups = "drop") %>%
  pivot_wider(names_from = scenario, values_from = avg_wins) %>%
  mutate(delta = lions_injury - baseline) %>%
  arrange(delta)

compare

The same idea works for transfers, coaching changes, or “rest days”—anything you can reasonably encode as a strength shift.

8) Poisson models for scorelines

When you predict scores (not just win/loss), Poisson models are a strong baseline—especially in low-scoring sports. The classic setup models goals scored as Poisson with attack/defense effects and home advantage.

8.1 Generate toy score data

# Create toy goals (home_goals, away_goals) driven by hidden strengths.
# This mimics soccer/hockey-like scoring.

scores <- matches %>%
  mutate(
    lambda_home = exp(-0.2 + 1.2*home_strength - 0.9*away_strength + 0.25),
    lambda_away = exp(-0.2 + 1.2*away_strength - 0.9*home_strength),
    home_goals = rpois(n(), lambda_home),
    away_goals = rpois(n(), lambda_away),
    home_win = as.integer(home_goals > away_goals),
    draw = as.integer(home_goals == away_goals)
  ) %>%
  select(date, home, away, home_goals, away_goals, home_win, draw)

scores %>% slice(1:8)

8.2 Fit Poisson GLMs

# Two-model approach: predict home_goals and away_goals separately.
# Features: team identifiers (as factors) can capture attack/defense proxies in simple form.

po_home <- glm(home_goals ~ home + away, data = scores, family = poisson())
po_away <- glm(away_goals ~ home + away, data = scores, family = poisson())

# Predict expected goals for upcoming games
future_xg <- future %>%
  select(date, home, away) %>%
  mutate(
    xg_home = predict(po_home, newdata = ., type = "response"),
    xg_away = predict(po_away, newdata = ., type = "response")
  )

future_xg %>% slice(1:8)

8.3 Convert expected goals into win probabilities

# Approximate match outcome probabilities by simulating scorelines from the Poisson rates.
scoreline_probs <- function(lambda_home, lambda_away, n = 20000) {
  hg <- rpois(n, lambda_home)
  ag <- rpois(n, lambda_away)
  tibble(
    p_home_win = mean(hg > ag),
    p_draw     = mean(hg == ag),
    p_away_win = mean(hg < ag)
  )
}

example_probs <- scoreline_probs(future_xg$xg_home[1], future_xg$xg_away[1], n = 30000)
example_probs

Why this works: if you can forecast scoring rates, you can forecast any downstream event: win, draw, total goals, both-teams-to-score, etc.

9) Logistic regression for match outcomes

A clean upgrade from “raw Elo mapping” is to fit a logistic regression that learns how rating differences translate to win probability—optionally adding context features (rest, travel, injuries, lineup quality).

# Train on historical matches: outcome ~ elo_diff
train <- matches_w_elo %>% drop_na(home_elo, away_elo)

logit <- glm(home_win ~ elo_diff, data = train, family = binomial())
summary(logit)

# Predict probabilities for future games using the fitted mapping
future_logit <- future %>%
  left_join(elo_ratings %>% rename(home_elo = elo), by = c("home" = "team")) %>%
  left_join(elo_ratings %>% rename(away_elo = elo), by = c("away" = "team")) %>%
  mutate(
    elo_diff = home_elo - away_elo,
    p_home = predict(logit, newdata = ., type = "response")
  )

future_logit %>% slice(1:8)

Logistic regression also makes it easy to incorporate additional predictors beyond ratings.

10) Using NFLSimulatoR as a simulator backbone

If you’re working on American football specifically, a purpose-built simulator can accelerate your pipeline: schedule generation, game simulation, season simulation, and standings logic.

Note: The exact function names can differ depending on the NFLSimulatoR version you have. The pattern below shows a practical structure you can adapt: prepare team strengths → simulate games → aggregate standings → repeat.

# library(NFLSimulatoR)

# 1) Define (or estimate) team strength inputs
team_strengths <- elo_ratings %>% select(team, elo)

# 2) Build or import a schedule (week, home, away)
# nfl_schedule <- NFLSimulatoR::load_schedule(season = 2026)
# or your own:
nfl_schedule <- tibble(
  week = rep(1:10, each = 3),
  home = sample(teams, 30, replace = TRUE),
  away = sample(teams, 30, replace = TRUE)
) %>% filter(home != away)

# 3) Map strengths to win probs (Elo-based or model-based)
nfl_games <- nfl_schedule %>%
  left_join(team_strengths %>% rename(home_elo = elo), by = c("home" = "team")) %>%
  left_join(team_strengths %>% rename(away_elo = elo), by = c("away" = "team")) %>%
  mutate(
    elo_diff = home_elo - away_elo,
    p_home = elo_prob(elo_diff, scale = 250)
  )

# 4) Simulate a season many times
simulate_nfl_like <- function(games, n_sims = 5000) {
  sims <- replicate(n_sims, {
    games %>%
      mutate(home_win = rbinom(n(), 1, p_home)) %>%
      transmute(home, away, home_w = home_win, away_w = 1 - home_win) %>%
      pivot_longer(cols = c(home, away), names_to = "side", values_to = "team") %>%
      mutate(win = if_else(side == "home", home_w, away_w)) %>%
      group_by(team) %>%
      summarize(wins = sum(win), .groups = "drop")
  }, simplify = FALSE)

  bind_rows(sims, .id = "sim") %>%
    mutate(sim = as.integer(sim))
}

nfl_sims <- simulate_nfl_like(nfl_games, n_sims = 3000)

nfl_summary <- nfl_sims %>%
  group_by(team) %>%
  summarize(
    avg_wins = mean(wins),
    p_10plus = mean(wins >= 10),
    .groups = "drop"
  ) %>%
  arrange(desc(avg_wins))

nfl_summary

Key idea: you don’t need a “perfect” game model to get useful season-level probabilities. A reasonable win-probability engine + Monte Carlo is already powerful.

11) Evaluation: calibration + proper scoring

In prediction, accuracy alone is not enough. You want probabilities that are well-calibrated and score well under proper scoring rules (Brier score, log loss).

11.1 Brier score

brier_score <- function(y, p) mean((y - p)^2)

# Compare Elo mapping vs logistic regression mapping (on the same historical set)
eval_tbl <- train %>%
  mutate(
    p_elo   = elo_prob(elo_diff, scale = 250),
    p_logit = predict(logit, newdata = ., type = "response")
  ) %>%
  summarize(
    brier_elo = brier_score(home_win, p_elo),
    brier_logit = brier_score(home_win, p_logit)
  )

eval_tbl

11.2 Calibration table (quick check)

calibration_table <- function(y, p, bins = 10) {
  tibble(y = y, p = p) %>%
    mutate(bin = ntile(p, bins)) %>%
    group_by(bin) %>%
    summarize(
      p_avg = mean(p),
      y_avg = mean(y),
      n = n(),
      .groups = "drop"
    )
}

calib <- train %>%
  mutate(p_logit = predict(logit, newdata = ., type = "response")) %>%
  calibration_table(y = home_win, p = p_logit, bins = 10)

calib

Interpretation: In each probability bin, the average predicted probability should roughly match the observed win rate. If not, your model is miscalibrated (common—and fixable).

12) Wrap-up: a practical workflow you can reuse

Here’s the repeatable blueprint you just built:

  1. Ingest results into a tidy match table (one row per game).
  2. Estimate strengths (Elo/Glicko or similar).
  3. Map strengths → probabilities (Elo curve or a fitted logistic model).
  4. Simulate forward with Monte Carlo for season and playoff probabilities.
  5. Run counterfactuals by adjusting strengths or model features.
  6. Evaluate with proper scoring and calibration checks.

If you liked this approach and want a compact, hands-on walkthrough that stitches these pieces into a single practical learning path—ratings (Elo/Glicko), Monte Carlo seasons/tournaments, and forecasting models with runnable R code— you can find a focused guide here: Sports Prediction and Simulation with R.

It’s built to be “open it, run the code, understand the method”—so you can apply the same pipeline to football, basketball, American football, tennis, baseball, esports, and beyond.

Disclaimer: This article is educational and focuses on modeling and simulation techniques. Real-world performance depends on data quality, league structure, and how well your features capture the game.

The post How to Predict Sports in R: Elo, Monte Carlo, and Real Simulations appeared first on R Programming Books.

]]>
https://rprogrammingbooks.com/predict-sports-in-r-elo-monte-carlo-simulations/feed/ 0
Designing Sports Betting Systems in R: Bayesian Probabilities, Expected Value, and Kelly Logic https://rprogrammingbooks.com/designing-sports-betting-systems-in-r/?utm_source=rss&utm_medium=rss&utm_campaign=designing-sports-betting-systems-in-r https://rprogrammingbooks.com/designing-sports-betting-systems-in-r/#respond Sun, 01 Feb 2026 20:49:04 +0000 https://rprogrammingbooks.com/?p=2421 A good sports betting system is not a “pick-winners” machine. It’s an uncertainty engine: it turns data into probabilities, probabilities into expected value, and expected value into position sizes that survive variance. If you can do those three steps consistently, you can build a robust process— even if individual bets lose often. This post is […]

The post Designing Sports Betting Systems in R: Bayesian Probabilities, Expected Value, and Kelly Logic appeared first on R Programming Books.

]]>

A good sports betting system is not a “pick-winners” machine. It’s an uncertainty engine: it turns data into probabilities, probabilities into expected value, and expected value into position sizes that survive variance. If you can do those three steps consistently, you can build a robust process— even if individual bets lose often.

This post is a code-heavy blueprint for designing sports betting systems in R using Bayesian probabilities, market odds, EV, the Kelly Criterion, and practical betting strategies. You’ll also get a backtesting and Monte Carlo framework to evaluate drawdowns, ruin risk, and performance stability.


Table of Contents


1. Setup: Packages, Data Schemas, and Conventions

We’ll work with an event-level dataset (one row per match) and optionally a bet-level dataset (one row per wager). The minimum fields you want for match win modeling:

  • date
  • team, opponent
  • home (1 home, 0 away)
  • result (1 win, 0 loss)
  • odds_decimal (decimal odds for the bet you want to evaluate)
# Core data + modeling stack
library(dplyr)
library(tidyr)
library(purrr)
library(stringr)
library(lubridate)
library(ggplot2)
library(scales)

# Bayesian regression (optional, heavier but powerful)
library(rstanarm)
library(posterior)

set.seed(42)

# --- Odds helpers ---
american_to_decimal <- function(american) {
  ifelse(american > 0, 1 + american/100, 1 + 100/abs(american))
}

decimal_to_american <- function(decimal) {
  # returns American odds as numeric (positive/negative)
  stopifnot(all(decimal > 1))
  out <- ifelse(decimal >= 2, (decimal - 1) * 100, -100/(decimal - 1))
  out
}

decimal_to_implied <- function(decimal) 1 / decimal

# Remove vig from a two-outcome market (simple proportional normalization)
remove_vig_two_way <- function(p1_raw, p2_raw) {
  s <- p1_raw + p2_raw
  c(p1 = p1_raw/s, p2 = p2_raw/s, overround = s)
}

# --- EV helpers ---
ev_decimal <- function(p, d) {
  # expected profit per unit stake for decimal odds d
  # win: profit = d - 1; lose: -1
  p*(d - 1) - (1 - p)
}

breakeven_p_decimal <- function(d) 1 / d

# --- Kelly helpers ---
kelly_fraction_decimal <- function(p, d) {
  # Kelly fraction for decimal odds d, win prob p
  # b = d - 1, q = 1-p, f* = (bp - q)/b
  b <- d - 1
  q <- 1 - p
  f <- (b*p - q) / b
  f
}

# Practical caps
clip <- function(x, lo, hi) pmax(lo, pmin(hi, x))

1.1 A toy dataset generator (so you can run everything)

In real life, you’ll load your match history and odds. For a self-contained tutorial, we simulate: latent team strengths → win probabilities → outcomes → bookmaker odds with a built-in margin.

simulate_league <- function(
  n_teams = 12,
  n_games = 2000,
  home_adv = 0.25,
  sigma_strength = 0.9,
  bookmaker_margin = 0.05
) {
  teams <- paste0("T", sprintf("%02d", 1:n_teams))
  strength <- rnorm(n_teams, 0, sigma_strength)
  names(strength) <- teams

  # schedule
  team <- sample(teams, n_games, replace = TRUE)
  opponent <- sample(teams, n_games, replace = TRUE)
  while(any(team == opponent)) {
    idx <- which(team == opponent)
    opponent[idx] <- sample(teams, length(idx), replace = TRUE)
  }

  home <- rbinom(n_games, 1, 0.5)
  date <- as.Date("2022-01-01") + sample(0:900, n_games, replace = TRUE)

  # true probability via logistic link
  lin <- (strength[team] - strength[opponent]) + home_adv * home
  p_true <- plogis(lin)

  # outcomes
  result <- rbinom(n_games, 1, p_true)

  # bookmaker sets odds from "market probability" with margin
  # market_prob ~ p_true with noise, then add margin by inflating implied probs
  p_mkt <- clip(p_true + rnorm(n_games, 0, 0.03), 0.02, 0.98)
  # apply margin by scaling implied probabilities upward (approx.)
  p_raw <- clip(p_mkt * (1 + bookmaker_margin), 0.02, 0.995)

  odds_decimal <- 1 / p_raw

  tibble(
    date = date,
    team = team,
    opponent = opponent,
    home = home,
    result = result,
    p_true = p_true,
    odds_decimal = odds_decimal
  ) %>%
    arrange(date)
}

matches <- simulate_league()
glimpse(matches)
summary(matches$odds_decimal)

2. Odds and Probability: Conversions, Vig (Overround), and Fair Prices

Odds imply probabilities, but sportsbook odds typically include a margin (vig/overround). If you compare your model’s probability to raw implied probability, you might misjudge edges. The right comparison is: your probability vs the market’s vig-free probability.

2.1 Basic conversions

# Example: decimal odds
d <- 1.91
p_implied_raw <- decimal_to_implied(d)
p_implied_raw
# Example: American odds
am <- -110
d2 <- american_to_decimal(am)
c(decimal = d2, implied = decimal_to_implied(d2))

2.2 Two-way vig removal example

For a two-outcome market (like moneyline without draws), you can normalize implied probabilities. Suppose the book offers:

d_home <- 1.80
d_away <- 2.10

p_home_raw <- 1/d_home
p_away_raw <- 1/d_away

vigfree <- remove_vig_two_way(p_home_raw, p_away_raw)
vigfree
# Market "fair" probabilities
p_home_fair <- vigfree["p1"]
p_away_fair <- vigfree["p2"]

p_home_fair + p_away_fair  # should be 1

In practice, you might not always have both sides’ odds, especially if you store only “the odds you bet”. But for sharp evaluation and backtesting, capturing the full market quote (both sides) is ideal.


3. Expected Value: Break-even Probability, Edge, and Sensitivity

Expected value (EV) is the bridge between probability and betting decisions. With decimal odds d, and win probability p, EV per 1 unit stake is:

# EV for a single bet
p_hat <- 0.56
d <- 1.95

ev <- ev_decimal(p_hat, d)
ev
# Break-even p (the book's implied probability)
p_be <- breakeven_p_decimal(d)
c(breakeven_p = p_be, edge = p_hat - p_be)

3.1 EV sensitivity curve

ev_curve <- function(d, p_grid = seq(0.01, 0.99, by = 0.01)) {
  tibble(p = p_grid, ev = ev_decimal(p_grid, d))
}

df_ev <- ev_curve(1.95)

ggplot(df_ev, aes(p, ev)) +
  geom_line() +
  geom_hline(yintercept = 0, linetype = 2) +
  scale_y_continuous(labels = percent_format(accuracy = 0.1)) +
  labs(
    title = "EV per Unit Stake vs Win Probability",
    x = "Win probability p",
    y = "Expected value (profit per 1 unit stake)"
  )

Notice how small probability errors can flip EV from positive to negative when odds are tight. This is why calibration and uncertainty-aware betting matter.


4. Bayesian Foundations for Betting: Priors, Posteriors, and Uncertainty

Bayesian thinking fits betting naturally: you start with a prior belief about a team/player/strategy, update with evidence, and end up with a posterior distribution over the probability you care about. Crucially, you get uncertainty, not just a point estimate.

4.1 A simple Bayesian coin: Beta prior + Binomial likelihood

# Prior: Beta(a, b)
a0 <- 10
b0 <- 10

# Observed: k wins out of n
k <- 12
n <- 20

# Posterior: Beta(a0 + k, b0 + n - k)
a1 <- a0 + k
b1 <- b0 + (n - k)

# Posterior mean and interval
post_mean <- a1 / (a1 + b1)
post_ci <- qbeta(c(0.05, 0.95), a1, b1)

c(mean = post_mean, lo90 = post_ci[1], hi90 = post_ci[2])
# Visualize prior vs posterior
grid <- seq(0, 1, length.out = 400)

df_beta <- bind_rows(
  tibble(p = grid, density = dbeta(grid, a0, b0), dist = "prior"),
  tibble(p = grid, density = dbeta(grid, a1, b1), dist = "posterior")
)

ggplot(df_beta, aes(p, density, color = dist)) +
  geom_line() +
  labs(
    title = "Beta Prior vs Posterior",
    x = "Win probability p",
    y = "Density"
  )

This is the simplest but most important idea: if your sample is small, your posterior won’t be overly confident. That alone can prevent overbetting.


5. Beta-Binomial Models: Bayesian Win Rates (Conjugate Bayes)

The Beta-Binomial is a workhorse for modeling win rates, conversion rates, “cover rates”, etc. You can use it for:

  • Team win rate vs certain opponents
  • Over/Under hit rate for a model-defined filter
  • Any binary market outcome you can label as 0/1

5.1 Team-level posteriors

# Build per-team counts
team_stats <- matches %>%
  group_by(team) %>%
  summarise(
    n = n(),
    k = sum(result),
    .groups = "drop"
  )

# Choose a prior: mildly skeptical
a0 <- 8
b0 <- 8

team_post <- team_stats %>%
  mutate(
    a = a0 + k,
    b = b0 + (n - k),
    post_mean = a/(a+b),
    lo90 = qbeta(0.05, a, b),
    hi90 = qbeta(0.95, a, b)
  ) %>%
  arrange(desc(post_mean))

team_post %>% slice(1:10)
ggplot(team_post, aes(reorder(team, post_mean), post_mean)) +
  geom_point() +
  geom_errorbar(aes(ymin = lo90, ymax = hi90), width = 0.2) +
  coord_flip() +
  scale_y_continuous(labels = percent_format(accuracy = 0.1)) +
  labs(
    title = "Team Win Probability Posteriors (Beta-Binomial)",
    x = "Team",
    y = "Posterior mean (90% interval)"
  )

5.2 Betting with uncertainty (posterior sampling)

Instead of using a single probability, sample from the posterior and compute EV distribution. This helps you avoid betting when your edge is too uncertain.

# Suppose you want to bet Team X at odds d
team_x <- team_post$team[1]
d <- 1.95

a <- team_post$a[team_post$team == team_x]
b <- team_post$b[team_post$team == team_x]

p_draws <- rbeta(20000, a, b)
ev_draws <- ev_decimal(p_draws, d)

quantile(ev_draws, c(0.05, 0.5, 0.95))
# Probability that EV is positive (posterior probability of an edge)
mean(ev_draws > 0)

A simple rule: only bet if P(EV > 0) exceeds a threshold (e.g., 0.6 or 0.7). That’s a Bayesian risk control.


6. Sequential Updating: Learning Over Time (Online Bayesian Updating)

Betting systems are online systems. Your beliefs should update as games happen. Beta-Binomial makes this trivial.

update_beta <- function(a, b, y) {
  # y is 1 for win, 0 for loss
  c(a = a + y, b = b + (1 - y))
}

sequential_posterior <- function(y, a0 = 8, b0 = 8) {
  a <- a0; b <- b0
  out <- vector("list", length(y))
  for (i in seq_along(y)) {
    ab <- update_beta(a, b, y[i])
    a <- ab["a"]; b <- ab["b"]
    out[[i]] <- c(
      i = i,
      a = a, b = b,
      mean = a/(a+b),
      lo90 = qbeta(0.05, a, b),
      hi90 = qbeta(0.95, a, b)
    )
  }
  bind_rows(lapply(out, as_tibble_row))
}

# Pick one team and track posterior over time
team_pick <- sample(unique(matches$team), 1)
y_seq <- matches %>% filter(team == team_pick) %>% arrange(date) %>% pull(result)

seq_df <- sequential_posterior(y_seq, a0 = 8, b0 = 8) %>%
  mutate(team = team_pick)

ggplot(seq_df, aes(i, mean)) +
  geom_line() +
  geom_ribbon(aes(ymin = lo90, ymax = hi90), alpha = 0.2) +
  scale_y_continuous(labels = percent_format(accuracy = 0.1)) +
  labs(
    title = paste("Sequential Posterior for", team_pick),
    x = "Game index",
    y = "Posterior mean (90% band)"
  )

7. Shrinkage and Partial Pooling: Stabilizing Team/Player Estimates

Raw win rates are noisy. Fully separate Beta posteriors help, but you can go further: estimate the prior from the league (Empirical Bayes), then use it as a shrinkage prior for each team.

7.1 Empirical Bayes prior fit (method-of-moments)

fit_beta_mom <- function(p_hat, w = NULL) {
  # method of moments for Beta parameters
  # p_hat: observed proportions, w optional weights (e.g., n games)
  if (is.null(w)) w <- rep(1, length(p_hat))

  m <- weighted.mean(p_hat, w)
  v <- weighted.mean((p_hat - m)^2, w)

  # clamp variance to avoid weirdness
  v <- max(v, 1e-6)

  # Beta variance: m(1-m)/(a+b+1)
  t <- m*(1-m)/v - 1
  a <- max(0.5, m*t)
  b <- max(0.5, (1-m)*t)
  c(a = a, b = b)
}

p_hat <- team_stats$k / team_stats$n
prior_ab <- fit_beta_mom(p_hat, w = team_stats$n)
prior_ab
a0 <- prior_ab["a"]
b0 <- prior_ab["b"]

team_post_eb <- team_stats %>%
  mutate(
    a = a0 + k,
    b = b0 + (n - k),
    post_mean = a/(a+b),
    lo90 = qbeta(0.05, a, b),
    hi90 = qbeta(0.95, a, b)
  ) %>%
  arrange(desc(post_mean))

# Compare naive sample rates vs EB posterior means
compare_df <- team_stats %>%
  mutate(sample_rate = k/n) %>%
  left_join(team_post_eb %>% select(team, post_mean), by = "team") %>%
  pivot_longer(cols = c(sample_rate, post_mean), names_to = "type", values_to = "p")

ggplot(compare_df, aes(p, fill = type)) +
  geom_histogram(bins = 20, alpha = 0.6, position = "identity") +
  labs(
    title = "Shrinkage Effect: Sample Rates vs EB Posterior Means",
    x = "Probability",
    y = "Count"
  )

Shrinkage reduces overreaction to small samples—one of the most common reasons betting systems look great in backtests and fail live.


8. Bayesian Logistic Models: Match Win Probabilities from Features

Team strength depends on context: home advantage, rest, injuries, travel, matchups, and more. Logistic regression is a natural model for win probabilities. Bayesian logistic regression adds regularization and uncertainty.

We’ll create a few simple features from our toy data. In real projects you’d engineer better predictors.

# Simple feature engineering (toy)
# Use rolling win rate as a proxy strength feature (in real data: Elo, xG, etc.)
matches_feat <- matches %>%
  arrange(date) %>%
  group_by(team) %>%
  mutate(
    games_played = row_number() - 1,
    roll_winrate = ifelse(games_played == 0, NA_real_,
                          cummean(lag(result)))
  ) %>%
  ungroup() %>%
  mutate(
    roll_winrate = ifelse(is.na(roll_winrate), mean(result), roll_winrate),
    # center features
    roll_winrate_c = roll_winrate - mean(roll_winrate),
    home_c = home - mean(home)
  )

# Train-test split by time (avoid leakage)
cut_date <- quantile(matches_feat$date, 0.8)
train <- matches_feat %>% filter(date <= cut_date)
test  <- matches_feat %>% filter(date >  cut_date)

nrow(train); nrow(test)
# Bayesian logistic regression
# Note: rstanarm can be slower. For a big dataset, adjust iter/chains or use variational inference elsewhere.
fit_bayes_logit <- stan_glm(
  result ~ home_c + roll_winrate_c,
  data = train,
  family = binomial(link = "logit"),
  prior = normal(location = 0, scale = 1, autoscale = TRUE),
  prior_intercept = normal(0, 2.5),
  chains = 2,
  iter = 800,
  refresh = 0
)

print(fit_bayes_logit)
# Posterior predictive probabilities on test
p_test <- posterior_linpred(fit_bayes_logit, newdata = test, transform = TRUE)
# p_test is draws x observations; take posterior mean probability
p_hat <- colMeans(p_test)

test_pred <- test %>%
  mutate(p_model = p_hat)

test_pred %>%
  summarise(
    mean_p = mean(p_model),
    mean_y = mean(result)
  )

9. Ratings as Priors: Elo with Uncertainty and Bayesian Flavor

Elo is a practical rating system that translates team strength differences into win probabilities. It’s not “Bayesian” in a strict sense, but it behaves like a sequential updating scheme and can be combined with Bayesian ideas.

9.1 Simple Elo implementation in R

elo_update <- function(Ra, Rb, y, k = 20) {
  # y: 1 if A wins, 0 if B wins
  Ea <- 1 / (1 + 10^((Rb - Ra)/400))
  Ra_new <- Ra + k*(y - Ea)
  Rb_new <- Rb + k*((1 - y) - (1 - Ea))
  c(Ra = Ra_new, Rb = Rb_new, Ea = Ea)
}

run_elo <- function(df, init = 1500, k = 20) {
  teams <- sort(unique(c(df$team, df$opponent)))
  R <- setNames(rep(init, length(teams)), teams)

  out <- vector("list", nrow(df))
  for (i in seq_len(nrow(df))) {
    a <- df$team[i]
    b <- df$opponent[i]
    y <- df$result[i]

    upd <- elo_update(R[a], R[b], y, k = k)
    R[a] <- upd["Ra"]
    R[b] <- upd["Rb"]

    out[[i]] <- tibble(
      date = df$date[i],
      team = a,
      opponent = b,
      result = y,
      elo_team = R[a],
      elo_opp = R[b],
      p_elo = upd["Ea"]
    )
  }
  bind_rows(out)
}

elo_df <- run_elo(matches %>% arrange(date), k = 18)
elo_df %>% slice(1:5)
# Join Elo features back to matches
matches_elo <- matches %>%
  arrange(date) %>%
  mutate(row_id = row_number()) %>%
  left_join(
    elo_df %>% mutate(row_id = row_number()) %>% select(row_id, p_elo),
    by = "row_id"
  )

summary(matches_elo$p_elo)

Elo probabilities can serve as baseline priors or features inside a Bayesian model. The key is to evaluate calibration and stability.


10. Score Models: Poisson and Skellam for Totals and Spreads

If you bet totals (Over/Under) or spreads, it’s helpful to model the score distribution. A common starting point is Poisson scoring for each team: HomeGoals ~ Poisson(lambda_home), AwayGoals ~ Poisson(lambda_away). The goal difference follows a Skellam distribution (difference of two Poissons).

We’ll simulate goals here to demonstrate the workflow. Replace this with real score data.

# Add synthetic goals to the toy dataset (for demonstration)
add_goals <- function(df) {
  # tie lambda to latent p_true to create plausible relationship
  # this is just for tutorial purposes
  base <- 1.2
  spread <- (df$p_true - 0.5) * 2.0

  lambda_home <- pmax(0.2, base + 0.4 + spread)
  lambda_away <- pmax(0.2, base - 0.1 - spread)

  df %>%
    mutate(
      goals_team = rpois(n(), lambda_home),
      goals_opp  = rpois(n(), lambda_away),
      total_goals = goals_team + goals_opp,
      goal_diff = goals_team - goals_opp
    )
}

matches_goals <- add_goals(matches)
summary(matches_goals$total_goals)

10.1 Fit simple Poisson regression for scoring rates

# Poisson model for "team goals" using simple predictors
# In real data: team/opponent effects, home advantage, pace, etc.
train_g <- matches_goals %>% filter(date <= cut_date)
test_g  <- matches_goals %>% filter(date >  cut_date)

fit_pois <- glm(
  goals_team ~ home + p_true,
  data = train_g,
  family = poisson()
)

summary(fit_pois)
# Predict lambdas on test
lambda_hat <- predict(fit_pois, newdata = test_g, type = "response")

test_g2 <- test_g %>% mutate(lambda_team = lambda_hat)

summary(test_g2$lambda_team)

10.2 Convert score distribution into market probabilities (totals/spreads)

To price a total (e.g., Over 2.5), you need P(Total > 2.5). If we model both sides as Poisson and independent, total is Poisson with rate lambda_total = lambda_team + lambda_opp. Here we only have a team-goals model; we’ll create a companion lambda for opponent for the demo.

# Companion opponent lambda (toy)
fit_pois_opp <- glm(
  goals_opp ~ home + p_true,
  data = train_g,
  family = poisson()
)

test_g2 <- test_g2 %>%
  mutate(lambda_opp = predict(fit_pois_opp, newdata = test_g2, type = "response"),
         lambda_total = lambda_team + lambda_opp)

# Price Over 2.5
p_over_25 <- 1 - ppois(2, lambda = test_g2$lambda_total)
summary(p_over_25)
# Example: bet Over 2.5 at decimal odds 1.95
d_over <- 1.95
ev_over <- ev_decimal(p_over_25, d_over)
summary(ev_over)
mean(ev_over > 0)

11. Probability Calibration: Brier, Log Loss, and Reliability Curves

A betting system needs probabilities that mean what they say. If you claim 60% and you win 60% over time, that’s calibration. Calibration is one of the best predictors that your edge is real rather than backtest noise.

11.1 Brier score and log loss

brier_score <- function(y, p) mean((y - p)^2)

log_loss <- function(y, p, eps = 1e-12) {
  p <- pmin(pmax(p, eps), 1 - eps)
  -mean(y*log(p) + (1-y)*log(1-p))
}

brier <- brier_score(test_pred$result, test_pred$p_model)
ll <- log_loss(test_pred$result, test_pred$p_model)
c(brier = brier, logloss = ll)

11.2 Reliability curve (calibration plot)

reliability_curve <- function(y, p, bins = 10) {
  tibble(y = y, p = p) %>%
    mutate(bin = ntile(p, bins)) %>%
    group_by(bin) %>%
    summarise(
      p_mean = mean(p),
      y_mean = mean(y),
      n = n(),
      .groups = "drop"
    )
}

rc <- reliability_curve(test_pred$result, test_pred$p_model, bins = 12)

ggplot(rc, aes(p_mean, y_mean, size = n)) +
  geom_point(alpha = 0.8) +
  geom_abline(slope = 1, intercept = 0, linetype = 2) +
  scale_x_continuous(labels = percent_format(accuracy = 1)) +
  scale_y_continuous(labels = percent_format(accuracy = 1)) +
  labs(
    title = "Reliability Curve (Calibration)",
    x = "Predicted probability (bin mean)",
    y = "Observed frequency",
    size = "Count"
  )

If your points drift above the diagonal, you’re underconfident; below it, overconfident. Both can destroy Kelly sizing.


12. Kelly Logic: Optimal Growth, Fractional Kelly, and Practical Constraints

Kelly sizing maximizes long-run logarithmic bankroll growth if your probabilities are correct. In real life, probabilities are noisy—so full Kelly can be too aggressive. Fractional Kelly is often the sweet spot.

12.1 Kelly fraction for decimal odds

p <- 0.55
d <- 2.05

f_k <- kelly_fraction_decimal(p, d)
f_k
# Fractional Kelly (e.g., half Kelly)
fractional_kelly <- function(f, frac = 0.5) frac * f

f_half <- fractional_kelly(f_k, 0.5)

# Apply caps (never bet negative Kelly; cap max stake)
f_bet <- clip(f_half, 0, 0.05)
f_bet

12.2 Kelly with uncertainty (posterior integration)

A Bayesian twist: rather than plug in a point estimate p, integrate over uncertainty by sampling from the posterior. This yields a distribution for Kelly fraction and lets you bet conservatively (e.g., on a lower quantile).

kelly_draws_decimal <- function(p_draws, d) {
  b <- d - 1
  q <- 1 - p_draws
  (b*p_draws - q)/b
}

# Example using the earlier Beta posterior (pick some a,b)
a <- 25; b <- 20
p_draws <- rbeta(50000, a, b)
d <- 1.95

f_draws <- kelly_draws_decimal(p_draws, d)

quantile(f_draws, c(0.05, 0.25, 0.5, 0.75, 0.95))
# Conservative staking: use 25th percentile of Kelly, then apply fraction + cap
f_cons <- quantile(f_draws, 0.25)
f_bet <- clip(0.5 * f_cons, 0, 0.03)
f_bet

This is one of the most practical Bayesian risk controls: uncertainty shrinks position sizes automatically.


13. Portfolio Betting: Correlation, Market Clustering, and Exposure Controls

Betting is rarely a sequence of independent wagers. Same league, same teams, same injury news—your bets can be correlated. Correlation increases drawdowns and makes naive Kelly sizing too aggressive.

13.1 A simple exposure constraint system

# Imagine we have candidate bets with:
# date, market, team, p_model, odds_decimal, kelly_f

make_candidates <- function(df, p_col = "p_model", odds_col = "odds_decimal") {
  df %>%
    transmute(
      date,
      team,
      opponent,
      p = .data[[p_col]],
      d = .data[[odds_col]],
      ev = ev_decimal(p, d),
      f_kelly = kelly_fraction_decimal(p, d)
    ) %>%
    mutate(
      f_half = 0.5 * f_kelly,
      f_bet = clip(f_half, 0, 0.03)
    )
}

# For demo, use test_pred probabilities against the odds in matches
candidates <- test_pred %>%
  mutate(p_model = p_model, odds_decimal = odds_decimal) %>%
  make_candidates()

summary(candidates$f_bet)
summary(candidates$ev)
# Portfolio constraints:
# - Max stake per day
# - Max stake per team
# - Only bet when EV > 0 and Kelly > 0

apply_constraints <- function(df, max_daily = 0.08, max_team = 0.05) {
  df %>%
    filter(ev > 0, f_bet > 0) %>%
    arrange(desc(ev)) %>%
    group_by(date) %>%
    mutate(
      daily_cum = cumsum(f_bet),
      f_bet_day = ifelse(daily_cum <= max_daily, f_bet, 0)
    ) %>%
    ungroup() %>%
    group_by(team) %>%
    mutate(
      team_cum = cumsum(f_bet_day),
      f_bet_final = ifelse(team_cum <= max_team, f_bet_day, 0)
    ) %>%
    ungroup()
}

bets_planned <- apply_constraints(candidates, max_daily = 0.08, max_team = 0.05)

bets_planned %>%
  summarise(
    n_candidates = n(),
    n_bets = sum(f_bet_final > 0),
    avg_stake = mean(f_bet_final[f_bet_final > 0]),
    total_stake = sum(f_bet_final)
  )

This looks simple, but constraints are often the difference between a backtest hero and a live survivor.


14. Concrete Betting Strategies: From Simple Rules to Model-Driven Systems

Below are practical strategies you can implement. None is “magic.” The point is to define rules, measure them, and iterate.

14.1 Strategy A: Simple value betting (probability edge threshold)

strategy_value_threshold <- function(df, edge_min = 0.02) {
  df %>%
    mutate(
      p_be = 1/d,
      edge = p - p_be
    ) %>%
    filter(edge >= edge_min) %>%
    mutate(
      f_bet = clip(0.5 * kelly_fraction_decimal(p, d), 0, 0.03)
    )
}

bets_A <- candidates %>% strategy_value_threshold(edge_min = 0.02)
summary(bets_A$f_bet)

14.2 Strategy B: Bayesian “probability of positive EV” gate

If you can produce a posterior distribution for p, you can compute P(EV > 0). We’ll demonstrate using a Beta posterior for each team’s win rate as a crude uncertainty model.

# Build EB Beta posteriors from the training window
team_train <- matches %>% filter(date <= cut_date) %>%
  group_by(team) %>%
  summarise(n = n(), k = sum(result), .groups = "drop")

p_hat_train <- team_train$k / team_train$n
prior_ab <- fit_beta_mom(p_hat_train, w = team_train$n)
a0 <- prior_ab["a"]; b0 <- prior_ab["b"]

team_params <- team_train %>%
  mutate(a = a0 + k, b = b0 + (n - k)) %>%
  select(team, a, b)

# Attach team posterior params to candidates
cand_B <- candidates %>%
  left_join(team_params, by = "team")

prob_ev_positive_beta <- function(a, b, d, n_draws = 10000) {
  p_draws <- rbeta(n_draws, a, b)
  mean(ev_decimal(p_draws, d) > 0)
}

cand_B2 <- cand_B %>%
  mutate(
    p_ev_pos = pmap_dbl(list(a, b, d), ~prob_ev_positive_beta(..1, ..2, ..3, n_draws = 6000)),
    f_bet = clip(0.5 * kelly_fraction_decimal(p, d), 0, 0.03)
  )

bets_B <- cand_B2 %>% filter(p_ev_pos >= 0.65, ev > 0, f_bet > 0)

bets_B %>%
  summarise(n_bets = n(), mean_p_ev_pos = mean(p_ev_pos), mean_ev = mean(ev), mean_f = mean(f_bet))

14.3 Strategy C: Calibrated model + fractional Kelly + caps

# If you have a calibrated model probability p_model,
# apply fractional Kelly and cap. Also: require minimal EV.
strategy_fractional_kelly <- function(df, ev_min = 0.01, frac = 0.5, cap = 0.03) {
  df %>%
    mutate(
      f_k = kelly_fraction_decimal(p, d),
      f_bet = clip(frac * f_k, 0, cap)
    ) %>%
    filter(ev >= ev_min, f_bet > 0)
}

bets_C <- candidates %>% strategy_fractional_kelly(ev_min = 0.01, frac = 0.35, cap = 0.025)
bets_C %>% summarise(n_bets = n(), mean_ev = mean(ev), mean_f = mean(f_bet))

15. Backtesting in R: Walk-forward, Leakage Prevention, and Metrics

Most betting “edges” vanish when you backtest properly. Your backtest must be time-respecting (walk-forward), must avoid target leakage, and must include realistic constraints (limits, stake caps, minimum odds, etc.).

15.1 A minimal bet settlement engine

settle_bets <- function(df, bankroll0 = 1.0) {
  # df must have: result (1/0), d (decimal odds), f_bet (fraction of bankroll)
  bankroll <- bankroll0
  out <- vector("list", nrow(df))

  for (i in seq_len(nrow(df))) {
    stake <- bankroll * df$f_bet[i]
    if (stake <= 0) {
      out[[i]] <- tibble(i = i, bankroll = bankroll, stake = 0, pnl = 0)
      next
    }

    pnl <- if (df$result[i] == 1) stake * (df$d[i] - 1) else -stake
    bankroll <- bankroll + pnl

    out[[i]] <- tibble(i = i, bankroll = bankroll, stake = stake, pnl = pnl)
  }

  bind_cols(df, bind_rows(out))
}

# Example: settle Strategy C bets (in arbitrary order; ideally sort by date)
bt_C <- bets_C %>%
  mutate(result = test_pred$result[match(paste(date, team, opponent), paste(test_pred$date, test_pred$team, test_pred$opponent))]) %>%
  arrange(date) %>%
  select(date, team, opponent, result, d, f_bet)

bt_C2 <- settle_bets(bt_C, bankroll0 = 1.0)

tail(bt_C2)

15.2 Performance metrics

compute_metrics <- function(df) {
  df %>%
    summarise(
      n_bets = sum(stake > 0),
      final_bankroll = last(bankroll),
      total_pnl = sum(pnl),
      roi_on_staked = ifelse(sum(stake) > 0, sum(pnl)/sum(stake), NA_real_),
      hit_rate = mean(result[stake > 0] == 1),
      avg_odds = mean(d[stake > 0]),
      max_drawdown = {
        peak <- cummax(bankroll)
        dd <- bankroll/peak - 1
        min(dd)
      }
    )
}

compute_metrics(bt_C2)
# Plot bankroll curve
ggplot(bt_C2, aes(date, bankroll)) +
  geom_line() +
  labs(
    title = "Backtest Bankroll Curve (Strategy C)",
    x = "Date",
    y = "Bankroll"
  )

15.3 Walk-forward template

Below is a simple walk-forward skeleton: refit model on an expanding window, predict the next chunk, generate bets, settle, repeat. Replace the “model” section with your real pipeline.

walk_forward <- function(df, initial_frac = 0.6, step_frac = 0.1) {
  df <- df %>% arrange(date)
  n <- nrow(df)

  idx0 <- floor(initial_frac * n)
  step <- floor(step_frac * n)

  start <- idx0
  bankroll <- 1.0
  all_bets <- list()

  while (start < n) {
    train_idx <- 1:start
    test_idx <- (start + 1):min(n, start + step)

    train <- df[train_idx, ]
    test  <- df[test_idx, ]

    # --- Model (placeholder) ---
    # simple baseline: probability = historical mean adjusted by home
    p0 <- mean(train$result)
    p_hat <- clip(p0 + 0.03*(test$home - mean(train$home)), 0.02, 0.98)

    cand <- test %>%
      transmute(
        date, team, opponent,
        result,
        p = p_hat,
        d = odds_decimal
      ) %>%
      mutate(
        ev = ev_decimal(p, d),
        f_bet = clip(0.35 * kelly_fraction_decimal(p, d), 0, 0.02)
      ) %>%
      filter(ev > 0.01, f_bet > 0)

    # Settle on this chunk
    if (nrow(cand) > 0) {
      settled <- settle_bets(cand, bankroll0 = bankroll)
      bankroll <- last(settled$bankroll)
      all_bets[[length(all_bets) + 1]] <- settled
    }

    start <- start + step
  }

  bind_rows(all_bets)
}

wf_bt <- walk_forward(matches)
compute_metrics(wf_bt)

16. Monte Carlo Stress Tests: Drawdowns, Ruin, and Regret

Backtests are one path. Stress tests are another. Monte Carlo simulation answers: “If my estimated edge is real but noisy, what drawdowns should I expect?”

16.1 Simulate a season of bets with uncertainty around p

simulate_betting_path <- function(n_bets = 500, p = 0.54, d = 1.95, f = 0.01, bankroll0 = 1) {
  bankroll <- bankroll0
  for (i in 1:n_bets) {
    stake <- bankroll * f
    win <- rbinom(1, 1, p)
    pnl <- if (win == 1) stake*(d - 1) else -stake
    bankroll <- bankroll + pnl
  }
  bankroll
}

# Many simulations
sim_paths <- replicate(5000, simulate_betting_path(n_bets = 400, p = 0.53, d = 1.95, f = 0.012))
quantile(sim_paths, c(0.05, 0.25, 0.5, 0.75, 0.95))

16.2 Incorporate probability error (model risk)

Your p_hat is not the true p. Model risk is often larger than market risk. We simulate true p drifting around your estimate.

simulate_with_model_error <- function(n_bets = 500, p_hat = 0.54, p_sd = 0.03, d = 1.95, f = 0.01, bankroll0 = 1) {
  bankroll <- bankroll0
  for (i in 1:n_bets) {
    p_true <- clip(rnorm(1, p_hat, p_sd), 0.01, 0.99)
    stake <- bankroll * f
    win <- rbinom(1, 1, p_true)
    pnl <- if (win == 1) stake*(d - 1) else -stake
    bankroll <- bankroll + pnl
  }
  bankroll
}

sim2 <- replicate(5000, simulate_with_model_error(n_bets = 400, p_hat = 0.53, p_sd = 0.05, d = 1.95, f = 0.012))
quantile(sim2, c(0.05, 0.25, 0.5, 0.75, 0.95))
# Compare distributions
df_sim <- tibble(
  bankroll = c(sim_paths, sim2),
  scenario = rep(c("no_model_error", "with_model_error"), each = length(sim_paths))
)

ggplot(df_sim, aes(bankroll)) +
  geom_histogram(bins = 50) +
  facet_wrap(~scenario, scales = "free_y") +
  labs(
    title = "Monte Carlo Bankroll Distribution",
    x = "Final bankroll",
    y = "Count"
  )

If model error collapses your distribution, reduce Kelly fraction, add EV thresholds, and improve calibration.


17. A Go-Live Checklist: Operational and Statistical Guardrails

  • Time split always: train only on past, predict future.
  • Log everything: inputs, odds, timestamps, model version, decisions, stakes.
  • Calibrate probabilities: reliability curves, Brier, log loss.
  • Respect uncertainty: use posterior bands or conservative probability adjustments.
  • Size bets defensively: fractional Kelly + caps + daily exposure limits.
  • Account for correlation: cluster exposures by league/team/news.
  • Stress-test: Monte Carlo with model error, not just “true p”.
  • Beware selection bias: strategies discovered by brute force often overfit.
  • Prefer stability: smaller edge + stable behavior beats fragile high edge.

If you’d like a more structured, code-heavy continuation of these ideas (Bayesian probabilities, EV, and Kelly-based sizing), you can find it here: Bayesian Sports Betting with R.

The post Designing Sports Betting Systems in R: Bayesian Probabilities, Expected Value, and Kelly Logic appeared first on R Programming Books.

]]>
https://rprogrammingbooks.com/designing-sports-betting-systems-in-r/feed/ 0
Fight Data Science in R: Proven Boxing Metrics & Models https://rprogrammingbooks.com/fight-data-science-in-r-boxing/?utm_source=rss&utm_medium=rss&utm_campaign=fight-data-science-in-r-boxing https://rprogrammingbooks.com/fight-data-science-in-r-boxing/#respond Sun, 25 Jan 2026 18:56:18 +0000 https://rprogrammingbooks.com/?p=2412 Boxing analysis is no longer just about punch totals or “who looked busier.” Modern fight analysis is data science: repeatable pipelines, validated data, explainable models, and performance indicators that translate into strategy. This post shows how to build a professional fight data science workflow in R—from raw data to metrics, modeling, and tactical insights—using code […]

The post Fight Data Science in R: Proven Boxing Metrics & Models appeared first on R Programming Books.

]]>

Boxing analysis is no longer just about punch totals or “who looked busier.” Modern fight analysis is data science: repeatable pipelines, validated data, explainable models, and performance indicators that translate into strategy. This post shows how to build a professional fight data science workflow in R—from raw data to metrics, modeling, and tactical insights—using code you can adapt to your own datasets.

You’ll get: a production-style project structure, data contracts, validation checks, feature engineering patterns, round-by-round models, fatigue and momentum signals, and high-signal visualizations for coaches and analysts. The goal is to help you move from “interesting charts” to decision-grade analytics.


Table of Contents


1) Professional setup and project structure

A “pro” analytics workflow starts with discipline: consistent folders, reproducible environments, and clear separation of raw → clean → features → models. Even if you’re solo, this structure makes your work easier to iterate and publish.

# Core libraries for fight data science
pkgs <- c(
  "tidyverse", "janitor", "lubridate", "glue", "cli",
  "arrow", "here", "fs",
  "duckdb", "DBI",
  "slider",
  "rsample", "recipes", "parsnip", "workflows", "tune", "dials",
  "yardstick", "broom",
  "ggrepel", "patchwork"
)

to_install <- pkgs[!pkgs %in% installed.packages()[, "Package"]]
if (length(to_install) > 0) install.packages(to_install, dependencies = TRUE)
invisible(lapply(pkgs, library, character.only = TRUE))

# Create a clean project layout (idempotent)
dirs <- c(
  "data/raw",
  "data/clean",
  "data/features",
  "data/models",
  "plots",
  "reports",
  "R"
)

walk(dirs, ~ fs::dir_create(here::here(.x)))

log_info <- function(...) cli::cli_alert_info(glue::glue(...))
log_ok   <- function(...) cli::cli_alert_success(glue::glue(...))

log_ok("Project folders ready at: {here::here()}")

Tip: store raw files read-only, and always write standardized outputs (e.g., Parquet) to data/clean/. You’ll instantly speed up your workflow and reduce “mystery bugs.”


2) A fight data contract (schemas that prevent chaos)

The fastest way to break a fight analytics project is to let columns drift (“fighter” vs “boxer”, mixed date formats, different naming conventions for the same actions). A data contract prevents that. Below are two useful contracts:

  • Round totals (works with CompuBox-style aggregates)
  • Event-level metadata (for joining and reporting)
round_schema <- tibble::tribble(
  ~column,              ~type,       ~notes,
  "fight_id",           "character",  "Unique fight identifier",
  "event_id",           "character",  "Unique event identifier",
  "event_date",         "date",       "ISO date",
  "weight_class",       "character",  "e.g., Welterweight",
  "fighter",            "character",  "This row's fighter",
  "opponent",           "character",  "Opponent fighter",
  "corner",             "character",  "Red/Blue or A/B",
  "round",              "integer",    "Round number",
  "jabs_landed",        "integer",    "Jabs landed",
  "jabs_attempted",     "integer",    "Jabs attempted",
  "power_landed",       "integer",    "Power shots landed",
  "power_attempted",    "integer",    "Power shots attempted",
  "knockdowns",         "integer",    "Knockdowns in round",
  "stance",             "character",  "orthodox/southpaw/other",
  "result_round",       "integer",    "1 if fighter won the round, 0 if lost (or NA if unknown)"
)

event_schema <- tibble::tribble(
  ~column,         ~type,       ~notes,
  "event_id",      "character", "Unique event identifier",
  "event_name",    "character", "Event name",
  "event_date",    "date",      "ISO date",
  "location",      "character", "City/Country (optional)",
  "promotion",     "character", "Promotion/org (optional)"
)

round_schema

If you don’t have result_round, you can still do great analytics: predict round outcomes, infer momentum, and quantify “who was in control” using validated scoring proxies.


3) Ingestion and standardization

Here’s a robust ingestion pattern: read raw CSVs, normalize names, enforce types, standardize fighter naming, and write Parquet for speed. Adjust paths to your sources.

read_round_totals <- function(path) {
  log_info("Reading raw round totals: {path}")
  readr::read_csv(path, show_col_types = FALSE) %>%
    janitor::clean_names()
}

standardize_round_totals <- function(df) {
  df %>%
    mutate(
      event_date = as.Date(event_date),
      round = as.integer(round),
      across(
        c(jabs_landed, jabs_attempted, power_landed, power_attempted, knockdowns),
        ~ as.integer(replace_na(.x, 0))
      ),
      across(c(fight_id, event_id, fighter, opponent, weight_class, stance, corner), as.character),
      stance = tolower(stance),
      corner = toupper(corner)
    ) %>%
    # Basic name normalization
    mutate(
      fighter  = str_squish(str_replace_all(fighter, "\\s+", " ")),
      opponent = str_squish(str_replace_all(opponent, "\\s+", " ")),
      weight_class = str_squish(weight_class)
    )
}

write_clean_parquet <- function(df, out_path) {
  fs::dir_create(fs::path_dir(out_path))
  arrow::write_parquet(df, out_path)
  log_ok("Wrote Parquet: {out_path}")
}

# Example:
# raw_path  <- here::here("data/raw/round_totals.csv")
# clean_out <- here::here("data/clean/round_totals.parquet")
# rounds_clean <- read_round_totals(raw_path) %>% standardize_round_totals()
# write_clean_parquet(rounds_clean, clean_out)

Parquet is a game-changer for analytics work: fast I/O, consistent types, and easy integration with DuckDB for SQL-style querying.


4) Validation, QA, and anomaly detection

Fight data is full of subtle mistakes: attempted shots < landed shots, duplicated rounds, mixed fighter/opponent rows, or “impossible” knockdown counts. Validation should be automatic.

validate_round_totals <- function(df) {
  # Required columns check
  required <- round_schema$column
  missing_cols <- setdiff(required, names(df))
  if (length(missing_cols) > 0) {
    stop(glue::glue("Missing required columns: {paste(missing_cols, collapse=', ')}"))
  }

  # Logical checks
  bad_landed <- df %>%
    filter(jabs_landed > jabs_attempted | power_landed > power_attempted)

  if (nrow(bad_landed) > 0) {
    log_info("Found {nrow(bad_landed)} rows where landed > attempted (check source or parsing).")
  }

  # Duplicate round rows (same fight_id, fighter, round)
  dupes <- df %>%
    count(fight_id, fighter, round) %>%
    filter(n > 1)

  if (nrow(dupes) > 0) {
    log_info("Found duplicates: {nrow(dupes)} fight/fighter/round combinations.")
  }

  # Suspicious extremes (simple heuristic)
  suspicious <- df %>%
    mutate(total_attempted = jabs_attempted + power_attempted) %>%
    filter(total_attempted > 120 | knockdowns > 3)

  if (nrow(suspicious) > 0) {
    log_info("Found {nrow(suspicious)} suspicious rows (very high volume or knockdowns).")
  }

  df
}

# Example:
# rounds_clean <- rounds_clean %>% validate_round_totals()

Validation gives you confidence. And confidence is what makes analytics actionable—especially when you’re presenting results to coaches, fighters, or bettors who will challenge your assumptions.


5) Feature engineering: pace, accuracy, intent, damage proxies

Fight performance is multidimensional. A clean feature set usually includes:

  • Pace: attempts per round, pace change across rounds
  • Accuracy: landed / attempted (jabs, power, total)
  • Intent / style: jab share vs power share
  • Damage proxies: power landed, knockdowns, power accuracy
  • Relative dominance: fighter metrics minus opponent metrics
engineer_round_features <- function(df) {
  df %>%
    mutate(
      total_landed    = jabs_landed + power_landed,
      total_attempted = jabs_attempted + power_attempted,
      acc_jab   = if_else(jabs_attempted > 0, jabs_landed / jabs_attempted, NA_real_),
      acc_power = if_else(power_attempted > 0, power_landed / power_attempted, NA_real_),
      acc_total = if_else(total_attempted > 0, total_landed / total_attempted, NA_real_),
      jab_share_attempts = if_else(total_attempted > 0, jabs_attempted / total_attempted, NA_real_),
      power_share_attempts = if_else(total_attempted > 0, power_attempted / total_attempted, NA_real_),
      # Simple damage proxy: power landed + weighted knockdowns
      damage_proxy = power_landed + 8 * knockdowns
    )
}

# Opponent-relative features (requires pairing fighter vs opponent within the same fight_id and round)
add_relative_features <- function(df) {
  df2 <- df %>%
    select(fight_id, round, fighter, opponent,
           total_landed, total_attempted, acc_total,
           power_landed, power_attempted, acc_power,
           damage_proxy, knockdowns) %>%
    rename_with(~ paste0("opp_", .x), -c(fight_id, round, fighter, opponent)) %>%
    rename(fighter_join = opponent, opponent_join = fighter)

  df %>%
    left_join(
      df2,
      by = c("fight_id" = "fight_id", "round" = "round", "fighter" = "fighter_join", "opponent" = "opponent_join")
    ) %>%
    mutate(
      rel_total_landed = total_landed - opp_total_landed,
      rel_acc_total    = acc_total - opp_acc_total,
      rel_power_landed = power_landed - opp_power_landed,
      rel_damage       = damage_proxy - opp_damage_proxy,
      rel_knockdowns   = knockdowns - opp_knockdowns
    )
}

# Example:
# rounds_feat <- rounds_clean %>% engineer_round_features() %>% add_relative_features()

Relative features are where fight analytics becomes tactical: a fighter’s pace means little without context. Dominance is “what you did” minus “what you absorbed.”


6) Round-by-round modeling (probability of winning a round)

If you have labeled rounds (result_round = 1/0), you can model round outcomes using interpretable classifiers. Even if you don’t, you can label from trusted sources or use proxy labels (with caution).

Below is an end-to-end workflow using tidymodels: split, recipe, logistic regression with regularization, tuning, and calibration-friendly evaluation.

# Assume you have rounds_feat with result_round (1/0) for some fights
# rounds_feat <- rounds_feat %>% filter(!is.na(result_round))

set.seed(123)
spl <- rsample::initial_split(rounds_feat %>% filter(!is.na(result_round)), prop = 0.8, strata = result_round)
train <- rsample::training(spl)
test  <- rsample::testing(spl)

rec <- recipes::recipe(result_round ~ rel_total_landed + rel_acc_total + rel_power_landed + rel_damage +
                        total_attempted + acc_total + jab_share_attempts + damage_proxy +
                        stance,
                      data = train) %>%
  step_impute_median(all_numeric_predictors()) %>%
  step_impute_mode(all_nominal_predictors()) %>%
  step_dummy(all_nominal_predictors()) %>%
  step_zv(all_predictors()) %>%
  step_normalize(all_numeric_predictors())

mod <- parsnip::logistic_reg(penalty = tune(), mixture = 1) %>%
  set_engine("glmnet")

wf <- workflows::workflow() %>%
  add_recipe(rec) %>%
  add_model(mod)

grid <- dials::grid_regular(dials::penalty(range = c(-6, 0)), levels = 30)

set.seed(123)
folds <- rsample::vfold_cv(train, v = 5, strata = result_round)

metrics <- yardstick::metric_set(yardstick::roc_auc, yardstick::pr_auc, yardstick::accuracy, yardstick::mn_log_loss)

tuned <- tune::tune_grid(
  wf,
  resamples = folds,
  grid = grid,
  metrics = metrics
)

best <- tune::select_best(tuned, metric = "mn_log_loss")
final_wf <- tune::finalize_workflow(wf, best)

final_fit <- final_wf %>% fit(train)

# Evaluate on holdout test set
test_pred <- predict(final_fit, test, type = "prob") %>%
  bind_cols(test %>% select(result_round))

yardstick::roc_auc(test_pred, truth = result_round, .pred_1)
yardstick::mn_log_loss(test_pred, truth = result_round, .pred_1)

Why log loss? Because in fight analytics, calibrated probabilities matter. A model that says “0.55” should be right ~55% of the time—not just classify correctly.


7) Fight outcome modeling (interpretable + calibrated)

Fight outcomes can be modeled from aggregated round features: average dominance, variance (consistency), late-round fade, and knockdown impacts. First, summarize per fight and fighter.

summarize_fight_features <- function(df) {
  df %>%
    group_by(fight_id, event_id, event_date, weight_class, fighter, opponent) %>%
    summarise(
      rounds = n(),
      avg_rel_damage = mean(rel_damage, na.rm = TRUE),
      avg_rel_power_landed = mean(rel_power_landed, na.rm = TRUE),
      avg_rel_total_landed = mean(rel_total_landed, na.rm = TRUE),
      avg_rel_acc_total = mean(rel_acc_total, na.rm = TRUE),
      # Volatility/consistency
      sd_rel_damage = sd(rel_damage, na.rm = TRUE),
      # Pace markers
      avg_total_attempted = mean(total_attempted, na.rm = TRUE),
      # Knockdown signal
      total_knockdowns = sum(knockdowns, na.rm = TRUE),
      .groups = "drop"
    )
}

fight_level <- summarize_fight_features(rounds_feat)

# If you have fight outcome label for fighter perspective (win=1/lose=0):
# fight_level <- fight_level %>% left_join(outcomes, by = c("fight_id","fighter"))

Then model fight wins with an interpretable learner. Logistic regression is often a strong baseline; boosted trees can add performance if you keep explainability via feature importance and partial dependence (where appropriate).

# Example: win label in fight_level as win (1/0)
set.seed(42)
spl2 <- rsample::initial_split(fight_level %>% filter(!is.na(win)), prop = 0.8, strata = win)
tr2 <- training(spl2)
te2 <- testing(spl2)

rec2 <- recipe(win ~ avg_rel_damage + avg_rel_power_landed + avg_rel_total_landed +
                avg_rel_acc_total + sd_rel_damage + avg_total_attempted + total_knockdowns,
              data = tr2) %>%
  step_impute_median(all_numeric_predictors()) %>%
  step_normalize(all_numeric_predictors())

mod2 <- logistic_reg(penalty = tune(), mixture = 1) %>% set_engine("glmnet")

wf2 <- workflow() %>% add_recipe(rec2) %>% add_model(mod2)

grid2 <- grid_regular(penalty(range = c(-7, 0)), levels = 40)

set.seed(42)
folds2 <- vfold_cv(tr2, v = 5, strata = win)

tuned2 <- tune_grid(wf2, resamples = folds2, grid = grid2, metrics = metrics)

best2 <- select_best(tuned2, "mn_log_loss")
final2 <- finalize_workflow(wf2, best2) %>% fit(tr2)

pred2 <- predict(final2, te2, type = "prob") %>% bind_cols(te2 %>% select(win))
roc_auc(pred2, truth = win, .pred_1)
mn_log_loss(pred2, truth = win, .pred_1)

# Inspect coefficients (interpretability)
final2 %>%
  extract_fit_parsnip() %>%
  broom::tidy() %>%
  arrange(desc(abs(estimate))) %>%
  head(20)

A practical coaching readout might be: “Your average relative damage was +4 per round, but volatility was high. You won the peaks and lost the valleys—work on maintaining output in the middle rounds.”


8) Fatigue, momentum, and tactical shifts

Fatigue often shows up as a drop in attempt rate, a decline in power accuracy, or a shift toward safer output (more jabs, fewer exchanges). Momentum often appears as multi-round streaks in relative dominance.

Below are two useful constructs:

  • Fatigue Index: compare late rounds vs early rounds on pace and accuracy
  • Momentum Signal: rolling mean of relative damage/dominance
fatigue_index <- function(df) {
  df %>%
    group_by(fight_id, fighter) %>%
    mutate(
      early = round <= 3,
      late  = round >= max(round, na.rm = TRUE) - 2
    ) %>%
    summarise(
      early_pace = mean(total_attempted[early], na.rm = TRUE),
      late_pace  = mean(total_attempted[late], na.rm = TRUE),
      early_acc  = mean(acc_total[early], na.rm = TRUE),
      late_acc   = mean(acc_total[late], na.rm = TRUE),
      fatigue_pace_drop = (late_pace - early_pace) / pmax(early_pace, 1),
      fatigue_acc_drop  = (late_acc - early_acc) / pmax(early_acc, 1e-6),
      .groups = "drop"
    ) %>%
    mutate(
      fatigue_score = 0.7 * fatigue_pace_drop + 0.3 * fatigue_acc_drop
    )
}

momentum_signal <- function(df, window = 3) {
  df %>%
    arrange(fight_id, fighter, round) %>%
    group_by(fight_id, fighter) %>%
    mutate(
      rel_damage_roll = slider::slide_dbl(rel_damage, mean, .before = window - 1, .complete = FALSE, na.rm = TRUE),
      rel_landed_roll = slider::slide_dbl(rel_total_landed, mean, .before = window - 1, .complete = FALSE, na.rm = TRUE)
    ) %>%
    ungroup()
}

fatigue_tbl <- fatigue_index(rounds_feat)
rounds_mom <- momentum_signal(rounds_feat, window = 3)

Interpretation tips:

  • Fatigue score negative → late output/accuracy declined (common)
  • Fatigue score near zero → stable performance (valuable at elite levels)
  • Rolling dominance crossing zero → tactical turning point (corner adjustments matter most here)

9) Visual analytics for strategy

The “best” plots are the ones that change decisions. Two strategy-grade visuals:

  • Dominance timeline (relative damage rolling mean)
  • Style map (jab share vs power accuracy)
plot_dominance_timeline <- function(df, fight_id_pick, fighter_pick) {
  d <- df %>%
    filter(fight_id == fight_id_pick, fighter == fighter_pick) %>%
    arrange(round)

  ggplot(d, aes(x = round, y = rel_damage_roll)) +
    geom_hline(yintercept = 0, linewidth = 0.6) +
    geom_line(linewidth = 1) +
    geom_point(size = 2) +
    labs(
      x = "Round",
      y = "Rolling Relative Damage (windowed mean)",
      title = "Dominance Timeline",
      subtitle = glue::glue("Fight {fight_id_pick} — {fighter_pick}")
    ) +
    theme_minimal(base_size = 12)
}

plot_style_map <- function(df, fight_id_pick) {
  d <- df %>%
    filter(fight_id == fight_id_pick) %>%
    group_by(fighter) %>%
    summarise(
      jab_share = mean(jab_share_attempts, na.rm = TRUE),
      power_acc = mean(acc_power, na.rm = TRUE),
      pace = mean(total_attempted, na.rm = TRUE),
      .groups = "drop"
    )

  ggplot(d, aes(x = jab_share, y = power_acc, label = fighter, size = pace)) +
    geom_point(alpha = 0.7) +
    ggrepel::geom_text_repel(max.overlaps = 50) +
    labs(
      x = "Jab Share (Attempts)",
      y = "Power Accuracy",
      title = "Style Map (per fight)",
      subtitle = "Higher pace = larger point"
    ) +
    theme_minimal(base_size = 12)
}

# Example:
# p1 <- plot_dominance_timeline(rounds_mom, fight_id_pick = "F123", fighter_pick = "Fighter A")
# p2 <- plot_style_map(rounds_feat, fight_id_pick = "F123")
# p1 + p2

How coaches use these:

  • If dominance drops after round 4, check conditioning or defensive adjustments from the opponent.
  • If jab share is high but power accuracy is low, the jab may be “busy” but not creating openings.
  • If pace is high and accuracy stable, that’s often a winning profile—especially across long fights.

10) Scalable pipelines: Parquet, DuckDB, reproducibility

Once your data grows (multiple events, seasons, amateur + pro, different sources), SQL-style analysis becomes extremely useful. DuckDB lets you query Parquet directly with zero database admin.

# Connect to DuckDB (in-memory or file-backed)
con <- DBI::dbConnect(duckdb::duckdb(), dbdir = here::here("data/fight_analytics.duckdb"))

# Point DuckDB at a Parquet file (or a folder of Parquet files)
parquet_path <- here::here("data/clean/round_totals.parquet")

DBI::dbExecute(con, glue::glue("
  CREATE OR REPLACE VIEW rounds AS
  SELECT * FROM read_parquet('{parquet_path}')
"))

# Example: top rounds by volume (attempted punches)
top_volume <- DBI::dbGetQuery(con, "
  SELECT fighter, fight_id, round,
         (jabs_attempted + power_attempted) AS total_attempted
  FROM rounds
  ORDER BY total_attempted DESC
  LIMIT 25
")

top_volume %>% as_tibble()

# Close when done
DBI::dbDisconnect(con, shutdown = TRUE)

This makes it easy to build reliable reporting: “Highest pace fights,” “Largest late-round fades,” “Most consistent dominance,” and “Knockdown-driven wins.”


11) Wrap-up and next steps

A fight data science workflow in R becomes powerful when you combine:

  • Clean contracts so your data doesn’t drift
  • Validation so your results are trustworthy
  • Relative features so metrics become tactical
  • Probability models so conclusions are calibrated
  • Fatigue/momentum so strategy reflects real turning points

If you want a more structured, end-to-end path with deeper modeling, richer case studies, and a full workflow designed specifically for boxing, you may like this resource: a complete hands-on book focused on boxing data science and fight performance strategy in R .

The post Fight Data Science in R: Proven Boxing Metrics & Models appeared first on R Programming Books.

]]>
https://rprogrammingbooks.com/fight-data-science-in-r-boxing/feed/ 0