11  Machine Learning for Biostatistics

11.1 Learning objectives

By the end of this chapter you should be able to:

  • Use tidymodels to specify, fit, tune, and evaluate predictive models with a consistent workflow.
  • Choose between gradient-boosted trees (xgboost, lightgbm), random forests (ranger), and regularised GLMs based on dataset characteristics, interpretability needs, and the role of the model in the analysis.
  • Implement nested cross-validation for hyperparameter tuning and unbiased performance estimation, and recognise when single-level CV is misleadingly optimistic.
  • Apply calibration assessment and recalibration to probabilistic classifiers, and explain why area under the ROC curve does not measure calibration.
  • Use SHAP values (Lundberg & Lee, 2017) for model interpretation, and articulate what they show and what they do not (association, not causation).
  • Recognise when a ‘machine learning’ framing is hiding a classical statistical question, and when the reverse is true.

11.2 Orientation

The boundary between machine learning and biostatistics is porous and confused. Both fit predictive models from data; both quantify uncertainty; both worry about overfitting and generalisation. The cultural distinctions, ML emphasises black-box prediction, biostatistics emphasises interpretable inference, were never as clean as the caricatures and have eroded further as both fields adopt each other’s tools.

This chapter is not an introduction to machine learning. The goal is narrower: how to use the modern ML toolkit in biostatistical work where the question is genuinely predictive, and how to recognise when the toolkit is being mis-applied to a question that calls for classical inference instead.

We anchor on tidymodels (Kuhn & Silge, 2022) as the contemporary R interface, a coherent, principled stack that handles preprocessing, resampling, tuning, and evaluation with consistent verbs. We then walk through the algorithm choices that matter most in clinical and epidemiological work, the diagnostic and validation machinery, and the interpretability tools that have made black-box models more defensible than they used to be.

11.3 The statistician’s contribution

LLMs can fit a gradient-boosted tree, run cross-validation, and produce a calibration plot. They cannot decide whether prediction or inference is the right framing, whether the deployed model will actually improve the decision it informs, or whether the data resemble those in which the model will be used.

(Judgement 1.) Prediction is not inference. A model that achieves high AUC on a held-out set is a good predictor. It is not a tool for asking which variables cause the outcome, what the conditional effect of treatment is, or how a policy intervention would change outcomes. Conflating prediction performance with causal or counterfactual claims is the most common error in applied ML in medicine. The statistician’s job is to keep this distinction visible, in the analysis, the report, and the conversation with collaborators.

(Judgement 2.) The held-out set is the deployment distribution, not the design distribution. Models trained on data from one hospital and evaluated on held-out cases from the same hospital are evaluated on the training distribution. Models that will be deployed to a different hospital, or to a future cohort, must be evaluated under that shift, temporal hold-out, geographic hold-out, or prospective validation. The statistician identifies the deployment scenario and designs the evaluation to match.

(Judgement 3.) Calibration is the metric clinicians need. Discrimination, AUC, accuracy, tells you whether the model orders patients correctly. It does not tell you whether ‘the model says 30%’ means 30% of those patients will have the event. For decisions that require absolute risk thresholds (treat / do not treat above 10% predicted risk), calibration is the binding constraint, and it is not what tuning typically optimises. The statistician checks calibration explicitly and recalibrates if needed.

These judgements decide whether an ML pipeline produces a deployable tool or a benchmark on a private dataset.

11.4 The tidymodels workflow

tidymodels (Kuhn & Silge, 2022) is a meta-package covering the steps of a modelling pipeline:

  • rsample for resampling (CV, bootstrap, time-based splits).
  • recipes for preprocessing (centring, encoding, missing data handling).
  • parsnip for model specification (uniform interface across glmnet, ranger, xgboost, keras, etc.).
  • tune for hyperparameter tuning.
  • yardstick for performance metrics.
  • workflows for binding preprocessing + model into one object.

The goal is consistency: switching from logistic regression to gradient boosting changes one line. A typical pipeline:

library(tidymodels)

split <- initial_split(d, strata = outcome, prop = 0.75)
train <- training(split)
test <- testing(split)

rec <- recipe(outcome ~ ., data = train) |>
  step_rm(id) |>
  step_impute_median(all_numeric_predictors()) |>
  step_dummy(all_nominal_predictors()) |>
  step_normalize(all_numeric_predictors())

xgb_spec <- boost_tree(
  trees = 1000, tree_depth = tune(),
  learn_rate = tune(), min_n = tune()
) |>
  set_engine("xgboost") |>
  set_mode("classification")

wf <- workflow() |>
  add_recipe(rec) |>
  add_model(xgb_spec)

folds <- vfold_cv(train, v = 5, strata = outcome)
grid <- grid_latin_hypercube(
  tree_depth(), learn_rate(), min_n(),
  size = 30
)

tuned <- tune_grid(
  wf, resamples = folds, grid = grid,
  metrics = metric_set(roc_auc, brier_class)
)

best <- select_best(tuned, metric = "brier_class")
final_fit <- finalize_workflow(wf, best) |>
  last_fit(split, metrics = metric_set(
    roc_auc, brier_class, accuracy))

Two design choices to highlight. First, the metric used to select hyperparameters matters. Choosing on roc_auc optimises ranking; choosing on brier_class (or mn_log_loss) optimises probabilistic calibration. Choose the metric your downstream decision actually uses.

Second, last_fit() deliberately uses the test split only once, after tuning is complete. The temptation to peek and re-tune is the single largest source of optimistic performance estimates in applied ML.

11.5 Algorithm choices that matter

For tabular biomedical data, four model families do most of the work:

Regularised generalised linear models (glmnet). The elastic net from chapter 9 fits cleanly into tidymodels via parsnip::logistic_reg(penalty = tune(), mixture = tune()). It is interpretable, well-calibrated by construction (with appropriate penalty), and the right baseline for any classification problem with sample sizes in the thousands or fewer.

Random forests (ranger). Robust, low-tuning, performant on tabular data with non-linear interactions. Less prone to overfitting than boosted trees but typically outperformed by them when tuning is done well.

Gradient-boosted trees (xgboost, lightgbm). The default for high-performance prediction on tabular data. xgboost (Chen & Guestrin, 2016) is the established choice; lightgbm is faster on very large datasets and handles categorical predictors more naturally. Both require careful tuning of learn_rate, tree_depth, min_n, and the number of trees; under-tuned, they overfit; over-tuned, they generalise poorly.

Neural networks (torch, keras). For tabular biomedical data with \(n\) in the thousands, neural networks rarely outperform gradient boosting and require considerably more engineering effort. They earn their place on high-dimensional sequence, image, or text data, clinical notes via transformers, medical imaging via convolutional networks, time series via recurrent or attention architectures. Outside those settings, prefer trees.

A practical decision rule:

Setting Default model
\(p < n\), interpretability matters Regularised GLM
\(p < n\), raw prediction matters Gradient boosting
\(p \approx n\) or \(p > n\), sparse Lasso (chapter 9)
Tabular with non-linear interactions Gradient boosting
Imaging, text, sequence Deep learning

Question. A clinical decision support tool will use a risk model output to display ‘high’, ‘medium’, or ‘low’ risk to clinicians. The thresholds are predicted probability above 30% (high) and below 10% (low). Should hyperparameters be tuned on AUC or on Brier score?

Answer.

Brier score (or log loss). The decision rule uses absolute predicted probabilities; calibration matters for the decision. AUC measures discrimination, does the model correctly order patients, and is invariant to monotonic transformations of the score. A model with excellent AUC can have terrible calibration: it ranks correctly but its predicted probabilities are systematically off. Tuning on AUC and then deploying with absolute-threshold decisions is a common error. Tune on Brier score; report AUC for context.

11.6 Validation: nested CV and external validation

Single-level cross-validation gives an optimistic estimate of out-of-sample performance when CV is also used for hyperparameter tuning, because the tuning has implicitly seen all of the data.

Nested CV corrects this: an inner CV loop chooses hyperparameters, an outer CV loop estimates performance. The outer loop never sees the tuning, so its performance estimate is unbiased.

nested <- nested_cv(
  train,
  outside = vfold_cv(v = 10),
  inside = vfold_cv(v = 5)
)

Nested CV is computationally expensive, 50 fits in the example above, plus a final fit on the full training set , but it is the correct estimator when the question is ‘what performance should we expect when we deploy this pipeline?’

For models intended for deployment, external validation on data the model has never seen is more important than any internal CV estimate. The relevant external set depends on the deployment scenario:

  • Temporal validation: train on cases through year \(T\), evaluate on year \(T+1\). Catches drift in case mix, coding practices, and population.
  • Geographic validation: train on hospital A, evaluate on hospital B. Catches site-specific feature distributions.
  • Prospective validation: train on retrospective data, evaluate on prospectively collected cases under the actual deployment workflow. The gold standard for high-stakes deployments.

A model that drops 0.05 AUC from internal CV to external validation has a deployment problem; one that drops 0.15 has a fundamentally different distribution from the training set, and the analyst should investigate before deployment.

11.7 Calibration and recalibration

Calibration is the property that ‘the model says 20%’ means 20% of those cases have the event. It is checked by plotting predicted probability (binned) against observed event rate. A perfectly calibrated model lies on the diagonal.

library(probably)
cal_plot_breaks(test_with_preds,
                truth = outcome, estimate = .pred_yes,
                num_breaks = 10)

Common patterns:

  • Over-confident model: predicts probabilities too close to 0 and 1; the curve is steeper than the diagonal. Common with deep learning models trained on small data.
  • Under-confident model: predicts probabilities too close to the base rate; the curve is shallower than the diagonal. Common with heavy regularisation or with log-loss-suboptimal training.

Recalibration post-fits a one-dimensional function mapping raw scores to calibrated probabilities. Two methods:

  • Platt scaling: fits a logistic regression of the outcome on the raw score.
  • Isotonic regression: fits a non-decreasing function; more flexible but requires more recalibration data.

probably::cal_apply() integrates these into the tidymodels workflow. Recalibration is essential when deploying a model whose raw outputs do not match the deployment distribution; it is also a free improvement on most ML models, since the cost of training a calibration layer is trivial.

11.8 Interpretability: SHAP and beyond

For a deployed model, the relevant interpretability question is rarely ‘which features are most important overall’ (the model permutation importance answers that) but ‘why did the model give this patient this prediction’. SHAP values (Lundberg & Lee, 2017): Shapley values from cooperative game theory, provide a principled per-prediction decomposition.

library(SHAPforxgboost)
shap_long <- shap.prep(xgb_model = final_xgb,
                       X_train = X_matrix)
shap.plot.summary(shap_long)
shap.plot.dependence(shap_long, x = "age")

SHAP values have two strong properties: they sum to the prediction (consistency), and they distribute credit fairly across features. They have one persistent limitation: they describe the model’s dependence on a feature, not the world’s. If the model has learned a spurious association, sex is associated with outcome because of unmeasured confounders, SHAP will faithfully describe that association as the model uses it. SHAP does not provide causal inference, and a model interpreter who treats SHAP values as causal effects has crossed an important line.

The cleaner framing: SHAP explains the model, not the world. The model explains the data only to the extent that the analyst trusts the model. If a clinician asks ‘why does the model predict high risk for this patient,’ SHAP gives a defensible answer. If the same clinician asks ‘so if we intervene on age, will risk drop,’ SHAP cannot answer that question.

11.9 Worked example: 30-day readmission risk model

We extend the readmission example from chapter 7 to a machine-learning workflow with proper validation and calibration.

library(tidymodels)
library(probably)

set.seed(2026)
n <- 8000
d <- tibble(
  age = rnorm(n, 65, 12),
  cmi = rnorm(n, 1.5, 0.4),
  prior_admits = rpois(n, 0.6),
  los = rgamma(n, shape = 2, rate = 0.4),
  service = sample(c("med","surg","cards"), n,
                   replace = TRUE)
)
d$readmit <- factor(rbinom(
  n, 1,
  plogis(-2 + 0.02 * (d$age - 65) +
         0.5 * (d$cmi - 1.5) +
         0.3 * d$prior_admits +
         0.05 * d$los)
), labels = c("no","yes"))

split <- initial_split(d, strata = readmit, prop = 0.75)
train <- training(split); test <- testing(split)

rec <- recipe(readmit ~ ., data = train) |>
  step_dummy(all_nominal_predictors()) |>
  step_normalize(all_numeric_predictors())

xgb_spec <- boost_tree(
  trees = 500, tree_depth = tune(),
  learn_rate = tune(), min_n = tune()
) |> set_engine("xgboost") |>
     set_mode("classification")

wf <- workflow() |> add_recipe(rec) |> add_model(xgb_spec)

folds <- vfold_cv(train, v = 5, strata = readmit)
grid <- grid_latin_hypercube(
  tree_depth(range = c(2, 6)),
  learn_rate(range = c(-3, -1)),
  min_n(),
  size = 25
)

tuned <- tune_grid(
  wf, resamples = folds, grid = grid,
  metrics = metric_set(roc_auc, brier_class)
)

best <- select_best(tuned, metric = "brier_class")
final_fit <- finalize_workflow(wf, best) |>
  last_fit(split,
           metrics = metric_set(roc_auc, brier_class))

collect_metrics(final_fit)

The held-out AUC is around 0.74 and the Brier score around 0.16, competent for a heavily simulated example. Using probably::cal_plot_breaks() reveals a mild over-confidence at high predicted probabilities; a Platt recalibration narrows the gap.

A logistic regression on the same problem with elastic-net regularisation gives AUC around 0.73 and Brier around 0.16 , effectively indistinguishable. For this dataset, the gradient-boosted tree’s flexibility yields no benefit over the regularised GLM. The simpler model is the right deployment.

This is the typical finding in applied biomedical ML: on tabular data with thousands of cases, regularised GLMs and gradient-boosted trees are within a percent of each other. The gradient-boosted tree wins when there are strong non-linear interactions; otherwise the simpler model wins on parsimony, calibration, and ease of explanation.

11.10 Collaborating with an LLM on machine learning

Three patterns dominate. LLMs handle the syntax of tidymodels, xgboost, and SHAP fluently. They are unreliable on the choice between prediction and inference, on validation strategy, and on the interpretation of model explanations.

Prompt 1: ‘Build me a tidymodels pipeline for this binary outcome.’ Provide the data and the outcome.

What to watch for. The LLM will produce a working pipeline. It will rarely ask whether the goal is prediction or inference. It will rarely propose external validation. It will default to AUC as the tuning metric even when absolute calibration is the binding constraint.

Verification. Before accepting the pipeline, articulate the decision the model will inform. If the decision is a threshold-based action (‘treat above 10% predicted risk’), override the metric to Brier score and add a calibration plot. If the decision is to deploy at another site, plan external validation now, not after CV is complete.

Prompt 2: ‘Why is the model giving this prediction for this patient?’ Provide the patient’s features and the model.

What to watch for. The LLM will produce a SHAP-style explanation: ‘age contributes +0.4 to the log-odds, low CMI contributes -0.2,’ etc. It will rarely flag that this explains the model’s behaviour, not the underlying biology. It may slip into language that sounds causal: ‘increasing age increases risk.’

Verification. The model explanation is a description of how the trained model uses the features in this case. It is not a causal effect. Translate any LLM explanation into explicit ‘the model attributes’ language before showing it to a clinician.

Prompt 3: ‘Compare gradient boosting and logistic regression on this dataset.’ Provide the data.

What to watch for. The LLM will fit both, report metrics, and pick the higher-AUC model. It will rarely test for significance of the difference, run nested CV to estimate generalisation honestly, or check calibration on the selected model. It will often miss that an apparent 0.005-AUC advantage is well within sampling noise.

Verification. Estimate uncertainty in the AUC difference via paired bootstrap on the test set. Run nested CV to confirm the advantage is not an artefact of single-level tuning. If the difference is small, report both models and let the parsimony argument decide.

The meta-pattern: LLMs are competent at the mechanics (pipeline construction, metric calculation, SHAP plots) and weak at the adjudication (does this model match the deployment scenario, is the apparent advantage real, what does the explanation actually mean). Treat the LLM as a fluent technician.

11.11 Principle in use

Three habits define defensible ML work:

  1. State the deployment scenario before tuning. Who will use the model, where, on what distribution? Tune the validation strategy and the metric to match. A model evaluated on the same distribution it was trained on is a benchmark, not a deployable artefact.

  2. Calibrate before deployment. Most ML models are poorly calibrated out of the box. Recalibration via Platt or isotonic on a held-out set is cheap insurance. Threshold-based decisions on uncalibrated scores are a common source of clinical-deployment failures.

  3. Distinguish model explanation from causal claim. SHAP, permutation importance, and partial-dependence plots describe the model. They do not describe the world. Reports that conflate the two mislead clinicians and reviewers; reports that keep them separate are defensible.

11.12 Exercises

  1. Repeat the readmission worked example with \(n = 800\) instead of \(n = 8000\). Compare the AUC and Brier improvements of the gradient-boosted tree over the logistic regression at small sample size.

  2. Implement nested CV for the readmission pipeline. Compare the nested-CV AUC estimate to the simple train/test estimate. How large is the optimism?

  3. Train the model on the first 70% of the data sorted by date, evaluate on the last 30%. Compare the temporal-validation AUC to the random-split AUC. Discuss any differences.

  4. Apply Platt recalibration to a deliberately over-confident model (e.g., a deep tree with no regularisation). Plot calibration before and after.

  5. Use SHAP on the readmission gradient-boosted tree. Identify a patient where the model’s prediction is surprising (high predicted risk despite a low-risk profile). Investigate the SHAP attribution and discuss whether it would be defensible language to use with a clinician.

11.13 Further reading

  • Kuhn & Silge (2022), the tidymodels book (https://www.tmwr.org/). The canonical applied reference.
  • Hastie et al. (2009), The Elements of Statistical Learning. The textbook treatment of the underlying methods.
  • Lundberg & Lee (2017), the SHAP paper. Best read alongside the shap documentation (https://shap.readthedocs.io/).
  • Van Calster et al. (2019), Calibration: the Achilles heel of predictive analytics. The essential applied reference for clinical prediction model calibration.