7 MCMC in Depth

7.1 Learning objectives

By the end of this chapter you should be able to:

Describe Hamiltonian Monte Carlo (HMC) as a Metropolis proposal that uses gradient information to construct long, low-rejection trajectories.
Identify the No-U-Turn Sampler (NUTS) as the adaptive variant of HMC that automates trajectory length, and explain why it is the default for modern probabilistic programming systems.
Compute and interpret modern MCMC convergence diagnostics: $\hat R$ (rank-normalised), bulk and tail effective sample size, divergent transitions, E-BFMI, and the energy diagnostic.
Recognise the reparameterisation patterns that fix hierarchical-prior pathologies: the non-centred parameterisation, the change-of-variable Jacobian, and the standardisation of input variables.
Run multiple chains in parallel correctly, with reproducible RNG and proper diagnostics across chains.
Diagnose failures: chains stuck on different modes, funnel pathologies, ill-conditioned posteriors, and what each looks like in bayesplot::mcmc_pairs().

7.2 Orientation

The introductory volume’s Bayesian chapter (?sec-bayesian) introduced MCMC at the Metropolis-Hastings and Gibbs-sampler level. That treatment shows the mathematical structure but produces samplers that are slow on realistic posteriors. Modern applied Bayesian work uses Hamiltonian Monte Carlo (HMC), or its adaptive descendant the No-U-Turn Sampler (NUTS), via Stan, PyMC, NumPyro, or the R interfaces rstan, cmdstanr, rstanarm, and brms.

This chapter treats HMC and NUTS in depth: enough mechanics to understand why they work and what their diagnostics mean, plus the practical bookkeeping that distinguishes a converged chain from one that looks converged. The next chapter (Chapter 8) covers the application layer (Stan, posterior workflows, LOO/WAIC); this chapter covers the algorithmic and diagnostic layer.

7.3 The statistician’s contribution

Modern MCMC tools handle the algorithmic mechanics automatically. The judgements at the centre of this chapter are about reading diagnostics correctly, recognising pathological posterior shapes from their diagnostic signatures, and making the parameterisation choices that determine whether the sampler can converge at all.

$\hat R$ near 1 is necessary, not sufficient. A chain with $\hat R = 1.001$ may still have low effective sample size, divergences, or trapped behaviour on a multimodal posterior. The full diagnostic suite (divergences, E-BFMI, ESS bulk and tail, trace plots, pair plots) is the verification; $\hat R$ alone is insufficient.

Reparameterisation is a modelling choice. A hierarchical model with a centred parameterisation ($\mu_j \sim N(\mu, \sigma)$) has a different posterior geometry from the same model with a non-centred parameterisation ($\mu_j = \mu + \sigma z_j, z_j \sim N(0, 1)$). The mathematics is the same; the sampler’s behaviour is not. Knowing which parameterisation matches your data’s information level is the difference between a fast, clean fit and an MCMC chain that looks converged but never explores the funnel of the variance parameter.

Divergences are not a tuning issue. When NUTS reports divergent transitions, the standard advice is ‘increase adapt_delta’. That treats the symptom. The underlying cause is geometry: the posterior has a region the sampler cannot integrate accurately. Increasing adapt_delta may hide the divergences but not the geometry; the chain still mis-samples the difficult region. The cure is reparameterisation, prior tightening, or model simplification.

Multiple chains, multiple starting points. Convergence of one chain to a fixed point does not imply convergence to the posterior. Multiple chains starting from dispersed values, with $\hat R$ across chains as the consistency check, is the protection against multimodality and non-stationarity.

These judgements are what distinguish a converged Bayesian analysis from one that mis-reports its uncertainty.

7.4 Hamiltonian Monte Carlo

HMC is a Metropolis-Hastings sampler whose proposal combines the parameter $\boldsymbol\theta$ with an auxiliary momentum variable $\mathbf{p}$ and uses the Hamiltonian dynamics on the joint $(\boldsymbol\theta, \mathbf{p})$ space to propose long, low-rejection moves.

The Hamiltonian for a target $p(\boldsymbol\theta)$ is

\[ H(\boldsymbol\theta, \mathbf{p}) = -\log p(\boldsymbol\theta) + \frac{1}{2} \mathbf{p}^T M^{-1} \mathbf{p} \]

where $M$ is a (positive-definite) mass matrix, often chosen to roughly equal the posterior covariance. The two halves of $H$ are the potential energy (the negative log target) and the kinetic energy (a quadratic in $\mathbf{p}$).

Hamilton’s equations describe how $(\boldsymbol\theta, \mathbf{p})$ evolves under $H$:

\[ \frac{d\boldsymbol\theta}{dt} = \frac{\partial H}{\partial \mathbf{p}} = M^{-1} \mathbf{p}, \quad \frac{d\mathbf{p}}{dt} = -\frac{\partial H}{\partial \boldsymbol\theta} = \nabla \log p(\boldsymbol\theta). \]

The continuous-time evolution preserves $H$ exactly. In practice, we discretise via the leapfrog integrator:

\[ \begin{aligned} \mathbf{p}^{(t + \epsilon/2)} &= \mathbf{p}^{(t)} + \frac{\epsilon}{2} \nabla \log p(\boldsymbol\theta^{(t)}) \\ \boldsymbol\theta^{(t + \epsilon)} &= \boldsymbol\theta^{(t)} + \epsilon M^{-1} \mathbf{p}^{(t + \epsilon/2)} \\ \mathbf{p}^{(t + \epsilon)} &= \mathbf{p}^{(t + \epsilon/2)} + \frac{\epsilon}{2} \nabla \log p(\boldsymbol\theta^{(t + \epsilon)}). \end{aligned} \]

Take $L$ leapfrog steps; propose $(\boldsymbol\theta^{(t+L\epsilon)}, -\mathbf{p}^{(t+L\epsilon)})$ as the Metropolis candidate; accept with probability $\min(1, \exp(H_{\text{old}} - H_{\text{new}}))$. The discretisation error in the leapfrog integrator is small but nonzero, so the acceptance probability is slightly below 1; the deviation from 1 is what makes the sampler proper.

The two tuning knobs are:

Step size $\epsilon$. Smaller produces more accurate trajectories and higher acceptance, but needs more steps to cover the same distance.
Trajectory length $L$. Longer trajectories produce proposals farther from the current state, reducing autocorrelation, but cost more gradient evaluations.

Tuning these by hand is delicate; NUTS automates it.

7.4.1 The No-U-Turn Sampler (NUTS)

NUTS (Hoffman and Gelman, 2014) extends HMC with two adaptations:

Automatic trajectory length. NUTS extends the trajectory in random directions until it begins to double back on itself (forms a U-turn) or reaches a maximum tree depth (typically 10, giving up to $2^{10} - 1 = 1023$ leapfrog steps). The trajectory automatically stops just before the proposal would loop back.
Automatic step-size adaptation. During warmup, NUTS adjusts $\epsilon$ to achieve a target acceptance rate (default 0.8 in Stan).

The result is a sampler with no user-tunable parameters except the target acceptance rate (adapt_delta in Stan, defaulting to 0.8). For most well-formulated posteriors, NUTS produces effective samples at hundreds to thousands per second, hundreds of times faster than random-walk Metropolis.

NUTS is the default in Stan, PyMC, NumPyro, and via the R wrappers rstanarm, brms, rstan, and cmdstanr. For routine Bayesian analyses, you do not implement NUTS yourself; you use one of these.

7.5 Modern convergence diagnostics

7.5.1 $\hat R$ (rank-normalised, split, with folding)

The classical Gelman-Rubin diagnostic compares within- chain to between-chain variance. The 2021 update by Vehtari et al. introduces three improvements:

Rank normalisation. Replace samples by their ranks, then transform via $\Phi^{-1}$ to a normal scale. Robust to heavy tails and asymmetric posteriors.
Split. Each chain is split in two halves; $\hat R$ is computed across the resulting $2C$ chains rather than the original $C$. Detects within-chain non-stationarity.
Folding. Compute $\hat R$ on $|x - \text{median}(x)|$ in addition to $x$ itself. Detects scale (variance) non-stationarity.

The threshold is $\hat R < 1.01$ for both the regular and folded versions. The convention $\hat R < 1.1$ from older literature is too lax for modern applied Bayesian work; use 1.01.

library(posterior)
draws <- as_draws_array(stan_fit)
summarise_draws(draws, default_convergence_measures())
#> # A tibble: 5 × 4
#>   variable    rhat ess_bulk ess_tail
#>   <chr>      <dbl>    <dbl>    <dbl>
#> 1 b_intercept 1.00     2840     2120
#> 2 b_x         1.00     2950     2470
#> 3 sigma       1.00     1980     2350
#> 4 ...

7.5.2 Effective sample size

ESS is the number of independent samples that would give the same standard error as the autocorrelated MCMC samples. ESS bulk measures the effective sample size for typical-value summaries (mean, median); ESS tail measures it for tail summaries (95% credible intervals). The convention: at least 400 in each is required for reliable summaries. Below that, the posterior interval bounds are noisy.

For most posteriors, ESS bulk and tail are similar. When they differ substantially, the tail of the posterior is slow-mixing; tail-quantile reporting is unreliable.

7.5.3 Divergent transitions

A divergence in NUTS occurs when the leapfrog integrator fails to track the true Hamiltonian dynamics, typically because the integrator step size is too large for a high-curvature region. Divergences indicate that NUTS is not exploring some part of the posterior accurately; the resulting samples are biased away from that region.

Default Stan reports divergence count as part of the warning system. Any positive count is concerning; double digits are a serious problem.

The textbook fix: increase adapt_delta from 0.8 to 0.95 or 0.99. This reduces the step size, which often fixes the immediate problem. But this treats the symptom; the underlying cause is usually a posterior geometry that requires a different parameterisation (see below).

7.5.4 E-BFMI

The Energy-based Bayesian Fraction of Missing Information diagnoses whether the chain is exploring the energy distribution efficiently. Values below 0.3 indicate problems; the canonical example is the funnel pathology, where the chain repeatedly fails to sample the bottom of the funnel.

# Stan reports E-BFMI per chain
rstan::check_energy(fit)        # warning if any chain has E-BFMI < 0.2

7.5.5 Pair plots

For diagnosing geometric pathologies, the most useful visual is a pair plot coloured by divergence:

library(bayesplot)
mcmc_pairs(stan_fit, pars = c("mu", "sigma", "tau"),
           np = nuts_params(stan_fit))

Divergent transitions cluster in regions of the posterior the sampler cannot integrate accurately. In hierarchical models, divergences typically cluster in the bottom of the funnel (small $\tau$, small group means).

7.5.6 Trace plots

The classical trace plot shows the sampled value of each parameter against iteration. A converged chain looks like white noise around a stable mean. Chains that drift, chains that get stuck, and chains that explore different regions on different runs are visually obvious in a trace plot.

mcmc_trace(stan_fit, pars = c("mu", "sigma"))

7.6 Reparameterisation: centred vs. non-centred

The single most common reason a hierarchical Bayesian model produces divergences is a parameterisation mismatch with the data’s information level. The canonical example: a hierarchical normal with group means $\mu_j \sim N(\mu, \tau)$ and observation-level likelihood $y_{ij} \sim N(\mu_j, \sigma)$.

Centred parameterisation (CP):

parameters {
  real mu;
  real<lower=0> tau;
  real<lower=0> sigma;
  vector[J] mu_j;
}
model {
  mu_j ~ normal(mu, tau);
  for (j in 1:J)
    y[j] ~ normal(mu_j[j], sigma);
}

Non-centred parameterisation (NCP):

parameters {
  real mu;
  real<lower=0> tau;
  real<lower=0> sigma;
  vector[J] z_j;       // standard normal
}
transformed parameters {
  vector[J] mu_j = mu + tau * z_j;
}
model {
  z_j ~ normal(0, 1);
  for (j in 1:J)
    y[j] ~ normal(mu_j[j], sigma);
}

The two parameterise the same model. The posterior on $(\mu, \tau, \mu_j)$ is identical in distribution. But the posterior geometry differs:

Under CP, the posterior on $(\tau, \mu_j)$ has a funnel shape: when $\tau$ is small, $\mu_j$ is squeezed close to $\mu$, producing high curvature in the joint posterior.
Under NCP, the posterior on $(\tau, z_j)$ is roughly spherical: $z_j$ stays around its prior $N(0, 1)$ regardless of $\tau$.

When CP works: when the data are informative about each $\mu_j$ (many observations per group). The posterior on $\mu_j$ is then dominated by the data, not the prior, and the funnel is mostly absent.

When NCP works: when the data are weakly informative about each $\mu_j$ (few observations per group). The posterior is dominated by the prior; the funnel is present in CP but absent in NCP.

The general rule: start with NCP for any hierarchical-prior parameter that is not strongly identified by the data. Switch to CP only when both work and CP gives more efficient sampling.

Check your understanding: when does CP fail?

Question. Your hierarchical model has $J = 50$ groups, with observations per group ranging from 2 to 10. Stan reports 47 divergences out of 4000 post-warmup samples. You increase adapt_delta to 0.99 and the divergences drop to 8. Should you publish?

Answer.

No. The divergences dropped because the smaller step size made the sampler more careful, but the underlying geometry is still bad. The few-observations-per-group setting is exactly the case where CP fails: the data do not pin down each $\mu_j$, so the posterior has a funnel shape that NCP would handle. The fix is to reparameterise to NCP: $\mu_j = \mu + \tau z_j$ with $z_j \sim N(0, 1)$. This typically eliminates the divergences and produces a faster sampler. Publishing with adapt_delta = 0.99 and 8 remaining divergences means the bottom of the funnel was never properly sampled; the variance estimate $\tau$ is biased toward larger values, and the group-level estimates are biased toward the global mean. Re-fit with NCP, verify zero divergences, then report.

7.7 Other reparameterisations

Beyond centred / non-centred, several other reparameterisations recur:

Standardise predictors. A predictor measured in millions has a coefficient that takes values in $10^{-6}$; one in tenths has a coefficient in 1. Mixing both in one regression produces a posterior on $\boldsymbol\beta$ with very different scales per coefficient, which the sampler’s mass matrix has trouble adapting to. Standardising every continuous predictor to mean 0, SD 1 puts coefficients on comparable scales and aids sampling.

Log-transform positive variables. A posterior on $\sigma > 0$ is more naturally sampled in the unconstrained $\log \sigma$ space. Stan automatically applies the transform for <lower=0> declarations and includes the Jacobian; PyMC and similar do the same. Hand-rolled MCMC must include the change-of- variable Jacobian explicitly.

Logit-transform probabilities. Probabilities live in $(0, 1)$; sampling on the logit scale removes the boundary effects.

Cholesky parameterisation of covariance matrices. For multivariate-normal random effects with covariance $\Sigma$, parameterise via the lower-triangular Cholesky factor $L$ such that $\Sigma = L L^T$. Stan’s lkj_corr_cholesky prior works on the Cholesky factor of the correlation matrix. This avoids the positive-definiteness constraint and produces substantially better sampling.

7.8 Parallel chains

Modern Bayesian software runs multiple chains in parallel by default. The standard configuration:

4 chains (Stan default), each on its own CPU core.
1000 warmup + 1000 sampling iterations per chain (Stan default).
Random initialisation dispersed in the parameter space (Stan default: uniform on $[-2, 2]$ in the unconstrained space).

The four chains’ worth of samples are pooled for inference; their consistency provides the $\hat R$ diagnostic.

For long-running models, cmdstanr and rstan parallelise across chains automatically. To extract maximum parallelism, set cores = 4 (or as many as you have).

library(cmdstanr)
fit <- stan_model$sample(
  data    = stan_data,
  chains  = 4,
  parallel_chains = 4,
  iter_warmup   = 1000,
  iter_sampling = 1000,
  seed    = 47
)

For RNG reproducibility, set seed. Each chain receives a derived seed; running with the same seed produces identical samples. Without setting seed, results are not reproducible.

7.9 Worked example: Eight Schools

The Eight Schools dataset (Rubin, 1981; Gelman et al., 2013) is the canonical hierarchical-Bayesian example. The data: estimated treatment effects of an SAT-prep coaching intervention at eight schools, with their standard errors:

schools_data <- list(
  J = 8,
  y = c(28, 8, -3, 7, -1, 1, 18, 12),
  sigma = c(15, 10, 16, 11, 9, 11, 10, 18)
)

A hierarchical model with non-centred parameterisation:

data {
  int<lower=0> J;
  vector[J] y;
  vector<lower=0>[J] sigma;
}
parameters {
  real mu;
  real<lower=0> tau;
  vector[J] eta;
}
transformed parameters {
  vector[J] theta = mu + tau * eta;
}
model {
  mu  ~ normal(0, 5);
  tau ~ cauchy(0, 5);
  eta ~ normal(0, 1);
  y   ~ normal(theta, sigma);
}

Fit and check diagnostics:

library(cmdstanr)
mod <- cmdstan_model('eight-schools-ncp.stan')
fit <- mod$sample(data = schools_data, chains = 4,
                  parallel_chains = 4, seed = 47, refresh = 0)

fit$summary()                  # tibble of point estimates, ESS, R-hat
fit$diagnostic_summary()       # divergences, E-BFMI, max tree depth

Under NCP, this model samples cleanly with zero divergences, $\hat R$ near 1, and ESS in the thousands in seconds. Re-running with the centred parameterisation typically produces dozens of divergences.

7.10 Collaborating with an LLM on MCMC

LLMs handle Stan and PyMC code competently, including both centred and non-centred parameterisations. They are less reliable on the diagnostic interpretation: specifically, on translating diagnostic output into either ‘this is fine, publish’ or ‘this is broken, fix the parameterisation’.

Prompt 1: write Stan code with appropriate parameterisation. Describe the model and the data’s information level (how many observations per group). Ask: ‘write Stan code for this hierarchical model. Choose between centred and non-centred parameterisation based on the information level.’

What to watch for. Few observations per group (under roughly 5–10) calls for NCP. Many observations per group (50+) is the regime where CP works. The LLM should make this choice and explain it. If it picks one without explanation, ask why.

Verification. Fit the model with the chosen parameterisation; check for divergences. If divergences appear with the chosen parameterisation, try the alternative. The right one usually has zero divergences.

Prompt 2: interpret diagnostic output. Paste the output of fit$diagnostic_summary() (divergences, E-BFMI, max tree depth) and fit$summary() (R-hat, ESS). Ask: ‘is this chain converged? What should I worry about?’

What to watch for. The LLM should check $\hat R < 1.01$ on every parameter, ESS bulk and tail above 400 on every parameter, divergences = 0, E-BFMI > 0.3, and max tree depth not saturated. If any fail, the diagnosis depends on which: divergences point to parameterisation or adapt_delta; low ESS points to chain length; saturated tree depth points to step size.

Verification. Run with adjusted settings (more iterations, NCP, higher adapt_delta) and confirm the diagnostic improves.

Prompt 3: diagnose a specific pathology. Paste a pair plot description (or the actual mcmc_pairs output) showing divergences clustered in a particular region. Ask: ‘what kind of geometric pathology produces this pattern, and what reparameterisation would fix it?’

What to watch for. Divergences clustered at small $\tau$ in a hierarchical model: funnel pathology, fix with NCP. Divergences clustered at the boundary of a constrained parameter (e.g., $\rho \to 1$ in a correlation): scale issue, consider tighter prior. Divergences along a thin diagonal: posterior is ill-conditioned, need to reparameterise to break the correlation.

Verification. Apply the recommended reparameterisation; refit; verify divergences are gone.

The meta-pattern: modern MCMC’s diagnostics are informative if you read them; LLMs help interpret them faster than you would alone. The judgement is whether to trust the LLM’s diagnosis enough to publish, which requires checking it against the actual posterior behaviour.

7.11 Principle in use

Three habits define defensible MCMC use:

Read the full diagnostic suite. $\hat R$, ESS (bulk and tail), divergences, E-BFMI. Any failure is a reason to investigate before reporting.
Reparameterise before retuning. When divergences appear, the first move is non-centred parameterisation (for hierarchical models) or input standardisation (for regression). Tightening adapt_delta is the second move, not the first.
Multiple chains, dispersed starts. Four chains with random initialisation are the protection against multimodality and against chains that look converged in isolation but disagree across chains.

7.12 Exercises

Fit the Eight Schools model with both centred and non-centred parameterisations. Compare divergences, ESS, and posterior summaries. Confirm the reparameterisation reproduces the same posterior in distribution but with cleaner sampling.
Construct a hierarchical model where the data are highly informative (many observations per group) and one where they are weakly informative (few). For each, compare CP and NCP. The advantage of NCP should diminish in the data-rich regime.
Implement the leapfrog integrator and a basic HMC sampler in R. Test on a 2D Gaussian target. Compare to random-walk Metropolis at fixed computational budget.
Use bayesplot::mcmc_pairs with divergence colouring on a hierarchical model that produces divergences. Visually identify the funnel; describe its location in $(\tau, \mu_j)$-space.
Run a Stan model with seed = 47 four separate times. Confirm that all four runs produce identical samples (same fit$draws()). Then run without a seed and observe non-reproducibility.

7.13 Further reading

(Hoffman & Gelman, 2014), Hoffman and Gelman, the NUTS paper.
(Betancourt, 2017), Betancourt, ‘A conceptual introduction to Hamiltonian Monte Carlo’. The clearest geometric treatment.
(Vehtari et al., 2021), Vehtari et al., the modern $\hat R$ paper.
The Stan User’s Guide and the bayesplot, posterior, and cmdstanr package documentation.