2  Numerical Stability and Conditioning

2.1 Learning objectives

By the end of this chapter you should be able to:

  • Describe IEEE 754 double-precision floating-point representation and identify the boundary cases (zero, subnormals, infinity, NaN) that affect statistical computation.
  • Compute and interpret machine epsilon, and predict when ordinary arithmetic fails to satisfy the laws you remember from high-school algebra (associativity, distributivity).
  • Recognise catastrophic cancellation when subtracting two nearly-equal floating-point numbers, and rewrite the affected expression to avoid it.
  • Compute the condition number of a matrix or function and interpret it as ‘how many digits will I lose on this operation’.
  • Distinguish forward error from backward error and explain why backward stability is the right objective for a numerical algorithm.
  • Diagnose numerical fragility in R using kappa(), .Machine, bench::mark(), and a small set of red-flag patterns.
  • Replace several classic statistical anti-patterns (computing variance via the textbook formula, computing \((X^T X)^{-1}\) by explicit inverse, computing \(\log \sum \exp\) naively) with their numerically stable counterparts.

2.2 Orientation

Computer arithmetic is not arithmetic. Every floating-point number is a real number rounded to one of a finite set, spaced unevenly across the real line. Every operation on floating-point numbers introduces a small relative error, on the order of \(10^{-16}\) for double precision. Most of the time you will not notice; the errors are tiny and they do not accumulate to anything visible. Some of the time you will notice catastrophically: a regression that worked on simulated data fails on a real dataset; a likelihood that should be 0.999 returns 1.001; an MCMC chain produces NaN after a million iterations and you have no idea why.

The introductory volume taught the algorithms; this chapter teaches when those algorithms break and how to tell. The material is foundational for every numerical chapter that follows. It is also the kind of material a working biostatistician spends a career rediscovering one bug at a time. Understanding it once, in one place, saves the career-long version.

2.3 The statistician’s contribution

An LLM can recite IEEE 754, the definition of a condition number, and the textbook anti-patterns. What it cannot do is recognise when your analysis is in numerical trouble. The judgements at the centre of this chapter are diagnostic: knowing what to suspect when something looks slightly off, and what to test for confirmation.

Suspect numerical fragility before suspecting bugs. When a regression returns coefficients that contradict prior work, or a likelihood maximises at a value that disagrees with a closed-form check, the first hypothesis should not be ‘I have a bug’. It should be ‘something here is ill-conditioned’. The diagnostic is fast: compute the condition number, compute the gradient by finite differences and compare to the analytic gradient, perturb the input slightly and watch the output. If the output moves by more than the perturbation should produce, the problem is not the code; the problem is the floating-point behaviour of the algorithm on this data.

Statistical formulas in textbooks are not numerically stable formulas. The shortcut variance form \(\frac{1}{n-1}\bigl(\sum x_i^2 - n\bar x^2\bigr)\) is mathematically correct but numerically fragile: it subtracts two near-equal large numbers when \(\bar x\) is large relative to \(\sigma\), and the catastrophic cancellation can produce a negative variance estimate. The centred-deviation form \(\frac{1}{n-1}\sum (x_i - \bar x)^2\) is numerically robust and is what R’s var() actually computes. The textbook often presents the shortcut form without warning. The stable formulations exist in libraries; the ones that you type in from memory will be the textbook ones.

Match precision to need; do not over-promise. A floating-point answer with 16 digits of nominal precision may have only 10 actual digits after a moderately ill-conditioned operation, and only 4 digits after a badly ill-conditioned one. A function that returns a 16-digit number does not advertise this; you advertise it (or fail to). Regression-coefficient point estimates with three digits of precision and a 95% CI are honest. The same estimate with eight digits of precision suggests a level of numerical certainty that the underlying computation cannot supply.

Test the boundary, not just the middle. Synthetic data drawn from \(N(0, 1)\) rarely exposes numerical problems. Real biomedical data has near-zero variances, near-collinear covariates, near-saturated probabilities, and observations that span many orders of magnitude. A correctness test on the centre of the data distribution is necessary; a robustness test on the boundary is necessary too.

These judgements are what distinguish numerical code that works on the analyst’s laptop from numerical code that works on the next analyst’s harder dataset.

2.4 Floating-point arithmetic in one page

A double-precision floating-point number has 64 bits: 1 sign bit, 11 exponent bits, 52 mantissa bits. The number represented is

\[ (-1)^s \cdot (1 + m / 2^{52}) \cdot 2^{e - 1023} \]

with special cases for zero, subnormals (when \(e = 0\)), infinity (when \(e = 2047\) and \(m = 0\)), and NaN (when \(e = 2047\) and \(m \ne 0\)).

The relative gap between consecutive representable numbers is roughly \(2^{-52} \approx 2.22 \times 10^{-16}\). This is machine epsilon:

.Machine$double.eps
#> [1] 2.220446e-16

.Machine$double.xmax            # largest finite double
#> [1] 1.797693e+308

.Machine$double.xmin            # smallest positive normalised
#> [1] 2.225074e-308

The largest integer exactly representable as a double is \(2^{53} = 9{,}007{,}199{,}254{,}740{,}992\). Beyond this, consecutive integers are no longer all representable.

2^53           == 2^53 + 1
#> [1] TRUE

This is one reason patient identifiers and sequence reads are stored as integers or strings, not doubles.

2.4.1 The arithmetic laws change

Floating-point operations satisfy most of the laws of real arithmetic, with three important exceptions:

Addition is not associative. \((a + b) + c\) may differ from \(a + (b + c)\) because the rounding step between operations depends on operand magnitudes.

(1e100 + 1) - 1e100
#> [1] 0
1e100 + (1 - 1e100)
#> [1] 0
1 + 1e100 - 1e100
#> [1] 0     # the 1 is lost

The lost 1 is catastrophic cancellation: a small number added to a much larger number is invisible at double precision because the large number’s rounding swallows it.

Multiplication is not associative either, though the deviations are typically smaller and harder to construct.

The distributive law is approximate. \(a \cdot (b + c) \ne a \cdot b + a \cdot c\) in general, because the parenthesised sum rounds before the multiplication.

For algorithms whose correctness depends on the exact arithmetic laws, every operation is a place where a small error enters. The error analysis question is whether the errors stay bounded or grow.

2.5 Catastrophic cancellation

When you subtract two nearly-equal numbers, the leading significant digits agree and cancel; what remains is a small number whose precision is set by the trailing digits of the inputs. Those trailing digits are the rounding error. The result has lost most of its significant figures.

A canonical example: the textbook variance formula.

\[ \widehat{\sigma}^2 = \frac{1}{n-1} \sum_{i=1}^n (x_i - \bar x)^2 \]

This is mathematically correct. It is also the most numerically stable form when \(\bar x\) is small. But for \(x_i\) near a large mean (a vector of HbA1c values around 6.5 mmol/L, say), an algebraically equivalent rearrangement is

\[ \widehat{\sigma}^2 = \frac{1}{n-1} \left( \sum x_i^2 - n \bar x^2 \right) \]

which is what the textbook ‘shortcut formula’ actually computes. This form subtracts two large nearly-equal numbers (the sum of squares and \(n \bar x^2\)) and is catastrophically unstable for data far from zero.

x <- 1e9 + rnorm(1000, sd = 1)        # mean ~1e9, sd ~1

# stable: two-pass
mean_x <- mean(x)
var_stable <- sum((x - mean_x)^2) / (length(x) - 1)
var_stable
#> [1] 1.014       # close to true variance ~1

# unstable: shortcut formula
var_naive <- (sum(x^2) - length(x) * mean_x^2) / (length(x) - 1)
var_naive
#> [1] -23.5       # negative; clearly wrong

# what R's var() actually does (two-pass centred-deviation, stable)
var(x)
#> [1] 1.014

The shortcut formula gives a negative variance estimate. The naive code does not crash; it produces a plausible- looking number that is wrong by orders of magnitude.

The cure is one of:

  1. Two-pass. Compute the mean first; then sum the squared deviations. Doubles the data passes; numerically robust. This is what R’s var() and sd() use internally (the C source calls a centred-deviation sum after a separate mean pass).
  2. Welford’s online algorithm. Single-pass; updates running mean and sum-of-squares in a numerically stable way. The standard choice for streaming data where two passes are not available.
  3. Centre the data first. Subtract any reasonable approximation of the mean from every observation; compute on the centred data. Useful when you need to express the variance via a formula that the shortcut form would otherwise tempt you into.

In a regression setting, the same issue appears as \(X^T X\) becoming numerically near-singular when \(X\) has columns of very different magnitudes. The cure is to centre and scale columns before forming \(X^T X\), or to work directly with the QR decomposition of \(X\) (which never forms \(X^T X\)).

Question. You implement the shortcut variance formula on a vector of patient ages (range 18 to 95, mean about 55). Will it give you the wrong answer?

Answer.

Probably not, but it is on the edge of trouble. Patient ages are roughly \(O(50)\), with variance on the order of \(O(200)\). The sum of squares is \(\sim n \cdot 50^2\) and \(n \bar x^2\) is also \(\sim n \cdot 50^2\); their difference is the variance times \(n - 1\), so the relative cancellation is about \((n-1) \sigma^2 / (n \bar x^2) \approx \sigma^2 / \bar x^2 \approx 200 / 2500 = 0.08\). You lose about \(\log_{10}(1/0.08) \approx 1\) digit of precision. Not catastrophic for \(n\) in the thousands, but already worse than the two-pass formulation, which loses zero digits to cancellation. The relative cancellation grows with \(\bar x^2 / \sigma^2\), so for biomarkers measured in units far from zero (HbA1c around 6, eGFR around 90, serum creatinine around 0.9), the shortcut formula loses more digits and eventually gives plainly wrong answers. The right rule is to never use the shortcut formula in new code; the cost of var() is negligible and the behaviour is correct on every input.

2.6 The condition number

The condition number of a numerical problem measures how much the solution changes when the input is perturbed. For solving \(A \mathbf{x} = \mathbf{b}\):

\[ \kappa(A) = \|A\| \cdot \|A^{-1}\| \]

(in any consistent norm). For the 2-norm, \(\kappa_2(A)\) is the ratio of the largest to the smallest singular value.

The interpretation: a relative perturbation of size \(\epsilon\) in the right-hand side \(\mathbf{b}\) produces a relative perturbation of size up to \(\kappa(A) \epsilon\) in the solution \(\mathbf{x}\). With double-precision input, a matrix with \(\kappa(A) \approx 10^{12}\) permits roughly \(\log_{10}(10^{16} / 10^{12}) = 4\) digits of meaningful precision in the answer.

In R:

A <- matrix(c(1, 1, 1, 1.0001), 2, 2)
kappa(A, exact = TRUE)
#> [1] 40004
log10(kappa(A, exact = TRUE))
#> [1] 4.602        # about 11 digits of precision out of 16

A few rules of thumb worth memorising:

  • \(\kappa(A) \le 100\): well-conditioned. Most or all 16 digits are meaningful.
  • \(\kappa(A) \approx 10^6\): moderately ill-conditioned. About 10 digits meaningful.
  • \(\kappa(A) \approx 10^{12}\): severely ill-conditioned. About 4 digits meaningful.
  • \(\kappa(A) \ge 10^{16}\): numerically singular at double precision. The computed solution is dominated by rounding error.

Common sources of large condition number in biostatistical work:

  • Collinear or near-collinear covariates. Two columns of \(X\) that are nearly proportional inflate \(\kappa(X^T X)\) to \(\kappa(X)^2\), which is the squared condition number.
  • Wildly different scales. A design matrix with one column in millions (genome positions) and another in tenths (proportions) has \(\kappa(X^T X)\) much larger than necessary. Centre and scale before fitting.
  • Vandermonde-like structure. Polynomial regression with degree above 5 or so produces matrices whose condition numbers explode. Use orthogonal polynomials (poly() instead of cbind(x, x^2, x^3, ...)).

For the regression problem \(X^T X \boldsymbol\beta = X^T y\), the condition number of the linear system is \(\kappa(X^T X) = \kappa(X)^2\). Working with the QR decomposition of \(X\) avoids forming \(X^T X\) at all and keeps the conditioning at \(\kappa(X)\), not \(\kappa(X)^2\). This is why lm() uses QR.

2.7 Forward and backward stability

A numerical algorithm computes \(\hat y = f^*(x)\) rather than the exact \(y = f(x)\). There are two ways to think about the error.

Forward error. \(\|\hat y - y\|\), the distance between the computed and exact answers. This is what you would naturally want to bound, but it depends jointly on the algorithm and the conditioning of the problem.

Backward error. The smallest perturbation \(\Delta x\) such that \(\hat y = f(x + \Delta x)\). That is, the algorithm produces the exact answer to a slightly different problem.

Backward error analysis (Wilkinson, 1960s) is the dominant analytical framework in modern numerical analysis. An algorithm is backward stable if \(\|\Delta x\| \le C \cdot \epsilon_\text{mach}\) for some modest constant \(C\). The forward error is then bounded by \(\kappa(f) \cdot \|\Delta x\|\), so:

\[ \frac{\|\hat y - y\|}{\|y\|} \lesssim \kappa(f) \cdot \epsilon_\text{mach} \]

This factorisation separates algorithm quality (whether the algorithm is backward stable) from problem difficulty (the condition number). A backward stable algorithm produces forward errors as small as the problem permits; an ill-conditioned problem cannot be solved accurately by any algorithm operating in finite precision.

The practical consequence: when you observe a large forward error, ask which factor is responsible. If the condition number is huge, no algorithm can save you; either accept the limited precision, regularise the problem, or use higher-precision arithmetic. If the condition number is small but the answer is still wrong, the algorithm is unstable; replace it.

2.8 Diagnostic patterns

A small set of patterns surfaces most numerical fragility in statistical code.

Pattern 1: compare to a closed-form benchmark. When a numerical answer disagrees with a closed-form check, the question is whether the disagreement is at the rounding level (acceptable) or many orders of magnitude larger (a bug or instability).

# closed form for OLS on synthetic data
set.seed(1)
n <- 100; p <- 3
X <- cbind(1, matrix(rnorm(n * p), n, p))
beta_true <- c(2, 1, -0.5, 0.3)
y <- X %*% beta_true + rnorm(n)

# fit two ways; compare
beta_lm  <- coef(lm(y ~ X[, -1]))
beta_qr  <- qr.coef(qr(X), y)

max(abs(beta_lm - beta_qr))
#> [1] 1.776e-15      # at machine precision; agreement

If this agreement breaks (becomes \(10^{-6}\) or larger), investigate.

Pattern 2: perturb and watch. Add a small noise to the input; recompute. The output should move by an amount proportional to the noise.

A <- matrix(c(1, 1, 1, 1.0001), 2, 2)
b <- c(2, 2.0001)

x_orig <- solve(A, b)
x_pert <- solve(A, b + c(0, 1e-10))   # tiny perturbation
max(abs(x_pert - x_orig))
#> [1] 1e-06          # output moves 10000x more than input

The output moved \(10^4\) times more than the input. That is the condition number \((\approx 4 \times 10^4)\) at work. For research code, automate this check on real data and flag any operation where the output sensitivity is larger than expected.

Pattern 3: compute the gradient two ways and compare. For an analytic gradient, compare to a finite-difference gradient. Disagreement above \(10^{-6}\) or so is a bug, typically in the analytic version.

library(numDeriv)

f  <- function(x) sum((x - 0.3)^2)
gf <- function(x) 2 * (x - 0.3)

x  <- runif(5)
max(abs(gf(x) - grad(f, x)))
#> [1] 4.5e-11        # agreement at numerical-derivative precision

Pattern 4: monitor the magnitude of intermediate quantities. When debugging an MCMC chain that produces NaN, log every intermediate variance estimate, every covariance matrix’s condition number, every likelihood contribution. The first to overflow, underflow, or become singular is the immediate cause; the algorithmic root cause is one or two steps upstream.

2.9 Stable rewrites of unstable formulas

Three classic anti-patterns and their stable replacements.

Antipattern: \(\log \sum_i \exp(z_i)\) computed naively.

z <- c(1000, 1001)
log(sum(exp(z)))
#> [1] Inf            # overflow

The fix is the log-sum-exp trick: factor out the maximum before the exponential.

logsumexp <- function(z) {
  M <- max(z)
  M + log(sum(exp(z - M)))
}
logsumexp(z)
#> [1] 1001.313

This appears whenever you normalise a probability across many categories: softmax, mixture-model responsibilities, log-likelihood contributions in models with many random effects. R’s matrixStats::logSumExp provides the robust implementation.

Antipattern: \(1 - \exp(z)\) for \(z\) near zero.

z <- 1e-10
1 - exp(z)
#> [1] -1e-10         # severe cancellation

R provides expm1(z) which computes \(\exp(z) - 1\) accurately for small \(z\):

-expm1(z)
#> [1] -1e-10         # but with full precision

Likewise log1p(z) computes \(\log(1 + z)\) accurately for small \(z\). These appear in CDFs for distributions that are close to 1 in the upper tail, in the logistic-regression log-likelihood for predictions near 0 or 1, and in survival analysis.

Antipattern: solving \(X^T X \boldsymbol\beta = X^T y\) via explicit inverse.

beta <- solve(t(X) %*% X) %*% t(X) %*% y           # bad
beta <- solve(crossprod(X), crossprod(X, y))       # better
beta <- qr.coef(qr(X), y)                          # best

The first form forms \(X^T X\) and inverts it explicitly, inflating the condition number from \(\kappa(X)\) to \(\kappa(X)^2\) and then computing an inverse, which is both numerically worse and computationally wasteful. The QR-based form avoids forming \(X^T X\) entirely and preserves the original conditioning.

Question. Why does the log-sum-exp trick avoid overflow when the naive form does not?

Answer.

The naive form computes \(\exp(z_i)\) for every \(z_i\). For any \(z_i\) above about 709, this overflows to infinity at double precision. The log-sum-exp trick factors out the maximum: \(\log \sum \exp(z_i) = M + \log \sum \exp(z_i - M)\) where \(M = \max_i z_i\). Now the exponentials inside the sum have arguments \(z_i - M \le 0\), so each \(\exp(\cdot)\) is in \((0, 1]\) and cannot overflow. The sum and log are applied to a small finite number, then \(M\) is added back on the log scale. The mathematics is identical; the floating-point behaviour is dramatically different. This is the canonical example of a numerically stable rewrite: same answer, different computation, no overflow.

2.10 Worked example: a numerically robust logistic likelihood

The logistic-regression log-likelihood is

\[ \ell(\boldsymbol\beta) = \sum_i \left[ y_i \mathbf{x}_i^T \boldsymbol\beta - \log(1 + \exp(\mathbf{x}_i^T \boldsymbol\beta)) \right]. \]

The naive computation of log(1 + exp(eta)) overflows when \(\eta\) is large positive and underflows (returning \(\log 1 = 0\)) when \(\eta\) is large negative. The robust form uses log1p and a max trick:

loglik_logistic <- function(beta, X, y) {
  eta <- as.numeric(X %*% beta)
  # log(1 + exp(eta)) without overflow:
  # for eta >= 0: eta + log(1 + exp(-eta))
  # for eta <  0: log(1 + exp(eta))
  # combined via max trick:
  m   <- pmax(eta, 0)
  log1pe <- m + log1p(exp(-abs(eta)))
  sum(y * eta - log1pe)
}

Compare to the naive form on a difficult input:

set.seed(1)
n <- 200
X <- cbind(1, scale(matrix(rnorm(n * 3), n, 3)))
beta <- c(0.5, 50, -50, 0)              # extreme coefs to force trouble
y <- rbinom(n, 1, plogis(X %*% beta))

beta_test <- c(0, 60, -60, 0)
naive <- function(beta, X, y) {
  eta <- as.numeric(X %*% beta)
  sum(y * eta - log(1 + exp(eta)))
}

naive(beta_test, X, y)
#> [1] -Inf            # log(exp(huge)) overflows
loglik_logistic(beta_test, X, y)
#> [1] -2754.3         # finite and meaningful

The naive version returns -Inf, which an optimiser treats as a barrier and the chain stops; the robust version returns a finite value the optimiser can work with.

2.11 Collaborating with an LLM on numerical stability

LLMs are reasonable at recalling the textbook anti-patterns in this chapter and at writing the stable replacements. They are less reliable at identifying which of your specific formulas are unstable, because that requires working through the magnitudes that arise in your data.

Prompt 1: review for instability. Paste a short function (a likelihood, a moment computation, a closed-form estimator) and ask: ‘identify any numerical-stability risks in this code, and propose stable rewrites where relevant.’

What to watch for. The LLM should flag: subtractions of nearly-equal quantities, naive log(1 + exp(x)) and log(sum(exp(x))), explicit matrix inverses, computations of variance via the shortcut formula, and accumulation of many small additions to a large running total. If it flags none on a function that contains any of these, push back; if it flags more, evaluate each.

Verification. Construct an extreme input (mean far from zero, near-collinear design, near-zero variance) and run both versions. The stable rewrite should produce a finite sensible number; the unstable version should produce overflow, underflow, NaN, or wildly wrong values.

Prompt 2: explain a numerical bug. Paste the failing output and the surrounding code, ask: ‘this returns NaN on a small subset of inputs. Trace the failure: what is the first quantity that becomes NaN, and why?’

What to watch for. The LLM is good at the standard failure cascade (zero variance → divide by zero → NaN; a covariance matrix with a tiny negative eigenvalue from rounding → Cholesky fails). It is less reliable for domain-specific cascades (a hierarchical-prior scale collapsing to zero in a Bayesian model). If the answer sounds vague, ask the LLM to state the failure in terms of specific quantities and where they were last finite.

Verification. Add stopifnot() or cat(...) checkpoints at each stage of the computation. The first checkpoint that triggers identifies the precise failure point; compare to what the LLM predicted.

Prompt 3: produce a numerically robust implementation. Describe the mathematical formula and the input ranges, ask: ‘implement this in R, prioritising numerical robustness over brevity. Use log1p, expm1, logSumExp, crossprod, qr.coef, and similar where applicable. Comment on each numerical-stability decision in the code.’

What to watch for. The LLM should produce code with explicit stability rationale in comments. If the comments are absent or generic, the implementation may be a straight transcription of the formula without thinking about magnitudes. If the code matches a textbook anti- pattern, ask for a specific rewrite.

Verification. Test on three input regimes: typical (centred near zero, modest variance), extreme (mean far from zero), and pathological (near-collinear design or saturated probabilities). The robust implementation should work on all three; the textbook implementation will fail on the second or third.

The meta-lesson: numerical robustness is not a correctness property of the formula; it is a property of how the formula is computed. LLMs assist with the computation but cannot recognise instability without you naming the specific risk.

2.12 Principle in use

Three habits define defensible numerical practice:

  1. Compute condition numbers. Whenever you solve a linear system, fit a regression, or invert a matrix, compute and look at the condition number. It is one line of code and surfaces the numerical risk before it becomes a wrong answer.
  2. Use stable rewrites by default. log1p, expm1, logSumExp, crossprod, qr.coef for OLS, two-pass variance. Make the stable form your habit; reach for the unstable form only when you have specifically verified its safety.
  3. Test the boundary. Synthetic data centred near zero rarely exposes problems. Run on data with means far from zero, near-collinear covariates, and probabilities near 0 or 1. The boundary cases are where production data lives.

2.13 Exercises

  1. Compute the variance of 1e9 + rnorm(1e5) two ways: the shortcut formula and var(). Document the difference. Now repeat with 1e15 + rnorm(1e5). At what magnitude of mean does the shortcut formula become useless?
  2. Construct a \(5 \times 5\) Hilbert matrix \(H_{ij} = 1/(i + j - 1)\). Compute its condition number. Solve \(H \mathbf{x} = \mathbf{b}\) for \(\mathbf{b}\) chosen so the exact answer is \(\mathbf{x} = (1, 1, 1, 1, 1)\). Report the relative error. Repeat for \(n = 10\) and \(n = 12\).
  3. Implement a numerically robust softmax function: \(\sigma_i(\mathbf{z}) = \exp(z_i) / \sum_j \exp(z_j)\). Verify on inputs containing values larger than 700, where naive softmax overflows.
  4. The logistic CDF \(\Phi(z) = (1 + e^{-z})^{-1}\) has a naive implementation that returns 1 for large positive \(z\) (true) and 0 for large negative \(z\) (also true). Implement log(plogis(z)) accurately for both tails. Hint: this is what R’s plogis(z, log.p = TRUE) does; verify your implementation against it.
  5. Take a regression of yours from a previous chapter or a real analysis. Compute the condition number of \(X\) and of \(X^T X\). Note the ratio. Compute the OLS coefficients via the explicit-inverse formula and via qr.coef. Compare the two answers in their last few digits.

2.14 Further reading

  • (Goldberg, 1991), ‘What every computer scientist should know about floating-point arithmetic’. The canonical introduction. Free at dl.acm.org.
  • (Higham, 2002), Accuracy and Stability of Numerical Algorithms, 2nd ed. The reference. Dense.
  • (Trefethen & Bau III, 1997), Numerical Linear Algebra. The treatment of conditioning and stability in chapters 12–15 is excellent.
  • The R documentation at ?.Machine and the matrixStats package vignettes for stable implementations of common reductions.