← back to machine learning

Gaussian Processes

Machine Learning ยท Ch.7 of 12

A GP is a distribution over functions. Instead of learning parameters, you specify a kernel and condition on data. The posterior is also a GP โ€” prediction comes with uncertainty for free.

mean GP posterior: mean curve with confidence bands

RBF kernel matrix

The kernel function defines the covariance between any two function values. For inputs x_1, ..., x_n, the kernel matrix K has K[i,j] = k(x_i, x_j). The RBF kernel k(x,y) = exp(-||x-y||2 / 2l2) produces smooth functions; the lengthscale l controls how quickly correlations decay.

Scheme

GP prior samples

A GP prior with zero mean and kernel k defines a distribution over functions. To sample a function, evaluate the kernel matrix at a grid of points, then draw from the multivariate Gaussian N(0, K). Each sample is a smooth curve whose shape depends on the kernel.

Scheme

GP posterior prediction

Given observed data (X, y), the GP posterior at a new point x* has mean k(x*, X) K-1 y and variance k(x*, x*) - k(x*, X) K-1 k(X, x*). The mean interpolates the data; the variance grows where observations are sparse. No parameters were fit -- the kernel does all the work.

Scheme

Notation reference

Math Scheme Meaning
k(x,y)(rbf x y l)Kernel (covariance function)
K(kernel-matrix xs l)Gram matrix of training points
μ* = k* K-1 ypost-meanPosterior mean at test point
σ*² = k** - k* K-1 k*Tpost-varPosterior variance (uncertainty)

Translation notes

A GP is nonparametric: instead of fitting a fixed number of weights, it uses the kernel to define a prior over the entire function space. Conditioning on data is Bayes' rule applied to function space. The kernel matrix is a metric on the input space -- connecting GPs to Leinster's magnitude theory, where the magnitude of a metric space measures its "effective size."

Neighbors
  • Probability Ch.6 โ€” expected value: the posterior mean is a conditional expectation
  • Leinster 2021 โ€” magnitude of metric spaces: the kernel matrix as a similarity structure

Ready for the real thing?

This chapter covers prediction with a fixed kernel. For hyperparameter optimization (marginal likelihood), sparse approximations, and deep GPs, see Rasmussen & Williams's Gaussian Processes for Machine Learning (free online).