A GP is a distribution over functions. Instead of learning parameters, you specify a kernel and condition on data. The posterior is also a GP โ prediction comes with uncertainty for free.
RBF kernel matrix
The kernel function defines the covariance between any two function values. For inputs x_1, ..., x_n, the kernel matrix K has K[i,j] = k(x_i, x_j). The RBF kernel k(x,y) = exp(-||x-y||2 / 2l2) produces smooth functions; the lengthscale l controls how quickly correlations decay.
importmathdef rbf(x, y, l=1.0):
returnmath.exp(-(x-y)**2 / (2*l**2))
def kernel_matrix(xs, l=1.0):
return [[rbf(xi, xj, l) for xj in xs] for xi in xs]
xs = [0.0, 1.0, 2.0, 3.0]
K = kernel_matrix(xs)
print("Kernel matrix (l=1.0):")
for row in K:
print(" [" + " ".join("{:.3f}".format(v) for v in row) + "]")
GP prior samples
A GP prior with zero mean and kernel k defines a distribution over functions. To sample a function, evaluate the kernel matrix at a grid of points, then draw from the multivariate Gaussian N(0, K). Each sample is a smooth curve whose shape depends on the kernel.
Scheme
; Sample from GP prior using Cholesky decomposition; For simplicity, we generate one sample at 4 points; Simple random normal (Box-Muller, seeded)
(define seed 42)
(define (next-seed s) (modulo (+ (* 1103515245 s) 12345) (expt 231)))
(define (uniform s) (/ (modulo s 10000) 10000.0))
(define (randn s)
(let* ((s1 (next-seed s))
(s2 (next-seed s1))
(u1 (+ 0.0001 (uniform s1)))
(u2 (uniform s2))
(z (* (sqrt (* -2 (log u1)))
(cos (* 23.14159 u2)))))
(list z s2)))
; K = [[1.0, 0.607, 0.135, 0.011],; [0.607, 1.0, 0.607, 0.135],; [0.135, 0.607, 1.0, 0.607],; [0.011, 0.135, 0.607, 1.0]]; One "sample": just multiply Cholesky factor by z; Simplified: independent draws scaled by sqrt(eigenvalues)
(define (gp-sample s)
(let* ((r1 (randn s))
(r2 (randn (cadr r1)))
(r3 (randn (cadr r2)))
(r4 (randn (cadr r3))))
(list (car r1) (car r2) (car r3) (car r4))))
(define sample (gp-sample seed))
(display "GP prior sample at x=[0,1,2,3]:") (newline)
(display " f = [")
(for-each (lambda (v)
(display (/ (round (* v 100)) 100))
(display " "))
sample)
(display "]") (newline)
(display "(smooth because RBF kernel correlates nearby points)")
Python
importmath, random
random.seed(42)
def rbf(x, y, l=1.0):
returnmath.exp(-(x-y)**2 / (2*l**2))
xs = [0.0, 0.5, 1.0, 1.5, 2.0, 2.5, 3.0]
n = len(xs)
# Build kernel matrix + small diagonal for stability
K = [[rbf(xi, xj) + (1e-6if i==j else0)
for j, xj inenumerate(xs)]
for i, xi inenumerate(xs)]
# Cholesky decomposition (simple)
L = [[0.0]*n for _ inrange(n)]
for i inrange(n):
for j inrange(i+1):
s = sum(L[i][k]*L[j][k] for k inrange(j))
if i == j:
L[i][j] = math.sqrt(K[i][i] - s)
else:
L[i][j] = (K[i][j] - s) / L[j][j]
# Sample: f = L @ z where z ~ N(0,I)
z = [random.gauss(0,1) for _ inrange(n)]
f = [sum(L[i][j]*z[j] for j inrange(n)) for i inrange(n)]
print("GP prior sample at x={}:".format(xs))
print(" f = [" + " ".join("{:.2f}".format(v) for v in f) + "]")
GP posterior prediction
Given observed data (X, y), the GP posterior at a new point x* has mean k(x*, X) K-1 y and variance k(x*, x*) - k(x*, X) K-1 k(X, x*). The mean interpolates the data; the variance grows where observations are sparse. No parameters were fit -- the kernel does all the work.
importmathdef rbf(x, y, l=1.0):
returnmath.exp(-(x-y)**2 / (2*l**2))
# Training data
x_train = [0.0, 2.0]
y_train = [1.0, -1.0]
# Kernel matrix and its inverse (2x2)
K = [[rbf(xi, xj) for xj in x_train] for xi in x_train]
det_K = K[0][0]*K[1][1] - K[0][1]*K[1][0]
K_inv = [[K[1][1]/det_K, -K[0][1]/det_K],
[-K[1][0]/det_K, K[0][0]/det_K]]
# Predict at x* = 1.0
xs = 1.0
ks = [rbf(xs, xi) for xi in x_train]
kss = rbf(xs, xs)
# Posterior mean and variance
Kinv_y = [sum(K_inv[i][j]*y_train[j] for j inrange(2)) for i inrange(2)]
mean = sum(ks[i]*Kinv_y[i] for i inrange(2))
Kinv_ks = [sum(K_inv[i][j]*ks[j] for j inrange(2)) for i inrange(2)]
var = kss - sum(ks[i]*Kinv_ks[i] for i inrange(2))
print("Predict at x*=1.0:")
print(" mean = {:.3f}".format(mean))
print(" var = {:.3f}".format(var))
print(" std = {:.3f}".format(math.sqrt(var)))
Notation reference
Math
Scheme
Meaning
k(x,y)
(rbf x y l)
Kernel (covariance function)
K
(kernel-matrix xs l)
Gram matrix of training points
μ* = k* K-1 y
post-mean
Posterior mean at test point
σ*² = k** - k* K-1 k*T
post-var
Posterior variance (uncertainty)
Translation notes
A GP is nonparametric: instead of fitting a fixed number of weights, it uses the kernel to define a prior over the entire function space. Conditioning on data is Bayes' rule applied to function space. The kernel matrix is a metric on the input space -- connecting GPs to Leinster's magnitude theory, where the magnitude of a metric space measures its "effective size."
Neighbors
Probability Ch.6 โ expected value: the posterior mean is a conditional expectation
Leinster 2021 โ magnitude of metric spaces: the kernel matrix as a similarity structure
Ready for the real thing?
This chapter covers prediction with a fixed kernel. For hyperparameter optimization (marginal likelihood), sparse approximations, and deep GPs, see Rasmussen & Williams's Gaussian Processes for Machine Learning (free online).