← back to machine learning

Large Language Models

MML (CC BY 4.0) · D2L (CC BY-SA 4.0) · 12 of 12

Scale the transformer, pretrain on next-token prediction over the internet, then align with human feedback. In-context learning emerges: the model learns new tasks from examples in the prompt without weight updates. The LLM is a foundation model — a general-purpose function approximator that specializes via prompting.

large corpus next-token loss pretrain base model RLHF human prefs align context → response prompt Pretrain on prediction. Align with preferences. Specialize via prompting.

Tokenization (BPE concept)

Before a model sees text, it's split into tokens. Byte Pair Encoding starts with individual characters and repeatedly merges the most frequent adjacent pair. The result: common words are single tokens, rare words are split into subwords. This gives a fixed vocabulary that can represent any text.

Scheme

Next-token sampling (temperature / top-k)

The model outputs a probability distribution over the vocabulary. Temperature controls sharpness: T→0 makes it greedy (always pick the most likely), T→∞ makes it uniform (random). Top-k restricts sampling to the k most probable tokens. Together they control the creativity/coherence tradeoff.

Scheme

Prompt template

In-context learning works by formatting examples into the prompt. The model doesn't update its weights — it uses the pattern in the context to infer the task. This is why prompt engineering matters: the format of the examples is the program. Fitting more signal into fewer tokens is the problem jkcontext density addresses.

Scheme

Notation reference

Math Scheme Meaning
P(xt|x<t)(softmax logits)Next-token distribution
T (temperature)temperatureSharpness of sampling
top-kk most probableRestrict sampling to k tokens
RLHFreward model + PPOAlign model to human preferences

Translation notes

The jump from transformer to LLM is primarily one of scale: billions of parameters, trillions of tokens, thousands of GPUs. Emergent abilities (chain-of-thought reasoning, code generation, tool use) appear at scale thresholds that we can't predict from smaller models. Whether this is true emergence or smooth scaling is debated. What's not debated: the practical capabilities are real. Whether scale alone produces jkgeneral intelligence is a separate question — one that depends on whether the system has a closed consolidation loop, not just a larger forward pass.

Neighbors
  • 📡 Shannon Ch.1 — next-token prediction is minimizing surprise (cross-entropy)
  • 🧠 Lovelace Ch.8 — cognitive architectures and the question of understanding
  • 📖 SICP Ch.14 — the metacircular evaluator: programs that interpret programs, self-reference all the way down

Ready for the real thing?

Dive into Deep Learning (d2l.ai) covers everything here with runnable PyTorch/JAX code, full math, and exercises. It's free, it's excellent, and it's where these chapters point when the scaffolding comes off.