Scale the transformer, pretrain on next-token prediction over the internet, then align with human feedback. In-context learning emerges: the model learns new tasks from examples in the prompt without weight updates. The LLM is a foundation model — a general-purpose function approximator that specializes via prompting.
Tokenization (BPE concept)
Before a model sees text, it's split into tokens. Byte Pair Encoding starts with individual characters and repeatedly merges the most frequent adjacent pair. The result: common words are single tokens, rare words are split into subwords. This gives a fixed vocabulary that can represent any text.
# Simplified BPE tokenizationfromcollectionsimport Counter
def count_pairs(tokens):
return Counter(zip(tokens[:-1], tokens[1:]))
def merge_pair(tokens, pair):
merged = pair[0] + pair[1]
result = []
i = 0while i < len(tokens):
if i < len(tokens)-1and tokens[i] == pair[0] and tokens[i+1] == pair[1]:
result.append(merged)
i += 2else:
result.append(tokens[i])
i += 1return result
tokens = list("low low lower")
print("Start: ", tokens)
for step inrange(3):
pairs = count_pairs(tokens)
ifnot pairs: break
best = pairs.most_common(1)[0][0]
tokens = merge_pair(tokens, best)
print("Merge {}: {} -> {}".format(step+1, best, tokens))
Next-token sampling (temperature / top-k)
The model outputs a probability distribution over the vocabulary. Temperature controls sharpness: T→0 makes it greedy (always pick the most likely), T→∞ makes it uniform (random). Top-k restricts sampling to the k most probable tokens. Together they control the creativity/coherence tradeoff.
importmath# Next-token sampling with temperaturedef softmax_temp(logits, temperature):
scaled = [x / temperature for x in logits]
mx = max(scaled)
exps = [math.exp(x - mx) for x in scaled]
total = sum(exps)
return [e / total for e in exps]
vocab = ["the", "cat", "sat", "on", "mat"]
logits = [2.0, 1.5, 0.5, 0.3, 0.1]
for temp, label in [(1.0,"normal"), (0.3,"sharp"), (3.0,"flat")]:
probs = softmax_temp(logits, temp)
print("T={} ({}):".format(temp, label))
for word, p inzip(vocab, probs):
print(" {}: {:.3f}".format(word, p))
Prompt template
In-context learning works by formatting examples into the prompt. The model doesn't update its weights — it uses the pattern in the context to infer the task. This is why prompt engineering matters: the format of the examples is the program. Fitting more signal into fewer tokens is the problem context density addresses.
Scheme
; Prompt template: few-shot in-context learning; The model sees examples, then completes the pattern
(define (make-prompt examples query)
(let ((formatted
(map (lambda (ex)
(string-append "Input: " (car ex) "
Output: " (cdr ex)))
examples)))
(string-append
(apply string-append
(map (lambda (s) (string-append s "
")) formatted))
"Input: " query "
Output:")))
(define examples
'(("happy" . "sad")
("hot" . "cold")
("big" . "small")))
(display (make-prompt examples "fast")) (newline)
(newline)
(display "---") (newline)
(display "The model sees the pattern (antonyms) and completes:") (newline)
(display "Expected completion: slow") (newline)
(newline)
; Token count matters: more examples = better pattern recognition; but you're limited by the context window
(display "Examples: ") (display (length examples)) (newline)
(display "Context window: ~4k to ~1M tokens")
Python
# Prompt template for few-shot in-context learningdef make_prompt(examples, query):
formatted = [
"Input: {}
Output: {}".format(inp, out)
for inp, out in examples
]
return"
".join(formatted) + "
Input: {}
Output:".format(query)
examples = [
("happy", "sad"),
("hot", "cold"),
("big", "small"),
]
print(make_prompt(examples, "fast"))
print()
print("---")
print("The model sees the pattern (antonyms) and completes:")
print("Expected completion: slow")
print()
print("Examples:", len(examples))
print("Context window: ~4k to ~1M tokens")
Notation reference
Math
Scheme
Meaning
P(xt|x<t)
(softmax logits)
Next-token distribution
T (temperature)
temperature
Sharpness of sampling
top-k
k most probable
Restrict sampling to k tokens
RLHF
reward model + PPO
Align model to human preferences
Translation notes
The jump from transformer to LLM is primarily one of scale: billions of parameters, trillions of tokens, thousands of GPUs. Emergent abilities (chain-of-thought reasoning, code generation, tool use) appear at scale thresholds that we can't predict from smaller models. Whether this is true emergence or smooth scaling is debated. What's not debated: the practical capabilities are real. Whether scale alone produces general intelligence is a separate question — one that depends on whether the system has a closed consolidation loop, not just a larger forward pass.
Neighbors
📡 Shannon Ch.1 — next-token prediction is minimizing surprise (cross-entropy)
🧠 Lovelace Ch.8 — cognitive architectures and the question of understanding
📖 SICP Ch.14 — the metacircular evaluator: programs that interpret programs, self-reference all the way down
Ready for the real thing?
Dive into Deep Learning (d2l.ai) covers everything here with runnable PyTorch/JAX code, full math, and exercises. It's free, it's excellent, and it's where these chapters point when the scaffolding comes off.