Sequence Models
MML (CC BY 4.0) · D2L (CC BY-SA 4.0) · 10 of 12
RNNs process sequences by maintaining hidden state. LSTMs add gates to control what to remember and forget. The hidden state is a
lossy compression of the past — the bottleneck that attention solves.
Simple RNN forward pass
At each time step, the RNN combines the current input with the previous hidden state through a linear transformation plus a nonlinearity. The same weights W are shared across all time steps. This weight sharing is what lets the network handle variable-length sequences.
LSTM gate computation
An LSTM replaces the simple RNN cell with four gates: forget (what to erase), input (what to write), cell update (the candidate values), and output (what to expose). The cell state acts as a conveyor belt — gradients flow through it without vanishing, solving the problem that kills vanilla RNNs on long sequences.
Sequence prediction (next character)
Given a sequence of characters, the model predicts the next one. Each character is encoded as a number, fed through the RNN, and the output is compared to the true next character. This is the simplest form of language modeling — the same objective that scales up to GPT.
Notation reference
| Math | Scheme | Meaning |
|---|---|---|
| ht = tanh(Whhht-1 + Wxhxt) | (rnn-step h-prev x) | RNN update |
| ft = σ(Wfx + Ufh + b) | (sigmoid (+ ...)) | LSTM forget gate |
| ct = f⋅ct-1 + i⋅ĉ | (+ (* f c-prev) (* i c-hat)) | Cell state update |
| σ(x) = 1/(1+e-x) | (sigmoid x) | Sigmoid activation |
Translation notes
The scalar RNN above is pedagogical. Real RNNs use matrix multiplications: ht = tanh(Whh · ht-1 + Wxh · xt + b) where Whh is (hidden_dim × hidden_dim) and Wxh is (hidden_dim × input_dim). GRUs are a simpler alternative to LSTMs with two gates instead of three.
Neighbors
- 📖 SICP Ch.13 — streams as lazy sequences, the functional view of time
- 🧠 Lovelace Ch.7 — language models and cognition