← back to machine learning

Sequence Models

MML (CC BY 4.0) · D2L (CC BY-SA 4.0) · 10 of 12

RNNs process sequences by maintaining hidden state. LSTMs add gates to control what to remember and forget. The hidden state is a jklossy compression of the past — the bottleneck that attention solves.

h₁ x₁ h₂ x₂ h₃ x₃ y₁ y₂ y₃ ... Unrolled RNN: hidden state carries information forward through time.

Simple RNN forward pass

At each time step, the RNN combines the current input with the previous hidden state through a linear transformation plus a nonlinearity. The same weights W are shared across all time steps. This weight sharing is what lets the network handle variable-length sequences.

Scheme

LSTM gate computation

An LSTM replaces the simple RNN cell with four gates: forget (what to erase), input (what to write), cell update (the candidate values), and output (what to expose). The cell state acts as a conveyor belt — gradients flow through it without vanishing, solving the problem that kills vanilla RNNs on long sequences.

Scheme

Sequence prediction (next character)

Given a sequence of characters, the model predicts the next one. Each character is encoded as a number, fed through the RNN, and the output is compared to the true next character. This is the simplest form of language modeling — the same objective that scales up to GPT.

Scheme

Notation reference

Math Scheme Meaning
ht = tanh(Whhht-1 + Wxhxt)(rnn-step h-prev x)RNN update
ft = σ(Wfx + Ufh + b)(sigmoid (+ ...))LSTM forget gate
ct = f⋅ct-1 + i⋅ĉ(+ (* f c-prev) (* i c-hat))Cell state update
σ(x) = 1/(1+e-x)(sigmoid x)Sigmoid activation

Translation notes

The scalar RNN above is pedagogical. Real RNNs use matrix multiplications: ht = tanh(Whh · ht-1 + Wxh · xt + b) where Whh is (hidden_dim × hidden_dim) and Wxh is (hidden_dim × input_dim). GRUs are a simpler alternative to LSTMs with two gates instead of three.

Neighbors