Sequence Models

MML (CC BY 4.0) · D2L (CC BY-SA 4.0) · 10 of 12

RNNs process sequences by maintaining hidden state. LSTMs add gates to control what to remember and forget. The hidden state is a lossy compression of the past — the bottleneck that attention solves.

Simple RNN forward pass

At each time step, the RNN combines the current input with the previous hidden state through a linear transformation plus a nonlinearity. The same weights W are shared across all time steps. This weight sharing is what lets the network handle variable-length sequences.

Scheme

; Simple RNN: h_t = tanh(W_hh * h_{t-1} + W_xh * x_t + b)
; Using scalars for clarity (real RNNs use matrices)

(define W-hh 0.5)   ; hidden-to-hidden weight
(define W-xh 0.8)   ; input-to-hidden weight
(define bias 0.0)

(define (rnn-step h-prev x)
  (let ((pre (+ (* W-hh h-prev) (* W-xh x) bias)))
    ; tanh approximation: (e^x - e^-x) / (e^x + e^-x)
    (let ((ep (exp pre)) (en (exp (- pre))))
      (/ (- ep en) (+ ep en)))))

; Process a sequence
(define xs '(1.0 0.5 -0.3 0.8 -1.0))
(define h0 0.0)

(let loop ((xs xs) (h h0) (step 1))
  (if (null? xs) 'done
      (let ((h-new (rnn-step h (car xs))))
        (display "t=") (display step)
        (display "  x=") (display (car xs))
        (display "  h=") (display (/ (round (* h-new 10000)) 10000))
        (newline)
        (loop (cdr xs) h-new (+ step 1)))))

LSTM gate computation

An LSTM replaces the simple RNN cell with four gates: forget (what to erase), input (what to write), cell update (the candidate values), and output (what to expose). The cell state acts as a conveyor belt — gradients flow through it without vanishing, solving the problem that kills vanilla RNNs on long sequences.

Scheme

; LSTM gates (scalar version for clarity)
; f = forget gate, i = input gate, o = output gate

(define (sigmoid x)
  (/ 1.0 (+ 1.0 (exp (- x)))))

(define (tanh-approx x)
  (let ((ep (exp x)) (en (exp (- x))))
    (/ (- ep en) (+ ep en))))

; Weights (scalar simplification)
(define Wf 0.6) (define Uf 0.4) (define bf -0.1)
(define Wi 0.7) (define Ui 0.3) (define bi 0.1)
(define Wc 0.8) (define Uc 0.5) (define bc 0.0)
(define Wo 0.5) (define Uo 0.6) (define bo 0.0)

(define (lstm-step h-prev c-prev x)
  (let ((f (sigmoid (+ (* Wf x) (* Uf h-prev) bf)))
        (i (sigmoid (+ (* Wi x) (* Ui h-prev) bi)))
        (c-hat (tanh-approx (+ (* Wc x) (* Uc h-prev) bc)))
        (o (sigmoid (+ (* Wo x) (* Uo h-prev) bo))))
    (let ((c-new (+ (* f c-prev) (* i c-hat))))
      (let ((h-new (* o (tanh-approx c-new))))
        (list h-new c-new f i o)))))

(define xs '(1.0 0.5 -0.3 0.8))
(let loop ((xs xs) (h 0.0) (c 0.0) (step 1))
  (if (null? xs) 'done
      (let ((result (lstm-step h c (car xs))))
        (display "t=") (display step)
        (display "  f=") (display (/ (round (* (list-ref result 2) 100)) 100))
        (display "  i=") (display (/ (round (* (list-ref result 3) 100)) 100))
        (display "  h=") (display (/ (round (* (car result) 10000)) 10000))
        (newline)
        (loop (cdr xs) (car result) (list-ref result 1) (+ step 1)))))

; LSTM gates (scalar version for clarity)
; f = forget gate, i = input gate, o = output gate

(define (sigmoid x)
  (/ 1.0 (+ 1.0 (exp (- x)))))

(define (tanh-approx x)
  (let ((ep (exp x)) (en (exp (- x))))
    (/ (- ep en) (+ ep en))))

; Weights (scalar simplification)
(define Wf 0.6) (define Uf 0.4) (define bf -0.1)
(define Wi 0.7) (define Ui 0.3) (define bi 0.1)
(define Wc 0.8) (define Uc 0.5) (define bc 0.0)
(define Wo 0.5) (define Uo 0.6) (define bo 0.0)

(define (lstm-step h-prev c-prev x)
  (let ((f (sigmoid (+ (* Wf x) (* Uf h-prev) bf)))
        (i (sigmoid (+ (* Wi x) (* Ui h-prev) bi)))
        (c-hat (tanh-approx (+ (* Wc x) (* Uc h-prev) bc)))
        (o (sigmoid (+ (* Wo x) (* Uo h-prev) bo))))
    (let ((c-new (+ (* f c-prev) (* i c-hat))))
      (let ((h-new (* o (tanh-approx c-new))))
        (list h-new c-new f i o)))))

(define xs '(1.0 0.5 -0.3 0.8))
(let loop ((xs xs) (h 0.0) (c 0.0) (step 1))
  (if (null? xs) 'done
      (let ((result (lstm-step h c (car xs))))
        (display "t=") (display step)
        (display "  f=") (display (/ (round (* (list-ref result 2) 100)) 100))
        (display "  i=") (display (/ (round (* (list-ref result 3) 100)) 100))
        (display "  h=") (display (/ (round (* (car result) 10000)) 10000))
        (newline)
        (loop (cdr xs) (car result) (list-ref result 1) (+ step 1)))))

Sequence prediction (next character)

Given a sequence of characters, the model predicts the next one. Each character is encoded as a number, fed through the RNN, and the output is compared to the true next character. This is the simplest form of language modeling — the same objective that scales up to GPT.

Scheme

; Predict next element in a repeating sequence
; Pattern: 0 1 2 0 1 2 0 1 2 ...
; RNN should learn: after 0 predict 1, after 1 predict 2, after 2 predict 0

(define (sigmoid x) (/ 1.0 (+ 1.0 (exp (- x)))))

; Trained weights (hand-tuned for this pattern)
(define W-hh 0.9)
(define W-xh 1.5)

(define (rnn-predict h x)
  (let ((h-new (sigmoid (+ (* W-hh h) (* W-xh x)))))
    ; Quantize output to 0, 1, or 2
    (let ((pred (cond ((< h-new 0.4) 0)
                      ((< h-new 0.7) 1)
                      (else 2))))
      (list h-new pred))))

(define sequence '(0 1 2 0 1 2 0 1 2))

(display "Input -> Predicted next:") (newline)
(let loop ((seq sequence) (h 0.0))
  (if (null? (cdr seq)) 'done
      (let ((result (rnn-predict h (car seq))))
        (display "  ") (display (car seq))
        (display " -> ") (display (list-ref result 1))
        (display " (actual: ") (display (cadr seq)) (display ")")
        (newline)
        (loop (cdr seq) (car result)))))

Notation reference

Math	Scheme	Meaning
h_t = tanh(W_hhh_t-1 + W_xhx_t)	(rnn-step h-prev x)	RNN update
f_t = σ(W_fx + U_fh + b)	(sigmoid (+ ...))	LSTM forget gate
c_t = f⋅c_t-1 + i⋅ĉ	(+ (* f c-prev) (* i c-hat))	Cell state update
σ(x) = 1/(1+e^-x)	(sigmoid x)	Sigmoid activation

Translation notes

The scalar RNN above is pedagogical. Real RNNs use matrix multiplications: h_t = tanh(W_hh · h_t-1 + W_xh · x_t + b) where W_hh is (hidden_dim × hidden_dim) and W_xh is (hidden_dim × input_dim). GRUs are a simpler alternative to LSTMs with two gates instead of three.

Neighbors

📖 SICP Ch.13 — streams as lazy sequences, the functional view of time
🧠 Lovelace Ch.7 — language models and cognition

← Convolutional Networks by june.kim Transformers →