Reinforcement Learning

Lovelace textbook · CC BY-SA 4.0 · computationalcognitivescience.github.io/lovelace/home

An agent selects actions, receives rewards, and updates a policy that maps states to actions. The core tension: exploit what you know works, or explore to find something better. Temporal difference learning bridges this by updating value estimates from the difference between consecutive predictions. Q-learning makes this model-free: the agent learns action values without knowing the environment's dynamics.

The agent-environment loop

At each time step, the agent observes a state, chooses an action, receives a reward, and transitions to a new state. The goal is to maximize cumulative reward over time. This loop is the universal interface between learner and world. When the loop closes all the way through consolidation, you get general intelligence.

Scheme

; Simple bandit problem: two slot machines with unknown payoffs.
; Explore vs exploit with epsilon-greedy strategy.

; True payoff probabilities (unknown to agent)
(define true-p-a 0.3)
(define true-p-b 0.7)

; Agent's estimated values (start at 0)
(define (make-bandit) (list 0.0 0 0.0 0))  ; (q-a count-a q-b count-b)

; Simulate a pull: returns 1 with probability p, else 0
(define (pull p seed)
  (if (< (/ (modulo (* seed 7919) 10000) 10000.0) p) 1 0))

; Update running average
(define (update-q old-q count reward)
  (let ((new-count (+ count 1)))
    (+ old-q (/ (- reward old-q) new-count))))

; Run 20 rounds of epsilon-greedy (epsilon=0.3)
(define (run-bandit qa na qb nb step seed)
  (if (> step 20)
      (begin
        (display "Final Q(A)=") (display qa)
        (display " Q(B)=") (display qb))
      (let* ((explore? (< (modulo (* seed 13) 100) 30))
             (choose-a (if explore?
                           (< (modulo (* seed 17) 100) 50)
                           (> qa qb)))
             (reward (if choose-a
                        (pull true-p-a (+ seed step))
                        (pull true-p-b (+ seed step 1000)))))
        (if choose-a
            (run-bandit (update-q qa na reward) (+ na 1) qb nb (+ step 1) (+ seed 37))
            (run-bandit qa na (update-q qb nb reward) (+ nb 1) (+ step 1) (+ seed 37))))))

(run-bandit 0.0 0 0.0 0 1 42)
; Q(B) should converge higher than Q(A)

; Simple bandit problem: two slot machines with unknown payoffs.
; Explore vs exploit with epsilon-greedy strategy.

; True payoff probabilities (unknown to agent)
(define true-p-a 0.3)
(define true-p-b 0.7)

; Agent's estimated values (start at 0)
(define (make-bandit) (list 0.0 0 0.0 0))  ; (q-a count-a q-b count-b)

; Simulate a pull: returns 1 with probability p, else 0
(define (pull p seed)
  (if (< (/ (modulo (* seed 7919) 10000) 10000.0) p) 1 0))

; Update running average
(define (update-q old-q count reward)
  (let ((new-count (+ count 1)))
    (+ old-q (/ (- reward old-q) new-count))))

; Run 20 rounds of epsilon-greedy (epsilon=0.3)
(define (run-bandit qa na qb nb step seed)
  (if (> step 20)
      (begin
        (display "Final Q(A)=") (display qa)
        (display " Q(B)=") (display qb))
      (let* ((explore? (< (modulo (* seed 13) 100) 30))
             (choose-a (if explore?
                           (< (modulo (* seed 17) 100) 50)
                           (> qa qb)))
             (reward (if choose-a
                        (pull true-p-a (+ seed step))
                        (pull true-p-b (+ seed step 1000)))))
        (if choose-a
            (run-bandit (update-q qa na reward) (+ na 1) qb nb (+ step 1) (+ seed 37))
            (run-bandit qa na (update-q qb nb reward) (+ nb 1) (+ step 1) (+ seed 37))))))

(run-bandit 0.0 0 0.0 0 1 42)
; Q(B) should converge higher than Q(A)

Temporal difference learning

Instead of waiting for the final outcome, TD learning updates value estimates after each step using the difference between consecutive predictions. The update rule: V(s) = V(s) + alpha * (reward + gamma * V(s') - V(s)). The term in parentheses is the TD error: the surprise between what you predicted and what you got plus what you now predict.

Scheme

; TD learning: a 5-state random walk
; States: 0 (left terminal) 1 2 3 4 5 (right terminal)
; Reward: +1 at state 5, 0 elsewhere. Discount = 1.
; True values: V(s) = s/5

(define alpha 0.1)

; Initialize values to 0.5 (except terminals)
(define v (list 0.0 0.5 0.5 0.5 0.5 0.0))

(define (get-v state) (list-ref v state))

(define (set-v state new-val)
  (set! v (let loop ((i 0) (acc '()))
    (if (= i 6) (reverse acc)
        (loop (+ i 1) (cons (if (= i state) new-val (list-ref v i)) acc))))))

; One episode: start at state 3, walk randomly
(define (td-step state seed)
  (if (or (= state 0) (= state 5))
      (begin (display "Values: ") (display v))
      (let* ((go-right (< (modulo (* seed 7919) 100) 50))
             (next (if go-right (+ state 1) (- state 1)))
             (reward (if (= next 5) 1.0 0.0))
             (td-error (- (+ reward (get-v next)) (get-v state)))
             (new-v (+ (get-v state) (* alpha td-error))))
        (set-v state new-v)
        (td-step next (+ seed 13)))))

(td-step 3 42)
; Values should start approaching [0, 0.2, 0.4, 0.6, 0.8, 0]

; TD learning: a 5-state random walk
; States: 0 (left terminal) 1 2 3 4 5 (right terminal)
; Reward: +1 at state 5, 0 elsewhere. Discount = 1.
; True values: V(s) = s/5

(define alpha 0.1)

; Initialize values to 0.5 (except terminals)
(define v (list 0.0 0.5 0.5 0.5 0.5 0.0))

(define (get-v state) (list-ref v state))

(define (set-v state new-val)
  (set! v (let loop ((i 0) (acc '()))
    (if (= i 6) (reverse acc)
        (loop (+ i 1) (cons (if (= i state) new-val (list-ref v i)) acc))))))

; One episode: start at state 3, walk randomly
(define (td-step state seed)
  (if (or (= state 0) (= state 5))
      (begin (display "Values: ") (display v))
      (let* ((go-right (< (modulo (* seed 7919) 100) 50))
             (next (if go-right (+ state 1) (- state 1)))
             (reward (if (= next 5) 1.0 0.0))
             (td-error (- (+ reward (get-v next)) (get-v state)))
             (new-v (+ (get-v state) (* alpha td-error))))
        (set-v state new-v)
        (td-step next (+ seed 13)))))

(td-step 3 42)
; Values should start approaching [0, 0.2, 0.4, 0.6, 0.8, 0]

Q-learning

Q-learning extends TD to action selection. Q(s, a) estimates the value of taking action a in state s. The update: Q(s, a) = Q(s, a) + alpha * (reward + gamma * max_a' Q(s', a') - Q(s, a)). The max makes it off-policy: it learns the optimal policy even while exploring suboptimally.

Scheme

; Q-learning concept: the update rule
; Q(s,a) <- Q(s,a) + alpha * (reward + gamma * max Q(s',a') - Q(s,a))

(define alpha 0.1)
(define gamma 0.9)

; A 2-state, 2-action MDP
; State 0: action 'stay' -> reward 1, stay. action 'go' -> reward 0, move to state 1.
; State 1: action 'stay' -> reward 2, stay. action 'go' -> reward 0, move to state 0.

; Q-table: ((q00 q01) (q10 q11)) = ((stay go) (stay go))
(define q '((0.0 0.0) (0.0 0.0)))

(define (get-q s a) (list-ref (list-ref q s) a))
(define (max-q s) (max (get-q s 0) (get-q s 1)))

(define (update-q s a reward next-s)
  (let* ((old (get-q s a))
         (td-error (- (+ reward (* gamma (max-q next-s))) old))
         (new-val (+ old (* alpha td-error))))
    (display "Q(") (display s) (display ",") (display a)
    (display ") = ") (display new-val) (newline)
    new-val))

; Simulate a few updates
(display "After action 'stay' in state 1 (reward=2):") (newline)
(update-q 1 0 2.0 1) (newline)

(display "After action 'go' in state 0 (reward=0, arrive at state 1):") (newline)
(update-q 0 1 0.0 1)
; Q(0, go) picks up value from state 1's potential

; Q-learning concept: the update rule
; Q(s,a) <- Q(s,a) + alpha * (reward + gamma * max Q(s',a') - Q(s,a))

(define alpha 0.1)
(define gamma 0.9)

; A 2-state, 2-action MDP
; State 0: action 'stay' -> reward 1, stay. action 'go' -> reward 0, move to state 1.
; State 1: action 'stay' -> reward 2, stay. action 'go' -> reward 0, move to state 0.

; Q-table: ((q00 q01) (q10 q11)) = ((stay go) (stay go))
(define q '((0.0 0.0) (0.0 0.0)))

(define (get-q s a) (list-ref (list-ref q s) a))
(define (max-q s) (max (get-q s 0) (get-q s 1)))

(define (update-q s a reward next-s)
  (let* ((old (get-q s a))
         (td-error (- (+ reward (* gamma (max-q next-s))) old))
         (new-val (+ old (* alpha td-error))))
    (display "Q(") (display s) (display ",") (display a)
    (display ") = ") (display new-val) (newline)
    new-val))

; Simulate a few updates
(display "After action 'stay' in state 1 (reward=2):") (newline)
(update-q 1 0 2.0 1) (newline)

(display "After action 'go' in state 0 (reward=0, arrive at state 1):") (newline)
(update-q 0 1 0.0 1)
; Q(0, go) picks up value from state 1's potential

Notation reference

Symbol	Meaning
V(s)	Value of state s (expected cumulative reward)
Q(s, a)	Value of taking action a in state s
alpha	Learning rate
gamma	Discount factor (how much future rewards are worth)
epsilon	Exploration rate (probability of random action)
TD error	reward + gamma * V(s') - V(s)

Neighbors

Lovelace Ch.6 — decision making and utility theory
Hedges 2018 — open games formalize the same agent-environment structure categorically
🎲 Nordstrom 3.2 Mixed Strategies — mixed strategies: randomization over actions as an RL exploration mechanism
Temporal difference learning
🎛 Control Ch.8 — optimal control is reinforcement learning with a known model
⚙ Algorithms Ch.9 — dynamic programming is the mathematical backbone of RL
🏛️ Soar — reinforcement learning formalized in a cognitive architecture
Diagnosis of Soar — where the RL cell fits and which cells remain empty

Translation notes

The Lovelace textbook connects RL to dopamine reward prediction errors in the brain, which is one of computational cognitive science's strongest success stories. This page focuses on the algorithms: bandits, TD learning, and Q-learning. The textbook also covers policy gradient methods and model-based RL, which connect to the planning and decision-making material in Chapter 6.

Read the original: Lovelace, Chapter 5.

← Neural Networks by june.kim Decision Making · 6 of 8 →