An agent selects actions, receives rewards, and updates a policy that maps states to actions. The core tension: exploit what you know works, or explore to find something better. Temporal difference learning bridges this by updating value estimates from the difference between consecutive predictions. Q-learning makes this model-free: the agent learns action values without knowing the environment's dynamics.
The agent-environment loop
At each time step, the agent observes a state, chooses an action, receives a reward, and transitions to a new state. The goal is to maximize cumulative reward over time. This loop is the universal interface between learner and world. When the loop closes all the way through consolidation, you get general intelligence.
Scheme
; Simple bandit problem: two slot machines with unknown payoffs.; Explore vs exploit with epsilon-greedy strategy.; True payoff probabilities (unknown to agent)
(define true-p-a 0.3)
(define true-p-b 0.7)
; Agent's estimated values (start at 0)
(define (make-bandit) (list 0.000.00)) ; (q-a count-a q-b count-b); Simulate a pull: returns 1 with probability p, else 0
(define (pull p seed)
(if (< (/ (modulo (* seed 7919) 10000) 10000.0) p) 10))
; Update running average
(define (update-q old-q count reward)
(let ((new-count (+ count 1)))
(+ old-q (/ (- reward old-q) new-count))))
; Run 20 rounds of epsilon-greedy (epsilon=0.3)
(define (run-bandit qa na qb nb step seed)
(if (> step 20)
(begin
(display "Final Q(A)=") (display qa)
(display " Q(B)=") (display qb))
(let* ((explore? (< (modulo (* seed 13) 100) 30))
(choose-a (if explore?
(< (modulo (* seed 17) 100) 50)
(> qa qb)))
(reward (if choose-a
(pull true-p-a (+ seed step))
(pull true-p-b (+ seed step 1000)))))
(if choose-a
(run-bandit (update-q qa na reward) (+ na 1) qb nb (+ step 1) (+ seed 37))
(run-bandit qa na (update-q qb nb reward) (+ nb 1) (+ step 1) (+ seed 37))))))
(run-bandit 0.000.00142)
; Q(B) should converge higher than Q(A)
Instead of waiting for the final outcome, TD learning updates value estimates after each step using the difference between consecutive predictions. The update rule: V(s) = V(s) + alpha * (reward + gamma * V(s') - V(s)). The term in parentheses is the TD error: the surprise between what you predicted and what you got plus what you now predict.
Scheme
; TD learning: a 5-state random walk; States: 0 (left terminal) 1 2 3 4 5 (right terminal); Reward: +1 at state 5, 0 elsewhere. Discount = 1.; True values: V(s) = s/5
(define alpha 0.1)
; Initialize values to 0.5 (except terminals)
(define v (list 0.00.50.50.50.50.0))
(define (get-v state) (list-ref v state))
(define (set-v state new-val)
(set! v (let loop ((i 0) (acc '()))
(if (= i 6) (reverse acc)
(loop (+ i 1) (cons (if (= i state) new-val (list-ref v i)) acc))))))
; One episode: start at state 3, walk randomly
(define (td-step state seed)
(if (or (= state 0) (= state 5))
(begin (display "Values: ") (display v))
(let* ((go-right (< (modulo (* seed 7919) 100) 50))
(next (if go-right (+ state 1) (- state 1)))
(reward (if (= next 5) 1.00.0))
(td-error (- (+ reward (get-v next)) (get-v state)))
(new-v (+ (get-v state) (* alpha td-error))))
(set-v state new-v)
(td-step next (+ seed 13)))))
(td-step 342)
; Values should start approaching [0, 0.2, 0.4, 0.6, 0.8, 0]
Python
# TD learning: 5-state random walkimport random
random.seed(42)
alpha = 0.1
v = [0.0, 0.5, 0.5, 0.5, 0.5, 0.0]
for episode inrange(100):
state = 3while0 < state < 5:
next_state = state + (1if random.random() < 0.5else -1)
reward = 1.0if next_state == 5else0.0
v[state] += alpha * (reward + v[next_state] - v[state])
state = next_state
print("Values: " + str([round(x, 3) for x in v]))
# Should approach [0, 0.2, 0.4, 0.6, 0.8, 0]
Q-learning
Q-learning extends TD to action selection. Q(s, a) estimates the value of taking action a in state s. The update: Q(s, a) = Q(s, a) + alpha * (reward + gamma * max_a' Q(s', a') - Q(s, a)). The max makes it off-policy: it learns the optimal policy even while exploring suboptimally.
Scheme
; Q-learning concept: the update rule; Q(s,a) <- Q(s,a) + alpha * (reward + gamma * max Q(s',a') - Q(s,a))
(define alpha 0.1)
(define gamma 0.9)
; A 2-state, 2-action MDP; State 0: action 'stay' -> reward 1, stay. action 'go' -> reward 0, move to state 1.; State 1: action 'stay' -> reward 2, stay. action 'go' -> reward 0, move to state 0.; Q-table: ((q00 q01) (q10 q11)) = ((stay go) (stay go))
(define q '((0.00.0) (0.00.0)))
(define (get-q s a) (list-ref (list-ref q s) a))
(define (max-q s) (max (get-q s 0) (get-q s 1)))
(define (update-q s a reward next-s)
(let* ((old (get-q s a))
(td-error (- (+ reward (* gamma (max-q next-s))) old))
(new-val (+ old (* alpha td-error))))
(display "Q(") (display s) (display ",") (display a)
(display ") = ") (display new-val) (newline)
new-val))
; Simulate a few updates
(display "After action 'stay' in state 1 (reward=2):") (newline)
(update-q 102.01) (newline)
(display "After action 'go' in state 0 (reward=0, arrive at state 1):") (newline)
(update-q 010.01)
; Q(0, go) picks up value from state 1's potential
Python
# Q-learning: 2-state, 2-action MDP
alpha, gamma = 0.1, 0.9
q = [[0.0, 0.0], [0.0, 0.0]] # q[state][action]def update_q(s, a, reward, next_s):
old = q[s][a]
td_error = reward + gamma * max(q[next_s]) - old
q[s][a] = old + alpha * td_error
print("Q(" + str(s) + "," + str(a) + ") = " + "{:.4f}".format(q[s][a]))
print("After action 'stay' in state 1 (reward=2):")
update_q(1, 0, 2.0, 1)
print()
print("After action 'go' in state 0 (reward=0, arrive at state 1):")
update_q(0, 1, 0.0, 1)
Notation reference
Symbol
Meaning
V(s)
Value of state s (expected cumulative reward)
Q(s, a)
Value of taking action a in state s
alpha
Learning rate
gamma
Discount factor (how much future rewards are worth)
epsilon
Exploration rate (probability of random action)
TD error
reward + gamma * V(s') - V(s)
Neighbors
Lovelace Ch.6 โ decision making and utility theory
Hedges 2018 โ open games formalize the same agent-environment structure categorically
๐ Control Ch.8 — optimal control is reinforcement learning with a known model
โ Algorithms Ch.9 — dynamic programming is the mathematical backbone of RL
๐๏ธ Soar — reinforcement learning formalized in a cognitive architecture
Diagnosis of Soar — where the RL cell fits and which cells remain empty
Translation notes
The Lovelace textbook connects RL to dopamine reward prediction errors in the brain, which is one of computational cognitive science's strongest success stories. This page focuses on the algorithms: bandits, TD learning, and Q-learning. The textbook also covers policy gradient methods and model-based RL, which connect to the planning and decision-making material in Chapter 6.