← back to cognitive science

Reinforcement Learning

Lovelace textbook · CC BY-SA 4.0 · computationalcognitivescience.github.io/lovelace/home

An agent selects actions, receives rewards, and updates a policy that maps states to actions. The core tension: exploit what you know works, or explore to find something better. Temporal difference learning bridges this by updating value estimates from the difference between consecutive predictions. Q-learning makes this model-free: the agent learns action values without knowing the environment's dynamics.

The agent-environment loop

At each time step, the agent observes a state, chooses an action, receives a reward, and transitions to a new state. The goal is to maximize cumulative reward over time. This loop is the universal interface between learner and world. When the loop closes all the way through consolidation, you get jkgeneral intelligence.

Agent Environment action state reward
Scheme

Temporal difference learning

Instead of waiting for the final outcome, TD learning updates value estimates after each step using the difference between consecutive predictions. The update rule: V(s) = V(s) + alpha * (reward + gamma * V(s') - V(s)). The term in parentheses is the TD error: the surprise between what you predicted and what you got plus what you now predict.

Scheme

Q-learning

Q-learning extends TD to action selection. Q(s, a) estimates the value of taking action a in state s. The update: Q(s, a) = Q(s, a) + alpha * (reward + gamma * max_a' Q(s', a') - Q(s, a)). The max makes it off-policy: it learns the optimal policy even while exploring suboptimally.

Scheme

Notation reference

Symbol Meaning
V(s)Value of state s (expected cumulative reward)
Q(s, a)Value of taking action a in state s
alphaLearning rate
gammaDiscount factor (how much future rewards are worth)
epsilonExploration rate (probability of random action)
TD errorreward + gamma * V(s') - V(s)
Neighbors

Translation notes

The Lovelace textbook connects RL to dopamine reward prediction errors in the brain, which is one of computational cognitive science's strongest success stories. This page focuses on the algorithms: bandits, TD learning, and Q-learning. The textbook also covers policy gradient methods and model-based RL, which connect to the planning and decision-making material in Chapter 6.

Read the original: Lovelace, Chapter 5.