Neural Networks

Lovelace textbook · CC BY-SA 4.0 · computationalcognitivescience.github.io/lovelace/home

A neural network is a function built from layers of simple units. Each unit computes a weighted sum of its inputs, adds a bias, and passes the result through a nonlinear activation function. Stacking layers lets the network learn hierarchical representations. Backpropagation adjusts the weights by propagating error gradients backward through the layers. With enough hidden units, a single hidden layer can approximate any continuous function.

The perceptron

The simplest neural network: one unit, n inputs, n weights, one bias. Output = activation(w1*x1 + w2*x2 + ... + bias). A single perceptron can learn any linearly separable function (AND, OR) but not XOR. This limitation motivated multilayer networks.

Scheme

; A perceptron: weighted sum + step activation
; Learning AND gate

(define (step x) (if (>= x 0) 1 0))

(define (perceptron inputs weights bias)
  (step (+ bias (apply + (map * inputs weights)))))

; AND gate: w1=1, w2=1, bias=-1.5
(define w '(1 1))
(define b -1.5)

(display "AND(0,0) = ") (display (perceptron '(0 0) w b)) (newline)
(display "AND(0,1) = ") (display (perceptron '(0 1) w b)) (newline)
(display "AND(1,0) = ") (display (perceptron '(1 0) w b)) (newline)
(display "AND(1,1) = ") (display (perceptron '(1 1) w b)) (newline)

; OR gate: w1=1, w2=1, bias=-0.5
(define b-or -0.5)
(newline)
(display "OR(0,0) = ") (display (perceptron '(0 0) w b-or)) (newline)
(display "OR(0,1) = ") (display (perceptron '(0 1) w b-or)) (newline)
(display "OR(1,0) = ") (display (perceptron '(1 0) w b-or)) (newline)
(display "OR(1,1) = ") (display (perceptron '(1 1) w b-or))

Backpropagation

Backpropagation computes the gradient of the loss with respect to each weight by applying the chain rule layer by layer, from output back to input. Each weight is then nudged in the direction that reduces the loss. The learning rate controls the step size. This is gradient descent applied to a compositional function. The same backward-pass structure appears in cognitive architectures as consolidation: a process that reads from experience and writes parameter changes back to the substrate.

Scheme

; Backpropagation on a tiny network: 1 input, 1 hidden, 1 output
; Sigmoid activation, MSE loss
; Task: learn f(x) = 2x for x in [0, 1]

(define (sigmoid x) (/ 1.0 (+ 1.0 (exp (- x)))))
(define (sigmoid-deriv s) (* s (- 1 s)))

; Initialize weights
(define w1 0.5)   ; input -> hidden
(define b1 0.0)
(define w2 0.5)   ; hidden -> output
(define b2 0.0)
(define lr 1.0)   ; learning rate

; One training step
(define (train-step x target w1 b1 w2 b2)
  ; Forward pass
  (let* ((z1 (+ (* w1 x) b1))
         (h (sigmoid z1))
         (z2 (+ (* w2 h) b2))
         (y-hat (sigmoid z2))
         ; Backward pass
         (d-loss (* 2 (- y-hat target)))         ; dL/dy-hat
         (d-z2 (* d-loss (sigmoid-deriv y-hat)))  ; dL/dz2
         (d-w2 (* d-z2 h))
         (d-b2 d-z2)
         (d-h (* d-z2 w2))
         (d-z1 (* d-h (sigmoid-deriv h)))
         (d-w1 (* d-z1 x))
         (d-b1 d-z1))
    ; Update weights
    (list (- w1 (* lr d-w1))
          (- b1 (* lr d-b1))
          (- w2 (* lr d-w2))
          (- b2 (* lr d-b2))
          y-hat)))

; Train for a few steps on x=0.5, target=1.0 (i.e., 2*0.5)
(define (train-loop x target w1 b1 w2 b2 steps)
  (if (= steps 0)
      (begin
        (display "Final prediction: ") (display (list-ref (train-step x target w1 b1 w2 b2) 4)))
      (let ((result (train-step x target w1 b1 w2 b2)))
        (if (= (modulo steps 50) 0)
            (begin (display "Step ") (display steps) (display ": y-hat = ")
                   (display (list-ref result 4)) (newline))
            #f)
        (train-loop x target
                    (list-ref result 0) (list-ref result 1)
                    (list-ref result 2) (list-ref result 3)
                    (- steps 1)))))

(train-loop 0.5 1.0 w1 b1 w2 b2 200)

; Backpropagation on a tiny network: 1 input, 1 hidden, 1 output
; Sigmoid activation, MSE loss
; Task: learn f(x) = 2x for x in [0, 1]

(define (sigmoid x) (/ 1.0 (+ 1.0 (exp (- x)))))
(define (sigmoid-deriv s) (* s (- 1 s)))

; Initialize weights
(define w1 0.5)   ; input -> hidden
(define b1 0.0)
(define w2 0.5)   ; hidden -> output
(define b2 0.0)
(define lr 1.0)   ; learning rate

; One training step
(define (train-step x target w1 b1 w2 b2)
  ; Forward pass
  (let* ((z1 (+ (* w1 x) b1))
         (h (sigmoid z1))
         (z2 (+ (* w2 h) b2))
         (y-hat (sigmoid z2))
         ; Backward pass
         (d-loss (* 2 (- y-hat target)))         ; dL/dy-hat
         (d-z2 (* d-loss (sigmoid-deriv y-hat)))  ; dL/dz2
         (d-w2 (* d-z2 h))
         (d-b2 d-z2)
         (d-h (* d-z2 w2))
         (d-z1 (* d-h (sigmoid-deriv h)))
         (d-w1 (* d-z1 x))
         (d-b1 d-z1))
    ; Update weights
    (list (- w1 (* lr d-w1))
          (- b1 (* lr d-b1))
          (- w2 (* lr d-w2))
          (- b2 (* lr d-b2))
          y-hat)))

; Train for a few steps on x=0.5, target=1.0 (i.e., 2*0.5)
(define (train-loop x target w1 b1 w2 b2 steps)
  (if (= steps 0)
      (begin
        (display "Final prediction: ") (display (list-ref (train-step x target w1 b1 w2 b2) 4)))
      (let ((result (train-step x target w1 b1 w2 b2)))
        (if (= (modulo steps 50) 0)
            (begin (display "Step ") (display steps) (display ": y-hat = ")
                   (display (list-ref result 4)) (newline))
            #f)
        (train-loop x target
                    (list-ref result 0) (list-ref result 1)
                    (list-ref result 2) (list-ref result 3)
                    (- steps 1)))))

(train-loop 0.5 1.0 w1 b1 w2 b2 200)

Universal approximation

A feedforward network with a single hidden layer and enough units can approximate any continuous function on a compact domain to arbitrary precision. This is the universal approximation theorem. It guarantees expressiveness but says nothing about learnability: finding the right weights is a separate problem, and that is where depth, architecture, and training dynamics matter.

Notation reference

Symbol	Meaning
w, b	Weights and biases
sigma(z)	Activation function (e.g., sigmoid, ReLU)
dL/dw	Gradient of loss with respect to weight
lr	Learning rate (step size for gradient descent)
UAT	Universal approximation theorem

Neighbors

Lovelace Ch.5 — reinforcement learning uses neural networks as function approximators
Lovelace Ch.8 — cognitive architectures that integrate neural-network-style learning
🍞 Capucci 2021 — backpropagation as lens composition: the categorical structure of gradient descent
∫ Calculus Ch.5 Chain Rule — the chain rule that makes backpropagation work
Universal approximation theorem

Translation notes

The Lovelace textbook covers the history of connectionism, from Rosenblatt's perceptron through the PDP group to modern deep learning. This page focuses on the computational mechanics: what a network computes, how backpropagation trains it, and why universal approximation matters. The textbook also discusses representation learning and distributed representations, which are central to the connectionist program in cognitive science.

Read the original: Lovelace, Chapter 4.

← Bayesian Models of Cognition by june.kim Reinforcement Learning →