Language is a noisy channel between minds. The speaker encodes a meaning, the channel (speech, text) introduces noise, and the listener decodes. Surprisal measures how unexpected a word is: high-surprisal words slow reading. Pragmatic inference goes beyond literal meaning: listeners reason about what the speaker chose to say and what they could have said instead.
The communication channel
Shannon's model: a sender encodes a message, transmits it through a noisy channel, and a receiver decodes it. The fundamental theorem of information theory says that reliable communication is possible at any rate below the channel capacity. Language production and comprehension fit this framework: the speaker compresses meaning into words, the listener decompresses. The same perceive-to-attend pipeline appears in non-linguistic cognition, where a salience layer filters the stream before attention allocates resources.
Scheme
; Information content and surprisal; Surprisal = -log2(P(word)); High surprisal = unexpected word = more information
(define (log2 x) (/ (log x) (log 2)))
(define (surprisal p) (- (log2 p)))
; Word probabilities in context "The cat sat on the ___"
(define p-mat 0.15)
(define p-floor 0.10)
(define p-elephant 0.001)
(define p-the 0.30)
(display "Surprisal of 'mat': ")
(display (surprisal p-mat)) (display " bits") (newline)
(display "Surprisal of 'floor': ")
(display (surprisal p-floor)) (display " bits") (newline)
(display "Surprisal of 'elephant': ")
(display (surprisal p-elephant)) (display " bits") (newline)
(display "Surprisal of 'the': ")
(display (surprisal p-the)) (display " bits") (newline)
(newline)
(display "Higher surprisal = slower reading time (empirically)")
Python
# Surprisal in Pythonimportmathdef surprisal(p):
return -math.log2(p)
words = {"mat": 0.15, "floor": 0.10, "elephant": 0.001, "the": 0.30}
for word, p in words.items():
print(f"Surprisal('{word}'): {surprisal(p):.2f} bits")
Probabilistic language models
A language model assigns probabilities to sequences of words. An n-gram model conditions on the previous n-1 words. Better models compress language more efficiently. The cross-entropy between the model's predictions and the true distribution measures how well the model captures the statistical structure of language.
# Bigram language modelfromcollectionsimport defaultdict
corpus = [["the","cat","sat"], ["the","cat","ran"], ["the","dog","sat"]]
counts = defaultdict(lambda: defaultdict(int))
for sent in corpus:
for i inrange(len(sent) - 1):
counts[sent[i]][sent[i+1]] += 1def bigram_prob(w1, w2):
total = sum(counts[w1].values())
if total == 0:
return0.0return counts[w1][w2] / total
for w1, w2 in [("the","cat"), ("the","dog"), ("cat","sat"), ("cat","ran")]:
print("P(" + w2 + " | " + w1 + ") = " + "{:.4f}".format(bigram_prob(w1, w2)))
Pragmatic inference
Literal meaning is just the starting point. A pragmatic listener reasons about the speaker's choice: if the speaker said "some students passed," they probably mean "not all," because a cooperative speaker would have said "all" if that were true. The Rational Speech Act (RSA) framework models this as nested Bayesian inference: the listener infers meaning by reasoning about a speaker who reasons about a literal listener.
Scheme
; Scalar implicature: "some" implies "not all"; RSA model (simplified);; Literal listener: P(state | utterance) proportional to truth value; Speaker: P(utterance | state) proportional to informativity; Pragmatic listener: P(state | utterance) via Bayes over speaker; States: "all passed" or "some passed"; Utterances: "all" or "some"; Literal semantics: is the utterance true in the state?
(define (literal utt state)
(cond ((and (equal? utt "all") (equal? state "all")) 1)
((and (equal? utt "all") (equal? state "some")) 0)
((and (equal? utt "some") (equal? state "all")) 1) ; "some" is true when all
((and (equal? utt "some") (equal? state "some")) 1)
(else0)))
; Pragmatic speaker: prefers informative utterances
(define (speaker state)
(let ((p-all (literal "all" state))
(p-some (literal "some" state)))
(let ((total (+ p-all p-some)))
(list (/ p-all total) (/ p-some total)))))
; When state = "all", speaker says "all" or "some" equally
(display "Speaker given 'all': P(say all)=")
(display (car (speaker "all")))
(display " P(say some)=")
(display (cadr (speaker "all"))) (newline)
; When state = "some", speaker must say "some"
(display "Speaker given 'some': P(say all)=")
(display (car (speaker "some")))
(display " P(say some)=")
(display (cadr (speaker "some"))) (newline)
; Pragmatic listener hearing "some":; P(all|"some") is lower because speaker would have said "all"
(display "Result: hearing 'some' implies 'not all'")
Python
# Scalar implicature via RSA (simplified)# Literal semantics: is utterance true in state?def literal(utt, state):
if utt == "all":
return1if state == "all"else0if utt == "some":
return1# "some" is true in both states# Speaker: uniform over true utterancesdef speaker(state):
scores = {"all": literal("all", state), "some": literal("some", state)}
total = sum(scores.values())
return {u: s / total for u, s in scores.items()}
for state in ["all", "some"]:
probs = speaker(state)
print("Speaker given '" + state + "': " +
"P(say all)=" + str(probs["all"]) +
" P(say some)=" + str(probs["some"]))
print("Result: hearing 'some' implies 'not all'")
Notation reference
Term
Meaning
Surprisal
-log2(P(word)); information content in bits
n-gram
P(word | previous n-1 words)
Cross-entropy
Expected surprisal under true distribution
RSA
Rational Speech Acts: nested Bayesian pragmatics
Channel capacity
Maximum rate of reliable communication
Neighbors
Shannon Ch.7 โ channel capacity and the noisy channel coding theorem
Lovelace Ch.2 โ the Bayesian inference that RSA builds on
The Lovelace textbook covers language acquisition, syntactic parsing, and semantic composition in addition to the information-theoretic and pragmatic perspectives presented here. This page focuses on the three ideas that connect most directly to the computational toolkit: surprisal as a linking hypothesis between models and reading data, n-gram models as the simplest language model, and RSA as Bayesian inference applied to communication.