A context-free grammar (CFG) generates strings through production rules. Each rule rewrites a single variable into a sequence of variables and terminals. CFGs describe more languages than regular expressions: they can handle nesting, matching brackets, and recursive structure.
Productions and derivations
A CFG is a 4-tuple (V, Σ, R, S):
V — finite set of variables (nonterminals)
Σ — finite set of terminals (the alphabet)
R — production rules of the form A → w
S — start variable (S ∈ V)
A derivation is a sequence of rule applications starting from S that produces a string of terminals.
Parse trees
A parse tree shows the structure of a derivation. The root is the start variable. Internal nodes are variables. Leaves are terminals. Reading the leaves left-to-right gives the derived string.
Scheme
; A simple CFG for matched parentheses.; S -> (S) | SS | epsilon; Generate random derivations:
(define (generate depth)
(if (> depth 5) ""; limit depth
(let ((r (random-integer 3)))
(cond
((= r 0) (string-append "(" (generate (+ depth 1)) ")"))
((= r 1) (string-append (generate (+ depth 1))
(generate (+ depth 1))))
(else"")))))
; Check if parentheses are balanced
(define (balanced? s)
(define (check i count)
(cond
((< count 0) #f)
((= i (string-length s)) (= count 0))
((equal? (substring s i (+ i 1)) "(")
(check (+ i 1) (+ count 1)))
(else (check (+ i 1) (- count 1)))))
(check 00))
(display "Generated strings from S -> (S) | SS | eps:") (newline)
(let loop ((n 0))
(when (< n 6)
(let ((s (generate 0)))
(display " "") (display s) (display "" balanced? ")
(display (balanced? s)) (newline)
(loop (+ n 1)))))
Python
# CFG for balanced parentheses: S -> (S) | SS | epsilonimport random
def generate(depth=0):
if depth > 5: return""
r = random.randint(0, 2)
if r == 0: return"(" + generate(depth+1) + ")"if r == 1: return generate(depth+1) + generate(depth+1)
return""def balanced(s):
count = 0for ch in s:
count += 1if ch == '('else -1if count < 0: returnFalsereturn count == 0for _ inrange(6):
s = generate()
print(f' "{s}" balanced? {balanced(s)}')
Ambiguity
A grammar is ambiguous if some string has two different parse trees. The classic example: E → E+E | E*E | n gives two parse trees for "1+2*3". Ambiguity is a property of the grammar, not the language. Sometimes you can find an unambiguous grammar for the same language. Sometimes you cannot: some context-free languages are inherently ambiguous.
Chomsky normal form
Every CFG can be converted to Chomsky normal form (CNF): every production is either A → BC (two variables) or A → a (one terminal). CNF makes proofs easier and enables the CYK parsing algorithm (O(n3) time).
Scheme
; CYK algorithm for CFG in Chomsky Normal Form.; Grammar: S -> AB, A -> a, B -> b; This recognizes exactly the string "ab".; In CNF, we use dynamic programming on substrings.; table[i][j] = set of variables that derive s[i..j]
(define (cyk s)
(let* ((n (string-length s))
(table (make-vector n)))
; Initialize: length-1 substrings
(let init ((i 0))
(when (< i n)
(vector-set! table i (make-vector n (list)))
(let ((ch (substring s i (+ i 1))))
(cond
((equal? ch "a")
(vector-set! (vector-ref table i) i (list "A")))
((equal? ch "b")
(vector-set! (vector-ref table i) i (list "B")))))
(init (+ i 1))))
; Fill table for longer substrings
(let len-loop ((len 2))
(when (<= len n)
(let i-loop ((i 0))
(when (<= (+ i len) n)
(let ((j (+ i len -1)))
(let k-loop ((k i))
(when (< k j)
(let ((left (vector-ref (vector-ref table i) k))
(right (vector-ref (vector-ref table (+ k 1)) j)))
(when (and (member "A" left) (member "B" right))
(let ((cur (vector-ref (vector-ref table i) j)))
(unless (member "S" cur)
(vector-set! (vector-ref table i) j
(cons "S" cur))))))
(k-loop (+ k 1)))))
(i-loop (+ i 1))))
(len-loop (+ len 1))))
; Check if S derives entire string
(member "S" (vector-ref (vector-ref table 0) (- n 1)))))
(display "ab: ") (display (if (cyk "ab") "accept""reject")) (newline)
(display "a: ") (display (if (cyk "a") "accept""reject")) (newline)
(display "ba: ") (display (if (cyk "ba") "accept""reject"))
Neighbors
🪄 SICP Ch.4 — parsing and interpreters use context-free grammars
🪄 SICP Ch.1 — Scheme syntax is a context-free grammar; the reader is a CFG parser
🤖 ML Ch.12 — LLMs model language distributions but lack the inductive bias of CFGs
🔑 Logic Ch.6 — predicate logic has a context-free grammar for its well-formed formulas