Context-Free Grammars

Maheshwari & Smid · Ch. 4 · Introduction to Theory of Computation

A context-free grammar (CFG) generates strings through production rules. Each rule rewrites a single variable into a sequence of variables and terminals. CFGs describe more languages than regular expressions: they can handle nesting, matching brackets, and recursive structure.

Productions and derivations

A CFG is a 4-tuple (V, Σ, R, S):

V — finite set of variables (nonterminals)
Σ — finite set of terminals (the alphabet)
R — production rules of the form A → w
S — start variable (S ∈ V)

A derivation is a sequence of rule applications starting from S that produces a string of terminals.

Parse trees

A parse tree shows the structure of a derivation. The root is the start variable. Internal nodes are variables. Leaves are terminals. Reading the leaves left-to-right gives the derived string.

Scheme

; A simple CFG for matched parentheses.
; S -> (S) | SS | epsilon
; Generate random derivations:

(define (generate depth)
  (if (> depth 5) ""  ; limit depth
      (let ((r (random-integer 3)))
        (cond
          ((= r 0) (string-append "(" (generate (+ depth 1)) ")"))
          ((= r 1) (string-append (generate (+ depth 1))
                                  (generate (+ depth 1))))
          (else "")))))

; Check if parentheses are balanced
(define (balanced? s)
  (define (check i count)
    (cond
      ((< count 0) #f)
      ((= i (string-length s)) (= count 0))
      ((equal? (substring s i (+ i 1)) "(")
       (check (+ i 1) (+ count 1)))
      (else (check (+ i 1) (- count 1)))))
  (check 0 0))

(display "Generated strings from S -> (S) | SS | eps:") (newline)
(let loop ((n 0))
  (when (< n 6)
    (let ((s (generate 0)))
      (display "  "") (display s) (display "" balanced? ")
      (display (balanced? s)) (newline)
      (loop (+ n 1)))))

Ambiguity

A grammar is ambiguous if some string has two different parse trees. The classic example: E → E+E | E*E | n gives two parse trees for "1+2*3". Ambiguity is a property of the grammar, not the language. Sometimes you can find an unambiguous grammar for the same language. Sometimes you cannot: some context-free languages are inherently ambiguous.

Chomsky normal form

Every CFG can be converted to Chomsky normal form (CNF): every production is either A → BC (two variables) or A → a (one terminal). CNF makes proofs easier and enables the CYK parsing algorithm (O(n³) time).

Scheme

; CYK algorithm for CFG in Chomsky Normal Form.
; Grammar: S -> AB, A -> a, B -> b
; This recognizes exactly the string "ab".

; In CNF, we use dynamic programming on substrings.
; table[i][j] = set of variables that derive s[i..j]

(define (cyk s)
  (let* ((n (string-length s))
         (table (make-vector n)))
    ; Initialize: length-1 substrings
    (let init ((i 0))
      (when (< i n)
        (vector-set! table i (make-vector n (list)))
        (let ((ch (substring s i (+ i 1))))
          (cond
            ((equal? ch "a")
             (vector-set! (vector-ref table i) i (list "A")))
            ((equal? ch "b")
             (vector-set! (vector-ref table i) i (list "B")))))
        (init (+ i 1))))
    ; Fill table for longer substrings
    (let len-loop ((len 2))
      (when (<= len n)
        (let i-loop ((i 0))
          (when (<= (+ i len) n)
            (let ((j (+ i len -1)))
              (let k-loop ((k i))
                (when (< k j)
                  (let ((left (vector-ref (vector-ref table i) k))
                        (right (vector-ref (vector-ref table (+ k 1)) j)))
                    (when (and (member "A" left) (member "B" right))
                      (let ((cur (vector-ref (vector-ref table i) j)))
                        (unless (member "S" cur)
                          (vector-set! (vector-ref table i) j
                                       (cons "S" cur))))))
                  (k-loop (+ k 1)))))
            (i-loop (+ i 1))))
        (len-loop (+ len 1))))
    ; Check if S derives entire string
    (member "S" (vector-ref (vector-ref table 0) (- n 1)))))

(display "ab:  ") (display (if (cyk "ab") "accept" "reject")) (newline)
(display "a:   ") (display (if (cyk "a") "accept" "reject")) (newline)
(display "ba:  ") (display (if (cyk "ba") "accept" "reject"))

; CYK algorithm for CFG in Chomsky Normal Form.
; Grammar: S -> AB, A -> a, B -> b
; This recognizes exactly the string "ab".

; In CNF, we use dynamic programming on substrings.
; table[i][j] = set of variables that derive s[i..j]

(define (cyk s)
  (let* ((n (string-length s))
         (table (make-vector n)))
    ; Initialize: length-1 substrings
    (let init ((i 0))
      (when (< i n)
        (vector-set! table i (make-vector n (list)))
        (let ((ch (substring s i (+ i 1))))
          (cond
            ((equal? ch "a")
             (vector-set! (vector-ref table i) i (list "A")))
            ((equal? ch "b")
             (vector-set! (vector-ref table i) i (list "B")))))
        (init (+ i 1))))
    ; Fill table for longer substrings
    (let len-loop ((len 2))
      (when (<= len n)
        (let i-loop ((i 0))
          (when (<= (+ i len) n)
            (let ((j (+ i len -1)))
              (let k-loop ((k i))
                (when (< k j)
                  (let ((left (vector-ref (vector-ref table i) k))
                        (right (vector-ref (vector-ref table (+ k 1)) j)))
                    (when (and (member "A" left) (member "B" right))
                      (let ((cur (vector-ref (vector-ref table i) j)))
                        (unless (member "S" cur)
                          (vector-set! (vector-ref table i) j
                                       (cons "S" cur))))))
                  (k-loop (+ k 1)))))
            (i-loop (+ i 1))))
        (len-loop (+ len 1))))
    ; Check if S derives entire string
    (member "S" (vector-ref (vector-ref table 0) (- n 1)))))

(display "ab:  ") (display (if (cyk "ab") "accept" "reject")) (newline)
(display "a:   ") (display (if (cyk "a") "accept" "reject")) (newline)
(display "ba:  ") (display (if (cyk "ba") "accept" "reject"))

Neighbors

🪄 SICP Ch.4 — parsing and interpreters use context-free grammars
🪄 SICP Ch.1 — Scheme syntax is a context-free grammar; the reader is a CFG parser
🤖 ML Ch.12 — LLMs model language distributions but lack the inductive bias of CFGs
🔑 Logic Ch.6 — predicate logic has a context-free grammar for its well-formed formulas

← Regular Expressions by june.kim Pushdown Automata →