Regular expressions describe exactly the regular languages: union, concatenation, and Kleene star. They are equivalent to finite automata. The pumping lemma proves that some languages are beyond this level.
Syntax
A regular expression over alphabet Σ is built from three operations:
Union: R1 | R2 matches strings in either language
Concatenation: R1R2 matches a string from R1 followed by one from R2
Kleene star: R* matches zero or more repetitions of R
Base cases: each symbol a ∈ Σ is a regex, ε matches the empty string, and ∅ matches nothing.
Scheme
; Regex matching via recursive descent.; We match (a|b)*c by hand.
(define (match-star-ab s i)
; Match zero or more a's or b's starting at position i.; Returns list of all possible end positions.
(if (>= i (string-length s))
(list i)
(let ((ch (substring s i (+ i 1))))
(if (or (equal? ch "a") (equal? ch "b"))
(cons i (match-star-ab s (+ i 1)))
(list i)))))
(define (match-regex s)
; Match (a|b)*c against entire string s
(let ((positions (match-star-ab s 0)))
(define (try-c pos)
(and (< pos (string-length s))
(equal? (substring s pos (+ pos 1)) "c")
(= (+ pos 1) (string-length s))))
(not (null? (filter try-c positions)))))
(display "abc: ") (display (match-regex "abc")) (newline)
(display "c: ") (display (match-regex "c")) (newline)
(display "aabbc: ") (display (match-regex "aabbc")) (newline)
(display "ab: ") (display (match-regex "ab")) (newline)
(display "cc: ") (display (match-regex "cc"))
Python
# Regex matching: (a|b)*cimport re
pattern = re.compile(r'^[ab]*c$')
for s in ["abc", "c", "aabbc", "ab", "cc"]:
print(s + ": " + str(bool(pattern.match(s))))
Equivalence with finite automata
Two directions: every regex can be converted to an NFA (Thompson's construction), and every DFA can be converted to a regex (state elimination). The regular languages are exactly those describable by regular expressions.
The pumping lemma
If a language is regular, then every sufficiently long string in it can be "pumped": a middle section can be repeated any number of times and the result stays in the language. Contrapositive: if you can find strings that resist pumping, the language is not regular.
Scheme
; Pumping lemma example: L = a^n b^n is NOT regular.; Proof by contradiction: assume it's regular with; pumping length p. Take s = a^p b^p.; Any split s = xyz with |xy| <= p means y = a^k.; Pumping: xy^2z = a^(p+k) b^p, which is NOT in L.; Demonstrate: a^n b^n is easy to check but not regular.
(define (make-anbn n)
(string-append (make-string n #a) (make-string n #)))
(define (in-anbn? s)
(let ((len (string-length s)))
(if (odd? len) #f
(let ((half (/ len 2)))
(and (equal? (substring s 0 half)
(make-string half #a))
(equal? (substring s half len)
(make-string half #)))))))
(display "aabb: ") (display (in-anbn? "aabb")) (newline)
(display "aaabbb: ") (display (in-anbn? "aaabbb")) (newline)
(display "aab: ") (display (in-anbn? "aab")) (newline)
(display "abab: ") (display (in-anbn? "abab")) (newline)
(newline)
(display "This language needs counting.") (newline)
(display "No finite automaton can count unboundedly.")
Python
# a^n b^n: easy to check but not regulardef in_anbn(s):
n = len(s)
if n % 2 != 0:
returnFalse
half = n // 2return s[:half] == 'a' * half and s[half:] == 'b' * half
for s in ["aabb", "aaabbb", "aab", "abab", ""]:
print(f"{repr(s)}: {in_anbn(s)}")
print()
print("This language needs counting.")
print("No finite automaton can count unboundedly.")
Neighbors
🪄 SICP Ch.9 — pattern matching in Scheme: regular expressions as recognition procedures
📡 Information Theory Ch.1 — regular languages have finite entropy rate; their recognizers are minimal automata
💾 Databases Ch.4 — SQL LIKE patterns and database query languages descend from regular expression theory