Regular Expressions

Maheshwari & Smid · Ch. 3 · Introduction to Theory of Computation

Regular expressions describe exactly the regular languages: union, concatenation, and Kleene star. They are equivalent to finite automata. The pumping lemma proves that some languages are beyond this level.

Syntax

A regular expression over alphabet Σ is built from three operations:

Union: R₁ | R₂ matches strings in either language
Concatenation: R₁R₂ matches a string from R₁ followed by one from R₂
Kleene star: R* matches zero or more repetitions of R

Base cases: each symbol a ∈ Σ is a regex, ε matches the empty string, and ∅ matches nothing.

Scheme

; Regex matching via recursive descent.
; We match (a|b)*c by hand.

(define (match-star-ab s i)
  ; Match zero or more a's or b's starting at position i.
  ; Returns list of all possible end positions.
  (if (>= i (string-length s))
      (list i)
      (let ((ch (substring s i (+ i 1))))
        (if (or (equal? ch "a") (equal? ch "b"))
            (cons i (match-star-ab s (+ i 1)))
            (list i)))))

(define (match-regex s)
  ; Match (a|b)*c against entire string s
  (let ((positions (match-star-ab s 0)))
    (define (try-c pos)
      (and (< pos (string-length s))
           (equal? (substring s pos (+ pos 1)) "c")
           (= (+ pos 1) (string-length s))))
    (not (null? (filter try-c positions)))))

(display "abc:   ") (display (match-regex "abc")) (newline)
(display "c:     ") (display (match-regex "c")) (newline)
(display "aabbc: ") (display (match-regex "aabbc")) (newline)
(display "ab:    ") (display (match-regex "ab")) (newline)
(display "cc:    ") (display (match-regex "cc"))

; Regex matching via recursive descent.
; We match (a|b)*c by hand.

(define (match-star-ab s i)
  ; Match zero or more a's or b's starting at position i.
  ; Returns list of all possible end positions.
  (if (>= i (string-length s))
      (list i)
      (let ((ch (substring s i (+ i 1))))
        (if (or (equal? ch "a") (equal? ch "b"))
            (cons i (match-star-ab s (+ i 1)))
            (list i)))))

(define (match-regex s)
  ; Match (a|b)*c against entire string s
  (let ((positions (match-star-ab s 0)))
    (define (try-c pos)
      (and (< pos (string-length s))
           (equal? (substring s pos (+ pos 1)) "c")
           (= (+ pos 1) (string-length s))))
    (not (null? (filter try-c positions)))))

(display "abc:   ") (display (match-regex "abc")) (newline)
(display "c:     ") (display (match-regex "c")) (newline)
(display "aabbc: ") (display (match-regex "aabbc")) (newline)
(display "ab:    ") (display (match-regex "ab")) (newline)
(display "cc:    ") (display (match-regex "cc"))

Python

# Regex matching: (a|b)*c
import re

pattern = re.compile(r'^[ab]*c$')

for s in ["abc", "c", "aabbc", "ab", "cc"]:
    print(s + ": " + str(bool(pattern.match(s))))

Equivalence with finite automata

Two directions: every regex can be converted to an NFA (Thompson's construction), and every DFA can be converted to a regex (state elimination). The regular languages are exactly those describable by regular expressions.

The pumping lemma

If a language is regular, then every sufficiently long string in it can be "pumped": a middle section can be repeated any number of times and the result stays in the language. Contrapositive: if you can find strings that resist pumping, the language is not regular.

Scheme

; Pumping lemma example: L = a^n b^n is NOT regular.
; Proof by contradiction: assume it's regular with
; pumping length p. Take s = a^p b^p.
; Any split s = xyz with |xy| <= p means y = a^k.
; Pumping: xy^2z = a^(p+k) b^p, which is NOT in L.

; Demonstrate: a^n b^n is easy to check but not regular.
(define (make-anbn n)
  (string-append (make-string n #a) (make-string n #)))

(define (in-anbn? s)
  (let ((len (string-length s)))
    (if (odd? len) #f
        (let ((half (/ len 2)))
          (and (equal? (substring s 0 half)
                       (make-string half #a))
               (equal? (substring s half len)
                       (make-string half #)))))))

(display "aabb:   ") (display (in-anbn? "aabb")) (newline)
(display "aaabbb: ") (display (in-anbn? "aaabbb")) (newline)
(display "aab:    ") (display (in-anbn? "aab")) (newline)
(display "abab:   ") (display (in-anbn? "abab")) (newline)
(newline)
(display "This language needs counting.") (newline)
(display "No finite automaton can count unboundedly.")

; Pumping lemma example: L = a^n b^n is NOT regular.
; Proof by contradiction: assume it's regular with
; pumping length p. Take s = a^p b^p.
; Any split s = xyz with |xy| <= p means y = a^k.
; Pumping: xy^2z = a^(p+k) b^p, which is NOT in L.

; Demonstrate: a^n b^n is easy to check but not regular.
(define (make-anbn n)
  (string-append (make-string n #a) (make-string n #)))

(define (in-anbn? s)
  (let ((len (string-length s)))
    (if (odd? len) #f
        (let ((half (/ len 2)))
          (and (equal? (substring s 0 half)
                       (make-string half #a))
               (equal? (substring s half len)
                       (make-string half #)))))))

(display "aabb:   ") (display (in-anbn? "aabb")) (newline)
(display "aaabbb: ") (display (in-anbn? "aaabbb")) (newline)
(display "aab:    ") (display (in-anbn? "aab")) (newline)
(display "abab:   ") (display (in-anbn? "abab")) (newline)
(newline)
(display "This language needs counting.") (newline)
(display "No finite automaton can count unboundedly.")

Python

# a^n b^n: easy to check but not regular
def in_anbn(s):
    n = len(s)
    if n % 2 != 0:
        return False
    half = n // 2
    return s[:half] == 'a' * half and s[half:] == 'b' * half

for s in ["aabb", "aaabbb", "aab", "abab", ""]:
    print(f"{repr(s)}: {in_anbn(s)}")

print()
print("This language needs counting.")
print("No finite automaton can count unboundedly.")

Neighbors

🪄 SICP Ch.9 — pattern matching in Scheme: regular expressions as recognition procedures
📡 Information Theory Ch.1 — regular languages have finite entropy rate; their recognizers are minimal automata
💾 Databases Ch.4 — SQL LIKE patterns and database query languages descend from regular expression theory

← Nondeterminism by june.kim Context-Free Grammars →