Query Processing

Wikipedia · Query optimization · CC BY-SA 4.0

Query processing translates a SQL query into an execution plan. The optimizer considers multiple plans (different join orders, index choices, join algorithms) and picks the cheapest one based on cost estimates. The main join algorithms: nested-loop, sort-merge, and hash join.

Nested-loop join

The simplest join algorithm. For each row in the outer table, scan the entire inner table for matches. Cost: O(n * m). Good when one table is small or the inner table has an index.

Scheme

; Nested-loop join: for each outer row, scan inner.

(define users
  (list (list 1 "Alice") (list 2 "Bob") (list 3 "Carol")))
(define orders
  (list (list 1 "book") (list 2 "pen") (list 1 "lamp") (list 3 "desk")))

(define (nested-loop-join outer inner)
  (apply append
    (map (lambda (o)
      (filter (lambda (x) x)
        (map (lambda (i)
          (if (= (car o) (car i))
            (list (car o) (cadr o) (cadr i))
            #f))
          inner)))
      outer)))

(define result (nested-loop-join users orders))
(for-each
  (lambda (r)
    (display (cadr r)) (display " -> ") (display (caddr r)) (newline))
  result)
(newline)
(display "Comparisons: ")
(display (* (length users) (length orders)))
(display " (n * m)")

Sort-merge join

Sort both tables on the join key, then merge them in one pass. Cost: O(n log n + m log m) for the sorts, then O(n + m) for the merge. Best when both tables are large and already nearly sorted.

Python

# Sort-merge join

def sort_merge_join(left, right, key_l, key_r):
    left_sorted = sorted(left, key=lambda r: r[key_l])
    right_sorted = sorted(right, key=lambda r: r[key_r])

    result = []
    i, j = 0, 0
    while i < len(left_sorted) and j < len(right_sorted):
        lk = left_sorted[i][key_l]
        rk = right_sorted[j][key_r]
        if lk == rk:
            # Collect all matching right rows
            j2 = j
            while j2 < len(right_sorted) and right_sorted[j2][key_r] == lk:
                result.append({**left_sorted[i], **right_sorted[j2]})
                j2 += 1
            i += 1
        elif lk < rk:
            i += 1
        else:
            j += 1
    return result

users = [{"id": 1, "name": "Alice"}, {"id": 2, "name": "Bob"}, {"id": 3, "name": "Carol"}]
orders = [{"uid": 3, "item": "desk"}, {"uid": 1, "item": "book"}, {"uid": 1, "item": "lamp"}]

for r in sort_merge_join(users, orders, "id", "uid"):
    print("  " + r['name'] + " -> " + r['item'])

Hash join

Build a hash table on the smaller (build) relation, then probe it with each row of the larger (probe) relation. Cost: O(n + m) on average. The default choice in most modern query optimizers when no useful index exists.

Scheme

; Hash join: build hash table on smaller relation, probe with larger.

(define build-table
  (list (list 1 "Alice") (list 2 "Bob") (list 3 "Carol")))
(define probe-table
  (list (list 1 "book") (list 3 "desk") (list 1 "lamp") (list 2 "pen")))

; Build phase: create an association list keyed by join column
(define hash-table
  (map (lambda (row) (cons (car row) (cadr row))) build-table))

; Probe phase: look up each probe row
(define (hash-lookup ht key)
  (let ((found (assoc key ht)))
    (if found (cdr found) #f)))

(display "Hash join results:") (newline)
(for-each
  (lambda (row)
    (let ((name (hash-lookup hash-table (car row))))
      (if name
        (begin
          (display "  ") (display name)
          (display " -> ") (display (cadr row)) (newline)))))
  probe-table)

Cost estimation

The optimizer estimates the cost of each plan using statistics: table sizes, index selectivity, data distribution. It picks the plan with the lowest estimated I/O and CPU cost. The estimates are not always accurate, which is why sometimes the optimizer picks a suboptimal plan.

Python

# Cost estimation: compare nested-loop vs hash join

def estimate_cost(n_left, n_right, has_index=False):
    nl_cost = n_left * n_right  # nested loop
    if has_index:
        nl_cost = n_left * 3    # ~log2 lookup per outer row
    hj_cost = n_left + n_right  # hash join (build + probe)
    sm_cost = (n_left * 14 +    # sort left (n log n, log2(1000)~10 + overhead)
               n_right * 17 +   # sort right
               n_left + n_right) # merge

    print("Tables: " + str(n_left) + " x " + str(n_right))
    print("  Nested loop:  " + str(nl_cost))
    print("  Hash join:    " + str(hj_cost))
    print("  Sort-merge:   " + str(sm_cost))
    best = min(("Nested loop", nl_cost), ("Hash join", hj_cost),
               ("Sort-merge", sm_cost), key=lambda x: x[1])
    print("  Winner: " + best[0])
    print()

estimate_cost(1000, 10000)
estimate_cost(10, 10000)   # small outer table
estimate_cost(1000, 10000, has_index=True)  # with index

Neighbors

Cross-references

⚙ Algorithms Ch.3 — sorting: sort-merge join depends on efficient sorting

Foundations (Wikipedia)

← Recovery by june.kim NoSQL and Beyond →