Query processing translates a SQL query into an execution plan. The optimizer considers multiple plans (different join orders, index choices, join algorithms) and picks the cheapest one based on cost estimates. The main join algorithms: nested-loop, sort-merge, and hash join.
Nested-loop join
The simplest join algorithm. For each row in the outer table, scan the entire inner table for matches. Cost: O(n * m). Good when one table is small or the inner table has an index.
Sort both tables on the join key, then merge them in one pass. Cost: O(n log n + m log m) for the sorts, then O(n + m) for the merge. Best when both tables are large and already nearly sorted.
Python
# Sort-merge joindef sort_merge_join(left, right, key_l, key_r):
left_sorted = sorted(left, key=lambda r: r[key_l])
right_sorted = sorted(right, key=lambda r: r[key_r])
result = []
i, j = 0, 0while i < len(left_sorted) and j < len(right_sorted):
lk = left_sorted[i][key_l]
rk = right_sorted[j][key_r]
if lk == rk:
# Collect all matching right rows
j2 = j
while j2 < len(right_sorted) and right_sorted[j2][key_r] == lk:
result.append({**left_sorted[i], **right_sorted[j2]})
j2 += 1
i += 1elif lk < rk:
i += 1else:
j += 1return result
users = [{"id": 1, "name": "Alice"}, {"id": 2, "name": "Bob"}, {"id": 3, "name": "Carol"}]
orders = [{"uid": 3, "item": "desk"}, {"uid": 1, "item": "book"}, {"uid": 1, "item": "lamp"}]
for r in sort_merge_join(users, orders, "id", "uid"):
print(" " + r['name'] + " -> " + r['item'])
Hash join
Build a hash table on the smaller (build) relation, then probe it with each row of the larger (probe) relation. Cost: O(n + m) on average. The default choice in most modern query optimizers when no useful index exists.
The optimizer estimates the cost of each plan using statistics: table sizes, index selectivity, data distribution. It picks the plan with the lowest estimated I/O and CPU cost. The estimates are not always accurate, which is why sometimes the optimizer picks a suboptimal plan.