What Makes It Hard

Wikipedia · Cambridge Distributed Systems · Distributed computing

A distributed system is one where a message between nodes can be lost, delayed, reordered, or duplicated. Nodes can fail independently. There is no shared clock. These four facts make everything else hard.

Network partitions

A network partition splits nodes into groups that cannot communicate with each other. The network is working fine on each side. The problem is that neither side knows whether the other has crashed or is simply unreachable. You cannot distinguish "slow" from "dead."

Scheme

; Simulating a network partition.
; Node A can reach B, but not C.
; Is C crashed or just unreachable? We cannot tell.

(define nodes (list "A" "B" "C"))

; adjacency: who can reach whom
(define (can-reach? from to)
  (cond
    ((and (equal? from "A") (equal? to "B")) #t)
    ((and (equal? from "B") (equal? to "A")) #t)
    ; B-C link is down (partition)
    ((and (equal? from "B") (equal? to "C")) #f)
    ((and (equal? from "C") (equal? to "B")) #f)
    ((and (equal? from "A") (equal? to "C")) #f)
    ((and (equal? from "C") (equal? to "A")) #f)
    (else #f)))

(display "A -> B: ") (display (can-reach? "A" "B")) (newline)
(display "A -> C: ") (display (can-reach? "A" "C")) (newline)
(display "B -> C: ") (display (can-reach? "B" "C")) (newline)
(display "Is C crashed or partitioned? Unknown.")

Partial failure

In a single machine, either the whole thing works or the whole thing crashes. In a distributed system, some nodes can fail while others keep running. The system is in an indeterminate state: the request was sent, but did the other side process it before crashing? Maybe. There is no way to know without further communication, and communication might be broken.

Scheme

; Partial failure: send a request, get no response.
; Did the server crash before processing?
; Or did it process but the response was lost?

(define (send-request server)
  ; simulate: server might crash at any point
  (let ((crash-point (modulo (* server 7) 3)))
    (cond
      ((= crash-point 0) "crashed before processing")
      ((= crash-point 1) "processed, but response lost")
      ((= crash-point 2) "success"))))

(display "Attempt 1: ") (display (send-request 1)) (newline)
(display "Attempt 2: ") (display (send-request 2)) (newline)
(display "Attempt 3: ") (display (send-request 3)) (newline)
; From the client side, all three look the same: timeout.
(display "Client sees: timeout, timeout, timeout")

No global clock

Each node has its own clock. Clocks drift. Even with NTP, you get milliseconds of skew. In special relativity, simultaneity is frame-dependent: two events that are "simultaneous" for one observer are ordered differently for another. Distributed systems have the same problem: there is no physical fact about which event happened first across nodes.

Scheme

; Two nodes, two clocks, no agreement on "now."
; Node A says event happened at t=100.
; Node B says event happened at t=101.
; Which was first? Depends on clock skew.

(define clock-A 100)
(define clock-B 101)
(define skew 3)  ; B's clock is 3 ahead of real time

(define real-time-A clock-A)           ; A is accurate
(define real-time-B (- clock-B skew))  ; B reads 101 but real = 98

(display "A's clock: ") (display clock-A) (newline)
(display "B's clock: ") (display clock-B) (newline)
(display "A's real time: ") (display real-time-A) (newline)
(display "B's real time: ") (display real-time-B) (newline)
(display "Clock order: B after A") (newline)
(display "Real order:  B before A") (newline)
(display "Clocks lie.")

Byzantine faults

A Byzantine fault is when a node does not just crash but sends incorrect or contradictory messages. It might be compromised, buggy, or malicious. Crash faults are a special case: a crashed node sends no messages at all. Byzantine faults are strictly harder because the faulty node actively misleads.

Scheme

; Byzantine fault: node sends different values to different peers.
; A asks B for the value. B says 42.
; C asks B for the value. B says 99.
; B is Byzantine (lying / compromised).

(define (byzantine-node requester)
  (cond
    ((equal? requester "A") 42)
    ((equal? requester "C") 99)
    (else 0)))

(display "B tells A: ") (display (byzantine-node "A")) (newline)
(display "B tells C: ") (display (byzantine-node "C")) (newline)
(display "A and C disagree about B's value.") (newline)
(display "Neither knows B is lying.")

Neighbors

Cross-references

⚡ Physics Ch.10 — relativity: no simultaneity, the physical root of "no global clock"

by june.kim Time and Clocks →