← back to distributed systems

What Makes It Hard

Wikipedia · Cambridge Distributed Systems · wpDistributed computing

A distributed system is one where a message between nodes can be lost, delayed, reordered, or duplicated. Nodes can fail independently. There is no shared clock. These four facts make everything else hard.

Network partitions

A network partition splits nodes into groups that cannot communicate with each other. The network is working fine on each side. The problem is that neither side knows whether the other has crashed or is simply unreachable. You cannot distinguish "slow" from "dead."

A B C connected partitioned
Scheme

Partial failure

In a single machine, either the whole thing works or the whole thing crashes. In a distributed system, some nodes can fail while others keep running. The system is in an indeterminate state: the request was sent, but did the other side process it before crashing? Maybe. There is no way to know without further communication, and communication might be broken.

Scheme

No global clock

Each node has its own clock. Clocks drift. Even with NTP, you get milliseconds of skew. In special relativity, simultaneity is frame-dependent: two events that are "simultaneous" for one observer are ordered differently for another. Distributed systems have the same problem: there is no physical fact about which event happened first across nodes.

Scheme

Byzantine faults

A Byzantine fault is when a node does not just crash but sends incorrect or contradictory messages. It might be compromised, buggy, or malicious. Crash faults are a special case: a crashed node sends no messages at all. Byzantine faults are strictly harder because the faulty node actively misleads.

Scheme
Neighbors

Cross-references

  • ⚡ Physics Ch.10 — relativity: no simultaneity, the physical root of "no global clock"