How Consensus Works — Getting Nodes to Agree

How Consensus Works — Getting Nodes to Agree

2026-03-23

Five nodes in a cluster need to agree on who is the leader. Three of them can communicate. Two are unreachable (network partition). How do the three remaining nodes make a decision that won't conflict with whatever the two unreachable nodes decide?

This is the consensus problem: getting a group of nodes to agree on a value, even when some nodes fail or can't communicate. It's the foundation of strong consistency, leader election, and distributed coordination.

Why Is Consensus Hard?

On a single machine, agreement is trivial — one processor, one decision. In a distributed system:

  • Nodes crash — a node might fail after voting but before the decision is announced.
  • Messages are delayed — a vote might arrive late, after the decision was already made.
  • Network partitions — some nodes can't communicate with others. Each partition might try to make independent decisions.
  • Clocks disagree — there's no global clock. "Which message came first?" has no definitive answer.

The FLP impossibility theorem (Fischer, Lynch, Paterson, 1985) proves that no deterministic consensus algorithm can guarantee termination in an asynchronous system where even one node can fail. Real systems work around this by using timeouts (making the system partially synchronous) and randomization.

How Does Raft Work?

Raft (2014, Diego Ongaro and John Ousterhout) is designed to be understandable. It decomposes consensus into three subproblems:

1. Leader Election

One node is elected leader. All client requests go to the leader. The leader decides the order of operations.

  • Each node starts as a follower.
  • If a follower doesn't hear from a leader within a timeout (randomized, 150-300ms), it becomes a candidate and requests votes.
  • A candidate needs votes from a majority (e.g., 3 of 5 nodes) to become leader.
  • Once elected, the leader sends heartbeats to maintain authority.

The randomized timeout is critical: if two nodes time out simultaneously and both become candidates, they split the vote. The randomization ensures one usually times out first, getting a head start.

Leader Election (5 nodes, majority = 3) N1 candidate N2 follower N3 follower N4 follower N5 down vote? yes vote? yes 3 votes (self + N2 + N3) = majority → N1 becomes leader

2. Log Replication

The leader maintains a replicated log. Each client operation becomes a log entry:

  1. Client sends a write to the leader.
  2. Leader appends the entry to its log.
  3. Leader sends the entry to all followers.
  4. When a majority of followers confirm (append to their logs), the entry is committed.
  5. The leader applies the entry to its state machine and responds to the client.

The committed entry is now durable — it exists on a majority of nodes. Even if the leader crashes, the new leader will have the entry (because it must have been one of the majority).

3. Safety

Raft guarantees that once a log entry is committed, it will be present in the logs of all future leaders. A candidate can only win an election if its log is at least as up-to-date as a majority of nodes.

This ensures: committed entries are never lost, and all nodes eventually have the same log in the same order.

How Does Paxos Compare?

Paxos (Leslie Lamport, 1989) was the first consensus algorithm. It's provably correct but notoriously hard to understand and implement. Lamport's original paper used a Greek parliament metaphor that confused more than it clarified.

Paxos works in two phases:

  1. Prepare — the proposer asks a majority of nodes to promise not to accept older proposals.
  2. Accept — the proposer sends the value to the majority. If they all accept, consensus is reached.

Raft is equivalent to Paxos in what it achieves (consensus with crash fault tolerance) but much easier to implement correctly. Most new systems choose Raft.

PaxosRaft
Year19892014
UnderstandabilityNotoriously complexDesigned for clarity
LeaderOptional (multi-decree Paxos has one)Always has a leader
LogNot built-inCentral concept
ImplementationsZooKeeper (ZAB, Paxos variant)etcd, CockroachDB, TiKV, Consul

What Can Consensus Do?

Leader election — the most common use. A cluster of nodes agrees on which one is the leader. If the leader fails, they agree on a new one. etcd, ZooKeeper, and Consul provide this as a service.

Replicated state machines — every node starts with the same state and applies the same log of operations in the same order. They all converge to the same state. This is how databases like CockroachDB replicate data with strong consistency.

Distributed locks — a consensus group can agree on who holds a lock. Used for leader election, job scheduling, and preventing split-brain.

Configuration management — Kubernetes stores all cluster state in etcd (Raft). Every API server reads from the same source of truth.

How Many Nodes Do You Need?

Consensus requires a majority (quorum) to make decisions. With N nodes:

NodesMajorityTolerated failures
321
532
743

3 nodes is the minimum for fault tolerance (tolerates 1 failure). 5 nodes is common in production (tolerates 2 failures). More nodes increase fault tolerance but add latency (more nodes must confirm each operation).

An even number of nodes (4, 6) is wasteful: 4 nodes tolerates only 1 failure (majority is 3), same as 3 nodes. Always use odd numbers.

What Consensus Cannot Do

Byzantine fault tolerance — Raft and Paxos assume nodes are honest but may crash. If a node lies (sends conflicting messages to different peers), the protocol breaks. Blockchains solve Byzantine consensus but at enormous cost (proof of work/stake). Most internal systems trust their own nodes and don't need Byzantine tolerance.

Availability during partitions — consensus requires a majority. If a partition isolates the minority, the minority can't make progress. This is the CP side of the CAP theorem — strong consistency sacrifices availability during partitions.

Speed — every write requires a network round trip to a majority of nodes. Cross-datacenter consensus adds 50-200ms of latency per write. This is why consensus is used for metadata and coordination, not for high-throughput data paths.

Next Steps