What is Leader Election
Leader election is the process by which a group of nodes selects exactly one node to serve as the leader. The leader coordinates all writes, orders operations, and replicates changes to followers. When the leader fails, the remaining nodes must elect a new one quickly to restore availability. Leader election is a specific instance of the consensus problem.
How it works
In Raft, each node starts as a follower. Followers expect periodic heartbeats from the leader. If a follower receives no heartbeat within a randomized timeout (e.g., 150-300 ms), it assumes the leader has failed, increments the term number, transitions to candidate state, and requests votes from all other nodes. A candidate wins if it receives votes from a quorum. The randomized timeout ensures that candidates start elections at slightly different times, making it unlikely that two candidates split the vote.
A candidate can only win votes from nodes that have not already voted in the current term and whose logs are not more up-to-date than the candidate's. This ensures the new leader has all committed entries.
ZooKeeper uses a different approach. Nodes establish TCP connections to each other and use the ZAB (ZooKeeper Atomic Broadcast) protocol, where the node with the highest transaction ID and highest node ID wins. The election completes when a quorum of nodes recognizes the same leader.
The critical safety property is that at most one leader exists per term (Raft) or epoch (ZAB). If two nodes believe they are the leader simultaneously, the system experiences split brain, which can cause data corruption. Quorum-based voting prevents this because two majorities in the same cluster always overlap.
Why it matters
Leader election determines your system's recovery time after a failure. A fast election (Raft typically completes in under a second) means a brief blip. A slow or stuck election means extended downtime. Understanding the election mechanism of your distributed system is essential for tuning timeouts and diagnosing availability incidents.
See How Consensus Works for the full walkthrough of leader election, log replication, and term management.