What is Split Brain

Split brain is a failure mode where a network partition divides a cluster into two or more groups, and each group independently elects its own leader. Both leaders accept writes, and the data diverges. When the partition heals, the system has two conflicting histories that cannot be automatically reconciled without data loss. Split brain is one of the most dangerous failures in distributed systems.

How it works

Consider a 2-node cluster with automatic failover. The primary and replica lose connectivity. The replica's heartbeat timer expires, and it promotes itself to primary. Now both nodes accept writes. Client A writes to the original primary. Client B writes to the promoted replica. When connectivity is restored, the system has two divergent datasets and no way to merge them without losing one side's writes.

This is why quorum-based systems require a majority. In a 3-node cluster with a network partition, one side has 2 nodes (a majority) and the other has 1. Only the majority side can elect a leader. The minority side stops accepting writes because it cannot form a quorum. This prevents split brain at the cost of availability on the minority side -- a trade-off described by the CAP theorem.

Consensus algorithms like Raft use term numbers as an additional safeguard. Even if a stale leader briefly accepts writes, followers reject entries from leaders with outdated terms. The stale leader eventually discovers a higher term and steps down.

Systems without quorum-based leader election use fencing to prevent split brain. A fencing token is a monotonically increasing number issued with each lease. Storage systems reject writes with an outdated fencing token, ensuring that even if two nodes believe they are the leader, only one can actually write.

Why it matters

Split brain causes silent data corruption -- the most dangerous kind of failure because it may go undetected until an audit or a customer reports missing data. Understanding how your system prevents split brain -- quorums, fencing tokens, or consensus protocols -- is essential for evaluating whether it is truly safe under network partitions.

See How Replication Works for the full walkthrough of replication topologies and partition handling.