What is Failover

Failover is the process of promoting a replica to take over when the primary node becomes unavailable. The goal is to restore read and write availability with minimal downtime and no data loss. Failover can be automatic (the system detects the failure and promotes a replica without human intervention) or manual (an operator triggers the promotion).

How it works

Failover follows a sequence of steps:

  1. Failure detection -- the system determines the primary is unreachable. This is typically done through heartbeat timeouts. The challenge is distinguishing a genuinely failed node from a slow one or a network partition. False positives cause unnecessary failovers, which are disruptive.
  2. Replica selection -- the system chooses which replica to promote. The ideal candidate has the most up-to-date data (the smallest replication lag). In systems using consensus-based replication, the leader election mechanism handles this automatically.
  3. Promotion -- the chosen replica is reconfigured to accept writes. Other replicas are redirected to replicate from the new primary.
  4. Client redirection -- clients must discover the new primary. This can happen through DNS updates, a proxy layer (e.g., HAProxy, PgBouncer), or a service discovery system (e.g., etcd, Consul).

The biggest risk during failover is data loss from replication lag. If the primary crashes before a replica has received its latest writes, those writes are lost. Synchronous replication prevents this by requiring the replica to acknowledge writes before the primary confirms them to the client, but it adds latency.

The second risk is split brain: the old primary recovers and both nodes accept writes. Fencing mechanisms (STONITH -- "Shoot The Other Node In The Head") forcibly shut down the old primary to prevent this.

Why it matters

Failover is what turns replication from a backup mechanism into a high-availability mechanism. Without failover, a replica is just a cold copy. With failover, your system can recover from a primary failure in seconds instead of hours. Understanding the trade-offs between automatic and manual failover -- and the risks of split brain and data loss -- is essential for operating any replicated system.

See How Replication Works for the full walkthrough of replication topologies and failover strategies.