How Load Balancing Works — Distributing Traffic Across Servers

2026-03-24

One server handles 1,000 requests per second. Your application gets 5,000. You have two options: get a bigger server (vertical scaling) or add more servers (horizontal scaling). Vertical scaling has a ceiling. Horizontal scaling needs a way to distribute traffic. That is what a load balancer does.

A load balancer sits between clients and a pool of servers (backends). Every incoming request goes to the load balancer. The load balancer picks a server and forwards the request. The client does not know which server handled its request. If a server fails, the load balancer stops sending traffic to it.

Layer 4 vs Layer 7

Load balancers operate at different network layers:

Layer 4 (transport) — operates on TCP connections. The load balancer sees source IP, destination IP, and port numbers. It forwards the entire TCP connection to a backend. It does not inspect HTTP headers, URLs, or request bodies. Fast and simple.

Layer 7 (application) — operates on HTTP requests. The load balancer can inspect URLs, headers, cookies, and request methods. It can route /api/* to one backend pool and /static/* to another. It can add headers, rewrite URLs, and terminate TLS. More flexible, slightly more overhead.

	Layer 4	Layer 7
Sees	IP addresses, ports, TCP flags	HTTP headers, URLs, cookies, request body
Routes by	Connection	Request
TLS	Passes through (backend terminates)	Terminates at load balancer
Use case	High throughput, protocol-agnostic	Content-based routing, HTTP features
Examples	AWS NLB, HAProxy (TCP mode)	nginx, HAProxy (HTTP mode), AWS ALB, Envoy

Most web applications use Layer 7 because they need content-based routing, TLS termination, and HTTP-aware features like sticky sessions and health checks on HTTP endpoints.

Algorithms

The load balancer must decide which server gets each request. The algorithm determines how:

Round-robin — requests go to servers in order: A, B, C, A, B, C. Simple and fair when all servers have equal capacity and all requests are equally expensive. The default for most load balancers.

Weighted round-robin — like round-robin, but servers have weights. Server A (weight 3) gets three requests for every one that server B (weight 1) gets. Useful when servers have different capacities.

Least connections — the request goes to the server with the fewest active connections. Better than round-robin when request processing times vary — a server handling a slow query naturally gets fewer new requests because its connection count is higher.

Least response time — the request goes to the server with the lowest average response time. Adapts to real-time server performance. Requires the load balancer to track response times.

IP hash — a hash of the client's IP address determines the server. The same client always hits the same server. Useful for basic session affinity, but breaks when clients share IPs (corporate NAT, mobile carriers).

Consistent hashing — similar to IP hash, but adding or removing a server only redistributes a fraction of requests rather than reshuffling all of them. Used in distributed caches and systems where cache locality matters. See How Partitioning Works.

Random — pick a server at random. Surprisingly effective with many servers. "Power of two random choices" — pick two servers at random and choose the one with fewer connections — provides near-optimal distribution with minimal overhead.

Load balancer distributing traffic to servers

Client A Client B Client C Load Balancer round-robin / least connections Server 1 healthy Server 2 healthy Server 3 unhealthy

removed from pool

Unhealthy servers receive no traffic

Health Checks

A load balancer must know which servers are healthy. Sending traffic to a crashed server returns errors. Health checks detect failures and remove bad servers from the pool.

Active health checks — the load balancer periodically sends a request to each server (e.g., GET /health every 10 seconds). If a server fails three consecutive checks, it is removed from the pool. When it passes again, it is re-added. nginx, HAProxy, and cloud load balancers all support this.

Passive health checks — the load balancer monitors real traffic. If a server returns too many errors (5xx responses, connection timeouts), it is marked unhealthy. No extra probe traffic, but slower to detect failures because it depends on real requests.

Most production setups use both. Active checks detect completely crashed servers. Passive checks detect servers that are running but misbehaving.

The health endpoint should check real dependencies — can the server reach the database? Is the disk not full? A /health endpoint that always returns 200 is useless.

Session Affinity

Some applications store session state in server memory. If a user's first request goes to Server A and the second request goes to Server B, Server B has no session data. The user might be logged out or lose their shopping cart.

Sticky sessions (session affinity) solve this by routing all requests from the same user to the same server. The load balancer sets a cookie or uses IP hashing to maintain the mapping.

This works but creates problems:

Uneven load — one server accumulates more sessions than others.
Failover — when the sticky server dies, all its sessions are lost.
Scaling down — you cannot remove a server without disrupting its sessions.

The better solution is externalized session storage. Store sessions in Redis or a database. Any server can handle any request because the session is not in server memory. This makes servers truly stateless and interchangeable.

Implementations

nginx — HTTP and TCP load balancing. Reverse proxy, TLS termination, caching. Configuration-driven. The most deployed web server and load balancer. Open source, with a commercial version (nginx Plus) adding active health checks and dynamic reconfiguration.

HAProxy — high-performance TCP and HTTP load balancer. Known for reliability and throughput. Extensive health checking, connection draining, and rate limiting. Preferred for high-traffic environments.

Cloud load balancers — AWS ALB/NLB, GCP Cloud Load Balancing, Azure Load Balancer. Managed, auto-scaling, integrated with cloud networking. ALB (Application Load Balancer) is Layer 7. NLB (Network Load Balancer) is Layer 4.

DNS round-robin — the DNS server returns multiple IP addresses for a domain, and the client picks one. Simple but crude — no health checks, no connection awareness, TTL caching means slow failover. Often used as a first layer in front of dedicated load balancers.

Envoy — a modern Layer 7 proxy designed for microservices. Supports gRPC, HTTP/2, observability, circuit breaking, and dynamic configuration. The data plane proxy for service meshes like Istio.

Connection Draining

When you remove a server (for deployment, scaling down, or maintenance), you don't want to drop active requests. Connection draining (or graceful shutdown) tells the load balancer to stop sending new requests to the server while allowing in-flight requests to complete.

The server finishes processing current requests (up to a timeout), then shuts down cleanly. Without connection draining, users see dropped connections and 502 errors during deployments.

Next Steps

How CQRS Works — separating read and write paths, often behind different scaling strategies.
How Partitioning Works — distributing data across servers, analogous to distributing traffic.
How DNS Works — DNS-based load balancing and how DNS caching affects failover.

Prerequisites

How CQRS Works