How Rate Limiting Works — Protecting APIs from Overload

How Rate Limiting Works — Protecting APIs from Overload

2026-03-24

Rate limiting controls how many requests a client can make to an API in a given time window. Without it, a single client — whether malicious, buggy, or just enthusiastic — can overwhelm the server, degrade performance for everyone, and run up infrastructure costs.

Every major API enforces rate limits. GitHub allows 5,000 requests per hour per authenticated user. Stripe allows 100 requests per second in live mode. Twitter (X) allows 900 reads per 15-minute window. Hit the limit, and you get a 429 Too Many Requests response.

Why Rate Limiting Exists

Fairness. Thousands of clients share the same API. One client making 10,000 requests per second shouldn't degrade response times for everyone else. Rate limits give each client a fair share of server capacity.

Protection. Bugs cause runaway loops. Scrapers crawl aggressively. Attackers brute-force endpoints. Rate limiting is the first line of defense against all of these.

Cost. Every request costs compute, memory, bandwidth, and database capacity. Public APIs expose expensive internal resources. Rate limiting caps the cost per client.

Stability. Under load, services degrade gradually — response times increase, error rates rise. Rate limiting sheds excess traffic before the system reaches the tipping point, keeping it responsive for requests that do get through.

Token Bucket Algorithm

The token bucket is the most widely used rate limiting algorithm. The idea: each client has a bucket that holds tokens. Tokens are added at a fixed rate (e.g., 10 per second). Each request consumes one token. If the bucket is empty, the request is rejected.

Parameters:

  • Bucket size (burst capacity) — maximum tokens the bucket can hold. A bucket of 100 allows bursts of up to 100 requests.
  • Refill rate — tokens added per second. 10 tokens/second = 10 sustained requests/second.

Behavior:

  • A client that sends requests at the refill rate is always allowed through.
  • A client that was idle accumulates tokens (up to the bucket size) and can burst.
  • A client that exceeds the sustained rate drains the bucket and gets throttled.

Token bucket naturally allows bursts while enforcing a long-term average rate — which matches how most real clients behave.

Token bucket: filling and draining

bucket (max: 10) +2 tokens/sec (refill) request (-1 token) tokens > 0 200 OK tokens = 0 429 Too Many Requests

7 tokens remaining — 3 empty slots — burst available

Fixed Window Algorithm

The simplest approach: count requests in a fixed time window (e.g., one minute). If the count exceeds the limit, reject. At the start of the next window, reset the counter.

Problem: boundary bursts. A client can send 100 requests at 11:59:59 and 100 more at 12:00:01 — 200 requests in 2 seconds while technically staying within the "100 per minute" limit. The window boundary creates a vulnerability.

Sliding Window Algorithm

The sliding window fixes the boundary problem. Instead of resetting at fixed intervals, it looks at the last N seconds relative to the current request.

Sliding window log: store the timestamp of every request. To check the limit, count timestamps in the last N seconds. Accurate but memory-intensive (one entry per request).

Sliding window counter: a weighted blend of the current and previous fixed windows. If the current window is 40% elapsed, the count is (previous_window_count * 0.6) + current_window_count. This approximation is accurate enough for most use cases and uses constant memory (two counters per client).

AlgorithmBurst handlingMemoryAccuracy
Token bucketAllows controlled burstsO(1) per clientHigh
Fixed windowVulnerable at boundariesO(1) per clientModerate
Sliding window logNo boundary issuesO(n) per clientExact
Sliding window counterNo boundary issuesO(1) per clientApproximate

Rate Limit Headers

APIs communicate rate limit state through response headers:

HTTP/1.1 200 OK
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 67
X-RateLimit-Reset: 1711322400
  • X-RateLimit-Limit — maximum requests allowed in the window
  • X-RateLimit-Remaining — requests left before throttling
  • X-RateLimit-Reset — Unix timestamp when the window resets

When the limit is exceeded:

HTTP/1.1 429 Too Many Requests
Retry-After: 30
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1711322400

The Retry-After header tells the client how many seconds to wait. Well-behaved clients respect it. Poorly-behaved clients get increasingly longer blocks.

There's an emerging IETF standard (RateLimit header) to unify the naming, but X-RateLimit-* remains the de facto convention.

Distributed Rate Limiting

On a single server, rate limiting is a counter in memory. In a distributed system with multiple servers behind a load balancer, each server needs a shared view of the client's request count.

Redis is the standard solution. Each request increments a Redis key scoped to the client (by API key, user ID, or IP address). Redis is fast enough to add sub-millisecond overhead per request.

INCR rate_limit:user:42:window:1711322400
EXPIRE rate_limit:user:42:window:1711322400 60

For token bucket in Redis, a Lua script atomically checks the token count, subtracts one, and returns allowed/denied — all in a single round trip. The atomic execution prevents race conditions between concurrent requests.

Tradeoffs:

  • Redis adds a network hop per request (typically <1ms on a local network)
  • If Redis is down, you must decide: fail open (allow all requests) or fail closed (reject all)
  • At very high scale, even Redis can become a bottleneck — sharding by client ID distributes the load

Rate Limiting at Different Layers

Rate limits are often enforced at multiple points:

  • API gateway (Kong, Nginx, Cloudflare) — first line, based on IP or API key. Stops abusive traffic before it reaches application code.
  • Application layer — per-user, per-endpoint limits based on business logic. Free users get 100 requests/hour, paid users get 10,000.
  • Database layer — connection pools and query rate limits protect the database from runaway application code.

Each layer catches different threats. Gateway limits stop DDoS. Application limits enforce business rules. Database limits prevent cascade failures.

Next Steps