Rate Limiting at Scale

Rate limiting protects availability and cost. Design it with burst behavior, identity strategy, and distributed enforcement in mind.

On this page

What Rate Limiting Protects

Rate limiting protects availability, cost, and downstream dependencies. It is not only an anti-abuse tool; it is a stability mechanism. Without limits, a small spike or a single bad actor can saturate your bottlenecks and collapse tail latency.

Decide What You Rate Limit By

IP: simple but noisy (NAT, mobile carriers)
User or API key: best for authenticated APIs
Tenant: necessary for multi-tenant fairness
Route: protect expensive endpoints separately

Core Algorithms

Token bucket: allows bursts up to a limit; common for APIs.
Leaky bucket: smooths traffic; good for steady processing.
Fixed window: simple but can allow burst at boundaries.
Sliding window: more accurate; slightly more complex.

Burst Behavior Matters

Production traffic is bursty. If you limit strictly without burst allowance, you may reject legitimate spikes (page refresh storms, deploy-triggered retries). Token bucket is often preferred because it supports controlled bursts.

Distributed Enforcement

With multiple instances, rate limiting must be consistent. You can enforce limits at the edge (gateway/load balancer), or centrally using a shared store (Redis). Edge enforcement is fast; centralized enforcement is more accurate per identity.

What to Return

Rate-limited requests should return HTTP 429 with a stable error code and, when possible, retry guidance.

HTTP 429 example response:
{
  "error": {
    "code": "RATE_LIMITED",
    "message": "Too many requests",
    "retryAfterMs": 1000
  }
}

Production-First Takeaway

Rate limiting is an availability feature. Define the identity (user/tenant), choose an algorithm with the right burst behavior, enforce consistently across instances, and make the response predictable for clients.

← Cache Invalidation and Consistency

Backpressure: Preventing Overload →