Rate Limiting at Scale
What Rate Limiting Protects
Rate limiting protects availability, cost, and downstream dependencies. It is not only an anti-abuse tool; it is a stability mechanism. Without limits, a small spike or a single bad actor can saturate your bottlenecks and collapse tail latency.
Decide What You Rate Limit By
- IP: simple but noisy (NAT, mobile carriers)
- User or API key: best for authenticated APIs
- Tenant: necessary for multi-tenant fairness
- Route: protect expensive endpoints separately
Core Algorithms
- Token bucket: allows bursts up to a limit; common for APIs.
- Leaky bucket: smooths traffic; good for steady processing.
- Fixed window: simple but can allow burst at boundaries.
- Sliding window: more accurate; slightly more complex.
Burst Behavior Matters
Production traffic is bursty. If you limit strictly without burst allowance, you may reject legitimate spikes (page refresh storms, deploy-triggered retries). Token bucket is often preferred because it supports controlled bursts.
Distributed Enforcement
With multiple instances, rate limiting must be consistent. You can enforce limits at the edge (gateway/load balancer), or centrally using a shared store (Redis). Edge enforcement is fast; centralized enforcement is more accurate per identity.
What to Return
Rate-limited requests should return HTTP 429 with a stable error code and, when possible, retry guidance.
HTTP 429 example response:
{
"error": {
"code": "RATE_LIMITED",
"message": "Too many requests",
"retryAfterMs": 1000
}
}
Production-First Takeaway
Rate limiting is an availability feature. Define the identity (user/tenant), choose an algorithm with the right burst behavior, enforce consistently across instances, and make the response predictable for clients.