Distributed Rate Limiting (Token Buckets Across Nodes)
Distributed Rate Limiting: Controlling Demand Across a Cluster
Rate limiting restricts the number of requests a client, user, or system can make within a defined time window. In distributed systems running across multiple nodes, rate limiting must operate consistently across the entire cluster — not just per instance.
Without distributed coordination, rate limits become ineffective and exploitable.
The Core Problem
In a horizontally scaled system:
- Each instance may enforce limits independently.
- A client may distribute requests across instances.
- Per-node limits fail to enforce global constraints.
Example:
Limit: 100 requests per minute Cluster: 5 nodes Client sends 100 requests to each node Total: 500 requests in one minute
Per-instance limiting fails to enforce cluster-wide policy.
Common Rate Limiting Algorithms
1) Token Bucket
- Tokens added at fixed rate.
- Requests consume tokens.
- Allows short bursts.
2) Leaky Bucket
- Requests processed at constant rate.
- Excess queued or rejected.
- Smooths traffic spikes.
3) Fixed Window Counter
- Counts requests per time window.
- Simple but prone to boundary spikes.
4) Sliding Window
- More accurate rolling time window enforcement.
- Higher computational cost.
Algorithm choice depends on burst tolerance and precision needs.
Centralized vs Distributed Enforcement
Centralized Rate Limiting
- Single shared store tracks counters.
- Ensures global consistency.
- May introduce latency or single bottleneck.
Distributed (Decentralized) Rate Limiting
- Each node enforces partial quota.
- Uses sharding or consistent hashing.
- More scalable but less precise.
Hybrid approaches are common.
Production Scenario: Abuse During Traffic Spike
Symptom
One tenant sends excessive API calls, degrading service for others.
Root Cause
Per-node rate limiting implemented without shared quota tracking.
Diagnosis
- Uneven request distribution across nodes.
- Tenant exceeding intended quota.
- No cluster-level aggregation.
Resolution
- Introduce distributed token bucket backed by shared store.
- Apply per-tenant quotas.
- Monitor rate limit rejections.
Consistency and Latency Tradeoff
Strong global enforcement requires:
- Centralized counter store.
- Atomic increment operations.
- Low-latency shared datastore.
Under high load, counter contention may increase latency.
Edge vs Application-Level Limiting
- Edge (API gateway): protects cluster early.
- Application-level: finer-grained business logic enforcement.
Defense in depth is recommended.
Retry Interaction
Rate limiting must consider retry behavior:
- Retries should count toward quota.
- Backoff required to prevent retry storms.
- Limit responses should include retry-after hints.
Improper integration can cause thundering herd amplification.
Observability Requirements
- Rate limit rejection count.
- Per-tenant usage distribution.
- Token consumption rate.
- Counter store latency.
- Quota exhaustion alerts.
Rate limiting effectiveness must be measurable.
Failure Injection Test
# Distributed rate limit validation 1) Simulate high traffic from single tenant 2) Verify cluster-wide enforcement 3) Attempt burst traffic across multiple nodes 4) Confirm token bucket limits respected 5) Inject shared store latency 6) Validate system stability under degraded enforcement
Common Anti-Patterns
- Per-node rate limiting only.
- No differentiation between tenants.
- No retry-after header.
- No monitoring of quota usage.
- Rate limiting applied too late in request path.
Rate limiting must be consistent and early.
Operational Checklist
- Is rate limiting enforced cluster-wide?
- Are quotas defined per tenant or API key?
- Is retry behavior aligned with limits?
- Is rate limit usage observable?
- Is enforcement positioned at system edge?
Key Takeaways
- Distributed systems require cluster-wide rate enforcement.
- Token bucket is common and burst-friendly.
- Strong consistency may introduce latency tradeoffs.
- Rate limiting protects stability and fairness.
- Monitoring quota usage prevents abuse and overload.
Distributed rate limiting is a foundational stability mechanism. In production-grade systems, it safeguards shared resources, prevents abuse, and protects overall reliability under unpredictable load conditions.