Distributed Rate Limiting (Token Buckets Across Nodes)

Distributed rate limiting controls request volume across multiple nodes to protect system stability. This lesson explains token bucket and leaky bucket algorithms, centralized vs decentralized enforcement, and production pitfalls.

On this page

Distributed Rate Limiting: Controlling Demand Across a Cluster

Rate limiting restricts the number of requests a client, user, or system can make within a defined time window. In distributed systems running across multiple nodes, rate limiting must operate consistently across the entire cluster — not just per instance.

Without distributed coordination, rate limits become ineffective and exploitable.

The Core Problem

In a horizontally scaled system:

Each instance may enforce limits independently.
A client may distribute requests across instances.
Per-node limits fail to enforce global constraints.

Example:

Limit: 100 requests per minute
Cluster: 5 nodes
Client sends 100 requests to each node
Total: 500 requests in one minute

Per-instance limiting fails to enforce cluster-wide policy.

Common Rate Limiting Algorithms

1) Token Bucket

Tokens added at fixed rate.
Requests consume tokens.
Allows short bursts.

2) Leaky Bucket

Requests processed at constant rate.
Excess queued or rejected.
Smooths traffic spikes.

3) Fixed Window Counter

Counts requests per time window.
Simple but prone to boundary spikes.

4) Sliding Window

More accurate rolling time window enforcement.
Higher computational cost.

Algorithm choice depends on burst tolerance and precision needs.

Centralized vs Distributed Enforcement

Centralized Rate Limiting

Single shared store tracks counters.
Ensures global consistency.
May introduce latency or single bottleneck.

Distributed (Decentralized) Rate Limiting

Each node enforces partial quota.
Uses sharding or consistent hashing.
More scalable but less precise.

Hybrid approaches are common.

Production Scenario: Abuse During Traffic Spike

Symptom

One tenant sends excessive API calls, degrading service for others.

Root Cause

Per-node rate limiting implemented without shared quota tracking.

Diagnosis

Uneven request distribution across nodes.
Tenant exceeding intended quota.
No cluster-level aggregation.

Resolution

Introduce distributed token bucket backed by shared store.
Apply per-tenant quotas.
Monitor rate limit rejections.

Consistency and Latency Tradeoff

Strong global enforcement requires:

Centralized counter store.
Atomic increment operations.
Low-latency shared datastore.

Under high load, counter contention may increase latency.

Edge vs Application-Level Limiting

Edge (API gateway): protects cluster early.
Application-level: finer-grained business logic enforcement.

Defense in depth is recommended.

Retry Interaction

Rate limiting must consider retry behavior:

Retries should count toward quota.
Backoff required to prevent retry storms.
Limit responses should include retry-after hints.

Improper integration can cause thundering herd amplification.

Observability Requirements

Rate limit rejection count.
Per-tenant usage distribution.
Token consumption rate.
Counter store latency.
Quota exhaustion alerts.

Rate limiting effectiveness must be measurable.

Failure Injection Test

# Distributed rate limit validation
1) Simulate high traffic from single tenant
2) Verify cluster-wide enforcement
3) Attempt burst traffic across multiple nodes
4) Confirm token bucket limits respected
5) Inject shared store latency
6) Validate system stability under degraded enforcement

Common Anti-Patterns

Per-node rate limiting only.
No differentiation between tenants.
No retry-after header.
No monitoring of quota usage.
Rate limiting applied too late in request path.

Rate limiting must be consistent and early.

Operational Checklist

Is rate limiting enforced cluster-wide?
Are quotas defined per tenant or API key?
Is retry behavior aligned with limits?
Is rate limit usage observable?
Is enforcement positioned at system edge?

Key Takeaways

Distributed systems require cluster-wide rate enforcement.
Token bucket is common and burst-friendly.
Strong consistency may introduce latency tradeoffs.
Rate limiting protects stability and fairness.
Monitoring quota usage prevents abuse and overload.

Distributed rate limiting is a foundational stability mechanism. In production-grade systems, it safeguards shared resources, prevents abuse, and protects overall reliability under unpredictable load conditions.

← Cache Consistency Tradeoffs (Invalidation Strategies)

Load Balancing Algorithms (RR, LC, EWMA, Hashing) →