Sharding Strategies (Range, Hash, Directory)
Data Sharding Strategies: Scaling by Partitioning Data
Replication improves availability by keeping copies of the same data. Sharding (partitioning) improves scalability by distributing different subsets of data across nodes. Without sharding, a single primary or single storage node becomes the bottleneck. With sharding, capacity grows with the number of shards.
Sharding is deceptively hard. The partitioning strategy determines performance, hotspot behavior, operational complexity, and the blast radius of failures.
What Is a Shard?
A shard is a partition of the dataset. Each shard is responsible for a subset of keys or entities. In most production systems, each shard is itself replicated (for availability). This creates a common architecture:
- Dataset is split into shards (partitioning)
- Each shard has replicas (replication)
- Routing layer maps keys to shard owners
Correctness and performance depend on predictable routing and safe rebalancing.
Strategy 1: Range-Based Sharding
Range sharding assigns contiguous key ranges to shards. Example: users 1–1M on shard A, 1M–2M on shard B.
Advantages
- Efficient range scans
- Good locality for ordered queries
Disadvantages
- Hotspot risk if writes concentrate in a range (e.g., newest timestamps)
- Rebalancing requires splitting ranges carefully
Production Hotspot Example
If keys are based on creation time, the newest range receives most writes, overloading a single shard.
Strategy 2: Hash-Based Sharding
Hash sharding computes a hash of the key and maps it to a shard. Example:
shard = hash(user_id) % N
Advantages
- Even distribution for random keys
- Reduced hotspot probability
Disadvantages
- Range queries become expensive
- Changing N causes massive reshuffling without careful design
Strategy 3: Consistent Hashing
Consistent hashing reduces reshuffling when nodes are added or removed. Keys map onto a ring; nodes own segments of the ring.
Why It Matters
- Adding one node moves only a fraction of keys
- Removing a node moves only its segment
Consistent hashing is widely used in caches and leaderless systems.
Reference: Consistent hashing
Strategy 4: Directory-Based Sharding
A central directory maps keys (or tenants) to shards. This supports flexible placement (useful for multi-tenant systems).
Advantages
- Fine-grained control
- Tenant-based placement and isolation
Disadvantages
- Directory becomes a critical dependency
- Requires high availability and caching
Production Scenario: Shard Hotspot Meltdown
Symptom
Overall cluster CPU is moderate, but one shard experiences 100% CPU and latency spikes. Error rates rise for a subset of users.
Root Cause
Shard key distribution is skewed. A high-traffic tenant or “celebrity key” concentrates requests on one shard.
Diagnosis
- Per-shard QPS distribution highly imbalanced.
- Hot shard has deeper queues and higher tail latency.
- Key-level metrics show top keys dominating traffic.
Resolution
- Introduce “sub-sharding” for hot keys (salting).
- Repartition tenant data across multiple shards.
- Add caching and request coalescing for hot reads.
Rebalancing and Shard Migration Realities
Shards are not static. You will add nodes, scale capacity, and rebalance. Migration is where many production incidents occur.
Key concerns:
- Maintaining correctness during movement
- Minimizing impact on latency
- Ensuring replication catches up before cutover
Safe Shard Migration Pattern
A common safe approach is dual-write + cutover:
- Start copying shard data to new owner.
- Enable dual-writes (old + new) for keys in migration range.
- Catch up replication / change stream to remove lag.
- Flip routing to new shard owner.
- Stop dual-writes and decommission old shard copy.
This pattern trades temporary write amplification for correctness.
Observability Requirements
- Per-shard QPS, latency, error rates
- Key distribution skew metrics
- Migration progress and lag
- Routing cache hit rate
- Hot key detection counters
Without per-shard metrics, sharding problems appear as “random slowness.”
Operational Checklist
- Is shard key selection aligned with access patterns?
- Do you have hot key detection and mitigation?
- Is routing deterministic and cached safely?
- Do you have a tested shard migration playbook?
- Are rebalancing operations rate-limited?
Failure Injection Test
# Sharding resilience test 1) Identify hottest shard 2) Introduce synthetic load on top keys 3) Validate hotspot detection triggers 4) Rebalance shard ownership in staging 5) Verify no missing reads or lost writes during cutover
Key Takeaways
- Sharding scales throughput by partitioning the dataset.
- Range sharding supports scans but risks hotspots.
- Hash sharding balances load but hurts range queries.
- Consistent hashing reduces key movement during scaling.
- Migration and hotspot mitigation are the real production challenges.
Sharding is an operational discipline as much as a data model decision. The strategy you choose will determine whether scaling is routine or a recurring incident generator.