DISTRIBUTED-SYSTEMS-ENGINEERING Contents

Sharding Strategies (Range, Hash, Directory)

Sharding distributes data across multiple nodes to scale throughput and storage. This lesson covers key sharding strategies, rebalancing realities, hotspot risks, and production-safe shard migration patterns.

On this page

Data Sharding Strategies: Scaling by Partitioning Data

Replication improves availability by keeping copies of the same data. Sharding (partitioning) improves scalability by distributing different subsets of data across nodes. Without sharding, a single primary or single storage node becomes the bottleneck. With sharding, capacity grows with the number of shards.

Sharding is deceptively hard. The partitioning strategy determines performance, hotspot behavior, operational complexity, and the blast radius of failures.

What Is a Shard?

A shard is a partition of the dataset. Each shard is responsible for a subset of keys or entities. In most production systems, each shard is itself replicated (for availability). This creates a common architecture:

  • Dataset is split into shards (partitioning)
  • Each shard has replicas (replication)
  • Routing layer maps keys to shard owners

Correctness and performance depend on predictable routing and safe rebalancing.

Strategy 1: Range-Based Sharding

Range sharding assigns contiguous key ranges to shards. Example: users 1–1M on shard A, 1M–2M on shard B.

Advantages

  • Efficient range scans
  • Good locality for ordered queries

Disadvantages

  • Hotspot risk if writes concentrate in a range (e.g., newest timestamps)
  • Rebalancing requires splitting ranges carefully

Production Hotspot Example

If keys are based on creation time, the newest range receives most writes, overloading a single shard.

Strategy 2: Hash-Based Sharding

Hash sharding computes a hash of the key and maps it to a shard. Example:

shard = hash(user_id) % N

Advantages

  • Even distribution for random keys
  • Reduced hotspot probability

Disadvantages

  • Range queries become expensive
  • Changing N causes massive reshuffling without careful design

Strategy 3: Consistent Hashing

Consistent hashing reduces reshuffling when nodes are added or removed. Keys map onto a ring; nodes own segments of the ring.

Why It Matters

  • Adding one node moves only a fraction of keys
  • Removing a node moves only its segment

Consistent hashing is widely used in caches and leaderless systems.

Reference: Consistent hashing

Strategy 4: Directory-Based Sharding

A central directory maps keys (or tenants) to shards. This supports flexible placement (useful for multi-tenant systems).

Advantages

  • Fine-grained control
  • Tenant-based placement and isolation

Disadvantages

  • Directory becomes a critical dependency
  • Requires high availability and caching

Production Scenario: Shard Hotspot Meltdown

Symptom

Overall cluster CPU is moderate, but one shard experiences 100% CPU and latency spikes. Error rates rise for a subset of users.

Root Cause

Shard key distribution is skewed. A high-traffic tenant or “celebrity key” concentrates requests on one shard.

Diagnosis

  • Per-shard QPS distribution highly imbalanced.
  • Hot shard has deeper queues and higher tail latency.
  • Key-level metrics show top keys dominating traffic.

Resolution

  • Introduce “sub-sharding” for hot keys (salting).
  • Repartition tenant data across multiple shards.
  • Add caching and request coalescing for hot reads.

Rebalancing and Shard Migration Realities

Shards are not static. You will add nodes, scale capacity, and rebalance. Migration is where many production incidents occur.

Key concerns:

  • Maintaining correctness during movement
  • Minimizing impact on latency
  • Ensuring replication catches up before cutover

Safe Shard Migration Pattern

A common safe approach is dual-write + cutover:

  1. Start copying shard data to new owner.
  2. Enable dual-writes (old + new) for keys in migration range.
  3. Catch up replication / change stream to remove lag.
  4. Flip routing to new shard owner.
  5. Stop dual-writes and decommission old shard copy.

This pattern trades temporary write amplification for correctness.

Observability Requirements

  • Per-shard QPS, latency, error rates
  • Key distribution skew metrics
  • Migration progress and lag
  • Routing cache hit rate
  • Hot key detection counters

Without per-shard metrics, sharding problems appear as “random slowness.”

Operational Checklist

  • Is shard key selection aligned with access patterns?
  • Do you have hot key detection and mitigation?
  • Is routing deterministic and cached safely?
  • Do you have a tested shard migration playbook?
  • Are rebalancing operations rate-limited?

Failure Injection Test

# Sharding resilience test
1) Identify hottest shard
2) Introduce synthetic load on top keys
3) Validate hotspot detection triggers
4) Rebalance shard ownership in staging
5) Verify no missing reads or lost writes during cutover

Key Takeaways

  • Sharding scales throughput by partitioning the dataset.
  • Range sharding supports scans but risks hotspots.
  • Hash sharding balances load but hurts range queries.
  • Consistent hashing reduces key movement during scaling.
  • Migration and hotspot mitigation are the real production challenges.

Sharding is an operational discipline as much as a data model decision. The strategy you choose will determine whether scaling is routine or a recurring incident generator.