Shard Key Design
Shard Key Design: The Most Expensive Mistake You Can Make
In a sharded system, the shard key decides where data lives and how requests are routed. If you choose the wrong shard key, you will eventually be forced into painful migrations, cross-shard queries, and correctness compromises.
A good shard key minimizes cross-shard operations and distributes load evenly under real traffic.
What a Shard Key Must Achieve
- Distribution: avoid hotspots; spread reads and writes
- Locality: keep related data together to avoid cross-shard joins
- Predictable routing: application can compute shard quickly
- Stability: shard key should not change (immutability)
Cardinality: High Enough to Distribute
Shard keys with low cardinality create hotspots.
- Bad: status (ACTIVE/INACTIVE)
- Bad: country (few values, skewed)
- Bad: plan_type (FREE/PRO)
Even if you hash these values, skew remains because traffic is not uniform.
Skew: Real Traffic Is Not Uniform
Distribution is not about number of distinct values only. It is about how frequently each value is accessed.
Example: if 60 percent of traffic is for one tenant, tenant_id is a hotspot key unless you do special handling for large tenants.
Locality: Keep What You Join Together
If your application frequently joins or queries related rows together, try to keep them on the same shard.
Common locality-friendly keys:
- tenant_id for SaaS multi-tenant systems
- user_id for user-centric systems
But locality can conflict with distribution. A few massive tenants can break tenant_id sharding.
Immutability: Keys That Change Will Break You
A shard key should be stable. If it changes, the row must move shards, which is expensive and correctness-sensitive.
- Bad key example: user_email (can change)
- Bad key example: region (user moves region, geo routing changes)
If a key must change, design a stable surrogate key for sharding (e.g., user_id) and treat mutable attributes as normal columns.
Query Pattern Alignment
The shard key should match your highest-QPS access patterns.
Ask:
- What is the hottest endpoint?
- What parameters does it use to fetch data?
- Can those parameters route to a single shard?
If your hottest endpoint becomes scatter-gather across shards, your system will fail under scale.
Hotspot Keys and “Big Tenant” Strategy
In multi-tenant systems, tenant_id is common. But a few large tenants often dominate traffic.
Mitigation strategies:
- Isolate large tenants into dedicated shards
- Use sub-sharding within a tenant (tenant_id + hash(user_id))
- Split by functional domains (billing vs events)
Do not pretend all tenants are equal; they are not.
Range vs Hash Considerations
Choosing the key is one step; choosing range vs hash is another.
- If you need ordered queries by key, range may help but risks monotonic hotspots.
- If you need uniform distribution, hash helps but breaks range queries.
A common compromise is stable key + hashing + secondary indexing per shard.
Global Uniqueness Constraints
Sharding breaks global uniqueness unless your key strategy embeds uniqueness.
Examples:
- Auto-increment ids cannot be global without coordination.
- Unique usernames across shards require central registry or coordination.
Production patterns:
- Use UUID/ULID-like ids
- Use a dedicated “registry” service/table for globally unique claims
- Namespace uniqueness by tenant when possible
Failure Modes in Production
- Hot shard: shard key correlates with traffic spikes.
- Scatter-gather overload: hottest queries require querying all shards.
- Resharding necessity: key chosen cannot support growth.
- Mutable key migration: frequent shard moves cause correctness bugs.
- Big tenant collapse: one tenant saturates a shard.
- Global uniqueness bugs: duplicates appear across shards.
Operational Checklist
- List top read and write endpoints by QPS and latency budget.
- Choose a shard key that routes hot endpoints to a single shard.
- Verify shard key has high cardinality and low traffic skew.
- Test key distribution with realistic production-like traffic.
- Ensure shard key is immutable; do not shard by mutable attributes.
- Plan strategy for big tenants/hot keys (dedicated shard or sub-sharding).
- Decide how you will handle global uniqueness and id generation.
- Decide how analytics and cross-shard aggregates will be computed (offline pipeline).
- Build routing observability: log shard id per request.
- Document the key decision and the expected failure modes.
Summary
Shard key choice determines routing, locality, hotspots, and future migration cost. A good key is stable, aligns with hottest query patterns, distributes load under skewed traffic, and supports your uniqueness requirements. Treat shard key design as a top-level architecture decision, not an implementation detail.