Cache Observability Checklist
Cache Observability: Measure the Cache as a System, Not a Feature
In production, caches fail in ways that look like database incidents: latency spikes, timeouts, and cascading retries. Without cache observability, teams debug the database while the cache is the real root cause.
Observability goals:
- Know if cache is helping (hit rate is not enough)
- Detect correctness risks (staleness, invalidation failures)
- Detect availability risks (eviction storms, stampedes, hot keys)
- Protect the database (fallback visibility and controls)
Core Metrics You Must Track
At minimum, instrument these metrics per service and per key namespace.
1) Hit Rate and Miss Rate (But With Context)
- Overall hit rate
- Hit rate by namespace (user:, feed:, product:)
- Hit rate by endpoint (because not all endpoints are equal)
Failure mode: overall hit rate is “good” while one critical namespace collapses.
2) Cache Latency (p95/p99)
Cache latency spikes can destroy tail latency even if hit rate is high.
- GET latency p95/p99
- SET/DEL latency p95/p99
Production rule: treat cache latency as part of your user request latency budget.
3) DB Fallback QPS
The most important metric for preventing outages:
- How many requests fall back to DB due to cache misses or errors?
Track fallback separately:
- Fallback due to miss
- Fallback due to cache error/timeouts
Because cache errors can suddenly create a load spike on DB even when key TTLs are stable.
4) Evictions and Memory Pressure
Eviction means keys disappear earlier than TTL.
- Eviction rate
- Memory used vs max
- Fragmentation indicators (if available)
Eviction storms usually show up as:
- Hit rate drops
- Fallback QPS rises
- DB latency spikes
5) Hot Keys and Big Keys
Hot keys create single-node bottlenecks. Big keys create latency spikes and memory pressure.
- Top keys by QPS
- Top keys by value size
- Top keys by total memory footprint
Production rule: build dashboards that make hot keys visible, not only aggregate graphs.
6) Stampede Signals
Detect stampedes early using:
- Sudden spike in misses for a namespace
- Sudden spike in fallback QPS
- Increase in concurrent single-flight waiters (if implemented)
- Cache restart events correlated with hit-rate collapse
7) TTL Health and Expiry Patterns
TTL issues cause synchronized expiry:
- Distribution of TTL remaining for hot namespaces
- Percentage of keys expiring per minute
If many keys expire in the same window, expect stampede risk.
8) Invalidation Health
Invalidation failures are correctness bugs.
- DEL/invalidations per write (rate)
- Invalidation error rate
- Lag for async invalidation pipelines (CDC/outbox consumer lag)
Production rule: invalidation must be observable like payments processing, not best-effort logging.
Tracing: Connect Cache to End-to-End Latency
Use distributed tracing to tag:
- cache.hit / cache.miss
- cache.error / cache.timeout
- db.fallback (yes/no)
- key namespace (not full key to avoid cardinality explosion)
This allows you to see whether p99 latency is driven by cache latency, stampede fallback, or DB contention.
SLOs and Alerting Strategy
Alerts should be tied to impact:
- DB fallback QPS spike (primary outage predictor)
- Eviction rate > 0 sustained (memory pressure)
- Cache p99 latency exceeding budget
- Namespace hit-rate collapse for critical keys
- Invalidation errors or async invalidation lag
Avoid alerting only on hit rate; it is too coarse.
Common Observability Mistakes
- Only overall hit rate: misses the critical namespace collapse.
- No fallback visibility: DB overload appears “mysterious.”
- No hot key reporting: problems appear as random latency spikes.
- High-cardinality metrics: full key labels explode metrics costs.
- No correlation: cache graphs not linked to DB and request latency.
Incident Response Checklist
- Check cache error rate and p99 latency.
- Check DB fallback QPS and compare to normal baseline.
- Check eviction rate and memory usage.
- Identify hot keys: is one key dominating traffic?
- Check for cache restart/flush events or deploy key changes.
- Check TTL expiry distribution for synchronized expiry patterns.
- If async invalidation exists, check consumer lag and error rate.
- Apply mitigations: rate-limit fallback, enable soft TTL/stale serve, pre-warm keys.
- After stabilization, write postmortem: root cause, missing signals, preventive controls.
Summary
Cache observability must go beyond hit rate. Track latency, fallback-to-DB QPS, evictions, hot keys, TTL expiry patterns, and invalidation health. Tie cache metrics to request tracing and DB metrics to detect stampedes and prevent cascading failures. In production, the cache is an SLO-critical subsystem.