Cache Observability Checklist

Metrics, traces, and alerts for hit rate, latency, evictions, and error budgets.

On this page

Cache Observability: Measure the Cache as a System, Not a Feature

In production, caches fail in ways that look like database incidents: latency spikes, timeouts, and cascading retries. Without cache observability, teams debug the database while the cache is the real root cause.

Observability goals:

Know if cache is helping (hit rate is not enough)
Detect correctness risks (staleness, invalidation failures)
Detect availability risks (eviction storms, stampedes, hot keys)
Protect the database (fallback visibility and controls)

Core Metrics You Must Track

At minimum, instrument these metrics per service and per key namespace.

1) Hit Rate and Miss Rate (But With Context)

Overall hit rate
Hit rate by namespace (user:, feed:, product:)
Hit rate by endpoint (because not all endpoints are equal)

Failure mode: overall hit rate is “good” while one critical namespace collapses.

2) Cache Latency (p95/p99)

Cache latency spikes can destroy tail latency even if hit rate is high.

GET latency p95/p99
SET/DEL latency p95/p99

Production rule: treat cache latency as part of your user request latency budget.

3) DB Fallback QPS

The most important metric for preventing outages:

How many requests fall back to DB due to cache misses or errors?

Track fallback separately:

Fallback due to miss
Fallback due to cache error/timeouts

Because cache errors can suddenly create a load spike on DB even when key TTLs are stable.

4) Evictions and Memory Pressure

Eviction means keys disappear earlier than TTL.

Eviction rate
Memory used vs max
Fragmentation indicators (if available)

Eviction storms usually show up as:

Hit rate drops
Fallback QPS rises
DB latency spikes

5) Hot Keys and Big Keys

Hot keys create single-node bottlenecks. Big keys create latency spikes and memory pressure.

Top keys by QPS
Top keys by value size
Top keys by total memory footprint

Production rule: build dashboards that make hot keys visible, not only aggregate graphs.

6) Stampede Signals

Detect stampedes early using:

Sudden spike in misses for a namespace
Sudden spike in fallback QPS
Increase in concurrent single-flight waiters (if implemented)
Cache restart events correlated with hit-rate collapse

7) TTL Health and Expiry Patterns

TTL issues cause synchronized expiry:

Distribution of TTL remaining for hot namespaces
Percentage of keys expiring per minute

If many keys expire in the same window, expect stampede risk.

8) Invalidation Health

Invalidation failures are correctness bugs.

DEL/invalidations per write (rate)
Invalidation error rate
Lag for async invalidation pipelines (CDC/outbox consumer lag)

Production rule: invalidation must be observable like payments processing, not best-effort logging.

Tracing: Connect Cache to End-to-End Latency

Use distributed tracing to tag:

cache.hit / cache.miss
cache.error / cache.timeout
db.fallback (yes/no)
key namespace (not full key to avoid cardinality explosion)

This allows you to see whether p99 latency is driven by cache latency, stampede fallback, or DB contention.

SLOs and Alerting Strategy

Alerts should be tied to impact:

DB fallback QPS spike (primary outage predictor)
Eviction rate > 0 sustained (memory pressure)
Cache p99 latency exceeding budget
Namespace hit-rate collapse for critical keys
Invalidation errors or async invalidation lag

Avoid alerting only on hit rate; it is too coarse.

Common Observability Mistakes

Only overall hit rate: misses the critical namespace collapse.
No fallback visibility: DB overload appears “mysterious.”
No hot key reporting: problems appear as random latency spikes.
High-cardinality metrics: full key labels explode metrics costs.
No correlation: cache graphs not linked to DB and request latency.

Incident Response Checklist

Check cache error rate and p99 latency.
Check DB fallback QPS and compare to normal baseline.
Check eviction rate and memory usage.
Identify hot keys: is one key dominating traffic?
Check for cache restart/flush events or deploy key changes.
Check TTL expiry distribution for synchronized expiry patterns.
If async invalidation exists, check consumer lag and error rate.
Apply mitigations: rate-limit fallback, enable soft TTL/stale serve, pre-warm keys.
After stabilization, write postmortem: root cause, missing signals, preventive controls.

Summary

Cache observability must go beyond hit rate. Track latency, fallback-to-DB QPS, evictions, hot keys, TTL expiry patterns, and invalidation health. Tie cache metrics to request tracing and DB metrics to detect stampedes and prevent cascading failures. In production, the cache is an SLO-critical subsystem.

← Cache Consistency Models