Alerting & SLO Basics (Actionable Only)
On this page
Production incident
Your on-call gets paged 40 times a night: CPU spikes, pod restarts, random warnings. After a while they ignore pages. Then a real outage happens and nobody reacts fast. The problem is not that alerts exist. The problem is that alerts are not tied to user impact. You need SLO-driven alerting: pages should mean “users are suffering” not “a graph moved”.
Symptoms
- Alert fatigue: too many pages, low signal.
- Slow incident detection for real user-impacting issues.
- Debates during incidents because there is no agreed target (what is acceptable?).
Root causes
- No SLOs and no error budgets.
- Alerts on symptoms that do not map to user pain (CPU alone, memory alone).
- No separation between paging alerts and ticket/notification alerts.
- Thresholds set arbitrarily with no baseline.
Diagnosis
# Look for existing SLO/SLI definitions (docs, configs) # In code, confirm you emit the required SLIs: request count, errors, latency histograms grep -R "Histogram" -n . grep -R "request_duration" -n . grep -R "StatusCodes" -n .
Correct pattern
Define:
- SLI: what you measure (availability, latency).
- SLO: the target (99.9% availability, p95 < 300ms).
- Error budget: how much you can fail in a window.
- Alert: when you are burning budget too fast or you already violated it.
Pragmatic SLO examples
- Availability SLO: 99.9% of requests are non-5xx over 30 days.
- Latency SLO: 95% of requests under 300ms for a critical route group.
- Dependency SLO: downstream timeout rate under 0.5% for checkout dependency.
Alert tiers
- Page (urgent): user impact now, budget burn is fast, action required.
- Ticket (non-urgent): trend issues, slow burn, needs investigation but not waking people up.
- Info: deploy markers, capacity heads-up, non-actionable noise kept out of paging.
Burn rate concept (engineer version)
- Fast burn: you will spend the monthly error budget in hours. Page now.
- Slow burn: you will spend it in days. Ticket now.
This avoids paging on small blips while still catching real sustained outages quickly.
Security and performance impact
- Performance: better alerting reduces thrash and bad incident mitigations (like random scaling).
- Security: alerting on auth failures, rate limit triggers, and anomaly patterns helps detect attacks. Tie them to impact where possible.
Operational notes
- Monitoring: alert volume, false positive rate, MTTA, and MTTR. Alerts are products; measure them.
- Rollout: start with 1–2 critical SLOs, not 20. Expand once they are trusted.
- Rollback: if paging is noisy, demote to ticket and iterate. Do not disable alerting entirely.
Checklist
- SLIs exist: availability, latency histograms, traffic.
- SLOs defined for critical user journeys.
- Error budget burn alerts exist (fast and slow).
- Paging alerts map to user impact and have runbooks.
- Alert volume and false positives are tracked.