Alerting Strategy and Alert Fatigue
On this page
Alerting Goals
- Wake humans only for actionable, time-sensitive issues.
- Every alert must answer: what broke, where, impact, next step.
- Route to the right owner with correct severity.
Severity Guidelines
- SEV1: user-impacting outage, immediate response.
- SEV2: partial outage/brownout, rapid response.
- SEV3: degradation, investigate in business hours.
Alert Message Template
Title: API p95 latency > 800ms (prod, eu-west) Impact: ~25% traffic affected, errors rising Signals: p95=1.4s, 5xx=2.1%, upstream timeouts Next: check dependency "payments" latency + recent deploy Links: dashboard, logs query, runbook, last rollout
Failure Modes
- Alert storms: no dedupe/silence strategy.
- Non-actionable alerts: people ignore pages and miss real incidents.