Alerting & SLO Basics (Actionable Only)

Alerting without SLOs is just noise. SLO-driven alerting tells you when users are hurt and keeps pages actionable. This is the pragmatic baseline: define SLIs, set error budgets, and wire alerts that do not cry wolf.

On this page

Production incident

Your on-call gets paged 40 times a night: CPU spikes, pod restarts, random warnings. After a while they ignore pages. Then a real outage happens and nobody reacts fast. The problem is not that alerts exist. The problem is that alerts are not tied to user impact. You need SLO-driven alerting: pages should mean “users are suffering” not “a graph moved”.

Symptoms

Alert fatigue: too many pages, low signal.
Slow incident detection for real user-impacting issues.
Debates during incidents because there is no agreed target (what is acceptable?).

Root causes

No SLOs and no error budgets.
Alerts on symptoms that do not map to user pain (CPU alone, memory alone).
No separation between paging alerts and ticket/notification alerts.
Thresholds set arbitrarily with no baseline.

Diagnosis

# Look for existing SLO/SLI definitions (docs, configs)
# In code, confirm you emit the required SLIs: request count, errors, latency histograms
grep -R "Histogram" -n .
grep -R "request_duration" -n .
grep -R "StatusCodes" -n .

Correct pattern

Define:

SLI: what you measure (availability, latency).
SLO: the target (99.9% availability, p95 < 300ms).
Error budget: how much you can fail in a window.
Alert: when you are burning budget too fast or you already violated it.

Pragmatic SLO examples

Availability SLO: 99.9% of requests are non-5xx over 30 days.
Latency SLO: 95% of requests under 300ms for a critical route group.
Dependency SLO: downstream timeout rate under 0.5% for checkout dependency.

Alert tiers

Page (urgent): user impact now, budget burn is fast, action required.
Ticket (non-urgent): trend issues, slow burn, needs investigation but not waking people up.
Info: deploy markers, capacity heads-up, non-actionable noise kept out of paging.

Burn rate concept (engineer version)

Fast burn: you will spend the monthly error budget in hours. Page now.
Slow burn: you will spend it in days. Ticket now.

This avoids paging on small blips while still catching real sustained outages quickly.

Security and performance impact

Performance: better alerting reduces thrash and bad incident mitigations (like random scaling).
Security: alerting on auth failures, rate limit triggers, and anomaly patterns helps detect attacks. Tie them to impact where possible.

Operational notes

Monitoring: alert volume, false positive rate, MTTA, and MTTR. Alerts are products; measure them.
Rollout: start with 1–2 critical SLOs, not 20. Expand once they are trusted.
Rollback: if paging is noisy, demote to ticket and iterate. Do not disable alerting entirely.

Checklist

SLIs exist: availability, latency histograms, traffic.
SLOs defined for critical user journeys.
Error budget burn alerts exist (fast and slow).
Paging alerts map to user impact and have runbooks.
Alert volume and false positives are tracked.

← Log PII Redaction (Don't Get Sued)

Debugging in Prod (Dumps, Traces) →