DOTNET Contents

Alerting & SLO Basics (Actionable Only)

Alerting without SLOs is just noise. SLO-driven alerting tells you when users are hurt and keeps pages actionable. This is the pragmatic baseline: define SLIs, set error budgets, and wire alerts that do not cry wolf.

On this page

Production incident

Your on-call gets paged 40 times a night: CPU spikes, pod restarts, random warnings. After a while they ignore pages. Then a real outage happens and nobody reacts fast. The problem is not that alerts exist. The problem is that alerts are not tied to user impact. You need SLO-driven alerting: pages should mean “users are suffering” not “a graph moved”.

Symptoms

  • Alert fatigue: too many pages, low signal.
  • Slow incident detection for real user-impacting issues.
  • Debates during incidents because there is no agreed target (what is acceptable?).

Root causes

  • No SLOs and no error budgets.
  • Alerts on symptoms that do not map to user pain (CPU alone, memory alone).
  • No separation between paging alerts and ticket/notification alerts.
  • Thresholds set arbitrarily with no baseline.

Diagnosis

# Look for existing SLO/SLI definitions (docs, configs)
# In code, confirm you emit the required SLIs: request count, errors, latency histograms
grep -R "Histogram" -n .
grep -R "request_duration" -n .
grep -R "StatusCodes" -n .

Correct pattern

Define:

  • SLI: what you measure (availability, latency).
  • SLO: the target (99.9% availability, p95 < 300ms).
  • Error budget: how much you can fail in a window.
  • Alert: when you are burning budget too fast or you already violated it.

Pragmatic SLO examples

  • Availability SLO: 99.9% of requests are non-5xx over 30 days.
  • Latency SLO: 95% of requests under 300ms for a critical route group.
  • Dependency SLO: downstream timeout rate under 0.5% for checkout dependency.

Alert tiers

  • Page (urgent): user impact now, budget burn is fast, action required.
  • Ticket (non-urgent): trend issues, slow burn, needs investigation but not waking people up.
  • Info: deploy markers, capacity heads-up, non-actionable noise kept out of paging.

Burn rate concept (engineer version)

  • Fast burn: you will spend the monthly error budget in hours. Page now.
  • Slow burn: you will spend it in days. Ticket now.

This avoids paging on small blips while still catching real sustained outages quickly.

Security and performance impact

  • Performance: better alerting reduces thrash and bad incident mitigations (like random scaling).
  • Security: alerting on auth failures, rate limit triggers, and anomaly patterns helps detect attacks. Tie them to impact where possible.

Operational notes

  • Monitoring: alert volume, false positive rate, MTTA, and MTTR. Alerts are products; measure them.
  • Rollout: start with 1–2 critical SLOs, not 20. Expand once they are trusted.
  • Rollback: if paging is noisy, demote to ticket and iterate. Do not disable alerting entirely.

Checklist

  • SLIs exist: availability, latency histograms, traffic.
  • SLOs defined for critical user journeys.
  • Error budget burn alerts exist (fast and slow).
  • Paging alerts map to user impact and have runbooks.
  • Alert volume and false positives are tracked.