Runbooks and Operational Checklists
On this page
Runbooks That Work
- Written for the 3am operator, not for documentation.
- One page first: prerequisites, steps, verification, rollback.
- Link to dashboards, logs queries, and known failure patterns.
Runbook Structure
- Symptom: what triggers this runbook?
- Impact: what users experience.
- Prereqs: access, tools, safe modes, contacts.
- Decision tree: quick checks, then deeper checks.
- Mitigations: reversible first (feature flag, rate limit).
- Verification: how you prove it is fixed.
Example: High 5xx Runbook Snippet
1) Confirm impact: error_rate_5xx > 2% for 5m 2) Check last deploy marker + diffs 3) If deploy correlated: rollback 4) If dependency correlated: bypass feature / increase timeout temporarily 5) Verify: 5xx < 0.2% and p95 latency back to baseline
Failure Modes
- Runbook without verification: "we did steps" but service still broken.
- Runbook with unsafe steps: destructive commands without guardrails.