INFRA-DEVOPS Contents

Runbooks and Operational Checklists

Write runbooks that operators actually use: prerequisites, decision trees, rollback steps, and validation checks.

On this page

Runbooks That Work

  • Written for the 3am operator, not for documentation.
  • One page first: prerequisites, steps, verification, rollback.
  • Link to dashboards, logs queries, and known failure patterns.

Runbook Structure

  1. Symptom: what triggers this runbook?
  2. Impact: what users experience.
  3. Prereqs: access, tools, safe modes, contacts.
  4. Decision tree: quick checks, then deeper checks.
  5. Mitigations: reversible first (feature flag, rate limit).
  6. Verification: how you prove it is fixed.

Example: High 5xx Runbook Snippet

1) Confirm impact: error_rate_5xx > 2% for 5m
2) Check last deploy marker + diffs
3) If deploy correlated: rollback
4) If dependency correlated: bypass feature / increase timeout temporarily
5) Verify: 5xx < 0.2% and p95 latency back to baseline

Failure Modes

  • Runbook without verification: "we did steps" but service still broken.
  • Runbook with unsafe steps: destructive commands without guardrails.