INFRA-DEVOPS Contents

Incident Lifecycle and Roles

Run incidents end-to-end with clear roles, timelines, and comms so you stabilize fast and learn reliably.

On this page

Incident Lifecycle (Ops View)

  1. Detect: alert or user report confirms impact.
  2. Triage: scope + severity + suspected component.
  3. Stabilize: stop the bleeding (rollback, disable feature, shed load).
  4. Recover: restore full service + verify.
  5. Communicate: internal + external updates on a cadence.
  6. Learn: postmortem + action items + follow-up.

Roles

  • Incident Commander (IC): coordinates, makes calls.
  • Ops/Comms: updates channel/status page, timestamps.
  • Tech Lead: directs debugging, assigns tasks.
  • Subject Matter Experts: service owners, SRE, DB, network.

Minimal Incident Checklist

  1. Declare incident + open an incident channel.
  2. Assign IC + scribe + tech lead.
  3. State the symptom in one sentence (what users see).
  4. Set a 10–15 min update cadence.
  5. Start stabilization work immediately (rollback/mitigation).

Timestamping Template

T+00: Detected elevated 5xx in prod (region=eu)
T+03: IC assigned, incident channel created
T+06: Suspected dependency: payments latency spike
T+10: Mitigation: disable feature flag "new-checkout"
T+15: Error rate down, monitoring recovery
T+30: Service stable, entering monitoring period