Incident Lifecycle and Roles

Run incidents end-to-end with clear roles, timelines, and comms so you stabilize fast and learn reliably.

On this page

Incident Lifecycle (Ops View)

Detect: alert or user report confirms impact.
Triage: scope + severity + suspected component.
Stabilize: stop the bleeding (rollback, disable feature, shed load).
Recover: restore full service + verify.
Communicate: internal + external updates on a cadence.
Learn: postmortem + action items + follow-up.

Roles

Incident Commander (IC): coordinates, makes calls.
Ops/Comms: updates channel/status page, timestamps.
Tech Lead: directs debugging, assigns tasks.
Subject Matter Experts: service owners, SRE, DB, network.

Minimal Incident Checklist

Declare incident + open an incident channel.
Assign IC + scribe + tech lead.
State the symptom in one sentence (what users see).
Set a 10–15 min update cadence.
Start stabilization work immediately (rollback/mitigation).

Timestamping Template

T+00: Detected elevated 5xx in prod (region=eu)
T+03: IC assigned, incident channel created
T+06: Suspected dependency: payments latency spike
T+10: Mitigation: disable feature flag "new-checkout"
T+15: Error rate down, monitoring recovery
T+30: Service stable, entering monitoring period

Triage and Stabilization (reduce blast radius) →