Incident Lifecycle and Roles
On this page
Incident Lifecycle (Ops View)
- Detect: alert or user report confirms impact.
- Triage: scope + severity + suspected component.
- Stabilize: stop the bleeding (rollback, disable feature, shed load).
- Recover: restore full service + verify.
- Communicate: internal + external updates on a cadence.
- Learn: postmortem + action items + follow-up.
Roles
- Incident Commander (IC): coordinates, makes calls.
- Ops/Comms: updates channel/status page, timestamps.
- Tech Lead: directs debugging, assigns tasks.
- Subject Matter Experts: service owners, SRE, DB, network.
Minimal Incident Checklist
- Declare incident + open an incident channel.
- Assign IC + scribe + tech lead.
- State the symptom in one sentence (what users see).
- Set a 10–15 min update cadence.
- Start stabilization work immediately (rollback/mitigation).
Timestamping Template
T+00: Detected elevated 5xx in prod (region=eu) T+03: IC assigned, incident channel created T+06: Suspected dependency: payments latency spike T+10: Mitigation: disable feature flag "new-checkout" T+15: Error rate down, monitoring recovery T+30: Service stable, entering monitoring period