Platform SLIs, SLOs, and Reliability
On this page
Choose SLIs That Matter
- Deployment success rate
- Time to provision namespace
- Pipeline duration / queue time
- Platform API availability
- Mean time to recover platform incidents
Example: SLO Statement
SLO: 99.9% successful GitOps syncs per month Error budget: 0.1% Alert: burn rate 2h / 6h windows
Runbook: SLO Burn
- Check if failure is systemic (cluster API, GitOps controller)
- Disable non-critical rollouts
- Scale platform components
- Communicate status + ETA + workaround
Failure Modes
- Vanity SLOs → pick measures tied to developer outcomes.
- Alert fatigue → tune alerts using burn-rate and paging policy.