Platform SLIs, SLOs, and Reliability

Define platform SLIs/SLOs around developer experience and runtime reliability. Use error budgets to prioritize reliability work.

On this page

Choose SLIs That Matter

Deployment success rate
Time to provision namespace
Pipeline duration / queue time
Platform API availability
Mean time to recover platform incidents

Example: SLO Statement

SLO: 99.9% successful GitOps syncs per month
Error budget: 0.1%
Alert: burn rate 2h / 6h windows

Runbook: SLO Burn

Check if failure is systemic (cluster API, GitOps controller)
Disable non-critical rollouts
Scale platform components
Communicate status + ETA + workaround

Failure Modes

Vanity SLOs → pick measures tied to developer outcomes.
Alert fatigue → tune alerts using burn-rate and paging policy.

← Service Templates and Scaffolding

Multi-Tenancy and Isolation Models →