INFRA-DEVOPS Contents

Platform SLIs, SLOs, and Reliability

Define platform SLIs/SLOs around developer experience and runtime reliability. Use error budgets to prioritize reliability work.

On this page

Choose SLIs That Matter

  • Deployment success rate
  • Time to provision namespace
  • Pipeline duration / queue time
  • Platform API availability
  • Mean time to recover platform incidents

Example: SLO Statement

SLO: 99.9% successful GitOps syncs per month
Error budget: 0.1%
Alert: burn rate 2h / 6h windows

Runbook: SLO Burn

  • Check if failure is systemic (cluster API, GitOps controller)
  • Disable non-critical rollouts
  • Scale platform components
  • Communicate status + ETA + workaround

Failure Modes

  • Vanity SLOs → pick measures tied to developer outcomes.
  • Alert fatigue → tune alerts using burn-rate and paging policy.