Resource Metrics Primer (what to alert on)
On this page
Metrics That Predict Outages
- CPU: utilization + run queue + throttling.
- Memory: working set, swap in/out, OOM kills.
- Disk: latency, %util, queue depth, free space, inode usage.
- Network: retransmits, drops, RTT spikes.
On-Host Quick Checks
uptime vmstat 1 5 free -h df -h df -i iostat -xz 1 5 2>/dev/null || true
Alerting Rules of Thumb
- Alert on symptoms (latency, errors) plus cause signals (saturation).
- Avoid paging on noisy metrics; prefer multi-signal alerts.
- Use burn-rate alerts for SLOs where possible.
Common Misreads
- High CPU can be fine if latency is stable and no queueing.
- Low 'free' memory is normal due to page cache.
- High throughput can coexist with terrible latency.
Checklist
# CPU: utilization + run queue + throttling # Mem: swap + OOM + major faults # Disk: await/%util + free space + inodes # Net: retransmits/drops + RTT