Resource Metrics Primer (what to alert on)

Know what to alert on: CPU, memory, disk, and network signals that predict outages early.

On this page

Metrics That Predict Outages

CPU: utilization + run queue + throttling.
Memory: working set, swap in/out, OOM kills.
Disk: latency, %util, queue depth, free space, inode usage.
Network: retransmits, drops, RTT spikes.

On-Host Quick Checks

uptime
vmstat 1 5
free -h
df -h
df -i
iostat -xz 1 5 2>/dev/null || true

Alerting Rules of Thumb

Alert on symptoms (latency, errors) plus cause signals (saturation).
Avoid paging on noisy metrics; prefer multi-signal alerts.
Use burn-rate alerts for SLOs where possible.

Common Misreads

High CPU can be fine if latency is stable and no queueing.
Low 'free' memory is normal due to page cache.
High throughput can coexist with terrible latency.

Checklist

# CPU: utilization + run queue + throttling
# Mem: swap + OOM + major faults
# Disk: await/%util + free space + inodes
# Net: retransmits/drops + RTT

← Virtualization vs Containers (Failure Domains)