ps/top/htop: Finding What Hurts
Why ps/top/htop Are Production Tools (Not Just Commands)
In production, you do not debug by guessing. You debug by observing. ps/top/htop are the fastest way to answer:
- Which process is hurting the system?
- Is it CPU, memory, I/O, or just “looks busy”?
- Is the problem one process, many processes, or the host itself?
Start With a Snapshot: ps
ps is your “frozen frame”. It is ideal for quick, scriptable checks.
Top CPU Consumers
ps aux --sort=-%cpu | head -n 15
Top Memory Consumers
ps aux --sort=-%mem | head -n 15
Production hint: %MEM is not enough by itself. Also watch RSS (resident memory), because VIRT can be huge without real usage.
Understand VIRT vs RES vs SHR
- VIRT: virtual address space (can be large, not always a problem)
- RES (RSS): real RAM used (this hurts)
- SHR: shared memory (libraries shared between processes)
If RES grows steadily, suspect memory leak or cache growth. If VIRT is huge but RES stable, it may be normal (mmap, shared libs, etc.).
Live View: top
top is your “moving picture”. Use it when load is changing and you need trends. Start top:
top
Critical top Metrics
- load average: 1m / 5m / 15m pressure trend
- %Cpu: user vs system vs iowait
- Mem: used vs available
- Swap: activity is usually a bad sign in production
CPU Steal (Virtual Machines)
If you run on a VM, watch for “steal” time. High steal means the hypervisor is starving your VM of CPU, even if your app is fine. In top, look for st (steal).
Symptom:
- App is slow
- CPU seems high or inconsistent
- Nothing obvious in process list
If steal is high, the fix may be infrastructure-level (bigger instance / less noisy neighbors), not code.
I/O Wait: The Silent Killer
If %Cpu shows high wa (iowait), the machine is waiting on disk. This is NOT a “CPU problem”. Restarting apps won’t fix slow storage.
Quick signal:
- High load average
- Many processes in D state
- wa is high
Readability Upgrade: htop
htop is often easier for production work because:
- Tree view (parent/child)
- Interactive sorting
- Per-core CPU view
- Kill/renice actions with less typing
Install:
sudo apt install htop sudo dnf install htop
Threads vs Processes Confusion
Some applications use many threads. CPU usage can look confusing. In top you can enable thread view:
top -H
Production use case:
- Java apps
- Some Node native modules
- Databases
When ps/top Are Not Enough: pidstat
If you need per-process CPU, memory, and I/O over time, use pidstat (from sysstat).
pidstat -u 1 pidstat -r 1 pidstat -d 1
This helps answer: is the process consistently heavy, or only spiking?
Common Production Mistakes
- Confusing high load with high CPU (load can be I/O wait)
- Chasing VIRT instead of RES
- Ignoring steal time on VMs
- Killing processes without checking why they are busy
- Restarting apps when the real bottleneck is disk/network
Mental Model
Always classify the pain before acting:
- CPU-bound: high user/system CPU, low iowait
- I/O-bound: high iowait, D-state processes, high load
- Memory pressure: low available memory, swap activity, OOM events
- Infrastructure starvation: high steal, noisy neighbor signals
Once you classify correctly, your next tool choice becomes obvious.
Production Checklist
- Use ps for snapshots, top/htop for trends
- Interpret load together with CPU (especially iowait)
- Track RES for real memory usage
- Watch steal time on VMs
- Use pidstat when you need time-series behavior