System Introspection Toolkit
Why Fast System Introspection Matters
In production, speed matters — but not the “random commands” kind. You need a repeatable, reliable checklist to understand a host quickly: what it is, how it is configured, where it is failing, and what is about to fail.
The 60-Second Host Snapshot
If you can only run a few commands, run these:
uname -a uptime df -h free -h ps aux --sort=-%cpu | head ss -lntp | head
This tells you:
- Kernel and OS basics
- Load and runtime duration
- Disk pressure
- Memory pressure
- Top CPU consumers
- What is listening on ports
OS and Distribution
Identify distro and version clearly (useful for package decisions):
cat /etc/os-release lsb_release -a
CPU, Memory, and Hardware
lscpu nproc free -h vmstat 1 5
What to look for:
- CPU cores vs load average
- Memory available vs used
- Swap usage trends
- Context switching and run queue pressure
Storage and Filesystem Reality
lsblk df -h mount | head -n 30
Production checks:
- Is /var filling up?
- Are mounts correct?
- Any unexpected tmpfs usage?
Disk Pressure and “Deleted But Still Open”
If df says full but you cannot find large files, check:
lsof | grep deleted | head
Common production issue: logs deleted while process still holds file handle.
Networking Snapshot
ip a ip route resolvectl status ss -lntp
What to confirm:
- Correct IP and routes
- DNS resolver health
- Listening services match expectation
System Limits and Kernel Signals
ulimit -a sysctl -a | head dmesg | tail -n 50
dmesg is where you often see:
- OOM killer events
- Disk I/O errors
- Kernel warnings
Service Inventory (What Is Running?)
systemctl --type=service --state=running systemctl status myapp --no-pager
Production engineers keep an inventory mindset: what should be running vs what is running.
Common Production Mistakes
- Running random commands without a mental model
- Ignoring disk and memory signals until outage
- Not correlating load with CPU cores
- Forgetting to check dmesg during weird failures
Mental Model
You are building situational awareness. Every command you run should answer one question: CPU? Memory? Disk? Network? Services? If you cannot state what you are checking, you are debugging blindly.
Production Checklist
- 60-second snapshot commands known by heart
- Disk/memory/network inspected before deep dives
- dmesg checked for kernel-level clues
- systemctl inventory used to verify expectations