LINUX-PRODUCTION Contents

System Introspection Toolkit

Know your host fast: uname, lsb_release, df, free, lscpu, lsblk, and friends.

On this page

Why Fast System Introspection Matters

In production, speed matters — but not the “random commands” kind. You need a repeatable, reliable checklist to understand a host quickly: what it is, how it is configured, where it is failing, and what is about to fail.

The 60-Second Host Snapshot

If you can only run a few commands, run these:

uname -a
uptime
df -h
free -h
ps aux --sort=-%cpu | head
ss -lntp | head

This tells you:

  • Kernel and OS basics
  • Load and runtime duration
  • Disk pressure
  • Memory pressure
  • Top CPU consumers
  • What is listening on ports

OS and Distribution

Identify distro and version clearly (useful for package decisions):

cat /etc/os-release
lsb_release -a

CPU, Memory, and Hardware

lscpu
nproc
free -h
vmstat 1 5

What to look for:

  • CPU cores vs load average
  • Memory available vs used
  • Swap usage trends
  • Context switching and run queue pressure

Storage and Filesystem Reality

lsblk
df -h
mount | head -n 30

Production checks:

  • Is /var filling up?
  • Are mounts correct?
  • Any unexpected tmpfs usage?

Disk Pressure and “Deleted But Still Open”

If df says full but you cannot find large files, check:

lsof | grep deleted | head

Common production issue: logs deleted while process still holds file handle.

Networking Snapshot

ip a
ip route
resolvectl status
ss -lntp

What to confirm:

  • Correct IP and routes
  • DNS resolver health
  • Listening services match expectation

System Limits and Kernel Signals

ulimit -a
sysctl -a | head
dmesg | tail -n 50

dmesg is where you often see:

  • OOM killer events
  • Disk I/O errors
  • Kernel warnings

Service Inventory (What Is Running?)

systemctl --type=service --state=running
systemctl status myapp --no-pager

Production engineers keep an inventory mindset: what should be running vs what is running.

Common Production Mistakes

  • Running random commands without a mental model
  • Ignoring disk and memory signals until outage
  • Not correlating load with CPU cores
  • Forgetting to check dmesg during weird failures

Mental Model

You are building situational awareness. Every command you run should answer one question: CPU? Memory? Disk? Network? Services? If you cannot state what you are checking, you are debugging blindly.

Production Checklist

  • 60-second snapshot commands known by heart
  • Disk/memory/network inspected before deep dives
  • dmesg checked for kernel-level clues
  • systemctl inventory used to verify expectations