LINUX-PRODUCTION Contents

Zero-Downtime Restarts in Production

Achieve zero-downtime restarts using graceful reloads, connection draining, and process models. Avoid breaking in-flight requests during deploys and prevent restart storms under load.

On this page

Why “Restart” Causes Downtime

A hard restart kills in-flight requests. Under load, that becomes user-visible errors, retries, and cascading failure. Zero-downtime restarts require graceful behavior: keep serving existing connections while new workers come up.

Symptom

  • Short outage during deploys
  • Spikes in 502/503 errors when restarting services
  • Clients see connection resets
  • Load balancer marks instance unhealthy during restart

Root Cause

  • Hard process restart (SIGKILL) instead of graceful reload
  • No connection draining before taking node out of rotation
  • Health checks not aligned with startup time
  • Single-instance deployments without redundancy

Mental Model

  • Reload: keep master process, replace workers gracefully
  • Restart: stop and start, often dropping connections
  • Drain: stop new traffic, finish existing requests

Investigation

1) Identify Service Process Model

Some services support graceful reload (nginx), others need rolling restart behind a load balancer.

systemctl status nginx
ps aux | grep nginx

2) Check Health Check and Restart Behavior

systemctl show app.service -p Restart
systemctl show app.service -p TimeoutStopUSec

Too short TimeoutStopUSec can kill graceful shutdown.

3) Observe In-Flight Errors During Restart Window

Correlate deploy time with access logs and error spikes.

journalctl -u app.service --since "30 minutes ago"

Mitigation Patterns

Pattern A: Graceful Reload (Preferred When Supported)

nginx: validate config then reload:

nginx -t
systemctl reload nginx

Reload keeps the master process and replaces workers without dropping active connections.

Pattern B: Rolling Restart Behind Load Balancer

  • Remove node from rotation (drain)
  • Wait for connections to finish
  • Restart service
  • Wait for readiness
  • Re-add node

Drain visibility (example):

ss -tan | grep ESTAB | wc -l

Wait until active connections drop to near-zero before restart.

Pattern C: systemd Graceful Stop + Start

Ensure stop timeout allows graceful shutdown:

sudo systemctl edit app.service

Add:

[Service]
TimeoutStopSec=60
KillSignal=SIGTERM

Reload daemon and restart service:

sudo systemctl daemon-reload
sudo systemctl restart app.service

SIGTERM gives the app a chance to close connections cleanly.

Pattern D: Socket Activation (Advanced)

Some services can accept connections via systemd socket activation, reducing downtime during restarts. This is advanced and requires service support.

Common Failure Modes

  • Reload not supported but used anyway (no effect)
  • Health checks mark node healthy before app is ready
  • Restart storms caused by aggressive Restart=always
  • TimeoutStopSec too low, causing SIGKILL

Hardening Strategy

  • Prefer reload over restart where possible
  • Use readiness gates (do not accept traffic before warmup)
  • Drain traffic before restart
  • Increase graceful shutdown timeouts
  • Deploy one node at a time

Verification Checklist

  • No 502/503 spike during deploy
  • Active connections drain before restart
  • Service becomes ready only after dependencies are healthy
  • Restart does not drop in-flight requests
nginx -t
systemctl reload nginx
ss -tan | grep ESTAB | wc -l
systemctl show app.service -p TimeoutStopUSec

Why This Matters in Real Infrastructure

Downtime during deploys is not inevitable. With graceful reloads, draining, and correct readiness checks, production systems can deploy without user-visible impact. Zero-downtime restarts reduce error spikes, prevent retry storms, and improve reliability under change.