Zero-Downtime Restarts in Production

Achieve zero-downtime restarts using graceful reloads, connection draining, and process models. Avoid breaking in-flight requests during deploys and prevent restart storms under load.

On this page

Why “Restart” Causes Downtime

A hard restart kills in-flight requests. Under load, that becomes user-visible errors, retries, and cascading failure. Zero-downtime restarts require graceful behavior: keep serving existing connections while new workers come up.

Symptom

Short outage during deploys
Spikes in 502/503 errors when restarting services
Clients see connection resets
Load balancer marks instance unhealthy during restart

Root Cause

Hard process restart (SIGKILL) instead of graceful reload
No connection draining before taking node out of rotation
Health checks not aligned with startup time
Single-instance deployments without redundancy

Mental Model

Reload: keep master process, replace workers gracefully
Restart: stop and start, often dropping connections
Drain: stop new traffic, finish existing requests

Investigation

1) Identify Service Process Model

Some services support graceful reload (nginx), others need rolling restart behind a load balancer.

systemctl status nginx
ps aux | grep nginx

2) Check Health Check and Restart Behavior

systemctl show app.service -p Restart
systemctl show app.service -p TimeoutStopUSec

Too short TimeoutStopUSec can kill graceful shutdown.

3) Observe In-Flight Errors During Restart Window

Correlate deploy time with access logs and error spikes.

journalctl -u app.service --since "30 minutes ago"

Mitigation Patterns

Pattern A: Graceful Reload (Preferred When Supported)

nginx: validate config then reload:

nginx -t
systemctl reload nginx

Reload keeps the master process and replaces workers without dropping active connections.

Pattern B: Rolling Restart Behind Load Balancer

Remove node from rotation (drain)
Wait for connections to finish
Restart service
Wait for readiness
Re-add node

Drain visibility (example):

ss -tan | grep ESTAB | wc -l

Wait until active connections drop to near-zero before restart.

Pattern C: systemd Graceful Stop + Start

Ensure stop timeout allows graceful shutdown:

sudo systemctl edit app.service

Add:

[Service]
TimeoutStopSec=60
KillSignal=SIGTERM

Reload daemon and restart service:

sudo systemctl daemon-reload
sudo systemctl restart app.service

SIGTERM gives the app a chance to close connections cleanly.

Pattern D: Socket Activation (Advanced)

Some services can accept connections via systemd socket activation, reducing downtime during restarts. This is advanced and requires service support.

Common Failure Modes

Reload not supported but used anyway (no effect)
Health checks mark node healthy before app is ready
Restart storms caused by aggressive Restart=always
TimeoutStopSec too low, causing SIGKILL

Hardening Strategy

Prefer reload over restart where possible
Use readiness gates (do not accept traffic before warmup)
Drain traffic before restart
Increase graceful shutdown timeouts
Deploy one node at a time

Verification Checklist

No 502/503 spike during deploy
Active connections drain before restart
Service becomes ready only after dependencies are healthy
Restart does not drop in-flight requests

nginx -t
systemctl reload nginx
ss -tan | grep ESTAB | wc -l
systemctl show app.service -p TimeoutStopUSec

Why This Matters in Real Infrastructure

Downtime during deploys is not inevitable. With graceful reloads, draining, and correct readiness checks, production systems can deploy without user-visible impact. Zero-downtime restarts reduce error spikes, prevent retry storms, and improve reliability under change.

← Release Directory Pattern for Safe Linux Deployments

Running Migrations Safely in Production →