Zero-Downtime Restarts in Production
Why “Restart” Causes Downtime
A hard restart kills in-flight requests. Under load, that becomes user-visible errors, retries, and cascading failure. Zero-downtime restarts require graceful behavior: keep serving existing connections while new workers come up.
Symptom
- Short outage during deploys
- Spikes in 502/503 errors when restarting services
- Clients see connection resets
- Load balancer marks instance unhealthy during restart
Root Cause
- Hard process restart (SIGKILL) instead of graceful reload
- No connection draining before taking node out of rotation
- Health checks not aligned with startup time
- Single-instance deployments without redundancy
Mental Model
- Reload: keep master process, replace workers gracefully
- Restart: stop and start, often dropping connections
- Drain: stop new traffic, finish existing requests
Investigation
1) Identify Service Process Model
Some services support graceful reload (nginx), others need rolling restart behind a load balancer.
systemctl status nginx ps aux | grep nginx
2) Check Health Check and Restart Behavior
systemctl show app.service -p Restart systemctl show app.service -p TimeoutStopUSec
Too short TimeoutStopUSec can kill graceful shutdown.
3) Observe In-Flight Errors During Restart Window
Correlate deploy time with access logs and error spikes.
journalctl -u app.service --since "30 minutes ago"
Mitigation Patterns
Pattern A: Graceful Reload (Preferred When Supported)
nginx: validate config then reload:
nginx -t systemctl reload nginx
Reload keeps the master process and replaces workers without dropping active connections.
Pattern B: Rolling Restart Behind Load Balancer
- Remove node from rotation (drain)
- Wait for connections to finish
- Restart service
- Wait for readiness
- Re-add node
Drain visibility (example):
ss -tan | grep ESTAB | wc -l
Wait until active connections drop to near-zero before restart.
Pattern C: systemd Graceful Stop + Start
Ensure stop timeout allows graceful shutdown:
sudo systemctl edit app.service
Add:
[Service] TimeoutStopSec=60 KillSignal=SIGTERM
Reload daemon and restart service:
sudo systemctl daemon-reload sudo systemctl restart app.service
SIGTERM gives the app a chance to close connections cleanly.
Pattern D: Socket Activation (Advanced)
Some services can accept connections via systemd socket activation, reducing downtime during restarts. This is advanced and requires service support.
Common Failure Modes
- Reload not supported but used anyway (no effect)
- Health checks mark node healthy before app is ready
- Restart storms caused by aggressive Restart=always
- TimeoutStopSec too low, causing SIGKILL
Hardening Strategy
- Prefer reload over restart where possible
- Use readiness gates (do not accept traffic before warmup)
- Drain traffic before restart
- Increase graceful shutdown timeouts
- Deploy one node at a time
Verification Checklist
- No 502/503 spike during deploy
- Active connections drain before restart
- Service becomes ready only after dependencies are healthy
- Restart does not drop in-flight requests
nginx -t systemctl reload nginx ss -tan | grep ESTAB | wc -l systemctl show app.service -p TimeoutStopUSec
Why This Matters in Real Infrastructure
Downtime during deploys is not inevitable. With graceful reloads, draining, and correct readiness checks, production systems can deploy without user-visible impact. Zero-downtime restarts reduce error spikes, prevent retry storms, and improve reliability under change.