Health Checks: Readiness vs Liveness
On this page
Health Checks Are Not “Is the App Running?”
In production, health checks decide two critical things: - Should this instance receive traffic? - Should this instance be restarted? If you confuse readiness and liveness, you will create: - restart loops - cascading failures - traffic black holes - incident noise that hides the real issue Health checks are a control plane. Treat them like one.Real Production Incident
Symptoms: - Deployment rolls out, then pods start flapping. - Kubernetes keeps restarting pods. - Error rate spikes because capacity keeps dropping. - Team blames the database because logs show DB timeouts. Root cause: - Liveness check called a database dependency. - Database had a brief latency spike. - Liveness failed, pods restarted, causing reconnect storms. - The reconnect storms amplified DB load, making the outage worse. This is a self-inflicted wound.Readiness vs Liveness (Burn This Into Your Brain)
Readiness: “Can I safely receive traffic right now?” - If readiness fails, remove instance from load balancer. - Do not restart the process. Liveness: “Is the process fundamentally broken and needs restart?” - If liveness fails, restart the process. - Should be cheap and local. Production rule: External dependencies belong in readiness, not liveness.Symptom → Cause → Diagnosis → Fix
Symptom: - Frequent restarts after deploy - Instances removed from rotation constantly - Spiky latency and error rate Cause: - Liveness check depends on DB/Redis/HTTP calls - Health endpoint requires auth - Health endpoint is slow or allocates heavily - Wrong timeouts and thresholds Diagnosis: - Check Kubernetes events: probe failures and restart counts. - Hit health endpoints directly (inside cluster). - Measure health endpoint latency. - Verify probes do not require auth headers. Fix: - Make liveness local-only. - Make readiness reflect ability to serve traffic. - Make both fast and non-allocating. - Separate deep dependency checks from the main probes.Anti-Pattern: Liveness Calls Dependencies
This will create restart loops:
builder.Services.AddHealthChecks()
.AddNpgSql(builder.Configuration.GetConnectionString("Main"));
If you expose that as liveness, you restart when DB is slow. That is insane.
Correct Pattern: Separate Checks
Use multiple endpoints: - /health/live for liveness (local only) - /health/ready for readiness (can include critical dependencies) Example:
builder.Services.AddHealthChecks()
.AddCheck("self", () => HealthCheckResult.Healthy())
.AddNpgSql(builder.Configuration.GetConnectionString("Main"), name: "db", tags: new[] { "ready" });
Map endpoints:
app.MapHealthChecks("/health/live", new HealthCheckOptions
{
Predicate = r => r.Name == "self"
});
app.MapHealthChecks("/health/ready", new HealthCheckOptions
{
Predicate = r => r.Tags.Contains("ready")
});
Now:
- /health/live never depends on DB.
- /health/ready can fail to remove traffic when DB is not usable.