DOTNET Contents

Health Checks: Readiness vs Liveness

Health checks are a traffic and restart control plane. Learn readiness vs liveness, avoid restart loops, and design checks that prevent cascading failures in Kubernetes and reverse proxies.

On this page

Health Checks Are Not “Is the App Running?”

In production, health checks decide two critical things: - Should this instance receive traffic? - Should this instance be restarted? If you confuse readiness and liveness, you will create: - restart loops - cascading failures - traffic black holes - incident noise that hides the real issue Health checks are a control plane. Treat them like one.

Real Production Incident

Symptoms: - Deployment rolls out, then pods start flapping. - Kubernetes keeps restarting pods. - Error rate spikes because capacity keeps dropping. - Team blames the database because logs show DB timeouts. Root cause: - Liveness check called a database dependency. - Database had a brief latency spike. - Liveness failed, pods restarted, causing reconnect storms. - The reconnect storms amplified DB load, making the outage worse. This is a self-inflicted wound.

Readiness vs Liveness (Burn This Into Your Brain)

Readiness: “Can I safely receive traffic right now?” - If readiness fails, remove instance from load balancer. - Do not restart the process. Liveness: “Is the process fundamentally broken and needs restart?” - If liveness fails, restart the process. - Should be cheap and local. Production rule: External dependencies belong in readiness, not liveness.

Symptom → Cause → Diagnosis → Fix

Symptom: - Frequent restarts after deploy - Instances removed from rotation constantly - Spiky latency and error rate Cause: - Liveness check depends on DB/Redis/HTTP calls - Health endpoint requires auth - Health endpoint is slow or allocates heavily - Wrong timeouts and thresholds Diagnosis: - Check Kubernetes events: probe failures and restart counts. - Hit health endpoints directly (inside cluster). - Measure health endpoint latency. - Verify probes do not require auth headers. Fix: - Make liveness local-only. - Make readiness reflect ability to serve traffic. - Make both fast and non-allocating. - Separate deep dependency checks from the main probes.

Anti-Pattern: Liveness Calls Dependencies

This will create restart loops:
builder.Services.AddHealthChecks()
    .AddNpgSql(builder.Configuration.GetConnectionString("Main"));
If you expose that as liveness, you restart when DB is slow. That is insane.

Correct Pattern: Separate Checks

Use multiple endpoints: - /health/live for liveness (local only) - /health/ready for readiness (can include critical dependencies) Example:
builder.Services.AddHealthChecks()
    .AddCheck("self", () => HealthCheckResult.Healthy())
    .AddNpgSql(builder.Configuration.GetConnectionString("Main"), name: "db", tags: new[] { "ready" });
Map endpoints:
app.MapHealthChecks("/health/live", new HealthCheckOptions
{
    Predicate = r => r.Name == "self"
});

app.MapHealthChecks("/health/ready", new HealthCheckOptions
{
    Predicate = r => r.Tags.Contains("ready")
});
Now: - /health/live never depends on DB. - /health/ready can fail to remove traffic when DB is not usable.

Health Endpoints Must Be Unauthenticated

If you protect health endpoints with auth, you will: - break probes - get false negatives - restart healthy instances Production rule: Health endpoints must be accessible to the platform (cluster, proxy) without user auth. If you need protection: - restrict by network policy / IP allowlist - run on internal port Do not add JWT to probes.

Timeouts and Thresholds Matter

If your readiness probe times out too aggressively, you will eject pods during minor hiccups. If it is too lenient, you will route traffic to broken instances. Tune based on reality: - probe interval - failure threshold - timeout seconds Never copy-paste defaults blindly.

Deep Checks vs Traffic Readiness

You may want deep checks (DB connectivity, downstream services), but do not put everything into readiness. If you tie readiness to every dependency, one flaky downstream can remove all pods from rotation. Strategy: - readiness includes only dependencies required to serve core traffic safely - deep checks exist for dashboards and alerts, not for routing decisions

Operational Notes

Monitoring: - Track readiness failures as a rate. - Track restart count and probe failure reasons. - Alert on rising restart loops (it is usually misconfigured probes). Rollout: - Canary deploy and monitor probe failure patterns. - Validate that new pods become Ready quickly and stay Ready under load. Rollback: - If a new probe configuration causes flapping, revert probe config immediately. - Probe changes should be treated as production changes with the same rigor as code. Risk management: - Avoid “dependency checks in liveness”. - Avoid readiness checks that are slow or lock resources. - Keep health endpoints lightweight.

Checklist

- Liveness is local-only and fast (no network dependencies). - Readiness represents ability to serve traffic, not overall system health. - Health endpoints do not require auth. - Probe timeouts and thresholds are tuned to production reality. - Dependency checks are tagged and separated (live vs ready). - Deep checks exist but do not control traffic unless necessary. - Monitoring alerts on probe failures and restart loops. - Probe changes are versioned and rollbackable.