Health Checks and Readiness
Why health checks are a deployment contract
Health endpoints are not just for monitoring. They are a contract between your service and the platform that runs it. Orchestrators decide when to restart your process and when to route traffic based on these endpoints. If health semantics are wrong, you get deploy instability: restart loops, traffic to broken instances, or slow rollouts that hide real problems.
Three concepts you must separate
- Liveness: answers whether the process should be restarted. It should be cheap and should not depend on external systems.
- Readiness: answers whether the instance should receive traffic. It can depend on critical dependencies and internal warmup state.
- Dependency health: detailed checks for databases, caches, and downstreams. Often used for readiness, but should remain bounded and fast.
Production goals at this level
- Fast responses: probes must be quick to avoid probe storms under load.
- Bounded work: dependency checks must have timeouts and never hang.
- Correct behavior on shutdown: readiness should turn off quickly during graceful shutdown.
- Clear debugging: responses and logs should make it obvious what failed.
Recommended endpoints
- /healthz: liveness, always cheap, returns OK if the process is running.
- /readyz: readiness, depends on internal state and critical dependencies.
- /livez: optional alias for liveness if your platform expects it.
Keep naming consistent across services so operations can rely on conventions.
Minimal liveness handler
Liveness should not query the database. It should reflect process health, for example that the server is running and not in a fatal state.
use axum::http::StatusCode;
pub async fn healthz() -> StatusCode {
StatusCode::OK
}
Readiness: combine internal state and dependency checks
Readiness should represent whether the service can serve real traffic. At this stage, readiness commonly depends on:
- Startup completed and configuration loaded
- Database connectivity works
- Optional: migrations applied or schema version compatible
Readiness checks should be bounded: they must have timeouts and avoid heavy work.
State-driven readiness flag
A simple pattern is a readiness flag in shared state. This makes graceful shutdown safe: on shutdown, flip the flag and readiness returns 503 quickly.
use std::sync::Arc;
use std::sync::atomic::{AtomicBool, Ordering};
#[derive(Clone)]
pub struct Readiness {
ready: Arc<AtomicBool>,
}
impl Readiness {
pub fn new() -> Self {
Self { ready: Arc::new(AtomicBool::new(true)) }
}
pub fn set_ready(&self, value: bool) {
self.ready.store(value, Ordering::Relaxed);
}
pub fn is_ready(&self) -> bool {
self.ready.load(Ordering::Relaxed)
}
}
Dependency check: database ping with a timeout
A dependency check must be fast and bounded. Use a lightweight query like SELECT 1 and wrap it in a timeout. This prevents readiness from hanging when the database is slow.
use axum::{extract::State, http::StatusCode};
use std::time::Duration;
use tokio::time::timeout;
#[derive(Clone)]
pub struct AppState {
pub pool: sqlx::MySqlPool,
pub readiness: crate::Readiness,
}
async fn db_ok(pool: &sqlx::MySqlPool) -> bool {
let fut = async {
sqlx::query("SELECT 1").execute(pool).await.is_ok()
};
match timeout(Duration::from_secs(1), fut).await {
Ok(ok) => ok,
Err(_) => false,
}
}
pub async fn readyz(State(state): State<AppState>) -> StatusCode {
if !state.readiness.is_ready() {
return StatusCode::SERVICE_UNAVAILABLE;
}
if db_ok(&state.pool).await {
StatusCode::OK
} else {
StatusCode::SERVICE_UNAVAILABLE
}
}
Router wiring
Keep health endpoints stable and do not hide them behind authentication. Probes must work before the service becomes fully functional.
use axum::{routing::get, Router};
pub fn app(state: AppState) -> Router {
Router::new()
.route("/healthz", get(crate::healthz))
.route("/readyz", get(crate::readyz))
.with_state(state)
}
Readiness and graceful shutdown
When shutdown begins, disable readiness first so the load balancer stops routing traffic. Then allow in-flight requests to drain via graceful shutdown. This combination dramatically reduces deploy-time errors.
pub async fn shutdown_signal_and_disable_readiness(readiness: crate::Readiness) {
crate::shutdown_signal().await;
readiness.set_ready(false);
tracing::info!("readiness disabled, draining");
}
Common mistakes
- Database in liveness: causes restart loops when the database is down, making incidents worse.
- Slow readiness: long checks create probe backlogs and noisy deploys.
- No shutdown alignment: readiness stays OK during shutdown, causing new traffic to hit an instance that is exiting.
- Too much detail in responses: keep the HTTP status meaningful, log the details internally.
Operational verification
Verify behavior with three quick scenarios:
- Normal: /healthz returns 200, /readyz returns 200.
- DB down: /healthz returns 200, /readyz returns 503 quickly.
- Shutdown: readiness flips to 503 immediately when shutdown begins.
curl -i http://localhost:3000/healthz curl -i http://localhost:3000/readyz
What comes next
Next we will focus on structured logging conventions: consistent fields, stable message patterns, and how to keep logs useful without leaking sensitive data or generating noise.