Graceful Shutdown Basics
On this page
Graceful Shutdown Is a Production Contract
In production, your process will be killed. Regularly. - Deployments - Autoscaling - Node drains - OOM kills - Spot/preemptible termination If your service cannot shut down gracefully, you will: - drop in-flight requests - create partial writes - break idempotency assumptions - amplify outages during deploys Graceful shutdown is not “nice to have”. It is required behavior.Real Production Incident
Symptoms: - During deploy, error rate spikes for 2–5 minutes. - Clients retry aggressively, causing traffic surge. - Database sees a burst of duplicate writes. - Operators conclude “deployments are risky”. Root cause: - Pods were terminated while still serving traffic. - Load balancer stopped routing too late, or app kept accepting requests. - Background work was killed mid-flight. - Requests were not cancel-aware, so shutdown exceeded the termination window and got SIGKILL. This is not a Kubernetes problem. It is a shutdown design failure.What “Graceful” Actually Means
A correct shutdown sequence looks like this: 1) Stop receiving new traffic (fail readiness / deregister) 2) Allow in-flight requests to complete (drain) 3) Cancel background work safely 4) Flush logs/metrics buffers if needed 5) Exit before termination timeout Production rule: If you cannot finish work safely within the termination window, you must design for interruption (idempotency, resumable jobs).Symptom → Cause → Diagnosis → Fix
Symptom: - Spike in 499/502/503 during deploys - Increased retries and duplicates - Requests abruptly cut off Cause: - App keeps accepting requests after termination begins - No cancellation propagation - Long-running requests without deadlines - Background services ignoring CancellationToken Diagnosis: - Correlate deploy timestamp with error spikes. - Inspect ingress/load balancer logs for client disconnects. - Check pod termination events and termination grace period. - Confirm readiness behavior during shutdown. Fix: - Implement readiness-driven draining. - Ensure request handlers respect HttpContext.RequestAborted. - Wire CancellationToken through all long operations. - Make background jobs cancel-aware and resumable.Anti-Pattern: Fire-and-Forget Work in Request Path
This is a reliability trap:
app.MapPost("/process", async (RequestDto dto, ILogger<Program> logger) =>
{
_ = Task.Run(() => DoWork(dto));
return Results.Accepted();
});
What happens in production:
- Work can outlive the request scope.
- Work ignores shutdown cancellation.
- During deploy, tasks are killed mid-flight with no recovery.
- You get partial effects and inconsistent state.
If it matters, make it durable (queue/outbox) or finish it before responding.
Correct Pattern: Cancellation-Aware Request Handling
Use HttpContext.RequestAborted and propagate it. Minimal API example:
app.MapGet("/heavy", async (HttpContext ctx, SomeService svc) =>
{
await svc.DoHeavyWorkAsync(ctx.RequestAborted);
return Results.Ok();
});
Service code:
public sealed class SomeService
{
public async Task DoHeavyWorkAsync(CancellationToken ct)
{
// Example: downstream call with cancellation
await Task.Delay(TimeSpan.FromSeconds(2), ct);
}
}
Production rule:
Every long operation must accept a CancellationToken.
Stop Accepting Traffic Before Killing the Process
Graceful shutdown starts at the load balancer / readiness layer. If you use Kubernetes: - readiness probe controls routing - terminationGracePeriodSeconds controls shutdown window App responsibility: - become NotReady quickly when shutdown begins - stop accepting new work A common approach is to fail readiness when ApplicationStopping triggers (conceptually). The exact wiring depends on your health check implementation, but the principle is fixed: Readiness should flip to unhealthy during shutdown so the platform drains you.Background Services Must Respect Cancellation
Anti-pattern: ignoring the stopping token.
public sealed class BadWorker : BackgroundService
{
protected override async Task ExecuteAsync(CancellationToken stoppingToken)
{
while (true)
{
await DoWorkAsync();
}
}
}
Correct pattern:
public sealed class GoodWorker : BackgroundService
{
protected override async Task ExecuteAsync(CancellationToken stoppingToken)
{
while (!stoppingToken.IsCancellationRequested)
{
await DoWorkAsync(stoppingToken);
}
}
private static async Task DoWorkAsync(CancellationToken ct)
{
await Task.Delay(TimeSpan.FromSeconds(1), ct);
}
}
Production rule:
Every loop must have a cancellation exit. Every wait must be cancelable.