HostedService Basics (The Safe Way)
Production incident
You deploy the web API with 6 replicas. A HostedService runs "nightly cleanup" and "sync partner data". Suddenly the partner rate limits you because the job runs 6 times in parallel. During the next deploy, half the jobs are interrupted mid-flight and never resumed. Support sees stale data, and the only evidence is a few scattered logs. The root cause: you treated HostedService like a cron runner and ignored process lifecycle, multi-instance behavior, and shutdown semantics.
Symptoms
- Duplicate executions after scaling out (job runs per replica).
- Jobs silently stop after an exception (no restart, no alert).
- Deployments cause partial work and inconsistent state.
- CPU spikes and memory growth because background loops have no bounds.
Root causes
- Wrong mental model: IHostedService is tied to process lifetime, not a scheduler with persistence.
- Multi-instance ignorance: every replica runs the same hosted service unless you coordinate.
- No idempotency: duplicate execution causes duplicates and side effects.
- No shutdown handling: cancellation is ignored; tasks are killed mid-flight.
Diagnosis
# Find hosted services and infinite loops grep -R "AddHostedService" -n . grep -R "BackgroundService" -n . grep -R "while (true)" -n . # Look for missing exception handling in ExecuteAsync grep -R "ExecuteAsync" -n .
Also check runtime topology: is this running inside the web app across multiple pods/instances? If yes, assume duplicates unless you have a leader lock or partitioning strategy.
Anti-pattern
// Naive loop: no bounds, no coordination, no jitter, no exception strategy
public class CleanupService : BackgroundService
{
protected override async Task ExecuteAsync(CancellationToken stoppingToken)
{
while (!stoppingToken.IsCancellationRequested)
{
await DoCleanupAsync(); // no ct, no timeout
await Task.Delay(TimeSpan.FromMinutes(5));
}
}
}
Correct pattern
Hosted services should be boring: bounded loops, cancellation-aware, exception-safe, observable, and designed for multi-instance execution.
Baseline implementation
public sealed class CleanupService : BackgroundService
{
private readonly ILogger<CleanupService> _log;
public CleanupService(ILogger<CleanupService> log) => _log = log;
protected override async Task ExecuteAsync(CancellationToken stoppingToken)
{
// Jitter to avoid synchronized thundering herd after deploy
await Task.Delay(TimeSpan.FromSeconds(Random.Shared.Next(0, 10)), stoppingToken);
while (!stoppingToken.IsCancellationRequested)
{
try
{
using var budget = CancellationTokenSource.CreateLinkedTokenSource(stoppingToken);
budget.CancelAfter(TimeSpan.FromMinutes(2)); // hard stop per iteration
await DoCleanupAsync(budget.Token);
}
catch (OperationCanceledException) when (stoppingToken.IsCancellationRequested)
{
// Normal shutdown
break;
}
catch (Exception ex)
{
_log.LogError(ex, "CleanupService failed");
// Backoff to avoid crash loops and hot spinning
await Task.Delay(TimeSpan.FromSeconds(10), stoppingToken);
}
await Task.Delay(TimeSpan.FromMinutes(5), stoppingToken);
}
}
private Task DoCleanupAsync(CancellationToken ct)
{
// Must be idempotent and safe to run concurrently if multi-instance
return Task.CompletedTask;
}
}
Multi-instance safety options
- Idempotent design: job can run multiple times without harm.
- Leader election / distributed lock: only one instance runs the job (DB lock, Redis lock). Must handle lock loss.
- Partitioning: each instance processes a shard (tenant range, hash partition) to avoid duplicates.
Security and performance impact
- Performance: uncontrolled background work competes with request handling and causes latency spikes.
- Security: jobs often process sensitive data; missing auth boundaries and sloppy logging can leak secrets. Also, duplicate execution can violate business invariants.
Operational notes
- Monitoring: heartbeat metric, last-success timestamp, iteration duration, exception count, backlog size (if any).
- Rollout: deploy with jitter and canary. Verify that only intended instances run the job.
- Rollback: keep a kill switch config to disable the hosted service without redeploy.
Checklist
- Hosted service loops are bounded and cancellation-aware.
- Exceptions are caught and surfaced via metrics/alerts.
- Per-iteration deadlines exist (CancelAfter).
- Multi-instance behavior is explicitly handled (idempotent, lock, or partition).
- Kill switch exists to disable background work fast.