HostedService Basics (The Safe Way)

IHostedService is not a free background thread. If you run background work inside your web app without lifecycle discipline, you will lose jobs on deploy, create duplicate runs across replicas, and ship silent failure. This is the production-safe baseline

On this page

Production incident

You deploy the web API with 6 replicas. A HostedService runs "nightly cleanup" and "sync partner data". Suddenly the partner rate limits you because the job runs 6 times in parallel. During the next deploy, half the jobs are interrupted mid-flight and never resumed. Support sees stale data, and the only evidence is a few scattered logs. The root cause: you treated HostedService like a cron runner and ignored process lifecycle, multi-instance behavior, and shutdown semantics.

Symptoms

Duplicate executions after scaling out (job runs per replica).
Jobs silently stop after an exception (no restart, no alert).
Deployments cause partial work and inconsistent state.
CPU spikes and memory growth because background loops have no bounds.

Root causes

Wrong mental model: IHostedService is tied to process lifetime, not a scheduler with persistence.
Multi-instance ignorance: every replica runs the same hosted service unless you coordinate.
No idempotency: duplicate execution causes duplicates and side effects.
No shutdown handling: cancellation is ignored; tasks are killed mid-flight.

Diagnosis

# Find hosted services and infinite loops
grep -R "AddHostedService" -n .
grep -R "BackgroundService" -n .
grep -R "while (true)" -n .

# Look for missing exception handling in ExecuteAsync
grep -R "ExecuteAsync" -n .

Also check runtime topology: is this running inside the web app across multiple pods/instances? If yes, assume duplicates unless you have a leader lock or partitioning strategy.

Anti-pattern

// Naive loop: no bounds, no coordination, no jitter, no exception strategy
public class CleanupService : BackgroundService
{
    protected override async Task ExecuteAsync(CancellationToken stoppingToken)
    {
        while (!stoppingToken.IsCancellationRequested)
        {
            await DoCleanupAsync(); // no ct, no timeout
            await Task.Delay(TimeSpan.FromMinutes(5));
        }
    }
}

Correct pattern

Hosted services should be boring: bounded loops, cancellation-aware, exception-safe, observable, and designed for multi-instance execution.

Baseline implementation

public sealed class CleanupService : BackgroundService
{
    private readonly ILogger<CleanupService> _log;

    public CleanupService(ILogger<CleanupService> log) => _log = log;

    protected override async Task ExecuteAsync(CancellationToken stoppingToken)
    {
        // Jitter to avoid synchronized thundering herd after deploy
        await Task.Delay(TimeSpan.FromSeconds(Random.Shared.Next(0, 10)), stoppingToken);

        while (!stoppingToken.IsCancellationRequested)
        {
            try
            {
                using var budget = CancellationTokenSource.CreateLinkedTokenSource(stoppingToken);
                budget.CancelAfter(TimeSpan.FromMinutes(2)); // hard stop per iteration

                await DoCleanupAsync(budget.Token);
            }
            catch (OperationCanceledException) when (stoppingToken.IsCancellationRequested)
            {
                // Normal shutdown
                break;
            }
            catch (Exception ex)
            {
                _log.LogError(ex, "CleanupService failed");
                // Backoff to avoid crash loops and hot spinning
                await Task.Delay(TimeSpan.FromSeconds(10), stoppingToken);
            }

            await Task.Delay(TimeSpan.FromMinutes(5), stoppingToken);
        }
    }

    private Task DoCleanupAsync(CancellationToken ct)
    {
        // Must be idempotent and safe to run concurrently if multi-instance
        return Task.CompletedTask;
    }
}

Multi-instance safety options

Idempotent design: job can run multiple times without harm.
Leader election / distributed lock: only one instance runs the job (DB lock, Redis lock). Must handle lock loss.
Partitioning: each instance processes a shard (tenant range, hash partition) to avoid duplicates.

Security and performance impact

Performance: uncontrolled background work competes with request handling and causes latency spikes.
Security: jobs often process sensitive data; missing auth boundaries and sloppy logging can leak secrets. Also, duplicate execution can violate business invariants.

Operational notes

Monitoring: heartbeat metric, last-success timestamp, iteration duration, exception count, backlog size (if any).
Rollout: deploy with jitter and canary. Verify that only intended instances run the job.
Rollback: keep a kill switch config to disable the hosted service without redeploy.

Checklist

Hosted service loops are bounded and cancellation-aware.
Exceptions are caught and surfaced via metrics/alerts.
Per-iteration deadlines exist (CancelAfter).
Multi-instance behavior is explicitly handled (idempotent, lock, or partition).
Kill switch exists to disable background work fast.

Background Queue with Channels →