Background Jobs & Scheduled Work
A lot of work should not happen inside a web request: sending email, generating reports, calling slow third parties, periodic re-screening. That work runs in background jobs and scheduled tasks, which bring their own rules. They can run twice, overlap, fail silently, and run on several instances at once. Make them idempotent, observable, and safe to retry.
Moving slow or bursty work off the request path keeps the app responsive (see Performance & Resource Use). But background work is easy to get subtly wrong. A scheduled job may fire on every instance at once. A retried job may do its work twice. A job that fails quietly leaves things half-done with nobody watching.
The same safety ideas as messaging apply: idempotency, bounded retries, and visibility. And because background jobs often touch regulated work (re-screening, report generation), failing safely and leaving an audit trail matters just as much here.
Make jobs safe to run
- AlwaysMake jobs idempotent and safe to retry. Running the same job twice must not double-charge, double-send, or double-apply (see Data Integrity & Transactions).
- DoGuard scheduled jobs against overlap and against running on every instance at once (a distributed lock or a single runner), so two copies do not collide.
- DoProcess in bounded batches with checkpoints, so a large job can resume rather than restart, and does not exhaust memory or overload the database.
- DoUse the platform's durable job or queue mechanism rather than fire-and-forget threads that vanish on restart or deploy.
- NeverRun a non-idempotent money- or state-changing job without a guard against duplicate or overlapping runs.
Make jobs observable and correct
- DoMake jobs observable: log start, finish, and outcome, emit metrics, and alert when a job fails or simply does not run when it should. A silent missed run is the dangerous one.
- DoFail safely and record it. For regulated work (re-screening, reporting), a job failure must surface and leave an audit trail, never silently skip (see Designing for Failure, Audit Trails).
- DoSet timeouts and resource limits so a stuck job is detected and bounded, not left hanging forever (see Cost & Scale Planning).
- ConsiderCarrying a correlation id through job runs so their effects can be traced like any other flow.
- AvoidDoing heavy or slow work inside a web request when it belongs in a background job. It ties up request threads and times out users.
// cron fires hourly on all 3 instances; no lock; no alerting
foreach (var c in DueForRescreen()) Rescreen(c);
Three instances run the same re-screening at once (duplicate work and possible double effects). If it crashes, nobody is told, so customers silently go un-rescreened. That is an AML gap.
using var lease = await locks.AcquireAsync("rescreen", ttl); // one runner
if (lease is null) return;
foreach (var batch in DueForRescreen().Chunk(500)) { Rescreen(batch); checkpoint(); }
metrics.RecordRun("rescreen", count); // + alert if it didn't run
Only one instance runs it, work is batched and checkpointed, and a missed or failed run is visible.
Self-review checklist
- AskIf this job runs twice or overlaps with itself, is the result still correct?
- AskOn multiple instances, could it run many times at once when it should run once?
- AskIf it fails or never runs, will anyone find out?
- AskFor regulated work, does a failure surface and leave a trail rather than silently skip?