ADR-0006: Retry, backoff, and dead-lettering on the jobs table
Context
Handlers fail. A job system that retries naively melts a struggling downstream; one that
retries forever hides poison messages; one that drops on first failure isn't a job system.
pulse needs bounded retries with backoff and a terminal parking state — and, per ADR-0002,
it all has to live on the jobs table with no extra machinery: no retry queue, no
scheduler process for retries, no per-attempt bookkeeping tables.
The relevant columns are already on the row (db/migrations/0001_CreateJobs.go):
attempts, max_attempts (default 3, domain.MaxAttempts), last_error, and
next_run_at — the retry backoff gate.
Decision
- Attempts are counted at claim, not at failure.
ClaimBatch(internal/repos/postgres/jobs.go) incrementsattemptsin the same UPDATE that marks the rowRUNNING. Soattemptscounts started runs, which means a worker that dies without ever reporting still consumed an attempt — a crash-looping poison job cannot retry forever just because it never reports failure. - Fail is one guarded UPDATE that branches in SQL.
Fail()is a single statement, guarded byWHERE status = 'RUNNING'(zero rows →domain.ErrInvalidTransition):attempts < max_attempts→status = 'RETRYING',next_run_at = now() + attempts² seconds(1s, 4s, 9s, …), lease andlocked_bycleared,last_errorrecorded;attempts >= max_attempts→status = 'DEAD_LETTERED',completed_atstamped,last_errorkept as the post-mortem. One statement means the decision is atomic — no read-decide-write race between two reporters of the same job.
- The claim predicate IS the retry scheduler.
ClaimBatchselectsstatus IN ('PENDING','RETRYING') AND next_run_at <= now()— aRETRYINGrow is just a pending row whose eligibility is in the future. No retry queue, no timer wheel, no background promoter: backoff falls out of a timestamp comparison the dispatch tick was already doing, and the partialidx_jobs_claimindex already covers both statuses. - Crash recovery reuses the same branch.
ReapExpired— the watchdog sweep (ADR-0007) — is the same CASE expression applied set-based:WHERE status = 'RUNNING' AND lease_until < now(),last_error = 'worker lease expired'. A dead worker's job re-enters the identical retry path as a reported failure; max-attempts, backoff, and dead-lettering apply uniformly whether a job failed loudly or silently. DEAD_LETTEREDis terminal parking, not deletion. The row stays queryable (GetJob/ListJobs) with itslast_error; nothing dispatches it because the claim predicate never matches its status.
Alternatives considered
A separate retry-queue table
Move failed jobs to retries and promote them back when due. Adds a two-table state
machine, a promoter loop, and a window where a job exists in both or neither — all to
represent what one timestamp column already represents. Rejected: next_run_at on the row
is the degenerate (and sufficient) retry queue.
Exponential backoff with jitter
The textbook answer for thundering herds. Deferred: with max_attempts = 3 the schedule is
three data points (1s/4s/9s) — jitter has nothing to smooth at this scale, and attempts²
keeps the policy readable in the SQL itself. The CASE expression is the single place to
change if per-topic policy or jitter ever earns its way in.
TTL-based expiry (jobs just go stale)
No terminal state; old failures age out. Rejected: a dead-lettered row with last_error is
the operator's evidence. Silent expiry converts "this handler has a bug" into "where did my
job go".
Drop on exhaust
Simplest possible. Rejected for the same reason: the entire value of a max-attempts cap is what you keep when you hit it.
Consequences
- Zero extra machinery. Retry scheduling costs no new table, loop, or index — the
claim's existing
next_run_at <= now()comparison does all of it. - One retry path. Reported failures and reaped leases converge on the same CASE branch, so there is exactly one place where "retry or dead-letter" is decided.
- Only
last_errorsurvives — each failure overwrites the previous reason. No per-attempt history is the accepted trade of the single-row design; if attempt-level forensics are ever needed, that is a separate append-only concern, not columns on this row. - Policy is hardcoded (
attempts²,max_attemptsdefault 3 with a per-job column override at submit). Per-topic retry config is deferred until a real topic needs it. - No replay yet. Un-dead-lettering (reset to
PENDING, clear attempts) is an admin feature for later; today it's a manual UPDATE.