pulse
Design decisions (ADRs)

ADR-0006: Retry, backoff, and dead-lettering on the jobs table

Context

Handlers fail. A job system that retries naively melts a struggling downstream; one that retries forever hides poison messages; one that drops on first failure isn't a job system. pulse needs bounded retries with backoff and a terminal parking state — and, per ADR-0002, it all has to live on the jobs table with no extra machinery: no retry queue, no scheduler process for retries, no per-attempt bookkeeping tables.

The relevant columns are already on the row (db/migrations/0001_CreateJobs.go): attempts, max_attempts (default 3, domain.MaxAttempts), last_error, and next_run_at — the retry backoff gate.

Decision

  1. Attempts are counted at claim, not at failure. ClaimBatch (internal/repos/postgres/jobs.go) increments attempts in the same UPDATE that marks the row RUNNING. So attempts counts started runs, which means a worker that dies without ever reporting still consumed an attempt — a crash-looping poison job cannot retry forever just because it never reports failure.
  2. Fail is one guarded UPDATE that branches in SQL. Fail() is a single statement, guarded by WHERE status = 'RUNNING' (zero rows → domain.ErrInvalidTransition):
    • attempts < max_attemptsstatus = 'RETRYING', next_run_at = now() + attempts² seconds (1s, 4s, 9s, …), lease and locked_by cleared, last_error recorded;
    • attempts >= max_attemptsstatus = 'DEAD_LETTERED', completed_at stamped, last_error kept as the post-mortem. One statement means the decision is atomic — no read-decide-write race between two reporters of the same job.
  3. The claim predicate IS the retry scheduler. ClaimBatch selects status IN ('PENDING','RETRYING') AND next_run_at <= now() — a RETRYING row is just a pending row whose eligibility is in the future. No retry queue, no timer wheel, no background promoter: backoff falls out of a timestamp comparison the dispatch tick was already doing, and the partial idx_jobs_claim index already covers both statuses.
  4. Crash recovery reuses the same branch. ReapExpired — the watchdog sweep (ADR-0007) — is the same CASE expression applied set-based: WHERE status = 'RUNNING' AND lease_until < now(), last_error = 'worker lease expired'. A dead worker's job re-enters the identical retry path as a reported failure; max-attempts, backoff, and dead-lettering apply uniformly whether a job failed loudly or silently.
  5. DEAD_LETTERED is terminal parking, not deletion. The row stays queryable (GetJob/ListJobs) with its last_error; nothing dispatches it because the claim predicate never matches its status.

Alternatives considered

A separate retry-queue table

Move failed jobs to retries and promote them back when due. Adds a two-table state machine, a promoter loop, and a window where a job exists in both or neither — all to represent what one timestamp column already represents. Rejected: next_run_at on the row is the degenerate (and sufficient) retry queue.

Exponential backoff with jitter

The textbook answer for thundering herds. Deferred: with max_attempts = 3 the schedule is three data points (1s/4s/9s) — jitter has nothing to smooth at this scale, and attempts² keeps the policy readable in the SQL itself. The CASE expression is the single place to change if per-topic policy or jitter ever earns its way in.

TTL-based expiry (jobs just go stale)

No terminal state; old failures age out. Rejected: a dead-lettered row with last_error is the operator's evidence. Silent expiry converts "this handler has a bug" into "where did my job go".

Drop on exhaust

Simplest possible. Rejected for the same reason: the entire value of a max-attempts cap is what you keep when you hit it.

Consequences

  • Zero extra machinery. Retry scheduling costs no new table, loop, or index — the claim's existing next_run_at <= now() comparison does all of it.
  • One retry path. Reported failures and reaped leases converge on the same CASE branch, so there is exactly one place where "retry or dead-letter" is decided.
  • Only last_error survives — each failure overwrites the previous reason. No per-attempt history is the accepted trade of the single-row design; if attempt-level forensics are ever needed, that is a separate append-only concern, not columns on this row.
  • Policy is hardcoded (attempts², max_attempts default 3 with a per-job column override at submit). Per-topic retry config is deferred until a real topic needs it.
  • No replay yet. Un-dead-lettering (reset to PENDING, clear attempts) is an admin feature for later; today it's a manual UPDATE.

On this page