ADR-0008: Scheduled and recurring jobs

Context

Three asks that keep coming up for a job queue, none of which pulse did:

Run once, later — "send this reminder at 09:00 tomorrow."
Run on an interval — "reconcile balances every 5 minutes."
Run on a cron expression — "roll up metrics at the top of every hour."

The core queue is one jobs table: a row per job carrying status, retry state, and the worker lease, claimed in batches with FOR UPDATE SKIP LOCKED (internal/repos/postgres/jobs.go ClaimBatch). The next_run_at column already gates when a row becomes claimable — the claim predicate is status IN ('PENDING','RETRYING') AND next_run_at <= now(). So "don't run before T" is a solved problem; what's missing is something that produces job rows on a timetable.

The hard requirements for that producer:

Exactly-once per occurrence. A 09:00 fire must yield one job — not zero after a crash, not two when the process restarts or a second scheduler instance runs.
Crash-safe without coordination. No leader election, no distributed lock service.
A schedule is editable. Pausing or retuning a schedule must affect future occurrences without touching anything already enqueued.

Decision

A schedule is a mutable config row; a fire is an ordinary job insert. Two tables, one loop, and idempotency does all the concurrency work.

Schedules are a config table, not jobs. db/migrations/0002_CreateSchedules.go creates schedules: kind ('once' | 'interval' | 'cron'), cron_expr / interval_ms (one populated per kind, the other NULL — see cronArg/intervalArg in internal/repos/postgres/schedules.go), paused, and next_run_at — the only place scheduling-time lives. The partial index idx_schedules_due on (next_run_at) WHERE NOT paused serves the loop's one hot query. Rows are mutable: SetPaused flips a boolean, Delete removes the generator; jobs already spawned are untouched.
One polling loop. internal/service/scheduler.go ticks every scheduleInterval (1s, cmd/pulsed/main.go), calls schedules.Due(now, 100) — a plain indexed read — and fires each due schedule independently, logging and skipping a failed one so it cannot stall the rest (tick).
Fire = deterministic id + idempotent insert. fire inserts a job whose id is uuid.NewV5(scheduleNamespace, scheduleID + "|" + occurrence) (deterministicJobID), and jobs.Insert is ON CONFLICT (id) DO NOTHING (internal/repos/postgres/jobs.go). Re-firing the same occurrence — after a crash, or from a second scheduler instance that read the same due row — re-derives the same id and the duplicate insert is a no-op. Exactly-once per occurrence is a property of the jobs table's primary key, not of any lock the scheduler holds; Due needs no claim locking for correctness.
Insert before advance; advance is a CAS. fire inserts the job first, then moves the cursor. A crash between the two steps means the next tick sees the schedule still due, re-derives the same id, no-ops the insert, and advances — nothing lost, nothing doubled. Advance (internal/repos/postgres/schedules.go) is compare-and-set on the cursor: WHERE id = @id AND next_run_at = @occurrence, so a stale writer cannot move the cursor backwards or re-open a slot that already advanced. A once schedule is deleted instead of advanced.
Catch-up policy: fire once, resync forward. nextRun (internal/service/scheduler.go) computes the next occurrence from now, not from the stale next_run_at. If the scheduler was down for an hour, a per-minute schedule fires once and resumes on its normal cadence — missed occurrences collapse into a single fire rather than replaying sixty times into the queue.
Lineage is a column. Every spawned job carries schedule_id (db/migrations/0001_CreateJobs.go, indexed by idx_jobs_schedule_id). One fire is one job, so a schedule's fire history is its jobs: ListScheduleFires (internal/transport/grpc/schedule.go) lists jobs WHERE schedule_id = $1 and reports the job's submitted_at as fired-at. No separate fire-history table to keep consistent.

Alternatives considered

Cron inside the worker process

Use an in-process cron library (robfig/cron as a runner, not just a parser) in each worker. Rejected: N worker replicas fire N times per occurrence, and a schedule dies with the process that holds it. Schedules must be durable, centrally stored state — which is exactly what a table is.

Materialize one row per future occurrence

On schedule creation, insert the next N jobs with future next_run_at values. Rejected: a recurring schedule has unbounded occurrences, so something must top the horizon up anyway — you still need the loop, now plus horizon bookkeeping. Worse, editing or pausing a schedule means finding and retracting already-inserted future rows. One cursor (next_run_at on the schedule) makes an edit a single-row UPDATE.

Advance the cursor before inserting the job

Move next_run_at first, then insert. Rejected: a crash between the two steps silently loses the occurrence — the cursor says "done", the job never existed, and nothing will retry it. Insert-before-advance fails in the opposite direction — a possible duplicate attempt — and the deterministic id makes that direction free. Choose the failure mode you can dedupe.

Self-rescheduling jobs

Each job enqueues its successor when it completes. Rejected: cadence drifts by execution time, the chain breaks permanently if one run dead-letters, and the schedule's definition ends up buried in handler code where it can't be listed, paused, or edited. A schedule is config; config belongs in a row.

Consequences

Positive

Exactly-once per occurrence with zero coordination: it falls out of a v5 UUID, a primary key, and a CAS. Multiple scheduler instances are safe by construction.
A fire is an ordinary job — it inherits retry, backoff, dead-letter, priority, and the claim path for free; the scheduler knows nothing about execution.
Fire history requires no extra storage: it is a WHERE schedule_id query.
Pausing/editing a schedule is one row touched; in-flight jobs are unaffected.

Negative / costs accepted

Polling: up to scheduleInterval (1s) of fire latency, and one indexed query per tick even when nothing is due. Accepted — the partial index keeps the idle cost trivial.
"Fire once, resync forward" means missed occurrences are not replayed. For a metrics-rollup that's correct; a schedule that must process every slot needs to derive the slot from data, not from fire count. Documented, not configurable yet.
Advance computes the next cron occurrence in Go (cron.ParseStandard per fire) — an unparseable expression is caught at fire time, after insert. The job still runs; the cursor stalls and the error is logged each tick.
A tick is bounded at 100 due schedules (scheduleBatch); a backlog larger than that drains across ticks.

ADR-0008: Scheduled and recurring jobs

On this page