ADR-0008: Scheduled and recurring jobs
Context
Three asks that keep coming up for a job queue, none of which pulse did:
- Run once, later — "send this reminder at 09:00 tomorrow."
- Run on an interval — "reconcile balances every 5 minutes."
- Run on a cron expression — "roll up metrics at the top of every hour."
The core queue is one jobs table: a row per job carrying status, retry state, and the
worker lease, claimed in batches with FOR UPDATE SKIP LOCKED
(internal/repos/postgres/jobs.go ClaimBatch). The next_run_at column already gates
when a row becomes claimable — the claim predicate is
status IN ('PENDING','RETRYING') AND next_run_at <= now(). So "don't run before T" is a
solved problem; what's missing is something that produces job rows on a timetable.
The hard requirements for that producer:
- Exactly-once per occurrence. A 09:00 fire must yield one job — not zero after a crash, not two when the process restarts or a second scheduler instance runs.
- Crash-safe without coordination. No leader election, no distributed lock service.
- A schedule is editable. Pausing or retuning a schedule must affect future occurrences without touching anything already enqueued.
Decision
A schedule is a mutable config row; a fire is an ordinary job insert. Two tables, one loop, and idempotency does all the concurrency work.
-
Schedules are a config table, not jobs.
db/migrations/0002_CreateSchedules.gocreatesschedules:kind('once' | 'interval' | 'cron'),cron_expr/interval_ms(one populated per kind, the other NULL — seecronArg/intervalArgininternal/repos/postgres/schedules.go),paused, andnext_run_at— the only place scheduling-time lives. The partial indexidx_schedules_dueon(next_run_at) WHERE NOT pausedserves the loop's one hot query. Rows are mutable:SetPausedflips a boolean,Deleteremoves the generator; jobs already spawned are untouched. -
One polling loop.
internal/service/scheduler.goticks everyscheduleInterval(1s,cmd/pulsed/main.go), callsschedules.Due(now, 100)— a plain indexed read — and fires each due schedule independently, logging and skipping a failed one so it cannot stall the rest (tick). -
Fire = deterministic id + idempotent insert.
fireinserts a job whose id isuuid.NewV5(scheduleNamespace, scheduleID + "|" + occurrence)(deterministicJobID), andjobs.InsertisON CONFLICT (id) DO NOTHING(internal/repos/postgres/jobs.go). Re-firing the same occurrence — after a crash, or from a second scheduler instance that read the same due row — re-derives the same id and the duplicate insert is a no-op. Exactly-once per occurrence is a property of the jobs table's primary key, not of any lock the scheduler holds;Dueneeds no claim locking for correctness. -
Insert before advance; advance is a CAS.
fireinserts the job first, then moves the cursor. A crash between the two steps means the next tick sees the schedule still due, re-derives the same id, no-ops the insert, and advances — nothing lost, nothing doubled.Advance(internal/repos/postgres/schedules.go) is compare-and-set on the cursor:WHERE id = @id AND next_run_at = @occurrence, so a stale writer cannot move the cursor backwards or re-open a slot that already advanced. Aonceschedule is deleted instead of advanced. -
Catch-up policy: fire once, resync forward.
nextRun(internal/service/scheduler.go) computes the next occurrence fromnow, not from the stalenext_run_at. If the scheduler was down for an hour, a per-minute schedule fires once and resumes on its normal cadence — missed occurrences collapse into a single fire rather than replaying sixty times into the queue. -
Lineage is a column. Every spawned job carries
schedule_id(db/migrations/0001_CreateJobs.go, indexed byidx_jobs_schedule_id). One fire is one job, so a schedule's fire history is its jobs:ListScheduleFires(internal/transport/grpc/schedule.go) listsjobs WHERE schedule_id = $1and reports the job'ssubmitted_atas fired-at. No separate fire-history table to keep consistent.
Alternatives considered
Cron inside the worker process
Use an in-process cron library (robfig/cron as a runner, not just a parser) in each worker. Rejected: N worker replicas fire N times per occurrence, and a schedule dies with the process that holds it. Schedules must be durable, centrally stored state — which is exactly what a table is.
Materialize one row per future occurrence
On schedule creation, insert the next N jobs with future next_run_at values.
Rejected: a recurring schedule has unbounded occurrences, so something must top the
horizon up anyway — you still need the loop, now plus horizon bookkeeping. Worse, editing
or pausing a schedule means finding and retracting already-inserted future rows. One
cursor (next_run_at on the schedule) makes an edit a single-row UPDATE.
Advance the cursor before inserting the job
Move next_run_at first, then insert.
Rejected: a crash between the two steps silently loses the occurrence — the cursor
says "done", the job never existed, and nothing will retry it. Insert-before-advance
fails in the opposite direction — a possible duplicate attempt — and the deterministic id
makes that direction free. Choose the failure mode you can dedupe.
Self-rescheduling jobs
Each job enqueues its successor when it completes. Rejected: cadence drifts by execution time, the chain breaks permanently if one run dead-letters, and the schedule's definition ends up buried in handler code where it can't be listed, paused, or edited. A schedule is config; config belongs in a row.
Consequences
Positive
- Exactly-once per occurrence with zero coordination: it falls out of a v5 UUID, a primary key, and a CAS. Multiple scheduler instances are safe by construction.
- A fire is an ordinary job — it inherits retry, backoff, dead-letter, priority, and the claim path for free; the scheduler knows nothing about execution.
- Fire history requires no extra storage: it is a
WHERE schedule_idquery. - Pausing/editing a schedule is one row touched; in-flight jobs are unaffected.
Negative / costs accepted
- Polling: up to
scheduleInterval(1s) of fire latency, and one indexed query per tick even when nothing is due. Accepted — the partial index keeps the idle cost trivial. - "Fire once, resync forward" means missed occurrences are not replayed. For a metrics-rollup that's correct; a schedule that must process every slot needs to derive the slot from data, not from fire count. Documented, not configurable yet.
Advancecomputes the next cron occurrence in Go (cron.ParseStandardper fire) — an unparseable expression is caught at fire time, after insert. The job still runs; the cursor stalls and the error is logged each tick.- A tick is bounded at 100 due schedules (
scheduleBatch); a backlog larger than that drains across ticks.