pulse
Design decisions (ADRs)

ADR-0008: Scheduled and recurring jobs

Context

Three asks that keep coming up for a job queue, none of which pulse did:

  1. Run once, later — "send this reminder at 09:00 tomorrow."
  2. Run on an interval — "reconcile balances every 5 minutes."
  3. Run on a cron expression — "roll up metrics at the top of every hour."

The core queue is one jobs table: a row per job carrying status, retry state, and the worker lease, claimed in batches with FOR UPDATE SKIP LOCKED (internal/repos/postgres/jobs.go ClaimBatch). The next_run_at column already gates when a row becomes claimable — the claim predicate is status IN ('PENDING','RETRYING') AND next_run_at <= now(). So "don't run before T" is a solved problem; what's missing is something that produces job rows on a timetable.

The hard requirements for that producer:

  • Exactly-once per occurrence. A 09:00 fire must yield one job — not zero after a crash, not two when the process restarts or a second scheduler instance runs.
  • Crash-safe without coordination. No leader election, no distributed lock service.
  • A schedule is editable. Pausing or retuning a schedule must affect future occurrences without touching anything already enqueued.

Decision

A schedule is a mutable config row; a fire is an ordinary job insert. Two tables, one loop, and idempotency does all the concurrency work.

  1. Schedules are a config table, not jobs. db/migrations/0002_CreateSchedules.go creates schedules: kind ('once' | 'interval' | 'cron'), cron_expr / interval_ms (one populated per kind, the other NULL — see cronArg/intervalArg in internal/repos/postgres/schedules.go), paused, and next_run_at — the only place scheduling-time lives. The partial index idx_schedules_due on (next_run_at) WHERE NOT paused serves the loop's one hot query. Rows are mutable: SetPaused flips a boolean, Delete removes the generator; jobs already spawned are untouched.

  2. One polling loop. internal/service/scheduler.go ticks every scheduleInterval (1s, cmd/pulsed/main.go), calls schedules.Due(now, 100) — a plain indexed read — and fires each due schedule independently, logging and skipping a failed one so it cannot stall the rest (tick).

  3. Fire = deterministic id + idempotent insert. fire inserts a job whose id is uuid.NewV5(scheduleNamespace, scheduleID + "|" + occurrence) (deterministicJobID), and jobs.Insert is ON CONFLICT (id) DO NOTHING (internal/repos/postgres/jobs.go). Re-firing the same occurrence — after a crash, or from a second scheduler instance that read the same due row — re-derives the same id and the duplicate insert is a no-op. Exactly-once per occurrence is a property of the jobs table's primary key, not of any lock the scheduler holds; Due needs no claim locking for correctness.

  4. Insert before advance; advance is a CAS. fire inserts the job first, then moves the cursor. A crash between the two steps means the next tick sees the schedule still due, re-derives the same id, no-ops the insert, and advances — nothing lost, nothing doubled. Advance (internal/repos/postgres/schedules.go) is compare-and-set on the cursor: WHERE id = @id AND next_run_at = @occurrence, so a stale writer cannot move the cursor backwards or re-open a slot that already advanced. A once schedule is deleted instead of advanced.

  5. Catch-up policy: fire once, resync forward. nextRun (internal/service/scheduler.go) computes the next occurrence from now, not from the stale next_run_at. If the scheduler was down for an hour, a per-minute schedule fires once and resumes on its normal cadence — missed occurrences collapse into a single fire rather than replaying sixty times into the queue.

  6. Lineage is a column. Every spawned job carries schedule_id (db/migrations/0001_CreateJobs.go, indexed by idx_jobs_schedule_id). One fire is one job, so a schedule's fire history is its jobs: ListScheduleFires (internal/transport/grpc/schedule.go) lists jobs WHERE schedule_id = $1 and reports the job's submitted_at as fired-at. No separate fire-history table to keep consistent.

Alternatives considered

Cron inside the worker process

Use an in-process cron library (robfig/cron as a runner, not just a parser) in each worker. Rejected: N worker replicas fire N times per occurrence, and a schedule dies with the process that holds it. Schedules must be durable, centrally stored state — which is exactly what a table is.

Materialize one row per future occurrence

On schedule creation, insert the next N jobs with future next_run_at values. Rejected: a recurring schedule has unbounded occurrences, so something must top the horizon up anyway — you still need the loop, now plus horizon bookkeeping. Worse, editing or pausing a schedule means finding and retracting already-inserted future rows. One cursor (next_run_at on the schedule) makes an edit a single-row UPDATE.

Advance the cursor before inserting the job

Move next_run_at first, then insert. Rejected: a crash between the two steps silently loses the occurrence — the cursor says "done", the job never existed, and nothing will retry it. Insert-before-advance fails in the opposite direction — a possible duplicate attempt — and the deterministic id makes that direction free. Choose the failure mode you can dedupe.

Self-rescheduling jobs

Each job enqueues its successor when it completes. Rejected: cadence drifts by execution time, the chain breaks permanently if one run dead-letters, and the schedule's definition ends up buried in handler code where it can't be listed, paused, or edited. A schedule is config; config belongs in a row.

Consequences

Positive

  • Exactly-once per occurrence with zero coordination: it falls out of a v5 UUID, a primary key, and a CAS. Multiple scheduler instances are safe by construction.
  • A fire is an ordinary job — it inherits retry, backoff, dead-letter, priority, and the claim path for free; the scheduler knows nothing about execution.
  • Fire history requires no extra storage: it is a WHERE schedule_id query.
  • Pausing/editing a schedule is one row touched; in-flight jobs are unaffected.

Negative / costs accepted

  • Polling: up to scheduleInterval (1s) of fire latency, and one indexed query per tick even when nothing is due. Accepted — the partial index keeps the idle cost trivial.
  • "Fire once, resync forward" means missed occurrences are not replayed. For a metrics-rollup that's correct; a schedule that must process every slot needs to derive the slot from data, not from fire count. Documented, not configurable yet.
  • Advance computes the next cron occurrence in Go (cron.ParseStandard per fire) — an unparseable expression is caught at fire time, after insert. The job still runs; the cursor stalls and the error is logged each tick.
  • A tick is bounded at 100 due schedules (scheduleBatch); a backlog larger than that drains across ticks.

On this page