pulse
Design decisions (ADRs)

ADR-0010: Pause/resume dispatch

Context

An operator needs to stop the system from handing out new work — a maintenance window, a downstream outage, "I need to touch the DB without jobs starting" — without stopping the servers, dropping worker connections, or losing submissions. Pause the faucet, not the plumbing.

What "paused" must mean, precisely:

  • New assignments stop. No job moves to RUNNING while paused.
  • Everything else continues. Producers keep submitting (rows accumulate as PENDING), running jobs finish and report Complete/Fail, heartbeats keep leases alive, the watchdog keeps reaping, and the scheduler keeps firing — its spawned jobs simply queue.
  • It survives a restart. A pause set before a deploy must still hold when the process comes back.
  • It covers every instance. With several pulsed servers sharing the database, one Pause call must stop dispatch everywhere, promptly.

And it must not tax the hot path: each connected worker's stream ticks every 500ms (dispatchInterval, internal/transport/grpc/dispatcher.go); the pause check runs on every one of those ticks.

Decision

Durable truth in a singleton row; a per-process in-memory gate on the hot path; a refresh loop reconciling the two.

  1. The durable switch is one row. db/migrations/0003_CreateDispatchControl.go creates dispatch_control with paused, reason, paused_at, and a CHECK (id = 1) singleton constraint, seeded paused = false so a fresh deployment dispatches. There is exactly one unambiguous answer to "is dispatch paused?", shared by every instance.

  2. The hot path reads memory, never the DB. DispatchGate (internal/service/pause.go) is an atomic bool. The dispatcher's tick checks it first: dispatchReady returns before claiming when d.gate.Paused() (internal/transport/grpc/dispatcher.go). That single if is the only enforcement point — pause gates the claim, so no PENDING row can become RUNNING, and every other flow (submit, complete, fail, heartbeat, watchdog, scheduler) is untouched by construction rather than by N scattered checks. A nil gate reads as open, so a dispatcher wired without one behaves normally.

  3. Boot: prime, fail-safe. Prime reads the row once and sets the gate before the server accepts workers; cmd/pulsed/main.go aborts startup if it fails. At boot there is no known-good value — serving with a guessed gate could dispatch straight into the outage the operator paused for. Refusing to start is the safe failure.

  4. Runtime: refresh, fail-open. Run re-reads the row every pauseRefreshInterval (1s, cmd/pulsed/main.go) and stores the result in the gate; on a read error it logs and keeps the last known value (refresh, internal/service/pause.go). At runtime a known-good value exists, so the honest move on a transient DB blip is to keep it — failing closed here would let any hiccup in the control-plane read halt all dispatch, turning a monitoring nuisance into an outage. The boot/runtime asymmetry is deliberate: fail-safe when you know nothing, fail-open when you know something.

  5. Pause/Resume write through, then flip locally. Pause writes the row, then sets the local gate immediately — the instance that served the RPC stops on its next tick without waiting for its own refresh; other instances converge within one refresh interval, so cross-instance convergence is ≤1s. If the DB write fails, the gate is untouched and the error surfaces: the durable row never lies ahead of memory.

  6. Resume clears the reason, keeps the timestamp. SetPaused (internal/repos/postgres/dispatch_control.go) overwrites reason (to '' on resume) but stamps paused_at only when pausing — CASE WHEN @paused THEN now() ELSE paused_at END — so Status can always answer "when was this last paused?" while never showing a stale reason for a running system.

Alternatives considered

In-memory flag only

A bool on the server, flipped by the RPC. Rejected: fails both durability requirements at once — a restart silently resumes dispatch mid-maintenance, and with multiple instances the operator must find and pause each one. Pause is operational state, and state that must outlive a process belongs in the database.

Pause as a job status (PENDINGPAUSED rows)

Mark the queued jobs themselves paused. Rejected: conflates a global control-plane switch with per-job state. Pausing becomes a mass UPDATE over the whole backlog, resuming another; jobs submitted while paused need intercepting at insert; and the claim query gains a new status to reason about. One singleton row does the same with one write, and the jobs table stays a description of jobs, not of operator intent.

LISTEN/NOTIFY for instant propagation

Postgres pub/sub so instances react to Pause in milliseconds instead of ≤1s. Deferred: it's an optimization layered on the same durable row, not a replacement — notifications are lossy (a disconnected listener misses them), so the poll must exist anyway as the reconciler. Adding a second propagation path plus reconnect handling to shave under a second off a human-scale operation isn't worth it yet.

Read the row on every dispatch tick

Skip the gate; check dispatch_control inside dispatchReady. Rejected: puts a control-plane query on the data-plane hot path — every connected worker, twice a second, forever, paying a DB round-trip to almost always learn "not paused". The gate gives the same freshness bound (≤1s) for one query per second per instance, independent of worker count, and an atomic load per tick.

Consequences

Positive

  • Pause survives restarts and applies to every instance within ~1s of the RPC.
  • Zero hot-path cost: an atomic bool load per dispatch tick.
  • One enforcement point means the invariant is auditable at a glance: paused ⇒ no claim ⇒ no new RUNNING rows; everything downstream of a claim already made proceeds normally to completion.
  • Submits, retries, scheduler fires, and heartbeats behave identically paused or not — resume drains the accumulated backlog through the ordinary claim path, priority order intact.

Negative / costs accepted

  • Pause is not instantaneous everywhere: an instance other than the one that served the RPC may hand out assignments for up to one refresh interval (≤1s). Acceptable for an operator-scale action.
  • Fail-open at runtime means a partitioned instance keeps its last state: if the DB becomes unreachable after a pause was set elsewhere but before this instance refreshed, it keeps dispatching until connectivity returns. (In practice claims need the same DB, so dispatch is failing anyway.)
  • The backlog grows unbounded while paused — pause does not shed load, it defers it.
  • One global switch: no per-topic pause. The dispatch_control shape (a keyed row per scope) extends to that later without changing the mechanism.

On this page