ADR-0010: Pause/resume dispatch

Context

An operator needs to stop the system from handing out new work — a maintenance window, a downstream outage, "I need to touch the DB without jobs starting" — without stopping the servers, dropping worker connections, or losing submissions. Pause the faucet, not the plumbing.

What "paused" must mean, precisely:

New assignments stop. No job moves to RUNNING while paused.
Everything else continues. Producers keep submitting (rows accumulate as PENDING), running jobs finish and report Complete/Fail, heartbeats keep leases alive, the watchdog keeps reaping, and the scheduler keeps firing — its spawned jobs simply queue.
It survives a restart. A pause set before a deploy must still hold when the process comes back.
It covers every instance. With several pulsed servers sharing the database, one Pause call must stop dispatch everywhere, promptly.

And it must not tax the hot path: each connected worker's stream ticks every 500ms (dispatchInterval, internal/transport/grpc/dispatcher.go); the pause check runs on every one of those ticks.

Decision

Durable truth in a singleton row; a per-process in-memory gate on the hot path; a refresh loop reconciling the two.

The durable switch is one row. db/migrations/0003_CreateDispatchControl.go creates dispatch_control with paused, reason, paused_at, and a CHECK (id = 1) singleton constraint, seeded paused = false so a fresh deployment dispatches. There is exactly one unambiguous answer to "is dispatch paused?", shared by every instance.
The hot path reads memory, never the DB. DispatchGate (internal/service/pause.go) is an atomic bool. The dispatcher's tick checks it first: dispatchReady returns before claiming when d.gate.Paused() (internal/transport/grpc/dispatcher.go). That single if is the only enforcement point — pause gates the claim, so no PENDING row can become RUNNING, and every other flow (submit, complete, fail, heartbeat, watchdog, scheduler) is untouched by construction rather than by N scattered checks. A nil gate reads as open, so a dispatcher wired without one behaves normally.
Boot: prime, fail-safe. Prime reads the row once and sets the gate before the server accepts workers; cmd/pulsed/main.go aborts startup if it fails. At boot there is no known-good value — serving with a guessed gate could dispatch straight into the outage the operator paused for. Refusing to start is the safe failure.
Runtime: refresh, fail-open. Run re-reads the row every pauseRefreshInterval (1s, cmd/pulsed/main.go) and stores the result in the gate; on a read error it logs and keeps the last known value (refresh, internal/service/pause.go). At runtime a known-good value exists, so the honest move on a transient DB blip is to keep it — failing closed here would let any hiccup in the control-plane read halt all dispatch, turning a monitoring nuisance into an outage. The boot/runtime asymmetry is deliberate: fail-safe when you know nothing, fail-open when you know something.
Pause/Resume write through, then flip locally. Pause writes the row, then sets the local gate immediately — the instance that served the RPC stops on its next tick without waiting for its own refresh; other instances converge within one refresh interval, so cross-instance convergence is ≤1s. If the DB write fails, the gate is untouched and the error surfaces: the durable row never lies ahead of memory.
Resume clears the reason, keeps the timestamp. SetPaused (internal/repos/postgres/dispatch_control.go) overwrites reason (to '' on resume) but stamps paused_at only when pausing — CASE WHEN @paused THEN now() ELSE paused_at END — so Status can always answer "when was this last paused?" while never showing a stale reason for a running system.

Alternatives considered

In-memory flag only

A bool on the server, flipped by the RPC. Rejected: fails both durability requirements at once — a restart silently resumes dispatch mid-maintenance, and with multiple instances the operator must find and pause each one. Pause is operational state, and state that must outlive a process belongs in the database.

Pause as a job status (`PENDING` → `PAUSED` rows)

Mark the queued jobs themselves paused. Rejected: conflates a global control-plane switch with per-job state. Pausing becomes a mass UPDATE over the whole backlog, resuming another; jobs submitted while paused need intercepting at insert; and the claim query gains a new status to reason about. One singleton row does the same with one write, and the jobs table stays a description of jobs, not of operator intent.

LISTEN/NOTIFY for instant propagation

Postgres pub/sub so instances react to Pause in milliseconds instead of ≤1s. Deferred: it's an optimization layered on the same durable row, not a replacement — notifications are lossy (a disconnected listener misses them), so the poll must exist anyway as the reconciler. Adding a second propagation path plus reconnect handling to shave under a second off a human-scale operation isn't worth it yet.

Read the row on every dispatch tick

Skip the gate; check dispatch_control inside dispatchReady. Rejected: puts a control-plane query on the data-plane hot path — every connected worker, twice a second, forever, paying a DB round-trip to almost always learn "not paused". The gate gives the same freshness bound (≤1s) for one query per second per instance, independent of worker count, and an atomic load per tick.

Consequences

Positive

Pause survives restarts and applies to every instance within ~1s of the RPC.
Zero hot-path cost: an atomic bool load per dispatch tick.
One enforcement point means the invariant is auditable at a glance: paused ⇒ no claim ⇒ no new RUNNING rows; everything downstream of a claim already made proceeds normally to completion.
Submits, retries, scheduler fires, and heartbeats behave identically paused or not — resume drains the accumulated backlog through the ordinary claim path, priority order intact.

Negative / costs accepted

Pause is not instantaneous everywhere: an instance other than the one that served the RPC may hand out assignments for up to one refresh interval (≤1s). Acceptable for an operator-scale action.
Fail-open at runtime means a partitioned instance keeps its last state: if the DB becomes unreachable after a pause was set elsewhere but before this instance refreshed, it keeps dispatching until connectivity returns. (In practice claims need the same DB, so dispatch is failing anyway.)
The backlog grows unbounded while paused — pause does not shed load, it defers it.
One global switch: no per-topic pause. The dispatch_control shape (a keyed row per scope) extends to that later without changing the mechanism.

ADR-0010: Pause/resume dispatch

On this page