ADR-0010: Pause/resume dispatch
Context
An operator needs to stop the system from handing out new work — a maintenance window, a downstream outage, "I need to touch the DB without jobs starting" — without stopping the servers, dropping worker connections, or losing submissions. Pause the faucet, not the plumbing.
What "paused" must mean, precisely:
- New assignments stop. No job moves to
RUNNINGwhile paused. - Everything else continues. Producers keep submitting (rows accumulate as
PENDING), running jobs finish and reportComplete/Fail, heartbeats keep leases alive, the watchdog keeps reaping, and the scheduler keeps firing — its spawned jobs simply queue. - It survives a restart. A pause set before a deploy must still hold when the process comes back.
- It covers every instance. With several pulsed servers sharing the database, one Pause call must stop dispatch everywhere, promptly.
And it must not tax the hot path: each connected worker's stream ticks every 500ms
(dispatchInterval, internal/transport/grpc/dispatcher.go); the pause check runs on
every one of those ticks.
Decision
Durable truth in a singleton row; a per-process in-memory gate on the hot path; a refresh loop reconciling the two.
-
The durable switch is one row.
db/migrations/0003_CreateDispatchControl.gocreatesdispatch_controlwithpaused,reason,paused_at, and aCHECK (id = 1)singleton constraint, seededpaused = falseso a fresh deployment dispatches. There is exactly one unambiguous answer to "is dispatch paused?", shared by every instance. -
The hot path reads memory, never the DB.
DispatchGate(internal/service/pause.go) is an atomic bool. The dispatcher's tick checks it first:dispatchReadyreturns before claiming whend.gate.Paused()(internal/transport/grpc/dispatcher.go). That singleifis the only enforcement point — pause gates the claim, so noPENDINGrow can becomeRUNNING, and every other flow (submit, complete, fail, heartbeat, watchdog, scheduler) is untouched by construction rather than by N scattered checks. A nil gate reads as open, so a dispatcher wired without one behaves normally. -
Boot: prime, fail-safe.
Primereads the row once and sets the gate before the server accepts workers;cmd/pulsed/main.goaborts startup if it fails. At boot there is no known-good value — serving with a guessed gate could dispatch straight into the outage the operator paused for. Refusing to start is the safe failure. -
Runtime: refresh, fail-open.
Runre-reads the row everypauseRefreshInterval(1s,cmd/pulsed/main.go) and stores the result in the gate; on a read error it logs and keeps the last known value (refresh,internal/service/pause.go). At runtime a known-good value exists, so the honest move on a transient DB blip is to keep it — failing closed here would let any hiccup in the control-plane read halt all dispatch, turning a monitoring nuisance into an outage. The boot/runtime asymmetry is deliberate: fail-safe when you know nothing, fail-open when you know something. -
Pause/Resume write through, then flip locally.
Pausewrites the row, then sets the local gate immediately — the instance that served the RPC stops on its next tick without waiting for its own refresh; other instances converge within one refresh interval, so cross-instance convergence is ≤1s. If the DB write fails, the gate is untouched and the error surfaces: the durable row never lies ahead of memory. -
Resume clears the reason, keeps the timestamp.
SetPaused(internal/repos/postgres/dispatch_control.go) overwritesreason(to''on resume) but stampspaused_atonly when pausing —CASE WHEN @paused THEN now() ELSE paused_at END— soStatuscan always answer "when was this last paused?" while never showing a stale reason for a running system.
Alternatives considered
In-memory flag only
A bool on the server, flipped by the RPC. Rejected: fails both durability requirements at once — a restart silently resumes dispatch mid-maintenance, and with multiple instances the operator must find and pause each one. Pause is operational state, and state that must outlive a process belongs in the database.
Pause as a job status (PENDING → PAUSED rows)
Mark the queued jobs themselves paused. Rejected: conflates a global control-plane switch with per-job state. Pausing becomes a mass UPDATE over the whole backlog, resuming another; jobs submitted while paused need intercepting at insert; and the claim query gains a new status to reason about. One singleton row does the same with one write, and the jobs table stays a description of jobs, not of operator intent.
LISTEN/NOTIFY for instant propagation
Postgres pub/sub so instances react to Pause in milliseconds instead of ≤1s. Deferred: it's an optimization layered on the same durable row, not a replacement — notifications are lossy (a disconnected listener misses them), so the poll must exist anyway as the reconciler. Adding a second propagation path plus reconnect handling to shave under a second off a human-scale operation isn't worth it yet.
Read the row on every dispatch tick
Skip the gate; check dispatch_control inside dispatchReady.
Rejected: puts a control-plane query on the data-plane hot path — every connected
worker, twice a second, forever, paying a DB round-trip to almost always learn
"not paused". The gate gives the same freshness bound (≤1s) for one query per second
per instance, independent of worker count, and an atomic load per tick.
Consequences
Positive
- Pause survives restarts and applies to every instance within ~1s of the RPC.
- Zero hot-path cost: an atomic bool load per dispatch tick.
- One enforcement point means the invariant is auditable at a glance: paused ⇒ no claim
⇒ no new
RUNNINGrows; everything downstream of a claim already made proceeds normally to completion. - Submits, retries, scheduler fires, and heartbeats behave identically paused or not — resume drains the accumulated backlog through the ordinary claim path, priority order intact.
Negative / costs accepted
- Pause is not instantaneous everywhere: an instance other than the one that served the RPC may hand out assignments for up to one refresh interval (≤1s). Acceptable for an operator-scale action.
- Fail-open at runtime means a partitioned instance keeps its last state: if the DB becomes unreachable after a pause was set elsewhere but before this instance refreshed, it keeps dispatching until connectivity returns. (In practice claims need the same DB, so dispatch is failing anyway.)
- The backlog grows unbounded while paused — pause does not shed load, it defers it.
- One global switch: no per-topic pause. The
dispatch_controlshape (a keyed row per scope) extends to that later without changing the mechanism.