ADR-0007: Leases, heartbeats, the watchdog, and fencing
Context
A worker can die holding a job: process crash, OOM kill, network partition. Without
recovery, that job sits RUNNING forever — the silent job loss that makes a queue
untrustworthy. The server cannot observe worker health directly (handlers run in the
developer's process, ADR-0002), so liveness has to be something the worker asserts and
the server expires.
Two hazards shape the design:
- The orphan window. If "job is claimed" and "worker is alive" are recorded separately, a crash between the two writes leaves a running job that no liveness record covers.
- The zombie. Recovery by timeout can never distinguish dead from unreachable. A partitioned worker may still be executing the job the server just gave to someone else — so the old claimant must be fenced out, and re-execution must be a stated contract, not a surprise.
Decision
- The lease lives on the job row — there is no liveness table.
lease_untilandlocked_byare columns ofjobs(db/migrations/0001_CreateJobs.go), andClaimBatch(internal/repos/postgres/jobs.go) sets both in the same UPDATE that setsstatus = 'RUNNING'. The claim is the lease grant: aRUNNINGrow without a lease cannot exist, so the orphan window is closed by atomicity, not by careful write ordering. - Heartbeats extend the lease; the SDK beats every ~10s against a 30s TTL.
The SDK runs a ticker per in-flight job (
heartbeatInterval = 10s,pulse.go); each beat is a unaryHeartbeatRPC that extendslease_untilbylivenessTTL = 30s(cmd/pulsed/main.go). Three missed beats before expiry means one dropped packet or a GC pause never triggers a false recovery. - The
locked_bypredicate is the fence.HeartbeatupdatesWHERE id = @id AND status = 'RUNNING' AND locked_by = @worker. After the watchdog reaps a lease and a new worker re-claims the job, the old claimant's beats match zero rows — a silent no-op, not an error. A zombie can keep executing its stale attempt, but it can never re-extend a claim it no longer owns, so it can never push the job back out of the new owner's hands. - The watchdog is one set-based UPDATE through the normal retry path.
WatchdogService(internal/service/watchdog.go) ticks every 10s and callsReapExpired: a single statement over the partialidx_jobs_expiredindex,WHERE status = 'RUNNING' AND lease_until < now(), applying the exact retry-or-dead-letter CASE branch from ADR-0006 withlast_error = 'worker lease expired'. Recovery is not a subsystem; it is a failure report the worker didn't live to make. Set-based means N stuck jobs cost one statement, and multiplepulsedreplicas sweeping concurrently are harmless — a row is only reaped once because the first UPDATE moves it out ofRUNNING. - The delivery contract is at-least-once; handlers must be idempotent. Timeout
recovery can re-run a partitioned worker's still-executing job — this is inherent, not a
bug, so it is stated as the contract: every
JobAssignmentcarriesjob_id+attempt(internal/transport/grpc/dispatcher.go,assignmentFrom) as the dedup key, and the fence guarantees the zombie's claim dies even when its process doesn't. The stale attempt's lateReportResulthits theWHERE status = 'RUNNING'guard and fails withFailedPreconditionif the job has since moved on.
Alternatives considered
A separate liveness table (the previous design)
A liveness row per running job, written after the claim. That is an extra table, an extra
write per claim, and — the real cost — a best-effort ordering problem: crash between claim
and liveness write and you need a second "orphan grace" recovery path for running jobs with
no liveness record. Putting the lease on the job row deletes the table, the second write,
and the second recovery path in one move. Rejected: the orphan window was a
self-inflicted wound of the two-table split.
gRPC stream liveness (dead stream = dead worker)
The StreamJobs connection already signals presence — why not use its death as the
recovery trigger? Because it has the wrong failure semantics in both directions: a stream
can drop while the worker (and its handlers) are perfectly healthy mid-job, causing false
recovery; and per-job liveness is what's needed, not per-worker — one wedged handler
shouldn't be masked by an otherwise-chatty connection. Leases make liveness explicit and
per-job.
Per-job duration timeouts (no heartbeats)
"This job may run 5 minutes" — expire it after that. Simpler wire surface, but it forces every producer to guess worst-case runtime; guess low and healthy long jobs get killed, guess high and real crashes take the whole timeout to recover. Heartbeats decouple recovery latency (≈TTL, 30s) from job duration entirely.
External lease service (etcd / Redis TTL keys)
Purpose-built lease machinery with sub-second expiry. Rejected on ADR-0002 grounds: a second infrastructure dependency, and leases in a different store than the rows they guard reintroduces cross-store ordering — the exact class of problem decision 1 exists to eliminate. Postgres timestamps are slower to expire but transactional with the claim.
Per-claim fencing: the attempt token
The locked_by fence distinguishes workers but not two claims by the same worker — the
self-zombie: worker W's claim is reaped (lease lapsed under a stuck handler), the job is
re-claimed by W again, and the still-running old goroutine's heartbeats and report would
match locked_by = W. The fix is a second fence on the attempt number: attempts is
incremented inside every claim, making it a monotonic, server-minted, per-claim token that
already rides every JobAssignment. Heartbeat, Complete, and Fail carry it back and
their guards add AND attempts = @attempt (0 skips the fence for legacy callers) — a stale
heartbeat is a silent no-op, a stale report is rejected with ErrInvalidTransition
(TestJob_AttemptFencesSelfZombie). The fence bounds state corruption, not side effects:
the zombie's handler already ran, which is why at-least-once + idempotent handlers remains
the delivery contract.
Consequences
- No silent job loss. Every
RUNNINGrow has a lease from birth; worst-case recovery latency is TTL + one watchdog tick (~40s), independent of what killed the worker. - One recovery mechanism. Claimed-but-unsent (ADR-0005), worker crash, and partition all resolve through the same lease expiry into the same retry path (ADR-0006).
- At-least-once is load-bearing. Handler idempotency is a hard requirement on SDK
users, documented at the boundary;
job_id+attemptis the key they dedup on. - Liveness costs one unary RPC per job per 10s — per running job, not per worker; an idle fleet heartbeats nothing.
- The TTL is a fleet-wide constant (30s in
cmd/pulsed/main.go). Per-topic or per-job lease tuning is deferred until a workload demonstrates the need.