ADR-0007: Leases, heartbeats, the watchdog, and fencing

Context

A worker can die holding a job: process crash, OOM kill, network partition. Without recovery, that job sits RUNNING forever — the silent job loss that makes a queue untrustworthy. The server cannot observe worker health directly (handlers run in the developer's process, ADR-0002), so liveness has to be something the worker asserts and the server expires.

Two hazards shape the design:

The orphan window. If "job is claimed" and "worker is alive" are recorded separately, a crash between the two writes leaves a running job that no liveness record covers.
The zombie. Recovery by timeout can never distinguish dead from unreachable. A partitioned worker may still be executing the job the server just gave to someone else — so the old claimant must be fenced out, and re-execution must be a stated contract, not a surprise.

Decision

The lease lives on the job row — there is no liveness table. lease_until and locked_by are columns of jobs (db/migrations/0001_CreateJobs.go), and ClaimBatch (internal/repos/postgres/jobs.go) sets both in the same UPDATE that sets status = 'RUNNING'. The claim is the lease grant: a RUNNING row without a lease cannot exist, so the orphan window is closed by atomicity, not by careful write ordering.
Heartbeats extend the lease; the SDK beats every ~10s against a 30s TTL. The SDK runs a ticker per in-flight job (heartbeatInterval = 10s, pulse.go); each beat is a unary Heartbeat RPC that extends lease_until by livenessTTL = 30s (cmd/pulsed/main.go). Three missed beats before expiry means one dropped packet or a GC pause never triggers a false recovery.
The locked_by predicate is the fence. Heartbeat updates WHERE id = @id AND status = 'RUNNING' AND locked_by = @worker. After the watchdog reaps a lease and a new worker re-claims the job, the old claimant's beats match zero rows — a silent no-op, not an error. A zombie can keep executing its stale attempt, but it can never re-extend a claim it no longer owns, so it can never push the job back out of the new owner's hands.
The watchdog is one set-based UPDATE through the normal retry path. WatchdogService (internal/service/watchdog.go) ticks every 10s and calls ReapExpired: a single statement over the partial idx_jobs_expired index, WHERE status = 'RUNNING' AND lease_until < now(), applying the exact retry-or-dead-letter CASE branch from ADR-0006 with last_error = 'worker lease expired'. Recovery is not a subsystem; it is a failure report the worker didn't live to make. Set-based means N stuck jobs cost one statement, and multiple pulsed replicas sweeping concurrently are harmless — a row is only reaped once because the first UPDATE moves it out of RUNNING.
The delivery contract is at-least-once; handlers must be idempotent. Timeout recovery can re-run a partitioned worker's still-executing job — this is inherent, not a bug, so it is stated as the contract: every JobAssignment carries job_id + attempt (internal/transport/grpc/dispatcher.go, assignmentFrom) as the dedup key, and the fence guarantees the zombie's claim dies even when its process doesn't. The stale attempt's late ReportResult hits the WHERE status = 'RUNNING' guard and fails with FailedPrecondition if the job has since moved on.

Alternatives considered

A separate liveness table (the previous design)

A liveness row per running job, written after the claim. That is an extra table, an extra write per claim, and — the real cost — a best-effort ordering problem: crash between claim and liveness write and you need a second "orphan grace" recovery path for running jobs with no liveness record. Putting the lease on the job row deletes the table, the second write, and the second recovery path in one move. Rejected: the orphan window was a self-inflicted wound of the two-table split.

gRPC stream liveness (dead stream = dead worker)

The StreamJobs connection already signals presence — why not use its death as the recovery trigger? Because it has the wrong failure semantics in both directions: a stream can drop while the worker (and its handlers) are perfectly healthy mid-job, causing false recovery; and per-job liveness is what's needed, not per-worker — one wedged handler shouldn't be masked by an otherwise-chatty connection. Leases make liveness explicit and per-job.

Per-job duration timeouts (no heartbeats)

"This job may run 5 minutes" — expire it after that. Simpler wire surface, but it forces every producer to guess worst-case runtime; guess low and healthy long jobs get killed, guess high and real crashes take the whole timeout to recover. Heartbeats decouple recovery latency (≈TTL, 30s) from job duration entirely.

External lease service (etcd / Redis TTL keys)

Purpose-built lease machinery with sub-second expiry. Rejected on ADR-0002 grounds: a second infrastructure dependency, and leases in a different store than the rows they guard reintroduces cross-store ordering — the exact class of problem decision 1 exists to eliminate. Postgres timestamps are slower to expire but transactional with the claim.

Per-claim fencing: the attempt token

The locked_by fence distinguishes workers but not two claims by the same worker — the self-zombie: worker W's claim is reaped (lease lapsed under a stuck handler), the job is re-claimed by W again, and the still-running old goroutine's heartbeats and report would match locked_by = W. The fix is a second fence on the attempt number: attempts is incremented inside every claim, making it a monotonic, server-minted, per-claim token that already rides every JobAssignment. Heartbeat, Complete, and Fail carry it back and their guards add AND attempts = @attempt (0 skips the fence for legacy callers) — a stale heartbeat is a silent no-op, a stale report is rejected with ErrInvalidTransition (TestJob_AttemptFencesSelfZombie). The fence bounds state corruption, not side effects: the zombie's handler already ran, which is why at-least-once + idempotent handlers remains the delivery contract.

Consequences

No silent job loss. Every RUNNING row has a lease from birth; worst-case recovery latency is TTL + one watchdog tick (~40s), independent of what killed the worker.
One recovery mechanism. Claimed-but-unsent (ADR-0005), worker crash, and partition all resolve through the same lease expiry into the same retry path (ADR-0006).
At-least-once is load-bearing. Handler idempotency is a hard requirement on SDK users, documented at the boundary; job_id + attempt is the key they dedup on.
Liveness costs one unary RPC per job per 10s — per running job, not per worker; an idle fleet heartbeats nothing.
The TTL is a fleet-wide constant (30s in cmd/pulsed/main.go). Per-topic or per-job lease tuning is deferred until a workload demonstrates the need.

ADR-0007: Leases, heartbeats, the watchdog, and fencing

On this page