Crash recovery — a worker dies mid-job
What happens when a worker is killed with SIGKILL while running a job: no goodbye, no
error, no connection teardown the server can trust. Recovery relies on one fact — the
worker stops proving it is alive — and one mechanism: the lease on the job row.
The run
Server: DB_HOST=… go run ./cmd/pulsed. Worker: go run ./examples — among its demo jobs
is slow-job, whose handler deliberately runs 45s (longer than the 30s lease TTL), so
completion depends on heartbeats. Constants: lease TTL 30s, SDK heartbeat ~10s, watchdog
tick 10s.
T+0 worker submits the job; its dispatcher claims it
T+14 SELECT status, attempts, locked_by, lease_until > now() FROM jobs
→ RUNNING | 1 | 019f2e6e-… | lease alive (heartbeats extending the lease)
T+15 kill -9 <worker pid> (SIGKILL: the process vanishes)
T+18 → RUNNING | 1 | 019f2e6e-… | lease alive (the row doesn't know yet — the lease
is a deadline, not a connection)
…no heartbeats arrive; the lease expires ~30s after the last beat…
T+68 → RETRYING | 1 | (unlocked) | worker lease expired
(the watchdog's sweep reaped it:
UPDATE … SET status='RETRYING', last_error='worker lease expired',
next_run_at = now() + attempts² s
WHERE status='RUNNING' AND lease_until < now())
T+70 a fresh worker starts (same command)
→ its next claim picks the job up (attempts := 2), runs the full 45s handler
T+130 → COMPLETED | 2 | worker lease expired | completed_at setThe recovered job finishes with attempts = 2 and keeps last_error = 'worker lease expired' as the record of what happened to attempt 1.
What this demonstrates
- No crash detection exists — only absence of proof. The server never learns the worker
died; it observes that the lease deadline passed without renewal. Recovery latency is
bounded by
lease TTL + watchdog tick(~40s worst case), regardless of how the worker died. - Recovery is the ordinary failure path. The reap routes through the same
RETRYING/DEAD_LETTEREDbranch as a handler error: backoff applies, the attempt cap applies. A job whose workers crashmax_attemptstimes dead-letters with the reason preserved — crash loops cannot retry forever. - The claim is the lease.
lease_untilis set in the same UPDATE that claims the job, so aRUNNINGrow without a lease cannot exist; there is no orphan window to patch over. - If the "dead" worker was actually partitioned and returns, its heartbeats and report
carry the stale attempt token (1) and are rejected by the fences — it cannot extend the
lease or complete attempt 2 (
TestJob_AttemptFencesSelfZombie). Its handler's side effects stand, which is why delivery is at-least-once and handlers must be idempotent.
Reproduce
psql $PG -c "TRUNCATE jobs"
DB_HOST=$PG go run ./cmd/pulsed & # terminal 1
go run ./examples & # terminal 2 — submits jobs incl. a 45s slow-job
sleep 14 && kill -9 $(pgrep -f exe/main) # kill the worker mid-run
watch -n 5 'psql $PG -c "SELECT status, attempts, locked_by, last_error FROM jobs"'
go run ./examples # after the reap: recovery worker