Crash recovery — a worker dies mid-job

What happens when a worker is killed with SIGKILL while running a job: no goodbye, no error, no connection teardown the server can trust. Recovery relies on one fact — the worker stops proving it is alive — and one mechanism: the lease on the job row.

The run

Server: DB_HOST=… go run ./cmd/pulsed. Worker: go run ./examples — among its demo jobs is slow-job, whose handler deliberately runs 45s (longer than the 30s lease TTL), so completion depends on heartbeats. Constants: lease TTL 30s, SDK heartbeat ~10s, watchdog tick 10s.

T+0    worker submits the job; its dispatcher claims it
T+14   SELECT status, attempts, locked_by, lease_until > now() FROM jobs
       → RUNNING | 1 | 019f2e6e-… | lease alive     (heartbeats extending the lease)

T+15   kill -9 <worker pid>                          (SIGKILL: the process vanishes)
T+18   → RUNNING | 1 | 019f2e6e-… | lease alive     (the row doesn't know yet — the lease
                                                      is a deadline, not a connection)

       …no heartbeats arrive; the lease expires ~30s after the last beat…

T+68   → RETRYING | 1 | (unlocked) | worker lease expired
       (the watchdog's sweep reaped it:
        UPDATE … SET status='RETRYING', last_error='worker lease expired',
                     next_run_at = now() + attempts² s
        WHERE status='RUNNING' AND lease_until < now())

T+70   a fresh worker starts (same command)
       → its next claim picks the job up (attempts := 2), runs the full 45s handler

T+130  → COMPLETED | 2 | worker lease expired | completed_at set

The recovered job finishes with attempts = 2 and keeps last_error = 'worker lease expired' as the record of what happened to attempt 1.

What this demonstrates

No crash detection exists — only absence of proof. The server never learns the worker died; it observes that the lease deadline passed without renewal. Recovery latency is bounded by lease TTL + watchdog tick (~40s worst case), regardless of how the worker died.
Recovery is the ordinary failure path. The reap routes through the same RETRYING/DEAD_LETTERED branch as a handler error: backoff applies, the attempt cap applies. A job whose workers crash max_attempts times dead-letters with the reason preserved — crash loops cannot retry forever.
The claim is the lease. lease_until is set in the same UPDATE that claims the job, so a RUNNING row without a lease cannot exist; there is no orphan window to patch over.
If the "dead" worker was actually partitioned and returns, its heartbeats and report carry the stale attempt token (1) and are rejected by the fences — it cannot extend the lease or complete attempt 2 (TestJob_AttemptFencesSelfZombie). Its handler's side effects stand, which is why delivery is at-least-once and handlers must be idempotent.

Reproduce

psql $PG -c "TRUNCATE jobs"
DB_HOST=$PG go run ./cmd/pulsed &            # terminal 1
go run ./examples &                          # terminal 2 — submits jobs incl. a 45s slow-job
sleep 14 && kill -9 $(pgrep -f exe/main)     # kill the worker mid-run
watch -n 5 'psql $PG -c "SELECT status, attempts, locked_by, last_error FROM jobs"'
go run ./examples                            # after the reap: recovery worker

Crash recovery — a worker dies mid-job

The run

What this demonstrates

Reproduce

On this page