pulse

Crash recovery — a worker dies mid-job

What happens when a worker is killed with SIGKILL while running a job: no goodbye, no error, no connection teardown the server can trust. Recovery relies on one fact — the worker stops proving it is alive — and one mechanism: the lease on the job row.

The run

Server: DB_HOST=… go run ./cmd/pulsed. Worker: go run ./examples — among its demo jobs is slow-job, whose handler deliberately runs 45s (longer than the 30s lease TTL), so completion depends on heartbeats. Constants: lease TTL 30s, SDK heartbeat ~10s, watchdog tick 10s.

T+0    worker submits the job; its dispatcher claims it
T+14   SELECT status, attempts, locked_by, lease_until > now() FROM jobs
       → RUNNING | 1 | 019f2e6e-… | lease alive     (heartbeats extending the lease)

T+15   kill -9 <worker pid>                          (SIGKILL: the process vanishes)
T+18   → RUNNING | 1 | 019f2e6e-… | lease alive     (the row doesn't know yet — the lease
                                                      is a deadline, not a connection)

       …no heartbeats arrive; the lease expires ~30s after the last beat…

T+68   → RETRYING | 1 | (unlocked) | worker lease expired
       (the watchdog's sweep reaped it:
        UPDATE … SET status='RETRYING', last_error='worker lease expired',
                     next_run_at = now() + attempts² s
        WHERE status='RUNNING' AND lease_until < now())

T+70   a fresh worker starts (same command)
       → its next claim picks the job up (attempts := 2), runs the full 45s handler

T+130  → COMPLETED | 2 | worker lease expired | completed_at set

The recovered job finishes with attempts = 2 and keeps last_error = 'worker lease expired' as the record of what happened to attempt 1.

What this demonstrates

  • No crash detection exists — only absence of proof. The server never learns the worker died; it observes that the lease deadline passed without renewal. Recovery latency is bounded by lease TTL + watchdog tick (~40s worst case), regardless of how the worker died.
  • Recovery is the ordinary failure path. The reap routes through the same RETRYING/DEAD_LETTERED branch as a handler error: backoff applies, the attempt cap applies. A job whose workers crash max_attempts times dead-letters with the reason preserved — crash loops cannot retry forever.
  • The claim is the lease. lease_until is set in the same UPDATE that claims the job, so a RUNNING row without a lease cannot exist; there is no orphan window to patch over.
  • If the "dead" worker was actually partitioned and returns, its heartbeats and report carry the stale attempt token (1) and are rejected by the fences — it cannot extend the lease or complete attempt 2 (TestJob_AttemptFencesSelfZombie). Its handler's side effects stand, which is why delivery is at-least-once and handlers must be idempotent.

Reproduce

psql $PG -c "TRUNCATE jobs"
DB_HOST=$PG go run ./cmd/pulsed &            # terminal 1
go run ./examples &                          # terminal 2 — submits jobs incl. a 45s slow-job
sleep 14 && kill -9 $(pgrep -f exe/main)     # kill the worker mid-run
watch -n 5 'psql $PG -c "SELECT status, attempts, locked_by, last_error FROM jobs"'
go run ./examples                            # after the reap: recovery worker

On this page