pulse

pulse — Architecture

How the pieces fit. The README gives the pitch; this gives the map. Decisions and tradeoffs live in docs/adr/.


1. The one-paragraph model

pulse is a Postgres-backed job queue. A job is one row in the jobs table — its state, retry policy, and worker lease together. Workers connect over gRPC and claim disjoint batches of jobs (FOR UPDATE SKIP LOCKED), run handlers in their own process, and report results back. Every state transition is a guarded UPDATE whose WHERE clause encodes the invariant; the database is the only arbiter, so any number of workers and any number of pulsed replicas are safe by construction. (ADR-0002)

   YOUR APP — the SDK                              PULSE SERVER — pulsed
   ┌──────────────────────┐                       ┌─────────────────────────┐
   │ Enqueue(name, args)  │ ──────  SubmitJob ──► │ INSERT job (PENDING)    │
   │ Register(name, fn)   │ ──────  StreamJobs ─► │ claim batch → send      │
   │ Run()                │ ◄─────  assignment ── │  (SKIP LOCKED + lease)  │
   │   fn() runs LOCALLY  │                       │                         │
   │   ReportResult       │ ───  ReportResult ──► │ complete / retry / DLQ  │
   │   Heartbeat (~10s)   │ ──────  Heartbeat ──► │ extend lease (fenced)   │
   └──────────────────────┘                       └─────────────────────────┘
                                background loops inside pulsed:
                                  watchdog   → reap lapsed leases → retry path
                                  scheduler  → fire due schedules → insert jobs
                                  pause loop → converge dispatch gate

2. The data model — three tables

TablePurpose
jobsone row per job: status (PENDING → RUNNING → COMPLETED / RETRYING / DEAD_LETTERED / CANCELED), priority, attempts/max_attempts, last_error, the lease (locked_by, lease_until), the backoff gate (next_run_at), lineage (schedule_id)
schedulesmutable schedule definitions (At/Every/Cron), next_run_at cursor (ADR-0008)
dispatch_controlsingleton pause switch: paused, reason, paused_at (ADR-0010)

Two partial indexes carry the hot paths: idx_jobs_claim (topic, priority DESC, submitted_at ASC) WHERE status IN ('PENDING','RETRYING') for claims, and idx_jobs_expired (lease_until) WHERE status = 'RUNNING' for the watchdog.

3. Package layout (ports and adapters)

pulse.go, job.go, schedule.go, dispatch.go   SDK root — the only package users import
cli/                                          operator CLI — its own Go module, so SDK
                                              consumers never inherit CLI dependencies
ui/                                           web dashboard (pulseui) — its own Go module;
                                              server-rendered HTML + htmx, reads and acts
                                              exclusively through the SDK, never the DB
cmd/pulsed/                                   composition root: wiring, consts, serve, shutdown
cmd/migrate/                                  goose migration runner
cmd/loadgen/                                  benchmark load generator
proto/pulse/v1/pulse.proto → gen/pulsev1/     the wire contract → generated stubs
internal/domain/                              Job, statuses, Backoff, Schedule, invariant errors
internal/repos/                               ports: JobRepo, ScheduleRepo, DispatchControlRepo (+ mocks/)
internal/repos/postgres/                      the pgx adapters — all concurrency-critical SQL lives here
internal/service/                             application layer: jobService, watchdog, scheduler,
                                              pause (gate + control)
internal/transport/grpc/                      thin handlers, the per-worker dispatcher,
                                              error→status interceptors
db/migrations/                                goose migrations 0001–0003
examples/                                     runnable demos: worker+producer, schedules, pause

Dependency direction: transport → service → repos (ports) → domain; postgres implements the ports; cmd/pulsed wires it all. Tests: domain (pure) → service (gomock) → postgres (real DB via TEST_DB_URL) → transport (direct + bufconn over a real gRPC server).

4. Request flows

SubmitSubmitJob → insert one PENDING row (uuidv7 id, priority, default attempt cap). Immediately claimable.

Dispatch — each connected worker's StreamJobs runs one dispatcher loop: every 500ms, if the pause gate is open, ClaimBatch atomically takes up to 100 dispatchable jobs (status IN (PENDING, RETRYING) AND next_run_at <= now(), ordered priority DESC, submitted_at ASC, FOR UPDATE SKIP LOCKED), marks them RUNNING with attempts+1 and a fresh lease, and streams each assignment. Concurrent workers take disjoint batches — there is nothing to race. (ADR-0005, ADR-0009)

Report — success: UPDATE → COMPLETED WHERE status='RUNNING'. Failure: one guarded UPDATE branches — attempts remain → RETRYING with next_run_at = now() + attempts² seconds; cap reached → DEAD_LETTERED keeping the last error. The claim's next_run_at predicate is the retry scheduler. (ADR-0006)

Crash recovery — the claim opened a lease; the SDK heartbeats every ~10s to extend it (fenced: WHERE locked_by = @worker, so a zombie's beat is a no-op). The watchdog ticks every 10s and reaps every RUNNING row with a lapsed lease through the normal retry path — one set-based UPDATE. Delivery is honestly at-least-once; handlers must be idempotent (job_id + attempt ride every assignment as the dedup key). (ADR-0007)

Schedule fire — the scheduler reads due schedules rows and inserts an ordinary job with a deterministic uuidv5(schedule_id|occurrence) id; Insert is ON CONFLICT DO NOTHING and the cursor advances via CAS on next_run_at, so a re-fired occurrence dedupes — exactly-once, crash-safe, multi-instance. One fire = one job, so fire history = the schedule's jobs. (ADR-0008)

PausePauseDispatch writes dispatch_control then flips the in-memory gate; other instances converge via a ~1s refresh loop. The gate is primed synchronously at boot (fail-safe: refuse to start on error) and fail-open at runtime (a DB blip never halts dispatch). Only claiming is gated — submits, running jobs, and the scheduler continue. (ADR-0010)

5. Where the guarantees come from

GuaranteeMechanism
No double-claimFOR UPDATE SKIP LOCKED — disjoint batches by construction
Illegal transitions rejectedguarded UPDATEs; zero rows → ErrInvalidTransition
Crash recoverylease on the row + watchdog reap + locked_by fence
Bounded failure handlingattempts² backoff, max_attempts, dead-letter with reason
Exactly-once schedule firesdeterministic uuidv5 id + ON CONFLICT DO NOTHING
Priority with FIFO tiesORDER BY priority DESC, submitted_at ASC + partial index
Pause survives restartboot-prime from the singleton row
O(batch) dispatch at any depthLIMIT + partial claim index
Honest delivery semanticsat-least-once + idempotent handlers = effectively-once

6. Operations

  • Logging: structured JSON (slog) throughout, including goose migration output.
  • Shutdown on SIGTERM: gRPC GracefulStop (bounded).
  • CLI: pulse jobs list|get|cancel, pulse submit, pulse schedules list, pulse dispatch pause|resume|status; connection saved once via pulse save creds (~/.config/pulse/config.json), precedence flag > env > config file.
  • Auth: PULSE_AUTH_USERS ("user:pass,user:pass") on the server requires Basic credentials on every RPC (constant-time compare, unary + stream); values with a $2 prefix are bcrypt hashes (pulse passwd), cached after first verify; SDK: pulse.WithUserPass. Unset = open.
  • TLS: PULSE_TLS_CERT/PULSE_TLS_KEY terminate TLS at the server (both required together — half a keypair refuses to start); unset = plaintext for local dev.
  • Addresses: gRPC on :50051 (PULSE_GRPC_ADDR takes the port).

7. Known limits (named, not hidden)

  • Dispatch latency floor — up to one 500ms tick (poll, not push; a broker is the deferred answer if the floor ever matters).
  • No per-job history — current state, last_error, and timestamps only.
  • Strict priority starves under a sustained high-priority stream; aging is future work.
  • Retry policy and lease constants are hardcoded (per-topic config deferred).
  • ListBySchedule is unpaginated (ListJobs pages by limit/offset; the schedule-lineage RPCs get paging with the CLI work).

On this page