pulse — Architecture
How the pieces fit. The README gives the pitch; this gives the map. Decisions and tradeoffs
live in docs/adr/.
1. The one-paragraph model
pulse is a Postgres-backed job queue. A job is one row in the jobs table — its state,
retry policy, and worker lease together. Workers connect over gRPC and claim disjoint
batches of jobs (FOR UPDATE SKIP LOCKED), run handlers in their own process, and
report results back. Every state transition is a guarded UPDATE whose WHERE clause
encodes the invariant; the database is the only arbiter, so any number of workers and any
number of pulsed replicas are safe by construction. (ADR-0002)
YOUR APP — the SDK PULSE SERVER — pulsed
┌──────────────────────┐ ┌─────────────────────────┐
│ Enqueue(name, args) │ ────── SubmitJob ──► │ INSERT job (PENDING) │
│ Register(name, fn) │ ────── StreamJobs ─► │ claim batch → send │
│ Run() │ ◄───── assignment ── │ (SKIP LOCKED + lease) │
│ fn() runs LOCALLY │ │ │
│ ReportResult │ ─── ReportResult ──► │ complete / retry / DLQ │
│ Heartbeat (~10s) │ ────── Heartbeat ──► │ extend lease (fenced) │
└──────────────────────┘ └─────────────────────────┘
background loops inside pulsed:
watchdog → reap lapsed leases → retry path
scheduler → fire due schedules → insert jobs
pause loop → converge dispatch gate2. The data model — three tables
| Table | Purpose |
|---|---|
jobs | one row per job: status (PENDING → RUNNING → COMPLETED / RETRYING / DEAD_LETTERED / CANCELED), priority, attempts/max_attempts, last_error, the lease (locked_by, lease_until), the backoff gate (next_run_at), lineage (schedule_id) |
schedules | mutable schedule definitions (At/Every/Cron), next_run_at cursor (ADR-0008) |
dispatch_control | singleton pause switch: paused, reason, paused_at (ADR-0010) |
Two partial indexes carry the hot paths: idx_jobs_claim (topic, priority DESC, submitted_at ASC) WHERE status IN ('PENDING','RETRYING') for claims, and
idx_jobs_expired (lease_until) WHERE status = 'RUNNING' for the watchdog.
3. Package layout (ports and adapters)
pulse.go, job.go, schedule.go, dispatch.go SDK root — the only package users import
cli/ operator CLI — its own Go module, so SDK
consumers never inherit CLI dependencies
ui/ web dashboard (pulseui) — its own Go module;
server-rendered HTML + htmx, reads and acts
exclusively through the SDK, never the DB
cmd/pulsed/ composition root: wiring, consts, serve, shutdown
cmd/migrate/ goose migration runner
cmd/loadgen/ benchmark load generator
proto/pulse/v1/pulse.proto → gen/pulsev1/ the wire contract → generated stubs
internal/domain/ Job, statuses, Backoff, Schedule, invariant errors
internal/repos/ ports: JobRepo, ScheduleRepo, DispatchControlRepo (+ mocks/)
internal/repos/postgres/ the pgx adapters — all concurrency-critical SQL lives here
internal/service/ application layer: jobService, watchdog, scheduler,
pause (gate + control)
internal/transport/grpc/ thin handlers, the per-worker dispatcher,
error→status interceptors
db/migrations/ goose migrations 0001–0003
examples/ runnable demos: worker+producer, schedules, pauseDependency direction: transport → service → repos (ports) → domain; postgres implements
the ports; cmd/pulsed wires it all. Tests: domain (pure) → service (gomock) → postgres
(real DB via TEST_DB_URL) → transport (direct + bufconn over a real gRPC server).
4. Request flows
Submit — SubmitJob → insert one PENDING row (uuidv7 id, priority, default attempt
cap). Immediately claimable.
Dispatch — each connected worker's StreamJobs runs one dispatcher loop: every 500ms,
if the pause gate is open, ClaimBatch atomically takes up to 100 dispatchable jobs
(status IN (PENDING, RETRYING) AND next_run_at <= now(), ordered priority DESC, submitted_at ASC, FOR UPDATE SKIP LOCKED), marks them RUNNING with attempts+1 and a
fresh lease, and streams each assignment. Concurrent workers take disjoint batches —
there is nothing to race. (ADR-0005, ADR-0009)
Report — success: UPDATE → COMPLETED WHERE status='RUNNING'. Failure: one guarded
UPDATE branches — attempts remain → RETRYING with next_run_at = now() + attempts² seconds; cap reached → DEAD_LETTERED keeping the last error. The claim's next_run_at
predicate is the retry scheduler. (ADR-0006)
Crash recovery — the claim opened a lease; the SDK heartbeats every ~10s to extend it
(fenced: WHERE locked_by = @worker, so a zombie's beat is a no-op). The watchdog ticks
every 10s and reaps every RUNNING row with a lapsed lease through the normal retry path —
one set-based UPDATE. Delivery is honestly at-least-once; handlers must be idempotent
(job_id + attempt ride every assignment as the dedup key). (ADR-0007)
Schedule fire — the scheduler reads due schedules rows and inserts an ordinary job
with a deterministic uuidv5(schedule_id|occurrence) id; Insert is ON CONFLICT DO NOTHING and the cursor advances via CAS on next_run_at, so a re-fired occurrence dedupes
— exactly-once, crash-safe, multi-instance. One fire = one job, so fire history = the
schedule's jobs. (ADR-0008)
Pause — PauseDispatch writes dispatch_control then flips the in-memory gate; other
instances converge via a ~1s refresh loop. The gate is primed synchronously at boot
(fail-safe: refuse to start on error) and fail-open at runtime (a DB blip never halts
dispatch). Only claiming is gated — submits, running jobs, and the scheduler continue.
(ADR-0010)
5. Where the guarantees come from
| Guarantee | Mechanism |
|---|---|
| No double-claim | FOR UPDATE SKIP LOCKED — disjoint batches by construction |
| Illegal transitions rejected | guarded UPDATEs; zero rows → ErrInvalidTransition |
| Crash recovery | lease on the row + watchdog reap + locked_by fence |
| Bounded failure handling | attempts² backoff, max_attempts, dead-letter with reason |
| Exactly-once schedule fires | deterministic uuidv5 id + ON CONFLICT DO NOTHING |
| Priority with FIFO ties | ORDER BY priority DESC, submitted_at ASC + partial index |
| Pause survives restart | boot-prime from the singleton row |
| O(batch) dispatch at any depth | LIMIT + partial claim index |
| Honest delivery semantics | at-least-once + idempotent handlers = effectively-once |
6. Operations
- Logging: structured JSON (
slog) throughout, including goose migration output. - Shutdown on SIGTERM: gRPC
GracefulStop(bounded). - CLI:
pulse jobs list|get|cancel,pulse submit,pulse schedules list,pulse dispatch pause|resume|status; connection saved once viapulse save creds(~/.config/pulse/config.json), precedence flag > env > config file. - Auth:
PULSE_AUTH_USERS("user:pass,user:pass") on the server requires Basic credentials on every RPC (constant-time compare, unary + stream); values with a$2prefix are bcrypt hashes (pulse passwd), cached after first verify; SDK:pulse.WithUserPass. Unset = open. - TLS:
PULSE_TLS_CERT/PULSE_TLS_KEYterminate TLS at the server (both required together — half a keypair refuses to start); unset = plaintext for local dev. - Addresses: gRPC on
:50051(PULSE_GRPC_ADDRtakes the port).
7. Known limits (named, not hidden)
- Dispatch latency floor — up to one 500ms tick (poll, not push; a broker is the deferred answer if the floor ever matters).
- No per-job history — current state,
last_error, and timestamps only. - Strict priority starves under a sustained high-priority stream; aging is future work.
- Retry policy and lease constants are hardcoded (per-topic config deferred).
ListByScheduleis unpaginated (ListJobs pages by limit/offset; the schedule-lineage RPCs get paging with the CLI work).