ADR-0002: Overall architecture — one server, one table, an SDK
Context
pulse needs a shape before it needs features: where does state live, where does user code run, how many deployable things are there, and what infrastructure may it assume? The guiding constraints:
- A developer should integrate with one URL and one imported package — no broker to install, no sidecar, no config sprawl.
- The server must never execute user code — handlers are the developer's, in the developer's process, with their own dependencies and secrets.
- Every architectural dollar should buy a demonstrable distributed-systems property (correct concurrency, crash recovery, honest semantics) — not deployment topology.
Decision
Four shape decisions, each with a one-line reason:
- One deployable server (
pulsed) + a client SDK — not microservices. All server-side concerns (dispatch, watchdog, scheduler, pause control) are goroutines in one process (cmd/pulsed/main.go). The interesting distribution problem in a job system is server ↔ many workers, not server ↔ server; splitting the server would add network boundaries with no property gained. - Postgres is the only infrastructure, and one
jobstable is the core of it. A job's state, retry policy, and worker lease are columns on a single row (db/migrations/0001_CreateJobs.go); schedules and the pause switch are two more small tables. Coordination uses exactly two Postgres primitives:FOR UPDATE SKIP LOCKEDfor contention-free batch claims, and guarded UPDATEs — every state transition is oneUPDATE ... WHERE status = ..., and a guard that matches no row surfaces asdomain.ErrInvalidTransition(internal/repos/postgres/jobs.go). No broker, no Redis, no coordination service until a measured limit demands one (ADR-0005 defers push/NATS on exactly these grounds). - gRPC + a typed SDK is the only boundary. One proto surface
(
proto/pulse/v1/pulse.proto):SubmitJob/GetJobfor producers,StreamJobs/ReportResult/Heartbeatfor workers. The SDK is the root package —pulse.New, one connection, both producer and consumer roles;Register/Enqueuekeep protobuf out of the developer's code. The server only ever ships data across the wire — handlers run in the developer's process, never inpulsed. Error mapping lives in interceptors (internal/transport/grpc/errors.go—ServerOptions()), so handlers justreturn errand production and the bufconn test harness share identical wire behaviour. - Ports-and-adapters layering inside the server:
transport → service → repos (ports) → domain, with Postgres implementing the ports (internal/repos/postgres/). The domain holds types and invariant errors; services own orchestration and the background loops; transport translates only. Every port has a gomock (internal/service/mocks/); integration tests hit real Postgres behindTEST_DB_URL.
Alternatives considered
Broker-centric architecture (NATS at the core, from day one)
A broker gives push dispatch and fan-out — but it becomes a second source of truth beside the jobs table, owns delivery semantics we'd have to reconcile with the claim logic, and adds an operational dependency for every adopter. Deferred, not rejected: the jobs table stays authoritative, so a broker can later become a delivery optimization (ADR-0005's revisit path) without re-architecting.
Split services (API / dispatcher / scheduler as separate deployables)
Buys independent scaling and blast-radius isolation; costs inter-service contracts,
shared-DB coupling or an internal bus, and N deploy targets. Nothing in pulse's problem
needs it — the loops already scale out correctly as replicas of the whole server: claims
are disjoint by SKIP LOCKED, schedule fires are deduplicated by deterministic job ids
plus a CAS advance, and the pause switch converges through dispatch_control. N instances
of pulsed are safe by construction.
Library-only (embed pulse in the app, no server)
Simplest possible adoption (like asynq-as-a-library), but then every app process runs
watchdog/scheduler loops, config drifts per-app, and there is no single admin surface
(pause, schedules, dead-letter ops). The server is what makes pulse operable.
REST/JSON instead of gRPC
Broader reach, but hand-written clients, no streaming for dispatch, no generated types. The SDK is the product surface — codegen and streaming won (ADR-0005 details).
Consequences
- One binary + one database = one-command local run and trivial CI; every distributed
property is demonstrable with
docker-composeand two terminals. - N
pulsedreplicas are safe by construction — every arbitration (claims, schedule fires, pause) happens in Postgres, none in process memory. Scale-out was designed in, not bolted on. - Postgres is the ceiling: poll-based dispatch latency and per-worker query load are the
costs (a claim tick is O(batch) via the partial
idx_jobs_claimindex, so backlog depth doesn't matter, but worker count sets the query rate). The smoke run — 2000 jobs across 8 workers at 1223 jobs/s end-to-end, 0 rollbacks, ~1.6 DB transactions per job — says the ceiling is far off; the answer when it arrives is the deferred broker. - The layering keeps the arbitration auditable: everything concurrency-critical is a
handful of SQL statements in
internal/repos/postgres/jobs.go, testable against a real database, mockable everywhere else. - Handlers-in-your-process means pulse never needs sandboxing, dependency injection into the server, or a plugin story — the sharpest scope cut in the design.