ADR-019: Gateway Deployment Model

Status: Accepted Date: 2026-04-17 Context: ADV-ARCH-03 (monolith blast radius), analyst backpass contention 4

Decision

Gateways run in-process with kiseki-server (monolith per node). Client resilience is provided by multi-endpoint resolution, not per-process gateway isolation.

Rationale

This is a distributed system with no master. Every storage node runs kiseki-server (log + chunk + composition + view + gateways). Clients resolve to multiple endpoints:

Client (NFS/S3/native)
  │
  ├── DNS round-robin: kiseki-nfs.cluster.local → [node1, node2, node3, ...]
  ├── Multiple A/AAAA records
  ├── Native client: seed list → discovery → multiple endpoints
  │
  └── On node failure: client reconnects to next endpoint
      (NFS: automatic reconnect; S3: retry to different host;
       native: transport failover)

Why monolith is acceptable

Concern	Mitigation
Gateway crash = node crash	Client reconnects to another node (seconds)
All tenants on crashed node affected	Tenants are served by multiple nodes; one node loss = partial, not total
Memory leak in gateway affects log/chunk	Resource limits via cgroups; OOM killer targets the process, not the node
Bug in NFS gateway affects S3 gateway	Accept — both are in the same process. Isolation adds operational complexity disproportionate to the risk

Why NOT separate gateway processes

Additional process management per node (spawn, monitor, restart, IPC)
Performance overhead of IPC between gateway and log/chunk/view
Operational complexity (more processes to configure, monitor, upgrade)
The resilience model is client-side multi-endpoint, not server-side process isolation

Client resolution

Client type	Resolution mechanism
NFS	DNS (multiple A records), NFS mount with multiple server addresses
S3	DNS round-robin, HTTP retry to next endpoint on 5xx
Native	Seed list → fabric discovery → multiple endpoints, automatic failover

Consequences

kiseki-server remains a single-process monolith per node
Client-side resilience is the primary availability mechanism
Update failure-modes.md: F-D1 (gateway crash) → node-scoped, not protocol-scoped
Node loss tolerance depends on tenant data distribution across nodes

Keyboard shortcuts