Kiseki

Kiseki is a distributed storage system built for HPC and AI workloads. It provides a unified data plane that serves files and objects through multiple protocol gateways (S3, NFS, FUSE) while handling encryption, replication, and caching transparently.

Key Features

S3 and NFS gateways – access the same data through S3-compatible HTTP, NFSv3/v4.2, or a native FUSE mount. Protocol gateways translate wire protocols into operations on the shared log-structured data model.
Client-side cache with staging – a two-tier cache (L1 in-memory, L2 local NVMe) on compute nodes eliminates repeated fabric traversals. Three modes (pinned, organic, bypass) match the dominant workload patterns: epoch-reuse training, mixed inference, and streaming ingest.
Per-shard Raft consensus – every shard is a single-tenant Raft group. Deltas (metadata mutations) are totally ordered within a shard and replicated to a quorum before acknowledgement.
Erasure coding and placement – chunks are stored across affinity pools with configurable EC profiles. The placement engine distributes data across device classes (fast-NVMe, bulk-NVMe) and rebuilds lost chunks from parity.
FIPS 140-2/3 encryption – always-on, two-layer envelope encryption. System DEKs (AES-256-GCM via aws-lc-rs) encrypt chunk data; tenant KEKs wrap the DEKs for access control. Five tenant KMS backends: Kiseki-Internal, HashiCorp Vault, KMIP 2.1, AWS KMS, PKCS#11.
GPU-direct and fabric transports – the native client selects the fastest available transport: libfabric/CXI (Slingshot), RDMA verbs, or TCP+TLS. Transport selection is automatic based on fabric discovery.
Multi-tenant isolation – tenant hierarchy (organization / project / workload) with per-level quotas, compliance tags, and key isolation. Shards are single-tenant. Cross-tenant data access is out of scope by design.
OIDC and mTLS authentication – Keycloak (or any OIDC provider) for identity; Cluster CA-signed mTLS certificates for data-fabric authentication. Certificates work on the SAN with no control plane access needed on the hot path.
Workflow advisory – a bidirectional advisory channel carries workload hints (access pattern, prefetch range, priority) inbound and telemetry feedback (backpressure, locality, staleness) outbound. The advisory path is side-by-side with the data path – it never blocks or delays data operations.

Architecture at a Glance

Kiseki is a single-language Rust system organized as 18 crates in a Cargo workspace:

Layer	Crates
Foundation	`kiseki-common`, `kiseki-proto`, `kiseki-crypto`, `kiseki-transport`
Data path	`kiseki-log`, `kiseki-block`, `kiseki-chunk`, `kiseki-composition`, `kiseki-view`
Protocol	`kiseki-gateway` (NFS + S3)
Client	`kiseki-client` (FUSE, FFI, Python via PyO3)
Infrastructure	`kiseki-raft`, `kiseki-keymanager`, `kiseki-audit`, `kiseki-advisory`, `kiseki-control`
Integration	`kiseki-server`, `kiseki-acceptance`

The data model is log-structured: mutations are recorded as deltas appended to per-shard Raft logs. Compositions describe how content-addressed, encrypted chunks assemble into files or objects. Views are materialized projections of shard state, maintained incrementally by stream processors and served by protocol gateways.

Four binaries are produced:

Binary	Role
`kiseki-server`	Storage node (log + chunk + composition + view + gateways + audit + advisory)
`kiseki-keyserver`	HA system key manager (Raft-replicated)
`kiseki-client-fuse`	Compute-node FUSE mount with native client
`kiseki-control`	Control plane (tenancy, IAM, policy, federation)

Target Workloads

Workload	How Kiseki helps
LLM training	Tokenized datasets staged once per job, served from local NVMe cache across epochs. Pinned cache mode prevents eviction.
LLM inference	Model weights cold-started into cache on first load, then served locally for all replicas on the node.
Climate / weather simulation	Boundary conditions staged with hard deadline via Slurm prolog. Input files cached; checkpoint writes bypass the cache.
HPC checkpoint/restart	Checkpoint writes go straight to canonical (bypass mode). Restart reads benefit from organic caching if the same node is reused.

Quick Links

Getting Started – Docker Compose quickstart
S3 API – supported operations, examples
NFS Access – NFSv3/v4.2 mount instructions
FUSE Mount – native client mount
Python SDK – PyO3 bindings
Client Cache & Staging – ADR-031 cache modes

Getting Started

This guide walks through running a single-node Kiseki stack with Docker Compose, verifying the deployment, and performing basic S3 operations.

Prerequisites

Docker 24+ with Compose V2 (docker compose)
curl (for health checks)
aws-cli (optional, for S3 operations)

If building from source instead of Docker:

Rust 1.78+ (stable)
Protobuf compiler (protoc)

Quick Start with Docker Compose

The repository includes a docker-compose.yml that brings up a single-node Kiseki server with supporting services:

Service	Port	Purpose
`kiseki-server`	9000	S3 HTTP gateway
`kiseki-server`	2049	NFS (v3 + v4.2)
`kiseki-server`	9090	Prometheus metrics
`kiseki-server`	9100	Data-path gRPC
`kiseki-server`	9101	Advisory gRPC
`jaeger`	16686	Tracing UI
`jaeger`	4317	OTLP gRPC receiver
`vault`	8200	HashiCorp Vault (dev mode, tenant KMS)
`keycloak`	8080	Keycloak (OIDC identity provider)

Start the stack:

docker compose up --build -d

Wait for all services to become healthy:

docker compose ps

The kiseki-server container sets KISEKI_BOOTSTRAP=true, which creates an initial shard for immediate use.

Verify the Deployment

Health Check

The data-path gRPC port responds to TCP connections when the server is ready:

# TCP probe on the data-path port
timeout 1 bash -c 'echo > /dev/tcp/127.0.0.1/9100'
echo $?  # 0 = healthy

Prometheus Metrics

curl -s http://localhost:9090/metrics | head -20

Jaeger Tracing

Open http://localhost:16686 in a browser to view distributed traces. The server exports traces via OTLP to Jaeger automatically.

Vault (Dev Mode)

Vault runs in dev mode with root token kiseki-e2e-token:

curl -s http://localhost:8200/v1/sys/health | python3 -m json.tool

Keycloak

Keycloak is available at http://localhost:8080 with admin credentials admin / admin.

S3 Operations

With aws-cli configured to point at the local S3 gateway:

# Configure a local profile (no real AWS credentials needed)
export AWS_ACCESS_KEY_ID=kiseki
export AWS_SECRET_ACCESS_KEY=kiseki
export AWS_DEFAULT_REGION=us-east-1

# Create a bucket (maps to a Kiseki namespace)
aws --endpoint-url http://localhost:9000 s3 mb s3://test-bucket

# Upload a file
echo "hello kiseki" > /tmp/hello.txt
aws --endpoint-url http://localhost:9000 s3 cp /tmp/hello.txt s3://test-bucket/hello.txt

# Download and verify
aws --endpoint-url http://localhost:9000 s3 cp s3://test-bucket/hello.txt /tmp/hello-back.txt
cat /tmp/hello-back.txt

Or with curl directly:

# List buckets
curl -s http://localhost:9000/

# PUT an object
curl -X PUT http://localhost:9000/test-bucket/greeting.txt \
     -d "hello from curl"

# GET it back
curl -s http://localhost:9000/test-bucket/greeting.txt

Multi-Node Cluster

A three-node cluster configuration is also provided:

docker compose -f docker-compose.3node.yml up --build -d

This starts three kiseki-server instances that form Raft groups for shard replication.

Building from Source

# Clone and build
git clone https://github.com/your-org/kiseki.git
cd kiseki
cargo build --release

# Run the server
KISEKI_BOOTSTRAP=true \
KISEKI_DATA_DIR=/tmp/kiseki-data \
KISEKI_S3_ADDR=0.0.0.0:9000 \
KISEKI_NFS_ADDR=0.0.0.0:2049 \
KISEKI_DATA_ADDR=0.0.0.0:9100 \
KISEKI_METRICS_ADDR=0.0.0.0:9090 \
  ./target/release/kiseki-server

Next Steps

S3 API – full list of supported S3 operations
NFS Access – mount via NFS
FUSE Mount – native client mount on compute nodes
Python SDK – use Kiseki from Python workloads
Client Cache & Staging – pre-stage datasets for training jobs

S3 API

Kiseki exposes an S3-compatible HTTP gateway on port 9000 (configurable via KISEKI_S3_ADDR). The gateway implements the subset of S3 API operations needed by HPC/AI workloads (ADR-014). Unsupported operations return 501 Not Implemented.

Endpoint

http://<node>:9000

In the Docker Compose development stack, the endpoint is http://localhost:9000.

Authentication

Kiseki supports AWS Signature Version 4 authentication:

Authorization header – standard SigV4 signing for aws-cli, boto3, and other SDK clients.
Presigned URLs – planned for a future release (not yet implemented).

In development mode (Docker Compose), any access key and secret key values are accepted.

Supported Operations

Bucket Operations

S3 buckets map to Kiseki namespaces. Creating a bucket creates a tenant-scoped namespace; deleting a bucket deletes the namespace.

Operation	S3 API	Notes
Create bucket	`PUT /{bucket}`	Maps to namespace creation
Delete bucket	`DELETE /{bucket}`	Maps to namespace deletion
Head bucket	`HEAD /{bucket}`	Existence check
List buckets	`GET /`	Per-tenant bucket listing

Object Operations

Operation	S3 API	Notes
Put object	`PUT /{bucket}/{key}`	Single-part upload
Get object	`GET /{bucket}/{key}`	Including byte-range reads (`Range` header)
Head object	`HEAD /{bucket}/{key}`	Metadata retrieval
Delete object	`DELETE /{bucket}/{key}`	Tombstone or delete marker (versioning)
List objects	`GET /{bucket}?list-type=2`	ListObjectsV2 with prefix, delimiter, pagination

Multipart Upload

For objects larger than a single PUT (large datasets, model weights):

Operation	S3 API	Notes
Create multipart upload	`POST /{bucket}/{key}?uploads`	Returns upload ID
Upload part	`PUT /{bucket}/{key}?partNumber={n}&uploadId={id}`	Upload one part
Complete multipart upload	`POST /{bucket}/{key}?uploadId={id}`	Assemble parts into final object
Abort multipart upload	`DELETE /{bucket}/{key}?uploadId={id}`	Clean up incomplete upload
List multipart uploads	`GET /{bucket}?uploads`	List in-progress uploads
List parts	`GET /{bucket}/{key}?uploadId={id}`	List parts of an in-progress upload

Versioning

Operation	S3 API	Notes
Get object version	`GET /{bucket}/{key}?versionId={v}`	Specific version retrieval
List object versions	`GET /{bucket}?versions`	Version listing
Delete object version	`DELETE /{bucket}/{key}?versionId={v}`	Delete specific version

Conditional Operations

Header	Direction	Notes
`If-None-Match`	Write	Conditional write (create-if-not-exists)
`If-Match`	Write	Conditional write (update-if-matches)
`If-Modified-Since`	Read	Conditional read

Examples

aws-cli

# Set up environment
export AWS_ACCESS_KEY_ID=kiseki
export AWS_SECRET_ACCESS_KEY=kiseki
export AWS_DEFAULT_REGION=us-east-1
ENDPOINT="--endpoint-url http://localhost:9000"

# Bucket operations
aws $ENDPOINT s3 mb s3://datasets
aws $ENDPOINT s3 ls

# Upload a directory
aws $ENDPOINT s3 sync ./training-data/ s3://datasets/imagenet/

# Download a file
aws $ENDPOINT s3 cp s3://datasets/imagenet/train.tar /tmp/train.tar

# Multipart upload (automatic for large files)
aws $ENDPOINT s3 cp ./large-model.bin s3://datasets/models/gpt.bin

# List objects with prefix
aws $ENDPOINT s3 ls s3://datasets/imagenet/ --recursive

# Delete
aws $ENDPOINT s3 rm s3://datasets/imagenet/train.tar

curl

# Create a bucket
curl -X PUT http://localhost:9000/my-bucket

# PUT an object
curl -X PUT http://localhost:9000/my-bucket/config.json \
     -H "Content-Type: application/json" \
     -d '{"epochs": 100, "batch_size": 32}'

# GET an object
curl -s http://localhost:9000/my-bucket/config.json

# HEAD an object (metadata only)
curl -I http://localhost:9000/my-bucket/config.json

# Byte-range read (first 1024 bytes)
curl -s http://localhost:9000/my-bucket/large-file.bin \
     -H "Range: bytes=0-1023"

# DELETE an object
curl -X DELETE http://localhost:9000/my-bucket/config.json

# List objects (ListObjectsV2)
curl -s "http://localhost:9000/my-bucket?list-type=2&prefix=models/"

# Delete a bucket
curl -X DELETE http://localhost:9000/my-bucket

Python (boto3)

import boto3

s3 = boto3.client(
    "s3",
    endpoint_url="http://localhost:9000",
    aws_access_key_id="kiseki",
    aws_secret_access_key="kiseki",
    region_name="us-east-1",
)

# Create bucket
s3.create_bucket(Bucket="training")

# Upload
s3.put_object(Bucket="training", Key="data.csv", Body=b"col1,col2\n1,2\n")

# Download
obj = s3.get_object(Bucket="training", Key="data.csv")
print(obj["Body"].read().decode())

# List
for item in s3.list_objects_v2(Bucket="training")["Contents"]:
    print(item["Key"], item["Size"])

Bucket-to-Namespace Mapping

Every S3 bucket maps 1:1 to a Kiseki namespace within the authenticated tenant’s scope. Bucket names become namespace identifiers. Buckets from different tenants are fully isolated – two tenants can have buckets with the same name without conflict.

Objects within a bucket map to Kiseki compositions. Each object version corresponds to a sequence of deltas in the shard that owns the namespace.

Encryption Handling

Kiseki always encrypts all data (invariant I-K1). S3 server-side encryption headers are handled as follows:

Header	Behavior
SSE-S3 (`x-amz-server-side-encryption: AES256`)	Acknowledged, no-op. System encryption is always on.
SSE-KMS with matching ARN	Acknowledged if the ARN matches the tenant KMS config.
SSE-KMS with different ARN	Rejected. Tenants cannot specify arbitrary keys.
SSE-C (`x-amz-server-side-encryption-customer-*`)	Rejected. Kiseki manages encryption, not the client.

Limitations

The following S3 features are not implemented:

Feature	Reason
Lifecycle policies	Kiseki has its own tiering and retention model
Event notifications (SNS/SQS)	Requires message bus integration
Presigned URLs	Planned for future release
Bucket policies / IAM	Kiseki uses its own IAM and policy model
CORS	Not relevant for HPC/AI workloads
Object Lock	Covered by Kiseki’s retention hold mechanism
S3 Select	Out of scope
Replication configuration	Kiseki manages replication internally
Storage classes	Kiseki uses affinity pools, not S3 storage classes

NFS Access

Kiseki exposes an NFS gateway on port 2049 (configurable via KISEKI_NFS_ADDR) supporting both NFSv3 and NFSv4.2. The gateway translates NFS operations into reads and writes against materialized views and the composition log.

Protocol Support

Protocol	Status	Notes
NFSv3	Supported	Stateless, lower overhead
NFSv4.2	Supported	Stateful, with lock support and extended attributes

Mounting

Basic Mount

mount -t nfs <node>:/ /mnt/kiseki

With explicit version and options:

# NFSv4.2
mount -t nfs -o vers=4.2,proto=tcp <node>:/ /mnt/kiseki

# NFSv3
mount -t nfs -o vers=3,proto=tcp <node>:/ /mnt/kiseki

Docker Compose (Development)

When using the development Docker Compose stack, the NFS port is published to the host:

mount -t nfs -o vers=4.2,proto=tcp,port=2049 127.0.0.1:/ /mnt/kiseki

fstab Entry

<node>:/ /mnt/kiseki nfs vers=4.2,proto=tcp,hard,intr 0 0

Authentication

Mode	Use case	Notes
AUTH_SYS	Development and testing	UID/GID-based, no Kerberos
Kerberos (RPCSEC_GSS)	Production	krb5, krb5i, or krb5p security flavors

In development (Docker Compose), AUTH_SYS is used with no additional configuration. For production deployments, Kerberos provides authentication and optional integrity/privacy protection on the wire.

Kiseki always encrypts data at rest regardless of the NFS authentication mode. The gateway performs tenant-layer encryption: clients send plaintext over TLS to the gateway, and the gateway encrypts before writing to the log and chunk store.

Supported Operations

Full Semantics

Operation	Notes
`open`, `close`, `read`, `write`	Standard file I/O
`create`, `unlink`	File creation and deletion
`mkdir`, `rmdir`	Directory creation and deletion
`rename` (within namespace)	Atomic within shard
`stat`, `fstat`, `lstat`	File metadata
`chmod`, `chown`	Permission changes (stored in delta attributes)
`readdir`, `readdirplus`	Directory listing from materialized view
`symlink`, `readlink`	Stored as inline data in delta
`truncate`, `ftruncate`	Composition resize
`fsync`, `fdatasync`	Flush to durable (delta committed to Raft quorum)
Extended attributes (xattr)	`getxattr`, `setxattr`, `listxattr`, `removexattr`
POSIX file locks (`fcntl`)	Per-gateway lock state
`O_APPEND`	Atomic append via delta
`O_CREAT`, `O_EXCL`	Atomic create-if-not-exists

Limited Semantics

Operation	Limitation
`rename` (cross-namespace)	Returns `EXDEV` – cannot rename across shards
Hard links	Within namespace only; cross-namespace returns `EXDEV`
Sparse files	Holes tracked in composition; zero-fill on read
`O_DIRECT`	Bypasses client cache but still traverses the gateway
`flock` (advisory)	Best-effort; not guaranteed across gateway failover

Not Supported

Operation	Reason
Writable shared `mmap`	Distributed shared writable mmap requires page-level coherence that is not tractable at HPC scale. Read-only mmap is supported. The gateway returns `ENOTSUP`. See ADR-013.
POSIX ACLs (POSIX.1e)	Unix permissions only (uid/gid/mode). POSIX ACLs add complexity without benefit for the target workloads.

Namespace Mapping

The NFS root (/) lists the tenant’s namespaces as top-level directories. Each namespace contains the compositions (files and directories) belonging to that namespace. This is analogous to the S3 bucket mapping – the same namespace appears as a bucket via S3 and as a top-level directory via NFS.

/mnt/kiseki/
  training/          <- namespace "training"
    imagenet/
      train.tar
      val.tar
  checkpoints/       <- namespace "checkpoints"
    epoch-001.pt

Performance Considerations

Readdir performance – directory listings are served from materialized views, not reconstructed from the log on each request. Views are updated incrementally by stream processors.
Write path – writes flow through the gateway to the composition context, which appends deltas to the shard log. An fsync ensures the delta is committed to a Raft quorum before returning.
Concurrent access – multiple NFS clients can read the same files concurrently. Write contention within a shard is serialized by the Raft leader.
Large files – large files are chunked using content-defined chunking (Rabin fingerprinting). Byte-range reads are served by fetching only the relevant chunks.

Limitations Summary

No writable shared mmap – applications that use writable shared memory-mapped files must use write() instead. Read-only mmap works and is useful for model loading.
Cross-namespace rename returns EXDEV – renaming a file from one namespace to another requires a copy-and-delete at the application level, same as moving files across filesystem boundaries on a traditional system.
No POSIX ACLs – only standard Unix permissions (mode bits). Fine-grained access control is handled by Kiseki’s tenant IAM model, not filesystem-level ACLs.
Lock state is per-gateway – POSIX file locks (fcntl) are maintained by the gateway instance. If a gateway fails over, lock state is lost. Advisory locks (flock) are best-effort.

FUSE Mount

The Kiseki native client provides a FUSE mount that exposes the distributed storage as a local filesystem on compute nodes. Unlike the NFS gateway, the FUSE client runs in the workload’s process space and performs client-side encryption – plaintext never leaves the process.

Building

The FUSE mount is feature-gated. Build the client binary with the fuse feature:

cargo build --release --bin kiseki-client-fuse --features fuse

This requires the fuser crate, which depends on the FUSE kernel module being available on the host:

Linux: install fuse3 or libfuse3-dev
macOS: install macFUSE

Mounting

kiseki-client-fuse mount /mnt/kiseki \
    --data-addr <storage-node>:9100 \
    --tenant <tenant-id> \
    --namespace <namespace-id>

Mount Options

Options are passed with -o:

kiseki-client-fuse mount /mnt/kiseki \
    -o cache_mode=organic \
    -o cache_dir=/local-nvme/kiseki-cache \
    -o cache_l2_max=100G \
    -o meta_ttl_ms=5000

Option	Values	Default	Description
`cache_mode`	`pinned`, `organic`, `bypass`	`organic`	Cache operating mode (see Client Cache)
`cache_dir`	path	`/tmp/kiseki-cache`	L2 NVMe cache directory
`cache_l1_max`	bytes	`256M`	L1 (in-memory) cache size
`cache_l2_max`	bytes	`50G`	L2 (NVMe) cache size per process
`meta_ttl_ms`	milliseconds	`5000`	Metadata cache TTL

Environment Variables

Mount options can also be set via environment variables. Mount options take priority over environment variables.

Variable	Equivalent option
`KISEKI_CACHE_MODE`	`cache_mode`
`KISEKI_CACHE_DIR`	`cache_dir`
`KISEKI_CACHE_L1_MAX`	`cache_l1_max`
`KISEKI_CACHE_L2_MAX`	`cache_l2_max`
`KISEKI_CACHE_META_TTL_MS`	`meta_ttl_ms`
`KISEKI_CACHE_POOL_ID`	Adopt an existing cache pool (see staging handoff)

Supported Operations

Read/Write

Operation	Supported	Notes
`read`	Yes	Served from cache (L1 -> L2 -> canonical)
`write`	Yes	Writes to canonical; local metadata cache updated immediately
`open` / `close`	Yes	Standard file handles
`fsync` / `fdatasync`	Yes	Flushes delta to Raft quorum
`truncate` / `ftruncate`	Yes	Composition resize
`O_APPEND`	Yes	Atomic append via delta
`O_CREAT` / `O_EXCL`	Yes	Atomic create-if-not-exists
`O_DIRECT`	Limited	Bypasses client cache, still goes through FUSE

Directory Operations

Operation	Supported	Notes
`mkdir` / `rmdir`	Yes	Create and remove directories
`readdir` / `readdirplus`	Yes	Listing from materialized view
`rename` (within namespace)	Yes	Atomic within shard
`rename` (cross-namespace)	No	Returns `EXDEV`

Metadata and Links

Operation	Supported	Notes
`stat` / `fstat` / `lstat`	Yes	File metadata
`chmod` / `chown`	Yes	Stored in delta attributes
`symlink` / `readlink`	Yes	Symlink targets stored as inline data
Hard links (within namespace)	Yes
Hard links (cross-namespace)	No	Returns `EXDEV`
`xattr` operations	Yes	`getxattr`, `setxattr`, `listxattr`, `removexattr`

Nested Directories and Write-at-Offset

The FUSE filesystem supports full directory trees within a namespace. Files can be created in nested directories, and writes at arbitrary offsets within a file are supported (the composition tracks chunk references and handles sparse regions with zero-fill).

mkdir -p /mnt/kiseki/experiments/run-42/logs
echo "epoch 1 loss: 0.3" > /mnt/kiseki/experiments/run-42/logs/train.log

# Write at offset (sparse file)
dd if=/dev/zero of=/mnt/kiseki/data/sparse.bin bs=1 count=1 seek=1048576

Not Supported

Operation	Reason
Writable shared `mmap`	Returns `ENOTSUP`. Read-only mmap works. Use `write()` instead. (ADR-013)
POSIX ACLs	Unix permissions only (uid/gid/mode)

Cache Mode Selection

The cache mode determines how aggressively the client caches data on local storage. Choose the mode that matches your workload:

Mode	Best for	Behavior
`pinned`	Training (epoch reuse), inference (model weights)	Chunks retained until explicit release. Populate via staging API.
`organic`	Mixed workloads, interactive use	LRU eviction with usage-weighted retention. Default.
`bypass`	Streaming ingest, checkpoint writes, one-shot scans	No caching. All reads go directly to canonical storage.

# Training job: pin the dataset
kiseki-client-fuse mount /mnt/kiseki -o cache_mode=pinned

# Interactive exploration
kiseki-client-fuse mount /mnt/kiseki -o cache_mode=organic

# Checkpoint writer
kiseki-client-fuse mount /mnt/kiseki -o cache_mode=bypass

See Client Cache & Staging for staging pre-fetch, Slurm integration, and policy configuration.

Transport Selection

The native client automatically selects the fastest available transport to reach storage nodes:

libfabric/CXI (Slingshot) – if available on the fabric
RDMA verbs – if InfiniBand/RoCE is available
TCP+TLS – universal fallback

Transport selection is automatic and requires no configuration. The client discovers available transports during fabric discovery at startup (ADR-008).

Unmounting

fusermount -u /mnt/kiseki    # Linux
umount /mnt/kiseki           # macOS

On clean unmount, the L2 cache pool is wiped (all chunk files are zeroized and deleted). On crash, the orphaned cache pool is cleaned up by the next client process or by the kiseki-cache-scrub service.

Python SDK

Kiseki provides Python bindings via PyO3, exposing the native client’s cache, staging, and workflow advisory APIs to Python workloads. The bindings are part of the kiseki-client crate, enabled with the python feature flag.

Building

Build and install the Python module using maturin:

pip install maturin
maturin develop --features python

This builds the native Rust code and installs the kiseki module into the active Python environment.

For a release build:

maturin build --release --features python
pip install target/wheels/kiseki-*.whl

Quick Start

import kiseki

# Create a client with organic caching (default)
client = kiseki.Client(cache_mode="organic", cache_dir="/tmp/kiseki-cache")

# Stage a dataset into the local cache
client.stage("/training/imagenet")

# ... workload reads via FUSE or native API ...

# Check cache statistics
stats = client.cache_stats()
print(stats)
# CacheStats(l1_hits=42, l2_hits=1500, misses=200, l1_bytes=134217728, l2_bytes=5368709120, wipes=0)

# Release the staged dataset
client.release("/training/imagenet")

# Clean up
client.close()

API Reference

`kiseki.Client`

The main entry point. Each Client instance manages its own cache pool (L1 in-memory + L2 NVMe) and advisory session.

Constructor

client = kiseki.Client(
    cache_mode="organic",           # "pinned", "organic", or "bypass"
    cache_dir="/tmp/kiseki-cache",  # L2 NVMe cache directory
    cache_l2_max=50 * 1024**3,      # L2 max bytes (default: 50 GB)
    meta_ttl_ms=5000,               # Metadata TTL in ms (default: 5000)
)

Parameter	Type	Default	Description
`cache_mode`	`str`	`"organic"`	Cache mode: `"pinned"`, `"organic"`, or `"bypass"`
`cache_dir`	`str`	`"/tmp/kiseki-cache"`	Directory for L2 NVMe cache files
`cache_l2_max`	`int` or `None`	`None` (50 GB)	Maximum L2 cache size in bytes
`meta_ttl_ms`	`int`	`5000`	Metadata cache TTL in milliseconds

`stage(namespace_path: str) -> None`

Pre-fetch a dataset’s chunks into the local cache with pinned retention. The dataset is identified by its namespace path (e.g., "/training/imagenet"). Staging is idempotent – re-staging an already-staged dataset is a no-op.

client.stage("/training/imagenet")
client.stage("/training/imagenet")  # no-op, already staged

For directory paths, staging recursively enumerates all files up to a depth of 10 and a maximum of 100,000 files.

`stage_status() -> list[str]`

Return the namespace paths of all currently staged datasets.

paths = client.stage_status()
# ["/training/imagenet", "/models/gpt-3"]

`release(namespace_path: str) -> None`

Release a staged dataset, unpinning its chunks and making them eligible for eviction.

client.release("/training/imagenet")

`release_all() -> None`

Release all staged datasets.

client.release_all()

`cache_stats() -> CacheStatsView`

Return current cache statistics.

stats = client.cache_stats()
print(f"L1 hits: {stats.l1_hits}")
print(f"L2 hits: {stats.l2_hits}")
print(f"Misses:  {stats.misses}")
print(f"L1 used: {stats.l1_bytes / 1024**2:.0f} MB")
print(f"L2 used: {stats.l2_bytes / 1024**3:.1f} GB")
print(f"Wipes:   {stats.wipes}")

`cache_mode() -> str`

Return the current cache mode as a string.

print(client.cache_mode())  # "organic"

`declare_workflow() -> int`

Declare a new workflow for advisory integration. Returns a workflow ID (128-bit integer) that can be used to correlate operations with the advisory channel for telemetry feedback.

wf_id = client.declare_workflow()
# ... run training epochs ...
client.end_workflow(wf_id)

`end_workflow(workflow_id: int) -> None`

End a previously declared workflow.

`wipe() -> None`

Immediately wipe the entire cache (L1 + L2). All cached plaintext is zeroized before deletion.

`close() -> None`

Wipe the cache and release resources. Call this when the workload is done. Equivalent to wipe().

`kiseki.CacheStatsView`

Read-only statistics object returned by cache_stats().

Attribute	Type	Description
`l1_hits`	`int`	Number of L1 (memory) cache hits
`l2_hits`	`int`	Number of L2 (NVMe) cache hits
`misses`	`int`	Number of cache misses (fetched from canonical)
`l1_bytes`	`int`	Current L1 memory usage in bytes
`l2_bytes`	`int`	Current L2 disk usage in bytes
`wipes`	`int`	Number of full cache wipes

Example: Training Workflow

import kiseki

def train():
    # Pin the dataset for the duration of training
    client = kiseki.Client(cache_mode="pinned", cache_dir="/local-nvme/cache")

    # Pre-stage the dataset (ideally done in Slurm prolog)
    client.stage("/training/imagenet-22k")

    # Declare a workflow for advisory telemetry
    wf_id = client.declare_workflow()

    try:
        for epoch in range(100):
            # Dataset reads hit L2 cache after first epoch
            # ... training loop reads from /mnt/kiseki/training/imagenet-22k/ ...
            pass

        stats = client.cache_stats()
        print(f"Cache hit rate: {(stats.l1_hits + stats.l2_hits) / "
              f"(stats.l1_hits + stats.l2_hits + stats.misses) * 100:.1f}%")
    finally:
        client.end_workflow(wf_id)
        client.release_all()
        client.close()

if __name__ == "__main__":
    train()

Example: Inference with Organic Caching

import kiseki

client = kiseki.Client(cache_mode="organic", cache_l2_max=20 * 1024**3)

# Model weights are cached on first load, then served from L2
# Prompt data is cached with LRU eviction

wf_id = client.declare_workflow()
try:
    # ... inference serving loop ...
    pass
finally:
    client.end_workflow(wf_id)
    client.close()

Example: Checkpoint Writer (No Caching)

import kiseki

# Bypass mode: checkpoint writes go straight to canonical
client = kiseki.Client(cache_mode="bypass")

# ... write checkpoints to /mnt/kiseki/checkpoints/ ...

client.close()

Environment Variable Overrides

The Python client respects the same environment variables as the FUSE mount and CLI:

Variable	Description
`KISEKI_CACHE_MODE`	Override cache mode
`KISEKI_CACHE_DIR`	Override cache directory
`KISEKI_CACHE_L1_MAX`	Override L1 max bytes
`KISEKI_CACHE_L2_MAX`	Override L2 max bytes
`KISEKI_CACHE_META_TTL_MS`	Override metadata TTL
`KISEKI_CACHE_POOL_ID`	Adopt an existing cache pool (staging handoff)

Constructor parameters take priority over environment variables. All client-set values are clamped to the effective policy ceilings set by tenant and cluster administrators.

Client Cache & Staging

The client-side cache (ADR-031) eliminates repeated data transfers across the storage fabric by caching decrypted plaintext chunks on compute-node local NVMe. It is a library-level module in kiseki-client, shared across all access modes: FUSE, FFI, Python, and native Rust.

Architecture

canonical (fabric) -> decrypt -> cache store (NVMe) -> serve to caller
                                     ^
                           cache hit path (no fabric, no decrypt)

Two-Tier Storage

Tier	Backing	Capacity	Purpose
L1 (Hot)	In-memory `HashMap`	256 MB default	Sub-microsecond hits for active working set
L2 (Warm)	Local NVMe files	50 GB default	Large capacity for datasets and model weights

Read path: L1 -> L2 (with CRC32 verification) -> canonical (decrypt + SHA-256 verify + store in L1/L2).

L2 files are organized per-process with isolated cache pools:

$KISEKI_CACHE_DIR/
  <tenant_id_hex>/
    <pool_id>/                  <- per-process pool (128-bit CSPRNG)
      chunks/
        <prefix>/
          <chunk_id_hex>        <- plaintext + CRC32 trailer
      meta/
        file_chunks.db
      staging/
        <dataset_id>.manifest
      pool.lock                 <- flock proves process is alive

Each client process creates its own pool directory. Multiple concurrent same-tenant processes on the same node have fully independent pools with no contention.

Security Model

The cache stores decrypted plaintext on local NVMe. This is acceptable because:

The compute node already holds decrypted data in process memory (computation requires plaintext)
L2 NVMe is local to the compute node, same trust domain as process memory
L2 is ephemeral – wiped on process exit and on long disconnect
All cached data is overwritten with zeros (zeroize) before deallocation or eviction
File permissions are 0600, owned by the process UID
Orphaned pools from crashes are cleaned by the kiseki-cache-scrub service

Cache Modes

Three modes are available, selected per client instance at session establishment.

Pinned Mode

For workloads that declare their dataset upfront: training runs (epoch reuse), inference (model weights), climate simulations (boundary conditions).

Chunks are retained against eviction until explicit release()
Populated via the staging API or on first access
Staging captures a point-in-time snapshot; canonical updates do not invalidate pinned data
Capacity bounded by max_cache_bytes; staging beyond capacity returns CacheCapacityExceeded

Organic Mode

Default for mixed workloads. LRU with usage-weighted retention.

Chunks cached on first read, evicted when capacity is reached
Frequently accessed chunks promoted to L1
L2 eviction: LRU by last-access timestamp, weighted by access count (chunks accessed N times survive N eviction rounds)
Metadata cache with configurable TTL (default 5 seconds)

Bypass Mode

For workloads that do not benefit from caching: streaming ingest, one-shot scans, checkpoint writes.

All reads go directly to canonical
No L1 or L2 storage consumed
Zero overhead beyond mode selection

Staging API

Client-local operation for pre-populating the cache in pinned mode. Pull-based – the client fetches from canonical.

CLI

# Stage a dataset
kiseki-client stage --dataset /training/imagenet

# Stage in daemon mode (for Slurm prolog)
POOL_ID=$(kiseki-client stage --dataset /training/imagenet --daemon)

# Check staging status
kiseki-client stage --status

# Release a dataset
kiseki-client stage --release /training/imagenet

# Release all
kiseki-client stage --release-all

Rust API

#![allow(unused)]
fn main() {
let result = cache_manager.stage("/training/imagenet").await?;
let datasets = cache_manager.stage_status();
cache_manager.release("/training/imagenet");
cache_manager.release_all();
}

Python API

client.stage("/training/imagenet")
paths = client.stage_status()
client.release("/training/imagenet")
client.release_all()

C FFI

kiseki_stage(handle, "/training/imagenet", timeout_secs);
kiseki_stage_status(handle, &status);
kiseki_release(handle, "/training/imagenet");

Staging Flow

Resolve namespace_path to compositions via canonical. For directory paths, recursively enumerate all files up to max_staging_depth (10) and max_staging_files (100,000).
Extract full chunk list from all resolved compositions.
For each chunk not already in L2: fetch from canonical, decrypt, verify content-address (SHA-256), store in L2 with CRC32 trailer and pinned retention.
Write a staging manifest listing all compositions and chunk IDs.
Report progress (chunks staged / total, bytes, elapsed).

Staging is idempotent – re-staging an already-staged dataset is a no-op. Partial staging (interrupted) can be resumed by re-running the command.

Slurm Integration

Staging Handoff

The staging CLI creates a cache pool and holds its pool.lock flock. The workload process adopts the pool instead of creating a new one:

Prolog: staging CLI fetches chunks in daemon mode, outputs pool_id.
Workload: sets KISEKI_CACHE_POOL_ID=<pool_id>, starts, adopts the existing pool, takes over the flock.
Staging daemon: detects flock loss, exits cleanly.

Prolog Script

#!/bin/bash
# prolog.sh -- run before the job starts

POOL_ID=$(kiseki-client stage --dataset /training/imagenet --daemon)
echo "export KISEKI_CACHE_POOL_ID=$POOL_ID" >> $SLURM_EXPORT_FILE

Epilog Script

#!/bin/bash
# epilog.sh -- run after the job completes

kiseki-client stage --release-all --pool $KISEKI_CACHE_POOL_ID

Lattice Integration

Lattice injects KISEKI_CACHE_POOL_ID into the workload environment after parallel staging completes across the node set. It queries stage --status to verify readiness before launching the workload.

Policy Hierarchy

Cache policy follows the same distribution mechanism as quotas, using the existing TenantConfig structure.

cluster default -> org override -> project override -> workload override
                                                         -> session selection

Each level narrows (never broadens) the parent’s settings.

Policy Attributes

Attribute	Type	Admin levels	Client selectable	Default
`cache_enabled`	bool	cluster, org, project, workload	No	`true`
`allowed_modes`	set	cluster, org	No	{pinned, organic, bypass}
`max_cache_bytes`	u64	cluster, org, workload	Up to ceiling	50 GB
`max_node_cache_bytes`	u64	cluster	No	80% of cache FS
`metadata_ttl_ms`	u64	cluster, org	Up to ceiling	5000
`max_disconnect_seconds`	u64	cluster	No	300
`key_health_interval_ms`	u64	cluster	No	30000
`staging_enabled`	bool	cluster, org	No	`true`
`mode`	enum	workload (default)	Yes (within allowed)	organic

Policy Resolution

At session establishment, the client resolves its effective policy through multiple paths:

Primary: GetCachePolicy RPC on the data-path gRPC channel to any storage node. No gateway or control plane access required.
Secondary: gateway’s locally-cached TenantConfig.
Stale tolerance: last-known policy persisted in the L2 pool directory (policy.json).
Fallback: conservative defaults (organic mode, 10 GB max, 5s TTL).

Policy changes apply to new sessions only. Active sessions continue under the policy effective at session establishment.

Configuration

Environment Variables

Variable	Description	Default
`KISEKI_CACHE_MODE`	Cache mode	`organic`
`KISEKI_CACHE_DIR`	L2 cache directory	`/tmp/kiseki-cache`
`KISEKI_CACHE_L1_MAX`	L1 memory max bytes	256 MB
`KISEKI_CACHE_L2_MAX`	L2 NVMe max bytes	50 GB
`KISEKI_CACHE_META_TTL_MS`	Metadata TTL (ms)	5000
`KISEKI_CACHE_POOL_ID`	Adopt existing pool	(none)

Mount Options (FUSE)

kiseki-client-fuse mount /mnt/kiseki \
    -o cache_mode=pinned \
    -o cache_dir=/local-nvme/kiseki \
    -o cache_l2_max=100G

API (Rust)

#![allow(unused)]
fn main() {
let config = CacheConfig {
    mode: CacheMode::Pinned,
    cache_dir: PathBuf::from("/local-nvme/kiseki"),
    max_cache_bytes: 100 * 1024 * 1024 * 1024,
    metadata_ttl: Duration::from_secs(5),
    ..CacheConfig::default()
};
}

API (Python)

client = kiseki.Client(
    cache_mode="pinned",
    cache_dir="/local-nvme/kiseki",
    cache_l2_max=100 * 1024**3,
)

Priority: API/mount options > environment variables > policy defaults. All client-set values are clamped to policy ceilings.

Cache Invalidation

Metadata

TTL-based only. No push invalidation from canonical. The metadata TTL (default 5 seconds) is the sole freshness mechanism and the upper bound on read staleness.

Write-through: when the client writes a file, the local metadata cache is updated immediately, providing read-your-writes consistency within a single process.

Crypto-Shred

When a tenant’s KEK is destroyed, all cached plaintext for that tenant must be wiped. Detection via three paths:

Periodic key health check (default every 30 seconds) – primary.
Advisory channel notification – fast path, best-effort.
KMS error on next operation – tertiary.

Maximum detection latency: min(key_health_interval, max_disconnect_seconds) = 30 seconds by default.

Disconnect

If the client cannot reach any canonical endpoint for max_disconnect_seconds (default 300 seconds), the entire cache is wiped. Background heartbeat RPCs (every 60 seconds) maintain the disconnect timer.

Capacity Management

Limit	Scope	Default	Enforcement
`max_memory_bytes` (L1)	Per-process	256 MB	Strict LRU eviction
`max_cache_bytes` (L2)	Per-process	50 GB	LRU (organic), reject (pinned)
`max_node_cache_bytes`	Per-node	80% of cache FS	Cooperative check before L2 insert
Disk pressure backstop	Per-node	90% utilization	Hard backstop

Pinned chunks are never evicted by organic LRU. Organic eviction considers only non-pinned chunks.

Crash Recovery

On process start: the client scans for orphaned cache pools (those whose pool.lock has no live flock holder), zeroizes their contents, and deletes them.
kiseki-cache-scrub service: a systemd one-shot (or cron job) that runs on node boot and every 60 seconds, covering the case where no subsequent Kiseki process starts on the node after a crash.

Deployment

This guide covers deploying Kiseki in development, multi-node cluster, and bare-metal production environments.

Docker Compose (development)

The single-node development stack includes Kiseki plus supporting services for tracing, KMS, and identity.

Services

Service	Image	Ports	Purpose
`kiseki-server`	`Dockerfile.server` (local build)	2049, 9000, 9090, 9100, 9101	Storage node
`jaeger`	`jaegertracing/all-in-one:latest`	4317, 16686	Distributed tracing (OTLP)
`vault`	`hashicorp/vault:1.19`	8200	Tenant KMS backend (Transit engine)
`keycloak`	`quay.io/keycloak/keycloak:26.0`	8080	OIDC identity provider

Starting the stack

# Build and start all services
docker compose up --build

# Run in background for e2e tests
docker compose up --build -d && pytest tests/e2e/

Port map (single-node)

Port	Protocol	Service
2049	TCP	NFS (v3 + v4.2)
9000	HTTP	S3 gateway
9090	HTTP	Prometheus metrics + admin dashboard
9100	gRPC	Data-path (log, chunk, composition, view)
9101	gRPC	Workflow advisory
4317	gRPC	Jaeger OTLP receiver
16686	HTTP	Jaeger UI
8200	HTTP	Vault API
8080	HTTP	Keycloak admin console

Environment (dev defaults)

The development compose file sets these environment variables on the kiseki-server container:

KISEKI_DATA_ADDR: "0.0.0.0:9100"
KISEKI_ADVISORY_ADDR: "0.0.0.0:9101"
KISEKI_S3_ADDR: "0.0.0.0:9000"
KISEKI_NFS_ADDR: "0.0.0.0:2049"
KISEKI_METRICS_ADDR: "0.0.0.0:9090"
KISEKI_DATA_DIR: "/data"
KISEKI_BOOTSTRAP: "true"
OTEL_EXPORTER_OTLP_ENDPOINT: "http://jaeger:4317"
OTEL_SERVICE_NAME: "kiseki-server"

The KISEKI_BOOTSTRAP=true flag tells the node to create an initial shard on first start, enabling immediate use without manual cluster initialization.

Vault (dev mode)

Vault runs in dev mode with the root token kiseki-e2e-token. This is suitable only for development and testing. The Transit secrets engine is used by Kiseki as a tenant KMS backend (ADR-028 Provider 2).

# Verify Vault is ready
curl http://localhost:8200/v1/sys/health

Keycloak (dev mode)

Keycloak runs with start-dev and default admin credentials (admin/admin). Configure OIDC realms for tenant identity provider integration.

Docker Compose (3-node cluster)

The multi-node compose file (docker-compose.3node.yml) deploys a 3-node Raft cluster for testing consensus, replication, and failover.

Starting

docker compose -f docker-compose.3node.yml up --build -d

# Run multi-node tests
KISEKI_E2E_COMPOSE=docker-compose.3node.yml pytest tests/e2e/test_multi_node.py

Node configuration

All three nodes share the same Raft peer list and each has a unique KISEKI_NODE_ID:

Node	Node ID	Data gRPC	Advisory gRPC	S3	Raft
`kiseki-node1`	1	`localhost:9100`	`localhost:9101`	`localhost:9000`	`9300`
`kiseki-node2`	2	`localhost:9110`	`localhost:9111`	`localhost:9010`	`9300`
`kiseki-node3`	3	`localhost:9120`	`localhost:9121`	`localhost:9020`	`9300`

The Raft peer list is configured identically on all nodes:

KISEKI_RAFT_PEERS=1=kiseki-node1:9300,2=kiseki-node2:9300,3=kiseki-node3:9300

Node 1 is the bootstrap node. Each node has an independent data volume (node1-data, node2-data, node3-data).

Verifying cluster health

# Check all nodes are healthy
for port in 9100 9110 9120; do
  curl -s http://localhost:${port/9100/9090}/health && echo " :$port OK"
done

# View cluster status via the admin dashboard
open http://localhost:9090/ui

Bare metal deployment

Build from source

Prerequisites: Rust stable toolchain, protobuf compiler (protoc), OpenSSL development headers, pkg-config.

# Clone and build
git clone https://github.com/your-org/kiseki.git
cd kiseki

# Release build (all binaries)
cargo build --release

# Binaries produced:
# target/release/kiseki-server      — storage node
# target/release/kiseki-keyserver    — system key manager (HA)
# target/release/kiseki-client-fuse  — FUSE client for compute nodes
# target/release/kiseki-control      — control plane

Optional feature flags:

# Enable CXI/Slingshot transport (requires libfabric)
cargo build --release --features kiseki-transport/cxi

# Enable RDMA verbs transport
cargo build --release --features kiseki-transport/verbs

# Enable tenant opt-in compression
cargo build --release --features kiseki-chunk/compression

Disk layout

Each storage node should follow the recommended disk layout:

Server node:
  System partition (RAID-1 on 2x SSD):
    /var/lib/kiseki/raft/log.redb       Raft log entries
    /var/lib/kiseki/keys/epochs.redb    Key epoch metadata
    /var/lib/kiseki/chunks/meta.redb    Chunk extent index
    /var/lib/kiseki/small/objects.redb   Small-file inline content
    /var/lib/kiseki/config/             Node config, TLS certs

  Data devices (JBOD, managed by Kiseki):
    /dev/nvme0n1 -> pool "fast-nvme"
    /dev/nvme1n1 -> pool "fast-nvme"
    /dev/sda     -> pool "bulk-ssd"
    /dev/sdb     -> pool "cold-hdd"

JBOD for data devices, RAID-1 for the system partition. Kiseki manages data durability via EC/replication across JBOD members. The system partition uses RAID-1 because redb and Raft log must survive a single disk failure without Kiseki’s own repair mechanism.

systemd unit: kiseki-server

[Unit]
Description=Kiseki Storage Node
After=network-online.target
Wants=network-online.target
StartLimitIntervalSec=300
StartLimitBurst=5

[Service]
Type=simple
User=kiseki
Group=kiseki
ExecStart=/usr/local/bin/kiseki-server
Restart=on-failure
RestartSec=5

# Environment
Environment=KISEKI_DATA_ADDR=0.0.0.0:9100
Environment=KISEKI_ADVISORY_ADDR=0.0.0.0:9101
Environment=KISEKI_S3_ADDR=0.0.0.0:9000
Environment=KISEKI_NFS_ADDR=0.0.0.0:2049
Environment=KISEKI_METRICS_ADDR=0.0.0.0:9090
Environment=KISEKI_DATA_DIR=/var/lib/kiseki
Environment=KISEKI_NODE_ID=1
Environment=KISEKI_RAFT_PEERS=1=node1.example.com:9300,2=node2.example.com:9300,3=node3.example.com:9300
Environment=KISEKI_RAFT_ADDR=0.0.0.0:9300

# TLS
Environment=KISEKI_CA_PATH=/etc/kiseki/tls/ca.crt
Environment=KISEKI_CERT_PATH=/etc/kiseki/tls/server.crt
Environment=KISEKI_KEY_PATH=/etc/kiseki/tls/server.key

# Observability
Environment=OTEL_EXPORTER_OTLP_ENDPOINT=http://jaeger.internal:4317
Environment=OTEL_SERVICE_NAME=kiseki-server
Environment=RUST_LOG=kiseki=info

# Security hardening
NoNewPrivileges=yes
ProtectSystem=strict
ProtectHome=yes
ReadWritePaths=/var/lib/kiseki
PrivateTmp=yes
MemoryDenyWriteExecute=yes
LimitCORE=0

[Install]
WantedBy=multi-user.target

systemd unit: kiseki-keyserver

[Unit]
Description=Kiseki System Key Manager
After=network-online.target
Wants=network-online.target

[Service]
Type=simple
User=kiseki-keys
Group=kiseki-keys
ExecStart=/usr/local/bin/kiseki-keyserver
Restart=on-failure
RestartSec=5

Environment=KISEKI_DATA_DIR=/var/lib/kiseki-keys
Environment=KISEKI_RAFT_PEERS=1=keysrv1:9400,2=keysrv2:9400,3=keysrv3:9400
Environment=KISEKI_RAFT_ADDR=0.0.0.0:9400
Environment=KISEKI_CA_PATH=/etc/kiseki/tls/ca.crt
Environment=KISEKI_CERT_PATH=/etc/kiseki/tls/keyserver.crt
Environment=KISEKI_KEY_PATH=/etc/kiseki/tls/keyserver.key

NoNewPrivileges=yes
ProtectSystem=strict
ProtectHome=yes
ReadWritePaths=/var/lib/kiseki-keys
PrivateTmp=yes
MemoryDenyWriteExecute=yes
LimitCORE=0

[Install]
WantedBy=multi-user.target

systemd unit: kiseki-client-fuse

[Unit]
Description=Kiseki FUSE Client
After=network-online.target
Wants=network-online.target

[Service]
Type=simple
User=root
ExecStart=/usr/local/bin/kiseki-client-fuse --mountpoint /mnt/kiseki
ExecStop=/bin/fusermount -u /mnt/kiseki
Restart=on-failure
RestartSec=5

Environment=KISEKI_DATA_ADDR=node1.example.com:9100,node2.example.com:9100,node3.example.com:9100
Environment=KISEKI_CA_PATH=/etc/kiseki/tls/ca.crt
Environment=KISEKI_CERT_PATH=/etc/kiseki/tls/client.crt
Environment=KISEKI_KEY_PATH=/etc/kiseki/tls/client.key
Environment=KISEKI_CACHE_MODE=organic
Environment=KISEKI_CACHE_DIR=/var/cache/kiseki
Environment=KISEKI_CACHE_L1_MAX=1073741824
Environment=KISEKI_CACHE_L2_MAX=107374182400

[Install]
WantedBy=multi-user.target

Configuration checklist

Before starting a production cluster, verify the following:

TLS certificates

Cluster CA certificate generated and distributed to all nodes
Per-node server certificate signed by Cluster CA
Per-tenant client certificates signed by Cluster CA
Key manager server certificate signed by Cluster CA
CRL distribution point configured (if using CRL-based revocation)
Certificate SANs include all node hostnames and IP addresses
All certificates use ECDSA P-256 or RSA 2048+ keys

Data directories

KISEKI_DATA_DIR exists and is owned by the kiseki user
System partition has sufficient capacity for metadata (see Capacity Planning)
Data devices formatted and accessible (raw block or file-backed)
Separate RAID-1 for system partition

Bootstrap

Exactly one node has KISEKI_BOOTSTRAP=true on first start
After initial bootstrap, set KISEKI_BOOTSTRAP=false on the bootstrap node (or remove the variable)
KISEKI_RAFT_PEERS is identical on all nodes
KISEKI_NODE_ID is unique per node
System key manager cluster is started before storage nodes

Network

Data-fabric ports (9100, 9101) reachable between all nodes
Raft port (9300) reachable between all nodes
Metrics port (9090) accessible to monitoring infrastructure
NFS port (2049) accessible to clients
S3 port (9000) accessible to clients
Management network separated from data fabric (recommended)

Observability

Jaeger or OTLP-compatible collector endpoint configured
Prometheus scrape target added for each node’s :9090/metrics
RUST_LOG level set appropriately (production: kiseki=info)

Health verification

After deployment, verify the cluster is healthy:

HTTP health endpoint

# Returns "OK" when the node is ready
curl http://node1:9090/health

Prometheus metrics

# Verify metrics are being exported
curl -s http://node1:9090/metrics | head -20

Admin dashboard

Open http://node1:9090/ui in a browser. The dashboard shows:

Cluster health (nodes healthy / total)
Raft entries applied
Gateway requests served
Data written and read
Active transport connections

Any node in the cluster serves the full cluster-wide view by scraping metrics from its peers.

Raft consensus

Verify that the Raft cluster has elected a leader:

# Check the cluster status via the admin API
curl -s http://node1:9090/ui/api/cluster | jq .

S3 connectivity

# Test S3 access (if a tenant namespace is configured)
aws --endpoint-url http://node1:9000 s3 ls

NFS connectivity

# Test NFS mount
mount -t nfs node1:/ /mnt/kiseki -o vers=4.2

FUSE client

# Mount via FUSE (on a compute node)
kiseki-client-fuse --mountpoint /mnt/kiseki
ls /mnt/kiseki

Configuration Reference

Kiseki is configured entirely through environment variables. There are no configuration files to manage. Every tunable parameter has a sensible default. Variables are grouped by function below.

Network addresses

Variable	Default	Description
`KISEKI_DATA_ADDR`	`0.0.0.0:9100`	Listen address for data-path gRPC (log, chunk, composition, view, discovery).
`KISEKI_ADVISORY_ADDR`	`0.0.0.0:9101`	Listen address for the Workflow Advisory gRPC service. Runs on a dedicated tokio runtime, isolated from the data path (ADR-021).
`KISEKI_S3_ADDR`	`0.0.0.0:9000`	Listen address for the S3 HTTP gateway.
`KISEKI_NFS_ADDR`	`0.0.0.0:2049`	Listen address for the NFS gateway (v3 + v4.2).
`KISEKI_METRICS_ADDR`	`0.0.0.0:9090`	Listen address for Prometheus metrics (`/metrics`), health endpoint (`/health`), and admin dashboard (`/ui`).
`KISEKI_RAFT_ADDR`	`0.0.0.0:9300`	Listen address for Raft consensus traffic between nodes.

All addresses accept the host:port format. Use 0.0.0.0 to bind to all interfaces or a specific IP to restrict to one network.

Cluster membership

Variable	Default	Description
`KISEKI_NODE_ID`	(required)	Unique integer identifier for this node within the cluster. Must be stable across restarts.
`KISEKI_RAFT_PEERS`	(required)	Comma-separated list of `id=host:port` pairs for all Raft voters. Example: `1=node1:9300,2=node2:9300,3=node3:9300`. Must be identical on every node.
`KISEKI_BOOTSTRAP`	`false`	When `true`, the node creates an initial shard on first start. Set to `true` on exactly one node during initial cluster formation, then set back to `false`.

Storage

Variable	Default	Description
`KISEKI_DATA_DIR`	`/var/lib/kiseki`	Root directory for all persistent state. Contains Raft log (`raft/log.redb`), key epochs (`keys/epochs.redb`), chunk metadata (`chunks/meta.redb`), and inline small-file content (`small/objects.redb`). Must reside on a low-latency device (NVMe or SSD strongly recommended; HDD triggers a boot warning).

Data directory layout

KISEKI_DATA_DIR/
  raft/log.redb            Raft log entries (bounded by snapshot policy)
  keys/epochs.redb         Key epoch metadata (<10 MB)
  chunks/meta.redb         Chunk extent index (scales with file count)
  small/objects.redb        Small-file encrypted content (capacity-managed)

TLS / mTLS

Variable	Default	Description
`KISEKI_CA_PATH`	(none)	Path to the Cluster CA certificate (PEM). Required for production. When set, all gRPC connections require mTLS.
`KISEKI_CERT_PATH`	(none)	Path to this node’s TLS certificate (PEM), signed by the Cluster CA.
`KISEKI_KEY_PATH`	(none)	Path to this node’s TLS private key (PEM). Never logged, printed, or transmitted.
`KISEKI_CRL_PATH`	(none)	Path to a CRL file (PEM) for certificate revocation. Reloaded periodically. Optional; if not set, CRL checking is disabled.

When KISEKI_CA_PATH is not set, the server runs without TLS. This is acceptable for development but must not be used in production.

Client-side cache (ADR-031)

These variables configure the native client cache on compute nodes running kiseki-client-fuse.

Variable	Default	Description
`KISEKI_CACHE_MODE`	`organic`	Cache operating mode. One of: `pinned` (staging-driven, eviction-resistant), `organic` (LRU with usage-weighted retention), `bypass` (no caching). Mode is per session, not per file.
`KISEKI_CACHE_DIR`	`$KISEKI_DATA_DIR/cache`	Directory for L2 cache pools on local NVMe. Each client process creates an isolated pool with a unique `pool_id`.
`KISEKI_CACHE_L1_MAX`	`1073741824` (1 GB)	Maximum bytes for the in-memory L1 cache (decrypted plaintext chunks). Bounded by process memory.
`KISEKI_CACHE_L2_MAX`	`107374182400` (100 GB)	Maximum bytes for the on-disk L2 cache on local NVMe. Per-process, per-tenant isolation via pool directories.
`KISEKI_CACHE_META_TTL_MS`	`5000` (5 seconds)	Metadata TTL in milliseconds. File-to-chunk-list mappings are served from cache within this window. After expiry, mappings are re-fetched from canonical. This is the sole freshness window: chunk data itself has no TTL because chunks are immutable (I-C1).
`KISEKI_CACHE_POOL_ID`	(none)	Adopt an existing L2 cache pool instead of creating a new one. Used for staging handoff from a Slurm prolog daemon to a workload process.

Cache behavior notes

Pinned mode: Pre-staged datasets remain in cache until explicitly released. Best for training workloads that re-read the same data across epochs.
Organic mode: LRU eviction with usage-weighted retention. Default for mixed workloads.
Bypass mode: No caching at all. Best for checkpoint/restart and streaming workloads.
On process restart, the client creates a new L2 pool (wiping orphaned pools). A kiseki-cache-scrub service cleans orphans on node boot.
Disconnects longer than 300 seconds (configurable) wipe the entire cache.
Crypto-shred events wipe all cached plaintext for the affected tenant within the key health check interval (default 30 seconds).

Metadata capacity (ADR-030)

These variables control the dynamic inline threshold for small-file placement.

Variable	Default	Description
`KISEKI_META_SOFT_LIMIT_PCT`	`50`	Normal operating ceiling for system disk metadata usage, as a percentage of system partition capacity. Exceeding this triggers inline threshold reduction.
`KISEKI_META_HARD_LIMIT_PCT`	`75`	Absolute maximum for system disk metadata usage. Exceeding this forces the inline threshold to the floor (128 bytes) and emits an alert via out-of-band gRPC (not Raft).

The inline threshold determines whether a file’s encrypted content is stored in small/objects.redb (metadata tier, NVMe) or as a chunk extent on a raw block device (data tier). The threshold is computed per-shard as the minimum affordable threshold across all Raft voters, clamped between 128 bytes (floor) and 64 KB (ceiling).

Observability

Variable	Default	Description
`OTEL_EXPORTER_OTLP_ENDPOINT`	(none)	OpenTelemetry OTLP gRPC endpoint for distributed traces. Example: `http://jaeger:4317`. When not set, tracing is disabled.
`OTEL_SERVICE_NAME`	`kiseki-server`	Service name reported in traces. Set to `kiseki-keyserver` or `kiseki-client` for other binaries.
`RUST_LOG`	`info`	Logging filter directive for the `tracing` crate. Supports per-module granularity. Examples: `kiseki=debug`, `kiseki_raft=trace,kiseki=info`, `warn`.
`KISEKI_LOG_FORMAT`	`text`	Log output format. `text` for human-readable, `json` for structured JSON (one line per event). Use `json` in production for log aggregation.

Tuning parameters (runtime)

The following parameters are set at runtime via the StorageAdminService gRPC API (SetTuningParams / GetTuningParams), not via environment variables. They are listed here for reference.

Cluster-wide tuning

Parameter	Default	Range	Description
`compaction_rate_mb_s`	100	10-1000	Background compaction throughput cap (MB/s).
`gc_interval_s`	300	60-3600	Interval between GC scans for reclaimable chunks.
`rebalance_rate_mb_s`	50	0-500	Background rebalance/evacuation throughput (MB/s).
`scrub_interval_h`	168 (7 days)	24-720	Interval between integrity scrub runs.
`max_concurrent_repairs`	4	1-32	Maximum parallel EC repair jobs.
`stream_proc_poll_ms`	100	10-1000	View materialization polling interval (ms).
`inline_threshold_bytes`	4096	512-65536	Default inline threshold for new shards.
`raft_snapshot_interval`	10000	1000-100000	Entries between Raft snapshots.

Per-pool tuning

Parameter	Default	Range	Description
`ec_data_chunks`	4 (NVMe) / 8 (HDD)	2-16	EC data fragment count. Immutable per pool after creation (I-C6).
`ec_parity_chunks`	2 (NVMe) / 3 (HDD)	1-8	EC parity fragment count. Immutable per pool after creation.
`replication_count`	3	2-5	For replication pools (non-EC).
`warning_threshold_pct`	Per device class	50-95	Pool capacity warning level.
`critical_threshold_pct`	Per device class	60-98	Pool capacity critical level. Writes rejected.
`readonly_threshold_pct`	Per device class	70-99	Read-only level. In-flight writes drain.
`target_fill_pct`	70 (SSD) / 80 (HDD)	50-90	Rebalance target fill level.

Default capacity thresholds by device class:

State	NVMe/SSD	HDD
Healthy	0-75%	0-85%
Warning	75-85%	85-92%
Critical	85-92%	92-97%
ReadOnly	92-97%	97-99%
Full	97-100%	99-100%

All tuning parameter changes via SetTuningParams are recorded in the cluster audit shard with parameter name, old value, new value, timestamp, and admin identity (I-A6).

Environment variable summary

Quick reference of all environment variables:

# Network
KISEKI_DATA_ADDR=0.0.0.0:9100
KISEKI_ADVISORY_ADDR=0.0.0.0:9101
KISEKI_S3_ADDR=0.0.0.0:9000
KISEKI_NFS_ADDR=0.0.0.0:2049
KISEKI_METRICS_ADDR=0.0.0.0:9090
KISEKI_RAFT_ADDR=0.0.0.0:9300

# Cluster
KISEKI_NODE_ID=1
KISEKI_RAFT_PEERS=1=node1:9300,2=node2:9300,3=node3:9300
KISEKI_BOOTSTRAP=false

# Storage
KISEKI_DATA_DIR=/var/lib/kiseki

# TLS
KISEKI_CA_PATH=/etc/kiseki/tls/ca.crt
KISEKI_CERT_PATH=/etc/kiseki/tls/server.crt
KISEKI_KEY_PATH=/etc/kiseki/tls/server.key
KISEKI_CRL_PATH=/etc/kiseki/tls/crl.pem

# Cache (client only)
KISEKI_CACHE_MODE=organic
KISEKI_CACHE_DIR=/var/cache/kiseki
KISEKI_CACHE_L1_MAX=1073741824
KISEKI_CACHE_L2_MAX=107374182400
KISEKI_CACHE_META_TTL_MS=5000

# Metadata capacity
KISEKI_META_SOFT_LIMIT_PCT=50
KISEKI_META_HARD_LIMIT_PCT=75

# Observability
OTEL_EXPORTER_OTLP_ENDPOINT=http://jaeger:4317
OTEL_SERVICE_NAME=kiseki-server
RUST_LOG=kiseki=info
KISEKI_LOG_FORMAT=json

Cluster Management

This guide covers day-to-day cluster operations: adding and removing nodes, managing shards and pools, maintenance mode, and schema migration.

Node management

Kiseki uses Raft consensus groups for metadata and log replication. Adding or removing nodes is done through Raft membership changes, which are zero-downtime and zero-data-loss operations.

Adding a node

Deploy kiseki-server on the new host with a unique KISEKI_NODE_ID and the full KISEKI_RAFT_PEERS list (including the new node).
Start the service. The node registers with the cluster and begins receiving Raft log entries as a learner.
Promote the node to a voter once it has caught up:
```
kiseki-server node add --node-id 4
```
The node receives shard assignments and begins participating in Raft elections and commit quorums.

Catch-up requirement (I-SF3): A learner must fully catch up with the leader’s committed index before being promoted to voter. The old voter remains in membership until the new voter is promoted.

Removing a node

Drain the node to migrate its shard assignments to other nodes:
```
kiseki-server node drain --node-id 4
```
Wait for all shards to be migrated. The drain operation uses Raft membership changes (add learner on target, promote, demote source) for each shard hosted on the node.
Once drained, remove the node from the cluster:
```
kiseki-server node remove --node-id 4
```
Stop the kiseki-server process and decommission the hardware.

Safety: Removing a node without draining first triggers automatic shard repair, but this is reactive rather than proactive. Always drain first for orderly removal.

Cluster sizing

Minimum: 3 nodes (Raft requires a majority quorum; 2-of-3 for writes).
Recommended: 5+ nodes for production. Tolerates 2 simultaneous node failures.
Key manager: Deploy on a dedicated 3-5 node HA cluster, separate from storage nodes. The system key manager must be at least as available as the log (I-K12).

Shard management

Shards are the smallest unit of totally-ordered deltas, backed by one Raft group. They split automatically when size or throughput thresholds are exceeded (I-L6).

Viewing shard status

# List all shards
kiseki-server shard list

# Get details for a specific shard
kiseki-server shard info --shard-id shard-0001

# Check shard health
kiseki-server shard health --shard-id shard-0001

Automatic shard split

Shards have a hard ceiling triggering mandatory split (I-L6). The ceiling is configurable across three dimensions:

Delta count: Maximum number of deltas in a shard.
Byte size: Maximum total size of shard data.
Write throughput: Maximum sustained write rate.

Any dimension exceeding its ceiling forces a split. The split operation:

Selects a split boundary (key range partition).
Creates a new shard for the upper range.
Continues accepting writes during the split (I-O1).
Notifies the control plane, views, and clients of the new shard topology.

Manual shard split

kiseki-server shard split --shard-id shard-0001 --boundary "..."

Shard maintenance mode

Set a shard to read-only for maintenance operations:

# Enable maintenance mode (writes rejected with retriable error)
kiseki-server shard maintenance --shard-id shard-0001 --enabled

During maintenance mode (I-O6):

Write commands are rejected with a retriable error.
Read operations continue normally.
In-progress compaction and GC continue but no new triggers fire from write pressure.
Shard splits do not initiate.

Cross-shard operations

Cross-shard rename returns EXDEV (I-L8). Shards are independent consensus domains with no two-phase commit. Applications must handle cross-shard moves via copy + delete.

Pool management

Affinity pools are groups of storage devices sharing a device class. Pools are the unit of capacity management and durability policy.

Viewing pools

# List all pools
kiseki-server pool list

# Get pool details including capacity and health
kiseki-server pool status --pool-id fast-nvme

Creating a pool

kiseki-server pool create --pool-id fast-nvme --device-class NvmeU2 \
  --ec-data 4 --ec-parity 2

Important: EC parameters (ec_data_chunks, ec_parity_chunks) are immutable per pool after creation (I-C6). Changing them requires creating a new pool and migrating data via ReencodePool.

Setting pool durability

# Switch pool durability strategy (applies to new chunks only)
kiseki-server pool set-durability --pool-id fast-nvme \
  --ec-data 4 --ec-parity 2

Existing chunks retain their original EC config. Re-encoding requires an explicit ReencodePool RPC.

Rebalancing a pool

Rebalance distributes data evenly across devices in a pool:

# Start rebalance
kiseki-server pool rebalance --pool-id fast-nvme

# Cancel a running rebalance
kiseki-server pool cancel-rebalance --pool-id fast-nvme

Rebalance runs at the configured rebalance_rate_mb_s (default 50 MB/s) to limit impact on production traffic.

Device evacuation

When a device shows signs of failure (SMART wear > 90% for SSD, > 100 bad sectors for HDD), automatic evacuation is triggered (I-D3). Evacuation can also be initiated manually:

# Start evacuation
kiseki-server device evacuate --device-id nvme-0001

# Cancel evacuation
kiseki-server device cancel-evacuation --device-id nvme-0001

Evacuation migrates all chunks from the device to other devices in the same pool. Device removal (RemoveDevice) is rejected unless the device state is Removed (post-evacuation) (I-D5).

Device state transitions: Healthy -> Degraded -> Evacuating -> Failed -> Removed. All transitions are recorded in the audit log (I-D2).

Pool capacity thresholds

Pool writes are rejected when the pool reaches the Critical threshold (I-C5). Thresholds vary by device class to account for SSD/NVMe GC pressure at high fill levels:

State	NVMe/SSD	HDD	Behavior
Healthy	0-75%	0-85%	Normal writes
Warning	75-85%	85-92%	Log warning, emit telemetry
Critical	85-92%	92-97%	Reject new placements
ReadOnly	92-97%	97-99%	In-flight writes drain, no new writes
Full	97-100%	99-100%	ENOSPC to clients

Pool redirection stays within the same device class only. ENOSPC is returned when the pool is Full.

Maintenance mode

Cluster-wide or per-shard maintenance mode sets the cluster (or specific shards) to read-only (I-O6).

Enabling cluster-wide maintenance

# Via the admin dashboard
curl -X POST http://node1:9090/ui/api/ops/maintenance \
  -H 'Content-Type: application/json' \
  -d '{"enabled": true}'

# Via the kiseki-server CLI
kiseki-server maintenance on

Maintenance mode behavior

All write commands are rejected with a retriable error code (MaintenanceMode). Clients can retry after maintenance ends.
Read operations continue normally.
In-progress compaction and GC complete their current run.
New shard splits, compaction triggers, and GC triggers from write pressure are suppressed.
Maintenance mode is the prerequisite for:
- Schema migration on upgrade
- Inline threshold increase (optional migration of small chunked files back to inline)
- Full cluster re-encryption

Disabling maintenance

curl -X POST http://node1:9090/ui/api/ops/maintenance \
  -H 'Content-Type: application/json' \
  -d '{"enabled": false}'

Writes resume immediately. Clients that were retrying will succeed on their next attempt.

Schema migration on upgrade

Kiseki uses versioned on-disk formats. Upgrades that change the schema follow this procedure:

Read the release notes for migration requirements. Not every release requires migration.
Enable maintenance mode on the cluster to prevent writes during migration.
Stop all nodes in the cluster.
Upgrade the binaries on all nodes (kiseki-server, kiseki-keyserver, kiseki-client-fuse).
Start nodes one at a time. On startup, each node detects the old schema version (via the superblock on each data device and the redb metadata version) and applies migration automatically.
Verify migration by checking the admin dashboard and node logs.
Disable maintenance mode to resume normal operations.

Rolling upgrades

For minor releases that do not change the on-disk format, rolling upgrades are supported:

Drain a node (DrainNode).
Stop the node.
Upgrade the binary.
Start the node.
Wait for it to rejoin and catch up.
Repeat for the next node.

The superblock on each data device carries a format version (ADR-029). Format version mismatches are detected at device open and handled by the migration path.

Admin Dashboard

Kiseki includes a built-in web dashboard for cluster monitoring and basic operations. The dashboard is served by every storage node on the metrics HTTP port.

Access

http://<node>:9090/ui

Any node in the cluster serves the full cluster-wide view. The dashboard scrapes metrics from peer nodes in the background and aggregates them locally. There is no dedicated dashboard server; connect to whichever node is most convenient.

The metrics HTTP server also serves:

Path	Purpose
`/health`	Health probe (returns `200 OK`). Used by load balancers.
`/metrics`	Prometheus text exposition format.
`/ui`	Admin dashboard (HTML + HTMX + Chart.js).
`/ui/logo`	Kiseki logo image.

Technology

The dashboard is a single-page HTML application using:

HTMX for live updates via HTML fragment polling.
Chart.js for time-series and per-node comparison charts.
No build step, no JavaScript framework, no node_modules.

The dashboard HTML is embedded in the kiseki-server binary at compile time (include_str!). No external files to deploy or manage.

Overview tab

The main view shows six metric cards at the top, a time-series chart in the middle, and a node table at the bottom. All data refreshes automatically via HTMX polling.

Metric cards

Card	Source metric	Description
Cluster Health	Node liveness	`N/M nodes healthy` with color coding: green (all healthy), yellow (degraded), red (all down).
Raft Entries	`kiseki_raft_entries_total`	Total Raft entries applied across the cluster.
Gateway Requests	`kiseki_gateway_requests_total`	Total S3 and NFS requests served.
Data Written	`kiseki_chunk_write_bytes_total`	Aggregate chunk bytes written.
Data Read	`kiseki_chunk_read_bytes_total`	Aggregate chunk bytes read.
Connections	`kiseki_transport_connections_active`	Active transport connections.

Numbers are formatted with SI suffixes (K, M, B) and byte units (KB, MB, GB, TB) for readability.

Time-series charts

The dashboard stores up to 3 hours of metric history (configurable) in memory. Time-series charts show:

Raft entries over time
Gateway request rate
Chunk write/read throughput
Connection count

Historical data is available via the API:

# Get 3 hours of history (default)
curl http://node1:9090/ui/api/history

# Get 1 hour of history
curl http://node1:9090/ui/api/history?hours=1

Node table

A table listing every node in the cluster with per-node metrics:

Column	Description
Node	Node address (hostname:port)
Status	Health badge: green “Healthy” or red “Unreachable”
Raft	Raft entries applied by this node
Requests	Gateway requests served by this node
Written	Chunk bytes written by this node
Read	Chunk bytes read by this node
Conns	Active transport connections on this node

Click a node row to drill down to the node detail view.

Performance tab

The performance tab shows per-node comparison charts for identifying hotspots and imbalances:

Write throughput by node: Bar chart comparing chunk bytes written per node.
Read throughput by node: Bar chart comparing chunk bytes read per node.
Request count by node: Bar chart comparing gateway requests per node.

Chart data is sourced from the chart-data API:

curl http://node1:9090/ui/fragment/chart-data
# Returns: {"labels": [...], "writes": [...], "reads": [...], "requests": [...]}

Alerts tab

The alerts tab shows health status and capacity warnings. Each alert is a row with a colored dot (green, yellow, red, blue), a message, and a timestamp.

Alert types

Dot	Meaning	Example
Green	All clear	“All 3 nodes healthy”
Red	Critical	“Node node2:9100 unreachable”
Blue	Informational	“Capacity monitoring active (3 nodes reporting)”
Green	Activity	“node1:9100: 1.2K gateway requests served”

Alerts are generated by comparing the current cluster state against expected conditions. The alert endpoint returns HTML fragments for HTMX polling:

curl http://node1:9090/ui/fragment/alerts

Operations tab

The operations tab provides buttons for common administrative actions. Each action calls a REST endpoint and records an event in the diagnostic event store.

Available operations

Operation	Endpoint	Method	Description
Maintenance Mode	`/ui/api/ops/maintenance`	POST	Enable or disable cluster-wide maintenance mode. Body: `{"enabled": true}` or `{"enabled": false}`.
Backup	`/ui/api/ops/backup`	POST	Initiate a background backup.
Scrub	`/ui/api/ops/scrub`	POST	Initiate a background integrity scrub.

Example:

# Enable maintenance mode
curl -X POST http://node1:9090/ui/api/ops/maintenance \
  -H 'Content-Type: application/json' \
  -d '{"enabled": true}'

# Trigger a scrub
curl -X POST http://node1:9090/ui/api/ops/scrub

All operations return {"status": "ok", "message": "..."} on success.

Node drill-down

Click a node in the node table to see its detailed view. The drill-down shows:

Node-specific metric history (time-series)
Device health for devices attached to that node
Shard assignments on that node
Raft role (leader/follower/learner) per shard

API endpoints

All dashboard data is available via JSON APIs for scripting and integration:

Endpoint	Method	Description
`/ui/api/cluster`	GET	Cluster summary: healthy nodes, total nodes, aggregate metrics.
`/ui/api/nodes`	GET	List of all nodes with per-node metrics and health status.
`/ui/api/history`	GET	Time-series metric history. Query: `?hours=3` (default).
`/ui/api/events`	GET	Diagnostic event log. Query parameters below.

Event log query parameters

Parameter	Type	Default	Description
`severity`	string	(all)	Filter by severity: `info`, `warning`, `error`, `critical`.
`category`	string	(all)	Filter by category: `node`, `shard`, `device`, `tenant`, `security`, `admin`, `gateway`, `raft`.
`hours`	float	3	Hours to look back.
`limit`	integer	100	Maximum events to return.

Example:

# Get last 50 error events in the past hour
curl 'http://node1:9090/ui/api/events?severity=error&hours=1&limit=50'

Response format:

{
  "count": 2,
  "events": [
    {
      "timestamp": "2026-04-23T14:30:00Z",
      "severity": "error",
      "category": "device",
      "source": "nvme-0001",
      "message": "Device SMART wear exceeds 90%"
    }
  ]
}

Cluster-wide view architecture

Every node in the cluster runs the same dashboard. The cluster-wide view is assembled by scraping /metrics from peer nodes:

Each node knows its peers from KISEKI_RAFT_PEERS.
A background task scrapes each peer’s /metrics endpoint at a configurable interval (default 10 seconds).
Scraped metrics are cached locally in a MetricsAggregator.
Dashboard requests aggregate local + cached peer metrics.

This means:

No single point of failure. Any node serves the dashboard.
Stale data tolerance. If a peer is unreachable, the dashboard shows the last known state and marks the node as “Unreachable.”
No additional infrastructure. No dedicated monitoring server is needed for basic cluster visibility.

For production monitoring with alerting and long-term retention, use Prometheus and Grafana (see Monitoring).

Backup & Recovery

Kiseki’s primary disaster recovery mechanism is federation (async replication to a secondary site). External backup is additive and optional, providing defense-in-depth for deployments that require it.

Architecture overview

Federation as primary DR

Federated-async replication to a secondary site is the recommended DR strategy (ADR-016). Properties:

RPO: Bounded by async replication lag (seconds to minutes).
RTO: Secondary site is warm (has replicated data + tenant config); switchover requires KMS connectivity and control plane reconfiguration.
Data replication: Ciphertext-only. No key material in the replication stream.

What is replicated

Component	Replicated?	Mechanism
Chunk data (ciphertext)	Yes	Async replication to peer site
Log deltas	Yes	Async replication of committed deltas
Control plane config	Yes	Federation config sync
Tenant KMS config	No	Same tenant KMS serves both sites
System master keys	No	Per-site system key manager
Audit log	Yes	Per-tenant audit shard replicated

External backup

Cluster admins can configure external backup targets (S3-compatible object store). Backup data is encrypted with the system key at rest.

Backup operations

Creating a backup

# Via the admin dashboard
curl -X POST http://node1:9090/ui/api/ops/backup

# Via the kiseki-server CLI
kiseki-server backup create

Backup contents

Each backup snapshot contains:

Per-shard metadata: Raft log snapshots for each shard, capturing the delta history up to the snapshot point.
Chunk extent manifests: The chunks/meta.redb index mapping chunk IDs to device extents.
Inline content: The small/objects.redb database (small-file data below the inline threshold).
Control plane state: Tenant configuration, namespace mappings, quotas, compliance tags, federation peer registry.
Key epoch metadata: Key epoch records from keys/epochs.redb (key material itself is NOT included in backups; it is managed by the system key manager and tenant KMS independently).

All backup data is encrypted. No plaintext chunk data appears in backup output. Backups reference chunk ciphertext on data devices by extent coordinates, not by copying the raw ciphertext (which would require reading and re-encrypting terabytes of data).

Listing backups

kiseki-server backup list

Deleting a backup

kiseki-server backup delete --backup-id backup-20260423-001

Retention policy

Backup retention is configurable per cluster. Defaults:

Setting	Default	Description
Retention period	7 days	Backups older than this are automatically deleted.
Maximum backups	10	Maximum number of retained backup snapshots.
Backup frequency	Daily	How often automatic backups are created (if enabled).

Retention is enforced by a background task that runs on the Raft leader. Deletion of expired backups is recorded in the cluster audit log.

Recovery procedures

Single node failure

Recovery path: Raft re-election + EC repair.

The Raft group detects the failed node and elects a new leader (if the failed node was leader).
EC repair automatically rebuilds chunk fragments that were on the failed node’s devices.
RPO: 0 (committed data is on a majority of replicas). RTO: seconds to minutes.

No manual intervention required. Monitor the repair progress via:

kiseki-server repair list

Multiple node failure (quorum maintained)

Recovery path: Raft reconfiguration + EC repair.

If the cluster still has a Raft majority (e.g., 2 of 3 nodes alive), recovery is automatic:

Raft continues operating with the surviving majority.
EC repair rebuilds lost chunk fragments.
Deploy replacement nodes and add them to the cluster.

Multiple node failure (quorum lost)

Recovery path: Manual Raft reconfiguration.

If the majority is lost (e.g., 2 of 3 nodes down), Raft cannot make progress. Recovery requires manual intervention:

Identify the surviving node(s) with the most recent committed state.
Force a new Raft configuration with the surviving node(s) as the initial voter set.
Deploy replacement nodes and add them as learners.
Promote learners to voters once they catch up.

Data loss risk: Deltas committed on the failed majority but not yet replicated to the surviving minority may be lost.

Full site failure (with federation)

Recovery path: Failover to federated peer.

Redirect clients to the secondary site (DNS, load balancer, or manual reconfiguration).
The secondary site has replicated chunk data, log deltas, and control plane config.
Tenant KMS must be reachable from the secondary site (same KMS serves both sites).
The secondary site’s system key manager has its own master keys, but tenant data is accessible because tenant KEKs come from the shared tenant KMS.

RPO: Replication lag. RTO: Minutes to hours (depends on control plane reconfiguration speed).

Full site failure (without federation)

Recovery path: Restore from external backup.

Deploy a new cluster.
Restore the backup snapshot to the new cluster.
The system key manager on the new cluster generates new system master keys.
Tenant KMS must be reconfigured to point to the new cluster.
Re-wrap all envelopes with new system master keys.

RPO: Time since last backup. RTO: Hours (depends on data volume).

Tenant KMS loss

Unrecoverable (I-K11). If the tenant loses their KMS and has no backup of their KEK material, all data encrypted under those keys is permanently unreadable. Kiseki documents this requirement but provides no system-side escrow. The tenant controls and is responsible for their keys.

Recovery summary

Scenario	Recovery path	RPO	RTO
Single node loss	Raft re-election + EC repair	0	Seconds-minutes
Multiple node loss (quorum held)	Raft reconfiguration + EC repair	0	Minutes
Multiple node loss (quorum lost)	Manual Raft reconfig	Possible delta loss	Minutes-hours
Full site loss (with federation)	Failover to peer	Replication lag	Minutes-hours
Full site loss (no federation)	Restore from backup	Backup lag	Hours
Tenant KMS loss	Unrecoverable	N/A	N/A

Limitations

No point-in-time restore. Backups are snapshots, not continuous journals. Recovery restores the cluster to the state at the snapshot time. Deltas committed after the snapshot are lost unless federation has replicated them.
Backup does not include key material. System master keys and tenant KEKs are managed by their respective key managers. Backup and recovery of key material is the responsibility of the key manager operator (cluster admin for system keys, tenant admin for tenant KEKs).
Chunk ciphertext is referenced, not copied. Backup manifests reference chunk extents on data devices. If data devices are destroyed, the chunk ciphertext is lost. Federation replicates the actual ciphertext to a secondary site, which is why it is the primary DR mechanism.
Cross-site backup requires federation. There is no built-in mechanism to ship backup snapshots to a remote site outside of the federation framework. For cross-site backup without federation, operators must arrange their own transport of backup snapshots.

Monitoring & Observability

Kiseki provides three observability pillars: metrics (Prometheus), structured logging (tracing), and distributed traces (OpenTelemetry). All three are tenant-aware, respecting the zero-trust boundary between cluster admin and tenant admin (ADR-015).

Prometheus metrics

Every kiseki-server node exposes Prometheus metrics in text exposition format on the metrics HTTP port.

Endpoint

GET http://<node>:9090/metrics

Registered metrics

Metric name	Type	Labels	Description
`kiseki_raft_commit_latency_seconds`	Histogram	`shard`	Raft commit latency per shard. Buckets: 100us to 1s.
`kiseki_raft_entries_total`	Counter	(none)	Total Raft entries applied on this node.
`kiseki_chunk_write_bytes_total`	Counter	(none)	Total chunk bytes written.
`kiseki_chunk_read_bytes_total`	Counter	(none)	Total chunk bytes read.
`kiseki_chunk_ec_encode_seconds`	Histogram	`strategy`	EC encode latency. Buckets: 100us to 50ms.
`kiseki_gateway_requests_total`	Counter	`method`, `status`	Gateway request count by method (GET, PUT, DELETE, etc.) and HTTP status.
`kiseki_gateway_request_duration_seconds`	Histogram	`method`	Gateway request duration. Buckets: 1ms to 5s.
`kiseki_pool_capacity_total_bytes`	Gauge	`pool`	Total capacity per pool in bytes.
`kiseki_pool_capacity_used_bytes`	Gauge	`pool`	Used capacity per pool in bytes.
`kiseki_transport_connections_active`	Gauge	(none)	Active transport connections.
`kiseki_transport_connections_idle`	Gauge	(none)	Idle transport connections.
`kiseki_shard_delta_count`	Gauge	`shard`	Current delta count per shard.
`kiseki_key_rotation_total`	Counter	(none)	Key rotations performed (system + tenant).
`kiseki_crypto_shred_total`	Counter	(none)	Crypto-shred operations performed.

Metric scoping (zero-trust)

Per ADR-015, metric scoping respects the zero-trust boundary:

Cluster admin sees: Aggregated metrics, per-node metrics, system health. Per-tenant metrics are anonymized (tenant_id replaced with opaque hash) unless the cluster admin has approved access for that tenant.
Tenant admin sees: Their own tenant’s metrics via the tenant audit export.
No metric exposes: File names, directory structure, data content, or access patterns attributable to a specific tenant (without approval).

Metric cardinality

Metric cardinality is bounded by design. Label values are drawn from fixed sets (shard IDs, pool names, HTTP methods, strategy names). There are no unbounded label values such as file paths, tenant IDs, or user identifiers in metrics labels.

Structured logging

Kiseki uses the tracing crate for structured logging. Every log event is a structured record with typed fields.

Configuration

Variable	Default	Description
`RUST_LOG`	`info`	Filter directive. Supports per-module granularity.
`KISEKI_LOG_FORMAT`	`text`	Output format: `text` (human-readable) or `json` (structured).

Filter examples

# Default: info-level for all Kiseki modules
RUST_LOG=kiseki=info

# Debug for the Raft subsystem, info for everything else
RUST_LOG=kiseki_raft=debug,kiseki=info

# Trace-level for the chunk subsystem (very verbose)
RUST_LOG=kiseki_chunk=trace,kiseki=info

# Warnings only (quiet)
RUST_LOG=warn

JSON output format

In production, set KISEKI_LOG_FORMAT=json for structured log aggregation (ELK, Loki, Datadog, etc.):

{
  "timestamp": "2026-04-23T14:30:00.123Z",
  "level": "INFO",
  "target": "kiseki_raft",
  "message": "Raft leader elected",
  "shard": "shard-0001",
  "node_id": 1,
  "term": 42
}

Log levels

Level	Usage
`ERROR`	Unrecoverable failures, invariant violations, data loss events.
`WARN`	Recoverable issues, degraded state, approaching capacity limits.
`INFO`	Significant state changes: leader election, key rotation, shard split, node join/leave.
`DEBUG`	Detailed operational events: individual RPCs, cache hits/misses, EC operations.
`TRACE`	Wire-level detail: Raft message contents, HKDF inputs, bitmap operations.

Security in logs

Tenant-identifying fields (tenant_id, namespace) are present for correlation.
Content fields (file names, chunk plaintext, key material) are never logged (I-K8).
Logs ship to the same audit/observability pipeline.

Distributed tracing (OpenTelemetry)

Kiseki uses OpenTelemetry for distributed tracing across the full write/read path.

Configuration

Variable	Default	Description
`OTEL_EXPORTER_OTLP_ENDPOINT`	(none)	OTLP gRPC endpoint. Example: `http://jaeger:4317`. When not set, tracing is disabled.
`OTEL_SERVICE_NAME`	`kiseki-server`	Service name in traces.
`OTEL_TRACES_SAMPLER_ARG`	`1.0`	Sampling rate (1.0 = 100%, 0.1 = 10%). Reduce in production for high-throughput workloads.

Trace propagation

Every write/read path carries a trace ID via OpenTelemetry context propagation. Traces span:

client -> gateway -> composition -> log -> chunk -> view

For the native client path:

client (FUSE) -> transport -> composition -> log -> chunk

Jaeger integration

The development Docker Compose stack includes Jaeger for trace visualization:

Jaeger UI: http://localhost:16686
OTLP gRPC receiver: localhost:4317

Trace scoping

Traces respect the zero-trust boundary:

Tenant-scoped traces are visible only to the tenant admin (via tenant audit export).
Cluster admin sees system-level spans. No tenant content appears in span attributes visible to the cluster admin.
Trace overhead is approximately 1-2% on the data path (acceptable for production).

Event store

The admin dashboard maintains an in-memory event store for diagnostic events. Events are categorized and severity-tagged.

Event categories

Category	Events
`node`	Node join, node leave, node unreachable, node recovered.
`shard`	Shard created, shard split, shard maintenance entered/exited.
`device`	Device added, device failed, SMART warning, evacuation started/completed.
`tenant`	Tenant created, tenant deleted, quota changed.
`security`	Auth failure, cert revocation, crypto-shred.
`admin`	Maintenance mode toggle, backup requested, scrub requested, tuning parameter change.
`gateway`	Protocol errors, connection surge, rate limiting.
`raft`	Leader election, membership change, snapshot transfer.

Event severities

Severity	Description
`info`	Normal operations.
`warning`	Attention needed, but system is operating.
`error`	Failure requiring investigation.
`critical`	Immediate action required (data at risk, quorum lost).

Event API

# All events from the last 3 hours
curl http://node1:9090/ui/api/events

# Errors from the last hour
curl 'http://node1:9090/ui/api/events?severity=error&hours=1'

# Device events, last 50
curl 'http://node1:9090/ui/api/events?category=device&limit=50'

# Security events from the last 24 hours
curl 'http://node1:9090/ui/api/events?category=security&hours=24'

Historical metrics API

# Metric snapshots from the last 3 hours
curl http://node1:9090/ui/api/history

# Last 6 hours
curl 'http://node1:9090/ui/api/history?hours=6'

The history endpoint returns time-series data points suitable for charting. The default retention is 3 hours in memory. For longer retention, use Prometheus.

Grafana integration

For production monitoring with alerting and long-term storage, configure Prometheus to scrape Kiseki metrics and visualize with Grafana.

Prometheus scrape configuration

scrape_configs:
  - job_name: 'kiseki'
    scrape_interval: 15s
    static_configs:
      - targets:
          - 'node1:9090'
          - 'node2:9090'
          - 'node3:9090'
    metrics_path: '/metrics'

Recommended Grafana dashboards

Cluster overview dashboard:

Cluster health (up/down per node)
Total Raft entries/sec (rate of kiseki_raft_entries_total)
Gateway request rate (rate of kiseki_gateway_requests_total)
Gateway latency p50/p99 (kiseki_gateway_request_duration_seconds)
Pool utilization (kiseki_pool_capacity_used_bytes / kiseki_pool_capacity_total_bytes)

Per-node dashboard:

Raft commit latency histogram (kiseki_raft_commit_latency_seconds)
Chunk read/write throughput
Transport connection count
Shard delta count per shard

Capacity dashboard:

Pool fill percentage over time
Pool capacity trend (linear projection for capacity planning)
Delta count growth rate (shard split prediction)

Key management dashboard:

Key rotation count over time (kiseki_key_rotation_total)
Crypto-shred count (kiseki_crypto_shred_total)

Alerting rules

Recommended Prometheus alerting rules:

groups:
  - name: kiseki
    rules:
      - alert: KisekiNodeDown
        expr: up{job="kiseki"} == 0
        for: 1m
        labels:
          severity: critical

      - alert: KisekiPoolCapacityWarning
        expr: >
          kiseki_pool_capacity_used_bytes / kiseki_pool_capacity_total_bytes > 0.85
        for: 5m
        labels:
          severity: warning

      - alert: KisekiPoolCapacityCritical
        expr: >
          kiseki_pool_capacity_used_bytes / kiseki_pool_capacity_total_bytes > 0.92
        for: 1m
        labels:
          severity: critical

      - alert: KisekiGatewayLatencyHigh
        expr: >
          histogram_quantile(0.99, rate(kiseki_gateway_request_duration_seconds_bucket[5m])) > 1
        for: 5m
        labels:
          severity: warning

      - alert: KisekiRaftCommitLatencyHigh
        expr: >
          histogram_quantile(0.99, rate(kiseki_raft_commit_latency_seconds_bucket[5m])) > 0.1
        for: 5m
        labels:
          severity: warning

Key Management

Kiseki uses a two-layer encryption model where system-level encryption protects data at rest and tenant-level key wrapping controls access. This page covers operational aspects of key management: rotation, re-encryption, crypto-shred, and external KMS integration.

Encryption model

Kiseki implements Model (C) from ADR-002: single data encryption pass at the system layer, with tenant access via key wrapping. No double encryption.

Plaintext chunk
  |
  v
System DEK (AES-256-GCM)  -->  Ciphertext (stored on disk)
  |
  v
System KEK (wraps DEK derivation material)
  |
  v
Tenant KEK (wraps system DEK derivation parameters per tenant)

System keys

System DEK: Per-chunk symmetric key derived locally on each storage node via HKDF-SHA256 (ADR-003). Never stored, never transmitted. Derivation: HKDF(master_key[epoch], chunk_id, "kiseki-chunk-dek-v1").
System master key: Per-epoch master key stored in the system key manager (kiseki-keyserver). Storage nodes fetch it at startup and on epoch rotation, then derive per-chunk DEKs locally. The key manager never sees individual chunk IDs.
System KEK: Wraps system master keys. Managed by the cluster admin.

Tenant keys

Tenant KEK: Key wrapping key managed by the tenant’s chosen KMS backend. Wraps access to system DEK derivation parameters (epoch + chunk_id). Destroying the tenant KEK = crypto-shred (data becomes unreadable).
No Tenant DEK: Model (C) does not double-encrypt. The tenant layer is key-wrapping, not data-encryption.

Invariants

I-K1: No plaintext chunk is ever persisted to storage.
I-K2: No plaintext payload is ever sent on the wire.
I-K7: Authenticated encryption (AES-256-GCM) everywhere.
I-K8: Keys are never logged, printed, transmitted in the clear, or stored in configuration files.

System key manager

The system key manager (kiseki-keyserver) is a dedicated HA service backed by its own Raft consensus group.

Deployment

Deploy on 3-5 dedicated nodes, separate from storage nodes. The system key manager must be at least as available as the log (I-K12) because its unavailability blocks all chunk writes cluster-wide.

Key distribution

kiseki-keyserver:
  Stores: master_key per epoch (Raft-replicated)
  Serves: master_key to authenticated kiseki-server processes (mTLS)
  Never sees: individual chunk_ids or per-chunk operations

kiseki-server:
  Caches: master_key (mlock'd, MADV_DONTDUMP, seccomp)
  Derives: per-chunk DEK = HKDF(master_key, chunk_id) -- locally
  Never sends: chunk_ids to the key manager

This design prevents the key manager from building an index of all chunk IDs, which would leak per-tenant access patterns.

Key rotation

System key rotation

System key rotation creates a new epoch with a new master key. The rotation process:

Cluster admin initiates rotation via RotateSystemKey().
The key manager generates a new master key and assigns a new epoch.
Storage nodes are notified and fetch the new master key.
New writes use the new epoch. Old data retains its epoch.
Two epochs coexist during the rotation window (I-K6).

Old master keys are retained until all data encrypted under them has been re-encrypted or deleted. Full re-encryption is available as an explicit admin action.

Tenant key rotation

Tenant key rotation creates a new epoch for the tenant’s KEK:

Tenant admin initiates rotation via RotateTenantKey(tenant).
The tenant KMS generates or rotates the key (provider-specific).
New envelope wrappings use the new epoch.
Old wrapped material remains valid until background re-wrapping completes.

Background re-encryption

A background monitor detects envelopes wrapped under old epochs and schedules re-wrapping. The rewrap worker:

Reads envelopes with old-epoch tenant wrapping.
Unwraps with old KEK.
Re-wraps with current KEK.
Writes the updated envelope.

For providers that support server-side rewrap (e.g., Vault Transit), the rewrap operation never exposes plaintext derivation material to the storage node.

Crypto-shred

Crypto-shred is the authoritative deletion mechanism in Kiseki. Destroying the tenant KEK renders all tenant data unreadable.

Process

Tenant admin initiates via CryptoShred(tenant).
The tenant KMS destroys the KEK (provider-specific: Vault key deletion, AWS KMS key scheduling, PKCS#11 key destruction).
All cached key material for the tenant is invalidated across the cluster.
Native clients detect the shred via key health checks (default every 30 seconds) and wipe their caches (I-CC12).

What happens after crypto-shred

Data is semantically deleted: No component can decrypt the tenant’s data because the KEK is destroyed.
Ciphertext remains on disk: Physical GC runs separately when chunk refcount = 0 AND no retention hold is active (I-C2b).
Audit trail preserved: Crypto-shred events are recorded in the audit log.

Ordering requirement

If retention holds are needed, they must be set before crypto-shred:

Set retention hold -> Crypto-shred -> Hold expires -> GC eligible

This prevents a race between crypto-shred and GC (I-C2b).

Detection latency

Crypto-shred detection is bounded by: min(key_health_interval, max_disconnect_seconds).

Default key health check interval: 30 seconds. Configurable per tenant within [5s, 300s], default 60s (I-K15).

External KMS providers (ADR-028)

Kiseki supports five tenant KMS backends via the TenantKmsProvider trait. The provider is selected per-tenant at onboarding.

Provider comparison

#	Backend	Transport	Material model	Key material location
1	Kiseki Internal	In-process	Local	Separate Raft group in Kiseki
2	HashiCorp Vault	HTTPS	Local (cached)	Vault Transit engine
3	KMIP 2.1	mTLS (TTLV)	Remote or local	KMIP server / HSM
4	AWS KMS	HTTPS	Remote only	AWS KMS
5	PKCS#11 v3.0	Local (FFI)	Remote only (HSM)	Hardware Security Module

Provider invariants

I-K16: Provider abstraction is opaque to callers. No correctness decision depends on which backend is selected.
I-K17: Wrap/unwrap operations include AAD (chunk_id) binding. A wrapped blob cannot be spliced from one envelope to another.
I-K18: Provider is validated on configuration: connectivity test, wrap/unwrap round-trip, certificate chain. Validation failure prevents tenant activation.
I-K19: Internal provider stores tenant KEKs in a separate Raft group from system master keys.
I-K20: Provider migration (e.g., Internal to Vault) requires re-wrapping all existing envelopes. Migration is background, audited, and preserves data availability throughout.

Provider 1: Kiseki Internal (default)

Zero-configuration default. Kiseki manages tenant KEKs internally in a Raft group separate from system master keys. Suitable for single-operator deployments.

Security trade-off: Internal mode does not provide the full two-layer security guarantee. Compromise of both the system key store and the tenant key store yields full access. Compliance-sensitive tenants should use an external provider.

Provider 2: HashiCorp Vault

Uses Vault’s Transit secrets engine for encryption-as-a-service:

Kiseki operation	Vault API
`wrap`	`POST /transit/encrypt/:name` (with `context` = AAD)
`unwrap`	`POST /transit/decrypt/:name` (with `context` = AAD)
`rotate`	`POST /transit/keys/:name/rotate`
`rewrap`	`POST /transit/rewrap/:name` (server-side, no plaintext exposure)
`destroy`	`DELETE /transit/keys/:name`

Provider 3: KMIP 2.1

Standards-based integration with enterprise KMS and HSM appliances. Uses mTLS over TTLV binary protocol.

Provider 4: AWS KMS

Cloud-native KMS integration. Key material never leaves AWS. All wrap/unwrap operations are remote HTTPS calls. Suitable for hybrid cloud deployments.

Provider 5: PKCS#11 v3.0

Direct HSM integration via the PKCS#11 C API (FFI). Key material stays in the HSM. Highest security level, requires HSM hardware on or accessible from storage nodes.

OIDC integration

Tenant identity providers can be integrated for second-stage authentication (I-Auth2). This is optional and orthogonal to the KMS provider choice.

When configured, workload-level identity is validated against the tenant admin’s authorization via OIDC/JWT tokens, providing “authorized by my tenant admin” on top of the mTLS-based “belongs to this cluster” identity.

Keycloak is included in the development Docker Compose stack for OIDC testing.

Operational checklist

Key rotation schedule

Key type	Recommended interval	Enforcement
System master key	Quarterly	Manual (cluster admin)
Tenant KEK	Per tenant policy	Manual or automated via KMS
TLS certificates	Annual	Cluster CA renewal

Monitoring key health

# Check key manager health
kiseki-server keymanager health

# Check tenant KMS connectivity
kiseki-server keymanager check-kms

# Monitor key rotation metrics
curl -s http://node1:9090/metrics | grep kiseki_key_rotation_total

# Monitor crypto-shred events
curl -s http://node1:9090/metrics | grep kiseki_crypto_shred_total

Key material security

Master keys are mlock’d in memory on storage nodes (prevent swapping).
Core dumps are disabled (LimitCORE=0 in systemd, MADV_DONTDUMP).
seccomp filters restrict system calls on key-handling threads.
Runtime integrity monitor detects ptrace, /proc/pid/mem access, and debugger attachment (I-O7).
Keys are zeroized on deallocation (Zeroizing<Vec<u8>>).

System Overview

Kiseki is a distributed storage system designed for HPC and AI workloads. It provides a unified data fabric with POSIX (FUSE), NFS, and S3 access paths, two-layer encryption with tenant-controlled crypto-shred, and pluggable HPC transports (CXI/Slingshot, InfiniBand, RoCEv2).

Workspace structure

The codebase is a single Rust workspace with 18 crates:

Crate	Purpose
`kiseki-common`	Shared types, HLC, identifiers, errors
`kiseki-proto`	Generated protobuf/gRPC code
`kiseki-crypto`	FIPS AEAD (AES-256-GCM), envelope encryption, tenant KMS providers
`kiseki-raft`	Shared Raft config, redb log store, TCP transport
`kiseki-transport`	Transport abstraction: TCP+TLS, RDMA verbs, CXI/libfabric
`kiseki-log`	Log context: delta ordering, shard lifecycle, Raft consensus
`kiseki-block`	Raw block device I/O, bitmap allocator, superblock (ADR-029)
`kiseki-chunk`	Chunk storage: placement, erasure coding, GC, device management
`kiseki-composition`	Composition context: namespace, refcount, multipart
`kiseki-view`	View materialization: stream processors, MVCC pins
`kiseki-gateway`	Protocol gateway: NFS and S3 translation
`kiseki-client`	Native client: FUSE, transport selection, client-side cache
`kiseki-keymanager`	System key manager with Raft HA
`kiseki-audit`	Append-only audit log with per-tenant shards
`kiseki-advisory`	Workflow advisory: hints, telemetry, budgets (ADR-020/021)
`kiseki-control`	Control plane: tenancy, IAM, policy, federation
`kiseki-server`	Storage node binary (composes all server-side crates)
`kiseki-acceptance`	BDD acceptance tests (cucumber-rs)

Bounded contexts

The domain is organized into eight bounded contexts, each with a distinct responsibility, failure domain, and scaling concern:

Log – Delta ordering, Raft consensus, shard lifecycle
Chunk Storage – Encrypted chunk persistence, placement, EC, GC
Composition – Tenant-scoped metadata assembly, namespace management
View Materialization – Protocol-shaped materialized projections
Protocol Gateway – NFS and S3 wire protocol translation
Control Plane – Tenancy, IAM, quota, policy, federation
Key Management – System DEK/KEK, tenant KMS providers, crypto-shred
Workflow Advisory – Client hints, telemetry feedback (cross-cutting)

Additionally, Native Client runs on compute nodes as a separate trust boundary and Block I/O handles raw device management underneath chunk storage.

Data path

Client (plaintext) ──encrypt──► Gateway / Native Client
                                       │
                                       ▼
                                  Composition
                                  (assemble chunks, record delta)
                                       │
                              ┌────────┴────────┐
                              ▼                 ▼
                          Log (Raft)       Chunk Storage
                     (commit delta,      (write encrypted
                      replicate)          chunk to device)

Write path: The client (native or protocol) encrypts data with the tenant KEK wrapping a system DEK. The composition layer assembles chunk references and records a delta. The delta is committed through Raft on the owning shard. Chunks are written to affinity pools with erasure coding.

Read path: The client issues a view lookup (materialized from log deltas). The view resolves chunk references. Chunks are read from devices, decrypted, and returned to the client.

Control path

Admin ──► Control Plane (gRPC)
              │
              ├── Tenant / Namespace / Quota / Policy
              ├── Flavor management
              ├── Federation (async cross-site)
              └── Advisory policy (hint budgets, profiles)

The control plane manages tenant lifecycle, IAM, quotas, compliance tags, placement policy, and federation. It communicates with storage nodes via gRPC on the management network. The control plane depends only on kiseki-common and kiseki-proto (crate-graph firewall, ADR-027).

Advisory path (ADR-020)

Client ──hints──► Advisory Runtime ──telemetry──► Client
                      │
                      ├── Route hints to Chunk / View / Composition
                      ├── Emit caller-scoped telemetry feedback
                      └── Audit advisory events

The workflow advisory system is a cross-cutting concern (not a bounded context). It carries two flows over a bidirectional gRPC channel per declared workflow:

Hints (client to storage): advisory steering signals for prefetch, affinity, priority, and phase-adaptive tuning. Never authoritative (I-WA1).
Telemetry feedback (storage to client): caller-scoped signals about backpressure, locality, materialization lag, and QoS headroom (I-WA5).

The advisory runtime runs on a dedicated tokio runtime, isolated from the data path. Advisory failures never block data-path operations (I-WA2).

Network ports

Port	Purpose
9100	Data-path gRPC (Log, Chunk, Composition, View, Discovery)
9101	Advisory gRPC (WorkflowAdvisoryService)
9000	S3 HTTP gateway
2049	NFS server
9090	Prometheus metrics + health + admin UI

Binaries

Binary	Contents	Deployment
`kiseki-server`	Log, Chunk, Composition, View, Gateway, Audit, Advisory	Every storage node
`kiseki-client-fuse`	Native client with FUSE	Compute nodes
`kiseki-control`	Control plane	Management network (3+ instances)
`kiseki-keyserver`	System key manager (Raft HA)	Dedicated cluster (3-5 nodes)

Bounded Contexts

Eight bounded contexts form the core domain model. Each has a distinct responsibility, failure domain, and scaling concern. This page describes each context’s purpose, implementing crate, key types, and governing invariants.

1. Log

Crate: kiseki-log

Purpose: Accept deltas, assign them a total order within a shard, replicate via Raft, persist durably, and support range reads for view materialization and replay.

Key types: Delta, DeltaEnvelope, Shard, ShardConfig, ShardInfo

Key invariants:

ID	Rule
I-L1	Within a shard, deltas have a total order
I-L2	A committed delta is durable on a majority of Raft replicas before ack
I-L3	A delta is immutable once committed
I-L4	Delta GC requires ALL consumers (views + audit) to have advanced past the delta
I-L5	A composition is not visible until all referenced chunks are durable
I-L6	Shards have a hard ceiling triggering mandatory split (delta count, byte size, or throughput)
I-L7	Delta envelope has separated system-visible header and tenant-encrypted payload
I-L8	Cross-shard rename returns EXDEV (no 2PC across shards)
I-L9	A delta’s inlined payload is immutable after write; threshold changes apply prospectively

Failure domain: Per-shard. Leader loss causes transient latency (election). Quorum loss makes the shard unavailable.

2. Chunk Storage

Crate: kiseki-chunk (with kiseki-block for device I/O)

Purpose: Store and retrieve opaque encrypted chunks. Manage placement across affinity pools. Handle erasure coding and replication. Run GC based on refcounts and retention holds.

Key types: Chunk, ChunkId, Envelope, AffinityPool, DeviceBackend

Key invariants:

ID	Rule
I-C1	Chunks are immutable; new versions are new chunks
I-C2	A chunk is not GC’d while any composition references it (refcount > 0)
I-C2b	A chunk is not GC’d while a retention hold is active
I-C3	Chunks are placed according to affinity policy from the referencing view descriptor
I-C4	Durability strategy is per affinity pool (EC default, N-copy replication available)
I-C5	Pool writes rejected at Critical threshold (SSD 85%, HDD 92%); ENOSPC at Full
I-C6	EC parameters are immutable per pool; `SetPoolDurability` applies to new chunks only
I-C7	All chunk data writes are aligned to device physical block size (ADR-029)
I-C8	Allocation bitmap is ground truth; free-list is a derived cache rebuilt on startup

Failure domain: Per-chunk or per-device. Chunk loss recoverable via EC parity or replicas.

3. Composition

Crate: kiseki-composition

Purpose: Maintain tenant-scoped metadata structures describing how chunks assemble into data units (files, objects). Manage namespaces. Record mutations as deltas in the log.

Key types: Composition, Namespace, CompositionMutation

Key invariants:

ID	Rule
I-X1	A composition belongs to exactly one tenant
I-X2	A composition’s chunks respect the tenant’s dedup policy (global hash or per-tenant HMAC)
I-X3	A composition’s mutation history is fully reconstructible from its shard’s deltas

Failure domain: Coupled to Log. If a shard fails, its compositions are affected.

4. View Materialization

Crate: kiseki-view

Purpose: Consume deltas from shards and maintain materialized views per view descriptor. Handle view lifecycle (create, discard, rebuild) and MVCC read pins.

Key types: View, ViewDescriptor, StreamProcessor, MvccPin

Key invariants:

ID	Rule
I-V1	A view is derivable from its source shard(s) alone (rebuildable-from-log)
I-V2	A view’s observed state is a consistent prefix of its source log(s) up to a watermark
I-V3	Cross-view consistency governed by the reading protocol’s declared consistency model
I-V4	MVCC read pins have bounded lifetime; pin expiration revokes the snapshot guarantee

Failure domain: Per-view. A fallen-behind view serves stale data. A lost view can be rebuilt from the log.

5. Protocol Gateway

Crate: kiseki-gateway

Purpose: Translate wire protocol requests (NFS, S3) into operations against views and the log. Serve reads from views. Route writes as deltas to the log via composition. Perform tenant-layer encryption for protocol-path clients.

Key types: Protocol gateway instance, protocol plugin

Trust boundary: NFS/S3 clients send plaintext over TLS to the gateway. The gateway encrypts before writing to log/chunks. Plaintext exists in gateway memory only ephemerally.

Failure domain: Per-gateway. Crash disconnects affected clients. Restart and client reconnect recovers.

6. Control Plane

Crate: kiseki-control

Purpose: Declarative API for tenancy, IAM, policy, placement, discovery, compliance tagging, and federation. Manages cluster-level and tenant-level configuration.

Key types: Organization, Project, Workload, Flavor, ComplianceRegime, RetentionHold, FederationPeer

Key invariants:

ID	Rule
I-T1	Tenants are fully isolated; no cross-tenant data access
I-T2	Tenant resource consumption bounded by quotas at org and workload levels
I-T3	Tenant keys not accessible to other tenants or shared processes
I-T4	Cluster admin cannot access tenant data without tenant admin approval
I-T4c	Cluster admin modifications to pools with tenant data are audit-logged to tenant

Failure domain: Control plane unavailability prevents new tenant creation and policy changes, but the existing data path continues with last-known configuration.

7. Key Management

Crates: kiseki-keymanager, kiseki-crypto

Purpose: Custody, rotation, escrow, and issuance of all key material. Two layers: system keys (cluster admin) and tenant key wrapping (tenant admin via tenant KMS). Orchestrate crypto-shred.

Key types: SystemDek, SystemKek, TenantKek, KeyEpoch, Envelope, TenantKmsProvider

Tenant KMS providers (ADR-028): Five pluggable backends implementing the TenantKmsProvider trait – Kiseki-Internal, HashiCorp Vault, KMIP 2.1, AWS KMS, and PKCS#11.

Key invariants:

ID	Rule
I-K1	No plaintext chunk is ever persisted to storage
I-K2	No plaintext payload is ever sent on the wire
I-K4	System can enforce access without reading plaintext
I-K5	Crypto-shred renders data unreadable within bounded time
I-K6	Key rotation does not lose access to old data until explicit cutover
I-K7	Authenticated encryption everywhere
I-K8	Keys are never logged, printed, transmitted in the clear, or in config files
I-K16	Provider abstraction is opaque to callers
I-K17	Wrap/unwrap operations include AAD (chunk_id) binding

Failure domain: KMS unavailability blocks new encrypt/decrypt operations. This context’s availability is as critical as the Log’s.

8. Workflow Advisory (cross-cutting)

Crate: kiseki-advisory

Purpose: Carry workflow hints from clients to storage and telemetry feedback from storage back to clients. Route advisory signals to the bounded context best able to act on them.

Key types: WorkflowRef, OperationAdvisory, PoolHandle, PoolDescriptor, HintBudget

Key invariants:

ID	Rule
I-WA1	Hints are advisory only; no correctness decision depends on a hint
I-WA2	Advisory subsystem is isolated from the data path; failures do not block data-path operations
I-WA3	A workflow belongs to exactly one workload; authorization is per-operation
I-WA5	Telemetry feedback is scoped to the caller’s authorization
I-WA6	Advisory requests are not existence or content oracles
I-WA7	Hint budgets enforced per workload within parent ceilings
I-WA14	Hints do not extend tenant capabilities

Runtime isolation: The advisory runtime runs on a dedicated tokio runtime separate from the data-path runtime (ADR-021). No data-path crate depends on kiseki-advisory.

Cross-context relationships

Producer	Consumer	What flows
Control Plane	All contexts	Policy, placement, tenant config, compliance tags
Log	Composition, View	Deltas (ordered, durable)
Composition	Chunk Storage	Chunk references (refcounts)
Key Management	Chunk Storage	System DEKs
Key Management	Gateway, Native Client	Tenant KEK (wrapping)
View Materialization	Gateway, Native Client	Materialized view state
Chunk Storage	View, Native Client	Chunk data (encrypted)

Data Flow

This page describes the write, read, inline, and cross-node data paths through the Kiseki system.

Write path

┌──────────┐    plaintext     ┌──────────────────┐
│  Client  │ ──────────────► │  Gateway /        │
│          │   (over TLS)    │  Native Client    │
└──────────┘                 └────────┬──────────┘
                                      │ 1. Encrypt with tenant KEK
                                      │    wrapping system DEK
                                      │ 2. Content-defined chunking
                                      │    (Rabin fingerprinting)
                                      ▼
                             ┌──────────────────┐
                             │   Composition    │
                             │                  │
                             │ 3. Record chunk  │
                             │    references    │
                             │ 4. Build delta   │
                             └───────┬──────────┘
                                     │
                            ┌────────┴────────┐
                            ▼                 ▼
                   ┌──────────────┐   ┌──────────────┐
                   │  Log (Raft)  │   │ Chunk Storage │
                   │              │   │               │
                   │ 5. Commit    │   │ 6. Write      │
                   │    delta via │   │    encrypted  │
                   │    Raft      │   │    chunk to   │
                   │ 7. Replicate │   │    device     │
                   │    to        │   │ 8. EC encode  │
                   │    majority  │   │    across     │
                   │              │   │    pool       │
                   └──────────────┘   └──────────────┘

Step-by-step

Client encrypt: The native client encrypts data before it leaves the process. Protocol-path clients (NFS/S3) send plaintext over TLS to the gateway, which encrypts on their behalf.
Content-defined chunking: Data is split into variable-size chunks using Rabin fingerprinting. Each chunk gets a content-addressed ID (SHA-256 hash of plaintext, or HMAC when tenant opts out of cross-tenant dedup).
Compose: The composition layer records chunk references and constructs a delta describing the mutation (create, update, delete).
Raft commit: The delta is appended to the owning shard’s Raft log. The leader replicates to a majority of voters before acknowledging.
Chunk write: Encrypted chunks are written to affinity pool devices with erasure coding (or N-copy replication, per pool policy).
Ack: The write is acknowledged to the client only after the delta is committed (I-L2) and all referenced chunks are durable (I-L5).

Read path

┌──────────┐                 ┌──────────────────┐
│  Client  │ ◄────────────── │  Gateway /        │
│          │   plaintext     │  Native Client    │
└──────────┘   (over TLS)   └────────┬──────────┘
                                      ▲ 5. Decrypt
                                      │
                             ┌────────┴──────────┐
                             │   View Lookup     │
                             │                   │
                             │ 1. Resolve path   │
                             │    to composition │
                             │ 2. Get chunk list │
                             └────────┬──────────┘
                                      │
                                      ▼
                             ┌──────────────────┐
                             │  Chunk Storage   │
                             │                  │
                             │ 3. Read chunks   │
                             │    from device   │
                             │ 4. EC decode if  │
                             │    degraded      │
                             └──────────────────┘

Step-by-step

View lookup: The client or gateway queries a materialized view to resolve a path (POSIX) or key (S3) to a composition and its chunk list.
Chunk read: Encrypted chunks are read from the storage devices. If a device is degraded, EC parity reconstructs the missing data.
Decrypt: The client (native path) or gateway (protocol path) unwraps the system DEK using the tenant KEK, then decrypts the chunk data with AES-256-GCM.
Return: Plaintext is returned to the client.

Inline path (ADR-030)

Small files below the configurable inline threshold bypass chunk storage entirely:

Client ──► Composition ──► Log (Raft)
                             │
                             ▼
                    Delta with inline payload
                             │
                             ▼
                    Raft replication to voters
                             │
                             ▼
                    State machine apply:
                    store in small/objects.redb

Threshold computation: The inline threshold for a shard is the minimum affordable threshold across all nodes hosting that shard’s voter set:

clamp(min(voter_budgets) / file_count_estimate, INLINE_FLOOR, INLINE_CEILING)

Key invariants:

I-L9: Inlined payloads are immutable after write; threshold changes apply prospectively only
I-SF5: Inline content is offloaded to small/objects.redb on state machine apply; snapshots include inline content from redb
I-SF7: Per-shard Raft inline throughput capped at KISEKI_RAFT_INLINE_MBPS (default 10 MB/s)

Cross-node data paths

Raft replication

Each shard runs an independent Raft group (ADR-026). The leader replicates log entries (deltas) to followers via the Raft RPC transport. Replication uses mTLS on the data fabric.

Leader ──► Follower 1 (AppendEntries)
       ──► Follower 2 (AppendEntries)
       ──► Follower 3 (AppendEntries)

Committed entries are persisted in RedbRaftLogStore on each voter.

Snapshot transfer

When a follower is too far behind or a new voter joins, the leader sends a full snapshot. Snapshots are transferred as length-prefixed JSON over the Raft transport connection.

For shards with inline data, the snapshot includes all entries from small/objects.redb (I-SF5).

Chunk replication and EC

Chunks are placed across distinct physical devices within a pool using deterministic hashing (CRUSH-like). No two EC fragments of the same chunk reside on the same device (I-D4).

Device failure triggers automatic repair from EC parity or replicas (I-D1).

Federation

Federated sites replicate data asynchronously. Only ciphertext is replicated – no key material in the replication stream (I-CS3). All federated sites for a tenant connect to the same tenant KMS.

Encryption Model

Kiseki uses a two-layer encryption architecture (ADR-002, model C) that separates data encryption from access control. One encryption pass protects data; key wrapping controls who can read it.

Two-layer architecture

┌─────────────────────────────────────────────────┐
│              Tenant Layer (access)              │
│                                                 │
│  Tenant KEK (controlled by tenant admin)        │
│  wraps the system DEK for tenant-scoped access  │
│                                                 │
│  Destroying the tenant KEK = crypto-shred       │
│  (all tenant data rendered unreadable)           │
├─────────────────────────────────────────────────┤
│              System Layer (data)                │
│                                                 │
│  System DEK encrypts chunk data (AES-256-GCM)   │
│  System KEK wraps system DEKs                    │
│  Always on -- no unencrypted chunks              │
└─────────────────────────────────────────────────┘

System layer: The system DEK encrypts every chunk using AES-256-GCM. System DEKs are derived per-chunk using HKDF-SHA256 from a master key (ADR-003). The system KEK wraps system DEKs and is managed by the cluster admin via the system key manager.

Tenant layer: The tenant KEK wraps the system DEK for tenant-scoped access control. There is no double encryption – one data encryption pass, with key wrapping for access control. The tenant admin controls the tenant KEK via the tenant KMS.

Envelope structure

Each chunk is stored as an envelope containing:

┌──────────────────────────────────────────┐
│  Envelope                                │
│                                          │
│  ┌──────────────────────────────────┐    │
│  │  Ciphertext (AES-256-GCM)       │    │
│  │  (encrypted chunk data)          │    │
│  └──────────────────────────────────┘    │
│                                          │
│  auth_tag (16 bytes, GCM tag)            │
│  nonce (12 bytes, unique per chunk)      │
│  system_key_epoch (current epoch)        │
│  tenant_key_epoch (current epoch)        │
│  chunk_id (content-addressed)            │
│  algorithm_id (for crypto-agility)       │
│                                          │
│  System wrapping metadata                │
│  Tenant wrapping metadata                │
└──────────────────────────────────────────┘

The envelope carries algorithm identifiers for crypto-agility (I-K7). All metadata is authenticated – unauthenticated encryption is never acceptable.

Key derivation

System DEKs are derived locally on each storage node using HKDF-SHA256 (ADR-003). No DEK-per-chunk RPC is required:

system_dek = HKDF-SHA256(
    ikm  = master_key[epoch],
    salt = chunk_id,
    info = "kiseki-chunk-dek-v1"
)

The master key is fetched from the system key manager at startup and on rotation events. DEK derivation is deterministic – the same chunk ID and epoch always produce the same DEK.

Key rotation

Key rotation is epoch-based (I-K6):

The admin triggers rotation (system or tenant level)
A new epoch is created with fresh key material
New data is encrypted with the current epoch’s keys
Old data retains its epoch until background re-encryption migrates it
Two epochs coexist during the rotation window
Full re-encryption available as an explicit admin action for key-compromise incidents

Crypto-shred

Destroying the tenant KEK renders all tenant data unreadable (I-K5):

1. Set retention hold (if compliance requires)
2. Destroy tenant KEK at tenant KMS
3. All wrapped system DEKs for this tenant become unwrappable
4. Chunk ciphertext remains on disk (system-encrypted) until GC
5. Physical GC runs separately when refcount = 0 AND no retention hold

The ordering contract (I-C2b): set hold before crypto-shred to prevent race with GC.

Client-side detection: periodic key health check (default 30s) detects KEK_DESTROYED and triggers immediate cache wipe (I-CC12). Maximum detection latency: min(key_health_interval, max_disconnect_seconds).

Chunk ID derivation

Mode	Algorithm	Cross-tenant dedup
Default	`SHA-256(plaintext)`	Yes
Opted-out	`HMAC-SHA256(plaintext, tenant_key)`	No (zero co-occurrence leak)

When a tenant opts out of cross-tenant dedup (I-X2, I-K10), chunk IDs are derived using HMAC with a tenant-specific key, making it impossible to determine whether two tenants store the same data.

Tenant KMS providers (ADR-028)

Five pluggable backends implement the TenantKmsProvider trait:

Provider	Key model	Key location
Kiseki-Internal	Raft-replicated	On-cluster
HashiCorp Vault	Transit secrets engine	External
KMIP 2.1	Standard key management protocol	External
AWS KMS	Cloud-managed keys	External
PKCS#11	Hardware security modules	External

Provider selection is per-tenant at onboarding. The trait fully encapsulates protocol differences – callers never branch on provider type (I-K16). Wrap/unwrap operations include AAD (chunk_id) binding to prevent envelope splicing (I-K17).

FIPS compliance

Kiseki uses aws-lc-rs with the FIPS feature flag for FIPS 140-2/3 validated cryptographic operations. The kiseki-crypto crate provides:

AES-256-GCM authenticated encryption
HKDF-SHA256 key derivation
SHA-256 hashing
HMAC-SHA256 for opted-out chunk ID derivation
zeroize integration for all key material in memory

Delta encryption

Log delta payloads (filenames, attributes, inline data) are encrypted with the system DEK, wrapped with the tenant KEK (I-K3). The delta envelope has structurally separated:

System-visible header (cleartext or system-encrypted): sequence number, shard ID, hashed_key, operation type, timestamp
Tenant-encrypted payload: the actual mutation data

Compaction operates on headers only and never decrypts tenant-encrypted payloads (I-O2).

Raft Consensus

Kiseki uses Raft for ordering and replicating deltas within each shard. The implementation is based on openraft 0.10 with a custom TCP transport and redb-backed persistent storage.

Per-shard Raft groups

Each shard runs an independent Raft group (ADR-026, Strategy A). This provides:

Independent scaling: shard count grows with data volume and throughput
Isolated failure domains: quorum loss in one shard does not affect others
No cross-shard coordination: cross-shard rename returns EXDEV (I-L8)

The system key manager also runs its own Raft group for high availability (ADR-007), as do audit log shards (ADR-009).

openraft integration

The kiseki-raft crate defines KisekiTypeConfig used by all Raft groups:

Node identity: u64 node IDs
Async runtime: tokio
Log store: RedbRaftLogStore (persistent) or MemLogStore (testing)
Entry format: customized per context (log deltas, key manager ops, audit events)

Each context (log, key manager, audit) defines its own request (D) and response (R) types while sharing the node identity, entry format, and async runtime configuration.

Persistent log: RedbRaftLogStore

Raft log entries are persisted using redb (ADR-022), a pure-Rust embedded key-value store. The RedbRaftLogStore provides:

Durable append and truncation of log entries
Vote persistence (current term, voted-for)
Snapshot metadata storage
Crash-safe operations (redb uses write-ahead logging internally)

For shards with inline data (ADR-030), the state machine offloads inline content to small/objects.redb on apply. The in-memory state machine does not hold inline content after apply (I-SF5).

Snapshot transfer

When a follower falls behind or a new voter joins the group, the leader sends a full snapshot:

Leader serializes the current state machine as length-prefixed JSON
For shards with inline data, the snapshot includes all entries from small/objects.redb
The snapshot is streamed over the Raft transport connection
The follower installs the snapshot and resumes normal replication

Transport and security

Raft RPCs use a custom TCP transport with mTLS:

All Raft communication is authenticated via per-node mTLS certificates signed by the Cluster CA (I-Auth1)
The transport runs on the data fabric (not the management network)
Connection pooling and keepalive are managed by the transport layer

The Raft transport address is configured via KISEKI_RAFT_ADDR.

Dynamic membership changes

Raft membership changes follow the standard joint-consensus protocol:

Add voter: new node starts as learner, catches up to committed index, then promoted to voter
Remove voter: validated that removal does not break quorum (safety check via can_remove_safely)
Shard migration: target node must fully catch up (learner state matches leader’s committed index) before old voter is removed (I-SF3)

Membership changes are validated by validate_membership_change in kiseki-raft, which checks quorum preservation and prevents unsafe removal.

Shard lifecycle

Event	Description
Create	New shard created when a namespace is created
Split	Mandatory split when shard exceeds ceiling (I-L6): delta count, byte size, or throughput
Maintenance	Shard set to read-only; writes rejected with retriable error (I-O6)
Compaction	Header-only merge; tenant-encrypted payloads carried opaquely (I-O2)
GC	Delta garbage collection after all consumers advance past the delta (I-L4)

Shard splits do not block writes to the existing shard during the split operation (I-O1).

Consistency guarantees

Scope	Guarantee	Mechanism
Intra-shard	Total order	Raft sequence numbers
Cross-shard	Causal ordering	HLC (Hybrid Logical Clock)
Cross-site	Eventual consistency	Async replication via federation
Writes	CP (no split-brain)	Raft majority commit (I-CS1)
Reads	Bounded staleness	Per view descriptor, subject to compliance floor (I-CS2)

Transport Layer

The kiseki-transport crate provides a pluggable transport abstraction for bidirectional byte-stream connections. It ships with a TCP+TLS reference implementation and feature-flagged support for HPC fabric transports.

Transport trait

The Transport trait is the core abstraction:

#![allow(unused)]
fn main() {
pub trait Transport: Send + Sync + 'static {
    type Connection: Connection;
    async fn connect(&self, addr: SocketAddr) -> Result<Self::Connection>;
    async fn listen(&self, addr: SocketAddr) -> Result<Listener>;
}

pub trait Connection: AsyncRead + AsyncWrite + Send + Unpin + 'static {
    fn peer_identity(&self) -> Option<&PeerIdentity>;
}
}

All components (client, server, Raft) use this trait, enabling transport selection without code changes.

TCP+TLS (reference implementation)

The TcpTlsTransport is always available and serves as the universal fallback:

mTLS: Cluster CA validation with per-tenant certificates (I-Auth1, I-K13)
SPIFFE: SAN-based SVID validation for workload identity (I-Auth3)
CRL: Optional certificate revocation list support via KISEKI_CRL_PATH
Connection pooling: Configurable pool size per peer
Keepalive: TCP keepalive for connection health
Timeouts: Configurable connect, read, and write timeouts

Configuration: TlsConfig with CA cert, node cert, node key, and optional CRL path.

RDMA verbs (feature: `verbs`)

Native InfiniBand and RoCEv2 support for low-latency HPC fabrics:

InfiniBand: Direct RDMA over InfiniBand fabric (VerbsIb)
RoCEv2: RDMA over Converged Ethernet (VerbsRoce)
Device selection: Auto-detects the first available IB device, or uses the device named in KISEKI_IB_DEVICE
Zero-copy: RDMA read/write for chunk data transfer

The verbs module uses unsafe code for FFI calls to libibverbs. Each unsafe block has a per-block SAFETY comment.

CXI/libfabric (feature: `cxi`)

HPE Slingshot fabric support via libfabric:

CXI provider: Lowest-latency transport on Slingshot-equipped systems
libfabric: Uses the libfabric API (fi_* calls) for fabric operations
Feature-flagged: Only compiled when cxi feature is enabled

The CXI module uses unsafe code for FFI calls to libfabric.

FabricSelector

The FabricSelector provides priority-based transport selection with automatic failover:

Priority 0: CXI        (Slingshot, lowest latency)
Priority 1: VerbsIb    (InfiniBand)
Priority 2: VerbsRoce  (RoCEv2)
Priority 3: TcpTls     (always available, universal fallback)

At boot, the selector probes for available transports (hardware presence check). On connection, it selects the highest-priority available transport. On failure, it falls back to the next-best transport.

The TransportHealthTracker monitors transport health and marks transports as unhealthy after repeated failures, temporarily removing them from selection until they recover.

GPU-direct (planned)

Future support for direct GPU memory access:

NVIDIA cuFile (feature: gpu-cuda): GPUDirect Storage for direct NVMe-to-GPU data transfer
AMD ROCm (feature: gpu-rocm): ROCm-based GPU direct access

These features bypass CPU memory for chunk data, reducing latency for AI training workloads.

NUMA-aware thread pinning

The NumaTopology module provides NUMA-aware thread pinning for optimal memory locality:

Auto-detects NUMA topology on Linux via sched_setaffinity
Pins I/O threads to the NUMA node closest to the network device
Reduces cross-NUMA memory access latency for high-throughput workloads

Metrics and health

The transport layer exports Prometheus metrics via TransportMetrics:

Connection count per transport type
Bytes sent/received per transport
Connection errors and failover events
Latency histograms per transport

Health tracking (TransportHealthTracker) provides per-transport health status for the selector’s failover decisions.

Invariant mapping

Invariant	How the transport layer enforces it
I-K2	All data on the wire is TLS-encrypted (or pre-encrypted chunks over CXI)
I-K13	mTLS with Cluster CA validation on every data-fabric connection
I-Auth1	Client certificate required on data fabric
I-Auth3	SPIFFE SVID validation via SAN matching

Client-Side Cache (ADR-031)

The native client (kiseki-client) includes a two-tier read-only cache of decrypted plaintext chunks. The cache is a performance feature, not a correctness mechanism – it is ephemeral and wiped on process restart or extended disconnect.

Architecture

┌────────────────────────────────────────────┐
│            kiseki-client process           │
│                                            │
│  ┌──────────────────────────────────────┐  │
│  │  L1: In-memory cache               │  │
│  │  Zeroizing<Vec<u8>> entries         │  │
│  │  Content-addressed by ChunkId       │  │
│  └──────────────┬───────────────────────┘  │
│                 │ miss                      │
│  ┌──────────────▼───────────────────────┐  │
│  │  L2: Local NVMe cache pool          │  │
│  │  CRC32 integrity per entry          │  │
│  │  Per-process, per-tenant isolation   │  │
│  └──────────────┬───────────────────────┘  │
│                 │ miss                      │
│                 ▼                           │
│         Fetch from canonical               │
│         (verify by ChunkId SHA-256)        │
└────────────────────────────────────────────┘

L1 (in-memory): Fast access to recently-used chunks. Entries use Zeroizing<Vec<u8>> so plaintext is overwritten with zeros on eviction or deallocation (I-CC2).

L2 (local NVMe): Larger cache on local storage. Each entry has a CRC32 checksum trailer computed at insert time (I-CC13). On read, the CRC32 is verified before serving; mismatch triggers bypass to canonical and entry deletion (I-CC7).

Cache modes

Three modes are available per client session (selected at session establishment):

Mode	Behavior	Use case
Pinned	Staging-driven, eviction-resistant; for declared datasets	HPC pre-staging (Slurm prolog)
Organic	LRU with usage-weighted retention	Mixed workloads (default)
Bypass	No caching	Streaming, checkpoint workloads

Mode is per session, not per file. The admin controls which modes are available for each workload.

Staging API

Staging pre-fetches a dataset’s chunks into the L2 cache with pinned retention:

kiseki-client stage --dataset /path/to/data

Takes a namespace path and recursively enumerates compositions
Fetches and verifies all chunks from canonical (SHA-256 match)
Stores chunks in L2 with pinned retention
Produces a manifest file listing staged compositions and chunk IDs

Staging is idempotent and resumable. Limits: max_staging_depth (10), max_staging_files (100,000).

Pool handoff

The staging daemon and workload process can be different processes (e.g., Slurm prolog stages, then the workload runs):

Staging daemon holds the L2 pool via flock on pool.lock
Workload process adopts the pool via KISEKI_CACHE_POOL_ID env var
Workload takes over the flock

Each cache pool is identified by a 128-bit CSPRNG pool_id, isolated per process and per tenant.

Freshness and staleness

Metadata TTL (default 5s): File-to-chunk-list mappings are cached with a configurable TTL. Within the TTL, cached metadata is authoritative and may serve data for files that have since been modified (I-CC3, I-CC5).

Chunk data: No TTL needed. Chunks are immutable (I-C1), so a verified chunk remains correct indefinitely absent crypto-shred.

Crypto-shred detection

On crypto-shred (tenant KEK destruction), all cached plaintext must be wiped (I-CC12):

Detection mechanisms (in priority order):

Advisory channel notification (if active)
KMS error on next operation
Periodic key health check (default every 30s)

Response: Immediate wipe of L1 and L2 with zeroize.

Maximum detection latency: min(key_health_interval, max_disconnect_seconds).

Disconnect handling

If the client loses connectivity to all canonical endpoints for longer than max_disconnect_seconds (default 300s), the entire cache (L1 + L2) is wiped (I-CC6).

A background heartbeat RPC (every 60s) maintains the last_successful_rpc timestamp for disconnect detection.

Error handling

Any local cache error bypasses to canonical unconditionally (I-CC7):

L2 I/O failure: bypass and flag pool for scrub
CRC32 mismatch: bypass, delete corrupt entry
Metadata lookup failure: bypass to canonical

Invariants

ID	Rule
I-CC1	A cached chunk is served only if content-address verified and no crypto-shred detected
I-CC2	Cached plaintext is zeroized before deallocation, eviction, or cache wipe
I-CC3	File-to-chunk metadata served from cache only within TTL (default 5s)
I-CC5	Metadata TTL is the upper bound on read staleness
I-CC6	Disconnect beyond threshold triggers full cache wipe
I-CC7	Any cache error bypasses to canonical unconditionally
I-CC8	Cache is ephemeral; wiped on process start (or adopted via pool handoff)
I-CC9	Unreachable cache policy falls back to conservative defaults
I-CC10	Cache policy changes apply to new sessions only
I-CC11	Staged chunks are a point-in-time snapshot; re-stage to pick up updates
I-CC12	Crypto-shred triggers immediate cache wipe with zeroize
I-CC13	L2 entries protected by CRC32 checksum trailer

Environment variables

Variable	Default	Description
`KISEKI_CACHE_MODE`	`organic`	Cache mode: `organic`, `pinned`, or `bypass`
`KISEKI_CACHE_DIR`	`/tmp/kiseki-cache`	L2 pool directory on local NVMe
`KISEKI_CACHE_L2_MAX`	50 GB	Maximum L2 cache size in bytes
`KISEKI_CACHE_POOL_ID`	(generated)	Adopt an existing pool (for staging handoff)

Security Model

Kiseki is designed with security as a foundational constraint, not a bolted-on feature. The system enforces strong tenant isolation, mandatory encryption, and a zero-trust boundary between infrastructure operators and tenants.

Zero-trust boundary

Kiseki enforces a strict separation between two administrative domains:

Cluster admin (infrastructure operator)

Manages nodes, global policy, system keys, pools, devices.
Cannot access tenant config, logs, or data without explicit tenant admin approval (I-T4).
Sees operational metrics in tenant-anonymous or aggregated form.
Modifications to pools containing tenant data are audit-logged to the affected tenant’s audit shard (I-T4c).

Tenant admin (data owner)

Controls tenant keys, projects, workload authorization, compliance tags, user access.
Grants or denies cluster admin access requests.
Receives tenant-scoped audit exports sufficient for independent compliance demonstration.
Can crypto-shred to render all tenant data unreadable.

Access request flow

When a cluster admin needs access to tenant resources (for debugging, migration, etc.):

Cluster admin submits an access request via the control plane.
The request is recorded in the audit log.
Tenant admin reviews and approves or denies.
If approved, access is time-bounded and scoped.
All access is audit-logged to the tenant’s shard.

Encryption at rest

Every chunk stored on disk is encrypted. There are no exceptions.

Algorithm: AES-256-GCM (authenticated encryption with associated data).
Key derivation: HKDF-SHA256 derives per-chunk DEKs from a system master key and the chunk ID (ADR-003).
Envelope: Each chunk carries an envelope containing ciphertext, system-layer wrapping metadata, tenant-layer wrapping metadata, and authenticated metadata (chunk ID, algorithm identifiers, key epoch).

What is encrypted

Data	Encryption	Location
Chunk data on disk	System DEK (AES-256-GCM)	Data devices
Inline small-file content	System DEK	`small/objects.redb`
Delta payloads (filenames, attributes)	System DEK, wrapped with tenant KEK	Raft log / redb
Delta headers (sequence, shard, operation type, timestamp)	Cleartext or system-encrypted	Raft log / redb
Backup data	System-encrypted	External backup target
Federation replication	Ciphertext-only	Replication stream

What is NOT encrypted

Delta headers: Compaction operates on headers only (I-O2). Headers contain no tenant-attributable content.
Prometheus metrics: Aggregated counters and histograms. No tenant-attributable data in metric labels.
Health/liveness probes: 200 OK response.

Encryption in transit

All data-fabric communication uses mTLS. No plaintext data crosses the network.

Data path: mTLS with per-tenant certificates signed by the Cluster CA (I-K2).
Raft consensus: mTLS between Raft peers.
Key manager: mTLS between storage nodes and the key manager.
Client to gateway: TLS (clients send plaintext over TLS; the gateway encrypts before writing).
Native client: Client-side encryption (plaintext never leaves the workload process).

Protocol gateway encryption

Protocol gateway clients (NFS, S3) send plaintext over TLS to the gateway. The gateway performs tenant-layer encryption before writing to the storage layer. This means plaintext exists in gateway process memory but never on the wire in cleartext and never at rest.

Native client encryption

Native clients (FUSE, FFI, Python) perform tenant-layer encryption themselves. Plaintext never leaves the workload process and never traverses the data fabric.

FIPS 140-2/3 compliance

Kiseki uses aws-lc-rs as its cryptographic backend, which provides a FIPS 140-2/3 validated implementation of:

AES-256-GCM (authenticated encryption)
HKDF-SHA256 (key derivation)
SHA-256 (content-addressed chunk IDs)
HMAC-SHA256 (per-tenant chunk IDs for opted-out tenants)

The FIPS feature is controlled by the kiseki-crypto/fips feature flag at compile time.

Crypto-agility

Envelope metadata carries algorithm identifiers for crypto-agility. If a new algorithm is needed (e.g., post-quantum), envelopes can carry the new algorithm identifier alongside the existing one during a transition period.

No plaintext past gateway boundary (I-K1, I-K2)

This is the fundamental security invariant. Kiseki guarantees:

No plaintext chunk is ever persisted to storage (I-K1).
No plaintext payload is ever sent on the wire between any components (I-K2).
The system can enforce access to ciphertext without being able to read plaintext without tenant key material (I-K4).

Where plaintext exists

Plaintext exists only in:

Client process memory: For native clients that perform client-side encryption.
Gateway process memory: Transiently, while the gateway encrypts protocol-path data.
Stream processor memory: Stream processors cache tenant key material and are in the tenant trust domain (I-O3).
Client cache (L1): In-memory cache of decrypted chunks (zeroized on eviction or deallocation, I-CC2).
Client cache (L2): On-disk cache of decrypted chunks on local NVMe (zeroized before unlink, I-CC2).

Content-addressed chunk IDs

Chunk identity is derived from content, serving both dedup and integrity:

Default: chunk_id = SHA-256(plaintext). Enables cross-tenant dedup.
Opted-out tenants: chunk_id = HMAC-SHA256(plaintext, tenant_key). Cross-tenant dedup is impossible. Zero co-occurrence leak (I-K10).

Tenants that opt out of cross-tenant dedup pay a storage overhead (identical data stored separately per tenant) but gain the guarantee that no metadata (chunk IDs, refcounts) leaks information about data similarity across tenants.

Audit trail

All security-relevant events are recorded in an append-only, immutable audit log with the same durability guarantees as the data log (I-A1).

Audit events include:

Data access (read/write by tenant, workload, client)
Key lifecycle (rotation, crypto-shred, KMS health)
Admin actions (pool changes, device management, tuning parameters)
Policy changes (quotas, compliance tags, advisory policy)
Authentication events (mTLS success/failure, cert revocation)

Audit scoping

Tenant audit export: Filtered to the tenant’s own events plus relevant system events. Delivered on the tenant’s VLAN (I-A2). Sufficient for independent compliance demonstration (HIPAA Section 164.312 audit controls).
Cluster admin audit view: System-level events only. Tenant-anonymous or aggregated (I-A3).

Runtime integrity

An optional runtime integrity monitor detects attempts to access Kiseki process memory (I-O7):

ptrace detection
/proc/pid/mem access monitoring
Debugger attachment detection
Core dump attempt detection

On detection, the monitor alerts both cluster admin and tenant admin. Optional auto-rotation of keys can be configured as a response.

STRIDE Threat Analysis

Systematic analysis of Kiseki’s attack surfaces using the STRIDE framework.

Spoofing (identity)

Threat	Attack surface	Mitigation	Invariant
Rogue node joins cluster	Raft peer handshake	mTLS with Cluster CA — only certs signed by the cluster CA are accepted. Raft RPC server rejects plaintext when TLS is configured.	I-Auth1, I-K13
Client impersonates tenant	Data fabric connection	mTLS required. OrgId extracted from cert OU or SPIFFE SAN. Fallback: UUID v5 from cert fingerprint (no anonymous access).	I-Auth1, I-Auth3
Forged S3 request	S3 gateway	SigV4 signature validation with HMAC-SHA256 (constant-time comparison). `x-amz-date` required, `host` must be signed.	SigV4 auth
Forged JWT token	OIDC second-stage	`alg=none` rejected unconditionally. HS256 verified via HMAC. RS256/ES256 verified via JWKS with key ID matching.	I-Auth2
NFS UID spoofing	NFS gateway	AUTH_SYS trusts client-asserted UID (known limitation). Mitigated by: network segmentation, Kerberos for production, per-export allowed method list.	NFS auth
Replay of captured request	S3 gateway	Timestamp validation (TODO: ±15min window). Captured Raft RPCs are harmless (Raft rejects stale term/log index).	SigV4

Residual risk: NFS AUTH_SYS is inherently spoofable. Production deployments MUST use Kerberos or restrict NFS to trusted networks.

Tampering (data integrity)

Threat	Attack surface	Mitigation	Invariant
Modify chunk on disk	Block device	CRC32C on every extent read. Mismatch → EC repair from parity. Periodic scrub with configurable sample rate.	I-C7, I-C8
Modify chunk in transit	Fabric	TLS 1.3 (authenticated encryption). RDMA paths use pre-encrypted chunks.	I-K2, I-Auth1
Modify Raft log entry	Raft replication	Raft consensus — committed entries are immutable (I-L3). Log entries validated by majority before commit. WAL journal for crash-safe bitmap.	I-L2, I-L3
Tamper with envelope	Crypto layer	AES-256-GCM authenticated encryption. Tampered ciphertext, auth tag, or nonce → decryption failure. AAD binding to chunk_id prevents envelope splicing (I-K17).	I-K7, I-K17
Modify L2 cache file	Client NVMe	CRC32 trailer on every L2 read. Mismatch → bypass to canonical + delete corrupt entry.	I-CC7, I-CC13
Corrupt staging manifest	Client cache	Invalid JSON silently skipped during manifest load. No data served from unverifiable source.	I-CC7

Repudiation (deniability)

Threat	Attack surface	Mitigation	Invariant
Admin denies action	Control plane	All admin operations (maintenance, quota, compliance, key rotation) recorded in cluster audit shard with timestamp, identity, and parameters.	I-A1, I-A6
Tenant denies access	Data path	All data access operations auditable. Tenant audit export provides filtered, coherent trail for compliance (HIPAA §164.312).	I-A2
Advisory abuse denied	Workflow advisory	Advisory lifecycle events (declare, end, phase-advance, budget-exceeded) logged per-occurrence. High-volume events sampled with per-second-per-workflow counts.	I-WA8
Device state change denied	Storage	Device state transitions (Healthy→Degraded→Evacuating→Failed→Removed) recorded with timestamp, reason, admin identity.	I-D2
Crypto-shred denied	Key management	Shred event logged in tenant audit shard. Key health check provides detection confirmation. Cache wipe events counted.	I-K5, I-CC12

Information disclosure (confidentiality)

Threat	Attack surface	Mitigation	Invariant
Plaintext leak on wire	All RPCs	TLS mandatory on all data fabric connections. No plaintext payloads transmitted.	I-K1, I-K2
Plaintext on disk (server)	Chunk storage	All chunks encrypted at rest with system DEK (AES-256-GCM). No plaintext persisted on storage nodes. Compaction operates on headers only — never decrypts payloads.	I-K1, I-O2
Plaintext on disk (client)	L2 cache	Cached plaintext on compute-node NVMe (same trust domain as process memory). File permissions 0600. Zeroize on eviction/wipe. Crash scrubber for orphaned pools. FTL residual risk documented.	I-CC2, I-CC8
Cross-tenant data leak	Multi-tenant	Full tenant isolation (I-T1). Per-tenant encryption keys. Cluster admin cannot access tenant data without approval (I-T4). HMAC-keyed chunk IDs for dedup-opted-out tenants prevent co-occurrence analysis.	I-T1, I-T3, I-K10
Telemetry leaks tenant info	Advisory	Telemetry scoped to caller’s authorization. k-anonymity (k≥5) over neighbour workloads. Response shape unchanged under low-k conditions. Timing and size bucketed to prevent covert channels.	I-WA5, I-WA6, I-WA15
Error messages leak state	All APIs	`AuthError` returns generic failures. `KmsError` uses enum variants not freeform strings. Advisory requests for unauthorized targets return same shape as absent targets.	I-WA6
Core dump exposes keys	Server/client	Key material wrapped in `Zeroizing<Vec<u8>>`. Runtime integrity monitor detects debugger/ptrace.	I-K8, I-O7
Log messages leak data	Structured logging	Structured tracing with typed fields. No plaintext in log events. Tenant-scoped identifiers hashed in cluster-admin views.	I-A3, I-K8

Denial of service (availability)

Threat	Attack surface	Mitigation	Invariant
Raft leader flooding	Raft consensus	MAX_RAFT_RPC_SIZE (128MB) rejects oversized messages. Per-shard throughput guard (I-SF7) limits inline write rate.	ADV-S1, I-SF7
Advisory hint flooding	Workflow advisory	Per-workload hint budget (hints/sec, concurrent workflows). Budget exceeded → local degradation only. Advisory isolated from data path (I-WA2).	I-WA7, I-WA16, I-WA17
Connection pool exhaustion	Transport	`max_per_endpoint` connection cap. Circuit breaker trips after threshold failures. FabricSelector falls back to TCP.	Transport health
Disk exhaustion (metadata)	System NVMe	ADR-030 dynamic inline threshold. Soft limit → threshold reduction. Hard limit → threshold floor + alert via out-of-band gRPC.	I-SF1, I-SF2
Disk exhaustion (data)	Device pools	Per-pool capacity thresholds (Warning/Critical/Full). Writes rejected at Critical. Pool rebalancing.	I-C5
Cache exhaustion (client)	Client NVMe	Per-process `max_cache_bytes`. Per-node `max_node_cache_bytes` (80% of filesystem). Disk-pressure backstop at 90%.	ADR-031 §8
Audit log backpressure	Audit	Safety valve: if audit export stalls >24h, data GC proceeds with documented gap. Per-tenant configurable backpressure mode.	I-A5
Shard split storm	Log	Exponential backoff per shard (2h floor, 24h cap). Cluster-wide concurrent migrations bounded by `max(1, num_nodes/10)`.	I-SF4

Elevation of privilege (authorization)

Threat	Attack surface	Mitigation	Invariant
Cluster admin accesses tenant data	Control plane	Zero-trust boundary. Access requires explicit tenant admin approval, time-bounded, scope-limited, audit-logged.	I-T4, I-T4c
Tenant escapes namespace	Data path	Namespace isolation per tenant. Cross-shard operations return EXDEV (I-L8). Compositions belong to exactly one tenant (I-X1).	I-T1, I-X1, I-L8
Hint escalates priority	Advisory	Hints cannot extend capability. Cannot cause operation success that would otherwise be rejected. Cannot cross namespace/tenant boundary. Cannot bypass retention hold.	I-WA14
Client escalates cache policy	Client cache	Client selections bounded by admin-set ceilings. Policy narrowing only (child ≤ parent). `cache_enabled=false` at any level → disabled for all children.	I-CC10, I-WA7
KMS provider escalation	Key management	Provider abstraction opaque to callers (I-K16). No access-control decision depends on provider type. Provider migration requires 100% re-wrap before atomic switch.	I-K16, I-K20
gRPC method escalation	Control plane	Per-method authorization. 9 admin-only methods gated by `require_admin()`. Unknown role → rejected.	gRPC authz

Summary

STRIDE Category	Threats identified	Mitigated	Residual risk
Spoofing	6	5	NFS AUTH_SYS UID spoofing (use Kerberos in prod)
Tampering	6	6	None — all paths have integrity verification
Repudiation	5	5	None — comprehensive audit trail
Information disclosure	8	8	Client L2 NVMe FTL residual (use OPAL/SED)
Denial of service	8	8	None — all paths have rate limiting/backpressure
Elevation of privilege	6	6	None — defense in depth at every boundary
Total	39	37	2 documented residual risks

Both residual risks have documented mitigations:

NFS AUTH_SYS → deploy Kerberos or restrict to trusted networks
NVMe FTL data remanence → deploy OPAL/SED with per-boot key rotation

Authentication

Kiseki uses a layered authentication model. The primary mechanism is mTLS with certificates signed by a Cluster CA. Optional second-stage authentication via tenant identity providers adds workload-level authorization.

mTLS with Cluster CA (I-Auth1)

The Cluster CA is the trust root for all data-fabric authentication. Every participant in the data fabric (storage nodes, gateways, clients, stream processors) presents a certificate signed by the Cluster CA.

Certificate hierarchy

Cluster CA (managed by cluster admin)
  |
  +-- Server certificates (per storage node)
  |     SAN: node hostname, IP address
  |     OU: kiseki-server
  |
  +-- Key manager certificates (per key server)
  |     SAN: keyserver hostname, IP address
  |     OU: kiseki-keyserver
  |
  +-- Admin certificates (cluster admin)
  |     OU: kiseki-admin
  |
  +-- Tenant certificates (per tenant)
        SAN: tenant identifier
        OU: tenant-{org_id}

Properties

No real-time auth server on data path (I-Auth1). Certificates are local credentials. Authentication is a TLS handshake, not an RPC to a central authority. This eliminates a latency-sensitive dependency on the data path.
Per-tenant certificates: Each tenant’s clients and gateways present certificates that identify the tenant. The storage layer validates the certificate chain and extracts the tenant identity.
Certificate revocation: Supported via CRL (KISEKI_CRL_PATH). The CRL is reloaded periodically. Revoked certificates are rejected at the TLS handshake.

Configuration

# On storage nodes
KISEKI_CA_PATH=/etc/kiseki/tls/ca.crt
KISEKI_CERT_PATH=/etc/kiseki/tls/server.crt
KISEKI_KEY_PATH=/etc/kiseki/tls/server.key
KISEKI_CRL_PATH=/etc/kiseki/tls/crl.pem  # optional

# On client nodes
KISEKI_CA_PATH=/etc/kiseki/tls/ca.crt
KISEKI_CERT_PATH=/etc/kiseki/tls/client.crt
KISEKI_KEY_PATH=/etc/kiseki/tls/client.key

Certificate generation example

# Generate Cluster CA (do this once)
openssl req -x509 -newkey ec -pkeyopt ec_paramgen_curve:P-256 \
  -keyout ca.key -out ca.crt -days 3650 -nodes \
  -subj "/CN=Kiseki Cluster CA"

# Generate server certificate
openssl req -newkey ec -pkeyopt ec_paramgen_curve:P-256 \
  -keyout server.key -out server.csr -nodes \
  -subj "/CN=node1.example.com/OU=kiseki-server"

openssl x509 -req -in server.csr -CA ca.crt -CAkey ca.key \
  -CAcreateserial -out server.crt -days 365 \
  -extfile <(echo "subjectAltName=DNS:node1.example.com,IP:10.0.0.1")

# Generate tenant client certificate
openssl req -newkey ec -pkeyopt ec_paramgen_curve:P-256 \
  -keyout tenant.key -out tenant.csr -nodes \
  -subj "/CN=workload-1/OU=tenant-acme-corp"

openssl x509 -req -in tenant.csr -CA ca.crt -CAkey ca.key \
  -CAcreateserial -out tenant.crt -days 365

SPIFFE SVID (I-Auth3)

SPIFFE (Secure Production Identity Framework for Everyone) is available as an alternative to raw mTLS certificate management.

SPIFFE ID structure

spiffe://kiseki.example.com/tenant/{org_id}/workload/{workload_id}
spiffe://kiseki.example.com/tenant/{org_id}/project/{project_id}/workload/{workload_id}

The SPIFFE ID maps directly to the tenant hierarchy (organization/project/workload).

SPIRE integration

SPIRE (the SPIFFE Runtime Environment) handles certificate issuance and rotation automatically:

SPIRE Server acts as the Cluster CA (or delegates to it).
SPIRE Agent runs on each node (storage and compute).
Workloads receive SVIDs via the Workload API.
Certificates rotate automatically (no manual renewal).

Benefits over raw mTLS

Automatic certificate rotation (no manual renewal ceremonies).
Workload attestation (verify the workload binary, not just the certificate).
Short-lived certificates reduce the window of compromise.

S3 SigV4 authentication

The S3 gateway supports AWS Signature Version 4 authentication for S3 API clients.

How it works

The S3 client signs each request with an access key and secret key.
The gateway validates the signature.
The access key is mapped to a tenant identity via the control plane.
Subsequent authorization is based on the tenant identity.

Configuration

Access keys are provisioned via the control plane:

kiseki-server s3-credentials create --tenant-id acme-corp --workload-id training-job-1

Compatibility

The SigV4 implementation supports standard S3 clients:

# AWS CLI
aws --endpoint-url http://node1:9000 s3 ls

# boto3
import boto3
s3 = boto3.client('s3', endpoint_url='http://node1:9000',
                  aws_access_key_id='...', aws_secret_access_key='...')

NFS authentication

The NFS gateway supports two authentication mechanisms:

Kerberos (recommended for production)

NFSv4.2 with Kerberos provides strong authentication:

krb5 — Authentication only.
krb5i — Authentication + integrity.
krb5p — Authentication + integrity + privacy (encrypted).

The Kerberos principal maps to a tenant identity.

AUTH_SYS (development only)

AUTH_SYS (traditional UNIX UID/GID authentication) is supported for development and testing. It provides no real security and should not be used in production. When AUTH_SYS is used, the NFS gateway maps the export path to a tenant identity.

OIDC/JWT second-stage authentication (I-Auth2)

Optional second-stage authentication validates workload identity against the tenant admin’s authorization. This provides an additional layer beyond the mTLS “belongs to this cluster” identity.

Architecture

Workload
  |
  v
mTLS (Cluster CA)  -->  "This workload belongs to tenant X"
  |
  v
OIDC/JWT (Tenant IdP)  -->  "This workload is authorized by tenant X's admin"

Integration

Tenant admin configures their identity provider (Keycloak, Okta, Azure AD, etc.) in the control plane.
Workloads obtain JWT tokens from the tenant IdP.
On connection, the workload presents both:
- mTLS certificate (Cluster CA trust chain)
- JWT token (tenant IdP authorization)
The storage node validates both independently.

Token validation

JWT signature verification against the tenant IdP’s JWKS endpoint.
Token expiry and audience validation.
Claims mapping to tenant hierarchy (org, project, workload).
No real-time IdP dependency on the data path: JWKS keys are cached and refreshed periodically.

gRPC role-based authorization

After authentication (mTLS + optional OIDC), gRPC services enforce role-based authorization:

Roles

Role	Authentication	Access
Cluster admin	Admin certificate (OU: kiseki-admin)	StorageAdminService, ControlService (full)
SRE (read-only)	SRE certificate	StorageAdminService (read-only: List, Get, Status)
Tenant admin	Tenant certificate + OIDC (optional)	ControlService (tenant-scoped), AuditExportService
Workload	Tenant certificate + OIDC (optional)	Data-path services, WorkflowAdvisoryService

Authorization enforcement

StorageAdminService: Cluster admin only (mTLS cert with admin OU). SRE read-only role for monitoring.
ControlService: Cluster admin for system operations, tenant admin for tenant-scoped operations.
Data-path services (LogService, ChunkOps, CompositionOps, ViewOps): Any authenticated tenant workload, scoped to the tenant’s own data.
WorkflowAdvisoryService: Any authenticated tenant workload. Per-operation authorization (I-WA3): every request re-validates the caller’s mTLS identity against the workflow’s owning workload.

Cluster admin isolation (I-T4)

The cluster admin certificate grants access to infrastructure management but explicitly does NOT grant access to:

Tenant configuration
Tenant audit logs
Tenant data (read or write)
Tenant key material

Access to tenant resources requires an explicit access request approved by the tenant admin.

Client identity

Client ID (native client)

Each native client process generates a stable identifier at startup:

128-bit CSPRNG value.
Bound to the workload’s mTLS certificate at first use.
Scoped within (org, project, workload).
Never reused across processes (I-WA4).

The client ID ties an operation stream to a single process instance. It is not a user identity and not a session token.

Workflow reference

For advisory-enabled workloads, a workflow reference is attached to data-path RPCs as a gRPC binary metadata entry (x-kiseki-workflow-ref-bin). This is a 16-byte opaque handle, generated with 128+ bits of entropy, never reused, and verified against the caller’s mTLS identity on every request (I-WA3, I-WA10).

Tenant Isolation

Tenant isolation is a foundational invariant of Kiseki. Tenants are fully isolated with no cross-tenant data access, no delegation tokens, and no cross-tenant key sharing (I-T1).

Isolation model

Kiseki implements hierarchical tenancy with strict isolation boundaries:

Organization (billing, admin, master key authority)
  |
  +-- Project (optional: resource grouping, key delegation)
  |     |
  |     +-- Workload (runtime isolation unit)
  |     +-- Workload
  |
  +-- Workload (directly under org, if no projects)

Isolation guarantees

Property	Guarantee	Invariant
Data access	No cross-tenant data access	I-T1
Key material	Per-tenant encryption keys, never shared	I-T3, I-K3
Resource consumption	Bounded by quotas at org and workload levels	I-T2
Audit visibility	Tenant sees only their own events	I-A2
Metrics	Tenant-anonymous for cluster admin	ADR-015
Admin access	Zero-trust: cluster admin cannot access tenant data without approval	I-T4

Per-tenant encryption keys

Each tenant has their own KEK (Key Encryption Key) managed by their chosen KMS backend (ADR-028). The tenant KEK wraps access to system DEK derivation parameters for that tenant’s data.

Key isolation

System DEKs are derived per-chunk and are the same for identical chunks across tenants (enabling cross-tenant dedup by default).
Tenant KEKs are unique per tenant. Even if two tenants store the same data, each tenant wraps access to the DEK derivation parameters independently.
Tenant keys are not accessible to other tenants or to shared system processes (I-T3).

Key storage isolation

When using the internal KMS provider (default), tenant KEKs are stored in a separate Raft group from system master keys (I-K19). Compromise of one group does not expose the other.

When using external KMS providers (Vault, KMIP, AWS KMS, PKCS#11), tenant key material is managed entirely outside of Kiseki’s storage, under the tenant’s own operational control.

HMAC-keyed chunk IDs for opted-out tenants

By default, chunk IDs are derived from plaintext content: chunk_id = SHA-256(plaintext). This enables cross-tenant deduplication: identical data stored by different tenants produces the same chunk ID and shares storage.

Tenants that require stronger isolation can opt out of cross-tenant dedup (I-X2, I-K10):

Default:     chunk_id = SHA-256(plaintext)
Opted-out:   chunk_id = HMAC-SHA256(plaintext, tenant_key)

What opt-out provides

No cross-tenant dedup: Identical data from different tenants produces different chunk IDs. Each tenant’s data is stored independently.
Zero co-occurrence leak: An observer cannot determine whether two tenants store the same data by comparing chunk IDs.
Storage overhead: Duplicate data across tenants consumes additional storage.

When to opt out

Opt-out is recommended for tenants with:

Regulatory requirements prohibiting any form of cross-tenant data correlation (even at the metadata level).
High-sensitivity data where the existence of shared content is itself sensitive information.
Compliance regimes (HIPAA, ITAR) where data co-location with other tenants must be minimized.

Audit log scoping

The audit log is append-only, immutable, and system-wide (I-A1). Audit visibility is strictly scoped:

Tenant audit export (I-A2)

Each tenant receives a filtered projection of the audit log:

All events originating from the tenant’s own operations.
Relevant system events sufficient for a coherent, complete audit trail (e.g., a cluster admin modifying a pool that contains the tenant’s data).
Delivered on the tenant’s VLAN.
Sufficient for independent compliance demonstration (e.g., HIPAA Section 164.312 audit controls).

The tenant admin consumes this export. No events from other tenants appear in the export.

Cluster admin audit view (I-A3)

The cluster admin sees:

System-level events (node joins, pool changes, key rotations).
Tenant-anonymous or aggregated metrics.
No tenant-attributable content.

Cluster admin modifications to pools containing tenant data are audit-logged to the affected tenant’s audit shard (I-T4c), so the tenant can review.

Advisory audit scoping (I-WA8)

Workflow advisory events (declare, end, phase-advance, hint accept/reject, etc.) are written to the tenant’s audit shard.

Semantic phase tags and workflow IDs are tenant-scoped.
Cluster-admin views see opaque hashes only (consistent with I-A3).
High-volume events (hint-accepted, hint-throttled) may be batched or sampled, but at least one event per unique (workflow_id, rejection_reason) tuple is written per second.

Cache isolation (ADR-031)

The client-side cache maintains strict per-tenant isolation:

L1 (in-memory) isolation

The L1 cache operates within a single client process.
A client process is authenticated as a specific tenant via mTLS.
L1 entries are decrypted plaintext chunks, keyed by chunk ID.
On process termination, L1 entries are zeroized (I-CC2).

L2 (on-disk) isolation

Each client process creates its own L2 cache pool on local NVMe.
Pool isolation is enforced by:
- Unique pool ID: 128-bit CSPRNG value per process.
- flock: Ownership proven by file lock on pool.lock.
- Per-process directory: No cross-process sharing.
Concurrent same-tenant processes have independent pools. There is no cross-process cache sharing.
Orphaned pools (no live flock holder) are scavenged on startup or by kiseki-cache-scrub.
On eviction or cache wipe, L2 entries are overwritten with zeros before unlink (I-CC2).

Crypto-shred cache wipe (I-CC12)

When a crypto-shred event is detected for a tenant:

All cached plaintext for that tenant is wiped from L1 and L2.
L1 entries: Zeroizing<Vec<u8>> ensures memory-level erasure.
L2 entries: File contents overwritten with zeros before unlink.
Detection mechanisms:
- Periodic key health check (default 30 seconds).
- Advisory channel notification.
- KMS error on next operation.

Maximum detection latency: min(key_health_interval, max_disconnect_seconds).

Physical-level erasure note

Logical-level erasure (zeroize before deallocation) provides strong protection against software-level attacks. For protection against physical-level attacks on flash storage (e.g., reading NAND cells after logical deletion), hardware encryption (OPAL/SED) on the compute node’s local NVMe is required. This is outside Kiseki’s control but should be part of the compute node security policy.

Network isolation

Data fabric

All data-fabric traffic is mTLS-encrypted. Tenant identity is extracted from the client certificate and validated on every RPC.

Management network

The management network (control plane, admin API) is separate from the data fabric. Cluster admin access requires admin-OU certificates.

Tenant VLAN

Tenant audit exports are delivered on the tenant’s VLAN, providing network-level isolation of audit data.

Advisory isolation (I-WA1, I-WA2, I-WA5, I-WA6)

The workflow advisory subsystem enforces strict tenant isolation:

No existence oracles (I-WA6): A client cannot determine the existence of resources it is not authorized to observe. Unauthorized and absent targets return identical responses (same error code, payload size, and latency distribution).
No content oracles (I-WA11): Advisory fields never include cluster-internal identifiers (shard IDs, chunk IDs, node IDs, device IDs, rack labels).
Telemetry scoping (I-WA5): Every telemetry value is computed over resources the caller is authorized to read. Aggregate metrics use k-anonymous bucketing (minimum k=5).
Covert-channel hardening (I-WA15): Response timing and size do not vary with neighbor-workload state.

Pool handle isolation (I-WA19)

Affinity pools are referenced via opaque pool handles, not cluster-internal pool IDs:

Handles are valid for one workflow’s lifetime only.
Never reused across workflows.
Never equal or leak the cluster-internal pool identity.
Multiple tenants can see the same opaque label attached to different internal pools; correlation across tenants is impossible because handles differ.

Compliance support

Kiseki’s tenant isolation model supports the following compliance regimes:

Regime	Relevant guarantees
HIPAA	Per-tenant encryption, audit export for Section 164.312, crypto-shred, bounded staleness (2s floor).
SOC 2	Audit log immutability, access control separation, key management lifecycle.
GDPR	Crypto-shred as right-to-erasure mechanism, data isolation by design.
ITAR	HMAC-keyed chunk IDs (no cross-tenant correlation), dedicated tenant KMS.

Compliance tags attach at any level of the tenant hierarchy (organization, project, workload) and inherit downward. Tags may impose additional constraints:

Prohibit compression (HIPAA namespaces, I-K14).
Set staleness floor (minimum 2 seconds for HIPAA).
Require external KMS provider (no internal mode).
Restrict pool placement (data residency).

Troubleshooting

This guide covers common issues, diagnostic tools, and resolution procedures for Kiseki clusters.

Diagnostic tools

Health endpoint

# Quick liveness check (returns "OK" or connection refused)
curl http://node1:9090/health

Event log

The event log captures categorized diagnostic events in memory. Query via the admin API:

# All events from the last 3 hours
curl http://node1:9090/ui/api/events

# Error events only
curl 'http://node1:9090/ui/api/events?severity=error'

# Critical events from the last 24 hours
curl 'http://node1:9090/ui/api/events?severity=critical&hours=24'

# Device-related events
curl 'http://node1:9090/ui/api/events?category=device'

# Raft events (elections, membership changes)
curl 'http://node1:9090/ui/api/events?category=raft'

Node status

# Per-node metrics and health
curl http://node1:9090/ui/api/nodes

# Cluster summary
curl http://node1:9090/ui/api/cluster

Structured logs

# Tail logs for errors (systemd)
journalctl -u kiseki-server -f --priority=err

# Search for specific errors in JSON logs
journalctl -u kiseki-server --output=json | jq 'select(.level == "ERROR")'

# Raft-specific logs
journalctl -u kiseki-server | grep kiseki_raft

Common issues

Connection refused on data-path port (9100)

Symptoms: Clients cannot connect. curl http://node:9090/health returns OK but gRPC connections to port 9100 fail.

Diagnosis:

Verify the port is listening:
```
ss -tlnp | grep 9100
```
Check firewall rules:
```
iptables -L -n | grep 9100
```

Check the server logs for bind errors:

journalctl -u kiseki-server | grep "bind\|listen\|9100"

Common causes:

Port conflict: Another process is using port 9100.
Bind address: KISEKI_DATA_ADDR is set to 127.0.0.1:9100 instead of 0.0.0.0:9100.
Firewall: Port 9100 is not open between nodes or to clients.

mTLS authentication failures

Symptoms: AuthenticationFailed errors in logs. Clients receive gRPC UNAUTHENTICATED (16) status.

Diagnosis:

# Verify certificate validity
openssl x509 -in /etc/kiseki/tls/server.crt -noout -dates -subject -issuer

# Verify certificate chain
openssl verify -CAfile /etc/kiseki/tls/ca.crt /etc/kiseki/tls/server.crt

# Test TLS handshake
openssl s_client -connect node1:9100 \
  -cert /etc/kiseki/tls/client.crt \
  -key /etc/kiseki/tls/client.key \
  -CAfile /etc/kiseki/tls/ca.crt

Common causes:

Certificate expired: Renew the certificate.
CA mismatch: Client and server certificates signed by different CAs.
Missing SAN: Server certificate does not include the hostname or IP the client is connecting to.
CRL revocation: Certificate revoked via KISEKI_CRL_PATH. Check the CRL:
```
openssl crl -in /etc/kiseki/tls/crl.pem -text -noout
```
Wrong OU: Tenant certificate has wrong OU, or admin certificate does not have kiseki-admin OU.

Capacity full (ENOSPC)

Symptoms: Write operations return PoolFull errors. S3 PutObject returns HTTP 507. NFS writes return EIO or ENOSPC.

Diagnosis:

# Check pool capacity
curl -s http://node1:9090/metrics | grep kiseki_pool_capacity

# Check system disk usage
df -h /var/lib/kiseki

Resolution:

Add devices to the pool to increase capacity.

Rebalance to distribute data more evenly:

kiseki-server pool rebalance --pool-id fast-nvme

Evacuate devices from an over-full pool to a different pool (within the same device class).
Delete data: Remove compositions/objects to free space. GC runs periodically (default every 300 seconds).

Adjust thresholds if the defaults are too conservative for your deployment:

kiseki-server pool set-thresholds --pool-id fast-nvme \
  --warning-pct 80 --critical-pct 90

Metadata disk full (system partition)

Symptoms: Inline threshold drops to floor (128 bytes). Alert: “system disk metadata usage exceeds hard limit.” Raft may stall if the system disk is completely full.

Diagnosis:

# Check system partition usage
df -h /var/lib/kiseki

# Check individual redb sizes
du -sh /var/lib/kiseki/raft/log.redb
du -sh /var/lib/kiseki/chunks/meta.redb
du -sh /var/lib/kiseki/small/objects.redb

Resolution:

The system automatically reduces the inline threshold to the floor (128 bytes) when the hard limit is exceeded (I-SF2).
Trigger Raft log compaction to reduce raft/log.redb size:
```
kiseki-server compact
```
Run GC to clean up orphaned entries in small/objects.redb (I-SF6).
Consider migrating shards to nodes with larger system disks.
If the system partition is persistently undersized, upgrade to larger NVMe for the system RAID-1.

Raft diagnostics

Leader election issues

Symptoms: ShardUnavailable errors. Writes fail intermittently.

Diagnosis:

# Check shard health
kiseki-server shard health --shard-id shard-0001

# Check Raft events
curl 'http://node1:9090/ui/api/events?category=raft'

# Check election metrics
curl -s http://node1:9090/metrics | grep kiseki_raft

Common causes:

Network partition: Raft peers cannot communicate. Check connectivity on port 9300 between all nodes.
Clock skew: Large clock differences can cause election timeouts. Verify NTP synchronization. Nodes with Unsync clock quality are flagged (I-T6).
Disk latency: HDD system disks cause 5-10ms fsync latency per Raft commit. Use NVMe or SSD for the system partition.

Quorum loss

Symptoms: All writes fail. Reads may succeed (depending on consistency model).

Diagnosis:

# Check how many nodes are reachable
for node in node1 node2 node3; do
  echo -n "$node: "
  curl -s http://$node:9090/health && echo "OK" || echo "DOWN"
done

Resolution:

If one node is down (3-node cluster): The remaining 2 nodes form a majority. Raft continues. Repair or replace the failed node.
If two nodes are down: Quorum is lost. See Backup & Recovery for recovery procedures.

Shard split stalls

Symptoms: Shard reports high delta count or throughput but split does not complete.

Diagnosis:

kiseki-server shard info --shard-id shard-0001

Resolution:

Verify the shard is not in maintenance mode (I-O6).
Check if the cluster-wide concurrent migration limit is reached (I-SF4): max(1, num_nodes / 10).
Check the exponential backoff timer (I-SF4): Minimum 2 hours between placement changes per shard.
Manually trigger a split if auto-split is not firing:
```
kiseki-server shard split --shard-id shard-0001
```

Device issues

Integrity scrub

Trigger a manual integrity scrub to verify chunk data against EC parity:

# Scrub all devices
curl -X POST http://node1:9090/ui/api/ops/scrub

# Scrub a specific device
kiseki-server device scrub --device-id nvme-0001

The periodic scrub runs every 7 days by default (scrub_interval_h).

SMART warnings

Automatic evacuation triggers when a device reports:

SSD: SMART wear indicator > 90%.
HDD: > 100 bad sectors.

Check device health:

kiseki-server device info --device-id nvme-0001

Device evacuation

Monitor evacuation progress:

# List active repairs/evacuations
kiseki-server repair list

# Check device state
kiseki-server device info --device-id nvme-0001

Device state transitions: Healthy -> Degraded -> Evacuating -> Failed -> Removed (I-D2).

A device in Evacuating state can be cancelled:

kiseki-server device cancel-evacuation --device-id nvme-0001

RemoveDevice is rejected unless the device state is Removed (post-evacuation) (I-D5).

Key management issues

Key manager unreachable

Symptoms: KeyManagerUnavailable errors. All chunk writes fail cluster-wide (I-K12).

Diagnosis:

# Check key manager health
kiseki-server keymanager health

# Check connectivity from storage node
curl -s http://node1:9090/metrics | grep kms_reachability

Resolution:

The key manager is a Raft-replicated HA service. If one node is down, the remaining majority continues serving.
If the entire key manager cluster is unreachable, storage nodes use cached master keys (mlock’d in memory) for reads but cannot process new writes.
Restore key manager connectivity as soon as possible.

Tenant KMS unreachable

Symptoms: TenantKmsUnreachable errors for operations involving the affected tenant. Other tenants are unaffected.

Diagnosis:

kiseki-server keymanager check-kms --tenant-id acme-corp

Resolution:

Check network connectivity to the tenant’s KMS endpoint.
Check KMS credentials and certificate validity.
The tenant admin is responsible for their KMS availability (I-K11).

Crypto-shred verification

After a crypto-shred, verify that all clients have wiped their caches:

# Check crypto-shred count
curl -s http://node1:9090/metrics | grep kiseki_crypto_shred_total

# Check security events
curl 'http://node1:9090/ui/api/events?category=security'

Gateway issues

S3 errors

Common S3 error codes returned by the gateway:

Error	Cause	Resolution
403 Forbidden	SigV4 authentication failure	Check access key/secret key.
404 Not Found	Bucket or object does not exist	Verify namespace and key.
507 Insufficient Storage	Pool full	Add capacity. See Capacity Full above.
503 Service Unavailable	Raft quorum lost or maintenance mode	Wait for recovery or disable maintenance.

NFS errors

Error	Cause	Resolution
ESTALE	Shard split caused file handle invalidation	Retry the operation.
EIO	Internal error (chunk read failure, key manager unreachable)	Check server logs.
ENOSPC	Pool full	Add capacity.
EXDEV	Cross-shard rename (I-L8)	Use copy + delete instead.
ENOTSUP	Writable shared mmap (I-O8)	Use read/write instead of mmap for writes.

Performance Tuning

Kiseki is designed for HPC and AI workloads running at 200+ Gbps per NIC. This guide covers tuning levers for maximizing throughput and minimizing latency.

Transport selection

The transport layer abstracts the network fabric. Kiseki automatically selects the best available transport, but manual override is possible.

Transport hierarchy (fastest to slowest)

Transport	Typical bandwidth	Latency	Feature flag	Notes
CXI (HPE Slingshot)	200 Gbps	<1 us	`kiseki-transport/cxi`	Requires libfabric with CXI provider. CSCS/Alps native.
InfiniBand verbs	100-400 Gbps	1-2 us	`kiseki-transport/verbs`	Requires RDMA-capable NICs and verbs libraries.
RoCE v2	25-100 Gbps	2-5 us	`kiseki-transport/verbs`	RDMA over Converged Ethernet. Requires lossless fabric (PFC/ECN).
TCP	10-100 Gbps	50-200 us	(always available)	Fallback. Uses kernel TCP with TLS.

Enabling high-performance transports

# Build with CXI support (requires libfabric development headers)
cargo build --release --features kiseki-transport/cxi

# Build with RDMA verbs support (requires rdma-core)
cargo build --release --features kiseki-transport/verbs

The client automatically detects available transports and selects the fastest one. Override with:

# Force TCP transport (e.g., for debugging)
KISEKI_TRANSPORT=tcp kiseki-client-fuse --mountpoint /mnt/kiseki

Transport tuning

Connection pooling: The transport layer maintains a pool of connections per peer. Pool size adapts to workload.
Keepalive: Connections are kept alive to avoid handshake overhead. Configure via KISEKI_TRANSPORT_KEEPALIVE_MS.
Zero-copy: CXI and verbs transports use zero-copy DMA where possible.

NUMA pinning

For multi-socket servers, NUMA-aware placement is critical for avoiding cross-socket memory traffic.

Recommendations

Pin kiseki-server to the NUMA node closest to the NIC:
```
numactl --cpunodebind=0 --membind=0 kiseki-server
```

Pin NVMe interrupts to the same NUMA node:

echo 0 > /proc/irq/<irq>/smp_affinity_list

Pin data devices to the NUMA node closest to their PCIe root complex.

systemd integration

[Service]
# Pin to NUMA node 0
ExecStart=/usr/bin/numactl --cpunodebind=0 --membind=0 /usr/local/bin/kiseki-server

Verification

# Check NUMA topology
numactl --hardware

# Check NIC NUMA node
cat /sys/class/net/eth0/device/numa_node

# Check NVMe NUMA node
cat /sys/block/nvme0n1/device/numa_node

Erasure coding parameters

EC parameters control the trade-off between storage overhead, repair bandwidth, and read performance.

Common configurations

Config	Data	Parity	Overhead	Fault tolerance	Use case
4+2	4	2	50%	2 device failures	Default for NVMe. Good balance.
8+3	8	3	37.5%	3 device failures	Large HDD pools. Lower overhead.
4+1	4	1	25%	1 device failure	Low-criticality data. Minimum overhead.
2+2	2	2	100%	2 device failures	Small pools (<6 devices). High redundancy.

Performance implications

Read amplification: Reading a chunk requires reading data_chunks fragments. More data chunks = more read I/O.
Write amplification: Writing a chunk requires writing data_chunks + parity_chunks fragments.
Repair bandwidth: Repairing a lost fragment requires reading data_chunks fragments and writing 1. Higher data_chunks = more repair bandwidth.
Minimum pool size: The pool must have at least data_chunks + parity_chunks devices.

EC parameters are immutable per pool after creation (I-C6). Choose carefully. Changing requires creating a new pool and migrating data.

Inline threshold (ADR-030)

The inline threshold determines whether small files are stored in the metadata tier (NVMe, redb) or the data tier (block device extents).

Tuning the threshold

The system automatically adjusts the threshold per-shard based on system disk capacity (I-SF1, I-SF2). Manual adjustment:

# Set cluster-wide default for new shards
kiseki-server tuning set --inline-threshold-bytes 8192

Trade-offs

Threshold	Metadata tier impact	Data tier impact	Latency
128 B (floor)	Minimal metadata growth	All files in chunks	Higher for tiny files
4 KB (default)	Moderate growth	Small files inline	Lower for small files
64 KB (ceiling)	Large growth	More inline data	Lowest for small files

Monitoring

# Check system disk usage
df -h /var/lib/kiseki

# Check per-store sizes
du -sh /var/lib/kiseki/small/objects.redb
du -sh /var/lib/kiseki/raft/log.redb

The Raft inline throughput guard (I-SF7) automatically reduces the threshold to the floor if inline write rate exceeds KISEKI_RAFT_INLINE_MBPS (default 10 MB/s per shard). This prevents inline data from starving metadata-only Raft operations during write storms.

Cache tuning (ADR-031)

L1 cache (in-memory)

The L1 cache holds decrypted plaintext chunks in process memory.

Parameter	Default	Recommendation
`KISEKI_CACHE_L1_MAX`	1 GB	Set to 10-25% of available process memory. AI training with large datasets: increase. Memory-constrained compute: decrease.

L2 cache (local NVMe)

The L2 cache uses local NVMe on compute nodes.

Parameter	Default	Recommendation
`KISEKI_CACHE_L2_MAX`	100 GB	Set based on available NVMe capacity. Training datasets: size to fit the working set. Inference: size to fit model weights.

Metadata TTL

Parameter	Default	Recommendation
`KISEKI_CACHE_META_TTL_MS`	5000 (5s)	Read-heavy workloads: increase for fewer metadata fetches. Low-latency requirements: decrease for fresher data. POSIX close-to-open consistency: 0 (no caching).

Cache mode selection

Workload	Recommended mode	Rationale
AI training (epoch reuse)	`pinned`	Dataset is re-read every epoch. Pin to avoid refetching.
AI inference	`organic`	Model weights are hot, prompts rotate. LRU works well.
HPC checkpoint/restart	`bypass`	Checkpoints are write-heavy. Caching checkpoints wastes NVMe.
Climate/weather staging	`pinned`	Boundary conditions staged once, read many times.
Interactive analysis	`organic`	Mixed access patterns. LRU adapts.

Staging for training workloads

Pre-stage datasets before training begins to avoid cold-start latency:

# Slurm prolog script
kiseki-client-fuse --stage /datasets/imagenet --mountpoint /mnt/kiseki
export KISEKI_CACHE_POOL_ID=$(cat /var/cache/kiseki/pool_id)

# Workload picks up the staged cache via KISEKI_CACHE_POOL_ID
srun --export=ALL python train.py

Raft tuning

Snapshot interval

kiseki-server tuning set --raft-snapshot-interval 10000

Lower values (1000-5000): More frequent snapshots. Faster catch-up for new nodes. More I/O.
Higher values (50000-100000): Less snapshot overhead. Slower catch-up.

Compaction rate

kiseki-server tuning set --compaction-rate-mb-s 200

Higher compaction rate reduces Raft log size faster but consumes more I/O bandwidth.

View materialization poll interval

kiseki-server tuning set --stream-proc-poll-ms 50

Lower poll interval reduces view staleness but increases CPU usage.

Benchmark harness

Kiseki includes a transport benchmark for measuring raw fabric throughput:

# Run transport benchmarks (if available)
tests/hw/run_transport_bench.sh

What it measures

Bandwidth: Sequential read/write throughput per transport.
Latency: Round-trip latency (p50, p99, p999) per transport.
IOPS: Random read/write IOPS per transport.
Concurrency: Throughput scaling with connection count.

Interpreting results

Metric	Good (CXI)	Good (TCP)	Action if below
Bandwidth	>150 Gbps	>50 Gbps	Check NIC config, MTU, NUMA pinning
Latency p99	<10 us	<500 us	Check CPU frequency, interrupt coalescing
IOPS (4K random)	>1M	>100K	Check NVMe config, queue depth

System tuning checklist

Kernel parameters

# Increase maximum open files
echo "fs.file-max = 1048576" >> /etc/sysctl.conf

# Increase socket buffer sizes for high-bandwidth transports
echo "net.core.rmem_max = 67108864" >> /etc/sysctl.conf
echo "net.core.wmem_max = 67108864" >> /etc/sysctl.conf
echo "net.ipv4.tcp_rmem = 4096 87380 67108864" >> /etc/sysctl.conf
echo "net.ipv4.tcp_wmem = 4096 65536 67108864" >> /etc/sysctl.conf

# Disable transparent hugepages (can cause latency spikes)
echo never > /sys/kernel/mm/transparent_hugepage/enabled

NVMe tuning

# Set I/O scheduler to none (best for NVMe)
echo none > /sys/block/nvme0n1/queue/scheduler

# Increase queue depth
echo 1024 > /sys/block/nvme0n1/queue/nr_requests

Process limits

# /etc/security/limits.d/kiseki.conf
kiseki  soft  nofile  1048576
kiseki  hard  nofile  1048576
kiseki  soft  memlock unlimited
kiseki  hard  memlock unlimited

Performance Tests

Benchmark results for kiseki on GCP infrastructure.

Test Environment

Component	Spec
HDD nodes (3)	n2-standard-16, 3 x PD-Standard 200GB each
Fast nodes (2)	n2-standard-16, 2 x local NVMe + 2 x PD-SSD 375GB
Client nodes (3)	n2-standard-8, 100GB SSD cache
Ctrl node (1)	e2-standard-4, orchestrator
Network	GCP VPC, single subnet 10.0.0.0/24
Region	europe-west6-c (Zurich)
Raft	Single group, 5 nodes, node 1 bootstrap
Release	v2026.1.352 (async GatewayOps, ADR-032)

Results (2026-04-24)

Network Bandwidth

Path	Throughput
Client → Leader (n2-standard-8 → n2-standard-16)	15.2 - 15.3 Gbps
HDD → Fast cross-tier (n2-standard-16 → n2-standard-16)	18.3 - 20.4 Gbps

S3 Gateway

All S3 tests run from client nodes (n2-standard-8) with 8-way parallelism.

Write Throughput (single client → leader)

Object Size	Count	Parallelism	Time	Throughput
1 MB	200	8	1,624 ms	123.2 MB/s
4 MB	50	8	239 ms	836.8 MB/s
16 MB	25	8	363 ms	1,101.9 MB/s

Read Throughput

Object Size	Count	Parallelism	Time	Throughput
1 MB	200	8	176 ms	1,136.4 MB/s

PUT Latency (1 KB objects, sequential)

Percentile	Latency
p50	7.6 ms
p99	8.6 ms
avg	7.7 ms
max	9.7 ms

Aggregate Write (3 clients, parallel)

Workload	Time	Aggregate Throughput
3 x 100 x 1 MB (8 concurrent/client)	2,205 ms	136.1 MB/s

NFS / pNFS / FUSE

Not yet tested on GCP. NFS mount from client nodes requires SSH key distribution from the ctrl node (OS Login configuration pending). FUSE requires the kiseki-client binary installed on client nodes.

Local testing (3-node cluster on localhost) confirms all protocols functional via unit and integration tests.

Prometheus Metrics

Gateway request counters showed 0 during the test. The requests_total atomic counter in InMemoryGateway is not wired to the Prometheus metrics exporter yet.

Local Test Results (same binary, localhost)

For comparison, local 3-node cluster results (loopback network, no disk I/O latency, 32-way parallelism):

Test	Result
S3 Write 1 MB x 200 (32 parallel)	380.2 MB/s
S3 Write 4 MB x 50 (32 parallel)	349.7 MB/s
S3 Write 16 MB x 25 (32 parallel)	340.7 MB/s
S3 Read 1 MB x 200 (32 parallel)	913.2 MB/s
32 concurrent PUTs	50 ms (no deadlock)

Observations

Small object writes improved 9.6x after ADR-032 (async GatewayOps + lock-free composition writes). The composition lock is no longer held during Raft consensus, allowing concurrent writes to proceed in parallel.
Read throughput exceeds write. Reads bypass Raft consensus (served from the local composition + chunk store) and hit 1.1 GB/s even for 1 MB objects.
GCP outperforms localhost for large objects. The GCP network (15+ Gbps) and n2-standard-16 nodes have more bandwidth than localhost loopback under contention. 16 MB writes: 1,102 MB/s (GCP) vs 341 MB/s (local).
Latency is network-bound. p50 latency on GCP (7.6 ms) includes network RTT + Raft consensus (5-node quorum). Local latency is dominated by CPU contention on shared machine.
Single Raft group is the write bottleneck. All writes go through one leader. Multi-shard deployment would distribute leaders across nodes, scaling write throughput linearly.

Known Issues

Concurrent write deadlock (fixed in ADR-032). The sync→async bridge (run_on_raft) caused thread starvation under concurrent load. Fixed by making GatewayOps and LogOps fully async, and moving log emission out of the composition lock scope. Result: 1 MB writes improved from 39.5 to 380.2 MB/s (9.6x).
NFS mount on GCP. Requires SSH key distribution from ctrl to client nodes. The ctrl service account needs osAdminLogin role and OS Login key registration.
Prometheus counters. gateway_requests_total not exported to /metrics endpoint.

Running the Benchmark

# Local 3-node test
cargo build --release --bin kiseki-server
# Start 3 nodes (see examples/cluster-3node.env.node{1,2,3})
# Run: bash infra/gcp/benchmarks/perf-suite.sh

# GCP deployment
cd infra/gcp
terraform apply -var="project_id=PROJECT" -var="zone=ZONE" \
  -var="release_tag=v2026.1.332"
# Deploy perf-suite.sh to ctrl node and run

See infra/gcp/benchmarks/perf-suite.sh for the full benchmark script and infra/gcp/benchmarks/run-perf.sh for the local deployment wrapper.

Comparison with Ceph and Lustre

Single-Leader Kiseki vs Typical Deployments (similar hardware scale)

Metric	Kiseki (1 leader)	Ceph RGW (S3)	Lustre
Large object write	1.1 GB/s (16 MB)	0.5-2 GB/s	1-2 GB/s per OST
Small object write	122 MB/s (1 MB)	50-200 MB/s	200-500 MB/s
Read throughput	1.1 GB/s	1-3 GB/s	2-10 GB/s
PUT latency	p50: 7.6 ms	p50: 2-5 ms	p50: <1 ms (POSIX)
Aggregate 3-client	133 MB/s	300-800 MB/s	1-5 GB/s
Encryption	Always (AES-256-GCM)	Optional (rarely on)	No

Why aggregate throughput is lower

All writes go through a single Raft leader (single Raft group). Ceph distributes across PGs/OSDs, Lustre stripes across OSTs. They parallelize writes across all nodes; kiseki serializes through one leader. This is a deployment constraint, not an architectural limit.

Where kiseki is strong

Per-leader throughput is excellent. 1.1 GB/s per leader with full AES-256-GCM encryption is comparable to Ceph RGW without encryption. The crypto overhead is nearly invisible (aws-lc-rs with AES-NI).
Read throughput matches. Reads bypass Raft consensus entirely and serve from local composition + chunk store. Multi-node reads scale linearly since any node can serve.
Latency is reasonable. 7.6 ms includes Raft consensus over network + encryption. Ceph’s 2-5 ms S3 latency is lower but typically without encryption. Lustre’s sub-ms is POSIX (kernel bypass), not comparable to HTTP/S3.

Bottleneck analysis

Not bottlenecked by crypto – AES-256-GCM at 1.1 GB/s means the CPU encrypts faster than the network/Raft can deliver.
Not bottlenecked by network – 15 Gbps available, using <10 Gbps.
Bottlenecked by Raft consensus – 7.6 ms per round-trip for small objects, amortized for large ones.
Multi-shard is the path to parity – linear scaling with shard count, same model as Ceph PGs and Lustre OSTs.

Projected multi-shard performance

Shards	1 MB Write	16 MB Write	Read
1	122 MB/s	1.1 GB/s	1.1 GB/s
3	~366 MB/s	~3.4 GB/s	~3.4 GB/s
5	~610 MB/s	~5.7 GB/s	~5.7 GB/s

At 5 shards on the same hardware, kiseki reaches parity with Ceph and approaches Lustre – while encrypting all data at rest and in transit, on commodity GCP VMs with network-attached storage (not local NVMe or InfiniBand).

Capacity Planning

Kiseki separates metadata and data onto different storage tiers. Proper sizing of both tiers is critical for stable operation at scale.

Storage tiers

Each storage node has two distinct storage tiers:

System disk (metadata tier)

The system partition hosts:

Raft log (raft/log.redb): Bounded by snapshot interval. Grows with write rate, compacted periodically.
Key epochs (keys/epochs.redb): Tiny (<10 MB). One entry per key epoch.
Chunk metadata (chunks/meta.redb): Scales linearly with file count. Approximately 80 bytes per file.
Inline content (small/objects.redb): Variable. Controlled by the dynamic inline threshold (ADR-030).

Requirements:

NVMe or SSD strongly recommended. HDD system disks trigger a boot warning because Raft fsync latency will be 5-10ms per commit.
RAID-1 on 2x SSD for redundancy (the system disk is not protected by Kiseki’s EC; it uses traditional RAID).
Size based on expected file count and inline content.

Data devices (data tier)

Data devices are JBOD-managed by Kiseki. They store chunk ciphertext as extents on raw block devices (ADR-029).

Requirements:

NVMe, SSD, or HDD depending on the pool’s device class.
Multiple devices per node for EC placement (I-D4: no two EC fragments on the same device).
JBOD (no RAID): Kiseki manages durability via EC or replication.

Metadata capacity sizing

Per-file metadata footprint

Component	Per file	Notes
Delta log entry	~200 bytes	Raft log entry with header fields
Chunk metadata	~80 bytes	Extent index entry in `chunks/meta.redb`
Subtotal (no inline)	~280 bytes	Fixed per file
Inline content	0 to 64 KB	Only if file is below inline threshold

Capacity examples

10 billion files, 50-node cluster, RF=3, no inline:

Component	Cluster total	Per node
Delta log (metadata only)	~2 TB	~120 GB
Chunk metadata index	~0.8 TB	~48 GB
Total metadata	~2.8 TB	~168 GB

At 168 GB per node, 256 GB NVMe system disks are tight. Larger system disks (512 GB or 1 TB) provide comfortable headroom.

10 billion files, 50-node cluster, RF=3, with inline (4 KB threshold):

Component	Cluster total	Per node
Metadata (as above)	~2.8 TB	~168 GB
Inline content (10% of files < 4 KB, avg 2 KB)	~2 TB	~120 GB
Total	~4.8 TB	~288 GB

This exceeds 256 GB system disks. The dynamic inline threshold (ADR-030) prevents this by automatically reducing the threshold when system disk usage approaches the soft limit.

Capacity monitoring

The system automatically monitors metadata disk usage and adjusts:

Usage level	Response
Below `KISEKI_META_SOFT_LIMIT_PCT` (50%)	Normal operation
Above soft limit	Inline threshold reduced
Above `KISEKI_META_HARD_LIMIT_PCT` (75%)	Threshold forced to floor (128 B), alert emitted

Alerts use out-of-band gRPC health reports (not Raft) so that a full-disk node can signal without writing Raft entries (I-SF2).

Dynamic inline threshold (ADR-030)

The inline threshold is computed per-shard as the minimum affordable threshold across all Raft voters:

available = min(node.small_file_budget_bytes for node in shard.voters)
projected_files = shard.file_count_estimate
raw_threshold = available / max(projected_files, 1)
shard_threshold = clamp(raw_threshold, INLINE_FLOOR, INLINE_CEILING)

Where:

Parameter	Value
`INLINE_FLOOR`	128 bytes (hard lower bound)
`INLINE_CEILING`	64 KB (system-wide maximum)
`KISEKI_META_SOFT_LIMIT_PCT`	50% (default)
`KISEKI_META_HARD_LIMIT_PCT`	75% (default)

Threshold behavior

Decrease: Automatic and safe. New files use the chunk path. Existing inline data is not retroactively migrated (I-L9).
Increase: Requires cluster admin decision. May trigger optional background migration of small chunked files back to inline.
Emergency: If any voter reports hard-limit breach, the leader commits a threshold reduction via Raft (2/3 majority; the full-disk node’s vote is not required).

Raft throughput guard (I-SF7)

The effective inline threshold is further clamped by a per-shard Raft log throughput budget (KISEKI_RAFT_INLINE_MBPS, default 10 MB/s). If the shard’s inline write rate exceeds this budget, the threshold temporarily drops to the floor until the rate subsides. This prevents inline data from starving metadata-only Raft operations during write storms.

Pool capacity thresholds

Data-tier capacity is managed per pool. Thresholds vary by device class to account for SSD/NVMe GC pressure at high fill levels (ADR-024):

State	NVMe/SSD	HDD	Behavior
Healthy	0-75%	0-85%	Normal writes, background rebalance
Warning	75-85%	85-92%	Log warning, emit telemetry
Critical	85-92%	92-97%	Reject new placements, advisory backpressure
ReadOnly	92-97%	97-99%	In-flight writes drain, no new writes
Full	97-100%	99-100%	ENOSPC to clients

Why NVMe/SSD thresholds are lower

NVMe and SSD devices experience write amplification from garbage collection at high fill levels. Above ~80% fill, GC pressure increases sharply, causing:

Increased write latency (10-100x during GC storms).
Reduced effective write bandwidth.
Accelerated wear.

Enterprise storage arrays (VAST, Pure) operate at 95%+ because they have global wear leveling across all flash in the system. JBOD devices do not have this capability, so Kiseki’s thresholds are more conservative.

Growth estimation

File count growth

Monitor kiseki_shard_delta_count to track delta (file) accumulation:

# Current delta count per shard
curl -s http://node1:9090/metrics | grep kiseki_shard_delta_count

Use the rate of delta count increase to project when the metadata tier will reach capacity.

Data volume growth

Monitor pool capacity metrics:

# Current pool utilization
curl -s http://node1:9090/metrics | grep kiseki_pool_capacity

Projection formula

days_until_full = (capacity_total - capacity_used) / daily_write_rate

For metadata:

metadata_per_file = 280 bytes (no inline) or 280 + avg_inline_size (with inline)
days_until_full = (system_disk_capacity * soft_limit_pct - current_used) /
                  (new_files_per_day * metadata_per_file * replication_factor)

Sizing recommendations

Small deployment (development/testing)

Component	Recommendation
Nodes	3 (minimum for Raft)
System disk	256 GB NVMe each (RAID-1 on 2x SSD)
Data devices	2x 1 TB NVMe per node
Key manager	Co-located with storage nodes (internal KMS)
File count	Up to 100 million

Medium deployment (departmental HPC)

Component	Recommendation
Nodes	5-10
System disk	512 GB NVMe each (RAID-1)
Data devices	4-8 NVMe per node (2-8 TB each)
Key manager	3 dedicated nodes
File count	Up to 1 billion

Large deployment (institutional HPC/AI)

Component	Recommendation
Nodes	50-200
System disk	1 TB NVMe each (RAID-1)
Data devices	8-24 devices per node, mixed tiers (NVMe + SSD + HDD)
Key manager	5 dedicated nodes
File count	Up to 10 billion
Total capacity	100 PB+

Rules of thumb

System disk: Size at 2x the expected metadata footprint for comfortable headroom. Include inline content estimates.
Data devices: At least ec_data_chunks + ec_parity_chunks devices per pool (for EC placement across distinct devices, I-D4).
Network: CXI or InfiniBand for clusters where storage bandwidth is critical. TCP is acceptable for cold-tier pools.
Memory: At least 64 GB per storage node for Raft state, chunk metadata caching, and stream processor buffers.

Capacity alerts

Configuring alerts

Use Prometheus alerting rules (see Monitoring) to detect capacity issues before they become critical:

- alert: KisekiSystemDiskWarning
  expr: >
    node_filesystem_avail_bytes{mountpoint="/var/lib/kiseki"} /
    node_filesystem_size_bytes{mountpoint="/var/lib/kiseki"} < 0.50
  for: 10m
  labels:
    severity: warning
  annotations:
    summary: "System disk above 50% on {{ $labels.instance }}"

- alert: KisekiSystemDiskCritical
  expr: >
    node_filesystem_avail_bytes{mountpoint="/var/lib/kiseki"} /
    node_filesystem_size_bytes{mountpoint="/var/lib/kiseki"} < 0.25
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "System disk above 75% on {{ $labels.instance }}"

When to add capacity

System disk above 50% (soft limit): Plan for capacity expansion. Inline threshold will start decreasing.
System disk above 75% (hard limit): Urgent. Inline threshold is at floor. Add nodes or upgrade system disks.
Pool above Warning threshold: Monitor growth. Plan for device additions.
Pool above Critical threshold: Writes are being rejected. Add devices immediately or evacuate data to another pool.

gRPC Services

Kiseki exposes several gRPC services across two network ports. Data-path services run on port 9100. The advisory service runs on a separate listener at port 9101 (isolated runtime, ADR-021).

LogService

Port: 9100 (data fabric) Provider: kiseki-log (via kiseki-server) Consumers: Composition, View stream processors, Gateway, Client

RPC	Type	Description
`AppendDelta`	Unary	Append a delta to a shard. Returns the assigned sequence number. Commits via Raft majority before ack (I-L2).
`ReadDeltas`	Server streaming	Read a range of deltas from a shard. Used by view stream processors for materialization.
`TruncateLog`	Unary	Trigger delta GC up to the minimum consumer watermark. Returns the new GC boundary.
`ShardHealth`	Unary	Query shard health, Raft state, and replication status.
`SplitShard`	Unary	Trigger mandatory shard split at a given boundary.
`SetMaintenance`	Unary	Enable or disable maintenance mode on a shard (I-O6).
`CompactShard`	Unary	Trigger compaction (header-only merge, I-O2).

KeyManagerService

Port: Internal network (dedicated key manager cluster) Provider: kiseki-keymanager (via kiseki-keyserver) Consumers: Storage nodes (chunk encryption), Gateway, Client

RPC	Type	Description
`FetchMasterKey`	Unary	Fetch the master key for a given epoch. Used at node startup and rotation.
`RotateKey`	Unary	Rotate system or tenant keys. Creates a new epoch.
`CryptoShred`	Unary	Destroy tenant KEK, rendering all tenant data unreadable.
`FullReEncrypt`	Unary	Trigger full re-encryption of a tenant’s data under new keys.
`FetchTenantKek`	Unary	Fetch tenant KEK for wrapping/unwrapping operations.
`CheckKmsHealth`	Unary	Check tenant KMS provider connectivity.
`KeyManagerHealth`	Unary	Query key manager cluster health and Raft state.

System DEK derivation is local (HKDF, no RPC). Only master key fetch and tenant KEK operations require network calls (ADR-003).

ControlService

Port: Management network Provider: kiseki-control Consumers: Admin CLI, storage nodes, advisory runtime

Tenant management

RPC	Description
`CreateOrg`	Create a new organization (top-level tenant)
`CreateProject`	Create a project within an organization
`CreateWorkload`	Create a workload within an org or project
`DeleteOrg` / `DeleteProject` / `DeleteWorkload`	Remove tenant hierarchy nodes

Namespace and policy

RPC	Description
`CreateNamespace`	Create a tenant-scoped namespace
`SetComplianceTags`	Set compliance regime tags (inherit downward)
`SetQuota`	Set resource quotas at org/project/workload level
`SetRetentionHold`	Create a retention hold on a namespace or composition
`ReleaseRetentionHold`	Release an active retention hold

IAM

RPC	Description
`RequestAccess`	Cluster admin requests access to tenant data
`ApproveAccess`	Tenant admin approves access request
`DenyAccess`	Tenant admin denies access request

Operations

RPC	Description
`SetMaintenanceMode`	Enable/disable cluster-wide maintenance mode
`ListFlavors` / `MatchFlavor`	Query and match deployment flavors

Federation

RPC	Description
`RegisterFederationPeer`	Register a remote Kiseki cluster for async replication

Advisory policy

RPC	Description
`SetAdvisoryPolicy`	Configure profiles, budgets, and state per scope
`TransitionAdvisoryState`	Transition advisory state (enabled/draining/disabled)
`GetEffectiveAdvisoryPolicy`	Compute effective policy for a workload (min across hierarchy)

WorkflowAdvisoryService

Port: 9101 (data fabric, separate listener) Provider: kiseki-advisory (via kiseki-server, isolated tokio runtime) Consumers: Native client, any authorized tenant caller

RPC	Type	Description
`DeclareWorkflow`	Unary	Declare a new workflow with profile, initial phase, and TTL. Returns a `WorkflowRef` handle and authorized pool handles.
`EndWorkflow`	Unary	End a declared workflow. Triggers audit summary and GC of workflow state.
`PhaseAdvance`	Unary	Advance to the next phase. Phase order is monotonic (I-WA13).
`GetWorkflowStatus`	Unary	Query current workflow state, phase, and budget usage.
`AdvisoryStream`	Bidirectional streaming	Multiplexed channel: hints in (client to storage), telemetry out (storage to client).
`SubscribeTelemetry`	Server streaming	Subscribe to specific telemetry channels for a workflow.

Advisory stream message types

Inbound hints (client to storage):

Access pattern declaration
Prefetch range (up to 4096 tuples per hint, I-WA16)
Affinity pool preference (via opaque pool handles, I-WA19)
Priority class (within policy-allowed maximum)
Retention intent
Dedup intent
Collective checkpoint announcement
Deadline hint

Outbound telemetry (storage to client):

Backpressure signal (ok / soft / hard severity with retry-after)
Placement locality class (local-node / local-rack / same-pool / remote / degraded)
Materialization lag
Prefetch effectiveness
QoS headroom
Hotspot detection (caller-owned compositions only)

StorageAdminService (ADR-025)

Port: Management network Provider: kiseki-server Consumers: Cluster admin, SRE (read-only role)

RPC	Type	Description
`ClusterStatus`	Unary	Cluster-wide status summary
`ListDevices` / `GetDevice`	Unary	Query storage devices
`AddDevice` / `RemoveDevice`	Unary	Add or remove a device (removal requires Removed state)
`EvacuateDevice` / `CancelEvacuation`	Unary	Trigger or cancel device evacuation
`ListPools` / `GetPool` / `PoolStatus`	Unary	Query affinity pools
`CreatePool` / `SetPoolDurability` / `SetPoolThresholds`	Unary	Manage pool configuration
`RebalancePool` / `CancelRebalance`	Unary	Trigger or cancel pool rebalance
`ListShards` / `GetShard` / `GetShardHealth`	Unary	Query shard state
`SplitShard` / `SetShardMaintenance`	Unary	Shard management
`SetTuningParams` / `GetTuningParams`	Unary	Runtime tuning parameters
`DrainNode`	Unary	Drain all shards and chunks from a node
`TriggerScrub` / `RepairChunk` / `ListRepairs`	Unary	Data integrity operations
`DeviceHealth`	Server streaming	Live device health events
`IOStats`	Server streaming	Live I/O statistics
`DeviceIOStats`	Server streaming	Per-device I/O statistics

DiscoveryService

Port: 9100 (data fabric) Provider: kiseki-server Consumers: Native client

Used by the native client to discover shards, views, and gateways from the data fabric without requiring direct control plane access (I-O4, ADR-008).

Protocol binding

Protobuf definitions: proto/kiseki/v1/*.proto
Generated code: kiseki-proto crate
Workflow ref header: x-kiseki-workflow-ref-bin (16 raw bytes as gRPC binary metadata, not a proto field, per ADR-021)

REST & Admin API

The kiseki-server binary exposes an HTTP server (default port 9090) for health checks, Prometheus metrics, and an admin dashboard. All endpoints are served via axum.

Health and metrics

GET /health

Liveness probe for load balancers and orchestrators.

Response: 200 OK with body ok when the server is running.

GET /metrics

Prometheus text-format metrics endpoint.

Response: 200 OK with text/plain body containing all registered Prometheus metrics including:

Raft state per shard (leader, follower, candidate)
Chunk operations (reads, writes, dedup hits)
Transport metrics (connections, bytes, errors per transport type)
Pool utilization (capacity, used, free per pool)
View materialization lag
Advisory budget usage

Admin dashboard

GET /ui

HTML admin dashboard with HTMX live polling. Provides a visual overview of cluster health, node status, and operational metrics.

The dashboard polls the JSON API endpoints below for live updates.

JSON API endpoints

GET /ui/api/cluster

Cluster-wide summary with aggregated metrics from all nodes.

Response: JSON object with node count, total capacity, total used, shard count, and aggregated health status.

GET /ui/api/nodes

List of all known nodes with per-node metrics.

Response: JSON array of node objects, each with node ID, address, status, device count, shard count, and key metrics.

GET /ui/api/history

Metric time series for charting.

Query parameters:

Parameter	Type	Default	Description
`hours`	float	3	Number of hours of history to retrieve

Response: JSON object with hours and points array containing timestamped metric snapshots.

GET /ui/api/events

Filtered event log for diagnostics and alerting.

Query parameters:

Parameter	Type	Default	Description
`severity`	string	(all)	Filter by severity: `info`, `warning`, `error`, `critical`
`category`	string	(all)	Filter by category: `node`, `shard`, `device`, `tenant`, `security`, `admin`
`hours`	float	3	Hours to look back

Response: JSON array of event objects with timestamp, severity, category, message, and source.

Operations endpoints

These endpoints trigger operational actions and require cluster admin authentication.

POST /ui/api/ops/maintenance

Toggle maintenance mode for the cluster or specific shards.

Request body: JSON with enabled (boolean) and optional shard_id.

Effect: Sets shards to read-only. Write commands are rejected with a retriable error (I-O6). Shard splits, compaction, and GC for in-progress operations continue.

POST /ui/api/ops/backup

Trigger a backup operation.

Request body: JSON with backup configuration parameters.

Effect: Initiates backup per ADR-016. Returns a job ID for status tracking.

POST /ui/api/ops/scrub

Trigger a data integrity scrub.

Request body: JSON with optional scope (pool, device, or cluster-wide).

Effect: Verifies chunk integrity via EC checksums. Reports corrupt or missing chunks. Triggers automatic repair for recoverable issues.

HTMX fragment endpoints

These endpoints return HTML fragments for the admin dashboard’s live polling:

Endpoint	Description
`GET /ui/fragment/cluster-cards`	Cluster status summary cards
`GET /ui/fragment/node-table`	Node list table rows
`GET /ui/fragment/chart-data`	Chart data for metrics graphs
`GET /ui/fragment/alerts`	Active alerts and warnings

CLI Reference

Kiseki provides two binaries with CLI interfaces: kiseki-server (which doubles as the admin CLI) and kiseki-client (native client with staging and cache commands).

All admin operations use these CLIs. The underlying gRPC API is also available for programmatic access (see gRPC), but the CLI is the primary admin interface.

kiseki-server

The server binary starts the storage node when invoked without arguments. When invoked with a subcommand, it acts as an admin CLI that connects to the local node’s gRPC endpoint.

Server mode

kiseki-server

Starts the storage node. Configuration is via environment variables (see Environment Variables).

status

kiseki-server status

Display cluster status summary: node count, shard count, device health, Raft leadership, and pool utilization.

Node management

kiseki-server node add --node-id <id>
kiseki-server node drain --node-id <id>
kiseki-server node remove --node-id <id>

Add, drain, or remove a node from the cluster. Drain migrates shard assignments before removal. See Cluster Management.

Shard management

kiseki-server shard list
kiseki-server shard info --shard-id <id>
kiseki-server shard health --shard-id <id>
kiseki-server shard split --shard-id <id> [--boundary <key>]
kiseki-server shard maintenance --shard-id <id> --enabled
kiseki-server shard maintenance --shard-id <id> --disabled

List shards, inspect details, check health, trigger manual splits, and toggle per-shard maintenance mode (I-O6).

Pool management

kiseki-server pool list
kiseki-server pool status --pool-id <id>
kiseki-server pool create --pool-id <id> --device-class <class> --ec-data <n> --ec-parity <n>
kiseki-server pool set-durability --pool-id <id> --ec-data <n> --ec-parity <n>
kiseki-server pool rebalance --pool-id <id>
kiseki-server pool cancel-rebalance --pool-id <id>
kiseki-server pool set-thresholds --pool-id <id> --warning-pct <n> --critical-pct <n>

Manage affinity pools: create, inspect capacity, set EC parameters, rebalance data, and adjust capacity thresholds (I-C5, I-C6).

Device management

kiseki-server device list
kiseki-server device info --device-id <id>
kiseki-server device evacuate --device-id <id>
kiseki-server device cancel-evacuation --device-id <id>
kiseki-server device scrub --device-id <id>

List devices, check health and SMART status, trigger evacuation or integrity scrub, and cancel in-progress evacuations (I-D2, I-D3, I-D5).

Maintenance mode

kiseki-server maintenance on
kiseki-server maintenance off

Enable or disable cluster-wide maintenance mode. Sets all shards to read-only. Write commands are rejected with a retriable error. Shard splits, compaction, and GC for in-progress operations continue but no new triggers fire from write pressure (I-O6).

Backup and recovery

kiseki-server backup create
kiseki-server backup list
kiseki-server backup delete --backup-id <id>
kiseki-server repair list
kiseki-server compact

Create, list, and delete backup snapshots. List active repairs and evacuations. Trigger Raft log compaction.

Key management

kiseki-server keymanager health
kiseki-server keymanager check-kms
kiseki-server keymanager check-kms --tenant-id <id>

Check system key manager health and tenant KMS connectivity.

S3 credentials

kiseki-server s3-credentials create --tenant-id <id> --workload-id <id>

Provision S3-compatible access keys for a tenant workload via the control plane.

Tuning parameters

kiseki-server tuning set --inline-threshold-bytes <n>
kiseki-server tuning set --raft-snapshot-interval <n>
kiseki-server tuning set --compaction-rate-mb-s <n>
kiseki-server tuning set --stream-proc-poll-ms <n>

Adjust cluster-wide tuning parameters. See Performance Tuning for guidance.

kiseki-client

The native client binary provides dataset staging and cache management commands for compute nodes.

stage –dataset

kiseki-client stage --dataset <path> [--timeout <seconds>]

Pre-fetch a dataset’s chunks into the L2 cache with pinned retention. Recursively enumerates compositions under the given namespace path, fetches all chunks from canonical, verifies by content-address (SHA-256), and stores in the L2 cache pool.

Staging is idempotent and resumable. Produces a manifest file listing staged compositions and chunk IDs.

Limits: max_staging_depth (10 levels), max_staging_files (100,000).

stage –status

kiseki-client stage --status

Show the status of the current staging operation: progress, number of chunks fetched, total size, and any errors.

stage –release

kiseki-client stage --release <path>

Release a staged dataset. Unpins cached chunks, making them eligible for LRU eviction. To pick up updates from canonical, release and re-stage.

stage –release-all

kiseki-client stage --release-all

Release all staged datasets.

cache –stats

kiseki-client cache --stats

Print cache statistics: mode, L1/L2 bytes used, hit/miss counts, errors, metadata cache stats, and wipe count.

cache –wipe

kiseki-client cache --wipe

Wipe all cached data (L1 + L2 + metadata). Zeroizes data before deletion (I-CC2).

version

kiseki-client version

Print the client version.

Environment variables (kiseki-client)

Variable	Default	Description
`KISEKI_CACHE_DIR`	`/tmp/kiseki-cache`	Cache directory
`KISEKI_CACHE_MODE`	`organic`	Cache mode: `pinned`, `organic`, `bypass`
`KISEKI_CACHE_L1_MAX`	`268435456` (256 MB)	L1 max bytes
`KISEKI_CACHE_L2_MAX`	`53687091200` (50 GB)	L2 max bytes

kiseki-admin

Standalone remote administration CLI. Runs from an admin workstation and connects to any Kiseki node via the REST API (port 9090). No server dependencies are needed on the workstation.

Default endpoint: KISEKI_ENDPOINT env var, or http://localhost:9090.

status

kiseki-admin --endpoint http://storage-node:9090 status

Cluster status summary: node count, Raft entries, gateway requests, data written/read, and active connections.

Example output:

Cluster Status
══════════════
Nodes:       3/3 healthy
Raft:        42,567 entries
Requests:    1,234 served
Written:     12.5 GB
Read:        8.2 GB
Connections: 15 active

nodes

kiseki-admin nodes

Node list with health badges and per-node metrics.

Example output:

NODE              STATUS    RAFT     REQUESTS  WRITTEN   READ      CONNS
10.0.0.1:9090     healthy   14,189   411       4.2 GB    2.7 GB    5
10.0.0.2:9090     healthy   14,189   412       4.2 GB    2.8 GB    5
10.0.0.3:9090     healthy   14,189   411       4.1 GB    2.7 GB    5

events

kiseki-admin events [--severity error] [--hours 1]

Filtered event log. Optional --severity (info, warning, error, critical) and --hours (default: 3).

Example output:

TIME      SEVERITY  CATEGORY  SOURCE    MESSAGE
12:34:56  ERROR     node      node-3    unreachable
12:35:12  ERROR     device    nvme0n1   CRC mismatch detected

history

kiseki-admin history [--hours 3]

Metric history time series for the specified number of hours (default: 3).

maintenance

kiseki-admin maintenance on
kiseki-admin maintenance off

Toggle cluster-wide maintenance mode. Enables read-only on all shards. Write commands return a retriable error (I-O6).

backup

kiseki-admin backup

Trigger a background backup operation (ADR-016).

scrub

kiseki-admin scrub

Trigger a background data integrity scrub.

Exit codes

Code	Meaning
0	Success
1	General error
2	Invalid arguments
3	Connection failure (server unreachable)
4	Authentication failure (mTLS)

Environment Variables

All Kiseki configuration is done via environment variables. No configuration files are used for runtime settings (I-K8: keys are never stored in configuration files).

Server configuration

Variable	Type	Default	Description
`KISEKI_DATA_ADDR`	`SocketAddr`	`0.0.0.0:9100`	Data-path gRPC listener address
`KISEKI_ADVISORY_ADDR`	`SocketAddr`	`0.0.0.0:9101`	Advisory gRPC listener address (isolated runtime)
`KISEKI_S3_ADDR`	`SocketAddr`	`0.0.0.0:9000`	S3 HTTP gateway listener address
`KISEKI_NFS_ADDR`	`SocketAddr`	`0.0.0.0:2049`	NFS server listener address
`KISEKI_METRICS_ADDR`	`SocketAddr`	`0.0.0.0:9090`	Prometheus metrics and admin UI listener address
`KISEKI_DATA_DIR`	`PathBuf`	(none)	Persistent storage directory for redb databases. If unset, runs in-memory only.
`KISEKI_NODE_ID`	`u64`	`0`	Raft node ID. 0 = single-node mode.
`KISEKI_BOOTSTRAP`	`bool`	`false`	Create a well-known bootstrap shard on startup. Set to `true` or `1` for development/testing.

TLS configuration

TLS is enabled when all three path variables are set. Otherwise the server runs in plaintext mode (development only, logged as a warning).

Variable	Type	Default	Description
`KISEKI_CA_PATH`	`PathBuf`	(none)	Cluster CA certificate PEM file
`KISEKI_CERT_PATH`	`PathBuf`	(none)	Node certificate chain PEM file
`KISEKI_KEY_PATH`	`PathBuf`	(none)	Node private key PEM file
`KISEKI_CRL_PATH`	`PathBuf`	(none)	Optional CRL PEM file for certificate revocation

Raft configuration

Variable	Type	Default	Description
`KISEKI_RAFT_ADDR`	`SocketAddr`	(none)	Raft RPC listen address. Required for multi-node clusters.
`KISEKI_RAFT_PEERS`	`String`	(empty)	Comma-separated peer list in `id=addr` format, e.g. `1=10.0.0.1:9200,2=10.0.0.2:9200,3=10.0.0.3:9200`

Metadata capacity (ADR-030)

Variable	Type	Default	Description
`KISEKI_META_SOFT_LIMIT_PCT`	`u8`	`50`	Soft limit percentage for system disk metadata usage. Exceeding triggers inline threshold reduction.
`KISEKI_META_HARD_LIMIT_PCT`	`u8`	`75`	Hard limit percentage for system disk metadata usage. Exceeding forces inline threshold to INLINE_FLOOR and emits alert (I-SF2).
`KISEKI_RAFT_INLINE_MBPS`	`u32`	`10`	Per-shard Raft inline throughput cap in MB/s. Prevents inline data from starving metadata-only Raft operations (I-SF7).

Client cache configuration

Variable	Type	Default	Description
`KISEKI_CACHE_MODE`	`String`	`organic`	Cache mode: `organic` (LRU), `pinned` (staging-driven), or `bypass` (no caching)
`KISEKI_CACHE_DIR`	`PathBuf`	`/tmp/kiseki-cache`	L2 cache pool directory on local NVMe
`KISEKI_CACHE_L2_MAX`	`u64`	53687091200 (50 GB)	Maximum L2 cache size in bytes
`KISEKI_CACHE_POOL_ID`	`String`	(generated)	Adopt an existing L2 pool (128-bit hex). Used for staging handoff between processes.

Transport configuration

Variable	Type	Default	Description
`KISEKI_IB_DEVICE`	`String`	(auto-detect)	InfiniBand device name for RDMA verbs transport. If unset, auto-detects the first available device.

Observability

Standard Rust/tokio observability variables:

Variable	Type	Default	Description
`RUST_LOG`	`String`	`info`	Log filter directive (e.g., `kiseki_log=debug,kiseki_raft=trace`)
`OTEL_EXPORTER_OTLP_ENDPOINT`	`String`	(none)	OpenTelemetry collector endpoint for distributed tracing

Example: single-node development

export KISEKI_DATA_DIR=/var/lib/kiseki
export KISEKI_BOOTSTRAP=true
kiseki-server

Example: three-node cluster

# Node 1
export KISEKI_NODE_ID=1
export KISEKI_DATA_DIR=/var/lib/kiseki
export KISEKI_RAFT_ADDR=10.0.0.1:9200
export KISEKI_RAFT_PEERS=1=10.0.0.1:9200,2=10.0.0.2:9200,3=10.0.0.3:9200
export KISEKI_CA_PATH=/etc/kiseki/ca.pem
export KISEKI_CERT_PATH=/etc/kiseki/node1.pem
export KISEKI_KEY_PATH=/etc/kiseki/node1-key.pem
export KISEKI_BOOTSTRAP=true
kiseki-server

Architecture Decision Records

All architectural decisions are recorded as ADRs in specs/architecture/adr/.

ADR index

ADR	Title	Status
ADR-001	Pure Rust, No Mochi Dependency	Accepted
ADR-002	Two-Layer Encryption Model (C)	Accepted
ADR-003	System DEK Derivation (Not Storage)	Accepted
ADR-004	Schema Versioning and Rolling Upgrades	Accepted
ADR-005	Erasure Coding and Chunk Durability	Accepted
ADR-006	Inline Data Threshold	Accepted
ADR-007	System Key Manager HA via Raft	Accepted
ADR-008	Native Client Fabric Discovery	Accepted
ADR-009	Audit Log Sharding and GC	Accepted
ADR-010	Retention Hold Enforcement Before Crypto-Shred	Accepted
ADR-011	Crypto-Shred Cache Invalidation and TTL	Accepted
ADR-012	Stream Processor Tenant Isolation	Accepted
ADR-013	POSIX Semantics Scope	Accepted
ADR-014	S3 API Compatibility Scope	Accepted
ADR-015	Observability Contract	Accepted
ADR-016	Backup and Disaster Recovery	Accepted
ADR-017	Dedup Refcount Metadata Access Control	Accepted
ADR-018	Runtime Integrity Monitor	Accepted
ADR-019	Gateway Deployment Model	Accepted
ADR-020	Workflow Advisory & Client Telemetry	Accepted
ADR-021	Workflow Advisory Architecture	Accepted
ADR-022	Storage Backend – redb (Pure Rust)	Accepted
ADR-023	Protocol RFC Compliance Scope	Accepted
ADR-024	Device Management, Storage Tiers, and Capacity Thresholds	Accepted
ADR-025	Storage Administration API	Accepted
ADR-026	Raft Topology – Per-Shard on Fabric (Strategy A)	Accepted
ADR-027	Single-Language Implementation – Rust Only	Accepted
ADR-028	External Tenant KMS Providers	Accepted
ADR-029	Raw Block Device Allocator	Accepted
ADR-030	Dynamic Small-File Placement and Metadata Capacity Management	Accepted
ADR-031	Client-Side Cache	Accepted

ADR template

New ADRs follow this structure:

# ADR-NNN: Title

**Status**: Proposed | Accepted | Superseded by ADR-XXX
**Date**: YYYY-MM-DD
**Context**: Why this decision is needed.

## Decision

What was decided and why.

## Consequences

What changes as a result. Trade-offs accepted.

## Alternatives considered

What else was evaluated and why it was rejected.

Key decisions by topic

Language and architecture

ADR-001: Pure Rust (no Mochi dependency)
ADR-027: Single-language Rust (Go control plane replaced)
ADR-022: redb as storage backend (pure Rust, no RocksDB)

Encryption

ADR-002: Two-layer encryption model (system DEK + tenant KEK)
ADR-003: HKDF-based DEK derivation (not per-chunk storage)
ADR-011: Crypto-shred cache invalidation TTL
ADR-028: External tenant KMS providers (Vault, KMIP, AWS KMS, PKCS#11)

Consensus and replication

ADR-007: System key manager HA via Raft
ADR-026: Per-shard Raft groups on fabric (Strategy A)
ADR-009: Audit log sharding and GC

Storage

ADR-005: Erasure coding and chunk durability
ADR-006: Inline data threshold
ADR-029: Raw block device allocator
ADR-030: Dynamic small-file placement

Protocols and access

ADR-008: Native client fabric discovery
ADR-013: POSIX semantics scope
ADR-014: S3 API compatibility scope
ADR-019: Gateway deployment model
ADR-023: Protocol RFC compliance scope

Operations

ADR-015: Observability contract
ADR-016: Backup and disaster recovery
ADR-024: Device management and capacity thresholds
ADR-025: Storage administration API

Advisory

ADR-020: Workflow advisory and client telemetry
ADR-021: Workflow advisory architecture

Client

ADR-031: Client-side cache

ADR-001: Pure Rust, No Mochi Dependency

Status: Accepted Date: 2026-04-17 Context: Q-E3, A-E3

Decision

Build all core components in Rust. Do not depend on Mochi (Mercury/Bake/SDSKV). Learn from Mochi’s design patterns (transport abstraction, composable services).

Rationale

Mochi has never been deployed in regulated environments (HIPAA/GDPR)
C/C++ FFI creates a FIPS compliance surface across two languages
Single-language FIPS module boundary is cleaner for certification
Rust ecosystem has the building blocks (aws-lc-rs for FIPS, tokio, tonic, openraft)
Weakest link is libfabric/CXI Rust binding — bounded scope, solvable

Consequences

Must build transport abstraction in Rust (kiseki-transport)
Must build chunk storage engine in Rust (kiseki-chunk)
Must build KV backend for log storage in Rust (RocksDB via rust-rocksdb, or sled)
libfabric-sys crate needed for Slingshot support (immature, may need contribution)

ADR-002: Two-Layer Encryption Model (C)

Status: Accepted Date: 2026-04-17 Context: Q-K-arch1, I-K1 through I-K14

Decision

Single data encryption pass at the system layer. Tenant access via key wrapping. No double encryption.

System DEK encrypts chunk data (AES-256-GCM via FIPS module)
Tenant KEK wraps access to system DEK derivation material
System key manager derives per-chunk DEKs via HKDF (see ADR-003)

Rationale

Single encryption pass at HPC line rates (200+ Gbps per NIC)
Double encryption doubles CPU cost for no additional security benefit given that both layers use authenticated encryption
Key wrapping is O(32 bytes) per operation vs O(data_size) for encryption
Cross-tenant dedup works: same plaintext → same chunk_id → one ciphertext, multiple tenant KEK wrappings

Consequences

Crypto-shred destroys tenant KEK → data unreadable but not physically deleted
System key compromise exposes system-layer ciphertext; combined with tenant KEK = full access. System key manager must be highly protected (ADR-007).
Envelope must carry both system and tenant wrapping metadata

ADR-003: System DEK Derivation (Not Storage)

Status: Accepted Date: 2026-04-17 Context: B-ADV-3 (system DEK count at scale), escalation point 3

Decision

System DEKs are derived locally on storage nodes via HKDF, not stored individually and not derived via RPC to the key manager.

system_dek = HKDF-SHA256(
    key = system_master_key[epoch],
    salt = chunk_id,
    info = "kiseki-chunk-dek-v1"
)

Key distribution model (revised per ADV-ARCH-01)

The system key manager (kiseki-keyserver) stores and replicates master keys. Storage nodes (kiseki-server) fetch the current master key at startup and on epoch rotation. DEK derivation happens locally on the storage node — the key manager never sees individual chunk_ids.

kiseki-keyserver:
  Stores: master_key per epoch
  Serves: master_key to authenticated kiseki-server processes
  Never sees: individual chunk_ids or per-chunk operations

kiseki-server:
  Caches: master_key (mlock'd, refreshed on rotation)
  Derives: per-chunk DEK = HKDF(master_key, chunk_id) — locally
  Never sends: chunk_ids to the key manager

This prevents the key manager from building an index of all chunk_ids (ADV-ARCH-01: HKDF leak), which would reconstruct the per-tenant refcount data we explicitly decided not to store (ADR-017).

Rationale

At petabyte scale with ~1MB average chunks: billions of chunks
Storing billions of 32-byte DEKs = tens of GB in the key manager
HKDF derivation is deterministic: same (master_key, chunk_id) → same DEK
The key manager stores only master keys (one per epoch) — trivial storage
HKDF is fast (~microseconds) and FIPS-approved
Local derivation eliminates per-chunk RPC to key manager (performance)
Key rotation: new epoch = new master key. Old master keys retained during migration window. Derivation still works for old-epoch chunks.
Key manager never sees chunk-level operations (ADV-ARCH-01 fix)

Consequences

System key manager is simpler (stores epochs, not individual DEKs)
Master key is cached in kiseki-server memory — this is the highest-value target on a storage node (ADV-ARCH-04, accepted risk with mitigations: mlock, MADV_DONTDUMP, seccomp, core dumps disabled)
Master key compromise exposes ALL system DEKs for that epoch
Chunk ID is used as salt — chunk ID must not change after creation
Tenant KEK wraps the HKDF derivation parameters (epoch + chunk_id), not the DEK itself — unwrapping + HKDF derives the DEK

ADR-004: Schema Versioning and Rolling Upgrades

Status: Accepted Date: 2026-04-17 Context: A-ADV-2 (upgrade and schema evolution)

Decision

All persistent formats carry a version field. Rolling upgrades are supported with N-1/N version compatibility.

Delta envelope versioning

DeltaHeader.format_version: u16 — first field, fixed offset
Readers that encounter unknown versions fail open (skip the delta, log warning) rather than crash
Writers always produce the current version
Compaction preserves original format version (does not upgrade)

Chunk envelope versioning

EnvelopeMeta.format_version: u16
Algorithm ID already provides crypto-agility
New envelope fields are additive (protobuf-style: unknown fields preserved)

Wire protocol versioning (gRPC)

Protobuf with reserved fields and additive evolution
gRPC service versioning: /kiseki.v1.LogService, /kiseki.v2.LogService
Native client negotiates version on connect

View materialization

Stream processors declare which delta format versions they support
Upgrade sequence: deploy new stream processors first (can read old+new), then upgrade writers (produce new format)

Rolling upgrade sequence

Deploy new kiseki-server binaries (can read old + new formats)
Rolling restart storage nodes (one at a time, Raft quorum maintained)
Deploy new kiseki-control (Go, independent restart)
Deploy new kiseki-client-fuse to compute nodes
After all nodes upgraded: optional compaction to upgrade old deltas

Consequences

All format changes must be backward-compatible for at least one version
Breaking changes require a two-phase rollout (add new, migrate, remove old)
Format version is the first field read on every deserialization path

ADR-005: Erasure Coding and Chunk Durability

Status: Accepted Date: 2026-04-17 Context: I-C4, escalation point 10

Decision

EC parameters are per affinity pool, configured by cluster admin.

Default profiles

Pool type	Strategy	Rationale
fast-nvme (metadata, hot data)	EC 4+2	Balance of space efficiency and rebuild speed
bulk-nvme (cold data, checkpoints)	EC 8+3	Higher space efficiency for bulk data
meta-nvme (log SSTables, key manager)	Replication-3	Lowest latency for consensus-critical data

Chunk-RDMA alignment (C-ADV-3)

Content-defined chunking produces variable-size chunks. For RDMA:

Chunks are stored with 4KB-aligned padding on disk
RDMA scatter-gather lists map logical chunk boundaries to aligned physical blocks
One-sided RDMA transfers use pre-registered memory regions at 4KB alignment
Padding overhead is bounded: max 4KB per chunk, typically <1% for chunks >256KB

Consequences

Pool-level EC config means all chunks in a pool share the same protection level
Changing EC parameters requires re-encoding existing chunks (background process)
RDMA alignment adds trivial storage overhead but enables zero-copy transfers

ADR-006: Inline Data Threshold

Status: Accepted Date: 2026-04-17 Context: Escalation point 6, analyst session

Decision

Delta payloads may carry inline data up to 4096 bytes (4KB).

Data below this threshold is encrypted and stored directly in the delta payload. No separate chunk write occurs.

Rationale

Small files (symlinks, xattrs, tiny configs): avoid chunk overhead
DeltaFS validated this pattern at scale (inode metadata with inline data)
4KB aligns with filesystem block size and NVMe sector size
Raft replication cost per delta increases slightly but acceptably (4KB payload vs ~200 byte metadata-only delta)
Standard practice: ext4, Btrfs, XFS all support inline data

Threshold selection

Threshold	Raft cost	Use cases captured	Chunk overhead saved
1KB	Minimal	Symlinks, xattrs	Low
4KB	Acceptable	Small files, metadata, configs	Moderate
8KB	Noticeable	More files inline	Higher but Raft fan-out increases
64KB	Significant	Too much data in the log	Raft becomes bottleneck

4KB is the sweet spot: captures the majority of metadata-only operations without overloading Raft replication.

Consequences

Configurable per cluster (system-level setting, not per-tenant)
Compaction must handle deltas with inline data (encrypted payload may be larger than metadata-only deltas)

ADR-007: System Key Manager HA via Raft

Status: Accepted Date: 2026-04-17 Context: I-K12, escalation point 7, B-ADV-3

Decision

The system key manager is a dedicated Raft group (3 or 5 members) running as kiseki-keyserver on dedicated nodes. It stores system master keys (one per epoch) and derives per-chunk DEKs via HKDF at runtime (ADR-003).

Architecture

kiseki-keyserver (3-5 nodes, Raft)
  ├── Stores: system master keys (one per epoch, ~32 bytes each)
  ├── Derives: system DEK = HKDF(master_key, chunk_id) — stateless
  ├── Manages: epoch lifecycle (create, rotate, retain, destroy)
  └── Audits: all key events to audit log

Rationale

System key manager is the highest-severity SPOF (P0 if unavailable)
Must be at least as available as the Log
Raft provides consensus + replication + leader election
Separate from shard Raft groups (independent failure domain)
Dedicated nodes: key material never co-located with tenant data
Master key storage is trivial (epochs × 32 bytes)
DEK derivation is stateless and fast (HKDF, ~microseconds)

Deployment

3 nodes for standard deployments, 5 for high-criticality
Dedicated hardware (or at minimum, dedicated processes on control-plane nodes)
Key material in memory only (mlock’d, guard pages)
On-disk: Raft log + snapshot of epoch state (encrypted with node-local key)

Consequences

Adds a deployment component (kiseki-keyserver)
Key manager must be deployed and healthy before any data operations
Cross-site: each site has its own system key manager (federation doesn’t share system keys — only tenant keys cross sites via tenant KMS)

ADR-008: Native Client Fabric Discovery

Status: Accepted Date: 2026-04-17 Context: Escalation point 8, A-ADV-1, I-O4

Decision

Native clients discover shards, views, and gateways via a lightweight discovery service running on every storage node, accessible on the data fabric. No control plane access required.

Mechanism

Bootstrap: client is configured with a list of seed endpoints (storage node addresses on the data fabric). Seed list can be provided via environment variable, config file, or DHCP option.
Discovery query: client sends a discovery request to any seed. The storage node responds with:
- List of active shards (shard_id, leader node, key range)
- List of materialized views (view_id, protocol, endpoint)
- List of gateway endpoints (protocol, transport)
- Tenant authentication requirements
Authentication: client presents mTLS certificate (Cluster CA signed, per-tenant). Optional second-stage auth via tenant IdP.
Cache: discovery results cached with TTL. Periodic refresh. Shard split/merge events invalidate relevant cache entries.
Transport negotiation: client probes available transports (CXI → verbs → TCP) and selects highest-performance option.

Why not DNS-SD or multicast

Slingshot fabric may not support multicast reliably
DNS-SD requires DNS infrastructure on the data fabric
Seed-based discovery is simple, deterministic, and works with any transport

Consequences

Every storage node runs a discovery responder (lightweight, part of kiseki-server)
Seed list is the only bootstrap configuration for compute nodes
Discovery responder must not expose tenant-sensitive information (shard/view metadata is operational, not tenant content)

ADR-009: Audit Log Sharding and GC

Status: Accepted Date: 2026-04-17 Context: B-ADV-1 (audit log scalability)

Decision

The audit log is sharded per tenant with its own archival lifecycle.

Architecture

Audit subsystem:
  ├── Per-tenant audit shard (append-only, Raft-replicated)
  │   └── Contains: tenant events + relevant system events
  │   └── GC: events archived to cold storage after retention period
  │   └── Retention period: set by compliance tags (e.g., HIPAA = 6 years)
  │
  ├── System audit shard (cluster-wide operational events)
  │   └── Contains: node events, maintenance, non-tenant-scoped events
  │   └── GC: configurable retention (default 1 year)
  │
  └── Export pipeline
      └── Tenant export: filtered stream to tenant VLAN
      └── System export: to cluster admin's SIEM

GC interaction with delta GC (I-L4)

Each tenant audit shard tracks its own watermark per data shard
Delta GC checks the relevant tenant audit shard’s watermark
A stalled tenant audit shard blocks delta GC only for that tenant’s data shards (not cluster-wide)

Rationale

Single global audit log is a cluster-wide GC bottleneck (B-ADV-1)
Per-tenant sharding: stalled export for one tenant doesn’t block others
Audit retention aligns with compliance (HIPAA 6yr, GDPR varies)
Archived events move to cold storage (bulk-nvme pool) after active retention

GC safety valve and backpressure (analyst backpass contention 2)

Default behavior (safety valve): if a tenant’s audit export stalls for > configurable threshold (default 24 hours), data shard GC proceeds anyway. The audit gap is logged, and the compliance team is notified. Storage exhaustion is worse than an auditable gap.

Per-tenant configurable: tenants can enable audit backpressure mode. When enabled, if the audit export falls behind, write throughput for that tenant is throttled (reducing GC pressure at the source). This preserves audit completeness at the cost of write performance.

Mode	GC behavior	Write impact	Use case
Safety valve (default)	GC proceeds after timeout	None	Most tenants
Backpressure (opt-in)	GC waits; writes throttled	Slower writes	Strict compliance

Consequences

More audit shards to manage (one per tenant + one system)
Audit Raft groups are lightweight (small append-only logs)
Archival pipeline is a background process
Safety valve prevents storage exhaustion from stalled audit export
Backpressure mode available for tenants with strict audit requirements

ADR-010: Retention Hold Enforcement Before Crypto-Shred

Status: Accepted Date: 2026-04-17 Context: B-ADV-4 (retention hold ordering race)

Decision

Compliance tags that imply retention requirements automatically create retention holds when data is written. Crypto-shred checks for active holds before proceeding.

Mechanism

When a namespace has compliance tags (HIPAA, GDPR, etc.), the control plane derives retention requirements from the tag.
A default retention hold is automatically created for the namespace with the TTL mandated by the compliance regime.
Crypto-shred for a tenant checks all namespaces for active holds:
- If holds exist: crypto-shred proceeds (KEK destroyed, data unreadable) but physical GC is blocked (correct behavior).
- If no holds exist AND compliance tags imply retention: crypto-shred is blocked with an error requiring explicit override.
Override requires force_without_hold_check: true + audit log entry documenting the override and the reason.

Compliance tag → retention mapping (configurable)

Tag	Default retention	Source
HIPAA	6 years	45 CFR §164.530(j)
GDPR	Per DPA agreement	No fixed minimum
revFADP	Per data controller policy	Swiss FDPA Art. 6

Consequences

Retention holds are created automatically, reducing risk of human error
Crypto-shred with override is audited (compliance team can review)
Tenant admin can extend holds but not shorten below compliance minimum

ADR-011: Crypto-Shred Cache Invalidation and TTL

Status: Accepted Date: 2026-04-17 Context: B-ADV-5 (crypto-shred propagation)

Decision

Maximum tenant KEK cache TTL is 60 seconds. Crypto-shred triggers an active invalidation broadcast in addition to TTL expiry.

Mechanism

Default cache TTL: 60 seconds (configurable per tenant, cannot exceed max)
On crypto-shred: a. KEK destroyed in tenant KMS b. Invalidation broadcast to all known gateways, stream processors, and native clients for that tenant c. Components receiving invalidation immediately purge cached KEK d. Components unreachable during broadcast will expire naturally at TTL
Crypto-shred operation returns success after KEK destruction + broadcast (does not wait for all acknowledgments)
Maximum residual window: 60 seconds (cache TTL for unreachable components)

TTL configuration (analyst backpass contention 3)

The 60-second TTL is the default, not a fixed value. TTL is configurable per tenant within bounds:

Parameter	Value	Rationale
Minimum TTL	5 seconds	Below this, KMS load becomes problematic (key fetch every 5s per component)
Default TTL	60 seconds	Reasonable for most deployments
Maximum TTL	300 seconds (5 min)	Beyond this, the crypto-shred window is unreasonable

Tenants under stricter regulation can request shorter TTL (e.g., 10s). The trade-off is higher KMS load (more frequent key fetches). The control plane validates that the requested TTL is within [min, max] and warns if KMS capacity may be insufficient.

HIPAA/GDPR acceptability

GDPR Art. 17 requires erasure “without undue delay” — even 300 seconds is within reasonable interpretation for a distributed system
HIPAA does not specify a time bound for deletion
The audit log records exact times: KEK destroyed, broadcast sent, cache TTL expiry — providing compliance evidence
Configurable TTL allows compliance-sensitive tenants to reduce the window

Consequences

Default 60-second window where data is technically readable after shred
Configurable per tenant within [5s, 300s] bounds
Components must handle invalidation broadcast (new message type)
Native clients on unreachable compute nodes: data readable until their process exits or TTL expires (whichever comes first)
Shorter TTLs increase KMS load (more frequent key fetches)
TTL bounds are performance parameters that may conflict with compliance — the minimum (5s) is a hard engineering limit, not a policy choice

ADR-012: Stream Processor Tenant Isolation

Status: Accepted Date: 2026-04-17 Context: B-ADV-6 (stream processor isolation)

Decision

Stream processors for different tenants run in separate OS processes on storage nodes. Key material is protected with mlock and guard pages.

Isolation model

Mechanism	Purpose
Separate processes	OS-level memory isolation between tenants
mlock on key pages	Prevent key material from swapping to disk
Guard pages	Detect buffer overflows near key material
seccomp (Linux)	Restrict syscalls to minimum needed
Separate cgroups	Resource isolation (CPU, memory) per tenant

Co-location policy

Small tenants: multiple stream processors per node (process isolation)
Large/sensitive tenants: dedicated nodes (configurable via placement policy)
Compliance tags can mandate dedicated nodes (e.g., HIPAA with strict isolation)

Hardware isolation (future)

AMD SEV-SNP / Intel TDX confidential VMs: out of scope for initial build
Envelope format and key wrapping are compatible with confidential compute (keys are already protected end-to-end; adding a TEE is additive, not architectural change)

Consequences

More processes per storage node (one per tenant per view)
Process management in kiseki-server (spawn, monitor, restart)
Memory overhead per process (Rust process ~10-20MB base)
Key material never in shared memory between tenants

ADR-013: POSIX Semantics Scope

Status: Accepted Date: 2026-04-17 Context: A-ADV-4 (POSIX semantics depth)

Decision

POSIX support via FUSE with explicit compatibility matrix.

Supported (full semantics)

Operation	Notes
open, close, read, write	Standard file I/O
create, unlink, mkdir, rmdir	Directory operations
rename (within namespace)	Atomic within shard
stat, fstat, lstat	File metadata
chmod, chown	Permission changes (stored in delta attributes)
readdir, readdirplus	Directory listing from view
symlink, readlink	Stored as inline data in delta
truncate, ftruncate	Composition resize
fsync, fdatasync	Flush to durable (delta committed)
extended attributes (xattr)	getxattr, setxattr, listxattr, removexattr
POSIX file locks (fcntl)	Per-gateway lock state
O_APPEND	Atomic append via delta
O_CREAT, O_EXCL	Atomic create-if-not-exists

Supported (limited semantics)

Operation	Limitation
rename (cross-namespace)	Returns EXDEV (ADR: I-L8)
hard links	Within namespace only; cross-namespace returns EXDEV
sparse files	Holes tracked in composition; zero-fill on read
O_DIRECT	Bypasses client cache but still goes through FUSE
flock (advisory)	Best-effort; not guaranteed across gateway failover

Not supported

Operation	Reason
mmap (shared, writable)	Distributed shared writable mmap requires page-level coherence — not tractable for a distributed system at HPC scale. Read-only mmap is supported. The FUSE client returns ENOTSUP with a log message: “writable shared mmap not supported; use write() instead.”
ACLs (POSIX.1e)	Unix permissions only (uid/gid/mode). POSIX ACLs add complexity without significant benefit for the target workload. Revisit if needed.
chroot, pivot_root	Filesystem-level operations, not meaningful for FUSE mount

Consequences

mmap restriction documented prominently (HPC users expect it)
Read-only mmap works (useful for model loading)
Writable mmap requires application changes (use write() instead)
No POSIX ACLs simplifies the permission model

ADR-014: S3 API Compatibility Scope

Status: Accepted Date: 2026-04-17 Context: A-ADV-5 (S3 API compatibility scope)

Decision

Implement a subset of S3 API covering the operations needed by HPC/AI workloads. Not a complete S3 implementation.

Supported (full)

API	Notes
PutObject	Single-part upload
GetObject	Including byte-range reads
HeadObject	Metadata retrieval
DeleteObject	Tombstone or delete marker (versioning)
ListObjectsV2	Prefix, delimiter, pagination
CreateMultipartUpload
UploadPart
CompleteMultipartUpload
AbortMultipartUpload
ListMultipartUploads
ListParts
CreateBucket	Maps to namespace creation
DeleteBucket	Maps to namespace deletion
HeadBucket	Existence check
ListBuckets	Per-tenant bucket listing

Supported (versioning)

API	Notes
GetObjectVersion	Specific version retrieval
ListObjectVersions	Version listing
DeleteObjectVersion	Delete specific version

Supported (conditional)

API	Notes
If-None-Match, If-Match	Conditional writes
If-Modified-Since	Conditional reads

Not supported (initial build)

API	Reason	Future?
Lifecycle policies	Complex; competes with Kiseki’s own tiering	Maybe
Event notifications	Requires message bus integration	Maybe
SSE-S3, SSE-KMS, SSE-C	Kiseki’s encryption is always-on; S3 SSE headers are acknowledged but don’t change behavior	N/A
Presigned URLs	Useful; add after core is stable	Yes
Bucket policies	Kiseki uses its own IAM/policy model	No
CORS	Not relevant for HPC/AI workloads	No
Object Lock	Covered by Kiseki’s retention holds	Mapping possible
Select (S3 Select)	Out of scope	No

SSE header handling

S3 clients may send SSE headers. Kiseki always encrypts (I-K1).

SSE-S3 headers: acknowledged, no-op (system encryption is always on)
SSE-KMS headers with key ARN: if ARN matches tenant KMS config, acknowledged. If different: error (tenant can’t specify arbitrary keys)
SSE-C headers: rejected (Kiseki manages encryption, not the client)

Consequences

S3-compatible tooling (aws cli, boto3, rclone) works for supported operations
Unsupported operations return 501 Not Implemented
SSE headers are handled gracefully without breaking encryption model

ADR-015: Observability Contract

Status: Accepted Date: 2026-04-17 Context: A-ADV-7 (observability)

Decision

OpenTelemetry-native observability with tenant-aware metric scoping.

Metrics (Prometheus-compatible, via OpenTelemetry)

Context	Key metrics
Log	delta_append_latency, raft_commit_latency, shard_count, shard_size, compaction_duration, election_count
Chunk	write_latency, read_latency, dedup_hit_rate, gc_chunks_collected, repair_count, pool_utilization
Composition	create_latency, delete_count, multipart_in_progress, refcount_operations
View	materialization_lag_ms, staleness_violation_count, rebuild_progress, pin_count
Gateway	request_latency (p50/p99/p999), requests_per_sec, error_rate, active_connections
Client	fuse_latency, transport_type, cache_hit_rate, prefetch_effectiveness
Key Mgr	derive_latency, rotation_in_progress, kms_reachability, cache_hit_rate
Control	tenant_count, namespace_count, quota_utilization, federation_sync_lag

Zero-trust metric scoping

Cluster admin sees: aggregated metrics, per-node metrics, system health. Per-tenant metrics are anonymized (tenant_id replaced with opaque hash) unless cluster admin has approved access for that tenant.
Tenant admin sees: their own tenant’s metrics via tenant audit export.
No metric exposes: file names, directory structure, data content, or access patterns attributable to a specific tenant (without approval).

Distributed tracing

Every write/read path carries a trace ID (OpenTelemetry context propagation)
Traces span: client → gateway → composition → log → chunk → view
Tenant-scoped traces are visible only to the tenant admin
Cluster admin sees system-level spans (no tenant content in span attributes)

Structured logging

JSON structured logs, one line per event
Log levels: ERROR, WARN, INFO, DEBUG, TRACE
Tenant-identifying fields are present but content fields are encrypted
Logs ship to the same audit/observability pipeline

Consequences

OpenTelemetry SDK in both Rust and Go codebases
Metric cardinality must be bounded (no unbounded label values)
Tracing overhead ~1-2% on data path (acceptable for production)

ADR-016: Backup and Disaster Recovery

Status: Accepted Date: 2026-04-17 Context: A-ADV-8 (backup and DR)

Decision

Federation is the primary DR mechanism. External backup is additive and optional.

Site-level DR via federation

Federated-async replication to a secondary site is the primary DR story
RPO: bounded by async replication lag (seconds to minutes)
RTO: secondary site is warm (has replicated data + tenant config); switchover requires KMS connectivity and control plane reconfiguration
Data replication is ciphertext-only (no key material in replication stream)

What is replicated

Component	Replicated?	Mechanism
Chunk data (ciphertext)	Yes	Async replication to peer site
Log deltas	Yes	Async replication of committed deltas
Control plane config	Yes	Federation config sync
Tenant KMS config	No	Same tenant KMS serves both sites
System master keys	No	Per-site system key manager
Audit log	Yes	Per-tenant audit shard replicated

External backup (optional, additive)

Cluster admin can configure external backup targets (S3-compatible store)
Backup contains: encrypted chunk data + log snapshots + control plane state
Backup is encrypted with the system key (at rest) — no plaintext in backup
HIPAA requirement met: backup is encrypted
Backup frequency: configurable (hourly/daily snapshots of control plane, continuous for chunk data)

Recovery scenarios

Scenario	Recovery path	RPO	RTO
Single node loss	Raft re-election + EC repair	0	Seconds-minutes
Multiple node loss	Raft reconfiguration + EC repair	0	Minutes
Full site loss	Failover to federated peer	Replication lag	Minutes-hours
Site loss, no federation	Restore from external backup	Backup lag	Hours
Tenant KMS loss	Unrecoverable (I-K11)	N/A	N/A

Consequences

Federation is the recommended (and primary) DR strategy
External backup is for defense-in-depth, not primary recovery
RTO for site failover depends on control plane reconfiguration speed
System key manager is per-site — site failover requires the secondary site’s own system key manager (different master keys, but tenants’ data is accessible because tenant KMS is shared cross-site)

ADR-017: Dedup Refcount Metadata Access Control

Status: Accepted Date: 2026-04-17 Context: B-ADV-2 (cross-tenant dedup refcount metadata)

Decision

Chunk refcount metadata stores total refcount only, without per-tenant attribution. Tenant-to-chunk mapping is derived from composition metadata (which is tenant-encrypted).

Mechanism

ChunkMeta:
  chunk_id: abc123
  total_refcount: 3      ← visible to system
  per_tenant_refs: N/A   ← NOT stored

Tenant attribution is in the composition deltas:
  org-pharma/composition-X references chunk abc123   ← encrypted in delta payload
  org-biotech/composition-Y references chunk abc123  ← encrypted in delta payload

Access control

Cluster admin can see: chunk_id, total_refcount, pool, EC status
Cluster admin CANNOT see: which tenants reference which chunks (this information is in tenant-encrypted delta payloads)
System dedup process: compares chunk_ids (in the clear for dedup), but does not record which tenant triggered the dedup match

Residual risk

Total refcount > 1 reveals that SOME dedup occurred, but not who
Timing side channel: a dedup hit is faster than a full write. An observer who can measure write latency precisely could infer dedup. Mitigation: add random delay to normalize write timing (optional, configurable per tenant).

Consequences

No per-tenant refcount tracking in chunk metadata
Refcount decrement on crypto-shred: the crypto-shred process walks the tenant’s compositions (decrypted with tenant key during shred) to identify which chunks to decrement
This is slower than a per-tenant refcount lookup but only happens during crypto-shred (rare operation)

ADR-018: Runtime Integrity Monitor

Status: Accepted Date: 2026-04-17 Context: ADV-ARCH-04 (master key in memory), analyst backpass contention 1

Decision

A runtime integrity monitor runs as a side process on every storage node, detecting signs of key material extraction attempts.

Detection signals

Signal	Detection method	Severity
ptrace attachment to kiseki processes	Monitor /proc/pid/status TracerPid	Critical
/proc/pid/mem reads on kiseki processes	inotify/audit on /proc/pid/mem	Critical
Debugger presence (gdb, lldb, strace)	Process enumeration	High
Core dump generation attempt	Monitor core_pattern, catch SIGABRT	Critical
Unexpected LD_PRELOAD on kiseki processes	Check /proc/pid/environ at startup	High
Process memory mapping changes	Monitor /proc/pid/maps periodically	Medium

Response

Alert: cluster admin + affected tenant admin(s) immediately
Log: audit event with full context (pid, signal, timestamp)
Optional auto-response (configurable):
- Rotate system master key (new epoch, invalidate cached key)
- Evict cached tenant KEKs (force re-fetch from KMS)
- Kill the suspect process
Do NOT: shut down the storage node (availability over prevention — the attacker may already have the key; shutting down just causes an outage)

Performance impact

Negligible. The monitor checks /proc periodically (every 1-5 seconds), not on every crypto operation. Crypto operations themselves are not a performance concern:

HKDF derivation: ~1μs per call, ~25,000 calls/sec at line rate = ~25ms CPU/sec
AES-256-GCM (the actual encryption): with AES-NI, ~5-10% of one core at 200 Gbps
The bottleneck is the AEAD data encryption, not key derivation or monitoring

Consequences

Additional process per storage node (lightweight)
Linux-specific (/proc-based detection); needs platform abstraction for other OS
Not a prevention mechanism — it’s detection and response
False positives possible (legitimate debugging during development); monitor should be disableable in dev/test mode

ADR-019: Gateway Deployment Model

Status: Accepted Date: 2026-04-17 Context: ADV-ARCH-03 (monolith blast radius), analyst backpass contention 4

Decision

Gateways run in-process with kiseki-server (monolith per node). Client resilience is provided by multi-endpoint resolution, not per-process gateway isolation.

Rationale

This is a distributed system with no master. Every storage node runs kiseki-server (log + chunk + composition + view + gateways). Clients resolve to multiple endpoints:

Client (NFS/S3/native)
  │
  ├── DNS round-robin: kiseki-nfs.cluster.local → [node1, node2, node3, ...]
  ├── Multiple A/AAAA records
  ├── Native client: seed list → discovery → multiple endpoints
  │
  └── On node failure: client reconnects to next endpoint
      (NFS: automatic reconnect; S3: retry to different host;
       native: transport failover)

Why monolith is acceptable

Concern	Mitigation
Gateway crash = node crash	Client reconnects to another node (seconds)
All tenants on crashed node affected	Tenants are served by multiple nodes; one node loss = partial, not total
Memory leak in gateway affects log/chunk	Resource limits via cgroups; OOM killer targets the process, not the node
Bug in NFS gateway affects S3 gateway	Accept — both are in the same process. Isolation adds operational complexity disproportionate to the risk

Why NOT separate gateway processes

Additional process management per node (spawn, monitor, restart, IPC)
Performance overhead of IPC between gateway and log/chunk/view
Operational complexity (more processes to configure, monitor, upgrade)
The resilience model is client-side multi-endpoint, not server-side process isolation

Client resolution

Client type	Resolution mechanism
NFS	DNS (multiple A records), NFS mount with multiple server addresses
S3	DNS round-robin, HTTP retry to next endpoint on 5xx
Native	Seed list → fabric discovery → multiple endpoints, automatic failover

Consequences

kiseki-server remains a single-process monolith per node
Client-side resilience is the primary availability mechanism
Update failure-modes.md: F-D1 (gateway crash) → node-scoped, not protocol-scoped
Node loss tolerance depends on tenant data distribution across nodes

ADR-020: Workflow Advisory & Client Telemetry

Status: Accepted (implemented, 51/51 BDD scenarios pass) Date: 2026-04-17 Context: new capability — HPC/AI workloads need to steer storage (prefetch, affinity, priority, phase-adaptive tuning) and consume caller-scoped feedback (backpressure, locality, materialization lag, QoS headroom). ADR-015 covers operator-facing observability; this ADR covers the orthogonal client-facing advisory/telemetry surface.

Decision (analyst-level; architect will refine interfaces)

Introduce a Workflow Advisory cross-cutting concern carrying two flows over one bidirectional advisory channel per declared workflow:

Hints (client → storage) — advisory, never authoritative (I-WA1).
Telemetry feedback (storage → client) — caller-scoped only (I-WA5, I-WA6).

Workflow is not a bounded context. It is a correlation + steering construct owned entirely by the client, with a stateless routing layer on the server side and bounded per-workflow state.

Correlation identity

Every data-path operation issued while a workflow is active carries:

(org_id, project_id?, workload_id, client_id, workflow_id, phase_id)

client_id pinned per native-client process (I-WA4).
workflow_id ≥128-bit opaque, unique within workload (I-WA10).
phase_id monotonic within workflow, bounded phase history (I-WA13).

Advisory channel

One bidi gRPC stream per active workflow, on the data fabric, under the same mTLS tenant certificate as the data path (I-Auth1, I-WA3).
Authorization is per-operation on the stream, not only at establishment (I-WA3). Certificate revocation tears down the stream.
Side-by-side with the data path — not in-band. Data-path requests may be annotated with a short workflow_ref header that the data-path code passes through; server-side the annotation is routed to the advisory subsystem asynchronously (I-WA2). Annotation is strictly best-effort:
- malformed workflow_ref → ignored, no data-path impact
- workflow_ref for an expired workflow → dropped silently on the advisory side with an hint-rejected: workflow_unknown audit event
- advisory subsystem overloaded or unavailable → annotation enqueued with bounded buffer; on overflow dropped with a rate-limited annotation_dropped audit event. Data-path operation outcome is never affected (I-WA2).
Closure of the advisory stream without End auto-expires the workflow on TTL. Process restart produces a fresh client_id; the old workflow expires on TTL and the new process must redeclare. No reattach protocol is defined in this ADR — it may be revisited as a follow-up feature with its own spec + adversary review.

Hint taxonomy (must-have)

Category	Example values	Acted on by
Workload profile	`ai-training`, `ai-inference`, `hpc-checkpoint`, `batch-etl`, `interactive`	Control Plane policy gate; tunes other hint defaults
Phase marker	`stage-in`, `compute`, `checkpoint`, `stage-out`, `epoch-N` (opaque semantic tag)	View (cache policy), Composition (write-absorb), Chunk (placement hot-set)
Access pattern	`sequential` / `random` / `strided` / `broadcast`	Native Client (prefetch), View (materialization priority)
Prefetch range	list of `(composition_id, offset, length)`	View, Chunk (opportunistic warm)
Priority class	`interactive` / `batch` / `bulk` within policy-allowed max	Gateway / Client QoS scheduler
Affinity preference	pool / rack / node preference within policy	Chunk placement engine
Retention intent	`temp` / `working` / `final`	Composition GC urgency, Chunk EC policy selection
Dedup intent	`shared-ensemble` / `per-rank`	Chunk dedup path (still bounded by I-X2)
Collective announcement	`{ranks, bytes_per_rank, deadline}`	Chunk write-absorb provisioning

Hint taxonomy (nice-to-have, deferred)

Co-access grouping, deadline, transient markers (discardable after epoch N), NUMA/GPU topology, peer-rank state. Architect may add these in a follow-up.

Telemetry feedback (must-have)

Signal	Shape	Scoping
Backpressure	severity enum + retry_after_ms	Caller’s own resources only
Materialization lag	ms	Caller’s views only
Locality class	bucketed enum (local-node, local-rack, same-pool, remote, degraded)	Caller-owned chunks only
Prefetch effectiveness	bucketed hit-rate	Caller’s declared prefetches only
QoS headroom	bucketed fraction	Caller’s workflow/workload
Own-hotspot	composition_id + coarse level	Caller’s own compositions

Tenant-hierarchy scoping

Policy chain: cluster → org → project → workload. Each level narrows (never broadens) its parent’s ceilings (I-WA7). Profile allow-lists inherit the same way.
Workflow lives strictly within one workload (I-WA3).
Disable switch at any level (I-WA12) — data path unaffected when advisory is disabled.

Security posture

Hints cannot extend capability (I-WA14).
Telemetry is not an existence oracle (I-WA6) — unauthorized target → same shape as absent target, including timing distribution.
Telemetry aggregation uses k-anonymity over neighbour workloads, k ≥ 5 (I-WA5).
Covert-channel hardening: rejection latency and telemetry response size are bucketed (I-WA15).
All advisory decisions audited on tenant shard; cluster-admin view sees opaque hashes (I-WA8, consistent with I-A3 / ADR-015).

Isolation from data path

Advisory channel on a separate gRPC service and (ideally) a separate server-side tokio runtime / goroutine pool from the data path.
Hint handling is best-effort with bounded buffering; on overload the handler drops-and-audits rather than queuing.
Data-path code never awaits advisory responses. At most it emits fire-and-forget annotations.

Alternatives considered

Attach hints as headers on existing data-path RPCs, no separate channel. Rejected: couples hint handling to data-path latency, violates I-WA2 isolation, and makes bidirectional telemetry awkward.
Model workflow as a new bounded context with durable state. Rejected: workflows are ephemeral correlation handles. Persisting them invites a new shared-state problem and gives little value beyond what the audit log already provides.
Expose ADR-015 observability directly to clients. Rejected: ADR-015 is operator-facing with aggregate/anonymized scope. Clients need caller-scoped, near-real-time feedback with a different privacy boundary (I-WA5/6).
Server-authoritative hints (storage can infer and inject its own). Rejected: inferring client intent from data-path patterns is already done internally; the point of this ADR is to let clients supply authoritative-to-themselves hints. Server-side inference remains available as a fallback when hints are absent.

Consequences

New crate kiseki-advisory (Rust) — hint validation, routing, rate limiting, telemetry emission, audit emission. Side-by-side with kiseki-server, not inside the data-path crates.
New protobuf service WorkflowAdvisory with DeclareWorkflow, EndWorkflow, PhaseAdvance, a bidi AdvisoryStream, and SubscribeTelemetry (may be a stream within AdvisoryStream).
Control Plane extensions: profile allow-lists, hint budgets, opt-out switches — inherited org → project → workload.
Native Client extensions: WorkflowSession handle; existing data-path methods accept an optional &WorkflowSession for automatic correlation annotation.
Audit additions: new event types per I-WA8. Tenant audit export (I-A2) includes them; cluster-admin export (I-A3) hashes the tenant-scoped identifiers.
Metric additions (ADR-015 operator view): advisory_hints_accepted, _rejected, _throttled, active_workflows, advisory_channel_latency, tenant-anonymized.
Performance: hint handling overhead target < 5µs p99 per accepted hint; telemetry emission frequency capped per subscription.
Failure mode F-ADV-1: advisory-subsystem outage → data path unaffected; clients observe advisory_unavailable until restoration. To be added to specs/failure-modes.md (severity P2, blast radius: steering quality only).

Changes from adversary gate-0 review

I-WA6 extended to cover hint rejection (previously telemetry-only).
I-WA3 tightened to per-operation authorization.
I-WA5 defines explicit low-k behaviour (fixed-sentinel neighbour component, unchanged response shape).
New invariants I-WA16 (hint payload size bound), I-WA17 (declare rate bound), I-WA18 (prospective policy application).
I-WA11 tightened to enumerate forbidden advisory target field types.
I-WA12 defines three-state opt-out with draining transition.
I-WA13 specifies CAS serialization for PhaseAdvance.
Reattach protocol explicitly dropped; TTL-only recovery.
client_id construction simplified to CSPRNG (≥128 bits), pinning enforced by registrar.
F-ADV-1 (advisory outage) and F-ADV-2 (audit storm) added to failure-modes.md.
A-ADV-1..A-ADV-4 added to assumptions.md.

Follow-ups (architect’s scope)

gRPC service definitions and message schemas.
Exact integration surface between kiseki-advisory and each of Chunk, View, Composition, Gateway.
Concrete k-anonymity bucketing algorithm and parameters.
Exact latency-bucketing and size-bucketing schemes for I-WA15.
Phase-history compaction format and retention per workload.
Reattach protocol for process-restart scenarios (I-WA4 scenario).

Follow-ups (adversary’s scope — gate 0 before architect)

Threat-model the covert-channel surface (timing, size, error-code).
Validate that the inherent side-channels from backpressure signals are truly k-anonymised under worst-case neighbour composition.
Probe the reattach protocol once drafted.

ADR-021: Workflow Advisory Architecture

Status: Accepted (implemented, 51/51 BDD scenarios pass) Date: 2026-04-17 Context: ADR-020 analyst-level decision; this ADR commits the architecture (crate shape, runtime isolation, advisory-to-data-path coupling, protobuf + intra-Rust boundaries).

Decision

Three structural commitments that, together, make the analyst-level invariants in ADR-020 enforceable at compile time and at runtime.

1. Advisory is a separate crate with an isolated runtime

New Rust crate kiseki-advisory, located at crates/kiseki-advisory/.
Compiled into kiseki-server but runs on a dedicated tokio runtime with its own thread pool, separate from the data-path runtime. Configured via kiseki-server at process start.
All advisory ingress (AdvisoryStream, DeclareWorkflow, PhaseAdvance, telemetry subscriptions) is accepted on a separate gRPC listener from the data-path gRPC listeners.
Advisory-audit emission uses kiseki-audit’s existing tenant-shard path but with its own bounded queue and drop-and-record-on-overflow policy (no awaits out of the advisory runtime into the data path).
Structural enforcement of I-WA2: data-path crates do not depend on kiseki-advisory in their Cargo manifests. The only way an advisory event can affect data-path behaviour is through well-typed domain-level preferences (see §3), which the data path treats as advisory hints — never as preconditions.

2. Shared domain types live in `kiseki-common`

A small set of enums and structs representing “the advisory context of one operation” is declared in kiseki-common (already a dependency of every context). This lets data-path crates accept an Option<&OperationAdvisory> on their operations without pulling in the advisory runtime.

kiseki-common        (domain types: WorkflowRef, OperationAdvisory, enums)
  ↑
kiseki-{log,chunk,composition,view,gateway-*,client}
  (accept Option<&OperationAdvisory>, use for preferences only)

kiseki-advisory      (runtime, router, budget, audit emitter)
  ├── depends on kiseki-common
  ├── depends on kiseki-audit
  └── depends on kiseki-proto (for WorkflowAdvisoryService)
  ↑
kiseki-server        (wires advisory runtime to each context)

Cycle-free: no data-path crate depends on kiseki-advisory; the runtime wiring happens only in the kiseki-server binary.

3. Pull-based advisory lookup (not push into the data path)

When a data-path request arrives carrying a workflow_ref header:

3.a Header mechanism

The workflow_ref is carried as a gRPC metadata entry, not as a protobuf field on any data-path message. Concrete binding:

Metadata key: x-kiseki-workflow-ref-bin (binary metadata, per gRPC convention for raw-bytes values)
Metadata value: the raw 16-byte WorkflowRef handle
All data-path protos remain unchanged — this is the structural payoff that makes I-WA2 tractable (data-path code stays advisory-unaware).
A gRPC interceptor in kiseki-server lifts the header into a request-scoped context at ingress. The context is accessed by each data-path handler through a small kiseki-common helper (CurrentAdvisory::from_request_context()), which returns an Option<OperationAdvisory> by calling AdvisoryLookup::lookup_fast.
For intra-Rust calls (e.g., native client’s native API path), the same helper reads from a task-local set by the caller. The native client’s WorkflowSession handle scopes this automatically.
For external protocols (NFS, S3) the HTTP-level header is x-kiseki-workflow-ref (plain, hex-encoded), translated by the protocol gateway into the gRPC binary metadata entry x-kiseki-workflow-ref-bin before forwarding to any internal gRPC service. This keeps external clients unaware of gRPC conventions.
No data-path proto file contains workflow_ref. Any future attempt to add it is rejected at architecture review.

The kiseki-server gRPC interceptor extracts workflow_ref and stores it in the request context.
The data-path operation (e.g., WriteChunk) optionally consults CurrentAdvisory::from_request_context() to obtain an Option<OperationAdvisory>.
The data-path code may, synchronously and fallibly, call AdvisoryLookup::lookup_fast(workflow_ref) -> Option<OperationAdvisory> with a strict bounded deadline (≤ 500 µs, configurable, default 200 µs). The method name carries the contract: implementations MUST NOT block, allocate on the happy path, or call non-O(1) functions.
On timeout, unavailability, or cache miss the lookup returns None. The data-path code proceeds exactly as it would for an operation without any workflow_ref.
There is no blocking wait, no retry, and no propagated error. The lookup is a hot-path cache read (see §4 below).

This guarantees I-WA2 structurally: the data path cannot be stalled or corrupted by the advisory subsystem. At worst, advisory context is unavailable and steering quality degrades.

4. Advisory state shape and hot path

kiseki-advisory maintains three bounded in-memory caches keyed by workflow:

Cache	Contents	Size bound	Eviction
Workflow table	`(workflow_id) → { mTLS-identity, profile, current_phase, budgets, TTL }`	policy-bounded max concurrent workflows per workload × total workloads	TTL + End
Effective-hints table	`(workflow_id) → OperationAdvisory` (latest accepted hints, merged across phase)	1 row per active workflow	replaced on new accept
Prefetch ring	per-workflow ring buffer of accepted prefetch tuples	`max_prefetch_tuples_per_hint × in-flight phases`	FIFO on cap

Reads from the data path hit the effective-hints table (O(1)). Writes into these caches happen on the advisory runtime only. Cross-thread access uses arc-swap (snapshot-read, copy-on-write) so the data-path read never takes a lock held by the advisory runtime.

5. gRPC service shape

One new service, WorkflowAdvisoryService, on its own gRPC listener. Unary: DeclareWorkflow, EndWorkflow, PhaseAdvance, GetWorkflowStatus (for admin/debug within caller’s own scope). Bidi streaming: AdvisoryStream (hints in, telemetry out over the same stream, multiplexed). Unary: SubscribeTelemetry (server-stream variant for callers who don’t want to send hints).

Full schema in specs/architecture/proto/kiseki/v1/advisory.proto.

6. Control-plane integration

New Go package control/pkg/advisory:

Policy CRUD for profile allow-lists, budgets, opt-out state per org/project/workload. Inheritance computed server-side; effective policy returned to kiseki-advisory via existing ControlService.
Opt-out state transitions (enabled/draining/disabled) are Raft-backed in the existing control-plane state store.
Federation does NOT replicate workflow state (ephemeral, local). It DOES replicate policy (existing async config replication path).

7. k-anonymity bucketing: concrete algorithm

For pool/shard saturation signals that incorporate cross-workload aggregate:

Compute aggregate metric A over all contributing workloads on the pool/shard.
Count distinct contributing workloads k.
If k ≥ 5 (policy-configurable minimum): return severity = bucket(A); retry-after = bucket(compute_retry(A)).
If k < 5: return severity = bucket(A_caller_only); retry-after = bucket(compute_retry(A_caller_only)). The response shape is identical to the k≥5 case; only the value of the neighbour-derived component is replaced by a sentinel bucket (ok, regardless of true aggregate) chosen to minimize caller utility of detecting the substitution.

Bucket function: fixed set {ok, soft, hard} for severity, {<50ms, 50-250ms, 250-1000ms, 1-10s, >10s} for retry-after.

8. Covert-channel hardening: concrete widths

Rejection response timing: every advisory rejection path (hint, subscription, declare, phase) pads response emission to the next 100-µs boundary after a fixed minimum of 300 µs. Enforced by a common emit_bucketed_response helper in kiseki-advisory.
Telemetry message sizes: protobuf messages padded to one of {128, 256, 512, 1024} bytes with a reserved bytes padding field repeated to the target size. Selection uses the nearest bucket ≥ actual size.
Error codes: every rejection caused by authorization or scope violation returns the SCOPE_NOT_FOUND code with the same message payload, regardless of whether the cause was “unauthorized” or “absent”. Internal audit records carry the true reason.
gRPC status code: WorkflowAdvisoryService MUST return gRPC status NOT_FOUND (code 5) for every SCOPE_NOT_FOUND case. Using PERMISSION_DENIED (code 7) or UNAUTHENTICATED (code 16) on authorization failures would leak the distinction via the gRPC trailers, defeating the canonicalization above. All gRPC clients and middleware expose the status code, so this is not a “docs-only” rule — it is enforced by an integration test at Phase 11.5 exit that compares status-code distributions across authorized-absent and unauthorized-existing cases.

9. Phase-history compaction format

Per workflow, keep the last 64 phase records in the workflow table (ring buffer of PhaseRecord { phase_id, tag_hash, entered_at, hints_accepted_count, hints_rejected_count }). On eviction, the evicted record is rolled up into a per-workflow PhaseSummary { from_phase_id, to_phase_id, total_hints_accepted, total_hints_rejected, duration_ms } audit event emitted to the tenant audit shard. The summary replaces all evicted individual records in audit history.

Alternatives considered

Put advisory code inside each data-path crate behind a feature flag. Rejected: tight coupling; impossible to guarantee I-WA2 (hot data-path code lives in the advisory lifecycle), and per-crate feature flags multiply combinatorics of build variants.
Separate OS process for advisory runtime, IPC’d from kiseki-server. Rejected: IPC adds serialization cost on the hot-path lookup (§3) and complicates deployment (another process per node). The isolated-tokio-runtime pattern gives enough blast-radius reduction at much lower overhead.
Define advisory traits in a new tiny crate kiseki-advisory-api separate from kiseki-common. Considered. Rejected for now: the advisory domain types (OperationAdvisory, enums) are small, stable, and already conceptually part of the shared vocabulary (Workflow, Phase, AccessPattern appear in ubiquitous-language.md). Adding a one-concept crate adds build-graph overhead without payoff. Can be split out later if the type set grows.
Push hints directly into each context via per-context channels (no OperationAdvisory aggregation). Rejected: spreads fan-out logic across every context and makes I-WA11 (target-field restriction) and I-WA16 (size cap) harder to enforce. Centralizing in kiseki-advisory and passing an already-validated bundle simplifies data-path code.

10. Schema versioning

advisory.proto ships as kiseki.v1. Forward-evolution rules:

Additions (new fields, new oneof variants, new enum values) stay within v1. Unknown fields are preserved by gRPC clients.
Deprecations mark fields with reserved after one minor release; old clients continue to work.
Breaking changes (semantic change of a field, required removal) move to v2 with a deprecation window ≥ 2 releases in which both versions are served.
Advisory-policy changes in the control plane (profile allow-list additions, budget changes) are config, not schema — no version bump needed.

11. Padding to bucket size

AdvisoryError.padding, AdvisoryServerMessage.padding, TelemetryEvent.padding, WorkflowStatus.padding, and AdvisoryAuditBody.padding carry the variable bytes needed to hit one of the bucket sizes {128, 256, 512, 1024, 2048 for audit bodies}. Computation at emit time:

serialized_size = serialize(rest_of_message).len();
target_bucket   = smallest bucket >= serialized_size + padding_overhead;
padding_len     = target_bucket - serialized_size - varint_overhead(target_bucket);

varint_overhead(N) accounts for the two-byte (tag + length-varint) prefix of the padding field; standard protobuf wire format. Implementations MUST use the kiseki-advisory::emit_bucketed_response helper. Property test at Phase 11.5 exit: every response on WorkflowAdvisoryService is exactly one of the bucket sizes.

Consequences

Adds one Rust crate (kiseki-advisory), one Go package (control/pkg/advisory), one proto file (proto/kiseki/v1/advisory.proto), one data-model stub (data-models/advisory.rs).
Adds a new phase to the build sequence (see build-phases.md).
Every data-path *Ops trait in api-contracts.md gains an optional advisory: Option<&OperationAdvisory> parameter on its methods. Callers that don’t care pass None.
Isolation requires kiseki-server to instantiate two tokio runtimes. Accepted cost.
The arc-swap hot-path read is the only cross-runtime coupling. Property-test and benchmark-verified at Phase 11 exit.

Open items (escalated to adversary gate-1)

Validate that §3 (pull-based lookup) cannot itself become a DoS surface: a malicious client pummelling workflow_ref headers causes lookups. Mitigation: lookup cache is per-node, bounded, and miss cost is a None return (no upstream RPC).
Validate §4 (arc-swap snapshot) meets latency targets on the actual data-path hot code (FUSE read/write, chunk write, view read).
Validate §8 covert-channel widths are large enough to mask actual work variance under realistic load.
Confirm §9 audit summary compaction does not itself become an existence oracle (size of summary varies with workflow activity).

ADR-022: Storage Backend — redb (Pure Rust)

Status: Accepted. Date: 2026-04-20. Deciders: Architect + implementer.

Context

The system needs persistent storage for:

Raft log entries — append-heavy, sequential reads for replay
State machine snapshots — periodic full-state serialization
Chunk metadata index — key-value mapping (chunk_id → placement, refcount)
View watermark checkpoints — small, frequently updated

The spec references “RocksDB or equivalent” (build-phases.md Phase 3) but does not commit to a specific engine. RocksDB is C++ and brings ~200MB build dependency via cmake/clang/librocksdb.

Decision

Use redb v2 for all structured persistent storage.

What redb handles

Data	redb Table	Key	Value
Raft log entries	`raft_log`	`u64` (log index)	bincode-serialized entry
Raft vote/term	`raft_meta`	`&str` (“vote”, “term”)	`u64`
State machine snapshot	`sm_snapshot`	`"latest"`	bincode-serialized state
Chunk metadata	`chunk_meta`	`[u8; 32]` (chunk_id)	bincode ChunkMeta
Device allocation	`device_alloc`	`(DeviceId, u64)` (device + offset)	`[u8; 32]` (chunk_id) — reverse index
View watermarks	`view_wm`	`[u8; 16]` (view_id)	`u64` (sequence)

What redb does NOT handle

Chunk ciphertext data is written directly to raw block devices (or file-backed fallback for VMs/CI) via the DeviceBackend trait in kiseki-block (ADR-029). redb stores metadata only; chunk ciphertext never passes through redb.

$KISEKI_DATA_DIR/
  devices/
    /dev/nvme0n1          # raw block device (default, ADR-029)
    /dev/nvme1n1          # raw block device
    /tmp/kiseki-dev0.img  # file-backed fallback (VMs/CI)
  raft/
    db.redb               # redb database file (metadata only)

redb tracks chunk placement: chunk_meta table maps chunk_id → (device_id, offset, size, fragment_index). The device_alloc table provides a reverse index (device_id, offset) → chunk_id for bitmap rebuild and scrub. Bitmap allocation updates are journaled in redb before application to the on-device bitmap (ADR-029).

Why pool files, not per-chunk files:

At 100TB / 64KB avg = 1.6B chunks → filesystem inode exhaustion
Pool files support O_DIRECT and RDMA pre-registration (single mmap region)
Chunks are 4KB-aligned within the pool file for NVMe block alignment
Pool file is sparse: only allocated regions consume disk space

EC fragment placement (CRUSH-like)

Fragments placed across devices via deterministic hashing:

fn place_fragment(chunk_id, frag_idx, pool_devices) -> DeviceId {
    // Ensure no two fragments on same device
    let mut candidates = pool_devices.clone();
    for prior in 0..frag_idx {
        candidates.remove(placed[prior]);
    }
    candidates[hash(chunk_id, frag_idx) % candidates.len()]
}

Deterministic — can recalculate placement without storing it. Reverse index (device_id, chunk_id) → fragment_index in redb enables efficient repair on device failure.

Raft snapshots

Trigger: Every 10,000 log entries
Format: bincode::serialize(&state_machine_inner)
Storage: redb sm_snapshot table, key = "latest"
Restore: Deserialize snapshot → replay log entries after snapshot index
Log cleanup: Truncate entries before snapshot index after snapshot

Rationale

Criterion	redb	RocksDB	fjall	Custom files
Pure Rust	Yes	No (C++)	Yes	Yes
Build deps	None	cmake, clang, librocksdb	None	None
Binary size	~50KB	~5MB	~100KB	0
ACID	Yes (COW)	Yes (WAL)	Yes (WAL)	Manual (fsync)
Crash recovery	Automatic	Automatic	Automatic	Manual replay
Compaction	None needed (B-tree)	Required (LSM)	Required (LSM)	None
Maturity	1.0, used by Firefox	Very mature	Newer	N/A
Write amplification	Low (COW)	High (LSM)	High (LSM)	Low

redb wins on simplicity, zero deps, and sufficient performance for Raft log append + metadata lookup.

Consequences

No LSM-tree compaction complexity
No C++ build toolchain required
Chunk blobs as files: simple, inspectable, compatible with RDMA
redb’s COW B-tree has higher read amplification than LSM for range scans — acceptable for our workload (point lookups + append)
If redb proves insufficient for high-throughput Raft log append, migrate to fjall (LSM, same API pattern)

References

redb: https://github.com/cberner/redb
RFC 1813 §3: NFS3 procedure semantics
build-phases.md Phase 3: “SSTable” storage (now redb B-tree)
ADR-029: Raw Block Device Allocator (chunk data I/O)

ADR-023: Protocol RFC Compliance Scope

Status: Accepted. Date: 2026-04-20. Deciders: Architect + implementer.

Context

Kiseki exposes three protocol interfaces: S3 HTTP, NFSv3, NFSv4.2. ADR-013 (POSIX semantics) and ADR-014 (S3 API scope) define the functional subset but don’t reference specific RFC sections or define wire-format compliance testing.

Now that wire protocol implementations exist, we need to codify which RFC requirements are met and how compliance is verified.

Decision

Protocol scope

Protocol	Standard	Implemented Subset	Total in Standard
NFSv3	RFC 1813	7 of 22 procedures	22 procedures
NFSv4.2	RFC 7862	10 of ~60 operations	~60 operations
S3	AWS S3 API	5 of 40+ operations	40+ operations

NFSv3 (RFC 1813) — implemented procedures

#	Procedure	Status	Notes
0	NULL	Implemented	Ping/health check
1	GETATTR	Implemented	File/directory attributes
3	LOOKUP	Implemented	Name → file handle resolution
6	READ	Implemented	Byte-range file read
7	WRITE	Implemented	File data write
8	CREATE	Implemented	Create new file + directory index entry
16	READDIR	Implemented	Directory listing with real filenames

Not implemented: SETATTR, ACCESS, READLINK, SYMLINK, MKNOD, REMOVE, RMDIR, RENAME, LINK, READDIRPLUS, FSSTAT, FSINFO, PATHCONF, COMMIT.

NFSv4.2 (RFC 7862) — implemented COMPOUND operations

Op	Name	Status	Notes
9	GETATTR	Implemented	Bitmap-selected attributes
10	GETFH	Implemented	Return current file handle
15	LOOKUP	Stub (delegates to directory index)
24	PUTROOTFH	Implemented	Set root file handle
25	READ	Implemented	Via stateid + offset + count
38	WRITE	Implemented	Via stateid + offset + stable
42	EXCHANGE_ID	Implemented	Random client IDs (C-ADV-7)
43	CREATE_SESSION	Implemented	Random session IDs (C-ADV-2)
44	DESTROY_SESSION	Implemented	Session teardown
53	SEQUENCE	Implemented	Per-request sequencing
63	IO_ADVISE	Implemented	Accepted (advisory integration pending)

S3 API — implemented operations

Operation	HTTP Method	Status
PutObject	PUT /:bucket/:key	Implemented
GetObject	GET /:bucket/:key	Implemented
HeadObject	HEAD /:bucket/:key	Implemented
DeleteObject	DELETE /:bucket/:key	Stub (returns 204)
ListObjectsV2	GET /:bucket	Not yet

Compliance testing approach

BDD feature files map to RFC sections:
- specs/features/nfs3-rfc1813.feature (14 scenarios)
- specs/features/nfs4-rfc7862.feature (20 scenarios)
- specs/features/s3-api.feature (10 scenarios)
Wire format validation via Python e2e tests:
- NFS: raw TCP with struct.pack for ONC RPC framing
- S3: requests library for HTTP
Real client interop (future):
- NFS: mount -t nfs -o nfsvers=3,tcp in Docker
- S3: boto3 / aws-cli

Consequences

Clear documentation of what’s implemented vs what’s not
BDD scenarios serve as living compliance spec
Real client interop deferred until wire format proven via raw tests
Expanding the subset (e.g., adding REMOVE, RENAME) requires: new BDD scenario → new step definition → implementation → test green

References

RFC 1813: NFS Version 3 Protocol Specification
RFC 7862: NFS Version 4.2 Protocol
RFC 5531: ONC RPC Version 2
RFC 4506: XDR: External Data Representation Standard
AWS S3 API Reference
ADR-013: POSIX Semantics Scope
ADR-014: S3 API Scope

ADR-024: Device Management, Storage Tiers, and Capacity Thresholds

Status: Accepted (19/19 device-management BDD scenarios pass). Date: 2026-04-20. Deciders: Architect + domain expert.

Context

The current design (ADR-005) defines three NVMe device classes but does not address:

HDD / spinning disk tiers (common in cost-optimized HPC clusters)
System partition vs data partition separation
Capacity thresholds and degradation behavior
Device health monitoring and proactive replacement
Memory-attached storage (CXL, persistent memory)
Mixed-tier deployments (SSD+HDD, fast-SSD+cheap-SSD)

Real HPC deployments often have:

System partition: RAID-1 (or RAID-1+0) on 2 SSDs for OS + Kiseki binaries + redb
Data partitions: JBOD — each NVMe/SSD/HDD is an independent pool member
Tiering: Hot data on fast NVMe, warm on cheap SSD, cold on HDD

Decision

Device classification

Extend DeviceClass to cover the full storage hierarchy:

Class	Medium	Use case	Typical capacity
`NvmeU2`	NVMe U.2 TLC/MLC	Metadata, hot data, Raft log	1-8 TB
`NvmeQlc`	NVMe QLC	Checkpoints, warm data	4-30 TB
`NvmePersistentMemory`	Intel Optane / CXL	Cache, ultra-hot metadata	128 GB - 1 TB
`SsdSata`	SATA SSD	Budget fast storage	1-8 TB
`HddEnterprise`	SAS/SATA HDD 10k/15k	Cold data, archive	4-20 TB
`HddBulk`	SATA HDD 7.2k	Deep archive, bulk cold	10-20 TB
`Custom(String)`	User-defined	Vendor-specific	Varies

Server disk layout

Server node:
├── System partition (RAID-1 on 2× SSD)
│   ├── /boot, /root, OS
│   ├── /var/lib/kiseki/redb/       ← Raft log, metadata index
│   └── /var/lib/kiseki/config/     ← Node config, certs
│
├── Data devices (JBOD, managed by Kiseki)
│   ├── /dev/nvme0n1 → pool "fast-nvme"  (device member)
│   ├── /dev/nvme1n1 → pool "fast-nvme"  (device member)
│   ├── /dev/sda     → pool "bulk-ssd"   (device member)
│   ├── /dev/sdb     → pool "cold-hdd"   (device member)
│   └── ...
│
└── Optional: CXL memory → pool "pmem" (hot cache tier)

JBOD for data, RAID-1 for system. Kiseki manages data durability via EC/replication across JBOD members. The system partition uses traditional RAID-1 because redb and Raft log must survive single-disk failure without Kiseki’s own repair mechanism.

Pool capacity management

Per-device-class capacity thresholds

Thresholds vary by device type because NVMe/SSD suffer GC-induced write amplification at high fill levels, while HDD does not. Enterprise arrays (VAST, Pure) can operate at 95%+ because they have global wear leveling — JBOD does not have that luxury.

State	NVMe/SSD	HDD	Behavior
Healthy	0-75%	0-85%	Normal writes, background rebalance
Warning	75-85%	85-92%	Log warning, emit telemetry
Critical	85-92%	92-97%	Reject new placements, advisory backpressure
ReadOnly	92-97%	97-99%	In-flight writes drain, no new writes
Full	97-100%	99-100%	ENOSPC to clients

Rationale: NVMe/SSD GC pressure increases sharply above ~80% fill. QLC is worse than TLC. The SSD Warning threshold (75%) gives the placement engine time to redirect before the GC cliff. HDD has no such cliff — outer-track vs inner-track difference is ~20%, not a performance wall.

Implementation:

#![allow(unused)]
fn main() {
pub enum PoolHealth {
    Healthy,
    Warning { used_percent: u8 },
    Critical { used_percent: u8 },
    ReadOnly { used_percent: u8 },
    Full,
}

pub struct CapacityThresholds {
    pub warning_pct: u8,
    pub critical_pct: u8,
    pub readonly_pct: u8,
    pub full_pct: u8,
}

impl CapacityThresholds {
    pub fn for_device_class(class: &DeviceClass) -> Self {
        match class {
            DeviceClass::NvmeU2 | DeviceClass::NvmeQlc
            | DeviceClass::NvmePersistentMemory | DeviceClass::SsdSata => Self {
                warning_pct: 75,
                critical_pct: 85,
                readonly_pct: 92,
                full_pct: 97,
            },
            DeviceClass::HddEnterprise | DeviceClass::HddBulk => Self {
                warning_pct: 85,
                critical_pct: 92,
                readonly_pct: 97,
                full_pct: 99,
            },
            DeviceClass::Custom(_) => Self {
                warning_pct: 80,
                critical_pct: 90,
                readonly_pct: 95,
                full_pct: 99,
            },
        }
    }
}

impl AffinityPool {
    pub fn health(&self) -> PoolHealth {
        let pct = (self.used_bytes * 100) / self.capacity_bytes;
            81..=90 => PoolHealth::Warning { used_percent: pct as u8 },
            91..=95 => PoolHealth::Critical { used_percent: pct as u8 },
            96..=99 => PoolHealth::ReadOnly { used_percent: pct as u8 },
            _ => PoolHealth::Full,
        }
    }
}
}

Placement engine behavior:

Healthy: Place chunks according to affinity policy
Warning: Continue placing but emit telemetry; cluster admin should add capacity
Critical: Reject new placements; redirect to same device-class sibling only
ReadOnly: In-flight writes complete; new writes fail with retriable error
Full: ENOSPC — client gets permanent error

Pool redirection policy: When a pool is Critical, the placement engine redirects to another pool of the same device class only. Never cross device-class boundaries (e.g., never NVMe → HDD). If no same-class sibling has capacity, return ENOSPC to client. This preserves performance SLAs and compliance tag enforcement.

System partition

OS-managed RAID-1 on 2× SSD. Kiseki does not manage the RAID.

Kiseki monitors system partition health:

On startup: check /proc/mdstat for RAID health
If degraded → log WARNING, continue operating
If both drives failed → log CRITICAL, refuse to start
Periodic check every 60 seconds

Admin is responsible for replacing failed system drives and rebuilding the RAID. Kiseki trusts the OS for system partition durability.

Device health monitoring

Each device reports SMART/health metrics:

Metric	Threshold	Action
Temperature	>70°C	Warning; throttle if >80°C
Wear level (SSD)	>90% life used	Warning; proactive replacement window
Bad sectors (HDD)	>0 reallocated	Warning at 1; evacuate at >100
Latency	>10× baseline	Mark degraded; reduce placement priority
Errors	Uncorrectable read	Mark suspect; verify EC/replicas for affected chunks

Device states:

Healthy → Degraded → Failed → Removed
     ↘       ↗
   Evacuating → Removed

Eviction and evacuation policy

Key principle: Unhealthy devices are evacuated proactively, not waited on until failure. Full devices are write-blocked, not evicted (data is still readable).

Trigger	Action	Automatic?	Priority
SMART wear >90% (SSD)	Evacuate — migrate chunks to other pool members	Yes (background)	Normal
Bad sectors >100 (HDD)	Evacuate — migrate before cascading failure	Yes (background)	High
Uncorrectable read error	Evacuate + EC repair for affected chunks	Yes (immediate)	Critical
Temperature >80°C	Throttle I/O, alert admin	Yes	High
Device unresponsive	Mark Failed — trigger EC repair from survivors	Yes (immediate)	Critical
Pool at Critical threshold	Block writes — redirect to sibling pools	Yes	Normal
Pool at ReadOnly threshold	Drain writes — no new data, existing completes	Yes	Normal
Admin-initiated	Evacuate — controlled migration before physical removal	Manual	Normal

Evacuation process:

Mark device Evacuating
For each chunk on device: read fragment, write to another healthy device in pool
Update chunk metadata (redb) with new placement
When all chunks migrated: mark device Removed
Admin can physically pull the device

Evacuation speed: Bounded by network and destination device throughput. At 1 GB/s NVMe write speed, a 4TB device evacuates in ~67 minutes. EC repair (from parity) is faster since only the missing fragments need reconstruction.

Invariant: A device in Evacuating state accepts no new writes but serves reads for chunks not yet migrated.

Storage backend per JBOD device

Approach	Pros	Cons	Recommendation
Raw block (ADR-029)	Zero FS overhead, direct I/O, aligned writes, bitmap allocator with redb journal	Custom allocator in `kiseki-block`	Default — recommended for production
File-backed (ADR-029)	Same `DeviceBackend` trait, works in VMs/CI without raw devices	Slight overhead from host FS	VMs and CI environments
xfs	Scales to 100M+ files, good NVMe support	Extra FS overhead, inode pressure at scale	Legacy / deprecated

Default: Raw block device I/O via kiseki-block (DeviceBackend trait with auto-detection of device characteristics). File-backed fallback for VMs and CI. XFS is deprecated as a chunk storage backend; existing XFS deployments can migrate via background evacuation.

Device discovery

Manual configuration (MVP):

Admin provides device list in node config (kiseki-server.toml)
Each device: path, class, pool assignment

Future: Auto-discovery:

Scan /sys/block/ for NVMe/SSD/HDD devices
Classify by transport (NVMe, SATA, SAS) and media (rotational flag)
Present to admin for pool assignment confirmation
Healthy: Normal I/O
Degraded: Elevated errors or latency; reduce write priority
Evacuating: Admin-initiated; migrate chunks to other devices, then remove
Failed: I/O errors; trigger EC repair for all chunks
Removed: Device physically absent; metadata cleaned up

Tiering and data movement

Static placement (MVP): Admin assigns pools to device classes. Chunk placement is determined at write time by the composition’s view descriptor affinity policy. No automatic migration.

Future: Reactive tiering (per assumption A8):

Compositions with high read frequency auto-promote from cold → hot
Compositions with no reads for >N days auto-demote from hot → cold
Promotion/demotion as background job (copy chunk, update metadata, delete old)
Bounded by pool capacity thresholds (don’t overfill hot tier)

Data model changes

#![allow(unused)]
fn main() {
pub enum DeviceClass {
    NvmeU2,
    NvmeQlc,
    NvmePersistentMemory,
    SsdSata,
    HddEnterprise,
    HddBulk,
    Custom(String),
}

pub struct DeviceInfo {
    pub id: DeviceId,
    pub class: DeviceClass,
    pub path: String,          // /dev/nvme0n1 or mount point
    pub capacity_bytes: u64,
    pub used_bytes: u64,
    pub state: DeviceState,
    pub pool_id: Option<String>,
}

pub enum DeviceState {
    Healthy,
    Degraded { reason: String },
    Evacuating { progress_percent: u8 },
    Failed { since: u64 },
    Removed,
}
}

Consequences

Device diversity now first-class (HDD, SSD, NVMe, PMem)
Capacity management is explicit with defined thresholds
System partition (RAID-1) separated from data (JBOD)
Device health monitoring enables proactive replacement
Tiering is future work; static placement for MVP
Cluster admin must provision devices and assign to pools at setup time

References

ADR-005: EC and chunk durability (per pool)
ADR-022: Storage backend (redb on system partition)
Assumption A4: ClusterStor hardware
Assumption A8: Reactive tiering
Failure mode F-I2: Storage node failure
Failure mode F-I4: Disk/device failure

ADR-025: Storage Administration API

Status: Proposed. Date: 2026-04-20. Deciders: Architect + domain expert.

Context

Storage administrators need to performance-tune the system similar to Ceph (ceph osd pool set), VAST (management UI), or Lustre (lctl). The current control plane API handles tenant lifecycle but has no storage admin surface — no pool management, device management, performance tuning, or cluster-wide observability.

API-first principle: All admin interactions go through gRPC APIs. CLI (kiseki-cli), Web UI, and job orchestrators (Ansible, Terraform) are wrappers around these APIs. No SSH-and-edit-config path.

Decision

Admin API surface (new gRPC service)

service StorageAdminService {
  // === Device management ===
  rpc ListDevices(ListDevicesRequest) returns (ListDevicesResponse);
  rpc GetDevice(GetDeviceRequest) returns (DeviceInfo);
  rpc AddDevice(AddDeviceRequest) returns (AddDeviceResponse);
  rpc RemoveDevice(RemoveDeviceRequest) returns (RemoveDeviceResponse);
  rpc EvacuateDevice(EvacuateDeviceRequest) returns (EvacuateDeviceResponse);
  rpc CancelEvacuation(CancelEvacuationRequest) returns (CancelEvacuationResponse);

  // === Pool management ===
  rpc ListPools(ListPoolsRequest) returns (ListPoolsResponse);
  rpc GetPool(GetPoolRequest) returns (PoolInfo);
  rpc CreatePool(CreatePoolRequest) returns (CreatePoolResponse);
  rpc SetPoolDurability(SetPoolDurabilityRequest) returns (SetPoolDurabilityResponse);
  rpc SetPoolThresholds(SetPoolThresholdsRequest) returns (SetPoolThresholdsResponse);
  rpc RebalancePool(RebalancePoolRequest) returns (RebalancePoolResponse);

  // === Performance tuning ===
  rpc GetTuningParams(GetTuningParamsRequest) returns (TuningParams);
  rpc SetTuningParams(SetTuningParamsRequest) returns (SetTuningParamsResponse);

  // === Cluster observability ===
  rpc ClusterStatus(ClusterStatusRequest) returns (ClusterStatus);
  rpc PoolStatus(PoolStatusRequest) returns (PoolStatus);
  rpc DeviceHealth(DeviceHealthRequest) returns (stream DeviceHealthEvent);
  rpc IOStats(IOStatsRequest) returns (stream IOStatsEvent);

  // === Shard management ===
  rpc ListShards(ListShardsRequest) returns (ListShardsResponse);
  rpc GetShard(GetShardRequest) returns (ShardInfo);
  rpc SplitShard(SplitShardRequest) returns (SplitShardResponse);
  rpc SetShardMaintenance(SetShardMaintenanceRequest) returns (SetShardMaintenanceResponse);

  // === Repair and scrub ===
  rpc TriggerScrub(TriggerScrubRequest) returns (TriggerScrubResponse);
  rpc RepairChunk(RepairChunkRequest) returns (RepairChunkResponse);
  rpc ListRepairs(ListRepairsRequest) returns (ListRepairsResponse);
}

Tuning parameters

Storage admins tune at four levels: cluster → pool → tenant → workload. Lower levels inherit from higher, can only narrow (not broaden).

Cluster-wide tuning

Parameter	Default	Range	What it controls
`compaction_rate_mb_s`	100	10-1000	Background compaction throughput cap
`gc_interval_s`	300	60-3600	How often GC scans for reclaimable chunks
`rebalance_rate_mb_s`	50	0-500	Background rebalance/evacuation throughput
`scrub_interval_h`	168 (7d)	24-720	How often integrity scrub runs
`max_concurrent_repairs`	4	1-32	Parallel EC repair jobs
`stream_proc_poll_ms`	100	10-1000	View materialization poll interval
`inline_threshold_bytes`	4096	512-65536	Below this, data inlined in delta
`raft_snapshot_interval`	10000	1000-100000	Entries between Raft snapshots

Per-pool tuning

Parameter	Default	Range	What it controls
`ec_data_chunks`	4 (NVMe) / 8 (HDD)	2-16	EC data fragment count
`ec_parity_chunks`	2 (NVMe) / 3 (HDD)	1-8	EC parity fragment count
`replication_count`	3	2-5	For replication pools (not EC)
`warning_threshold_pct`	per ADR-024	50-95	Capacity warning level
`critical_threshold_pct`	per ADR-024	60-98	Capacity critical level
`readonly_threshold_pct`	per ADR-024	70-99	Read-only level
`target_fill_pct`	70 (SSD) / 80 (HDD)	50-90	Rebalance target fill level
`chunk_alignment_bytes`	4096	512-65536	On-disk alignment (RDMA/NVMe)
`prefer_sequential_alloc`	true	bool	Allocate sequentially in pool file

Per-tenant tuning (via ControlService, existing)

Parameter	Existing API	What it controls
`quota.capacity_bytes`	SetQuota	Tenant capacity ceiling
`quota.iops`	SetQuota	IOPS limit
`quota.metadata_ops_per_sec`	SetQuota	Metadata op rate limit
`dedup_policy`	CreateOrganization	Cross-tenant vs isolated dedup
`compliance_tags`	SetComplianceTags	Regulatory constraints

Per-workload tuning (via ControlService + Advisory)

Parameter	API	What it controls
`workload.quota`	CreateWorkload	Workload-level capacity/IOPS
`advisory.hints_per_sec`	Advisory ceilings	Hint submission rate
`advisory.prefetch_bytes_max`	Advisory ceilings	Prefetch budget
`advisory.profile`	Advisory profiles	Allowed hint profiles

Observability API

ClusterStatus response

message ClusterStatus {
  uint32 node_count = 1;
  uint32 healthy_nodes = 2;
  uint64 total_capacity_bytes = 3;
  uint64 used_bytes = 4;
  uint32 pool_count = 5;
  uint32 shard_count = 6;
  uint32 active_repairs = 7;
  uint32 evacuating_devices = 8;
  repeated PoolSummary pools = 9;
}

PoolStatus response

message PoolStatus {
  string pool_id = 1;
  PoolHealth health = 2;
  uint64 capacity_bytes = 3;
  uint64 used_bytes = 4;
  uint32 device_count = 5;
  uint32 healthy_devices = 6;
  uint32 chunk_count = 7;
  // Performance metrics (rolling 60s window)
  double read_iops = 8;
  double write_iops = 9;
  double read_throughput_mb_s = 10;
  double write_throughput_mb_s = 11;
  double avg_read_latency_ms = 12;
  double avg_write_latency_ms = 13;
  double p99_read_latency_ms = 14;
  double p99_write_latency_ms = 15;
}

Streaming events

message DeviceHealthEvent {
  DeviceId device_id = 1;
  DeviceState old_state = 2;
  DeviceState new_state = 3;
  string reason = 4;
  uint64 timestamp_ms = 5;
}

message IOStatsEvent {
  string pool_id = 1;
  double read_iops = 2;
  double write_iops = 3;
  double read_throughput_mb_s = 4;
  double write_throughput_mb_s = 5;
  uint64 timestamp_ms = 6;
}

Admin personas and API mapping

Persona	Typical actions	APIs used
Cluster admin	Add/remove nodes, set cluster params, view health	StorageAdminService (all), ClusterStatus
Storage admin	Create pools, tune EC, set thresholds, rebalance	Pool*, SetTuningParams, PoolStatus
Tenant admin	Set quotas, compliance, retention, advisory	ControlService (existing)
Workload admin	Tune advisory, prefetch, dedup hints	Advisory (existing) + workload quota
On-call/SRE	View health, trigger repair, check alerts	ClusterStatus, DeviceHealth stream, TriggerScrub

CLI mapping (kiseki-cli)

kiseki cluster status              → ClusterStatus
kiseki pool list                   → ListPools
kiseki pool status fast-nvme       → PoolStatus
kiseki pool create --name bulk-hdd --class HddBulk --ec 8+3
kiseki pool tune fast-nvme --warning-pct 75 --target-fill 70
kiseki device list                 → ListDevices
kiseki device add /dev/nvme2n1 --pool fast-nvme
kiseki device evacuate dev-uuid    → EvacuateDevice
kiseki device health --watch       → DeviceHealth stream
kiseki tune set --compaction-rate 200 --gc-interval 120
kiseki shard list                  → ListShards
kiseki shard split shard-uuid      → SplitShard
kiseki repair scrub --pool fast-nvme
kiseki iostat --pool fast-nvme     → IOStats stream

Authorization model

API	Who can call	Auth
StorageAdminService (all)	Cluster admin only	mTLS cert with admin OU
ControlService (tenant ops)	Tenant admin	mTLS cert with tenant OU
Advisory (workload ops)	Workload identity	mTLS cert + workflow token
Read-only observability	Cluster admin, SRE	mTLS cert with admin/sre OU

Tenant admins cannot access StorageAdminService. They see their own quotas and compliance tags, not pool health or device state. This preserves the zero-trust boundary (I-T4).

Consequences

Full API-first admin surface — no SSH-and-edit needed
CLI, UI, automation all use the same gRPC APIs
Performance tuning at four levels with inheritance
Streaming observability for real-time monitoring
Clear authorization boundary between cluster admin and tenant admin
Significantly expands the gRPC surface (20+ new RPCs)

References

ADR-024: Device management and capacity thresholds
ADR-005: EC and chunk durability
ADR-020: Workflow advisory (workload-level tuning)
Ceph: ceph osd pool set command reference
Lustre: lctl set_param tunables
I-T4: Zero-trust infra/tenant boundary

Addendum: Adversarial Review Resolutions (2026-04-20)

C1: Per-tenant resource usage → ControlService, not StorageAdminService

Per-tenant resource usage (capacity, IOPS attribution) is exposed via ControlService with tenant-admin authorization, NOT via StorageAdminService. Cluster admin sees pool-level aggregates only. Tenant admin sees their own usage. This preserves I-T4.

// In ControlService (not StorageAdminService):
rpc GetTenantUsage(GetTenantUsageRequest) returns (TenantUsage);
// Requires tenant admin cert (mTLS OU = tenant ID)

C2: Per-device I/O stats added

rpc DeviceIOStats(DeviceIOStatsRequest) returns (stream DeviceIOStatsEvent);

message DeviceIOStatsEvent {
  string device_id = 1;
  double read_iops = 2;
  double write_iops = 3;
  double read_latency_p50_ms = 4;
  double read_latency_p99_ms = 5;
  double errors_per_sec = 6;
  uint64 timestamp_ms = 7;
}

C3: Shard health observability added

rpc GetShardHealth(GetShardHealthRequest) returns (ShardHealthInfo);

message ShardHealthInfo {
  string shard_id = 1;
  uint64 leader_node_id = 2;
  uint32 replica_count = 3;
  uint32 reachable_count = 4;
  uint32 recent_elections = 5;
  uint64 commit_lag_entries = 6;
}

C4: EC parameters immutable per pool

New invariant I-C6: EC parameters (data_chunks, parity_chunks) are immutable per pool. SetPoolDurability applies only to NEW chunks. Existing chunks retain their original EC configuration. Explicit re-encoding via ReencodePool RPC (long-running, cancellable).

C5: Compaction rate validation

Protobuf-level validation: compaction_rate_mb_s ∈ [10, 1000]. API rejects values outside range. Audit event on every change.

C6: Inline threshold is prospective

New invariant I-L9: A delta’s inlined payload is immutable after write. inline_threshold_bytes changes do NOT retroactively affect existing deltas. Old and new thresholds coexist in the log.

C7: RemoveDevice requires evacuated state

New invariant I-D5: RemoveDevice rejects if device state is not Removed (post-evacuation). Precondition: EvacuateDevice must complete first. Error code: DEVICE_NOT_EVACUATED.

C8: Pool modifications audited to affected tenants

New invariant I-T4c: Cluster admin modifications to pools containing tenant data (SetPoolDurability, EvacuateDevice) are audit-logged to the affected tenant’s audit shard. Tenant admin can review.

C9: Tuning change audit trail

New invariant I-A6: All tuning parameter changes via SetTuningParams are recorded in the cluster audit shard with parameter name, old value, new value, timestamp, and admin identity.

H5: SRE roles defined

Role	Access
`cluster-admin`	Full StorageAdminService (read + write)
`sre-on-call`	Read-only: List, Get, Status, Health streams
`sre-incident-response`	SRE + TriggerScrub, RepairChunk

Enforced via mTLS certificate OU field.

M4: DrainNode added

rpc DrainNode(DrainNodeRequest) returns (stream DrainNodeProgress);

Internally evacuates all devices on the node, then removes them. Idempotent, safe to retry.

ADR-026: Raft Topology — Per-Shard on Fabric (Strategy A)

Status: Accepted. Date: 2026-04-20. Deciders: Architect + domain expert.

Context

Kiseki needs multi-node Raft for durability (I-L2) and failover. The cluster operates on a shared Slingshot fabric (200 Gbps per node) where control messages (Raft) and data (chunk I/O) share bandwidth.

Three strategies were evaluated:

A: Raft per shard, all traffic on fabric
B: Raft for metadata only, primary-copy for data (Ceph-like)
C: Multi-Raft with batched transport (TiKV-like)

Decision

Strategy A: Raft per shard, on the data fabric.

Start with A, add C’s batching optimization when monitoring shows it’s needed (>1000 connections per node).

Why this works

Raft traffic is negligible compared to data fabric capacity:

Scale	Shards	Groups/node	Heartbeat/node	Replication/node	% of 200 Gbps
10 nodes	100	30	78 KB/s	3 MB/s	<0.001%
100 nodes	1000	30	78 KB/s	3 MB/s	<0.001%
1000 nodes	10,000	30	78 KB/s	3 MB/s	<0.001%

Groups-per-node stays constant at ~30 because shard count scales with node count (each node hosts ~30 shard replicas regardless of cluster size).

Key insight: Raft only for metadata

Chunk data does NOT go through Raft. The write path:

Large write:
  Client → Gateway → encrypt → chunk to NVMe (EC direct) → delta to Raft (1KB metadata)

Small write (<4KB):
  Client → Gateway → encrypt → inline in delta → Raft only

Raft replicates delta metadata (~1KB per operation). Chunk ciphertext (64KB-64MB) is written directly to NVMe devices via EC. This means:

Write throughput limited by NVMe/network, NOT by Raft
Raft consensus adds ~30-60µs (RDMA) or ~75-250µs (TCP) per metadata op
50-100k metadata ops/sec per shard, shards in parallel

Projected performance vs competition

Metric	Kiseki (projected)	Lustre	Ceph	GPFS
Write GB/s /node	25-40	5-12	1-3	5-15
Read GB/s /node	40-50	10-20	3-8	10-30
Write latency	30-250µs	100-500µs	500-2000µs	100-300µs
Metadata IOPS /node	1.5-3M	50-100k	10-50k	200k

Raft group configuration

Raft group	Members	Where
Key manager	3-5	Dedicated keyserver nodes
Log shard (per shard)	3	Spread across storage nodes
Audit shard (per tenant)	3	Spread across storage nodes

Placement rule: no two members of the same group on the same node (or same rack if rack-aware placement is configured).

Transport

Phase	Transport	Optimization
Phase 1 (now)	TCP + TLS	Direct connections, one per Raft peer
Phase 2 (10+ nodes)	TCP + TLS + connection pooling	Reuse connections across groups
Phase 3 (100+ nodes)	Batched transport (Strategy C)	Coalesce heartbeats per node pair
Future	Slingshot CXI / RDMA	Sub-10µs Raft RTT

Election storm mitigation

Correlated failure (rack power loss) causes simultaneous elections for all Raft groups on affected nodes (~30 groups per node × N nodes).

Mitigations:

Randomized election timeouts: openraft already does this (150-300ms jitter)
Staggered group startup: on node restart, groups start elections over a 5-second window (not all at once)
Leader sticky: prefer re-electing the same leader if it recovers within the election timeout (avoids unnecessary leader changes)

Network requirements

Network	Purpose	Kiseki traffic
Data fabric (Slingshot/ethernet)	Chunk I/O + Raft	99.99% data, 0.01% Raft
Management network (if available)	ControlService, monitoring	Optional: route Raft here to fully isolate

Management network is NOT required. Raft on the fabric is fine because the overhead is <0.001% of capacity. If a management network exists (common in HPC), Raft CAN be routed there for belt-and-suspenders isolation, but it’s not necessary.

Consequences

Simplest implementation: use openraft’s built-in TCP transport
No separate management network required (but can use one)
Scales to ~10k shards / 1000 nodes without transport optimization
Add batching (Strategy C) as a pure transport optimization later
Election storms during correlated failure are bounded by randomized timeouts
Raft adds ~30-250µs to metadata write latency (acceptable for HPC)

Migration path

If Strategy A proves insufficient at extreme scale:

Add batched transport (C) — pure transport change, no protocol change
If even C is insufficient, partition shards into metadata-Raft and data-EC groups (B) — larger refactor but data model already supports it

References

ADR-005: EC and chunk durability
ADR-022: Storage backend (redb)
ADR-024: Device management
TiKV Multi-Raft: https://tikv.org/deep-dive/scalability/multi-raft/
openraft: https://datafuselabs.github.io/openraft/
Slingshot fabric: ~5-10µs RTT, 200 Gbps per endpoint

ADR-027: Single-Language Implementation — Rust Only

Status: Accepted (implemented 2026-04-21, Go code removed) Date: 2026-04-20 (proposed), 2026-04-21 (accepted + migrated) Context: Supersedes the implicit language split in docs/analysis/design-conversation.md §2.13. No prior ADR recorded the Rust/Go decision.

Context

Kiseki’s original design split the implementation across two languages:

Rust for the core (log, chunks, views, native client, hot paths)
Go for the control plane (tenancy, IAM, policy, flavor, federation, audit export, CLI) and one half each of two cross-cutting contexts (kiseki-audit + control/pkg/audit; kiseki-advisory + control/pkg/advisory)
gRPC/protobuf as the boundary

The split was recorded in docs/analysis/design-conversation.md §2.13 but never promoted to an ADR. It surfaces in specs/architecture/module-graph.md (Go modules section), .claude/coding/go.md, and in two contexts that are currently split across both languages. The split pre-dates ADR-001 (pure-Rust, no Mochi/FFI), which already identified “FIPS compliance surface across two languages” as a cost.

At proposal time, 1,490 lines of Go business logic existed with 32/32 BDD scenarios passing (godog, Strict:true). The migration ported all 32 scenarios to cucumber-rs backed by a new kiseki-control Rust crate (~650 lines, 10 modules) before deleting the Go code. See specs/implementation/adr027-go-to-rust-migration.md for the migration plan and specs/findings/adr027-adversarial.md for the gate-1 review.

Decision

Implement Kiseki in Rust only. Retire the Go control plane, the Go CLI, and the Go halves of audit and advisory. Keep gRPC/protobuf as the wire boundary between the control plane and data plane so that a future non-Rust control plane remains possible.

Concretely:

New Rust crates replace the Go packages one-for-one:
- kiseki-control — control plane daemon (tenancy, IAM, policy, flavor, federation, audit export, discovery)
- kiseki-cli — admin CLI
- The control/pkg/audit half is absorbed into kiseki-audit
- The control/pkg/advisory half is absorbed into kiseki-advisory
gRPC/protobuf stays as the wire boundary. kiseki-control serves ControlService, AuditExportService, and policy endpoints over gRPC. kiseki-server consumes them as a client. No in-process shortcut across the boundary, even though both sides are now Rust.
Architectural firewall is enforced by crate dependencies, not by language. kiseki-control and kiseki-cli depend only on kiseki-common and kiseki-proto. They MUST NOT depend on any data-path crate (kiseki-log, kiseki-chunk, kiseki-composition, kiseki-view, kiseki-gateway-*, kiseki-client, kiseki-keymanager). Enforced by a cargo-deny or workspace-level architectural lint at CI.
Control plane binaries live alongside data-plane binaries in crates/bin/:
- bin/kiseki-control/ (new)
- bin/kiseki-cli/ (new)
gRPC server framework: tonic (already the Rust-side choice). Config: figment or config-rs for layered YAML/env overrides (parity with Go’s viper pattern).
Federation / state machine: kiseki-control uses openraft (already the project’s Raft choice per ADR-026) for replicated control-plane state (policy, opt-out state, tenant topology). This also eliminates the second Raft vendor that a Go control plane would have required (etcd client or dragonboat).

Rationale

One domain model

specs/ubiquitous-language.md defines Tenant, Org, Project, Workload, RetentionHold, Policy, Flavor, WorkflowRef, OperationAdvisory. Every one of these would otherwise need two implementations (Rust enums/structs + Go types). Two implementations drift: field renames, validation subtly different, invariant enforcement on one side only. Consolidating removes the class of bug where control-plane Go says a name is valid but data-path Rust rejects it (or vice versa).

One error taxonomy

specs/architecture/error-taxonomy.md enumerates retriable / permanent / security error categories. A Go implementation mirrors the Rust taxonomy as Go types + gRPC status mappings. One language means one thiserror-derived enum hierarchy and one mapping to tonic::Status.

Smaller FIPS surface

ADR-001 already cited “FIPS compliance surface across two languages” as a reason to reject C/C++ FFI. The same cost applies to Go: either BoringCrypto (Go’s FIPS module) is part of the certification boundary, or the control plane sits outside the FIPS module boundary and the certification scope has to be carefully drawn. Rust-only gives one aws-lc-rs FIPS module boundary for the whole system.

Cross-context crates stop being split

kiseki-audit and kiseki-advisory are currently split across Rust and Go. That means two queue implementations, two filter implementations, two sets of integration tests, two ways that tenant-scope validation can drift. In Rust-only, each is one crate with one set of invariants.

Eliminated toolchain duplication

Today’s per-commit gate has to run: cargo fmt, clippy, cargo-deny, cargo test and go fmt, go vet, golangci-lint, go test -race. Rust-only halves the CI configuration, halves the local developer setup, and removes one supply-chain audit surface (Go module proxy + checksum DB alongside crates.io).

Reuse of `kiseki-common` and `kiseki-proto`

The CLI and control plane can import the real domain types rather than regenerated protobuf Go structs. Validation logic written once in kiseki-common (e.g., tenant-id parsing, flavor matching, policy inheritance) is reused verbatim in the control plane and the CLI.

Build-phase cost is low now

Phase 0 has not started. Adding two Rust crates (kiseki-control, kiseki-cli) is cheaper than maintaining a separate control/ Go module, its go.mod, its generated proto outputs, and its CI lane. The cost rises monotonically with every phase that ships Go code.

Hiring and cognitive load

Contributors need one language, one async runtime (tokio), one tracing stack, one error model. Code review crosses fewer idiom boundaries. Onboarding doc shrinks.

Alternatives considered

Keep Go as specified.
- Pro: Go’s ecosystem for control planes (cobra, viper, operator-sdk, client-go patterns) is the golden path; k8s, etcd, Consul all use it. GC is fine on cold paths. Operators extending the system are more likely to know Go.
- Pro: the language wall is the architectural wall — the Go control plane physically cannot reach into data-plane memory or internals.
- Con: every benefit above comes with the duplication, drift, and FIPS-surface costs enumerated in “Rationale”. With no code written, the ecosystem-maturity argument is weaker than at a later stage.
Port only the CLI to Rust, keep the Go control-plane daemon.
- Pro: preserves Go for the longer-lived daemon code where operator-sdk patterns matter most. Low churn.
- Con: doesn’t remove duplication for the split contexts (audit, advisory). Doesn’t shrink the FIPS surface. Doesn’t remove the second toolchain from CI. Half-measure.
Rewrite the core in Go (single-language Go).
- Rejected immediately: Go GC and lack of precise control over allocation and layout disqualify it from the hot data path at 200 Gbps per NIC. This inverts the original rationale for Rust in the core.
Separate Rust crate per Go package, but share no runtime (same-language boundary still isolated by process).
- Considered. Rejected: unnecessary. The isolation value of “separate OS process” is already provided by kiseki-control being a distinct binary. Running two daemons is orthogonal to the language question.
Defer the decision until after Phase 3.
- Rejected: the decision is cheapest to reverse now. Each build phase that ships Go code raises the cost of consolidation and lets duplication set in. The analyst already flagged the split without recording a decision; formalizing now is overdue.

Consequences

Positive

Single toolchain: cargo fmt, clippy, cargo-deny, cargo test, cargo audit. Lefthook configuration shrinks.
Single FIPS module boundary (aws-lc-rs).
Domain types (Tenant, Policy, RetentionHold, Flavor, WorkflowRef, OperationAdvisory) exist once in kiseki-common.
kiseki-audit and kiseki-advisory become whole crates rather than split halves. Their invariants (I-A1..I-A3, I-WA1..I-WA16) are enforced in one place.
kiseki-control can reuse openraft (ADR-026) for its replicated state rather than requiring a second Raft implementation (etcd/dragonboat).
No generated Go protobuf stubs to keep in sync; one generated tree under crates/kiseki-proto/.
CI matrix shrinks; no go test -race lane.

Negative

Loses the “language wall as architectural wall” property. Must be replaced with crate-graph enforcement (see “Enforcement” below). This is a discipline cost and must be tooled, not trusted.
Rust’s CLI/operator ecosystem (clap, tonic, figment) is less mature than Go’s (cobra, viper, operator-sdk). Some patterns (admission webhooks, CRD controllers) will require more bespoke code if we ever grow a k8s operator.
Contributors with Go-only platform experience face a higher barrier to writing control-plane extensions.
kiseki-control uses tokio for async I/O and is exposed to async-Rust complexity on request handlers (cancellation safety, 'static bounds) that Go handlers would not have had.
One-time rewrite cost for the control-plane spec surface (api-contracts.md, module-graph.md, .claude/coding/go.md → remove or archive, build-phases.md may need to re-sequence control-plane phases).

Enforcement (replacing the language wall)

The split previously enforced “control plane never reaches into data plane” structurally. In Rust-only, this is enforced by:

Crate-graph rule. kiseki-control and kiseki-cli depend only on kiseki-common and kiseki-proto. This is asserted by a CI check that greps Cargo manifests, or by cargo-deny’s bans section, or by a custom workspace lint.
No re-export shortcut. kiseki-common MUST NOT re-export internal types from data-path crates. This is already the case; restated here as a rule.
gRPC boundary preserved. Even though both sides are now Rust, control-plane-to-data-plane traffic still goes through tonic over gRPC, not through in-process trait calls. This keeps the wire contract as the source of truth and preserves the option of a non-Rust control plane later.
Runtime separation. kiseki-control runs as its own binary (bin/kiseki-control/), not as a library linked into kiseki-server. The isolation that process separation provides is preserved.

Migration

No code exists yet. Migration is a spec update:

docs/analysis/design-conversation.md §2.13: annotate with a pointer to this ADR.
specs/architecture/module-graph.md: delete the “Go modules (control plane)” section; add the new Rust crates (kiseki-control, kiseki-cli) and update the “Bounded context → module mapping” table to say Rust for every row.
specs/architecture/build-phases.md: review Phase sequencing — the Go control-plane phase collapses into a Rust phase; audit/advisory phases no longer have a “Go side” task.
.claude/CLAUDE.md and .claude/guidelines/go.md: remove Go from the workflow router; keep .claude/coding/go.md archived (move to specs/archive/ or delete) as a historical record.
.claude/coding/rust.md: add a “control plane” section describing kiseki-control/kiseki-cli conventions (config with figment, CLI with clap, server with tonic + axum for any REST admin surface).
Makefile (when it exists): drop Go lanes.
specs/features/control-plane.feature: BDD scenarios remain; the step definitions move from godog to cucumber-rs.

Open items (escalated to adversary gate-1)

Verify the crate-graph rule (control plane depends only on kiseki-common/kiseki-proto) is enforceable with cargo-deny alone, or whether a custom workspace lint is needed.
Confirm cucumber-rs covers the Gherkin features that godog was planned to run, without step-definition regressions.
Confirm FIPS posture: aws-lc-rs covers the control-plane’s TLS needs (mTLS CA, admin endpoints) as well as the data-plane’s. No Go BoringCrypto equivalent is needed.
Verify that removing the Go language wall does not create a realistic path by which a control-plane code change accidentally links data-path crates. Propose a pre-merge check if manifest-grep is insufficient.
Decide the fate of control/pkg/discovery: if fabric discovery uses libfabric/CXI, it was already going to need a Rust FFI layer; confirm the Rust-only home for it is kiseki-control (or a new kiseki-discovery crate).

References

ADR-001: Pure Rust, No Mochi Dependency (FIPS surface precedent).
ADR-021: Workflow Advisory Architecture (defines the Rust+Go split for advisory that this ADR collapses).
ADR-026: Raft Topology — openraft is the Rust-side Raft; now also the control plane’s Raft.
docs/analysis/design-conversation.md §2.13 — original (now superseded) language-split rationale.
specs/architecture/module-graph.md — current two-language module layout (to be rewritten).
.claude/coding/go.md — Go coding standards (to be archived on acceptance).

ADR-028: External Tenant KMS Providers

Status: Accepted Date: 2026-04-22 Context: I-K11, ADR-002, ADR-003, ADR-007 Adversarial review: 2026-04-22 (8 findings: 2H 5M 1L, all resolved)

Problem

ADR-002 defines a two-layer encryption model where tenant KEKs wrap access to system DEK derivation material. The current implementation hardcodes tenant KEK as a locally-managed [u8; 32] — there is no mechanism for tenants to bring their own key management infrastructure.

HPC and enterprise tenants require integration with their existing KMS:

Regulatory compliance (FIPS 140-2/3, Common Criteria, SOC 2)
Centralized key lifecycle management
Hardware-backed key storage (HSMs)
Audit trails in their own systems
Key escrow and disaster recovery under their own policies

Decision

Introduce a TenantKmsProvider trait with five backend implementations. Tenant KEK sourcing becomes pluggable per-tenant via control-plane configuration. The system key manager (ADR-007) remains unchanged — only the tenant KEK layer is externalized.

Provider Backends

#	Backend	Type	Standard	Transport	Material model
1	Kiseki Internal	Built-in	—	In-process	Local
2	HashiCorp Vault	Open source	Proprietary (Transit)	HTTPS	Local (cached)
3	KMIP 2.1	Standard	OASIS KMIP SP 800-57	mTLS (TTLV)	Remote or local
4	AWS KMS	Cloud	AWS Sig V4	HTTPS	Remote only
5	PKCS#11 v3.0	HSM	OASIS PKCS#11	Local (FFI)	Remote only (HSM)

Material model: “Local” = KEK material cached in Kiseki process memory. “Remote” = material never leaves the provider; all wrap/unwrap operations are remote calls. The trait fully encapsulates this distinction — callers never branch on provider type.

Provider 1: Kiseki Internal (default)

The existing behavior. Kiseki manages tenant KEKs internally. Suitable for deployments where tenants trust the operator or where external KMS is unavailable.

Tenant KEK generated internally on tenant creation
Stored in a separate Raft group from system master keys (independent compromise domain — see Security Considerations §6)
Rotation managed by Kiseki’s epoch mechanism
No external dependency

This is the zero-configuration default. Existing tenants and single-operator deployments use this without change.

Security trade-off: Internal mode does not provide the full two-layer security guarantee of ADR-002. A compromise of both the system key manager and the tenant key store (even though they are separate Raft groups) yields full access. Compliance-sensitive tenants should use an external provider where the tenant KEK is under the tenant’s own operational control.

Provider 2: HashiCorp Vault (Transit secrets engine)

Vault’s Transit engine provides encryption-as-a-service with key versioning that maps cleanly to Kiseki’s epoch model.

Operations mapping:

Kiseki operation	Vault API
`wrap`	`POST /transit/encrypt/:name` (with `context` = AAD)
`unwrap`	`POST /transit/decrypt/:name` (with `context` = AAD)
`rotate`	`POST /transit/keys/:name/rotate`
`rewrap`	`POST /transit/rewrap/:name` (server-side, no plaintext exposure)
`destroy`	`DELETE /transit/keys/:name` (after enabling deletion)

Authentication methods (tenant-configurable):

TLS certificate — maps to Kiseki’s SPIFFE/mTLS identity
AppRole — role_id + secret_id for service authentication
Kubernetes — ServiceAccount JWT (for k8s-deployed Kiseki)
OIDC/JWT — external IdP token

Vault namespaces: Multi-tenant Vault deployments use namespaces to isolate tenant key material. The tenant’s Vault namespace is configured at onboarding.

Caching: Vault provider may optionally cache KEK material locally (fetched via POST /transit/datakey/plaintext/:name). When caching is disabled, all wrap/unwrap calls go through Vault directly. Caching mode is configurable per tenant.

Rust crate: vaultrs (maintained, async, supports Transit engine).

Provider 3: KMIP 2.1 (OASIS standard)

KMIP is the interoperability standard for enterprise key management. A single KMIP client covers: Thales CipherTrust Manager, IBM Security Guardium Key Lifecycle Manager, Fortanix SDKMS, Entrust KeyControl, NetApp StorageGRID KMS, Dell PowerProtect, and any KMIP-compliant HSM.

Relevant OASIS specifications:

KMIP Specification v2.1 (2019) — protocol and operations
KMIP Profiles v2.1 — conformance levels
KMIP Usage Guide v2.1 — implementation guidance

Operations mapping:

Kiseki operation	KMIP operation
`wrap`	`Encrypt` with Correlation Value (AAD)
`unwrap`	`Decrypt` with Correlation Value (AAD)
`rotate`	`ReKey` or `Create` + `Activate` + `Revoke` old
`destroy` (crypto-shred)	`Destroy` (state → Destroyed, irrecoverable)

Transport: TTLV (Tag-Type-Length-Value) binary encoding over mTLS. The KMIP spec mandates mutual TLS with X.509 certificates.

Key object attributes: KMIP keys carry rich metadata — Cryptographic Algorithm, Cryptographic Length, State (Pre-Active/Active/Deactivated/Compromised/Destroyed), Activation Date, Deactivation Date. These map to Kiseki’s EpochInfo (is_current, migration_complete).

Material model: Depends on KMIP server configuration. Some servers allow Get to extract key material (local caching). Others enforce non-extractable keys (remote-only wrap/unwrap). The provider detects this via CKA_EXTRACTABLE equivalent attribute and adapts.

Rust implementation: No mature KMIP crate exists. Implement a minimal KMIP client covering the Symmetric Key Foundry Client profile (KMIP Profiles v2.1 §4.1). The wire format (TTLV) is straightforward — ~1500 lines for the operations Kiseki needs.

Provider 4: AWS KMS (cloud KMS exemplar)

AWS KMS as the reference cloud implementation. Azure Key Vault and GCP Cloud KMS follow the same adapter pattern.

Operations mapping:

Kiseki operation	AWS KMS API
`wrap`	`Encrypt` (with `EncryptionContext` = AAD)
`unwrap`	`Decrypt` (with `EncryptionContext` = AAD)
`rotate`	`CreateKey` + `CreateAlias` (manual) or `EnableKeyRotation` (automatic annual)
`rewrap`	`ReEncrypt` (server-side, no plaintext exposure)

Key difference: With cloud KMS, the KEK material never leaves the cloud provider. Kiseki sends the derivation parameters (epoch + chunk_id) to KMS for wrapping/unwrapping. This is strictly more secure than local caching but adds network latency per operation.

Caching strategy: Kiseki caches the unwrapped derivation parameters (not the KEK itself, which never leaves KMS). The existing KeyCache TTL mechanism applies — after TTL expiry, a new Decrypt call to KMS is required.

Auth: IAM role assumption via STS, instance metadata, or environment credentials. For Azure: AAD/Managed Identity. For GCP: service account key or Workload Identity.

Rust crates: aws-sdk-kms, azure_security_keyvault, google-cloud-kms (all maintained, async).

Provider 5: PKCS#11 v3.0 (HSM direct)

For tenants with on-premises HSMs (Thales Luna, Utimaco, nCipher, YubiHSM). PKCS#11 is the standard C API for cryptographic tokens.

Relevant standards:

OASIS PKCS#11 v3.0 (2020) — Cryptographic Token Interface
PKCS#11 Profiles v3.0 — baseline/extended profiles

Operations mapping:

Kiseki operation	PKCS#11 function
`wrap`	`C_WrapKey` (AES-KWP per RFC 5649, with `pParameter` = AAD)
`unwrap`	`C_UnwrapKey`
`rotate`	`C_GenerateKey` + `C_DestroyObject` (old, after migration)
`destroy`	`C_DestroyObject`

Material model: Remote only. HSM keys are CKA_SENSITIVE and CKA_EXTRACTABLE=FALSE by default — material never leaves the HSM. All wrap/unwrap operations execute on the HSM hardware. Kiseki caches unwrapped derivation parameters (same as cloud KMS model).

Transport: Local — PKCS#11 is a C shared library (.so/.dylib) loaded via FFI. The HSM may be network-attached (e.g., Luna Network HSM), but the PKCS#11 interface is local to the host.

Rust crate: cryptoki (maintained, wraps PKCS#11 C API).

Trait Interface

#![allow(unused)]
fn main() {
/// Provider for tenant key encryption keys (KEKs).
///
/// Each tenant configures exactly one provider. The provider handles
/// authentication, key lifecycle, and wrapping/unwrapping operations.
/// The trait fully encapsulates the provider's material model — callers
/// never need to know whether wrapping happens locally or remotely.
///
/// Providers that cache KEK material locally (Internal, Vault) manage
/// their own cache internally. Providers where material never leaves
/// the backend (AWS KMS, PKCS#11) perform remote wrap/unwrap calls.
/// The caller's code path is identical in both cases.
#[async_trait]
pub trait TenantKmsProvider: Send + Sync {
    /// Wrap DEK derivation parameters (epoch + chunk_id) with the
    /// tenant KEK. The `aad` binds the wrapped ciphertext to its
    /// envelope context (typically chunk_id), preventing splice attacks.
    /// Returns opaque ciphertext stored in the envelope.
    async fn wrap(
        &self,
        tenant: &OrgId,
        plaintext: &[u8],
        aad: &[u8],
    ) -> Result<Vec<u8>, KmsProviderError>;

    /// Unwrap DEK derivation parameters from envelope ciphertext.
    /// The `aad` must match the value used during wrapping.
    async fn unwrap(
        &self,
        tenant: &OrgId,
        ciphertext: &[u8],
        aad: &[u8],
    ) -> Result<Zeroizing<Vec<u8>>, KmsProviderError>;

    /// Rotate the tenant KEK to a new version/epoch.
    /// Returns the new provider-specific epoch identifier.
    async fn rotate(
        &self,
        tenant: &OrgId,
    ) -> Result<KmsEpochId, KmsProviderError>;

    /// Re-wrap ciphertext from old key version to current version
    /// without exposing plaintext (server-side re-wrap where supported).
    /// Falls back to unwrap + wrap if the provider doesn't support
    /// server-side re-wrap. The `aad` is preserved across the re-wrap.
    async fn rewrap(
        &self,
        tenant: &OrgId,
        old_ciphertext: &[u8],
        aad: &[u8],
    ) -> Result<Vec<u8>, KmsProviderError>;

    /// Destroy the tenant KEK (crypto-shred). Irrecoverable.
    /// Also purges any locally cached material for this tenant.
    async fn destroy(
        &self,
        tenant: &OrgId,
    ) -> Result<(), KmsProviderError>;

    /// Check provider health and connectivity.
    async fn health(&self) -> KmsHealthStatus;

    /// Provider name for logging and diagnostics (never includes
    /// credentials or key material).
    fn provider_name(&self) -> &'static str;
}
}

AAD usage: Callers pass chunk_id.as_bytes() as aad for per-chunk envelope wrapping. Each provider maps aad to its native authenticated context mechanism:

Provider	AAD mechanism
Internal	AES-256-GCM additional data (existing `"kiseki-tenant-wrap-v1"` prefix + aad)
Vault	Transit `context` parameter (base64-encoded)
KMIP	`Correlation Value` attribute on Encrypt/Decrypt
AWS KMS	`EncryptionContext` key-value map (`{"chunk_id": "<hex>"}`)
PKCS#11	`pParameter` field in mechanism struct

Tenant Configuration

Stored in the control plane (kiseki-control) per-tenant:

#![allow(unused)]
fn main() {
pub struct TenantKmsConfig {
    /// Provider type.
    pub provider: KmsProviderType,
    /// Provider-specific endpoint (URL, socket path, or "internal").
    pub endpoint: String,
    /// Authentication configuration. All secret fields use Zeroizing
    /// wrappers and implement Debug redaction (I-K8 extended).
    pub auth: KmsAuthConfig,
    /// Key identifier within the provider.
    pub key_name: String,
    /// Provider namespace (Vault namespace, KMIP group, KMS alias prefix).
    pub namespace: Option<String>,
    /// Cache TTL override (bounded by I-K15: 5s-300s).
    pub cache_ttl_secs: Option<u64>,
}

pub enum KmsProviderType {
    Internal,
    Vault,
    Kmip,
    AwsKms,
    AzureKeyVault,
    GcpCloudKms,
    Pkcs11,
}

/// Authentication configuration for external KMS providers.
///
/// All secret fields use `Zeroizing<String>` for automatic memory
/// clearing on drop. The `Debug` impl prints variant names only —
/// never credential contents (I-K8 extended to provider credentials).
pub enum KmsAuthConfig {
    /// Internal provider — no external auth needed.
    None,
    /// mTLS client certificate (KMIP, Vault TLS auth).
    TlsCert {
        cert_pem: String,
        key_pem: Zeroizing<String>,
    },
    /// Vault AppRole.
    AppRole {
        role_id: String,
        secret_id: Zeroizing<String>,
    },
    /// OIDC/JWT token (Vault, cloud providers).
    Oidc {
        token_endpoint: String,
        client_id: String,
    },
    /// AWS IAM role assumption.
    AwsIamRole {
        role_arn: String,
        region: String,
    },
    /// Azure Managed Identity or Service Principal.
    AzureIdentity {
        tenant_id: String,
        client_id: String,
    },
    /// GCP Service Account.
    GcpServiceAccount {
        credentials_json: Zeroizing<String>,
    },
    /// PKCS#11 library path + slot/pin.
    Pkcs11 {
        library_path: String,
        slot_id: u64,
        pin: Zeroizing<String>,
    },
}
}

I-K8 extended: KmsAuthConfig implements Debug with redaction:

#![allow(unused)]
fn main() {
impl fmt::Debug for KmsAuthConfig {
    fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
        match self {
            Self::None => write!(f, "KmsAuthConfig::None"),
            Self::TlsCert { .. } => write!(f, "KmsAuthConfig::TlsCert(***)"),
            Self::AppRole { role_id, .. } => write!(f, "KmsAuthConfig::AppRole({})", role_id),
            // ... all variants redact secret fields
        }
    }
}
}

Caching and Fallback

The existing KeyCache (cache.rs) is reused for providers with local material. Remote-only providers (AWS KMS, PKCS#11) cache unwrapped derivation parameters instead.

Provider	What is cached	Cache miss action
Internal	KEK material (32 bytes)	Fetch from tenant key Raft store
Vault	KEK material or nothing (configurable)	`POST /transit/decrypt`
KMIP	KEK material or nothing (depends on server)	`Encrypt`/`Decrypt` operation
AWS KMS	Unwrapped derivation params	`Decrypt` API call
PKCS#11	Unwrapped derivation params	`C_UnwrapKey`

I-K15 applies: Cache TTL bounded to [5s, 300s] regardless of provider. This ensures crypto-shred takes effect within the TTL window even if the external KMS is ahead of Kiseki’s cache.

Provider unavailability:

Within TTL window: cached material serves reads (degraded mode)
Beyond TTL: reads fail with TenantKekUnavailable (retriable)
Writes always require fresh validation (no stale-cache writes)

Resilience (adversarial finding #5):

Circuit breaker per provider endpoint: open after 5 consecutive failures/timeouts, half-open probe every 30s
Jittered cache TTL: actual TTL = configured TTL ± 10% (random) to prevent synchronized expiry across storage nodes
Concurrency limit: max 10 concurrent KMS requests per tenant per storage node (backpressure, not queuing)
Timeout bounds: 2s connect timeout, 5s operation timeout for all network-based providers

I-K11 unchanged: Kiseki provides no escrow. If the tenant loses access to their external KMS and has no backup, their data is unrecoverable. This is documented and accepted.

Provider Migration

Changing a tenant’s KMS provider (e.g., Internal → Vault) requires re-wrapping all existing envelopes (adversarial finding #3):

Provision new KEK in the target provider
Configure the new provider as “pending” in control plane
Background re-wrap: for each envelope, old_provider.unwrap() → new_provider.wrap() with the same AAD
Track progress (same mechanism as epoch re-wrap: RewrapProgress)
Once 100% re-wrapped, atomically switch active provider
Decommission old provider KEK

During migration, reads use whichever provider matches the envelope’s tenant_epoch. The envelope carries a provider-version tag to disambiguate.

Constraint: Provider migration is an operator-initiated, audited action. It cannot be triggered by the tenant API alone.

Crypto-Shred Interaction

Crypto-shred (tenant KEK destruction) behavior per provider:

Provider	Crypto-shred mechanism
Internal	Delete KEK from tenant key store; purge cache
Vault	`POST /transit/keys/:name/config` with `deletion_allowed=true`, then `DELETE /transit/keys/:name`
KMIP	`Destroy` operation (state → Destroyed, irrecoverable)
AWS KMS	`DisableKey` (immediate, blocks all operations) + `ScheduleKeyDeletion` (permanent, 7-30 day window)
PKCS#11	`C_DestroyObject`

AWS KMS: DisableKey is called immediately on crypto-shred to block all wrap/unwrap operations. ScheduleKeyDeletion follows for permanent destruction. The 7-day AWS-enforced waiting period applies to permanent deletion only — the key is operationally dead from the moment DisableKey is called. The health() check reports supports_immediate_shred: true (via DisableKey) so tenants can verify crypto-shred SLA compliance at configuration time.

Security Considerations

Credential protection: KMS auth credentials stored in the control plane are encrypted at rest with the system master key. All secret fields use Zeroizing<String> for memory protection. Debug implementations redact all credential content (I-K8 extended). Credentials are excluded from core dumps via MADV_DONTDUMP on the containing allocation.
Network isolation: External KMS calls are made from storage nodes, not the control plane. This avoids routing tenant data through the control plane. mTLS is required for all network-based providers.
Provider compromise: If a tenant’s external KMS is compromised, only that tenant’s data is at risk. System master keys and other tenants are unaffected (tenant isolation, I-T3).
Mixed providers: Different tenants can use different providers. A single Kiseki cluster can serve tenants using Vault, AWS KMS, and internal management simultaneously.
FIPS compliance: The HKDF derivation and AES-256-GCM encryption remain on Kiseki’s FIPS-validated aws-lc-rs module regardless of provider. The external KMS only handles the tenant KEK wrapping layer — the system encryption layer is always FIPS.
Internal provider isolation: Tenant KEKs in Internal mode are stored in a separate Raft group from system master keys. This provides an independent compromise domain — system key manager compromise alone does not yield tenant KEKs, and vice versa. However, an operator with access to both stores has full access. Compliance-sensitive tenants should use an external provider where the KEK is under their own operational control.

Implementation Phases

Phase K1: TenantKmsProvider trait + Internal backend (refactor current code to use the trait; no behavioral change)
Phase K2: Vault backend (Transit engine, vaultrs crate)
Phase K3: KMIP 2.1 backend (custom TTLV client, ~1500 lines)
Phase K4: AWS KMS backend (aws-sdk-kms crate)
Phase K5: PKCS#11 backend (cryptoki crate)

Phases K2-K5 are independent and can be built in any order.

Alternatives Considered

BYOK (Bring Your Own Key) upload model: Tenant uploads raw key material to Kiseki. Rejected — defeats the purpose of external KMS (key material leaves tenant’s control boundary).
Single cloud KMS only: Support only AWS KMS. Rejected — HPC customers are frequently on-premises or multi-cloud.
KMIP only: Use KMIP as the sole external standard. Rejected — Vault and cloud KMS are too prevalent to ignore, and KMIP client implementation cost is non-trivial.
No internal provider: Require all tenants to configure external KMS. Rejected — creates unnecessary deployment friction for simple or single-operator clusters.
fetch_kek in trait interface: Original design included fetch_kek() -> Option<TenantKekMaterial> with None for cloud providers. Rejected after adversarial review — leaky abstraction that forces callers to branch on provider model. wrap/unwrap as the universal interface fully encapsulates the distinction.

Adversarial Review Findings (2026-04-22)

#	Severity	Finding	Resolution
1	High	Credential fields as plaintext `String`	`Zeroizing<String>` + Debug redaction
2	High	`fetch_kek` leaky abstraction	Removed; `wrap`/`unwrap` are universal
3	Medium	No provider migration path	Migration protocol documented
4	Medium	No AAD in wrap/unwrap	`aad: &[u8]` parameter added
5	Medium	No rate limiting/circuit breaker	Circuit breaker + jitter + limits specified
6	Medium	PKCS#11 `C_GetAttributeValue` violates HSM model	Removed; HSM uses `C_WrapKey`/`C_UnwrapKey` only
7	Medium	Internal KEK co-located with system keys	Separate Raft group for tenant KEKs
8	Low	AWS KMS 7-day deletion window	`DisableKey` immediate + `ScheduleKeyDeletion` deferred

Consequences

Adds kiseki-kms crate (or module within kiseki-keymanager)
Tenant key Raft group added (separate from system key manager)
Control plane gains KMS configuration endpoints
Each storage node needs network access to tenant KMS endpoints
KMIP requires custom protocol implementation (~1500 lines)
PKCS#11 requires unsafe FFI (contained within cryptoki crate)
Testing requires mock KMS servers (Vault dev mode, LocalStack, SoftHSM for PKCS#11)

ADR-029: Raw Block Device Allocator

Status: Accepted Adversarial review: 2026-04-22 (8 findings: 2H 4M 2L, all resolved) Date: 2026-04-22 Context: ADR-022, ADR-024, ADR-005, I-C1 through I-C6

Problem

Chunk ciphertext needs to persist on JBOD data devices. ADR-024 specifies XFS on each device as the default, but filesystem overhead becomes the bottleneck at HPC scale:

Double journaling: XFS journals its metadata, then redb journals ours — redundant durability cost
Page cache pollution: OS caches data we already manage in our own cache layer, wasting DRAM
Inode contention: Billions of chunks = billions of inodes; XFS metadata operations become the throughput ceiling
Indirection: Every I/O traverses VFS → XFS → block layer → device; raw access removes two layers

Ceph’s migration from FileStore (XFS) to BlueStore (raw block) was driven by exactly these issues. DAOS uses SPDK for the same reason.

Decision

New crate: `kiseki-block`

A device I/O crate that manages raw block devices (and file-backed fallback for VMs/CI). Separate from kiseki-chunk (domain logic). kiseki-chunk depends on kiseki-block for storage.

Device Backend Trait

#![allow(unused)]
fn main() {
/// Abstraction over a storage device — raw block or file-backed.
/// Auto-detects device characteristics and adapts I/O strategy.
#[async_trait]
pub trait DeviceBackend: Send + Sync {
    /// Allocate a contiguous extent of at least `size` bytes.
    /// Alignment matches the device's physical block size.
    fn alloc(&self, size: u64) -> Result<Extent, AllocError>;

    /// Write data at the given extent.
    fn write(&self, extent: &Extent, data: &[u8]) -> Result<(), BlockError>;

    /// Read data from the given extent.
    fn read(&self, extent: &Extent) -> Result<Vec<u8>, BlockError>;

    /// Free an extent, returning blocks to the free pool.
    fn free(&self, extent: &Extent) -> Result<(), AllocError>;

    /// Sync all pending writes to stable storage.
    fn sync(&self) -> Result<(), BlockError>;

    /// Device capacity: (used_bytes, total_bytes).
    fn capacity(&self) -> (u64, u64);

    /// Probed device characteristics (read-only after open).
    fn characteristics(&self) -> &DeviceCharacteristics;
}
}

Auto-detection (no manual configuration)

On DeviceManager::open(path), probe sysfs (Linux):

/sys/block/<dev>/queue/rotational         → 0 (SSD/NVMe) or 1 (HDD)
/sys/block/<dev>/queue/physical_block_size → 512 or 4096
/sys/block/<dev>/queue/optimal_io_size    → device-preferred I/O size
/sys/block/<dev>/queue/max_hw_sectors_kb  → max single I/O size
/sys/block/<dev>/device/model             → model string
/sys/block/<dev>/device/numa_node         → NUMA node (-1 if none)
/sys/block/<dev>/queue/discard_max_bytes  → TRIM support (>0 = yes)

Derived properties:

#![allow(unused)]
fn main() {
pub struct DeviceCharacteristics {
    pub medium: DetectedMedium,
    pub physical_block_size: u32,
    pub optimal_io_size: u32,
    pub rotational: bool,
    pub numa_node: Option<u32>,
    pub supports_trim: bool,
    pub supports_smart: bool,
    pub io_strategy: IoStrategy,
}

pub enum DetectedMedium {
    NvmeSsd,       // /sys/block/nvme*/ + rotational=0
    SataSsd,       // rotational=0, not NVMe
    Hdd,           // rotational=1
    Virtual,       // virtio in model, no SMART
    Unknown,
}

pub enum IoStrategy {
    DirectAligned,       // O_DIRECT | O_DSYNC — NVMe, SATA SSD
    BufferedSequential,  // O_SYNC — HDD (readahead benefits)
    FileBacked,          // Default flags — VM, dev, CI
}
}

For non-Linux / VMs without sysfs: detect virtio in model string or absence of block device properties → fall back to IoStrategy::FileBacked with sparse file. All three strategies implement the same DeviceBackend trait transparently.

On-disk format

Per data device:

Offset 0:     [Superblock — 4K]
Offset 4K:    [Primary Bitmap — variable size]
Offset M:     [Mirror Bitmap — same size as primary]
Offset N:     [Data Region — remainder of device]

Superblock (4K, first block):

#![allow(unused)]
fn main() {
pub struct Superblock {
    pub magic: [u8; 8],              // b"KISEKI\x01\x00"
    pub version: u32,                // Format version (1)
    pub device_id: [u8; 16],         // UUID
    pub block_size: u32,             // Physical block size (probed)
    pub total_blocks: u64,           // Device capacity in blocks
    pub bitmap_offset: u64,          // Byte offset of primary bitmap
    pub bitmap_mirror_offset: u64,   // Byte offset of mirror bitmap
    pub bitmap_blocks: u64,          // Size of each bitmap in blocks
    pub data_offset: u64,            // Byte offset of data region
    pub generation: u64,             // Monotonic, incremented on bitmap flush
    pub checksum: [u8; 32],          // SHA-256 of superblock fields
}
}

Allocation bitmap (primary + mirror): 1 bit per block in the data region. Stored twice at different offsets for redundancy.

At 4K blocks: 4TB device = 1 billion blocks = 128MB × 2 = 256MB
At 512B blocks: 4TB device = 8 billion blocks = 1GB × 2 = 2GB
Bitmap overhead: 0.006% (4K) to 0.048% (512B)
On read: verify primary against mirror. On mismatch, use the copy consistent with the redb journal.

Per-extent CRC32: Every data extent has a 4-byte CRC32 trailer written after the payload data (within the same aligned block).

On read: verify CRC32 before returning data.
CRC mismatch → hardware corruption → trigger EC repair from parity fragments (not a security incident).
AES-GCM auth_tag failure after CRC pass → actual tampering (security incident, alert + audit).
This distinguishes hardware failure from cryptographic attack, enabling correct operational response.

Allocation algorithm

Extent-based best-fit with free-list cache (Ceph BlueStore pattern, simpler than DAOS VEA):

In-memory: B-tree of free extents (offset, block_count), sorted by offset. On alloc, scan for smallest extent >= requested blocks. On free, insert and coalesce with neighbors.
Concurrency: alloc() and free() are serialized per device via Mutex on the allocator state. This is acceptable — allocation is a B-tree lookup (microseconds); I/O is the bottleneck, not allocation. Ceph BlueStore also serializes allocation per OSD.
On-disk: Bitmap is ground truth. Free-list rebuilt from bitmap on startup (~100ms for 4TB at 4K blocks).
Crash safety: Bitmap updates journaled in redb (device_alloc table) before applied to device bitmap region. On crash recovery: reload bitmap from device, replay pending journal entries from redb, rebuild free-list.

Allocation flow (WAL-ordered for crash safety):

Round up requested size to block_size boundary
Search free-list for best-fit extent
Split extent if larger than needed
Journal intent in redb (device_alloc table: alloc intent)
Mark bits in bitmap (pwrite to bitmap region)
Return Extent { offset, length }
Caller writes data to extent, then commits chunk_meta to redb
Clear intent from device_alloc journal (write complete)

On crash recovery: scan device_alloc for pending intents. If the corresponding chunk_meta exists → write completed, clear intent. If no chunk_meta → write was interrupted, free the extent (clear bitmap bits, remove intent). This is the standard WAL pattern — Ceph BlueStore uses the same approach.

Free flow:

Journal the deallocation intent in redb
Clear bits in bitmap
Insert freed extent into free-list, coalesce neighbors
If supports_trim: add to TRIM batch queue (see below)
Clear dealloc intent from journal

TRIM batching: Freed extents accumulate in a TRIM queue per device. A batched BLKDISCARD ioctl is issued periodically (every 60 seconds or when queue exceeds 1GB). This avoids write amplification from many small TRIM commands.

Maximum extent size: 16MB. Allocations larger than 16MB are split into multiple extents. FragmentLocation in chunk_meta already supports multiple extents per chunk via Vec<FragmentLocation>.

I/O strategy per device type

Strategy	Open flags	Alignment	Sync	Use case
`DirectAligned`	`O_DIRECT \| O_DSYNC`	`physical_block_size`	Implicit (O_DSYNC)	NVMe, SATA SSD
`BufferedSequential`	`O_SYNC`	512B	`fdatasync()`	HDD
`FileBacked`	default	4K (simulated)	`fsync()`	VM, dev, CI

FileBacked alignment: FileBackedDevice enforces the same 4K alignment as RawBlockDevice to ensure tests faithfully reproduce raw block behavior. Code that passes CI will not fail on real hardware due to alignment issues.

Write buffers aligned via std::alloc::Layout::from_size_align for O_DIRECT compatibility
NUMA-aware: pin allocator thread to numa_node if detected
TRIM/UNMAP on free if supports_trim (SSD wear management)
optimal_io_size used for write batching (coalesce small writes up to this size before issuing I/O)

Metadata in redb (system partition)

ADR-022’s redb on the RAID-1 system partition stores chunk metadata:

Table: chunk_meta

Key:   [u8; 32]  (chunk_id)
Value: bincode-serialized ChunkMeta {
    refcount: u64,
    retention_holds: Vec<String>,
    pool_name: String,
    stored_bytes: u64,
    fragments: Vec<FragmentLocation {
        device_id: [u8; 16],
        offset: u64,
        length: u64,
    }>,
    envelope_meta: EnvelopeMeta {
        nonce: [u8; 12],
        auth_tag: [u8; 16],
        system_epoch: u64,
        tenant_epoch: Option<u64>,
        tenant_wrapped_material: Option<Vec<u8>>,
    },
}

Table: device_alloc (bitmap journal for crash safety)

Key:   (device_id: [u8; 16], generation: u64)
Value: bincode-serialized Vec<AllocJournalEntry {
    offset: u64,
    length: u64,
    is_alloc: bool,  // true = allocate, false = free
}>

Separation of concerns

The allocator does NOT know about device subclasses (NvmeU2 vs NvmeQlc, HddEnterprise vs HddBulk). Those are pool/placement concerns in kiseki-chunk and kiseki-control (ADR-024).

Layer	Cares about	Doesn’t care about
`kiseki-block`	physical_block_size, rotational, O_DIRECT	TLC vs QLC, RPM, pool policy
`kiseki-chunk`	pool thresholds, EC config, placement	block alignment, I/O flags
`kiseki-control`	device class, pool assignment, tiering	how bytes reach the device

The DeviceClass enum (ADR-024) stays in kiseki-chunk/kiseki-control. DeviceCharacteristics (auto-probed) stays in kiseki-block.

Integration with existing code

ChunkOps trait (ADR-005) unchanged — callers unaware of backend
New PersistentChunkStore in kiseki-chunk implements ChunkOps:
- write_chunk(): EC encode → alloc extents per device via DeviceBackend → write fragments → update redb chunk_meta
- read_chunk(): lookup redb chunk_meta → DeviceBackend::read per fragment → EC decode if needed → return Envelope
- gc(): free extents via DeviceBackend::free → update bitmap → remove from redb
DeviceManager in kiseki-block opens devices at startup, probes characteristics, creates appropriate DeviceBackend per device
Server runtime (kiseki-server) wires DeviceManager → pools → PersistentChunkStore when KISEKI_DATA_DIR is set

Crate structure

kiseki-block/
├── Cargo.toml
└── src/
    ├── lib.rs
    ├── backend.rs        # DeviceBackend trait
    ├── raw.rs            # RawBlockDevice (O_DIRECT)
    ├── file.rs           # FileBackedDevice (sparse file)
    ├── probe.rs          # Sysfs device probing
    ├── superblock.rs     # On-disk superblock format
    ├── bitmap.rs         # Allocation bitmap
    ├── allocator.rs      # Extent allocator (free-list + bitmap)
    ├── extent.rs         # Extent type
    ├── manager.rs        # DeviceManager
    └── error.rs          # BlockError, AllocError

Rationale

Raw block over XFS: Eliminates FS overhead (journaling, inode, page cache) that becomes the bottleneck at NVMe line rate. Ceph BlueStore validated this approach at scale.
Auto-detection over manual config: Reduces deployment friction. Admin provides device paths; Kiseki probes characteristics. Works correctly on bare metal, VMs, and CI without config changes.
Bitmap over B-tree free-list on disk: Simpler crash recovery (fixed-size, position-indexed). Free-list is derived in-memory. DAOS VEA uses B-tree on persistent memory, but we don’t require PMEM — bitmap on block device with redb journal is sufficient.
File-backed fallback: Same trait, different backend. Tests and CI don’t need raw devices. VMs work without device passthrough.
Separate crate: kiseki-block has no domain knowledge (chunks, EC, pools). Clean dependency boundary. Testable in isolation.

Alternatives Considered

XFS on each JBOD device (ADR-024 original default): Rejected for production — FS overhead at NVMe line rate is unacceptable. Still available as FileBacked strategy for dev/VM.
SPDK userspace I/O (DAOS model): Rejected — requires dedicated devices (no kernel access), complicates deployment, needs custom memory management (DMA buffers). Future optimization path if kernel I/O overhead is measured as bottleneck.
Pool files (one large file per device): Rejected — still has FS overhead (XFS metadata for the pool file itself). Raw block eliminates the FS entirely.
redb for chunk data: Rejected — B-tree not designed for multi-GB blob storage. Acceptable for metadata only.

Consequences

Adds kiseki-block crate to workspace (~2000 lines estimated)
Data devices must be provisioned as raw (no filesystem). Operator provides device paths in config; Kiseki writes superblock on init.
VMs and CI use file-backed mode transparently (no raw devices needed)
Crash recovery depends on redb journal + device bitmap consistency
Device initialization is a destructive operation (writes superblock, bitmap — existing data on device is lost). Safety checks before init: (1) check for existing Kiseki superblock magic — require --force if found, (2) check for known FS signatures (XFS, ext4, NTFS magic) — refuse with clear error, (3) audit log the init
TRIM/UNMAP support improves SSD endurance but is optional
Future: SPDK backend can implement DeviceBackend trait for userspace I/O without changing upper layers

Adversarial Review Findings (2026-04-22)

#	Severity	Finding	Resolution
1	High	Write ordering — data before metadata creates phantom chunks on crash	WAL intent journal: alloc → journal intent → write data → commit chunk_meta → clear intent. Recovery replays intents.
2	High	No per-extent checksum — silent corruption indistinguishable from tampering	CRC32 trailer per extent. CRC fail = hardware corruption (EC repair). Auth tag fail after CRC pass = tampering (security alert).
3	Medium	Bitmap single point of failure per device	Primary + mirror bitmap at different offsets. On mismatch, use copy consistent with redb journal.
4	Medium	No device init safety — accidental overwrite of existing data	Safety checks: existing Kiseki magic → require –force. Known FS signatures → refuse. Audit log init.
5	Medium	File-backed mode doesn’t enforce alignment — CI misses bugs	FileBacked enforces same 4K alignment as RawBlockDevice.
6	Medium	Concurrent alloc race on shared free-list	Mutex per device on allocator state. Allocation is microseconds; I/O is the bottleneck.
7	Low	Immediate TRIM on free causes write amplification	Batch TRIM queue: accumulate, issue BLKDISCARD every 60s or at 1GB threshold.
8	Low	No max extent size — unbounded alloc fragments bitmap scan	Max extent 16MB. Larger chunks split into multiple extents.

References

Ceph BlueStore: Architecture
DAOS VOS/VEA: Storage Model
ADR-022: Storage backend (redb for metadata)
ADR-024: Device management and capacity thresholds
ADR-005: EC and chunk durability

ADR-030: Dynamic Small-File Placement and Metadata Capacity Management

Status: Accepted Date: 2026-04-22 Deciders: Architect + domain expert Adversarial review: 2026-04-22 (6 findings: 1C 2H 2M 1L, all resolved) Context: ADR-024 (device management), ADR-029 (raw block allocator), I-L9 (inline threshold), I-C5 (capacity thresholds), I-C8 (bitmap ground truth)

Problem

At scale (10B+ files, 100PB+), the metadata tier (redb on system NVMe) becomes a sizing bottleneck. The per-file metadata footprint (~280 bytes) is unavoidable, but small-file content inlined into deltas causes the metadata tier to scale with data volume, not just file count.

Current state:

inline_threshold_bytes is specified (I-L9) but not implemented
No dynamic adjustment mechanism exists
No awareness of system disk capacity or media type
No workload-driven shard placement across heterogeneous nodes

Capacity example

10B files, 100PB total, 50-node cluster, RF=3, 256GB NVMe root disks:

Component	Per file	Cluster total	Per node
Delta log (no inline)	~200 B	~2 TB	~120 GB
Chunk metadata	~80 B	~0.8 TB	~48 GB
Subtotal (metadata only)	~280 B	~2.8 TB	~168 GB
Small-file content (if inlined)	variable	3-200 TB	blows budget

Metadata alone consumes 168 GB/node at 50 nodes. Adding inline content makes 256 GB root disks insufficient.

Decision

1. System disk auto-detection and budget calculation

At server boot, detect the system partition’s capacity and media type. Compute a metadata budget with configurable soft and hard limits.

KISEKI_DATA_DIR → stat() → total_bytes, fs_type
/sys/block/{dev}/queue/rotational → 0 = SSD/NVMe, 1 = HDD
/sys/block/{dev}/device/model → device identification

Defaults (configurable via env or config file):

Parameter	Default	Description
`KISEKI_META_SOFT_LIMIT_PCT`	50%	Normal operating ceiling
`KISEKI_META_HARD_LIMIT_PCT`	75%	Absolute maximum, triggers emergency
`KISEKI_META_INLINE_FLOOR`	128 B	Hard lower bound for inline (metadata-like payloads only)

Warning: If the system disk is rotational (HDD), emit a persistent warning at boot and in health reports:

WARNING: system disk is rotational (HDD). Raft fsync latency will
be 5-10ms per commit. Production deployments require NVMe or SSD
for the metadata partition. See ADR-030.

Reported to cluster (via gRPC health reports, not Raft — see SF-ADV-4 resolution):

#![allow(unused)]
fn main() {
struct NodeMetadataCapacity {
    total_bytes: u64,
    used_bytes: u64,
    soft_limit_bytes: u64,
    hard_limit_bytes: u64,
    media_type: MediaType,  // Nvme, Ssd, Hdd
    small_file_budget_bytes: u64,  // derived: soft_limit - reserved - metadata
}
}

2. Two-tier redb layout on system disk

Separate metadata (Raft log, chunk index) from small-file content:

KISEKI_DATA_DIR/
├── raft/log.redb            ← Raft log entries (bounded by snapshot policy)
├── keys/epochs.redb         ← Key epoch metadata (tiny, <10 MB)
├── chunks/meta.redb         ← Chunk extent index (scales with file count)
└── small/objects.redb       ← Small-file encrypted content (capacity-managed)

The first three are structural metadata — required regardless of inline threshold. The fourth (small/objects.redb) is data-tier extension — its size is controlled by the inline threshold.

This separation enables:

Independent monitoring of each tier’s growth
Emergency response: disable inline (threshold → floor) without touching structural metadata
Backup/restore of structural metadata without bulk data

GC contract (SF-ADV-6): When truncate_log or compact_shard removes a delta that references an inline object, the corresponding small/objects.redb entry is also deleted. The GC path must cover both stores — orphan entries in small/objects.redb are a capacity leak. The chunk_id key is shared between small/objects.redb and the block device extent mapping, so deletion is keyed identically.

3. Per-shard dynamic inline threshold

The inline threshold determines whether a file’s encrypted content is stored in small/objects.redb (metadata tier) or as a chunk extent on a raw block device (data tier).

Threshold is per-shard, not per-node, because all Raft replicas of a shard must agree on whether content is inline or chunked (state machine determinism).

Computation: The shard leader computes the threshold from the minimum small-file budget across all nodes hosting that shard:

available = min(node.small_file_budget_bytes for node in shard.voters)
projected_files = shard.file_count_estimate (from delta count heuristic)
raw_threshold = available / max(projected_files, 1)
shard_threshold = clamp(raw_threshold, INLINE_FLOOR, INLINE_CEILING)

Where INLINE_CEILING is a system-wide maximum (e.g., 64 KB) to prevent pathological cases.

Raft log throughput guard (SF-ADV-1): The threshold is further clamped by a per-shard Raft log throughput budget (KISEKI_RAFT_INLINE_MBPS, default 10 MB/s). If the shard’s inline write rate (measured over a sliding 10-second window) would exceed this budget at the current threshold, the effective threshold is temporarily reduced to floor until the rate drops. This prevents inline data from starving metadata-only Raft operations (large-file chunk_ref deltas, maintenance commands, watermark advances) during write storms.

effective_threshold = if shard.inline_write_rate_mbps > RAFT_INLINE_MBPS:
    INLINE_FLOOR
else:
    shard_threshold

Threshold adjustment rules (I-L9 compatibility):

Threshold can decrease dynamically (safe — new files use chunks)
Threshold changes are prospective only — existing inline data is not retroactively migrated
Threshold increase requires cluster admin decision and may trigger background migration of small chunked files back to inline (optional, maintenance-mode operation)
Threshold is stored in ShardConfig and replicated via Raft

Read latency note (SF-ADV-3): After a threshold decrease, existing inline files remain in small/objects.redb (fast, NVMe reads) while new files of the same size go to block device extents (potentially slower, especially on HDD). This bimodal latency for same-sized files is expected behavior. Administrators can normalize it via the maintenance-mode migration path (move old inline content to chunks), but this is optional and not automatic.

Emergency override (SF-ADV-4): Capacity alerts use out-of-band gRPC health reports, not Raft. Each node periodically reports its NodeMetadataCapacity to the shard leader (or control plane) via the data-path gRPC channel. If any voter reports hard-limit breach, the leader commits a threshold reduction via Raft. This works because:

The full-disk node doesn’t need to write Raft entries for the signal
The leader commits the threshold change with 2/3 majority (the full-disk node’s vote is not required)
The full-disk node receives the committed threshold change via Raft replication (read-only, no disk write needed until next apply)

4. Small-file data path

Inline content flows through Raft (SF-ADV-2): Inline content is carried as payload in the Raft log entry (LogCommand::AppendDelta with payload field). The state machine’s apply() method offloads the payload to small/objects.redb on apply, keyed by chunk_id. The in-memory state machine retains only the delta header (no payload).

This ensures:

Snapshot correctness: build_snapshot() reads inline content from small/objects.redb, includes it in the serialized snapshot. install_snapshot() writes it back. Learners and restarted nodes receive all inline content via snapshot transfer.
State machine determinism: all replicas apply the same log entries and write to their local small/objects.redb identically.
Memory efficiency: inline payloads are not held in memory after apply — only the redb reference remains.

Below threshold (inline path):

client write → gateway encrypt → delta with payload →
  Raft client_write (payload in log entry) →
  replicated to voters →
  state machine apply() → offload payload to small/objects.redb →
  in-memory state: header only (no payload)

Above threshold (chunk path, unchanged):

client write → gateway encrypt → chunk alloc on DeviceBackend →
  extent write (O_DIRECT) → delta with chunk_ref (no payload) →
  Raft client_write → replicated (metadata only)

Read path: ChunkOps::get() checks small/objects.redb first (keyed by chunk_id). If not found, reads from block device extent. This is transparent to callers.

5. Workload-driven shard placement (heterogeneous clusters)

When the cluster has mixed node types (HDD + SSD), the control plane can migrate shards to better-suited nodes using Raft membership changes.

Placement levers (ordered by preference, topology-dependent):

Lever	When to use	Mechanism
Lower inline threshold	Always available	ShardConfig update via Raft
Split shard	Shard exceeds I-L6 ceiling	Standard shard split
Migrate to larger-NVMe node	Heterogeneous cluster, metadata pressure	Raft add_learner → promote → demote
Migrate to SSD node	Heterogeneous, small-file-heavy shard	Raft add_learner → promote → demote

Decision tree (control plane policy):

IF shard.metadata_pressure > soft_limit:
  IF can_lower_threshold(shard):
    lower_threshold(shard)               # cheapest, always try first
  ELSE IF shard.exceeds_split_ceiling:
    split_shard(shard)                   # distributes load
  ELSE IF cluster.has_better_node(shard):
    migrate_shard(shard, better_node)    # needs heterogeneous cluster
  ELSE:
    alert("metadata tier at capacity, no placement options available")

In a homogeneous cluster, only the first two levers exist. The policy prunes itself based on what’s available.

Shard migration via Raft:

Migration is not a special operation — it’s a Raft membership change:

raft.add_learner(target_node) — target receives log/snapshot
Wait for learner to catch up (snapshot transfer, then log replay)
raft.change_membership(new_voter_set) — promote target, demote source
Old node removed from voter set, its data eventually GC’d

Properties:

Zero downtime: reads/writes continue during migration
Zero data loss: old node stays in membership until new node is caught up
Reversible: if migration fails, learner is removed, no state change

6. Placement change rate limiting

Placement changes (shard migration, learner add/remove) consume snapshot transfer bandwidth. In HPC environments, workload profiles shift at job boundaries (hours to days), not continuously.

Exponential backoff per shard:

Observation window	After N-th change
2 hours	1st (initial observation, minimum floor)
2 hours	2nd (backoff resets never go below 2h)
4 hours	3rd
8 hours	4th
…	doubles each time
24 hours	cap (maximum interval)

Reset (SF-ADV-5): The backoff resets to the minimum floor of 2 hours, not to a shorter interval. Even when the shard’s workload profile changes significantly (e.g., small-file ratio crosses a threshold boundary), the shard cannot be migrated more than once per 2 hours. This prevents oscillating workloads from causing continuous snapshot transfers. The 2-hour floor is chosen because:

HPC job boundaries are typically hours apart
A snapshot transfer of a large shard takes minutes, and the target node needs time to stabilize before being evaluated again
The floor applies per-shard, so different shards can migrate concurrently within the cluster-wide rate limit

Per-cluster rate limit: at most max(1, num_nodes / 10) concurrent shard migrations cluster-wide, to bound snapshot transfer bandwidth.

7. SSD nodes as read accelerators (Raft learners)

For read-heavy small-file workloads, SSD nodes can serve as non-voting Raft learners:

Learners receive the full Raft log (including small-file content)
Learners do NOT participate in elections or commit quorum
Learners serve read requests (state machine is up-to-date)
Add/remove learners without disturbing the voter set

Use case: a shard has RF=3 on HDD voters (for capacity) plus 1-2 SSD learners (for read IOPS). The SSD learners handle small-file reads, HDD voters handle bulk writes.

Correction after suboptimal placement: Initial shard placement does not need to be optimal. The control plane observes shard metrics (small-file ratio, read IOPS, p99 latency) and corrects placement via Raft membership changes. Adding an SSD learner, promoting it to voter, and demoting an HDD voter is a zero-downtime, zero-data-loss operation. The cost is one snapshot transfer per migrated shard — bounded by the rate limiting in §6.

Promotion path: if workload shifts permanently, a learner can be promoted to voter (and an HDD voter demoted) via standard membership change.

Consequences

Positive

Metadata tier sizing becomes self-managing
Small files handled efficiently without manual tuning
Mixed HDD/SSD clusters used optimally
Placement corrections have zero downtime and zero data loss
I-L9 compatibility preserved (prospective-only threshold changes)
Snapshot transfer includes inline content (SF-ADV-2 resolved)

Negative

Per-shard threshold adds complexity to ShardConfig
ChunkOps::get() now checks two stores (redb + block device)
Snapshot transfer is the bottleneck for migration speed
Threshold computation requires cluster-wide metadata aggregation
Inline writes under high load may be temporarily demoted to chunk path (throughput guard), causing brief latency increase for small files

Neutral

Threshold floor (128 B) means truly tiny files are always inline
Homogeneous clusters get simpler behavior (fewer levers)
Migration mechanism is just Raft membership changes — no new protocol
Bimodal read latency after threshold decrease is expected (SF-ADV-3)

Adversarial findings (resolved)

ID	Severity	Finding	Resolution
SF-ADV-1	High	Raft log throughput saturation from inline writes	Per-shard throughput budget (§3), temporarily lowers threshold to floor under load
SF-ADV-2	Critical	Inline content missing from Raft snapshots	Inline content flows through Raft log; state machine offloads to redb on apply; snapshot reads from redb (§4)
SF-ADV-3	Medium	Bimodal read latency after threshold decrease	Documented as expected; optional admin migration path to normalize (§3)
SF-ADV-4	High	Emergency override fails if full-disk node can’t write Raft entries	Capacity reporting via out-of-band gRPC, not Raft; leader commits with 2/3 majority (§3)
SF-ADV-5	Low	Backoff reset allows frequent migrations from oscillating workloads	Minimum 2-hour floor that never resets below (§6)
SF-ADV-6	Medium	No GC path for small/objects.redb	GC contract: truncate_log and compact_shard delete corresponding redb entries (§2)

Invariant impact

Invariant	Impact
I-L9	Extended: threshold is now per-shard and dynamic, but still prospective-only. Increase requires admin action.
I-C5	Unchanged: capacity thresholds on data devices unaffected.
I-C8	Unchanged: bitmap remains ground truth for block device allocations.
I-K3	Unchanged: inline content is still encrypted with system DEK, wrapped with tenant KEK.

New invariants

ID	Invariant
I-SF1	The inline threshold for a shard is the minimum affordable threshold across all nodes hosting that shard’s voter set. Threshold stored in ShardConfig, replicated via Raft.
I-SF2	System disk metadata usage must not exceed `hard_limit_pct` of system partition capacity. Exceeding soft limit triggers threshold reduction; exceeding hard limit forces threshold to floor and emits alert. Alert uses out-of-band gRPC, not Raft.
I-SF3	Shard migration via Raft membership change must not proceed until the target node has fully caught up (learner state matches leader’s committed index).
I-SF4	Placement change rate per shard follows exponential backoff (2h floor, 24h cap). Backoff resets never go below 2h floor. Cluster-wide concurrent migrations bounded by `max(1, num_nodes / 10)`.
I-SF5	Inline content is carried in Raft log entries and offloaded to `small/objects.redb` on state machine apply. Snapshots include inline content read from redb. No inline content is held in the in-memory state machine after apply.
I-SF6	GC (truncate_log, compact_shard) must delete corresponding entries from `small/objects.redb` when removing deltas that reference inline objects. Orphan redb entries are a capacity leak.
I-SF7	Per-shard Raft inline throughput must not exceed `KISEKI_RAFT_INLINE_MBPS` (default 10 MB/s). When exceeded, effective inline threshold drops to floor until rate subsides.

Spec references

specs/invariants.md — I-L9, I-C5, I-C8, I-K3
specs/architecture/adr/024-device-management-and-capacity.md — device classes, server disk layout
specs/architecture/adr/029-raw-block-device-allocator.md — DeviceBackend trait, extent allocation
specs/architecture/adr/026-raft-topology.md — Raft membership, multi-Raft pattern
specs/implementation/phase-7-9-assessment.md — open design question on small files

ADR-031: Client-Side Cache

Status: Accepted Date: 2026-04-23 Deciders: Architect + domain expert Adversarial review: 2026-04-23 (14 findings: 2C 4H 4M 4L, all resolved)

Context

ADR-013 (POSIX semantics scope), ADR-019 (gateway deployment model), ADR-020 (workflow advisory), ADR-030 (dynamic small-file placement), control-plane.feature (policy distribution precedent), native-client.feature (client architecture).

CSCS workload mix: LLM pretraining (epoch reuse of tokenized datasets), LLM inference (model weight cold-start), climate/weather simulation (bounded input staging with hard deadlines), HPC checkpoint/restart. Common pattern: compute nodes repeatedly pull the same encrypted chunks across the fabric.

Existing client architecture: kiseki-client crate with feature flags (fuse, ffi, python, pure-Rust default). Performs tenant-layer encryption — plaintext never leaves the workload process. The existing ClientCache is an in-memory HashMap<ChunkId, Vec<u8>> with TTL and max-entries eviction.

Problem

Repeat reads of the same chunks cross the fabric unnecessarily. Training datasets are read epoch after epoch. Inference weights are loaded identically by multiple model replicas. Climate boundary conditions are staged identically to every simulation rank.
In-memory cache (current ClientCache) is bounded by process memory, which is primarily needed for computation. Compute-node NVMe is available and underutilized.
No mechanism for pre-staging datasets. Jobs start with cold cache and pay first-access latency on every rank simultaneously, creating a thundering-herd pattern on the storage fabric.
No cache mode differentiation. Training (pin everything), inference (pin weights, LRU prompts), and HPC checkpoint (don’t cache) have fundamentally different cache needs.

Decision

1. Cache architecture

The client-side cache is a library-level module in kiseki-client, shared across all linkage modes (FUSE, FFI, Python, native Rust). It operates on decrypted plaintext chunks keyed by ChunkId.

canonical (fabric) → decrypt → cache store (NVMe) → serve to caller
                                    ↑
                          cache hit path (no fabric, no decrypt)

Two-tier storage:

Tier	Backing	Purpose	Eviction
Hot (L1)	In-memory `HashMap`	Sub-microsecond hits for active working set	LRU, bounded by `max_memory_bytes`
Warm (L2)	Local NVMe file or directory	Large capacity for datasets and weights	Per-mode policy (see §2)

L2 layout on NVMe (CC-ADV-4 resolved: per-process subdirectories):

$KISEKI_CACHE_DIR/
├── <tenant_id_hex>/
│   ├── <pool_id>/                 ← per-process pool (128-bit CSPRNG)
│   │   ├── chunks/
│   │   │   ├── <prefix>/
│   │   │   │   └── <chunk_id_hex> ← plaintext + CRC32 trailer
│   │   │   └── ...
│   │   ├── meta/
│   │   │   └── file_chunks.db
│   │   ├── staging/
│   │   │   └── <dataset_id>.manifest
│   │   └── pool.lock              ← flock, proves process is alive
│   └── <pool_id>/                 ← another concurrent process
│       └── ...
└── ...

Each client process creates its own pool_id directory (128-bit CSPRNG, same generation as client_id per I-WA4). The pool.lock file holds an flock for the process lifetime. Multiple concurrent same-tenant processes on the same node have fully independent pools with no contention.

L2 integrity (CC-ADV-3 resolved): Each L2 chunk file stores the plaintext data followed by a 4-byte CRC32 trailer, computed at insert time. On L2 read, the CRC32 is verified before serving. Full SHA-256 content-address verification occurs only at fetch time (when the chunk is first retrieved from canonical). CRC32 catches bit-flips and filesystem corruption at ~1 GB/s throughput cost. CRC mismatch triggers bypass to canonical and L2 entry deletion (I-CC7).

Security model (plaintext cache):

The L2 cache holds decrypted plaintext on local NVMe. This is acceptable because:

The compute node already holds decrypted data in process memory (computation requires plaintext)
L2 NVMe is local to the compute node, same trust domain as process memory
L2 is ephemeral — wiped on process exit and on long disconnect
zeroize on eviction/wipe: overwrite chunk data before deallocation (I-CC2)
File permissions: 0600, owned by process UID
Crash recovery: startup scavenger + periodic scrubber clean orphaned pools (CC-ADV-1 resolved, see §9)

Residual risk (CC-ADV-10 acknowledged): Software zeroize on NVMe/SSD provides logical-level erasure only. The Flash Translation Layer may retain physical copies of overwritten data until internal garbage collection. For deployments requiring physical erasure guarantees, use NVMe drives with hardware encryption (OPAL/SED) and rotate the drive encryption key on node reboot. This is an operational hardening measure, not a baseline requirement.

2. Cache modes

Three modes, selectable per client instance at session establishment:

Pinned mode

For workloads that declare their dataset upfront: training runs (epoch reuse), inference (model weights), climate (boundary conditions).

Chunks are retained against eviction until explicit release
Populated via the staging API (§6) or on first access
L2 is the primary tier; L1 is a hot subset
Eviction: only on explicit release() or process exit
Capacity bounded by max_cache_bytes (§8); staging beyond capacity returns an error, does not evict pinned chunks

Dataset versioning (CC-ADV-8 resolved): Pinned mode stages a point-in-time snapshot of the dataset. The staged version is immutable in the cache regardless of canonical updates. This is intentional — training runs require a stable dataset across epochs. To pick up dataset updates, the user must explicitly release and re-stage. There is no automatic dataset-level version check.

Organic mode

Default for mixed workloads. LRU with usage-weighted retention.

Chunks cached on first read, evicted on LRU when capacity is reached
Frequently accessed chunks promoted to L1
L2 eviction: LRU by last-access timestamp, weighted by access count (chunks accessed N times survive N eviction rounds)
Metadata cache (file→chunk_list) with configurable TTL (default 5s)

Bypass mode

For workloads that don’t benefit from caching: streaming ingest, one-shot scans, checkpoint writes, compute-bound codes with no repeat reads.

All reads go directly to canonical
No L1 or L2 storage consumed
Zero overhead beyond mode selection

3. Metadata cache

The cache stores file-to-chunk-list mappings with a bounded TTL:

#![allow(unused)]
fn main() {
struct MetadataEntry {
    chunk_list: Vec<ChunkId>,
    fetched_at: Instant,
    ttl: Duration,
}
}

I-CC3 (metadata freshness and authority): File→chunk_list metadata mappings are served from cache only within the configured TTL (default 5s). After TTL expiry, the mapping must be re-fetched from canonical before serving chunks that depend on it. Within the TTL window, the cached mapping is authoritative — it may serve data for files that have since been modified or deleted in canonical. This is an accepted consequence of the TTL window, not a correctness violation. Modifications create new compositions with new chunk_ids; the old mapping points to valid immutable chunks that were the file’s content at fetch time. Deletions remove the composition; the cached mapping continues to serve the deleted file’s data until TTL expiry.

I-CC5 (staleness bound): Metadata TTL is the upper bound on read staleness. A file modified or deleted in canonical will be visible to a caching client within at most one metadata TTL period. The default TTL (5 seconds) balances freshness against metadata lookup cost.

Write-through: When the client writes a file (creating new chunks and a new composition), the local metadata cache is updated immediately with the new chunk list. This provides read-your-writes consistency within a single client process without waiting for TTL expiry.

4. Correctness invariants

The cache’s correctness rests on a small set of stated invariants. Each case where the cache serves (rather than bypasses) is backed by one or more of these invariants. Cases not covered bypass to canonical.

I-CC1 (chunk immutability): Chunks are immutable in canonical (I-C1). A chunk fetched, verified by content-address (SHA-256 of plaintext matches chunk_id derivation), and stored in cache is correct for all future reads of that chunk_id. No TTL needed for chunk data.

I-CC2 (plaintext security): Cached plaintext is overwritten with zeros (zeroize) before deallocation, eviction, or cache wipe. File-level: overwrite contents before unlink. Memory-level: Zeroizing<Vec<u8>> for L1 entries. This provides logical-level erasure; physical-level erasure on flash storage requires hardware encryption (see §1 residual risk).

I-CC6 (disconnect threshold): Cached entries remain authoritative across fabric disconnects shorter than max_disconnect_seconds (default 300s). Beyond this threshold, the entire cache (L1 + L2) is wiped. Disconnect is defined as: no successful RPC to any canonical endpoint (storage node or gateway) for max_disconnect_seconds consecutive seconds. The client maintains a last_successful_rpc timestamp updated on every successful data-path or heartbeat RPC. Background heartbeat RPCs (every 60s, piggybacked on metadata TTL refresh when idle) keep this timestamp current. Transient single-RPC failures do not trigger the disconnect timer — only sustained unreachability across all endpoints does.

I-CC7 (error bypass): Any local cache error (L2 I/O failure, corrupt chunk detected by CRC32 mismatch, metadata lookup failure) bypasses to canonical unconditionally. The cache never serves data it cannot verify. Failed L2 reads are not retried from L2 — they go to canonical immediately.

I-CC8 (wipe on restart / crash recovery): On process start, the client either creates a new L2 pool (wiping any prior orphaned pools) or adopts an existing pool identified by KISEKI_CACHE_POOL_ID environment variable (see §6 staging handoff). Orphaned pools are detected by attempting flock on each pool.lock — if the lock succeeds, the pool is orphaned (no live process holds it) and is wiped (zeroized and deleted). A separate kiseki-cache-scrub service runs on node boot and periodically (every 60s) to clean orphaned pools across all tenants, covering crash recovery when no subsequent kiseki process starts on that node.

I-CC13 (L2 integrity): L2 cache entries are protected by a CRC32 checksum computed at insert time and stored as a 4-byte trailer on each chunk file. On L2 read, the CRC32 is verified before serving. CRC mismatch triggers bypass to canonical and L2 entry deletion.

5. Policy authority and distribution

Cache policy follows the same distribution mechanism as quotas (per control-plane.feature scenario “Quota enforcement during control plane outage”).

Policy hierarchy

cluster default → org override → project override → workload override
                                                      → session selection

Each level narrows (never broadens) the parent’s settings, consistent with ADR-020 / I-WA7.

Policy attributes

Attribute	Type	Admin levels	Client selectable	Default
`cache_enabled`	bool	cluster, org, project, workload	No	true
`allowed_modes`	set{pinned, organic, bypass}	cluster, org	No	{pinned, organic, bypass}
`max_cache_bytes`	u64	cluster, org, workload	Up to ceiling	50 GB
`max_node_cache_bytes`	u64	cluster	No	80% of cache filesystem
`metadata_ttl_ms`	u64	cluster, org	Up to ceiling	5000
`max_disconnect_seconds`	u64	cluster	No	300
`key_health_interval_ms`	u64	cluster	No	30000
`staging_enabled`	bool	cluster, org	No	true
`mode`	enum	workload (default)	Yes (within allowed)	organic

Narrowing rules (same as I-WA7):

cache_enabled = false at any level → disabled for all children
allowed_modes at child ⊆ allowed_modes at parent
max_cache_bytes at child ≤ max_cache_bytes at parent
metadata_ttl_ms at child ≤ metadata_ttl_ms at parent

Distribution mechanism

Cache policy is carried in the same TenantConfig structure that carries quotas. At session establishment, the client resolves its effective policy through multiple paths (CC-ADV-6 resolved):

Primary: GetCachePolicy RPC on the data-path gRPC channel to any connected storage node. Storage nodes have TenantConfig (same data they use for quota enforcement). No gateway or control plane reachability required — the client only needs the data fabric.
Secondary: fetch from gateway’s locally-cached TenantConfig (if gateway is reachable)
Stale tolerance: last-known policy persisted in L2 pool directory (policy.json). Remains effective during outages, consistent with quota scenario in control-plane.feature.
Fallback: if no policy resolvable (first-ever session, all paths unreachable), use conservative defaults (cache enabled, organic mode, 10 GB max, 5s TTL)
Reconciliation: on control-plane recovery, client re-fetches policy and applies prospectively (I-WA18 pattern — active sessions continue under session-start policy; new sessions use updated policy)

No parallel policy-distribution path is introduced. Cache policy is one more field in TenantConfig, alongside quotas, compliance tags, and advisory settings.

I-CC9 (policy fallback): When effective cache policy is unreachable at session start, the client operates with conservative defaults (cache enabled, organic mode, 10 GB ceiling, 5s metadata TTL). The cache is a performance feature; failing to resolve policy must not prevent data access.

I-CC10 (prospective policy): Cache policy changes apply to new sessions only. Active sessions continue under the policy effective at session establishment, consistent with I-WA18.

6. Staging API

Client-local operation for pre-populating the cache with a dataset’s chunks in pinned mode. Pull-based — the client fetches from canonical.

Interface

# CLI (Slurm prolog, manual use)
kiseki-client stage --dataset <namespace_path> [--timeout <seconds>]
kiseki-client stage --status [--dataset <namespace_path>]
kiseki-client stage --release <namespace_path>
kiseki-client stage --release-all

# Rust API
impl CacheManager {
    async fn stage(&self, namespace_path: &str) -> Result<StageResult>;
    fn stage_status(&self) -> Vec<StagedDataset>;
    fn release(&self, namespace_path: &str);
    fn release_all(&self);
}

# Python API (via PyO3)
client.stage(namespace_path="/training/imagenet")
client.stage_status()
client.release(namespace_path="/training/imagenet")

# C FFI
kiseki_stage(handle, "/training/imagenet", timeout_secs)
kiseki_stage_status(handle, &status)
kiseki_release(handle, "/training/imagenet")

Flow (CC-ADV-11 resolved: directory tree handling)

Resolve namespace_path to composition metadata via canonical. If the path is a directory, recursively enumerate all files (compositions) up to max_staging_depth (default 10) and max_staging_files (default 100,000). If limits are exceeded, return an error with the count of files discovered.
Extract full chunk list from all resolved compositions
For each chunk not already in L2: fetch from canonical, decrypt, verify content-address (SHA-256), store in L2 with CRC32 trailer and pinned retention
Write staging/<dataset_id>.manifest listing all compositions, their chunk_ids, and the total byte count
Report progress (chunks staged / total, bytes, elapsed)

Staging is idempotent — re-staging an already-staged dataset is a no-op (chunks already present). Partial staging (interrupted) can be resumed by re-running the command.

Staging handoff (CC-ADV-5 resolved)

The staging CLI creates a cache pool and holds the pool.lock flock for its lifetime. The workload process adopts the staging pool instead of creating a new one:

Staging CLI: kiseki-client stage --dataset /training/imagenet
- Creates pool, writes pool_id to stdout and to $KISEKI_CACHE_DIR/<tenant>/staging_pool_id
- Stages chunks, holds flock, stays alive (daemon mode)
Workload process: sets KISEKI_CACHE_POOL_ID=<pool_id> (from Slurm prolog output, Lattice env injection, or the file)
- On start, detects existing pool with matching pool_id
- Adopts pool: takes over flock from staging daemon
- Staging daemon detects flock loss, exits cleanly
If KISEKI_CACHE_POOL_ID is not set: normal fresh-pool behavior (create new pool, wipe orphans)

Slurm integration:

# prolog.sh:
POOL_ID=$(kiseki-client stage --dataset /training/imagenet --daemon)
echo "export KISEKI_CACHE_POOL_ID=$POOL_ID" >> $SLURM_EXPORT_FILE

# epilog.sh:
kiseki-client stage --release-all --pool $KISEKI_CACHE_POOL_ID

Lattice integration: injects KISEKI_CACHE_POOL_ID into the workload environment after parallel staging completes across the node set. Queries stage --status to verify readiness before launching the workload.

I-CC11 (staging correctness): Staged chunks are fetched from canonical, verified by content-address, and stored with pinned retention. The staging manifest records the compositions and chunk_ids at staging time as a point-in-time snapshot. If the dataset is modified in canonical after staging, the staged version remains correct for its chunk_ids (immutable chunks) but stale relative to the current dataset version. To pick up updates, the user must explicitly release and re-stage.

7. Cache invalidation

The cache is primarily self-consistent due to chunk immutability (I-C1). Explicit invalidation is needed only for metadata:

Metadata invalidation: TTL-based. No push invalidation from canonical to client. The metadata TTL is the sole freshness mechanism.

Chunk invalidation: Not needed under normal operation (chunks are immutable). Two exceptional cases:

Crypto-shred (CC-ADV-2 resolved): When a tenant’s KEK is destroyed, all cached plaintext for that tenant must be wiped. Detection via three paths:
- Periodic key health check: Client pings KMS every key_health_interval (default 30s). If the tenant KEK is reported destroyed (KEK_DESTROYED error), wipe immediately.
- Advisory channel: If connected, receives shred notification immediately (fast path, best-effort).
- KMS error on next operation: Any key fetch that returns KEK_DESTROYED triggers immediate wipe.
- Unreachability fallback: If KMS is unreachable for max_disconnect_seconds, the disconnect timer triggers a full cache wipe (I-CC6), which covers the case where the KMS is unreachable because the KEK was destroyed.
Maximum time between crypto-shred event and cache wipe is bounded by min(key_health_interval, max_disconnect_seconds) — default 30 seconds.
Key rotation: When the system key epoch rotates, existing cached plaintext remains valid (same content, different encryption at rest). No cache action needed — the cache holds plaintext, not ciphertext.

I-CC12 (crypto-shred wipe): On crypto-shred event, all cached plaintext for the affected tenant is wiped from L1 and L2 with zeroize. Detection bounded by key_health_interval (default 30s). No cached data from a shredded tenant is served after detection.

8. Capacity management

Per-process limits:

Parameter	Default	Source
`max_memory_bytes` (L1)	256 MB	env `KISEKI_CACHE_L1_MAX` or API
`max_cache_bytes` (L2)	50 GB	policy ceiling or env `KISEKI_CACHE_L2_MAX`

Per-node limit (CC-ADV-9 resolved):

max_node_cache_bytes (default: 80% of $KISEKI_CACHE_DIR filesystem capacity). Enforced cooperatively: before inserting into L2, each process sums total usage across all pool directories in $KISEKI_CACHE_DIR. If total exceeds max_node_cache_bytes, the insert is rejected (organic: evict first; pinned: staging error). The disk-pressure check (90% filesystem utilization) remains as a hard backstop.

Capacity enforcement:

L1: strict LRU eviction at max_memory_bytes
L2 organic mode: LRU eviction at max_cache_bytes
L2 pinned mode: staging requests rejected with CacheCapacityExceeded when staged + proposed > max_cache_bytes. No eviction of pinned data.
Combined pinned + organic: pinned chunks are never evicted by organic LRU. Organic eviction only considers non-pinned chunks.
Node-wide: cooperative check against max_node_cache_bytes before any L2 insert.

9. Lifecycle

Process start (CC-ADV-1 resolved: crash recovery):

If KISEKI_CACHE_POOL_ID set: adopt existing pool (§6 handoff)
Otherwise: create new pool with CSPRNG pool_id
Scavenge orphans: scan all pool directories under $KISEKI_CACHE_DIR/<tenant_id>/, attempt flock on each pool.lock. If lock succeeds (no live holder), the pool is orphaned — zeroize all chunk files, delete directory. This catches prior crashes.
Resolve effective cache policy (§5)
Initialize L1 (empty HashMap)
Start background tasks: metadata TTL eviction, disk-pressure check, key health check (every key_health_interval), heartbeat RPC (every 60s for disconnect detection)
Cache operational

Crash recovery service (kiseki-cache-scrub): A systemd one-shot service (or cron job) that runs on node boot and every 60 seconds. Scans $KISEKI_CACHE_DIR for all tenant/pool directories, wipes any whose pool.lock has no live flock holder. This covers the case where no subsequent kiseki process starts on the node after a crash.

Steady state:

Reads: L1 → L2 (CRC32 verify) → canonical (decrypt + SHA-256 verify
- store in L1/L2 with CRC32 trailer)
Writes: straight to canonical; update local metadata cache
Background: periodic L1 expired-entry eviction, L2 disk-pressure check, key health check, heartbeat RPC

Disconnect (fabric unreachable):

Reads from L1/L2 continue to be served (chunks are immutable)
After max_disconnect_seconds with no successful RPC to any canonical endpoint: wipe entire cache (I-CC6)
On reconnect before threshold: resume normal operation

Process exit (clean):

Wipe L2 (zeroize all chunk files, delete pool directory)
L1 freed with process memory (Zeroizing drop handles cleanup)
Release flock on pool.lock

Process exit (crash):

L2 chunk files remain on NVMe (no zeroize opportunity)
Next process start or kiseki-cache-scrub service detects orphaned pool via flock check and wipes it

10. Configuration surface

Linkage mode	Configuration mechanism
FUSE mount	Mount options: `-o cache_mode=organic,cache_l2_max=50G,cache_dir=/tmp/kiseki`
Rust API	`CacheConfig` struct passed to `Client::new()`
Python	`kiseki.Client(cache_mode="pinned", cache_l2_max=501024*3)`
C FFI	`kiseki_open()` with `KisekiCacheConfig` struct
Environment	`KISEKI_CACHE_MODE`, `KISEKI_CACHE_DIR`, `KISEKI_CACHE_L1_MAX`, `KISEKI_CACHE_L2_MAX`, `KISEKI_CACHE_META_TTL_MS`, `KISEKI_CACHE_POOL_ID`

Priority: API/mount options > environment variables > policy defaults. All client-set values are clamped to policy ceilings (§5).

11. Observability

Cache metrics exposed via the client’s local metrics (not Prometheus — client runs on compute nodes, not storage nodes):

Metric	Type	Description
`cache_l1_hits`	counter	L1 (memory) cache hits
`cache_l2_hits`	counter	L2 (NVMe) cache hits
`cache_misses`	counter	Cache misses (bypassed to canonical)
`cache_bypasses`	counter	Bypass mode reads (intentional non-cache)
`cache_errors`	counter	L2 I/O errors (bypassed to canonical per I-CC7)
`cache_l1_bytes`	gauge	Current L1 memory usage
`cache_l2_bytes`	gauge	Current L2 disk usage
`cache_staged_datasets`	gauge	Number of pinned datasets
`cache_staged_bytes`	gauge	Total bytes in pinned datasets
`cache_meta_hits`	counter	Metadata cache hits (within TTL)
`cache_meta_misses`	counter	Metadata cache misses (TTL expired or absent)
`cache_wipes`	counter	Full cache wipes (disconnect threshold, restart, crypto-shred)
`cache_l2_read_latency_us`	histogram	L2 NVMe read latency
`cache_l2_write_latency_us`	histogram	L2 NVMe write latency

Metrics available via workflow advisory telemetry (scoped to caller) and via local API (cache_stats()).

Consequences

Positive

Repeat reads served from local NVMe: order-of-magnitude latency reduction for training datasets, inference weights, simulation input
Staging API with scheduler handoff eliminates thundering-herd on job start
Three modes match the three dominant workload patterns precisely
Plaintext cache means cache hits avoid decryption cost entirely
Policy model reuses existing TenantConfig distribution — no new subsystem
Content-addressed chunk immutability makes cache correctness simple (I-C1 is the foundation)
Crash recovery via flock-based orphan detection + scrubber service

Negative

Plaintext on local NVMe is a security surface. Mitigated by zeroize, file permissions, wipe-on-exit, crash scrubber, and ephemeral-only semantics (I-CC2, I-CC8). Residual FTL risk documented.
Metadata TTL introduces a staleness window including for deleted files. Mitigated by short default (5s) and write-through for own writes (I-CC3, I-CC5)
L2 NVMe cache competes with application use of local NVMe (e.g., scratch, checkpoint). Mitigated by configurable per-process ceiling, per-node ceiling, and disk-pressure backoff (§8)
No cross-process chunk sharing within a tenant means duplicate chunks when multiple jobs for the same tenant run on the same node. Accepted trade-off: simplicity over hit-rate optimization

Neutral

Bypass mode has zero overhead (no cache code on read path)
Staging is idempotent and resumable
Cache wipe on long disconnect is conservative but safe
Policy distribution via data-path gRPC works in all deployment topologies (no gateway or control plane access required)

Adversarial findings

ID	Severity	Section	Finding	Resolution
CC-ADV-1	Critical	§1, §9	Crash leaves plaintext on NVMe unreachable by zeroize. Process crash skips the exit wipe path.	Resolved: startup scavenger wipes orphaned pools (flock detection). `kiseki-cache-scrub` systemd/cron service runs on boot + every 60s for nodes where no subsequent kiseki process starts. Residual FTL risk documented. §9 updated.
CC-ADV-2	Critical	§7	Crypto-shred detection has no reliable delivery path. Advisory channel is best-effort.	Resolved: periodic key health check (default 30s) as primary detection. Advisory channel as fast path. KMS error on next operation as tertiary. Unreachability falls through to disconnect timer (I-CC6). Maximum detection latency: `min(key_health_interval, max_disconnect_seconds)` = 30s default. §7 updated, I-CC12 revised.
CC-ADV-3	High	§1	L2 read verification unspecified. Full SHA-256 on every read too expensive for training throughput.	Resolved: CRC32 trailer on each L2 chunk file, verified on read. SHA-256 only at fetch time. CRC32 catches bit-flips at ~1 GB/s cost. CRC mismatch → bypass canonical + delete entry. New I-CC13. §1 updated.
CC-ADV-4	High	§1	`cache.lock` flock contradicts separate-pool semantics for concurrent same-tenant processes.	Resolved: per-process `pool_id` subdirectory (128-bit CSPRNG). Each process has own `pool.lock`. No contention between concurrent processes. L2 layout updated in §1.
CC-ADV-5	High	§6	Staging CLI is separate process — workload’s wipe-on-start destroys staged data.	Resolved: staging daemon holds flock; workload adopts pool via `KISEKI_CACHE_POOL_ID` env var instead of creating new pool. Handoff mechanism specified in §6. I-CC8 revised to include adoption path.
CC-ADV-6	High	§5	Policy resolution via gateway unreachable in some topologies.	Resolved: primary path is `GetCachePolicy` RPC on data-path gRPC channel to any storage node. No gateway or control plane access required. Fallback chain: data-path → gateway → persisted last-known → conservative defaults. §5 updated.
CC-ADV-7	Medium	§3, §4	Metadata TTL authority doesn’t explicitly cover file deletion case.	Resolved: I-CC3 text now explicitly states that serving data for a deleted file within TTL is an accepted consequence. I-CC5 updated to cover deletion. §3 updated.
CC-ADV-8	Medium	§2	Pinned mode has no mechanism to detect canonical dataset updates.	Resolved: documented as intentional. Pinned mode stages a point-in-time snapshot. Update requires explicit release + re-stage. §2 updated.
CC-ADV-9	Medium	§8	No aggregate capacity enforcement across processes on same node.	Resolved: `max_node_cache_bytes` policy attribute (default 80% of cache filesystem). Cooperative enforcement: each process sums all pools before inserting. Disk-pressure 90% as hard backstop. §8 updated, policy table updated.
CC-ADV-10	Medium	§1	NVMe FTL retains physical copies after software zeroize.	Resolved: acknowledged as residual risk. Recommended hardening: OPAL/SED NVMe with per-boot key rotation. §1 updated.
CC-ADV-11	Medium	§6	Staging conflates namespace path with single composition.	Resolved: staging flow now specifies recursive directory enumeration with `max_staging_depth` (default 10) and `max_staging_files` (default 100,000). §6 flow updated.
CC-ADV-12	Low	§4	I-CC3, I-CC4, I-CC5 partially overlap.	Resolved: I-CC3 and I-CC4 consolidated into single I-CC3 covering freshness, authority, and deletion case. I-CC5 retained as the externally-facing staleness guarantee. Invariant table updated.
CC-ADV-13	Low	§9	Disconnect detection mechanism unspecified.	Resolved: defined as “no successful RPC to any canonical endpoint for `max_disconnect_seconds` consecutive seconds.” Client maintains `last_successful_rpc` timestamp. Background heartbeat every 60s. I-CC6 updated with detection mechanism.
CC-ADV-14	Low	§11	Missing L2 read/write latency metrics.	Resolved: added `cache_l2_read_latency_us` and `cache_l2_write_latency_us` histograms to metrics table. §11 updated.

Invariant impact

Invariant	Impact
I-C1	Foundation: chunk immutability enables the cache. No change to I-C1.
I-K1, I-K2	Unchanged: plaintext never leaves the compute node. Cache stores plaintext locally, same trust domain as process memory.
I-WA18	Reused: cache policy changes apply prospectively.
I-WA7	Reused: scope narrowing pattern for policy hierarchy.

New invariants

ID	Invariant
I-CC1	A chunk in pinned or organic mode is served from cache if and only if (a) the chunk was fetched from canonical and verified by chunk_id content-address match (SHA-256) at fetch time, and (b) no crypto-shred event has been detected for that tenant since fetch. Chunks are immutable in canonical (I-C1); therefore a verified chunk remains correct indefinitely absent crypto-shred.
I-CC2	Cached plaintext is overwritten with zeros (zeroize) before deallocation, eviction, or cache wipe. File-level: overwrite contents before unlink. Memory-level: `Zeroizing<Vec<u8>>` for L1 entries. This provides logical-level erasure; physical-level erasure on flash storage requires hardware encryption (OPAL/SED).
I-CC3	File→chunk_list metadata mappings are served from cache only within the configured TTL (default 5s). After TTL expiry, the mapping must be re-fetched from canonical. Within the TTL window, the cached mapping is authoritative: it may serve data for files that have since been modified or deleted in canonical. This is the sole freshness window in the cache design — chunk data itself has no TTL.
I-CC5	Metadata TTL is the upper bound on read staleness. A file modified or deleted in canonical is visible to a caching client within at most one metadata TTL period (default 5s).
I-CC6	Cached entries remain authoritative across fabric disconnects shorter than `max_disconnect_seconds` (default 300s). Beyond this threshold, the entire cache (L1 + L2) is wiped. Disconnect defined as: no successful RPC to any canonical endpoint for the threshold duration. Background heartbeat RPCs (every 60s) maintain the `last_successful_rpc` timestamp.
I-CC7	Any local cache error (L2 I/O failure, CRC32 mismatch, metadata lookup failure) bypasses to canonical unconditionally. The cache never serves data it cannot verify.
I-CC8	The cache is ephemeral. On process start, the client either creates a new L2 pool (wiping orphaned pools detected via flock) or adopts an existing pool via `KISEKI_CACHE_POOL_ID`. A `kiseki-cache-scrub` service runs on node boot and periodically to clean orphaned pools from crashed processes.
I-CC9	When effective cache policy is unreachable at session start, the client operates with conservative defaults (cache enabled, organic mode, 10 GB ceiling, 5s metadata TTL). Policy is fetched via data-path gRPC (primary), gateway (secondary), persisted last-known (tertiary), or conservative defaults (fallback).
I-CC10	Cache policy changes apply to new sessions only. Active sessions continue under session-start policy (consistent with I-WA18).
I-CC11	Staged chunks are fetched from canonical, verified by content-address, and stored with pinned retention as a point-in-time snapshot. The staged version is immutable in the cache regardless of canonical updates. To pick up updates, the user must explicitly release and re-stage. Staging enumerates directory trees recursively up to `max_staging_depth` (10) and `max_staging_files` (100,000).
I-CC12	On crypto-shred event, all cached plaintext for the affected tenant is wiped from L1 and L2 with zeroize. Detection via periodic key health check (default 30s), advisory channel notification, or KMS error on next operation. Maximum detection latency bounded by `min(key_health_interval, max_disconnect_seconds)`.
I-CC13	L2 cache entries are protected by a CRC32 checksum computed at insert time and stored as a 4-byte trailer. On L2 read, the CRC32 is verified before serving. Mismatch triggers bypass to canonical and L2 entry deletion.

Spec references

specs/features/native-client.feature — cache hit/invalidation/staging scenarios (extend)
specs/features/control-plane.feature — cache policy distribution scenarios (extend)
specs/invariants.md — add I-CC1 through I-CC13
specs/ubiquitous-language.md — add cache-specific terms
specs/failure-modes.md — add F-CC1 through F-CC4
specs/assumptions.md — add A-CC1 through A-CC4

ADR-032: Async GatewayOps

Status: Accepted
Date: 2026-04-24
Traces: I-L2, I-L5, I-V3, I-WA2, I-C2, I-C5, I-L8

Context

GatewayOps is a synchronous trait used by all three protocol gateways (S3, NFS, FUSE) to perform reads and writes through the composition and chunk stores. When the Raft-backed log store was introduced, the sync trait required a sync→async bridge (run_on_raft) that blocks the calling thread while waiting for Raft consensus.

Under concurrent load (≥ Raft runtime thread count), this causes thread starvation: all Raft threads are occupied polling client_write futures, leaving no thread for the Raft core loop to dispatch entries. The current mitigation (KISEKI_RAFT_THREADS = cpus/2) works but wastes resources and imposes a concurrency ceiling equal to the thread count.

For HPC/ML workloads with hundreds to thousands of concurrent writers, the thread-per-request model is unsustainable. The gateway must not block OS threads while waiting for Raft consensus.

Decision

Make GatewayOps an async trait. All protocol gateways call async methods directly. NFS and FUSE callers bridge async→sync via tokio::runtime::Handle::block_on on a dedicated runtime (the reverse of the current problem, but on threads they own — OS threads that are explicitly meant to block).

Trait change

#![allow(unused)]
fn main() {
// Before (sync)
pub trait GatewayOps: Send + Sync {
    fn write(&self, req: WriteRequest) -> Result<WriteResponse, GatewayError>;
    fn read(&self, req: ReadRequest) -> Result<ReadResponse, GatewayError>;
    fn list(...) -> Result<...>;
    fn delete(...) -> Result<...>;
    // ...
}

// After (async)
pub trait GatewayOps: Send + Sync {
    async fn write(&self, req: WriteRequest) -> Result<WriteResponse, GatewayError>;
    async fn read(&self, req: ReadRequest) -> Result<ReadResponse, GatewayError>;
    async fn list(...) -> Result<...>;
    async fn delete(...) -> Result<...>;
    // ...
}
}

Mutex strategy

Replace std::sync::Mutex with tokio::sync::Mutex for CompositionStore and ChunkStore in InMemoryGateway. Lock guards must NOT be held across .await points that perform disk I/O or Raft submissions — acquire, do in-memory work, drop, then await I/O.

Protocol gateway changes

Protocol	Current	After
S3 (axum)	`block_in_place(\|\| gateway.write())`	`gateway.write().await`
NFS (std::thread)	`gateway.write()`	`rt.block_on(gateway.write())` on NFS thread
FUSE (fuser threads)	`gateway.write()`	`rt.block_on(gateway.write())` on fuser thread

S3 becomes fully non-blocking. NFS and FUSE threads block as before, but they own their threads (not tokio worker threads) so no starvation.

LogOps bridge

LogOps::append_delta becomes async. The run_on_raft bridge is removed — the Raft runtime’s handle is used directly via .await from async gateway methods. No mpsc::recv blocking, no thread starvation.

Invariant preservation

The async conversion preserves all invariants by maintaining the same happens-before ordering via .await:

Invariant	Guarantee
I-L2	Gateway awaits Raft commit before returning to client
I-L5	Chunk writes awaited before composition finalize delta
I-V3	Read-your-writes: `last_written_seq` set after awaited write
I-C2	Refcount ops after awaited chunk confirm
I-C5	Capacity check before async write submission
I-L8	Shard membership validated before async rename
I-WA2	Advisory lookups remain sync + bounded (≤500 µs timeout)

Concurrency model

With async GatewayOps, the concurrency ceiling becomes the tokio task limit (effectively unbounded) instead of the thread count. Thousands of concurrent writes share a fixed thread pool without starvation.

Migration

Big-bang conversion. All callers updated in one pass:

Make GatewayOps async (trait + InMemoryGateway impl)
Replace std::sync::Mutex → tokio::sync::Mutex in gateway
Make LogOps async, remove run_on_raft bridge
Update S3 handlers: remove block_in_place, use .await
Update NFS server: add rt.block_on() wrapper on NFS threads
Update FUSE daemon: add rt.block_on() wrapper on fuser threads
Update all tests + BDD step definitions
Remove KISEKI_RAFT_THREADS (no longer needed)

Consequences

Benefits:

No thread starvation under any concurrency level
S3 handler is fully non-blocking (proper async axum)
Removes run_on_raft, block_in_place, KISEKI_RAFT_THREADS
Single Raft runtime (no dedicated runtime needed)
Clean async-all-the-way data path

Costs:

Large refactor touching all protocol gateways and tests
NFS/FUSE need a tokio runtime handle for block_on
tokio::sync::Mutex has slightly higher per-lock overhead than std::sync::Mutex (but eliminates thread starvation)
Async trait requires Send + 'static bounds on futures

Risks:

tokio::sync::Mutex held across .await can cause deadlocks if not careful. Mitigated by code review rule: never hold gateway mutex across Raft submission or disk I/O.
NFS/FUSE block_on on a non-tokio thread: works correctly but must not be called from within a tokio context (same issue we already solved with std::thread::spawn for runtime creation).

Implementation Notes (2026-04-24)

CompositionOps reverted to sync. The initial implementation made CompositionOps async, but holding tokio::sync::Mutex<CompositionStore> across emit_delta().await serialized all writes behind a single Raft round-trip — the same bottleneck as before, just without thread starvation.

Final architecture:

GatewayOps: async (S3 handlers await directly)
LogOps: async (Raft consensus)
CompositionOps: sync (in-memory HashMap operations only)

Gateway write pattern (lock-free):

Lock compositions → create() (sync, microseconds) → drop lock
Emit delta to log (async, Raft consensus, ~8ms) — no lock held
If emission fails, re-acquire lock and rollback (PIPE-ADV-1)

NFS/FUSE bridge: block_gateway() helper uses block_in_place when on a tokio worker thread (tests), or direct block_on on OS threads (production NFS/FUSE daemon).

Result: 1MB write throughput: 39.5 → 380.2 MB/s (9.6x improvement). 32 concurrent S3 PUTs complete in 50ms with no deadlock.

Keyboard shortcuts

Kiseki Documentation