Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Kiseki

Kiseki is a distributed storage system built for HPC and AI workloads. It provides a unified data plane that serves files and objects through multiple protocol gateways (S3, NFS, FUSE) while handling encryption, replication, and caching transparently.

Key Features

  • S3 and NFS gateways – access the same data through S3-compatible HTTP, NFSv3/v4.2, or a native FUSE mount. Protocol gateways translate wire protocols into operations on the shared log-structured data model.

  • Client-side cache with staging – a two-tier cache (L1 in-memory, L2 local NVMe) on compute nodes eliminates repeated fabric traversals. Three modes (pinned, organic, bypass) match the dominant workload patterns: epoch-reuse training, mixed inference, and streaming ingest.

  • Per-shard Raft consensus – every shard is a single-tenant Raft group. Deltas (metadata mutations) are totally ordered within a shard and replicated to a quorum before acknowledgement.

  • Erasure coding and placement – chunks are stored across affinity pools with configurable EC profiles. The placement engine distributes data across device classes (fast-NVMe, bulk-NVMe) and rebuilds lost chunks from parity.

  • FIPS 140-2/3 encryption – always-on, two-layer envelope encryption. System DEKs (AES-256-GCM via aws-lc-rs) encrypt chunk data; tenant KEKs wrap the DEKs for access control. Five tenant KMS backends: Kiseki-Internal, HashiCorp Vault, KMIP 2.1, AWS KMS, PKCS#11.

  • GPU-direct and fabric transports – the native client selects the fastest available transport: libfabric/CXI (Slingshot), RDMA verbs, or TCP+TLS. Transport selection is automatic based on fabric discovery.

  • Multi-tenant isolation – tenant hierarchy (organization / project / workload) with per-level quotas, compliance tags, and key isolation. Shards are single-tenant. Cross-tenant data access is out of scope by design.

  • OIDC and mTLS authentication – Keycloak (or any OIDC provider) for identity; Cluster CA-signed mTLS certificates for data-fabric authentication. Certificates work on the SAN with no control plane access needed on the hot path.

  • Workflow advisory – a bidirectional advisory channel carries workload hints (access pattern, prefetch range, priority) inbound and telemetry feedback (backpressure, locality, staleness) outbound. The advisory path is side-by-side with the data path – it never blocks or delays data operations.

Architecture at a Glance

Kiseki is a single-language Rust system organized as 18 crates in a Cargo workspace:

LayerCrates
Foundationkiseki-common, kiseki-proto, kiseki-crypto, kiseki-transport
Data pathkiseki-log, kiseki-block, kiseki-chunk, kiseki-composition, kiseki-view
Protocolkiseki-gateway (NFS + S3)
Clientkiseki-client (FUSE, FFI, Python via PyO3)
Infrastructurekiseki-raft, kiseki-keymanager, kiseki-audit, kiseki-advisory, kiseki-control
Integrationkiseki-server, kiseki-acceptance

The data model is log-structured: mutations are recorded as deltas appended to per-shard Raft logs. Compositions describe how content-addressed, encrypted chunks assemble into files or objects. Views are materialized projections of shard state, maintained incrementally by stream processors and served by protocol gateways.

Four binaries are produced:

BinaryRole
kiseki-serverStorage node (log + chunk + composition + view + gateways + audit + advisory)
kiseki-keyserverHA system key manager (Raft-replicated)
kiseki-client-fuseCompute-node FUSE mount with native client
kiseki-controlControl plane (tenancy, IAM, policy, federation)

Target Workloads

WorkloadHow Kiseki helps
LLM trainingTokenized datasets staged once per job, served from local NVMe cache across epochs. Pinned cache mode prevents eviction.
LLM inferenceModel weights cold-started into cache on first load, then served locally for all replicas on the node.
Climate / weather simulationBoundary conditions staged with hard deadline via Slurm prolog. Input files cached; checkpoint writes bypass the cache.
HPC checkpoint/restartCheckpoint writes go straight to canonical (bypass mode). Restart reads benefit from organic caching if the same node is reused.

Getting Started

This guide walks through running a single-node Kiseki stack with Docker Compose, verifying the deployment, and performing basic S3 operations.

Prerequisites

  • Docker 24+ with Compose V2 (docker compose)
  • curl (for health checks)
  • aws-cli (optional, for S3 operations)

If building from source instead of Docker:

  • Rust 1.78+ (stable)
  • Protobuf compiler (protoc)

Quick Start with Docker Compose

The repository includes a docker-compose.yml that brings up a single-node Kiseki server with supporting services:

ServicePortPurpose
kiseki-server9000S3 HTTP gateway
kiseki-server2049NFS (v3 + v4.2)
kiseki-server9090Prometheus metrics
kiseki-server9100Data-path gRPC
kiseki-server9101Advisory gRPC
jaeger16686Tracing UI
jaeger4317OTLP gRPC receiver
vault8200HashiCorp Vault (dev mode, tenant KMS)
keycloak8080Keycloak (OIDC identity provider)

Start the stack:

docker compose up --build -d

Wait for all services to become healthy:

docker compose ps

The kiseki-server container sets KISEKI_BOOTSTRAP=true, which creates an initial shard for immediate use.

Verify the Deployment

Health Check

The data-path gRPC port responds to TCP connections when the server is ready:

# TCP probe on the data-path port
timeout 1 bash -c 'echo > /dev/tcp/127.0.0.1/9100'
echo $?  # 0 = healthy

Prometheus Metrics

curl -s http://localhost:9090/metrics | head -20

Jaeger Tracing

Open http://localhost:16686 in a browser to view distributed traces. The server exports traces via OTLP to Jaeger automatically.

Vault (Dev Mode)

Vault runs in dev mode with root token kiseki-e2e-token:

curl -s http://localhost:8200/v1/sys/health | python3 -m json.tool

Keycloak

Keycloak is available at http://localhost:8080 with admin credentials admin / admin.

S3 Operations

With aws-cli configured to point at the local S3 gateway:

# Configure a local profile (no real AWS credentials needed)
export AWS_ACCESS_KEY_ID=kiseki
export AWS_SECRET_ACCESS_KEY=kiseki
export AWS_DEFAULT_REGION=us-east-1

# Create a bucket (maps to a Kiseki namespace)
aws --endpoint-url http://localhost:9000 s3 mb s3://test-bucket

# Upload a file
echo "hello kiseki" > /tmp/hello.txt
aws --endpoint-url http://localhost:9000 s3 cp /tmp/hello.txt s3://test-bucket/hello.txt

# Download and verify
aws --endpoint-url http://localhost:9000 s3 cp s3://test-bucket/hello.txt /tmp/hello-back.txt
cat /tmp/hello-back.txt

Or with curl directly:

# List buckets
curl -s http://localhost:9000/

# PUT an object
curl -X PUT http://localhost:9000/test-bucket/greeting.txt \
     -d "hello from curl"

# GET it back
curl -s http://localhost:9000/test-bucket/greeting.txt

Multi-Node Cluster

A three-node cluster configuration is also provided:

docker compose -f docker-compose.3node.yml up --build -d

This starts three kiseki-server instances that form Raft groups for shard replication.

Building from Source

# Clone and build
git clone https://github.com/your-org/kiseki.git
cd kiseki
cargo build --release

# Run the server
KISEKI_BOOTSTRAP=true \
KISEKI_DATA_DIR=/tmp/kiseki-data \
KISEKI_S3_ADDR=0.0.0.0:9000 \
KISEKI_NFS_ADDR=0.0.0.0:2049 \
KISEKI_DATA_ADDR=0.0.0.0:9100 \
KISEKI_METRICS_ADDR=0.0.0.0:9090 \
  ./target/release/kiseki-server

Next Steps

S3 API

Kiseki exposes an S3-compatible HTTP gateway on port 9000 (configurable via KISEKI_S3_ADDR). The gateway implements the subset of S3 API operations needed by HPC/AI workloads (ADR-014). Unsupported operations return 501 Not Implemented.

Endpoint

http://<node>:9000

In the Docker Compose development stack, the endpoint is http://localhost:9000.

Authentication

Kiseki supports AWS Signature Version 4 authentication:

  • Authorization header – standard SigV4 signing for aws-cli, boto3, and other SDK clients.
  • Presigned URLs – planned for a future release (not yet implemented).

In development mode (Docker Compose), any access key and secret key values are accepted.

Supported Operations

Bucket Operations

S3 buckets map to Kiseki namespaces. Creating a bucket creates a tenant-scoped namespace; deleting a bucket deletes the namespace.

OperationS3 APINotes
Create bucketPUT /{bucket}Maps to namespace creation
Delete bucketDELETE /{bucket}Maps to namespace deletion
Head bucketHEAD /{bucket}Existence check
List bucketsGET /Per-tenant bucket listing

Object Operations

OperationS3 APINotes
Put objectPUT /{bucket}/{key}Single-part upload
Get objectGET /{bucket}/{key}Including byte-range reads (Range header)
Head objectHEAD /{bucket}/{key}Metadata retrieval
Delete objectDELETE /{bucket}/{key}Tombstone or delete marker (versioning)
List objectsGET /{bucket}?list-type=2ListObjectsV2 with prefix, delimiter, pagination

Multipart Upload

For objects larger than a single PUT (large datasets, model weights):

OperationS3 APINotes
Create multipart uploadPOST /{bucket}/{key}?uploadsReturns upload ID
Upload partPUT /{bucket}/{key}?partNumber={n}&uploadId={id}Upload one part
Complete multipart uploadPOST /{bucket}/{key}?uploadId={id}Assemble parts into final object
Abort multipart uploadDELETE /{bucket}/{key}?uploadId={id}Clean up incomplete upload
List multipart uploadsGET /{bucket}?uploadsList in-progress uploads
List partsGET /{bucket}/{key}?uploadId={id}List parts of an in-progress upload

Versioning

OperationS3 APINotes
Get object versionGET /{bucket}/{key}?versionId={v}Specific version retrieval
List object versionsGET /{bucket}?versionsVersion listing
Delete object versionDELETE /{bucket}/{key}?versionId={v}Delete specific version

Conditional Operations

HeaderDirectionNotes
If-None-MatchWriteConditional write (create-if-not-exists)
If-MatchWriteConditional write (update-if-matches)
If-Modified-SinceReadConditional read

Examples

aws-cli

# Set up environment
export AWS_ACCESS_KEY_ID=kiseki
export AWS_SECRET_ACCESS_KEY=kiseki
export AWS_DEFAULT_REGION=us-east-1
ENDPOINT="--endpoint-url http://localhost:9000"

# Bucket operations
aws $ENDPOINT s3 mb s3://datasets
aws $ENDPOINT s3 ls

# Upload a directory
aws $ENDPOINT s3 sync ./training-data/ s3://datasets/imagenet/

# Download a file
aws $ENDPOINT s3 cp s3://datasets/imagenet/train.tar /tmp/train.tar

# Multipart upload (automatic for large files)
aws $ENDPOINT s3 cp ./large-model.bin s3://datasets/models/gpt.bin

# List objects with prefix
aws $ENDPOINT s3 ls s3://datasets/imagenet/ --recursive

# Delete
aws $ENDPOINT s3 rm s3://datasets/imagenet/train.tar

curl

# Create a bucket
curl -X PUT http://localhost:9000/my-bucket

# PUT an object
curl -X PUT http://localhost:9000/my-bucket/config.json \
     -H "Content-Type: application/json" \
     -d '{"epochs": 100, "batch_size": 32}'

# GET an object
curl -s http://localhost:9000/my-bucket/config.json

# HEAD an object (metadata only)
curl -I http://localhost:9000/my-bucket/config.json

# Byte-range read (first 1024 bytes)
curl -s http://localhost:9000/my-bucket/large-file.bin \
     -H "Range: bytes=0-1023"

# DELETE an object
curl -X DELETE http://localhost:9000/my-bucket/config.json

# List objects (ListObjectsV2)
curl -s "http://localhost:9000/my-bucket?list-type=2&prefix=models/"

# Delete a bucket
curl -X DELETE http://localhost:9000/my-bucket

Python (boto3)

import boto3

s3 = boto3.client(
    "s3",
    endpoint_url="http://localhost:9000",
    aws_access_key_id="kiseki",
    aws_secret_access_key="kiseki",
    region_name="us-east-1",
)

# Create bucket
s3.create_bucket(Bucket="training")

# Upload
s3.put_object(Bucket="training", Key="data.csv", Body=b"col1,col2\n1,2\n")

# Download
obj = s3.get_object(Bucket="training", Key="data.csv")
print(obj["Body"].read().decode())

# List
for item in s3.list_objects_v2(Bucket="training")["Contents"]:
    print(item["Key"], item["Size"])

Bucket-to-Namespace Mapping

Every S3 bucket maps 1:1 to a Kiseki namespace within the authenticated tenant’s scope. Bucket names become namespace identifiers. Buckets from different tenants are fully isolated – two tenants can have buckets with the same name without conflict.

Objects within a bucket map to Kiseki compositions. Each object version corresponds to a sequence of deltas in the shard that owns the namespace.

Encryption Handling

Kiseki always encrypts all data (invariant I-K1). S3 server-side encryption headers are handled as follows:

HeaderBehavior
SSE-S3 (x-amz-server-side-encryption: AES256)Acknowledged, no-op. System encryption is always on.
SSE-KMS with matching ARNAcknowledged if the ARN matches the tenant KMS config.
SSE-KMS with different ARNRejected. Tenants cannot specify arbitrary keys.
SSE-C (x-amz-server-side-encryption-customer-*)Rejected. Kiseki manages encryption, not the client.

Limitations

The following S3 features are not implemented:

FeatureReason
Lifecycle policiesKiseki has its own tiering and retention model
Event notifications (SNS/SQS)Requires message bus integration
Presigned URLsPlanned for future release
Bucket policies / IAMKiseki uses its own IAM and policy model
CORSNot relevant for HPC/AI workloads
Object LockCovered by Kiseki’s retention hold mechanism
S3 SelectOut of scope
Replication configurationKiseki manages replication internally
Storage classesKiseki uses affinity pools, not S3 storage classes

NFS Access

Kiseki exposes an NFS gateway on port 2049 (configurable via KISEKI_NFS_ADDR) supporting both NFSv3 and NFSv4.2. The gateway translates NFS operations into reads and writes against materialized views and the composition log.

Protocol Support

ProtocolStatusNotes
NFSv3SupportedStateless, lower overhead
NFSv4.2SupportedStateful, with lock support and extended attributes

Mounting

Basic Mount

mount -t nfs <node>:/ /mnt/kiseki

With explicit version and options:

# NFSv4.2
mount -t nfs -o vers=4.2,proto=tcp <node>:/ /mnt/kiseki

# NFSv3
mount -t nfs -o vers=3,proto=tcp <node>:/ /mnt/kiseki

Docker Compose (Development)

When using the development Docker Compose stack, the NFS port is published to the host:

mount -t nfs -o vers=4.2,proto=tcp,port=2049 127.0.0.1:/ /mnt/kiseki

fstab Entry

<node>:/ /mnt/kiseki nfs vers=4.2,proto=tcp,hard,intr 0 0

Authentication

ModeUse caseNotes
AUTH_SYSDevelopment and testingUID/GID-based, no Kerberos
Kerberos (RPCSEC_GSS)Productionkrb5, krb5i, or krb5p security flavors

In development (Docker Compose), AUTH_SYS is used with no additional configuration. For production deployments, Kerberos provides authentication and optional integrity/privacy protection on the wire.

Kiseki always encrypts data at rest regardless of the NFS authentication mode. The gateway performs tenant-layer encryption: clients send plaintext over TLS to the gateway, and the gateway encrypts before writing to the log and chunk store.

Supported Operations

Full Semantics

OperationNotes
open, close, read, writeStandard file I/O
create, unlinkFile creation and deletion
mkdir, rmdirDirectory creation and deletion
rename (within namespace)Atomic within shard
stat, fstat, lstatFile metadata
chmod, chownPermission changes (stored in delta attributes)
readdir, readdirplusDirectory listing from materialized view
symlink, readlinkStored as inline data in delta
truncate, ftruncateComposition resize
fsync, fdatasyncFlush to durable (delta committed to Raft quorum)
Extended attributes (xattr)getxattr, setxattr, listxattr, removexattr
POSIX file locks (fcntl)Per-gateway lock state
O_APPENDAtomic append via delta
O_CREAT, O_EXCLAtomic create-if-not-exists

Limited Semantics

OperationLimitation
rename (cross-namespace)Returns EXDEV – cannot rename across shards
Hard linksWithin namespace only; cross-namespace returns EXDEV
Sparse filesHoles tracked in composition; zero-fill on read
O_DIRECTBypasses client cache but still traverses the gateway
flock (advisory)Best-effort; not guaranteed across gateway failover

Not Supported

OperationReason
Writable shared mmapDistributed shared writable mmap requires page-level coherence that is not tractable at HPC scale. Read-only mmap is supported. The gateway returns ENOTSUP. See ADR-013.
POSIX ACLs (POSIX.1e)Unix permissions only (uid/gid/mode). POSIX ACLs add complexity without benefit for the target workloads.

Namespace Mapping

The NFS root (/) lists the tenant’s namespaces as top-level directories. Each namespace contains the compositions (files and directories) belonging to that namespace. This is analogous to the S3 bucket mapping – the same namespace appears as a bucket via S3 and as a top-level directory via NFS.

/mnt/kiseki/
  training/          <- namespace "training"
    imagenet/
      train.tar
      val.tar
  checkpoints/       <- namespace "checkpoints"
    epoch-001.pt

Performance Considerations

  • Readdir performance – directory listings are served from materialized views, not reconstructed from the log on each request. Views are updated incrementally by stream processors.

  • Write path – writes flow through the gateway to the composition context, which appends deltas to the shard log. An fsync ensures the delta is committed to a Raft quorum before returning.

  • Concurrent access – multiple NFS clients can read the same files concurrently. Write contention within a shard is serialized by the Raft leader.

  • Large files – large files are chunked using content-defined chunking (Rabin fingerprinting). Byte-range reads are served by fetching only the relevant chunks.

Limitations Summary

  1. No writable shared mmap – applications that use writable shared memory-mapped files must use write() instead. Read-only mmap works and is useful for model loading.

  2. Cross-namespace rename returns EXDEV – renaming a file from one namespace to another requires a copy-and-delete at the application level, same as moving files across filesystem boundaries on a traditional system.

  3. No POSIX ACLs – only standard Unix permissions (mode bits). Fine-grained access control is handled by Kiseki’s tenant IAM model, not filesystem-level ACLs.

  4. Lock state is per-gateway – POSIX file locks (fcntl) are maintained by the gateway instance. If a gateway fails over, lock state is lost. Advisory locks (flock) are best-effort.

FUSE Mount

The Kiseki native client provides a FUSE mount that exposes the distributed storage as a local filesystem on compute nodes. Unlike the NFS gateway, the FUSE client runs in the workload’s process space and performs client-side encryption – plaintext never leaves the process.

Building

The FUSE mount is feature-gated. Build the client binary with the fuse feature:

cargo build --release --bin kiseki-client-fuse --features fuse

This requires the fuser crate, which depends on the FUSE kernel module being available on the host:

  • Linux: install fuse3 or libfuse3-dev
  • macOS: install macFUSE

Mounting

kiseki-client-fuse mount /mnt/kiseki \
    --data-addr <storage-node>:9100 \
    --tenant <tenant-id> \
    --namespace <namespace-id>

Mount Options

Options are passed with -o:

kiseki-client-fuse mount /mnt/kiseki \
    -o cache_mode=organic \
    -o cache_dir=/local-nvme/kiseki-cache \
    -o cache_l2_max=100G \
    -o meta_ttl_ms=5000
OptionValuesDefaultDescription
cache_modepinned, organic, bypassorganicCache operating mode (see Client Cache)
cache_dirpath/tmp/kiseki-cacheL2 NVMe cache directory
cache_l1_maxbytes256ML1 (in-memory) cache size
cache_l2_maxbytes50GL2 (NVMe) cache size per process
meta_ttl_msmilliseconds5000Metadata cache TTL

Environment Variables

Mount options can also be set via environment variables. Mount options take priority over environment variables.

VariableEquivalent option
KISEKI_CACHE_MODEcache_mode
KISEKI_CACHE_DIRcache_dir
KISEKI_CACHE_L1_MAXcache_l1_max
KISEKI_CACHE_L2_MAXcache_l2_max
KISEKI_CACHE_META_TTL_MSmeta_ttl_ms
KISEKI_CACHE_POOL_IDAdopt an existing cache pool (see staging handoff)

Supported Operations

Read/Write

OperationSupportedNotes
readYesServed from cache (L1 -> L2 -> canonical)
writeYesWrites to canonical; local metadata cache updated immediately
open / closeYesStandard file handles
fsync / fdatasyncYesFlushes delta to Raft quorum
truncate / ftruncateYesComposition resize
O_APPENDYesAtomic append via delta
O_CREAT / O_EXCLYesAtomic create-if-not-exists
O_DIRECTLimitedBypasses client cache, still goes through FUSE

Directory Operations

OperationSupportedNotes
mkdir / rmdirYesCreate and remove directories
readdir / readdirplusYesListing from materialized view
rename (within namespace)YesAtomic within shard
rename (cross-namespace)NoReturns EXDEV
OperationSupportedNotes
stat / fstat / lstatYesFile metadata
chmod / chownYesStored in delta attributes
symlink / readlinkYesSymlink targets stored as inline data
Hard links (within namespace)Yes
Hard links (cross-namespace)NoReturns EXDEV
xattr operationsYesgetxattr, setxattr, listxattr, removexattr

Nested Directories and Write-at-Offset

The FUSE filesystem supports full directory trees within a namespace. Files can be created in nested directories, and writes at arbitrary offsets within a file are supported (the composition tracks chunk references and handles sparse regions with zero-fill).

mkdir -p /mnt/kiseki/experiments/run-42/logs
echo "epoch 1 loss: 0.3" > /mnt/kiseki/experiments/run-42/logs/train.log

# Write at offset (sparse file)
dd if=/dev/zero of=/mnt/kiseki/data/sparse.bin bs=1 count=1 seek=1048576

Not Supported

OperationReason
Writable shared mmapReturns ENOTSUP. Read-only mmap works. Use write() instead. (ADR-013)
POSIX ACLsUnix permissions only (uid/gid/mode)

Cache Mode Selection

The cache mode determines how aggressively the client caches data on local storage. Choose the mode that matches your workload:

ModeBest forBehavior
pinnedTraining (epoch reuse), inference (model weights)Chunks retained until explicit release. Populate via staging API.
organicMixed workloads, interactive useLRU eviction with usage-weighted retention. Default.
bypassStreaming ingest, checkpoint writes, one-shot scansNo caching. All reads go directly to canonical storage.
# Training job: pin the dataset
kiseki-client-fuse mount /mnt/kiseki -o cache_mode=pinned

# Interactive exploration
kiseki-client-fuse mount /mnt/kiseki -o cache_mode=organic

# Checkpoint writer
kiseki-client-fuse mount /mnt/kiseki -o cache_mode=bypass

See Client Cache & Staging for staging pre-fetch, Slurm integration, and policy configuration.

Transport Selection

The native client automatically selects the fastest available transport to reach storage nodes:

  1. libfabric/CXI (Slingshot) – if available on the fabric
  2. RDMA verbs – if InfiniBand/RoCE is available
  3. TCP+TLS – universal fallback

Transport selection is automatic and requires no configuration. The client discovers available transports during fabric discovery at startup (ADR-008).

Unmounting

fusermount -u /mnt/kiseki    # Linux
umount /mnt/kiseki           # macOS

On clean unmount, the L2 cache pool is wiped (all chunk files are zeroized and deleted). On crash, the orphaned cache pool is cleaned up by the next client process or by the kiseki-cache-scrub service.

Python SDK

Kiseki provides Python bindings via PyO3, exposing the native client’s cache, staging, and workflow advisory APIs to Python workloads. The bindings are part of the kiseki-client crate, enabled with the python feature flag.

Building

Build and install the Python module using maturin:

pip install maturin
maturin develop --features python

This builds the native Rust code and installs the kiseki module into the active Python environment.

For a release build:

maturin build --release --features python
pip install target/wheels/kiseki-*.whl

Quick Start

import kiseki

# Create a client with organic caching (default)
client = kiseki.Client(cache_mode="organic", cache_dir="/tmp/kiseki-cache")

# Stage a dataset into the local cache
client.stage("/training/imagenet")

# ... workload reads via FUSE or native API ...

# Check cache statistics
stats = client.cache_stats()
print(stats)
# CacheStats(l1_hits=42, l2_hits=1500, misses=200, l1_bytes=134217728, l2_bytes=5368709120, wipes=0)

# Release the staged dataset
client.release("/training/imagenet")

# Clean up
client.close()

API Reference

kiseki.Client

The main entry point. Each Client instance manages its own cache pool (L1 in-memory + L2 NVMe) and advisory session.

Constructor

client = kiseki.Client(
    cache_mode="organic",           # "pinned", "organic", or "bypass"
    cache_dir="/tmp/kiseki-cache",  # L2 NVMe cache directory
    cache_l2_max=50 * 1024**3,      # L2 max bytes (default: 50 GB)
    meta_ttl_ms=5000,               # Metadata TTL in ms (default: 5000)
)
ParameterTypeDefaultDescription
cache_modestr"organic"Cache mode: "pinned", "organic", or "bypass"
cache_dirstr"/tmp/kiseki-cache"Directory for L2 NVMe cache files
cache_l2_maxint or NoneNone (50 GB)Maximum L2 cache size in bytes
meta_ttl_msint5000Metadata cache TTL in milliseconds

stage(namespace_path: str) -> None

Pre-fetch a dataset’s chunks into the local cache with pinned retention. The dataset is identified by its namespace path (e.g., "/training/imagenet"). Staging is idempotent – re-staging an already-staged dataset is a no-op.

client.stage("/training/imagenet")
client.stage("/training/imagenet")  # no-op, already staged

For directory paths, staging recursively enumerates all files up to a depth of 10 and a maximum of 100,000 files.

stage_status() -> list[str]

Return the namespace paths of all currently staged datasets.

paths = client.stage_status()
# ["/training/imagenet", "/models/gpt-3"]

release(namespace_path: str) -> None

Release a staged dataset, unpinning its chunks and making them eligible for eviction.

client.release("/training/imagenet")

release_all() -> None

Release all staged datasets.

client.release_all()

cache_stats() -> CacheStatsView

Return current cache statistics.

stats = client.cache_stats()
print(f"L1 hits: {stats.l1_hits}")
print(f"L2 hits: {stats.l2_hits}")
print(f"Misses:  {stats.misses}")
print(f"L1 used: {stats.l1_bytes / 1024**2:.0f} MB")
print(f"L2 used: {stats.l2_bytes / 1024**3:.1f} GB")
print(f"Wipes:   {stats.wipes}")

cache_mode() -> str

Return the current cache mode as a string.

print(client.cache_mode())  # "organic"

declare_workflow() -> int

Declare a new workflow for advisory integration. Returns a workflow ID (128-bit integer) that can be used to correlate operations with the advisory channel for telemetry feedback.

wf_id = client.declare_workflow()
# ... run training epochs ...
client.end_workflow(wf_id)

end_workflow(workflow_id: int) -> None

End a previously declared workflow.

wipe() -> None

Immediately wipe the entire cache (L1 + L2). All cached plaintext is zeroized before deletion.

close() -> None

Wipe the cache and release resources. Call this when the workload is done. Equivalent to wipe().

kiseki.CacheStatsView

Read-only statistics object returned by cache_stats().

AttributeTypeDescription
l1_hitsintNumber of L1 (memory) cache hits
l2_hitsintNumber of L2 (NVMe) cache hits
missesintNumber of cache misses (fetched from canonical)
l1_bytesintCurrent L1 memory usage in bytes
l2_bytesintCurrent L2 disk usage in bytes
wipesintNumber of full cache wipes

Example: Training Workflow

import kiseki

def train():
    # Pin the dataset for the duration of training
    client = kiseki.Client(cache_mode="pinned", cache_dir="/local-nvme/cache")

    # Pre-stage the dataset (ideally done in Slurm prolog)
    client.stage("/training/imagenet-22k")

    # Declare a workflow for advisory telemetry
    wf_id = client.declare_workflow()

    try:
        for epoch in range(100):
            # Dataset reads hit L2 cache after first epoch
            # ... training loop reads from /mnt/kiseki/training/imagenet-22k/ ...
            pass

        stats = client.cache_stats()
        print(f"Cache hit rate: {(stats.l1_hits + stats.l2_hits) / "
              f"(stats.l1_hits + stats.l2_hits + stats.misses) * 100:.1f}%")
    finally:
        client.end_workflow(wf_id)
        client.release_all()
        client.close()

if __name__ == "__main__":
    train()

Example: Inference with Organic Caching

import kiseki

client = kiseki.Client(cache_mode="organic", cache_l2_max=20 * 1024**3)

# Model weights are cached on first load, then served from L2
# Prompt data is cached with LRU eviction

wf_id = client.declare_workflow()
try:
    # ... inference serving loop ...
    pass
finally:
    client.end_workflow(wf_id)
    client.close()

Example: Checkpoint Writer (No Caching)

import kiseki

# Bypass mode: checkpoint writes go straight to canonical
client = kiseki.Client(cache_mode="bypass")

# ... write checkpoints to /mnt/kiseki/checkpoints/ ...

client.close()

Environment Variable Overrides

The Python client respects the same environment variables as the FUSE mount and CLI:

VariableDescription
KISEKI_CACHE_MODEOverride cache mode
KISEKI_CACHE_DIROverride cache directory
KISEKI_CACHE_L1_MAXOverride L1 max bytes
KISEKI_CACHE_L2_MAXOverride L2 max bytes
KISEKI_CACHE_META_TTL_MSOverride metadata TTL
KISEKI_CACHE_POOL_IDAdopt an existing cache pool (staging handoff)

Constructor parameters take priority over environment variables. All client-set values are clamped to the effective policy ceilings set by tenant and cluster administrators.

Client Cache & Staging

The client-side cache (ADR-031) eliminates repeated data transfers across the storage fabric by caching decrypted plaintext chunks on compute-node local NVMe. It is a library-level module in kiseki-client, shared across all access modes: FUSE, FFI, Python, and native Rust.

Architecture

canonical (fabric) -> decrypt -> cache store (NVMe) -> serve to caller
                                     ^
                           cache hit path (no fabric, no decrypt)

Two-Tier Storage

TierBackingCapacityPurpose
L1 (Hot)In-memory HashMap256 MB defaultSub-microsecond hits for active working set
L2 (Warm)Local NVMe files50 GB defaultLarge capacity for datasets and model weights

Read path: L1 -> L2 (with CRC32 verification) -> canonical (decrypt + SHA-256 verify + store in L1/L2).

L2 files are organized per-process with isolated cache pools:

$KISEKI_CACHE_DIR/
  <tenant_id_hex>/
    <pool_id>/                  <- per-process pool (128-bit CSPRNG)
      chunks/
        <prefix>/
          <chunk_id_hex>        <- plaintext + CRC32 trailer
      meta/
        file_chunks.db
      staging/
        <dataset_id>.manifest
      pool.lock                 <- flock proves process is alive

Each client process creates its own pool directory. Multiple concurrent same-tenant processes on the same node have fully independent pools with no contention.

Security Model

The cache stores decrypted plaintext on local NVMe. This is acceptable because:

  • The compute node already holds decrypted data in process memory (computation requires plaintext)
  • L2 NVMe is local to the compute node, same trust domain as process memory
  • L2 is ephemeral – wiped on process exit and on long disconnect
  • All cached data is overwritten with zeros (zeroize) before deallocation or eviction
  • File permissions are 0600, owned by the process UID
  • Orphaned pools from crashes are cleaned by the kiseki-cache-scrub service

Cache Modes

Three modes are available, selected per client instance at session establishment.

Pinned Mode

For workloads that declare their dataset upfront: training runs (epoch reuse), inference (model weights), climate simulations (boundary conditions).

  • Chunks are retained against eviction until explicit release()
  • Populated via the staging API or on first access
  • Staging captures a point-in-time snapshot; canonical updates do not invalidate pinned data
  • Capacity bounded by max_cache_bytes; staging beyond capacity returns CacheCapacityExceeded

Organic Mode

Default for mixed workloads. LRU with usage-weighted retention.

  • Chunks cached on first read, evicted when capacity is reached
  • Frequently accessed chunks promoted to L1
  • L2 eviction: LRU by last-access timestamp, weighted by access count (chunks accessed N times survive N eviction rounds)
  • Metadata cache with configurable TTL (default 5 seconds)

Bypass Mode

For workloads that do not benefit from caching: streaming ingest, one-shot scans, checkpoint writes.

  • All reads go directly to canonical
  • No L1 or L2 storage consumed
  • Zero overhead beyond mode selection

Staging API

Client-local operation for pre-populating the cache in pinned mode. Pull-based – the client fetches from canonical.

CLI

# Stage a dataset
kiseki-client stage --dataset /training/imagenet

# Stage in daemon mode (for Slurm prolog)
POOL_ID=$(kiseki-client stage --dataset /training/imagenet --daemon)

# Check staging status
kiseki-client stage --status

# Release a dataset
kiseki-client stage --release /training/imagenet

# Release all
kiseki-client stage --release-all

Rust API

#![allow(unused)]
fn main() {
let result = cache_manager.stage("/training/imagenet").await?;
let datasets = cache_manager.stage_status();
cache_manager.release("/training/imagenet");
cache_manager.release_all();
}

Python API

client.stage("/training/imagenet")
paths = client.stage_status()
client.release("/training/imagenet")
client.release_all()

C FFI

kiseki_stage(handle, "/training/imagenet", timeout_secs);
kiseki_stage_status(handle, &status);
kiseki_release(handle, "/training/imagenet");

Staging Flow

  1. Resolve namespace_path to compositions via canonical. For directory paths, recursively enumerate all files up to max_staging_depth (10) and max_staging_files (100,000).
  2. Extract full chunk list from all resolved compositions.
  3. For each chunk not already in L2: fetch from canonical, decrypt, verify content-address (SHA-256), store in L2 with CRC32 trailer and pinned retention.
  4. Write a staging manifest listing all compositions and chunk IDs.
  5. Report progress (chunks staged / total, bytes, elapsed).

Staging is idempotent – re-staging an already-staged dataset is a no-op. Partial staging (interrupted) can be resumed by re-running the command.

Slurm Integration

Staging Handoff

The staging CLI creates a cache pool and holds its pool.lock flock. The workload process adopts the pool instead of creating a new one:

  1. Prolog: staging CLI fetches chunks in daemon mode, outputs pool_id.
  2. Workload: sets KISEKI_CACHE_POOL_ID=<pool_id>, starts, adopts the existing pool, takes over the flock.
  3. Staging daemon: detects flock loss, exits cleanly.

Prolog Script

#!/bin/bash
# prolog.sh -- run before the job starts

POOL_ID=$(kiseki-client stage --dataset /training/imagenet --daemon)
echo "export KISEKI_CACHE_POOL_ID=$POOL_ID" >> $SLURM_EXPORT_FILE

Epilog Script

#!/bin/bash
# epilog.sh -- run after the job completes

kiseki-client stage --release-all --pool $KISEKI_CACHE_POOL_ID

Lattice Integration

Lattice injects KISEKI_CACHE_POOL_ID into the workload environment after parallel staging completes across the node set. It queries stage --status to verify readiness before launching the workload.

Policy Hierarchy

Cache policy follows the same distribution mechanism as quotas, using the existing TenantConfig structure.

cluster default -> org override -> project override -> workload override
                                                         -> session selection

Each level narrows (never broadens) the parent’s settings.

Policy Attributes

AttributeTypeAdmin levelsClient selectableDefault
cache_enabledboolcluster, org, project, workloadNotrue
allowed_modessetcluster, orgNo{pinned, organic, bypass}
max_cache_bytesu64cluster, org, workloadUp to ceiling50 GB
max_node_cache_bytesu64clusterNo80% of cache FS
metadata_ttl_msu64cluster, orgUp to ceiling5000
max_disconnect_secondsu64clusterNo300
key_health_interval_msu64clusterNo30000
staging_enabledboolcluster, orgNotrue
modeenumworkload (default)Yes (within allowed)organic

Policy Resolution

At session establishment, the client resolves its effective policy through multiple paths:

  1. Primary: GetCachePolicy RPC on the data-path gRPC channel to any storage node. No gateway or control plane access required.
  2. Secondary: gateway’s locally-cached TenantConfig.
  3. Stale tolerance: last-known policy persisted in the L2 pool directory (policy.json).
  4. Fallback: conservative defaults (organic mode, 10 GB max, 5s TTL).

Policy changes apply to new sessions only. Active sessions continue under the policy effective at session establishment.

Configuration

Environment Variables

VariableDescriptionDefault
KISEKI_CACHE_MODECache modeorganic
KISEKI_CACHE_DIRL2 cache directory/tmp/kiseki-cache
KISEKI_CACHE_L1_MAXL1 memory max bytes256 MB
KISEKI_CACHE_L2_MAXL2 NVMe max bytes50 GB
KISEKI_CACHE_META_TTL_MSMetadata TTL (ms)5000
KISEKI_CACHE_POOL_IDAdopt existing pool(none)

Mount Options (FUSE)

kiseki-client-fuse mount /mnt/kiseki \
    -o cache_mode=pinned \
    -o cache_dir=/local-nvme/kiseki \
    -o cache_l2_max=100G

API (Rust)

#![allow(unused)]
fn main() {
let config = CacheConfig {
    mode: CacheMode::Pinned,
    cache_dir: PathBuf::from("/local-nvme/kiseki"),
    max_cache_bytes: 100 * 1024 * 1024 * 1024,
    metadata_ttl: Duration::from_secs(5),
    ..CacheConfig::default()
};
}

API (Python)

client = kiseki.Client(
    cache_mode="pinned",
    cache_dir="/local-nvme/kiseki",
    cache_l2_max=100 * 1024**3,
)

Priority: API/mount options > environment variables > policy defaults. All client-set values are clamped to policy ceilings.

Cache Invalidation

Metadata

TTL-based only. No push invalidation from canonical. The metadata TTL (default 5 seconds) is the sole freshness mechanism and the upper bound on read staleness.

Write-through: when the client writes a file, the local metadata cache is updated immediately, providing read-your-writes consistency within a single process.

Crypto-Shred

When a tenant’s KEK is destroyed, all cached plaintext for that tenant must be wiped. Detection via three paths:

  1. Periodic key health check (default every 30 seconds) – primary.
  2. Advisory channel notification – fast path, best-effort.
  3. KMS error on next operation – tertiary.

Maximum detection latency: min(key_health_interval, max_disconnect_seconds) = 30 seconds by default.

Disconnect

If the client cannot reach any canonical endpoint for max_disconnect_seconds (default 300 seconds), the entire cache is wiped. Background heartbeat RPCs (every 60 seconds) maintain the disconnect timer.

Capacity Management

LimitScopeDefaultEnforcement
max_memory_bytes (L1)Per-process256 MBStrict LRU eviction
max_cache_bytes (L2)Per-process50 GBLRU (organic), reject (pinned)
max_node_cache_bytesPer-node80% of cache FSCooperative check before L2 insert
Disk pressure backstopPer-node90% utilizationHard backstop

Pinned chunks are never evicted by organic LRU. Organic eviction considers only non-pinned chunks.

Crash Recovery

  • On process start: the client scans for orphaned cache pools (those whose pool.lock has no live flock holder), zeroizes their contents, and deletes them.
  • kiseki-cache-scrub service: a systemd one-shot (or cron job) that runs on node boot and every 60 seconds, covering the case where no subsequent Kiseki process starts on the node after a crash.

Deployment

This guide covers deploying Kiseki in development, multi-node cluster, and bare-metal production environments.


Docker Compose (development)

The single-node development stack includes Kiseki plus supporting services for tracing, KMS, and identity.

Services

ServiceImagePortsPurpose
kiseki-serverDockerfile.server (local build)2049, 9000, 9090, 9100, 9101Storage node
jaegerjaegertracing/all-in-one:latest4317, 16686Distributed tracing (OTLP)
vaulthashicorp/vault:1.198200Tenant KMS backend (Transit engine)
keycloakquay.io/keycloak/keycloak:26.08080OIDC identity provider

Starting the stack

# Build and start all services
docker compose up --build

# Run in background for e2e tests
docker compose up --build -d && pytest tests/e2e/

Port map (single-node)

PortProtocolService
2049TCPNFS (v3 + v4.2)
9000HTTPS3 gateway
9090HTTPPrometheus metrics + admin dashboard
9100gRPCData-path (log, chunk, composition, view)
9101gRPCWorkflow advisory
4317gRPCJaeger OTLP receiver
16686HTTPJaeger UI
8200HTTPVault API
8080HTTPKeycloak admin console

Environment (dev defaults)

The development compose file sets these environment variables on the kiseki-server container:

KISEKI_DATA_ADDR: "0.0.0.0:9100"
KISEKI_ADVISORY_ADDR: "0.0.0.0:9101"
KISEKI_S3_ADDR: "0.0.0.0:9000"
KISEKI_NFS_ADDR: "0.0.0.0:2049"
KISEKI_METRICS_ADDR: "0.0.0.0:9090"
KISEKI_DATA_DIR: "/data"
KISEKI_BOOTSTRAP: "true"
OTEL_EXPORTER_OTLP_ENDPOINT: "http://jaeger:4317"
OTEL_SERVICE_NAME: "kiseki-server"

The KISEKI_BOOTSTRAP=true flag tells the node to create an initial shard on first start, enabling immediate use without manual cluster initialization.

Vault (dev mode)

Vault runs in dev mode with the root token kiseki-e2e-token. This is suitable only for development and testing. The Transit secrets engine is used by Kiseki as a tenant KMS backend (ADR-028 Provider 2).

# Verify Vault is ready
curl http://localhost:8200/v1/sys/health

Keycloak (dev mode)

Keycloak runs with start-dev and default admin credentials (admin/admin). Configure OIDC realms for tenant identity provider integration.


Docker Compose (3-node cluster)

The multi-node compose file (docker-compose.3node.yml) deploys a 3-node Raft cluster for testing consensus, replication, and failover.

Starting

docker compose -f docker-compose.3node.yml up --build -d

# Run multi-node tests
KISEKI_E2E_COMPOSE=docker-compose.3node.yml pytest tests/e2e/test_multi_node.py

Node configuration

All three nodes share the same Raft peer list and each has a unique KISEKI_NODE_ID:

NodeNode IDData gRPCAdvisory gRPCS3Raft
kiseki-node11localhost:9100localhost:9101localhost:90009300
kiseki-node22localhost:9110localhost:9111localhost:90109300
kiseki-node33localhost:9120localhost:9121localhost:90209300

The Raft peer list is configured identically on all nodes:

KISEKI_RAFT_PEERS=1=kiseki-node1:9300,2=kiseki-node2:9300,3=kiseki-node3:9300

Node 1 is the bootstrap node. Each node has an independent data volume (node1-data, node2-data, node3-data).

Verifying cluster health

# Check all nodes are healthy
for port in 9100 9110 9120; do
  curl -s http://localhost:${port/9100/9090}/health && echo " :$port OK"
done

# View cluster status via the admin dashboard
open http://localhost:9090/ui

Bare metal deployment

Build from source

Prerequisites: Rust stable toolchain, protobuf compiler (protoc), OpenSSL development headers, pkg-config.

# Clone and build
git clone https://github.com/your-org/kiseki.git
cd kiseki

# Release build (all binaries)
cargo build --release

# Binaries produced:
# target/release/kiseki-server      — storage node
# target/release/kiseki-keyserver    — system key manager (HA)
# target/release/kiseki-client-fuse  — FUSE client for compute nodes
# target/release/kiseki-control      — control plane

Optional feature flags:

# Enable CXI/Slingshot transport (requires libfabric)
cargo build --release --features kiseki-transport/cxi

# Enable RDMA verbs transport
cargo build --release --features kiseki-transport/verbs

# Enable tenant opt-in compression
cargo build --release --features kiseki-chunk/compression

Disk layout

Each storage node should follow the recommended disk layout:

Server node:
  System partition (RAID-1 on 2x SSD):
    /var/lib/kiseki/raft/log.redb       Raft log entries
    /var/lib/kiseki/keys/epochs.redb    Key epoch metadata
    /var/lib/kiseki/chunks/meta.redb    Chunk extent index
    /var/lib/kiseki/small/objects.redb   Small-file inline content
    /var/lib/kiseki/config/             Node config, TLS certs

  Data devices (JBOD, managed by Kiseki):
    /dev/nvme0n1 -> pool "fast-nvme"
    /dev/nvme1n1 -> pool "fast-nvme"
    /dev/sda     -> pool "bulk-ssd"
    /dev/sdb     -> pool "cold-hdd"

JBOD for data devices, RAID-1 for the system partition. Kiseki manages data durability via EC/replication across JBOD members. The system partition uses RAID-1 because redb and Raft log must survive a single disk failure without Kiseki’s own repair mechanism.

systemd unit: kiseki-server

[Unit]
Description=Kiseki Storage Node
After=network-online.target
Wants=network-online.target
StartLimitIntervalSec=300
StartLimitBurst=5

[Service]
Type=simple
User=kiseki
Group=kiseki
ExecStart=/usr/local/bin/kiseki-server
Restart=on-failure
RestartSec=5

# Environment
Environment=KISEKI_DATA_ADDR=0.0.0.0:9100
Environment=KISEKI_ADVISORY_ADDR=0.0.0.0:9101
Environment=KISEKI_S3_ADDR=0.0.0.0:9000
Environment=KISEKI_NFS_ADDR=0.0.0.0:2049
Environment=KISEKI_METRICS_ADDR=0.0.0.0:9090
Environment=KISEKI_DATA_DIR=/var/lib/kiseki
Environment=KISEKI_NODE_ID=1
Environment=KISEKI_RAFT_PEERS=1=node1.example.com:9300,2=node2.example.com:9300,3=node3.example.com:9300
Environment=KISEKI_RAFT_ADDR=0.0.0.0:9300

# TLS
Environment=KISEKI_CA_PATH=/etc/kiseki/tls/ca.crt
Environment=KISEKI_CERT_PATH=/etc/kiseki/tls/server.crt
Environment=KISEKI_KEY_PATH=/etc/kiseki/tls/server.key

# Observability
Environment=OTEL_EXPORTER_OTLP_ENDPOINT=http://jaeger.internal:4317
Environment=OTEL_SERVICE_NAME=kiseki-server
Environment=RUST_LOG=kiseki=info

# Security hardening
NoNewPrivileges=yes
ProtectSystem=strict
ProtectHome=yes
ReadWritePaths=/var/lib/kiseki
PrivateTmp=yes
MemoryDenyWriteExecute=yes
LimitCORE=0

[Install]
WantedBy=multi-user.target

systemd unit: kiseki-keyserver

[Unit]
Description=Kiseki System Key Manager
After=network-online.target
Wants=network-online.target

[Service]
Type=simple
User=kiseki-keys
Group=kiseki-keys
ExecStart=/usr/local/bin/kiseki-keyserver
Restart=on-failure
RestartSec=5

Environment=KISEKI_DATA_DIR=/var/lib/kiseki-keys
Environment=KISEKI_RAFT_PEERS=1=keysrv1:9400,2=keysrv2:9400,3=keysrv3:9400
Environment=KISEKI_RAFT_ADDR=0.0.0.0:9400
Environment=KISEKI_CA_PATH=/etc/kiseki/tls/ca.crt
Environment=KISEKI_CERT_PATH=/etc/kiseki/tls/keyserver.crt
Environment=KISEKI_KEY_PATH=/etc/kiseki/tls/keyserver.key

NoNewPrivileges=yes
ProtectSystem=strict
ProtectHome=yes
ReadWritePaths=/var/lib/kiseki-keys
PrivateTmp=yes
MemoryDenyWriteExecute=yes
LimitCORE=0

[Install]
WantedBy=multi-user.target

systemd unit: kiseki-client-fuse

[Unit]
Description=Kiseki FUSE Client
After=network-online.target
Wants=network-online.target

[Service]
Type=simple
User=root
ExecStart=/usr/local/bin/kiseki-client-fuse --mountpoint /mnt/kiseki
ExecStop=/bin/fusermount -u /mnt/kiseki
Restart=on-failure
RestartSec=5

Environment=KISEKI_DATA_ADDR=node1.example.com:9100,node2.example.com:9100,node3.example.com:9100
Environment=KISEKI_CA_PATH=/etc/kiseki/tls/ca.crt
Environment=KISEKI_CERT_PATH=/etc/kiseki/tls/client.crt
Environment=KISEKI_KEY_PATH=/etc/kiseki/tls/client.key
Environment=KISEKI_CACHE_MODE=organic
Environment=KISEKI_CACHE_DIR=/var/cache/kiseki
Environment=KISEKI_CACHE_L1_MAX=1073741824
Environment=KISEKI_CACHE_L2_MAX=107374182400

[Install]
WantedBy=multi-user.target

Configuration checklist

Before starting a production cluster, verify the following:

TLS certificates

  • Cluster CA certificate generated and distributed to all nodes
  • Per-node server certificate signed by Cluster CA
  • Per-tenant client certificates signed by Cluster CA
  • Key manager server certificate signed by Cluster CA
  • CRL distribution point configured (if using CRL-based revocation)
  • Certificate SANs include all node hostnames and IP addresses
  • All certificates use ECDSA P-256 or RSA 2048+ keys

Data directories

  • KISEKI_DATA_DIR exists and is owned by the kiseki user
  • System partition has sufficient capacity for metadata (see Capacity Planning)
  • Data devices formatted and accessible (raw block or file-backed)
  • Separate RAID-1 for system partition

Bootstrap

  • Exactly one node has KISEKI_BOOTSTRAP=true on first start
  • After initial bootstrap, set KISEKI_BOOTSTRAP=false on the bootstrap node (or remove the variable)
  • KISEKI_RAFT_PEERS is identical on all nodes
  • KISEKI_NODE_ID is unique per node
  • System key manager cluster is started before storage nodes

Network

  • Data-fabric ports (9100, 9101) reachable between all nodes
  • Raft port (9300) reachable between all nodes
  • Metrics port (9090) accessible to monitoring infrastructure
  • NFS port (2049) accessible to clients
  • S3 port (9000) accessible to clients
  • Management network separated from data fabric (recommended)

Observability

  • Jaeger or OTLP-compatible collector endpoint configured
  • Prometheus scrape target added for each node’s :9090/metrics
  • RUST_LOG level set appropriately (production: kiseki=info)

Health verification

After deployment, verify the cluster is healthy:

HTTP health endpoint

# Returns "OK" when the node is ready
curl http://node1:9090/health

Prometheus metrics

# Verify metrics are being exported
curl -s http://node1:9090/metrics | head -20

Admin dashboard

Open http://node1:9090/ui in a browser. The dashboard shows:

  • Cluster health (nodes healthy / total)
  • Raft entries applied
  • Gateway requests served
  • Data written and read
  • Active transport connections

Any node in the cluster serves the full cluster-wide view by scraping metrics from its peers.

Raft consensus

Verify that the Raft cluster has elected a leader:

# Check the cluster status via the admin API
curl -s http://node1:9090/ui/api/cluster | jq .

S3 connectivity

# Test S3 access (if a tenant namespace is configured)
aws --endpoint-url http://node1:9000 s3 ls

NFS connectivity

# Test NFS mount
mount -t nfs node1:/ /mnt/kiseki -o vers=4.2

FUSE client

# Mount via FUSE (on a compute node)
kiseki-client-fuse --mountpoint /mnt/kiseki
ls /mnt/kiseki

Configuration Reference

Kiseki is configured entirely through environment variables. There are no configuration files to manage. Every tunable parameter has a sensible default. Variables are grouped by function below.


Network addresses

VariableDefaultDescription
KISEKI_DATA_ADDR0.0.0.0:9100Listen address for data-path gRPC (log, chunk, composition, view, discovery).
KISEKI_ADVISORY_ADDR0.0.0.0:9101Listen address for the Workflow Advisory gRPC service. Runs on a dedicated tokio runtime, isolated from the data path (ADR-021).
KISEKI_S3_ADDR0.0.0.0:9000Listen address for the S3 HTTP gateway.
KISEKI_NFS_ADDR0.0.0.0:2049Listen address for the NFS gateway (v3 + v4.2).
KISEKI_METRICS_ADDR0.0.0.0:9090Listen address for Prometheus metrics (/metrics), health endpoint (/health), and admin dashboard (/ui).
KISEKI_RAFT_ADDR0.0.0.0:9300Listen address for Raft consensus traffic between nodes.

All addresses accept the host:port format. Use 0.0.0.0 to bind to all interfaces or a specific IP to restrict to one network.


Cluster membership

VariableDefaultDescription
KISEKI_NODE_ID(required)Unique integer identifier for this node within the cluster. Must be stable across restarts.
KISEKI_RAFT_PEERS(required)Comma-separated list of id=host:port pairs for all Raft voters. Example: 1=node1:9300,2=node2:9300,3=node3:9300. Must be identical on every node.
KISEKI_BOOTSTRAPfalseWhen true, the node creates an initial shard on first start. Set to true on exactly one node during initial cluster formation, then set back to false.

Storage

VariableDefaultDescription
KISEKI_DATA_DIR/var/lib/kisekiRoot directory for all persistent state. Contains Raft log (raft/log.redb), key epochs (keys/epochs.redb), chunk metadata (chunks/meta.redb), and inline small-file content (small/objects.redb). Must reside on a low-latency device (NVMe or SSD strongly recommended; HDD triggers a boot warning).

Data directory layout

KISEKI_DATA_DIR/
  raft/log.redb            Raft log entries (bounded by snapshot policy)
  keys/epochs.redb         Key epoch metadata (<10 MB)
  chunks/meta.redb         Chunk extent index (scales with file count)
  small/objects.redb        Small-file encrypted content (capacity-managed)

TLS / mTLS

VariableDefaultDescription
KISEKI_CA_PATH(none)Path to the Cluster CA certificate (PEM). Required for production. When set, all gRPC connections require mTLS.
KISEKI_CERT_PATH(none)Path to this node’s TLS certificate (PEM), signed by the Cluster CA.
KISEKI_KEY_PATH(none)Path to this node’s TLS private key (PEM). Never logged, printed, or transmitted.
KISEKI_CRL_PATH(none)Path to a CRL file (PEM) for certificate revocation. Reloaded periodically. Optional; if not set, CRL checking is disabled.

When KISEKI_CA_PATH is not set, the server runs without TLS. This is acceptable for development but must not be used in production.


Client-side cache (ADR-031)

These variables configure the native client cache on compute nodes running kiseki-client-fuse.

VariableDefaultDescription
KISEKI_CACHE_MODEorganicCache operating mode. One of: pinned (staging-driven, eviction-resistant), organic (LRU with usage-weighted retention), bypass (no caching). Mode is per session, not per file.
KISEKI_CACHE_DIR$KISEKI_DATA_DIR/cacheDirectory for L2 cache pools on local NVMe. Each client process creates an isolated pool with a unique pool_id.
KISEKI_CACHE_L1_MAX1073741824 (1 GB)Maximum bytes for the in-memory L1 cache (decrypted plaintext chunks). Bounded by process memory.
KISEKI_CACHE_L2_MAX107374182400 (100 GB)Maximum bytes for the on-disk L2 cache on local NVMe. Per-process, per-tenant isolation via pool directories.
KISEKI_CACHE_META_TTL_MS5000 (5 seconds)Metadata TTL in milliseconds. File-to-chunk-list mappings are served from cache within this window. After expiry, mappings are re-fetched from canonical. This is the sole freshness window: chunk data itself has no TTL because chunks are immutable (I-C1).
KISEKI_CACHE_POOL_ID(none)Adopt an existing L2 cache pool instead of creating a new one. Used for staging handoff from a Slurm prolog daemon to a workload process.

Cache behavior notes

  • Pinned mode: Pre-staged datasets remain in cache until explicitly released. Best for training workloads that re-read the same data across epochs.
  • Organic mode: LRU eviction with usage-weighted retention. Default for mixed workloads.
  • Bypass mode: No caching at all. Best for checkpoint/restart and streaming workloads.
  • On process restart, the client creates a new L2 pool (wiping orphaned pools). A kiseki-cache-scrub service cleans orphans on node boot.
  • Disconnects longer than 300 seconds (configurable) wipe the entire cache.
  • Crypto-shred events wipe all cached plaintext for the affected tenant within the key health check interval (default 30 seconds).

Metadata capacity (ADR-030)

These variables control the dynamic inline threshold for small-file placement.

VariableDefaultDescription
KISEKI_META_SOFT_LIMIT_PCT50Normal operating ceiling for system disk metadata usage, as a percentage of system partition capacity. Exceeding this triggers inline threshold reduction.
KISEKI_META_HARD_LIMIT_PCT75Absolute maximum for system disk metadata usage. Exceeding this forces the inline threshold to the floor (128 bytes) and emits an alert via out-of-band gRPC (not Raft).

The inline threshold determines whether a file’s encrypted content is stored in small/objects.redb (metadata tier, NVMe) or as a chunk extent on a raw block device (data tier). The threshold is computed per-shard as the minimum affordable threshold across all Raft voters, clamped between 128 bytes (floor) and 64 KB (ceiling).


Observability

VariableDefaultDescription
OTEL_EXPORTER_OTLP_ENDPOINT(none)OpenTelemetry OTLP gRPC endpoint for distributed traces. Example: http://jaeger:4317. When not set, tracing is disabled.
OTEL_SERVICE_NAMEkiseki-serverService name reported in traces. Set to kiseki-keyserver or kiseki-client for other binaries.
RUST_LOGinfoLogging filter directive for the tracing crate. Supports per-module granularity. Examples: kiseki=debug, kiseki_raft=trace,kiseki=info, warn.
KISEKI_LOG_FORMATtextLog output format. text for human-readable, json for structured JSON (one line per event). Use json in production for log aggregation.

Tuning parameters (runtime)

The following parameters are set at runtime via the StorageAdminService gRPC API (SetTuningParams / GetTuningParams), not via environment variables. They are listed here for reference.

Cluster-wide tuning

ParameterDefaultRangeDescription
compaction_rate_mb_s10010-1000Background compaction throughput cap (MB/s).
gc_interval_s30060-3600Interval between GC scans for reclaimable chunks.
rebalance_rate_mb_s500-500Background rebalance/evacuation throughput (MB/s).
scrub_interval_h168 (7 days)24-720Interval between integrity scrub runs.
max_concurrent_repairs41-32Maximum parallel EC repair jobs.
stream_proc_poll_ms10010-1000View materialization polling interval (ms).
inline_threshold_bytes4096512-65536Default inline threshold for new shards.
raft_snapshot_interval100001000-100000Entries between Raft snapshots.

Per-pool tuning

ParameterDefaultRangeDescription
ec_data_chunks4 (NVMe) / 8 (HDD)2-16EC data fragment count. Immutable per pool after creation (I-C6).
ec_parity_chunks2 (NVMe) / 3 (HDD)1-8EC parity fragment count. Immutable per pool after creation.
replication_count32-5For replication pools (non-EC).
warning_threshold_pctPer device class50-95Pool capacity warning level.
critical_threshold_pctPer device class60-98Pool capacity critical level. Writes rejected.
readonly_threshold_pctPer device class70-99Read-only level. In-flight writes drain.
target_fill_pct70 (SSD) / 80 (HDD)50-90Rebalance target fill level.

Default capacity thresholds by device class:

StateNVMe/SSDHDD
Healthy0-75%0-85%
Warning75-85%85-92%
Critical85-92%92-97%
ReadOnly92-97%97-99%
Full97-100%99-100%

All tuning parameter changes via SetTuningParams are recorded in the cluster audit shard with parameter name, old value, new value, timestamp, and admin identity (I-A6).


Environment variable summary

Quick reference of all environment variables:

# Network
KISEKI_DATA_ADDR=0.0.0.0:9100
KISEKI_ADVISORY_ADDR=0.0.0.0:9101
KISEKI_S3_ADDR=0.0.0.0:9000
KISEKI_NFS_ADDR=0.0.0.0:2049
KISEKI_METRICS_ADDR=0.0.0.0:9090
KISEKI_RAFT_ADDR=0.0.0.0:9300

# Cluster
KISEKI_NODE_ID=1
KISEKI_RAFT_PEERS=1=node1:9300,2=node2:9300,3=node3:9300
KISEKI_BOOTSTRAP=false

# Storage
KISEKI_DATA_DIR=/var/lib/kiseki

# TLS
KISEKI_CA_PATH=/etc/kiseki/tls/ca.crt
KISEKI_CERT_PATH=/etc/kiseki/tls/server.crt
KISEKI_KEY_PATH=/etc/kiseki/tls/server.key
KISEKI_CRL_PATH=/etc/kiseki/tls/crl.pem

# Cache (client only)
KISEKI_CACHE_MODE=organic
KISEKI_CACHE_DIR=/var/cache/kiseki
KISEKI_CACHE_L1_MAX=1073741824
KISEKI_CACHE_L2_MAX=107374182400
KISEKI_CACHE_META_TTL_MS=5000

# Metadata capacity
KISEKI_META_SOFT_LIMIT_PCT=50
KISEKI_META_HARD_LIMIT_PCT=75

# Observability
OTEL_EXPORTER_OTLP_ENDPOINT=http://jaeger:4317
OTEL_SERVICE_NAME=kiseki-server
RUST_LOG=kiseki=info
KISEKI_LOG_FORMAT=json

Cluster Management

This guide covers day-to-day cluster operations: adding and removing nodes, managing shards and pools, maintenance mode, and schema migration.


Node management

Kiseki uses Raft consensus groups for metadata and log replication. Adding or removing nodes is done through Raft membership changes, which are zero-downtime and zero-data-loss operations.

Adding a node

  1. Deploy kiseki-server on the new host with a unique KISEKI_NODE_ID and the full KISEKI_RAFT_PEERS list (including the new node).

  2. Start the service. The node registers with the cluster and begins receiving Raft log entries as a learner.

  3. Promote the node to a voter once it has caught up:

    kiseki-server node add --node-id 4
    
  4. The node receives shard assignments and begins participating in Raft elections and commit quorums.

Catch-up requirement (I-SF3): A learner must fully catch up with the leader’s committed index before being promoted to voter. The old voter remains in membership until the new voter is promoted.

Removing a node

  1. Drain the node to migrate its shard assignments to other nodes:

    kiseki-server node drain --node-id 4
    
  2. Wait for all shards to be migrated. The drain operation uses Raft membership changes (add learner on target, promote, demote source) for each shard hosted on the node.

  3. Once drained, remove the node from the cluster:

    kiseki-server node remove --node-id 4
    
  4. Stop the kiseki-server process and decommission the hardware.

Safety: Removing a node without draining first triggers automatic shard repair, but this is reactive rather than proactive. Always drain first for orderly removal.

Cluster sizing

  • Minimum: 3 nodes (Raft requires a majority quorum; 2-of-3 for writes).
  • Recommended: 5+ nodes for production. Tolerates 2 simultaneous node failures.
  • Key manager: Deploy on a dedicated 3-5 node HA cluster, separate from storage nodes. The system key manager must be at least as available as the log (I-K12).

Shard management

Shards are the smallest unit of totally-ordered deltas, backed by one Raft group. They split automatically when size or throughput thresholds are exceeded (I-L6).

Viewing shard status

# List all shards
kiseki-server shard list

# Get details for a specific shard
kiseki-server shard info --shard-id shard-0001

# Check shard health
kiseki-server shard health --shard-id shard-0001

Automatic shard split

Shards have a hard ceiling triggering mandatory split (I-L6). The ceiling is configurable across three dimensions:

  • Delta count: Maximum number of deltas in a shard.
  • Byte size: Maximum total size of shard data.
  • Write throughput: Maximum sustained write rate.

Any dimension exceeding its ceiling forces a split. The split operation:

  1. Selects a split boundary (key range partition).
  2. Creates a new shard for the upper range.
  3. Continues accepting writes during the split (I-O1).
  4. Notifies the control plane, views, and clients of the new shard topology.

Manual shard split

kiseki-server shard split --shard-id shard-0001 --boundary "..."

Shard maintenance mode

Set a shard to read-only for maintenance operations:

# Enable maintenance mode (writes rejected with retriable error)
kiseki-server shard maintenance --shard-id shard-0001 --enabled

During maintenance mode (I-O6):

  • Write commands are rejected with a retriable error.
  • Read operations continue normally.
  • In-progress compaction and GC continue but no new triggers fire from write pressure.
  • Shard splits do not initiate.

Cross-shard operations

Cross-shard rename returns EXDEV (I-L8). Shards are independent consensus domains with no two-phase commit. Applications must handle cross-shard moves via copy + delete.


Pool management

Affinity pools are groups of storage devices sharing a device class. Pools are the unit of capacity management and durability policy.

Viewing pools

# List all pools
kiseki-server pool list

# Get pool details including capacity and health
kiseki-server pool status --pool-id fast-nvme

Creating a pool

kiseki-server pool create --pool-id fast-nvme --device-class NvmeU2 \
  --ec-data 4 --ec-parity 2

Important: EC parameters (ec_data_chunks, ec_parity_chunks) are immutable per pool after creation (I-C6). Changing them requires creating a new pool and migrating data via ReencodePool.

Setting pool durability

# Switch pool durability strategy (applies to new chunks only)
kiseki-server pool set-durability --pool-id fast-nvme \
  --ec-data 4 --ec-parity 2

Existing chunks retain their original EC config. Re-encoding requires an explicit ReencodePool RPC.

Rebalancing a pool

Rebalance distributes data evenly across devices in a pool:

# Start rebalance
kiseki-server pool rebalance --pool-id fast-nvme

# Cancel a running rebalance
kiseki-server pool cancel-rebalance --pool-id fast-nvme

Rebalance runs at the configured rebalance_rate_mb_s (default 50 MB/s) to limit impact on production traffic.

Device evacuation

When a device shows signs of failure (SMART wear > 90% for SSD, > 100 bad sectors for HDD), automatic evacuation is triggered (I-D3). Evacuation can also be initiated manually:

# Start evacuation
kiseki-server device evacuate --device-id nvme-0001

# Cancel evacuation
kiseki-server device cancel-evacuation --device-id nvme-0001

Evacuation migrates all chunks from the device to other devices in the same pool. Device removal (RemoveDevice) is rejected unless the device state is Removed (post-evacuation) (I-D5).

Device state transitions: Healthy -> Degraded -> Evacuating -> Failed -> Removed. All transitions are recorded in the audit log (I-D2).

Pool capacity thresholds

Pool writes are rejected when the pool reaches the Critical threshold (I-C5). Thresholds vary by device class to account for SSD/NVMe GC pressure at high fill levels:

StateNVMe/SSDHDDBehavior
Healthy0-75%0-85%Normal writes
Warning75-85%85-92%Log warning, emit telemetry
Critical85-92%92-97%Reject new placements
ReadOnly92-97%97-99%In-flight writes drain, no new writes
Full97-100%99-100%ENOSPC to clients

Pool redirection stays within the same device class only. ENOSPC is returned when the pool is Full.


Maintenance mode

Cluster-wide or per-shard maintenance mode sets the cluster (or specific shards) to read-only (I-O6).

Enabling cluster-wide maintenance

# Via the admin dashboard
curl -X POST http://node1:9090/ui/api/ops/maintenance \
  -H 'Content-Type: application/json' \
  -d '{"enabled": true}'

# Via the kiseki-server CLI
kiseki-server maintenance on

Maintenance mode behavior

  • All write commands are rejected with a retriable error code (MaintenanceMode). Clients can retry after maintenance ends.
  • Read operations continue normally.
  • In-progress compaction and GC complete their current run.
  • New shard splits, compaction triggers, and GC triggers from write pressure are suppressed.
  • Maintenance mode is the prerequisite for:
    • Schema migration on upgrade
    • Inline threshold increase (optional migration of small chunked files back to inline)
    • Full cluster re-encryption

Disabling maintenance

curl -X POST http://node1:9090/ui/api/ops/maintenance \
  -H 'Content-Type: application/json' \
  -d '{"enabled": false}'

Writes resume immediately. Clients that were retrying will succeed on their next attempt.


Schema migration on upgrade

Kiseki uses versioned on-disk formats. Upgrades that change the schema follow this procedure:

  1. Read the release notes for migration requirements. Not every release requires migration.

  2. Enable maintenance mode on the cluster to prevent writes during migration.

  3. Stop all nodes in the cluster.

  4. Upgrade the binaries on all nodes (kiseki-server, kiseki-keyserver, kiseki-client-fuse).

  5. Start nodes one at a time. On startup, each node detects the old schema version (via the superblock on each data device and the redb metadata version) and applies migration automatically.

  6. Verify migration by checking the admin dashboard and node logs.

  7. Disable maintenance mode to resume normal operations.

Rolling upgrades

For minor releases that do not change the on-disk format, rolling upgrades are supported:

  1. Drain a node (DrainNode).
  2. Stop the node.
  3. Upgrade the binary.
  4. Start the node.
  5. Wait for it to rejoin and catch up.
  6. Repeat for the next node.

The superblock on each data device carries a format version (ADR-029). Format version mismatches are detected at device open and handled by the migration path.

Admin Dashboard

Kiseki includes a built-in web dashboard for cluster monitoring and basic operations. The dashboard is served by every storage node on the metrics HTTP port.


Access

http://<node>:9090/ui

Any node in the cluster serves the full cluster-wide view. The dashboard scrapes metrics from peer nodes in the background and aggregates them locally. There is no dedicated dashboard server; connect to whichever node is most convenient.

The metrics HTTP server also serves:

PathPurpose
/healthHealth probe (returns 200 OK). Used by load balancers.
/metricsPrometheus text exposition format.
/uiAdmin dashboard (HTML + HTMX + Chart.js).
/ui/logoKiseki logo image.

Technology

The dashboard is a single-page HTML application using:

  • HTMX for live updates via HTML fragment polling.
  • Chart.js for time-series and per-node comparison charts.
  • No build step, no JavaScript framework, no node_modules.

The dashboard HTML is embedded in the kiseki-server binary at compile time (include_str!). No external files to deploy or manage.


Overview tab

The main view shows six metric cards at the top, a time-series chart in the middle, and a node table at the bottom. All data refreshes automatically via HTMX polling.

Metric cards

CardSource metricDescription
Cluster HealthNode livenessN/M nodes healthy with color coding: green (all healthy), yellow (degraded), red (all down).
Raft Entrieskiseki_raft_entries_totalTotal Raft entries applied across the cluster.
Gateway Requestskiseki_gateway_requests_totalTotal S3 and NFS requests served.
Data Writtenkiseki_chunk_write_bytes_totalAggregate chunk bytes written.
Data Readkiseki_chunk_read_bytes_totalAggregate chunk bytes read.
Connectionskiseki_transport_connections_activeActive transport connections.

Numbers are formatted with SI suffixes (K, M, B) and byte units (KB, MB, GB, TB) for readability.

Time-series charts

The dashboard stores up to 3 hours of metric history (configurable) in memory. Time-series charts show:

  • Raft entries over time
  • Gateway request rate
  • Chunk write/read throughput
  • Connection count

Historical data is available via the API:

# Get 3 hours of history (default)
curl http://node1:9090/ui/api/history

# Get 1 hour of history
curl http://node1:9090/ui/api/history?hours=1

Node table

A table listing every node in the cluster with per-node metrics:

ColumnDescription
NodeNode address (hostname:port)
StatusHealth badge: green “Healthy” or red “Unreachable”
RaftRaft entries applied by this node
RequestsGateway requests served by this node
WrittenChunk bytes written by this node
ReadChunk bytes read by this node
ConnsActive transport connections on this node

Click a node row to drill down to the node detail view.


Performance tab

The performance tab shows per-node comparison charts for identifying hotspots and imbalances:

  • Write throughput by node: Bar chart comparing chunk bytes written per node.
  • Read throughput by node: Bar chart comparing chunk bytes read per node.
  • Request count by node: Bar chart comparing gateway requests per node.

Chart data is sourced from the chart-data API:

curl http://node1:9090/ui/fragment/chart-data
# Returns: {"labels": [...], "writes": [...], "reads": [...], "requests": [...]}

Alerts tab

The alerts tab shows health status and capacity warnings. Each alert is a row with a colored dot (green, yellow, red, blue), a message, and a timestamp.

Alert types

DotMeaningExample
GreenAll clear“All 3 nodes healthy”
RedCritical“Node node2:9100 unreachable”
BlueInformational“Capacity monitoring active (3 nodes reporting)”
GreenActivity“node1:9100: 1.2K gateway requests served”

Alerts are generated by comparing the current cluster state against expected conditions. The alert endpoint returns HTML fragments for HTMX polling:

curl http://node1:9090/ui/fragment/alerts

Operations tab

The operations tab provides buttons for common administrative actions. Each action calls a REST endpoint and records an event in the diagnostic event store.

Available operations

OperationEndpointMethodDescription
Maintenance Mode/ui/api/ops/maintenancePOSTEnable or disable cluster-wide maintenance mode. Body: {"enabled": true} or {"enabled": false}.
Backup/ui/api/ops/backupPOSTInitiate a background backup.
Scrub/ui/api/ops/scrubPOSTInitiate a background integrity scrub.

Example:

# Enable maintenance mode
curl -X POST http://node1:9090/ui/api/ops/maintenance \
  -H 'Content-Type: application/json' \
  -d '{"enabled": true}'

# Trigger a scrub
curl -X POST http://node1:9090/ui/api/ops/scrub

All operations return {"status": "ok", "message": "..."} on success.


Node drill-down

Click a node in the node table to see its detailed view. The drill-down shows:

  • Node-specific metric history (time-series)
  • Device health for devices attached to that node
  • Shard assignments on that node
  • Raft role (leader/follower/learner) per shard

API endpoints

All dashboard data is available via JSON APIs for scripting and integration:

EndpointMethodDescription
/ui/api/clusterGETCluster summary: healthy nodes, total nodes, aggregate metrics.
/ui/api/nodesGETList of all nodes with per-node metrics and health status.
/ui/api/historyGETTime-series metric history. Query: ?hours=3 (default).
/ui/api/eventsGETDiagnostic event log. Query parameters below.

Event log query parameters

ParameterTypeDefaultDescription
severitystring(all)Filter by severity: info, warning, error, critical.
categorystring(all)Filter by category: node, shard, device, tenant, security, admin, gateway, raft.
hoursfloat3Hours to look back.
limitinteger100Maximum events to return.

Example:

# Get last 50 error events in the past hour
curl 'http://node1:9090/ui/api/events?severity=error&hours=1&limit=50'

Response format:

{
  "count": 2,
  "events": [
    {
      "timestamp": "2026-04-23T14:30:00Z",
      "severity": "error",
      "category": "device",
      "source": "nvme-0001",
      "message": "Device SMART wear exceeds 90%"
    }
  ]
}

Cluster-wide view architecture

Every node in the cluster runs the same dashboard. The cluster-wide view is assembled by scraping /metrics from peer nodes:

  1. Each node knows its peers from KISEKI_RAFT_PEERS.
  2. A background task scrapes each peer’s /metrics endpoint at a configurable interval (default 10 seconds).
  3. Scraped metrics are cached locally in a MetricsAggregator.
  4. Dashboard requests aggregate local + cached peer metrics.

This means:

  • No single point of failure. Any node serves the dashboard.
  • Stale data tolerance. If a peer is unreachable, the dashboard shows the last known state and marks the node as “Unreachable.”
  • No additional infrastructure. No dedicated monitoring server is needed for basic cluster visibility.

For production monitoring with alerting and long-term retention, use Prometheus and Grafana (see Monitoring).

Backup & Recovery

Kiseki’s primary disaster recovery mechanism is federation (async replication to a secondary site). External backup is additive and optional, providing defense-in-depth for deployments that require it.


Architecture overview

Federation as primary DR

Federated-async replication to a secondary site is the recommended DR strategy (ADR-016). Properties:

  • RPO: Bounded by async replication lag (seconds to minutes).
  • RTO: Secondary site is warm (has replicated data + tenant config); switchover requires KMS connectivity and control plane reconfiguration.
  • Data replication: Ciphertext-only. No key material in the replication stream.

What is replicated

ComponentReplicated?Mechanism
Chunk data (ciphertext)YesAsync replication to peer site
Log deltasYesAsync replication of committed deltas
Control plane configYesFederation config sync
Tenant KMS configNoSame tenant KMS serves both sites
System master keysNoPer-site system key manager
Audit logYesPer-tenant audit shard replicated

External backup

Cluster admins can configure external backup targets (S3-compatible object store). Backup data is encrypted with the system key at rest.


Backup operations

Creating a backup

# Via the admin dashboard
curl -X POST http://node1:9090/ui/api/ops/backup

# Via the kiseki-server CLI
kiseki-server backup create

Backup contents

Each backup snapshot contains:

  1. Per-shard metadata: Raft log snapshots for each shard, capturing the delta history up to the snapshot point.
  2. Chunk extent manifests: The chunks/meta.redb index mapping chunk IDs to device extents.
  3. Inline content: The small/objects.redb database (small-file data below the inline threshold).
  4. Control plane state: Tenant configuration, namespace mappings, quotas, compliance tags, federation peer registry.
  5. Key epoch metadata: Key epoch records from keys/epochs.redb (key material itself is NOT included in backups; it is managed by the system key manager and tenant KMS independently).

All backup data is encrypted. No plaintext chunk data appears in backup output. Backups reference chunk ciphertext on data devices by extent coordinates, not by copying the raw ciphertext (which would require reading and re-encrypting terabytes of data).

Listing backups

kiseki-server backup list

Deleting a backup

kiseki-server backup delete --backup-id backup-20260423-001

Retention policy

Backup retention is configurable per cluster. Defaults:

SettingDefaultDescription
Retention period7 daysBackups older than this are automatically deleted.
Maximum backups10Maximum number of retained backup snapshots.
Backup frequencyDailyHow often automatic backups are created (if enabled).

Retention is enforced by a background task that runs on the Raft leader. Deletion of expired backups is recorded in the cluster audit log.


Recovery procedures

Single node failure

Recovery path: Raft re-election + EC repair.

  1. The Raft group detects the failed node and elects a new leader (if the failed node was leader).
  2. EC repair automatically rebuilds chunk fragments that were on the failed node’s devices.
  3. RPO: 0 (committed data is on a majority of replicas). RTO: seconds to minutes.

No manual intervention required. Monitor the repair progress via:

kiseki-server repair list

Multiple node failure (quorum maintained)

Recovery path: Raft reconfiguration + EC repair.

If the cluster still has a Raft majority (e.g., 2 of 3 nodes alive), recovery is automatic:

  1. Raft continues operating with the surviving majority.
  2. EC repair rebuilds lost chunk fragments.
  3. Deploy replacement nodes and add them to the cluster.

Multiple node failure (quorum lost)

Recovery path: Manual Raft reconfiguration.

If the majority is lost (e.g., 2 of 3 nodes down), Raft cannot make progress. Recovery requires manual intervention:

  1. Identify the surviving node(s) with the most recent committed state.
  2. Force a new Raft configuration with the surviving node(s) as the initial voter set.
  3. Deploy replacement nodes and add them as learners.
  4. Promote learners to voters once they catch up.

Data loss risk: Deltas committed on the failed majority but not yet replicated to the surviving minority may be lost.

Full site failure (with federation)

Recovery path: Failover to federated peer.

  1. Redirect clients to the secondary site (DNS, load balancer, or manual reconfiguration).
  2. The secondary site has replicated chunk data, log deltas, and control plane config.
  3. Tenant KMS must be reachable from the secondary site (same KMS serves both sites).
  4. The secondary site’s system key manager has its own master keys, but tenant data is accessible because tenant KEKs come from the shared tenant KMS.

RPO: Replication lag. RTO: Minutes to hours (depends on control plane reconfiguration speed).

Full site failure (without federation)

Recovery path: Restore from external backup.

  1. Deploy a new cluster.
  2. Restore the backup snapshot to the new cluster.
  3. The system key manager on the new cluster generates new system master keys.
  4. Tenant KMS must be reconfigured to point to the new cluster.
  5. Re-wrap all envelopes with new system master keys.

RPO: Time since last backup. RTO: Hours (depends on data volume).

Tenant KMS loss

Unrecoverable (I-K11). If the tenant loses their KMS and has no backup of their KEK material, all data encrypted under those keys is permanently unreadable. Kiseki documents this requirement but provides no system-side escrow. The tenant controls and is responsible for their keys.


Recovery summary

ScenarioRecovery pathRPORTO
Single node lossRaft re-election + EC repair0Seconds-minutes
Multiple node loss (quorum held)Raft reconfiguration + EC repair0Minutes
Multiple node loss (quorum lost)Manual Raft reconfigPossible delta lossMinutes-hours
Full site loss (with federation)Failover to peerReplication lagMinutes-hours
Full site loss (no federation)Restore from backupBackup lagHours
Tenant KMS lossUnrecoverableN/AN/A

Limitations

  • No point-in-time restore. Backups are snapshots, not continuous journals. Recovery restores the cluster to the state at the snapshot time. Deltas committed after the snapshot are lost unless federation has replicated them.

  • Backup does not include key material. System master keys and tenant KEKs are managed by their respective key managers. Backup and recovery of key material is the responsibility of the key manager operator (cluster admin for system keys, tenant admin for tenant KEKs).

  • Chunk ciphertext is referenced, not copied. Backup manifests reference chunk extents on data devices. If data devices are destroyed, the chunk ciphertext is lost. Federation replicates the actual ciphertext to a secondary site, which is why it is the primary DR mechanism.

  • Cross-site backup requires federation. There is no built-in mechanism to ship backup snapshots to a remote site outside of the federation framework. For cross-site backup without federation, operators must arrange their own transport of backup snapshots.

Monitoring & Observability

Kiseki provides three observability pillars: metrics (Prometheus), structured logging (tracing), and distributed traces (OpenTelemetry). All three are tenant-aware, respecting the zero-trust boundary between cluster admin and tenant admin (ADR-015).


Prometheus metrics

Every kiseki-server node exposes Prometheus metrics in text exposition format on the metrics HTTP port.

Endpoint

GET http://<node>:9090/metrics

Registered metrics

Metric nameTypeLabelsDescription
kiseki_raft_commit_latency_secondsHistogramshardRaft commit latency per shard. Buckets: 100us to 1s.
kiseki_raft_entries_totalCounter(none)Total Raft entries applied on this node.
kiseki_chunk_write_bytes_totalCounter(none)Total chunk bytes written.
kiseki_chunk_read_bytes_totalCounter(none)Total chunk bytes read.
kiseki_chunk_ec_encode_secondsHistogramstrategyEC encode latency. Buckets: 100us to 50ms.
kiseki_gateway_requests_totalCountermethod, statusGateway request count by method (GET, PUT, DELETE, etc.) and HTTP status.
kiseki_gateway_request_duration_secondsHistogrammethodGateway request duration. Buckets: 1ms to 5s.
kiseki_pool_capacity_total_bytesGaugepoolTotal capacity per pool in bytes.
kiseki_pool_capacity_used_bytesGaugepoolUsed capacity per pool in bytes.
kiseki_transport_connections_activeGauge(none)Active transport connections.
kiseki_transport_connections_idleGauge(none)Idle transport connections.
kiseki_shard_delta_countGaugeshardCurrent delta count per shard.
kiseki_key_rotation_totalCounter(none)Key rotations performed (system + tenant).
kiseki_crypto_shred_totalCounter(none)Crypto-shred operations performed.

Metric scoping (zero-trust)

Per ADR-015, metric scoping respects the zero-trust boundary:

  • Cluster admin sees: Aggregated metrics, per-node metrics, system health. Per-tenant metrics are anonymized (tenant_id replaced with opaque hash) unless the cluster admin has approved access for that tenant.
  • Tenant admin sees: Their own tenant’s metrics via the tenant audit export.
  • No metric exposes: File names, directory structure, data content, or access patterns attributable to a specific tenant (without approval).

Metric cardinality

Metric cardinality is bounded by design. Label values are drawn from fixed sets (shard IDs, pool names, HTTP methods, strategy names). There are no unbounded label values such as file paths, tenant IDs, or user identifiers in metrics labels.


Structured logging

Kiseki uses the tracing crate for structured logging. Every log event is a structured record with typed fields.

Configuration

VariableDefaultDescription
RUST_LOGinfoFilter directive. Supports per-module granularity.
KISEKI_LOG_FORMATtextOutput format: text (human-readable) or json (structured).

Filter examples

# Default: info-level for all Kiseki modules
RUST_LOG=kiseki=info

# Debug for the Raft subsystem, info for everything else
RUST_LOG=kiseki_raft=debug,kiseki=info

# Trace-level for the chunk subsystem (very verbose)
RUST_LOG=kiseki_chunk=trace,kiseki=info

# Warnings only (quiet)
RUST_LOG=warn

JSON output format

In production, set KISEKI_LOG_FORMAT=json for structured log aggregation (ELK, Loki, Datadog, etc.):

{
  "timestamp": "2026-04-23T14:30:00.123Z",
  "level": "INFO",
  "target": "kiseki_raft",
  "message": "Raft leader elected",
  "shard": "shard-0001",
  "node_id": 1,
  "term": 42
}

Log levels

LevelUsage
ERRORUnrecoverable failures, invariant violations, data loss events.
WARNRecoverable issues, degraded state, approaching capacity limits.
INFOSignificant state changes: leader election, key rotation, shard split, node join/leave.
DEBUGDetailed operational events: individual RPCs, cache hits/misses, EC operations.
TRACEWire-level detail: Raft message contents, HKDF inputs, bitmap operations.

Security in logs

  • Tenant-identifying fields (tenant_id, namespace) are present for correlation.
  • Content fields (file names, chunk plaintext, key material) are never logged (I-K8).
  • Logs ship to the same audit/observability pipeline.

Distributed tracing (OpenTelemetry)

Kiseki uses OpenTelemetry for distributed tracing across the full write/read path.

Configuration

VariableDefaultDescription
OTEL_EXPORTER_OTLP_ENDPOINT(none)OTLP gRPC endpoint. Example: http://jaeger:4317. When not set, tracing is disabled.
OTEL_SERVICE_NAMEkiseki-serverService name in traces.
OTEL_TRACES_SAMPLER_ARG1.0Sampling rate (1.0 = 100%, 0.1 = 10%). Reduce in production for high-throughput workloads.

Trace propagation

Every write/read path carries a trace ID via OpenTelemetry context propagation. Traces span:

client -> gateway -> composition -> log -> chunk -> view

For the native client path:

client (FUSE) -> transport -> composition -> log -> chunk

Jaeger integration

The development Docker Compose stack includes Jaeger for trace visualization:

  • Jaeger UI: http://localhost:16686
  • OTLP gRPC receiver: localhost:4317

Trace scoping

Traces respect the zero-trust boundary:

  • Tenant-scoped traces are visible only to the tenant admin (via tenant audit export).
  • Cluster admin sees system-level spans. No tenant content appears in span attributes visible to the cluster admin.
  • Trace overhead is approximately 1-2% on the data path (acceptable for production).

Event store

The admin dashboard maintains an in-memory event store for diagnostic events. Events are categorized and severity-tagged.

Event categories

CategoryEvents
nodeNode join, node leave, node unreachable, node recovered.
shardShard created, shard split, shard maintenance entered/exited.
deviceDevice added, device failed, SMART warning, evacuation started/completed.
tenantTenant created, tenant deleted, quota changed.
securityAuth failure, cert revocation, crypto-shred.
adminMaintenance mode toggle, backup requested, scrub requested, tuning parameter change.
gatewayProtocol errors, connection surge, rate limiting.
raftLeader election, membership change, snapshot transfer.

Event severities

SeverityDescription
infoNormal operations.
warningAttention needed, but system is operating.
errorFailure requiring investigation.
criticalImmediate action required (data at risk, quorum lost).

Event API

# All events from the last 3 hours
curl http://node1:9090/ui/api/events

# Errors from the last hour
curl 'http://node1:9090/ui/api/events?severity=error&hours=1'

# Device events, last 50
curl 'http://node1:9090/ui/api/events?category=device&limit=50'

# Security events from the last 24 hours
curl 'http://node1:9090/ui/api/events?category=security&hours=24'

Historical metrics API

# Metric snapshots from the last 3 hours
curl http://node1:9090/ui/api/history

# Last 6 hours
curl 'http://node1:9090/ui/api/history?hours=6'

The history endpoint returns time-series data points suitable for charting. The default retention is 3 hours in memory. For longer retention, use Prometheus.


Grafana integration

For production monitoring with alerting and long-term storage, configure Prometheus to scrape Kiseki metrics and visualize with Grafana.

Prometheus scrape configuration

scrape_configs:
  - job_name: 'kiseki'
    scrape_interval: 15s
    static_configs:
      - targets:
          - 'node1:9090'
          - 'node2:9090'
          - 'node3:9090'
    metrics_path: '/metrics'

Cluster overview dashboard:

  • Cluster health (up/down per node)
  • Total Raft entries/sec (rate of kiseki_raft_entries_total)
  • Gateway request rate (rate of kiseki_gateway_requests_total)
  • Gateway latency p50/p99 (kiseki_gateway_request_duration_seconds)
  • Pool utilization (kiseki_pool_capacity_used_bytes / kiseki_pool_capacity_total_bytes)

Per-node dashboard:

  • Raft commit latency histogram (kiseki_raft_commit_latency_seconds)
  • Chunk read/write throughput
  • Transport connection count
  • Shard delta count per shard

Capacity dashboard:

  • Pool fill percentage over time
  • Pool capacity trend (linear projection for capacity planning)
  • Delta count growth rate (shard split prediction)

Key management dashboard:

  • Key rotation count over time (kiseki_key_rotation_total)
  • Crypto-shred count (kiseki_crypto_shred_total)

Alerting rules

Recommended Prometheus alerting rules:

groups:
  - name: kiseki
    rules:
      - alert: KisekiNodeDown
        expr: up{job="kiseki"} == 0
        for: 1m
        labels:
          severity: critical

      - alert: KisekiPoolCapacityWarning
        expr: >
          kiseki_pool_capacity_used_bytes / kiseki_pool_capacity_total_bytes > 0.85
        for: 5m
        labels:
          severity: warning

      - alert: KisekiPoolCapacityCritical
        expr: >
          kiseki_pool_capacity_used_bytes / kiseki_pool_capacity_total_bytes > 0.92
        for: 1m
        labels:
          severity: critical

      - alert: KisekiGatewayLatencyHigh
        expr: >
          histogram_quantile(0.99, rate(kiseki_gateway_request_duration_seconds_bucket[5m])) > 1
        for: 5m
        labels:
          severity: warning

      - alert: KisekiRaftCommitLatencyHigh
        expr: >
          histogram_quantile(0.99, rate(kiseki_raft_commit_latency_seconds_bucket[5m])) > 0.1
        for: 5m
        labels:
          severity: warning

Key Management

Kiseki uses a two-layer encryption model where system-level encryption protects data at rest and tenant-level key wrapping controls access. This page covers operational aspects of key management: rotation, re-encryption, crypto-shred, and external KMS integration.


Encryption model

Kiseki implements Model (C) from ADR-002: single data encryption pass at the system layer, with tenant access via key wrapping. No double encryption.

Plaintext chunk
  |
  v
System DEK (AES-256-GCM)  -->  Ciphertext (stored on disk)
  |
  v
System KEK (wraps DEK derivation material)
  |
  v
Tenant KEK (wraps system DEK derivation parameters per tenant)

System keys

  • System DEK: Per-chunk symmetric key derived locally on each storage node via HKDF-SHA256 (ADR-003). Never stored, never transmitted. Derivation: HKDF(master_key[epoch], chunk_id, "kiseki-chunk-dek-v1").
  • System master key: Per-epoch master key stored in the system key manager (kiseki-keyserver). Storage nodes fetch it at startup and on epoch rotation, then derive per-chunk DEKs locally. The key manager never sees individual chunk IDs.
  • System KEK: Wraps system master keys. Managed by the cluster admin.

Tenant keys

  • Tenant KEK: Key wrapping key managed by the tenant’s chosen KMS backend. Wraps access to system DEK derivation parameters (epoch + chunk_id). Destroying the tenant KEK = crypto-shred (data becomes unreadable).
  • No Tenant DEK: Model (C) does not double-encrypt. The tenant layer is key-wrapping, not data-encryption.

Invariants

  • I-K1: No plaintext chunk is ever persisted to storage.
  • I-K2: No plaintext payload is ever sent on the wire.
  • I-K7: Authenticated encryption (AES-256-GCM) everywhere.
  • I-K8: Keys are never logged, printed, transmitted in the clear, or stored in configuration files.

System key manager

The system key manager (kiseki-keyserver) is a dedicated HA service backed by its own Raft consensus group.

Deployment

Deploy on 3-5 dedicated nodes, separate from storage nodes. The system key manager must be at least as available as the log (I-K12) because its unavailability blocks all chunk writes cluster-wide.

Key distribution

kiseki-keyserver:
  Stores: master_key per epoch (Raft-replicated)
  Serves: master_key to authenticated kiseki-server processes (mTLS)
  Never sees: individual chunk_ids or per-chunk operations

kiseki-server:
  Caches: master_key (mlock'd, MADV_DONTDUMP, seccomp)
  Derives: per-chunk DEK = HKDF(master_key, chunk_id) -- locally
  Never sends: chunk_ids to the key manager

This design prevents the key manager from building an index of all chunk IDs, which would leak per-tenant access patterns.


Key rotation

System key rotation

System key rotation creates a new epoch with a new master key. The rotation process:

  1. Cluster admin initiates rotation via RotateSystemKey().
  2. The key manager generates a new master key and assigns a new epoch.
  3. Storage nodes are notified and fetch the new master key.
  4. New writes use the new epoch. Old data retains its epoch.
  5. Two epochs coexist during the rotation window (I-K6).

Old master keys are retained until all data encrypted under them has been re-encrypted or deleted. Full re-encryption is available as an explicit admin action.

Tenant key rotation

Tenant key rotation creates a new epoch for the tenant’s KEK:

  1. Tenant admin initiates rotation via RotateTenantKey(tenant).
  2. The tenant KMS generates or rotates the key (provider-specific).
  3. New envelope wrappings use the new epoch.
  4. Old wrapped material remains valid until background re-wrapping completes.

Background re-encryption

A background monitor detects envelopes wrapped under old epochs and schedules re-wrapping. The rewrap worker:

  1. Reads envelopes with old-epoch tenant wrapping.
  2. Unwraps with old KEK.
  3. Re-wraps with current KEK.
  4. Writes the updated envelope.

For providers that support server-side rewrap (e.g., Vault Transit), the rewrap operation never exposes plaintext derivation material to the storage node.


Crypto-shred

Crypto-shred is the authoritative deletion mechanism in Kiseki. Destroying the tenant KEK renders all tenant data unreadable.

Process

  1. Tenant admin initiates via CryptoShred(tenant).
  2. The tenant KMS destroys the KEK (provider-specific: Vault key deletion, AWS KMS key scheduling, PKCS#11 key destruction).
  3. All cached key material for the tenant is invalidated across the cluster.
  4. Native clients detect the shred via key health checks (default every 30 seconds) and wipe their caches (I-CC12).

What happens after crypto-shred

  • Data is semantically deleted: No component can decrypt the tenant’s data because the KEK is destroyed.
  • Ciphertext remains on disk: Physical GC runs separately when chunk refcount = 0 AND no retention hold is active (I-C2b).
  • Audit trail preserved: Crypto-shred events are recorded in the audit log.

Ordering requirement

If retention holds are needed, they must be set before crypto-shred:

Set retention hold -> Crypto-shred -> Hold expires -> GC eligible

This prevents a race between crypto-shred and GC (I-C2b).

Detection latency

Crypto-shred detection is bounded by: min(key_health_interval, max_disconnect_seconds).

Default key health check interval: 30 seconds. Configurable per tenant within [5s, 300s], default 60s (I-K15).


External KMS providers (ADR-028)

Kiseki supports five tenant KMS backends via the TenantKmsProvider trait. The provider is selected per-tenant at onboarding.

Provider comparison

#BackendTransportMaterial modelKey material location
1Kiseki InternalIn-processLocalSeparate Raft group in Kiseki
2HashiCorp VaultHTTPSLocal (cached)Vault Transit engine
3KMIP 2.1mTLS (TTLV)Remote or localKMIP server / HSM
4AWS KMSHTTPSRemote onlyAWS KMS
5PKCS#11 v3.0Local (FFI)Remote only (HSM)Hardware Security Module

Provider invariants

  • I-K16: Provider abstraction is opaque to callers. No correctness decision depends on which backend is selected.
  • I-K17: Wrap/unwrap operations include AAD (chunk_id) binding. A wrapped blob cannot be spliced from one envelope to another.
  • I-K18: Provider is validated on configuration: connectivity test, wrap/unwrap round-trip, certificate chain. Validation failure prevents tenant activation.
  • I-K19: Internal provider stores tenant KEKs in a separate Raft group from system master keys.
  • I-K20: Provider migration (e.g., Internal to Vault) requires re-wrapping all existing envelopes. Migration is background, audited, and preserves data availability throughout.

Provider 1: Kiseki Internal (default)

Zero-configuration default. Kiseki manages tenant KEKs internally in a Raft group separate from system master keys. Suitable for single-operator deployments.

Security trade-off: Internal mode does not provide the full two-layer security guarantee. Compromise of both the system key store and the tenant key store yields full access. Compliance-sensitive tenants should use an external provider.

Provider 2: HashiCorp Vault

Uses Vault’s Transit secrets engine for encryption-as-a-service:

Kiseki operationVault API
wrapPOST /transit/encrypt/:name (with context = AAD)
unwrapPOST /transit/decrypt/:name (with context = AAD)
rotatePOST /transit/keys/:name/rotate
rewrapPOST /transit/rewrap/:name (server-side, no plaintext exposure)
destroyDELETE /transit/keys/:name

Provider 3: KMIP 2.1

Standards-based integration with enterprise KMS and HSM appliances. Uses mTLS over TTLV binary protocol.

Provider 4: AWS KMS

Cloud-native KMS integration. Key material never leaves AWS. All wrap/unwrap operations are remote HTTPS calls. Suitable for hybrid cloud deployments.

Provider 5: PKCS#11 v3.0

Direct HSM integration via the PKCS#11 C API (FFI). Key material stays in the HSM. Highest security level, requires HSM hardware on or accessible from storage nodes.


OIDC integration

Tenant identity providers can be integrated for second-stage authentication (I-Auth2). This is optional and orthogonal to the KMS provider choice.

When configured, workload-level identity is validated against the tenant admin’s authorization via OIDC/JWT tokens, providing “authorized by my tenant admin” on top of the mTLS-based “belongs to this cluster” identity.

Keycloak is included in the development Docker Compose stack for OIDC testing.


Operational checklist

Key rotation schedule

Key typeRecommended intervalEnforcement
System master keyQuarterlyManual (cluster admin)
Tenant KEKPer tenant policyManual or automated via KMS
TLS certificatesAnnualCluster CA renewal

Monitoring key health

# Check key manager health
kiseki-server keymanager health

# Check tenant KMS connectivity
kiseki-server keymanager check-kms

# Monitor key rotation metrics
curl -s http://node1:9090/metrics | grep kiseki_key_rotation_total

# Monitor crypto-shred events
curl -s http://node1:9090/metrics | grep kiseki_crypto_shred_total

Key material security

  • Master keys are mlock’d in memory on storage nodes (prevent swapping).
  • Core dumps are disabled (LimitCORE=0 in systemd, MADV_DONTDUMP).
  • seccomp filters restrict system calls on key-handling threads.
  • Runtime integrity monitor detects ptrace, /proc/pid/mem access, and debugger attachment (I-O7).
  • Keys are zeroized on deallocation (Zeroizing<Vec<u8>>).

System Overview

Kiseki is a distributed storage system designed for HPC and AI workloads. It provides a unified data fabric with POSIX (FUSE), NFS, and S3 access paths, two-layer encryption with tenant-controlled crypto-shred, and pluggable HPC transports (CXI/Slingshot, InfiniBand, RoCEv2).

Workspace structure

The codebase is a single Rust workspace with 18 crates:

CratePurpose
kiseki-commonShared types, HLC, identifiers, errors
kiseki-protoGenerated protobuf/gRPC code
kiseki-cryptoFIPS AEAD (AES-256-GCM), envelope encryption, tenant KMS providers
kiseki-raftShared Raft config, redb log store, TCP transport
kiseki-transportTransport abstraction: TCP+TLS, RDMA verbs, CXI/libfabric
kiseki-logLog context: delta ordering, shard lifecycle, Raft consensus
kiseki-blockRaw block device I/O, bitmap allocator, superblock (ADR-029)
kiseki-chunkChunk storage: placement, erasure coding, GC, device management
kiseki-compositionComposition context: namespace, refcount, multipart
kiseki-viewView materialization: stream processors, MVCC pins
kiseki-gatewayProtocol gateway: NFS and S3 translation
kiseki-clientNative client: FUSE, transport selection, client-side cache
kiseki-keymanagerSystem key manager with Raft HA
kiseki-auditAppend-only audit log with per-tenant shards
kiseki-advisoryWorkflow advisory: hints, telemetry, budgets (ADR-020/021)
kiseki-controlControl plane: tenancy, IAM, policy, federation
kiseki-serverStorage node binary (composes all server-side crates)
kiseki-acceptanceBDD acceptance tests (cucumber-rs)

Bounded contexts

The domain is organized into eight bounded contexts, each with a distinct responsibility, failure domain, and scaling concern:

  1. Log – Delta ordering, Raft consensus, shard lifecycle
  2. Chunk Storage – Encrypted chunk persistence, placement, EC, GC
  3. Composition – Tenant-scoped metadata assembly, namespace management
  4. View Materialization – Protocol-shaped materialized projections
  5. Protocol Gateway – NFS and S3 wire protocol translation
  6. Control Plane – Tenancy, IAM, quota, policy, federation
  7. Key Management – System DEK/KEK, tenant KMS providers, crypto-shred
  8. Workflow Advisory – Client hints, telemetry feedback (cross-cutting)

Additionally, Native Client runs on compute nodes as a separate trust boundary and Block I/O handles raw device management underneath chunk storage.

Data path

Client (plaintext) ──encrypt──► Gateway / Native Client
                                       │
                                       ▼
                                  Composition
                                  (assemble chunks, record delta)
                                       │
                              ┌────────┴────────┐
                              ▼                 ▼
                          Log (Raft)       Chunk Storage
                     (commit delta,      (write encrypted
                      replicate)          chunk to device)

Write path: The client (native or protocol) encrypts data with the tenant KEK wrapping a system DEK. The composition layer assembles chunk references and records a delta. The delta is committed through Raft on the owning shard. Chunks are written to affinity pools with erasure coding.

Read path: The client issues a view lookup (materialized from log deltas). The view resolves chunk references. Chunks are read from devices, decrypted, and returned to the client.

Control path

Admin ──► Control Plane (gRPC)
              │
              ├── Tenant / Namespace / Quota / Policy
              ├── Flavor management
              ├── Federation (async cross-site)
              └── Advisory policy (hint budgets, profiles)

The control plane manages tenant lifecycle, IAM, quotas, compliance tags, placement policy, and federation. It communicates with storage nodes via gRPC on the management network. The control plane depends only on kiseki-common and kiseki-proto (crate-graph firewall, ADR-027).

Advisory path (ADR-020)

Client ──hints──► Advisory Runtime ──telemetry──► Client
                      │
                      ├── Route hints to Chunk / View / Composition
                      ├── Emit caller-scoped telemetry feedback
                      └── Audit advisory events

The workflow advisory system is a cross-cutting concern (not a bounded context). It carries two flows over a bidirectional gRPC channel per declared workflow:

  • Hints (client to storage): advisory steering signals for prefetch, affinity, priority, and phase-adaptive tuning. Never authoritative (I-WA1).
  • Telemetry feedback (storage to client): caller-scoped signals about backpressure, locality, materialization lag, and QoS headroom (I-WA5).

The advisory runtime runs on a dedicated tokio runtime, isolated from the data path. Advisory failures never block data-path operations (I-WA2).

Network ports

PortPurpose
9100Data-path gRPC (Log, Chunk, Composition, View, Discovery)
9101Advisory gRPC (WorkflowAdvisoryService)
9000S3 HTTP gateway
2049NFS server
9090Prometheus metrics + health + admin UI

Binaries

BinaryContentsDeployment
kiseki-serverLog, Chunk, Composition, View, Gateway, Audit, AdvisoryEvery storage node
kiseki-client-fuseNative client with FUSECompute nodes
kiseki-controlControl planeManagement network (3+ instances)
kiseki-keyserverSystem key manager (Raft HA)Dedicated cluster (3-5 nodes)

Bounded Contexts

Eight bounded contexts form the core domain model. Each has a distinct responsibility, failure domain, and scaling concern. This page describes each context’s purpose, implementing crate, key types, and governing invariants.


1. Log

Crate: kiseki-log

Purpose: Accept deltas, assign them a total order within a shard, replicate via Raft, persist durably, and support range reads for view materialization and replay.

Key types: Delta, DeltaEnvelope, Shard, ShardConfig, ShardInfo

Key invariants:

IDRule
I-L1Within a shard, deltas have a total order
I-L2A committed delta is durable on a majority of Raft replicas before ack
I-L3A delta is immutable once committed
I-L4Delta GC requires ALL consumers (views + audit) to have advanced past the delta
I-L5A composition is not visible until all referenced chunks are durable
I-L6Shards have a hard ceiling triggering mandatory split (delta count, byte size, or throughput)
I-L7Delta envelope has separated system-visible header and tenant-encrypted payload
I-L8Cross-shard rename returns EXDEV (no 2PC across shards)
I-L9A delta’s inlined payload is immutable after write; threshold changes apply prospectively

Failure domain: Per-shard. Leader loss causes transient latency (election). Quorum loss makes the shard unavailable.


2. Chunk Storage

Crate: kiseki-chunk (with kiseki-block for device I/O)

Purpose: Store and retrieve opaque encrypted chunks. Manage placement across affinity pools. Handle erasure coding and replication. Run GC based on refcounts and retention holds.

Key types: Chunk, ChunkId, Envelope, AffinityPool, DeviceBackend

Key invariants:

IDRule
I-C1Chunks are immutable; new versions are new chunks
I-C2A chunk is not GC’d while any composition references it (refcount > 0)
I-C2bA chunk is not GC’d while a retention hold is active
I-C3Chunks are placed according to affinity policy from the referencing view descriptor
I-C4Durability strategy is per affinity pool (EC default, N-copy replication available)
I-C5Pool writes rejected at Critical threshold (SSD 85%, HDD 92%); ENOSPC at Full
I-C6EC parameters are immutable per pool; SetPoolDurability applies to new chunks only
I-C7All chunk data writes are aligned to device physical block size (ADR-029)
I-C8Allocation bitmap is ground truth; free-list is a derived cache rebuilt on startup

Failure domain: Per-chunk or per-device. Chunk loss recoverable via EC parity or replicas.


3. Composition

Crate: kiseki-composition

Purpose: Maintain tenant-scoped metadata structures describing how chunks assemble into data units (files, objects). Manage namespaces. Record mutations as deltas in the log.

Key types: Composition, Namespace, CompositionMutation

Key invariants:

IDRule
I-X1A composition belongs to exactly one tenant
I-X2A composition’s chunks respect the tenant’s dedup policy (global hash or per-tenant HMAC)
I-X3A composition’s mutation history is fully reconstructible from its shard’s deltas

Failure domain: Coupled to Log. If a shard fails, its compositions are affected.


4. View Materialization

Crate: kiseki-view

Purpose: Consume deltas from shards and maintain materialized views per view descriptor. Handle view lifecycle (create, discard, rebuild) and MVCC read pins.

Key types: View, ViewDescriptor, StreamProcessor, MvccPin

Key invariants:

IDRule
I-V1A view is derivable from its source shard(s) alone (rebuildable-from-log)
I-V2A view’s observed state is a consistent prefix of its source log(s) up to a watermark
I-V3Cross-view consistency governed by the reading protocol’s declared consistency model
I-V4MVCC read pins have bounded lifetime; pin expiration revokes the snapshot guarantee

Failure domain: Per-view. A fallen-behind view serves stale data. A lost view can be rebuilt from the log.


5. Protocol Gateway

Crate: kiseki-gateway

Purpose: Translate wire protocol requests (NFS, S3) into operations against views and the log. Serve reads from views. Route writes as deltas to the log via composition. Perform tenant-layer encryption for protocol-path clients.

Key types: Protocol gateway instance, protocol plugin

Trust boundary: NFS/S3 clients send plaintext over TLS to the gateway. The gateway encrypts before writing to log/chunks. Plaintext exists in gateway memory only ephemerally.

Failure domain: Per-gateway. Crash disconnects affected clients. Restart and client reconnect recovers.


6. Control Plane

Crate: kiseki-control

Purpose: Declarative API for tenancy, IAM, policy, placement, discovery, compliance tagging, and federation. Manages cluster-level and tenant-level configuration.

Key types: Organization, Project, Workload, Flavor, ComplianceRegime, RetentionHold, FederationPeer

Key invariants:

IDRule
I-T1Tenants are fully isolated; no cross-tenant data access
I-T2Tenant resource consumption bounded by quotas at org and workload levels
I-T3Tenant keys not accessible to other tenants or shared processes
I-T4Cluster admin cannot access tenant data without tenant admin approval
I-T4cCluster admin modifications to pools with tenant data are audit-logged to tenant

Failure domain: Control plane unavailability prevents new tenant creation and policy changes, but the existing data path continues with last-known configuration.


7. Key Management

Crates: kiseki-keymanager, kiseki-crypto

Purpose: Custody, rotation, escrow, and issuance of all key material. Two layers: system keys (cluster admin) and tenant key wrapping (tenant admin via tenant KMS). Orchestrate crypto-shred.

Key types: SystemDek, SystemKek, TenantKek, KeyEpoch, Envelope, TenantKmsProvider

Tenant KMS providers (ADR-028): Five pluggable backends implementing the TenantKmsProvider trait – Kiseki-Internal, HashiCorp Vault, KMIP 2.1, AWS KMS, and PKCS#11.

Key invariants:

IDRule
I-K1No plaintext chunk is ever persisted to storage
I-K2No plaintext payload is ever sent on the wire
I-K4System can enforce access without reading plaintext
I-K5Crypto-shred renders data unreadable within bounded time
I-K6Key rotation does not lose access to old data until explicit cutover
I-K7Authenticated encryption everywhere
I-K8Keys are never logged, printed, transmitted in the clear, or in config files
I-K16Provider abstraction is opaque to callers
I-K17Wrap/unwrap operations include AAD (chunk_id) binding

Failure domain: KMS unavailability blocks new encrypt/decrypt operations. This context’s availability is as critical as the Log’s.


8. Workflow Advisory (cross-cutting)

Crate: kiseki-advisory

Purpose: Carry workflow hints from clients to storage and telemetry feedback from storage back to clients. Route advisory signals to the bounded context best able to act on them.

Key types: WorkflowRef, OperationAdvisory, PoolHandle, PoolDescriptor, HintBudget

Key invariants:

IDRule
I-WA1Hints are advisory only; no correctness decision depends on a hint
I-WA2Advisory subsystem is isolated from the data path; failures do not block data-path operations
I-WA3A workflow belongs to exactly one workload; authorization is per-operation
I-WA5Telemetry feedback is scoped to the caller’s authorization
I-WA6Advisory requests are not existence or content oracles
I-WA7Hint budgets enforced per workload within parent ceilings
I-WA14Hints do not extend tenant capabilities

Runtime isolation: The advisory runtime runs on a dedicated tokio runtime separate from the data-path runtime (ADR-021). No data-path crate depends on kiseki-advisory.


Cross-context relationships

ProducerConsumerWhat flows
Control PlaneAll contextsPolicy, placement, tenant config, compliance tags
LogComposition, ViewDeltas (ordered, durable)
CompositionChunk StorageChunk references (refcounts)
Key ManagementChunk StorageSystem DEKs
Key ManagementGateway, Native ClientTenant KEK (wrapping)
View MaterializationGateway, Native ClientMaterialized view state
Chunk StorageView, Native ClientChunk data (encrypted)

Data Flow

This page describes the write, read, inline, and cross-node data paths through the Kiseki system.


Write path

┌──────────┐    plaintext     ┌──────────────────┐
│  Client  │ ──────────────► │  Gateway /        │
│          │   (over TLS)    │  Native Client    │
└──────────┘                 └────────┬──────────┘
                                      │ 1. Encrypt with tenant KEK
                                      │    wrapping system DEK
                                      │ 2. Content-defined chunking
                                      │    (Rabin fingerprinting)
                                      ▼
                             ┌──────────────────┐
                             │   Composition    │
                             │                  │
                             │ 3. Record chunk  │
                             │    references    │
                             │ 4. Build delta   │
                             └───────┬──────────┘
                                     │
                            ┌────────┴────────┐
                            ▼                 ▼
                   ┌──────────────┐   ┌──────────────┐
                   │  Log (Raft)  │   │ Chunk Storage │
                   │              │   │               │
                   │ 5. Commit    │   │ 6. Write      │
                   │    delta via │   │    encrypted  │
                   │    Raft      │   │    chunk to   │
                   │ 7. Replicate │   │    device     │
                   │    to        │   │ 8. EC encode  │
                   │    majority  │   │    across     │
                   │              │   │    pool       │
                   └──────────────┘   └──────────────┘

Step-by-step

  1. Client encrypt: The native client encrypts data before it leaves the process. Protocol-path clients (NFS/S3) send plaintext over TLS to the gateway, which encrypts on their behalf.

  2. Content-defined chunking: Data is split into variable-size chunks using Rabin fingerprinting. Each chunk gets a content-addressed ID (SHA-256 hash of plaintext, or HMAC when tenant opts out of cross-tenant dedup).

  3. Compose: The composition layer records chunk references and constructs a delta describing the mutation (create, update, delete).

  4. Raft commit: The delta is appended to the owning shard’s Raft log. The leader replicates to a majority of voters before acknowledging.

  5. Chunk write: Encrypted chunks are written to affinity pool devices with erasure coding (or N-copy replication, per pool policy).

  6. Ack: The write is acknowledged to the client only after the delta is committed (I-L2) and all referenced chunks are durable (I-L5).


Read path

┌──────────┐                 ┌──────────────────┐
│  Client  │ ◄────────────── │  Gateway /        │
│          │   plaintext     │  Native Client    │
└──────────┘   (over TLS)   └────────┬──────────┘
                                      ▲ 5. Decrypt
                                      │
                             ┌────────┴──────────┐
                             │   View Lookup     │
                             │                   │
                             │ 1. Resolve path   │
                             │    to composition │
                             │ 2. Get chunk list │
                             └────────┬──────────┘
                                      │
                                      ▼
                             ┌──────────────────┐
                             │  Chunk Storage   │
                             │                  │
                             │ 3. Read chunks   │
                             │    from device   │
                             │ 4. EC decode if  │
                             │    degraded      │
                             └──────────────────┘

Step-by-step

  1. View lookup: The client or gateway queries a materialized view to resolve a path (POSIX) or key (S3) to a composition and its chunk list.

  2. Chunk read: Encrypted chunks are read from the storage devices. If a device is degraded, EC parity reconstructs the missing data.

  3. Decrypt: The client (native path) or gateway (protocol path) unwraps the system DEK using the tenant KEK, then decrypts the chunk data with AES-256-GCM.

  4. Return: Plaintext is returned to the client.


Inline path (ADR-030)

Small files below the configurable inline threshold bypass chunk storage entirely:

Client ──► Composition ──► Log (Raft)
                             │
                             ▼
                    Delta with inline payload
                             │
                             ▼
                    Raft replication to voters
                             │
                             ▼
                    State machine apply:
                    store in small/objects.redb

Threshold computation: The inline threshold for a shard is the minimum affordable threshold across all nodes hosting that shard’s voter set:

clamp(min(voter_budgets) / file_count_estimate, INLINE_FLOOR, INLINE_CEILING)

Key invariants:

  • I-L9: Inlined payloads are immutable after write; threshold changes apply prospectively only
  • I-SF5: Inline content is offloaded to small/objects.redb on state machine apply; snapshots include inline content from redb
  • I-SF7: Per-shard Raft inline throughput capped at KISEKI_RAFT_INLINE_MBPS (default 10 MB/s)

Cross-node data paths

Raft replication

Each shard runs an independent Raft group (ADR-026). The leader replicates log entries (deltas) to followers via the Raft RPC transport. Replication uses mTLS on the data fabric.

Leader ──► Follower 1 (AppendEntries)
       ──► Follower 2 (AppendEntries)
       ──► Follower 3 (AppendEntries)

Committed entries are persisted in RedbRaftLogStore on each voter.

Snapshot transfer

When a follower is too far behind or a new voter joins, the leader sends a full snapshot. Snapshots are transferred as length-prefixed JSON over the Raft transport connection.

For shards with inline data, the snapshot includes all entries from small/objects.redb (I-SF5).

Chunk replication and EC

Chunks are placed across distinct physical devices within a pool using deterministic hashing (CRUSH-like). No two EC fragments of the same chunk reside on the same device (I-D4).

Device failure triggers automatic repair from EC parity or replicas (I-D1).

Federation

Federated sites replicate data asynchronously. Only ciphertext is replicated – no key material in the replication stream (I-CS3). All federated sites for a tenant connect to the same tenant KMS.

Encryption Model

Kiseki uses a two-layer encryption architecture (ADR-002, model C) that separates data encryption from access control. One encryption pass protects data; key wrapping controls who can read it.


Two-layer architecture

┌─────────────────────────────────────────────────┐
│              Tenant Layer (access)              │
│                                                 │
│  Tenant KEK (controlled by tenant admin)        │
│  wraps the system DEK for tenant-scoped access  │
│                                                 │
│  Destroying the tenant KEK = crypto-shred       │
│  (all tenant data rendered unreadable)           │
├─────────────────────────────────────────────────┤
│              System Layer (data)                │
│                                                 │
│  System DEK encrypts chunk data (AES-256-GCM)   │
│  System KEK wraps system DEKs                    │
│  Always on -- no unencrypted chunks              │
└─────────────────────────────────────────────────┘

System layer: The system DEK encrypts every chunk using AES-256-GCM. System DEKs are derived per-chunk using HKDF-SHA256 from a master key (ADR-003). The system KEK wraps system DEKs and is managed by the cluster admin via the system key manager.

Tenant layer: The tenant KEK wraps the system DEK for tenant-scoped access control. There is no double encryption – one data encryption pass, with key wrapping for access control. The tenant admin controls the tenant KEK via the tenant KMS.


Envelope structure

Each chunk is stored as an envelope containing:

┌──────────────────────────────────────────┐
│  Envelope                                │
│                                          │
│  ┌──────────────────────────────────┐    │
│  │  Ciphertext (AES-256-GCM)       │    │
│  │  (encrypted chunk data)          │    │
│  └──────────────────────────────────┘    │
│                                          │
│  auth_tag (16 bytes, GCM tag)            │
│  nonce (12 bytes, unique per chunk)      │
│  system_key_epoch (current epoch)        │
│  tenant_key_epoch (current epoch)        │
│  chunk_id (content-addressed)            │
│  algorithm_id (for crypto-agility)       │
│                                          │
│  System wrapping metadata                │
│  Tenant wrapping metadata                │
└──────────────────────────────────────────┘

The envelope carries algorithm identifiers for crypto-agility (I-K7). All metadata is authenticated – unauthenticated encryption is never acceptable.


Key derivation

System DEKs are derived locally on each storage node using HKDF-SHA256 (ADR-003). No DEK-per-chunk RPC is required:

system_dek = HKDF-SHA256(
    ikm  = master_key[epoch],
    salt = chunk_id,
    info = "kiseki-chunk-dek-v1"
)

The master key is fetched from the system key manager at startup and on rotation events. DEK derivation is deterministic – the same chunk ID and epoch always produce the same DEK.


Key rotation

Key rotation is epoch-based (I-K6):

  1. The admin triggers rotation (system or tenant level)
  2. A new epoch is created with fresh key material
  3. New data is encrypted with the current epoch’s keys
  4. Old data retains its epoch until background re-encryption migrates it
  5. Two epochs coexist during the rotation window
  6. Full re-encryption available as an explicit admin action for key-compromise incidents

Crypto-shred

Destroying the tenant KEK renders all tenant data unreadable (I-K5):

1. Set retention hold (if compliance requires)
2. Destroy tenant KEK at tenant KMS
3. All wrapped system DEKs for this tenant become unwrappable
4. Chunk ciphertext remains on disk (system-encrypted) until GC
5. Physical GC runs separately when refcount = 0 AND no retention hold

The ordering contract (I-C2b): set hold before crypto-shred to prevent race with GC.

Client-side detection: periodic key health check (default 30s) detects KEK_DESTROYED and triggers immediate cache wipe (I-CC12). Maximum detection latency: min(key_health_interval, max_disconnect_seconds).


Chunk ID derivation

ModeAlgorithmCross-tenant dedup
DefaultSHA-256(plaintext)Yes
Opted-outHMAC-SHA256(plaintext, tenant_key)No (zero co-occurrence leak)

When a tenant opts out of cross-tenant dedup (I-X2, I-K10), chunk IDs are derived using HMAC with a tenant-specific key, making it impossible to determine whether two tenants store the same data.


Tenant KMS providers (ADR-028)

Five pluggable backends implement the TenantKmsProvider trait:

ProviderKey modelKey location
Kiseki-InternalRaft-replicatedOn-cluster
HashiCorp VaultTransit secrets engineExternal
KMIP 2.1Standard key management protocolExternal
AWS KMSCloud-managed keysExternal
PKCS#11Hardware security modulesExternal

Provider selection is per-tenant at onboarding. The trait fully encapsulates protocol differences – callers never branch on provider type (I-K16). Wrap/unwrap operations include AAD (chunk_id) binding to prevent envelope splicing (I-K17).


FIPS compliance

Kiseki uses aws-lc-rs with the FIPS feature flag for FIPS 140-2/3 validated cryptographic operations. The kiseki-crypto crate provides:

  • AES-256-GCM authenticated encryption
  • HKDF-SHA256 key derivation
  • SHA-256 hashing
  • HMAC-SHA256 for opted-out chunk ID derivation
  • zeroize integration for all key material in memory

Delta encryption

Log delta payloads (filenames, attributes, inline data) are encrypted with the system DEK, wrapped with the tenant KEK (I-K3). The delta envelope has structurally separated:

  • System-visible header (cleartext or system-encrypted): sequence number, shard ID, hashed_key, operation type, timestamp
  • Tenant-encrypted payload: the actual mutation data

Compaction operates on headers only and never decrypts tenant-encrypted payloads (I-O2).

Raft Consensus

Kiseki uses Raft for ordering and replicating deltas within each shard. The implementation is based on openraft 0.10 with a custom TCP transport and redb-backed persistent storage.


Per-shard Raft groups

Each shard runs an independent Raft group (ADR-026, Strategy A). This provides:

  • Independent scaling: shard count grows with data volume and throughput
  • Isolated failure domains: quorum loss in one shard does not affect others
  • No cross-shard coordination: cross-shard rename returns EXDEV (I-L8)

The system key manager also runs its own Raft group for high availability (ADR-007), as do audit log shards (ADR-009).


openraft integration

The kiseki-raft crate defines KisekiTypeConfig used by all Raft groups:

  • Node identity: u64 node IDs
  • Async runtime: tokio
  • Log store: RedbRaftLogStore (persistent) or MemLogStore (testing)
  • Entry format: customized per context (log deltas, key manager ops, audit events)

Each context (log, key manager, audit) defines its own request (D) and response (R) types while sharing the node identity, entry format, and async runtime configuration.


Persistent log: RedbRaftLogStore

Raft log entries are persisted using redb (ADR-022), a pure-Rust embedded key-value store. The RedbRaftLogStore provides:

  • Durable append and truncation of log entries
  • Vote persistence (current term, voted-for)
  • Snapshot metadata storage
  • Crash-safe operations (redb uses write-ahead logging internally)

For shards with inline data (ADR-030), the state machine offloads inline content to small/objects.redb on apply. The in-memory state machine does not hold inline content after apply (I-SF5).


Snapshot transfer

When a follower falls behind or a new voter joins the group, the leader sends a full snapshot:

  1. Leader serializes the current state machine as length-prefixed JSON
  2. For shards with inline data, the snapshot includes all entries from small/objects.redb
  3. The snapshot is streamed over the Raft transport connection
  4. The follower installs the snapshot and resumes normal replication

Transport and security

Raft RPCs use a custom TCP transport with mTLS:

  • All Raft communication is authenticated via per-node mTLS certificates signed by the Cluster CA (I-Auth1)
  • The transport runs on the data fabric (not the management network)
  • Connection pooling and keepalive are managed by the transport layer

The Raft transport address is configured via KISEKI_RAFT_ADDR.


Dynamic membership changes

Raft membership changes follow the standard joint-consensus protocol:

  1. Add voter: new node starts as learner, catches up to committed index, then promoted to voter
  2. Remove voter: validated that removal does not break quorum (safety check via can_remove_safely)
  3. Shard migration: target node must fully catch up (learner state matches leader’s committed index) before old voter is removed (I-SF3)

Membership changes are validated by validate_membership_change in kiseki-raft, which checks quorum preservation and prevents unsafe removal.


Shard lifecycle

EventDescription
CreateNew shard created when a namespace is created
SplitMandatory split when shard exceeds ceiling (I-L6): delta count, byte size, or throughput
MaintenanceShard set to read-only; writes rejected with retriable error (I-O6)
CompactionHeader-only merge; tenant-encrypted payloads carried opaquely (I-O2)
GCDelta garbage collection after all consumers advance past the delta (I-L4)

Shard splits do not block writes to the existing shard during the split operation (I-O1).


Consistency guarantees

ScopeGuaranteeMechanism
Intra-shardTotal orderRaft sequence numbers
Cross-shardCausal orderingHLC (Hybrid Logical Clock)
Cross-siteEventual consistencyAsync replication via federation
WritesCP (no split-brain)Raft majority commit (I-CS1)
ReadsBounded stalenessPer view descriptor, subject to compliance floor (I-CS2)

Transport Layer

The kiseki-transport crate provides a pluggable transport abstraction for bidirectional byte-stream connections. It ships with a TCP+TLS reference implementation and feature-flagged support for HPC fabric transports.


Transport trait

The Transport trait is the core abstraction:

#![allow(unused)]
fn main() {
pub trait Transport: Send + Sync + 'static {
    type Connection: Connection;
    async fn connect(&self, addr: SocketAddr) -> Result<Self::Connection>;
    async fn listen(&self, addr: SocketAddr) -> Result<Listener>;
}

pub trait Connection: AsyncRead + AsyncWrite + Send + Unpin + 'static {
    fn peer_identity(&self) -> Option<&PeerIdentity>;
}
}

All components (client, server, Raft) use this trait, enabling transport selection without code changes.


TCP+TLS (reference implementation)

The TcpTlsTransport is always available and serves as the universal fallback:

  • mTLS: Cluster CA validation with per-tenant certificates (I-Auth1, I-K13)
  • SPIFFE: SAN-based SVID validation for workload identity (I-Auth3)
  • CRL: Optional certificate revocation list support via KISEKI_CRL_PATH
  • Connection pooling: Configurable pool size per peer
  • Keepalive: TCP keepalive for connection health
  • Timeouts: Configurable connect, read, and write timeouts

Configuration: TlsConfig with CA cert, node cert, node key, and optional CRL path.


RDMA verbs (feature: verbs)

Native InfiniBand and RoCEv2 support for low-latency HPC fabrics:

  • InfiniBand: Direct RDMA over InfiniBand fabric (VerbsIb)
  • RoCEv2: RDMA over Converged Ethernet (VerbsRoce)
  • Device selection: Auto-detects the first available IB device, or uses the device named in KISEKI_IB_DEVICE
  • Zero-copy: RDMA read/write for chunk data transfer

The verbs module uses unsafe code for FFI calls to libibverbs. Each unsafe block has a per-block SAFETY comment.


CXI/libfabric (feature: cxi)

HPE Slingshot fabric support via libfabric:

  • CXI provider: Lowest-latency transport on Slingshot-equipped systems
  • libfabric: Uses the libfabric API (fi_* calls) for fabric operations
  • Feature-flagged: Only compiled when cxi feature is enabled

The CXI module uses unsafe code for FFI calls to libfabric.


FabricSelector

The FabricSelector provides priority-based transport selection with automatic failover:

Priority 0: CXI        (Slingshot, lowest latency)
Priority 1: VerbsIb    (InfiniBand)
Priority 2: VerbsRoce  (RoCEv2)
Priority 3: TcpTls     (always available, universal fallback)

At boot, the selector probes for available transports (hardware presence check). On connection, it selects the highest-priority available transport. On failure, it falls back to the next-best transport.

The TransportHealthTracker monitors transport health and marks transports as unhealthy after repeated failures, temporarily removing them from selection until they recover.


GPU-direct (planned)

Future support for direct GPU memory access:

  • NVIDIA cuFile (feature: gpu-cuda): GPUDirect Storage for direct NVMe-to-GPU data transfer
  • AMD ROCm (feature: gpu-rocm): ROCm-based GPU direct access

These features bypass CPU memory for chunk data, reducing latency for AI training workloads.


NUMA-aware thread pinning

The NumaTopology module provides NUMA-aware thread pinning for optimal memory locality:

  • Auto-detects NUMA topology on Linux via sched_setaffinity
  • Pins I/O threads to the NUMA node closest to the network device
  • Reduces cross-NUMA memory access latency for high-throughput workloads

Metrics and health

The transport layer exports Prometheus metrics via TransportMetrics:

  • Connection count per transport type
  • Bytes sent/received per transport
  • Connection errors and failover events
  • Latency histograms per transport

Health tracking (TransportHealthTracker) provides per-transport health status for the selector’s failover decisions.


Invariant mapping

InvariantHow the transport layer enforces it
I-K2All data on the wire is TLS-encrypted (or pre-encrypted chunks over CXI)
I-K13mTLS with Cluster CA validation on every data-fabric connection
I-Auth1Client certificate required on data fabric
I-Auth3SPIFFE SVID validation via SAN matching

Client-Side Cache (ADR-031)

The native client (kiseki-client) includes a two-tier read-only cache of decrypted plaintext chunks. The cache is a performance feature, not a correctness mechanism – it is ephemeral and wiped on process restart or extended disconnect.


Architecture

┌────────────────────────────────────────────┐
│            kiseki-client process           │
│                                            │
│  ┌──────────────────────────────────────┐  │
│  │  L1: In-memory cache               │  │
│  │  Zeroizing<Vec<u8>> entries         │  │
│  │  Content-addressed by ChunkId       │  │
│  └──────────────┬───────────────────────┘  │
│                 │ miss                      │
│  ┌──────────────▼───────────────────────┐  │
│  │  L2: Local NVMe cache pool          │  │
│  │  CRC32 integrity per entry          │  │
│  │  Per-process, per-tenant isolation   │  │
│  └──────────────┬───────────────────────┘  │
│                 │ miss                      │
│                 ▼                           │
│         Fetch from canonical               │
│         (verify by ChunkId SHA-256)        │
└────────────────────────────────────────────┘

L1 (in-memory): Fast access to recently-used chunks. Entries use Zeroizing<Vec<u8>> so plaintext is overwritten with zeros on eviction or deallocation (I-CC2).

L2 (local NVMe): Larger cache on local storage. Each entry has a CRC32 checksum trailer computed at insert time (I-CC13). On read, the CRC32 is verified before serving; mismatch triggers bypass to canonical and entry deletion (I-CC7).


Cache modes

Three modes are available per client session (selected at session establishment):

ModeBehaviorUse case
PinnedStaging-driven, eviction-resistant; for declared datasetsHPC pre-staging (Slurm prolog)
OrganicLRU with usage-weighted retentionMixed workloads (default)
BypassNo cachingStreaming, checkpoint workloads

Mode is per session, not per file. The admin controls which modes are available for each workload.


Staging API

Staging pre-fetches a dataset’s chunks into the L2 cache with pinned retention:

kiseki-client stage --dataset /path/to/data
  1. Takes a namespace path and recursively enumerates compositions
  2. Fetches and verifies all chunks from canonical (SHA-256 match)
  3. Stores chunks in L2 with pinned retention
  4. Produces a manifest file listing staged compositions and chunk IDs

Staging is idempotent and resumable. Limits: max_staging_depth (10), max_staging_files (100,000).

Pool handoff

The staging daemon and workload process can be different processes (e.g., Slurm prolog stages, then the workload runs):

  1. Staging daemon holds the L2 pool via flock on pool.lock
  2. Workload process adopts the pool via KISEKI_CACHE_POOL_ID env var
  3. Workload takes over the flock

Each cache pool is identified by a 128-bit CSPRNG pool_id, isolated per process and per tenant.


Freshness and staleness

Metadata TTL (default 5s): File-to-chunk-list mappings are cached with a configurable TTL. Within the TTL, cached metadata is authoritative and may serve data for files that have since been modified (I-CC3, I-CC5).

Chunk data: No TTL needed. Chunks are immutable (I-C1), so a verified chunk remains correct indefinitely absent crypto-shred.


Crypto-shred detection

On crypto-shred (tenant KEK destruction), all cached plaintext must be wiped (I-CC12):

Detection mechanisms (in priority order):

  1. Advisory channel notification (if active)
  2. KMS error on next operation
  3. Periodic key health check (default every 30s)

Response: Immediate wipe of L1 and L2 with zeroize.

Maximum detection latency: min(key_health_interval, max_disconnect_seconds).


Disconnect handling

If the client loses connectivity to all canonical endpoints for longer than max_disconnect_seconds (default 300s), the entire cache (L1 + L2) is wiped (I-CC6).

A background heartbeat RPC (every 60s) maintains the last_successful_rpc timestamp for disconnect detection.


Error handling

Any local cache error bypasses to canonical unconditionally (I-CC7):

  • L2 I/O failure: bypass and flag pool for scrub
  • CRC32 mismatch: bypass, delete corrupt entry
  • Metadata lookup failure: bypass to canonical

Invariants

IDRule
I-CC1A cached chunk is served only if content-address verified and no crypto-shred detected
I-CC2Cached plaintext is zeroized before deallocation, eviction, or cache wipe
I-CC3File-to-chunk metadata served from cache only within TTL (default 5s)
I-CC5Metadata TTL is the upper bound on read staleness
I-CC6Disconnect beyond threshold triggers full cache wipe
I-CC7Any cache error bypasses to canonical unconditionally
I-CC8Cache is ephemeral; wiped on process start (or adopted via pool handoff)
I-CC9Unreachable cache policy falls back to conservative defaults
I-CC10Cache policy changes apply to new sessions only
I-CC11Staged chunks are a point-in-time snapshot; re-stage to pick up updates
I-CC12Crypto-shred triggers immediate cache wipe with zeroize
I-CC13L2 entries protected by CRC32 checksum trailer

Environment variables

VariableDefaultDescription
KISEKI_CACHE_MODEorganicCache mode: organic, pinned, or bypass
KISEKI_CACHE_DIR/tmp/kiseki-cacheL2 pool directory on local NVMe
KISEKI_CACHE_L2_MAX50 GBMaximum L2 cache size in bytes
KISEKI_CACHE_POOL_ID(generated)Adopt an existing pool (for staging handoff)

Security Model

Kiseki is designed with security as a foundational constraint, not a bolted-on feature. The system enforces strong tenant isolation, mandatory encryption, and a zero-trust boundary between infrastructure operators and tenants.


Zero-trust boundary

Kiseki enforces a strict separation between two administrative domains:

Cluster admin (infrastructure operator)

  • Manages nodes, global policy, system keys, pools, devices.
  • Cannot access tenant config, logs, or data without explicit tenant admin approval (I-T4).
  • Sees operational metrics in tenant-anonymous or aggregated form.
  • Modifications to pools containing tenant data are audit-logged to the affected tenant’s audit shard (I-T4c).

Tenant admin (data owner)

  • Controls tenant keys, projects, workload authorization, compliance tags, user access.
  • Grants or denies cluster admin access requests.
  • Receives tenant-scoped audit exports sufficient for independent compliance demonstration.
  • Can crypto-shred to render all tenant data unreadable.

Access request flow

When a cluster admin needs access to tenant resources (for debugging, migration, etc.):

  1. Cluster admin submits an access request via the control plane.
  2. The request is recorded in the audit log.
  3. Tenant admin reviews and approves or denies.
  4. If approved, access is time-bounded and scoped.
  5. All access is audit-logged to the tenant’s shard.

Encryption at rest

Every chunk stored on disk is encrypted. There are no exceptions.

  • Algorithm: AES-256-GCM (authenticated encryption with associated data).
  • Key derivation: HKDF-SHA256 derives per-chunk DEKs from a system master key and the chunk ID (ADR-003).
  • Envelope: Each chunk carries an envelope containing ciphertext, system-layer wrapping metadata, tenant-layer wrapping metadata, and authenticated metadata (chunk ID, algorithm identifiers, key epoch).

What is encrypted

DataEncryptionLocation
Chunk data on diskSystem DEK (AES-256-GCM)Data devices
Inline small-file contentSystem DEKsmall/objects.redb
Delta payloads (filenames, attributes)System DEK, wrapped with tenant KEKRaft log / redb
Delta headers (sequence, shard, operation type, timestamp)Cleartext or system-encryptedRaft log / redb
Backup dataSystem-encryptedExternal backup target
Federation replicationCiphertext-onlyReplication stream

What is NOT encrypted

  • Delta headers: Compaction operates on headers only (I-O2). Headers contain no tenant-attributable content.
  • Prometheus metrics: Aggregated counters and histograms. No tenant-attributable data in metric labels.
  • Health/liveness probes: 200 OK response.

Encryption in transit

All data-fabric communication uses mTLS. No plaintext data crosses the network.

  • Data path: mTLS with per-tenant certificates signed by the Cluster CA (I-K2).
  • Raft consensus: mTLS between Raft peers.
  • Key manager: mTLS between storage nodes and the key manager.
  • Client to gateway: TLS (clients send plaintext over TLS; the gateway encrypts before writing).
  • Native client: Client-side encryption (plaintext never leaves the workload process).

Protocol gateway encryption

Protocol gateway clients (NFS, S3) send plaintext over TLS to the gateway. The gateway performs tenant-layer encryption before writing to the storage layer. This means plaintext exists in gateway process memory but never on the wire in cleartext and never at rest.

Native client encryption

Native clients (FUSE, FFI, Python) perform tenant-layer encryption themselves. Plaintext never leaves the workload process and never traverses the data fabric.


FIPS 140-2/3 compliance

Kiseki uses aws-lc-rs as its cryptographic backend, which provides a FIPS 140-2/3 validated implementation of:

  • AES-256-GCM (authenticated encryption)
  • HKDF-SHA256 (key derivation)
  • SHA-256 (content-addressed chunk IDs)
  • HMAC-SHA256 (per-tenant chunk IDs for opted-out tenants)

The FIPS feature is controlled by the kiseki-crypto/fips feature flag at compile time.

Crypto-agility

Envelope metadata carries algorithm identifiers for crypto-agility. If a new algorithm is needed (e.g., post-quantum), envelopes can carry the new algorithm identifier alongside the existing one during a transition period.


No plaintext past gateway boundary (I-K1, I-K2)

This is the fundamental security invariant. Kiseki guarantees:

  1. No plaintext chunk is ever persisted to storage (I-K1).
  2. No plaintext payload is ever sent on the wire between any components (I-K2).
  3. The system can enforce access to ciphertext without being able to read plaintext without tenant key material (I-K4).

Where plaintext exists

Plaintext exists only in:

  • Client process memory: For native clients that perform client-side encryption.
  • Gateway process memory: Transiently, while the gateway encrypts protocol-path data.
  • Stream processor memory: Stream processors cache tenant key material and are in the tenant trust domain (I-O3).
  • Client cache (L1): In-memory cache of decrypted chunks (zeroized on eviction or deallocation, I-CC2).
  • Client cache (L2): On-disk cache of decrypted chunks on local NVMe (zeroized before unlink, I-CC2).

Content-addressed chunk IDs

Chunk identity is derived from content, serving both dedup and integrity:

  • Default: chunk_id = SHA-256(plaintext). Enables cross-tenant dedup.
  • Opted-out tenants: chunk_id = HMAC-SHA256(plaintext, tenant_key). Cross-tenant dedup is impossible. Zero co-occurrence leak (I-K10).

Tenants that opt out of cross-tenant dedup pay a storage overhead (identical data stored separately per tenant) but gain the guarantee that no metadata (chunk IDs, refcounts) leaks information about data similarity across tenants.


Audit trail

All security-relevant events are recorded in an append-only, immutable audit log with the same durability guarantees as the data log (I-A1).

Audit events include:

  • Data access (read/write by tenant, workload, client)
  • Key lifecycle (rotation, crypto-shred, KMS health)
  • Admin actions (pool changes, device management, tuning parameters)
  • Policy changes (quotas, compliance tags, advisory policy)
  • Authentication events (mTLS success/failure, cert revocation)

Audit scoping

  • Tenant audit export: Filtered to the tenant’s own events plus relevant system events. Delivered on the tenant’s VLAN (I-A2). Sufficient for independent compliance demonstration (HIPAA Section 164.312 audit controls).
  • Cluster admin audit view: System-level events only. Tenant-anonymous or aggregated (I-A3).

Runtime integrity

An optional runtime integrity monitor detects attempts to access Kiseki process memory (I-O7):

  • ptrace detection
  • /proc/pid/mem access monitoring
  • Debugger attachment detection
  • Core dump attempt detection

On detection, the monitor alerts both cluster admin and tenant admin. Optional auto-rotation of keys can be configured as a response.


STRIDE Threat Analysis

Systematic analysis of Kiseki’s attack surfaces using the STRIDE framework.

Spoofing (identity)

ThreatAttack surfaceMitigationInvariant
Rogue node joins clusterRaft peer handshakemTLS with Cluster CA — only certs signed by the cluster CA are accepted. Raft RPC server rejects plaintext when TLS is configured.I-Auth1, I-K13
Client impersonates tenantData fabric connectionmTLS required. OrgId extracted from cert OU or SPIFFE SAN. Fallback: UUID v5 from cert fingerprint (no anonymous access).I-Auth1, I-Auth3
Forged S3 requestS3 gatewaySigV4 signature validation with HMAC-SHA256 (constant-time comparison). x-amz-date required, host must be signed.SigV4 auth
Forged JWT tokenOIDC second-stagealg=none rejected unconditionally. HS256 verified via HMAC. RS256/ES256 verified via JWKS with key ID matching.I-Auth2
NFS UID spoofingNFS gatewayAUTH_SYS trusts client-asserted UID (known limitation). Mitigated by: network segmentation, Kerberos for production, per-export allowed method list.NFS auth
Replay of captured requestS3 gatewayTimestamp validation (TODO: ±15min window). Captured Raft RPCs are harmless (Raft rejects stale term/log index).SigV4

Residual risk: NFS AUTH_SYS is inherently spoofable. Production deployments MUST use Kerberos or restrict NFS to trusted networks.

Tampering (data integrity)

ThreatAttack surfaceMitigationInvariant
Modify chunk on diskBlock deviceCRC32C on every extent read. Mismatch → EC repair from parity. Periodic scrub with configurable sample rate.I-C7, I-C8
Modify chunk in transitFabricTLS 1.3 (authenticated encryption). RDMA paths use pre-encrypted chunks.I-K2, I-Auth1
Modify Raft log entryRaft replicationRaft consensus — committed entries are immutable (I-L3). Log entries validated by majority before commit. WAL journal for crash-safe bitmap.I-L2, I-L3
Tamper with envelopeCrypto layerAES-256-GCM authenticated encryption. Tampered ciphertext, auth tag, or nonce → decryption failure. AAD binding to chunk_id prevents envelope splicing (I-K17).I-K7, I-K17
Modify L2 cache fileClient NVMeCRC32 trailer on every L2 read. Mismatch → bypass to canonical + delete corrupt entry.I-CC7, I-CC13
Corrupt staging manifestClient cacheInvalid JSON silently skipped during manifest load. No data served from unverifiable source.I-CC7

Repudiation (deniability)

ThreatAttack surfaceMitigationInvariant
Admin denies actionControl planeAll admin operations (maintenance, quota, compliance, key rotation) recorded in cluster audit shard with timestamp, identity, and parameters.I-A1, I-A6
Tenant denies accessData pathAll data access operations auditable. Tenant audit export provides filtered, coherent trail for compliance (HIPAA §164.312).I-A2
Advisory abuse deniedWorkflow advisoryAdvisory lifecycle events (declare, end, phase-advance, budget-exceeded) logged per-occurrence. High-volume events sampled with per-second-per-workflow counts.I-WA8
Device state change deniedStorageDevice state transitions (Healthy→Degraded→Evacuating→Failed→Removed) recorded with timestamp, reason, admin identity.I-D2
Crypto-shred deniedKey managementShred event logged in tenant audit shard. Key health check provides detection confirmation. Cache wipe events counted.I-K5, I-CC12

Information disclosure (confidentiality)

ThreatAttack surfaceMitigationInvariant
Plaintext leak on wireAll RPCsTLS mandatory on all data fabric connections. No plaintext payloads transmitted.I-K1, I-K2
Plaintext on disk (server)Chunk storageAll chunks encrypted at rest with system DEK (AES-256-GCM). No plaintext persisted on storage nodes. Compaction operates on headers only — never decrypts payloads.I-K1, I-O2
Plaintext on disk (client)L2 cacheCached plaintext on compute-node NVMe (same trust domain as process memory). File permissions 0600. Zeroize on eviction/wipe. Crash scrubber for orphaned pools. FTL residual risk documented.I-CC2, I-CC8
Cross-tenant data leakMulti-tenantFull tenant isolation (I-T1). Per-tenant encryption keys. Cluster admin cannot access tenant data without approval (I-T4). HMAC-keyed chunk IDs for dedup-opted-out tenants prevent co-occurrence analysis.I-T1, I-T3, I-K10
Telemetry leaks tenant infoAdvisoryTelemetry scoped to caller’s authorization. k-anonymity (k≥5) over neighbour workloads. Response shape unchanged under low-k conditions. Timing and size bucketed to prevent covert channels.I-WA5, I-WA6, I-WA15
Error messages leak stateAll APIsAuthError returns generic failures. KmsError uses enum variants not freeform strings. Advisory requests for unauthorized targets return same shape as absent targets.I-WA6
Core dump exposes keysServer/clientKey material wrapped in Zeroizing<Vec<u8>>. Runtime integrity monitor detects debugger/ptrace.I-K8, I-O7
Log messages leak dataStructured loggingStructured tracing with typed fields. No plaintext in log events. Tenant-scoped identifiers hashed in cluster-admin views.I-A3, I-K8

Denial of service (availability)

ThreatAttack surfaceMitigationInvariant
Raft leader floodingRaft consensusMAX_RAFT_RPC_SIZE (128MB) rejects oversized messages. Per-shard throughput guard (I-SF7) limits inline write rate.ADV-S1, I-SF7
Advisory hint floodingWorkflow advisoryPer-workload hint budget (hints/sec, concurrent workflows). Budget exceeded → local degradation only. Advisory isolated from data path (I-WA2).I-WA7, I-WA16, I-WA17
Connection pool exhaustionTransportmax_per_endpoint connection cap. Circuit breaker trips after threshold failures. FabricSelector falls back to TCP.Transport health
Disk exhaustion (metadata)System NVMeADR-030 dynamic inline threshold. Soft limit → threshold reduction. Hard limit → threshold floor + alert via out-of-band gRPC.I-SF1, I-SF2
Disk exhaustion (data)Device poolsPer-pool capacity thresholds (Warning/Critical/Full). Writes rejected at Critical. Pool rebalancing.I-C5
Cache exhaustion (client)Client NVMePer-process max_cache_bytes. Per-node max_node_cache_bytes (80% of filesystem). Disk-pressure backstop at 90%.ADR-031 §8
Audit log backpressureAuditSafety valve: if audit export stalls >24h, data GC proceeds with documented gap. Per-tenant configurable backpressure mode.I-A5
Shard split stormLogExponential backoff per shard (2h floor, 24h cap). Cluster-wide concurrent migrations bounded by max(1, num_nodes/10).I-SF4

Elevation of privilege (authorization)

ThreatAttack surfaceMitigationInvariant
Cluster admin accesses tenant dataControl planeZero-trust boundary. Access requires explicit tenant admin approval, time-bounded, scope-limited, audit-logged.I-T4, I-T4c
Tenant escapes namespaceData pathNamespace isolation per tenant. Cross-shard operations return EXDEV (I-L8). Compositions belong to exactly one tenant (I-X1).I-T1, I-X1, I-L8
Hint escalates priorityAdvisoryHints cannot extend capability. Cannot cause operation success that would otherwise be rejected. Cannot cross namespace/tenant boundary. Cannot bypass retention hold.I-WA14
Client escalates cache policyClient cacheClient selections bounded by admin-set ceilings. Policy narrowing only (child ≤ parent). cache_enabled=false at any level → disabled for all children.I-CC10, I-WA7
KMS provider escalationKey managementProvider abstraction opaque to callers (I-K16). No access-control decision depends on provider type. Provider migration requires 100% re-wrap before atomic switch.I-K16, I-K20
gRPC method escalationControl planePer-method authorization. 9 admin-only methods gated by require_admin(). Unknown role → rejected.gRPC authz

Summary

STRIDE CategoryThreats identifiedMitigatedResidual risk
Spoofing65NFS AUTH_SYS UID spoofing (use Kerberos in prod)
Tampering66None — all paths have integrity verification
Repudiation55None — comprehensive audit trail
Information disclosure88Client L2 NVMe FTL residual (use OPAL/SED)
Denial of service88None — all paths have rate limiting/backpressure
Elevation of privilege66None — defense in depth at every boundary
Total39372 documented residual risks

Both residual risks have documented mitigations:

  1. NFS AUTH_SYS → deploy Kerberos or restrict to trusted networks
  2. NVMe FTL data remanence → deploy OPAL/SED with per-boot key rotation

Authentication

Kiseki uses a layered authentication model. The primary mechanism is mTLS with certificates signed by a Cluster CA. Optional second-stage authentication via tenant identity providers adds workload-level authorization.


mTLS with Cluster CA (I-Auth1)

The Cluster CA is the trust root for all data-fabric authentication. Every participant in the data fabric (storage nodes, gateways, clients, stream processors) presents a certificate signed by the Cluster CA.

Certificate hierarchy

Cluster CA (managed by cluster admin)
  |
  +-- Server certificates (per storage node)
  |     SAN: node hostname, IP address
  |     OU: kiseki-server
  |
  +-- Key manager certificates (per key server)
  |     SAN: keyserver hostname, IP address
  |     OU: kiseki-keyserver
  |
  +-- Admin certificates (cluster admin)
  |     OU: kiseki-admin
  |
  +-- Tenant certificates (per tenant)
        SAN: tenant identifier
        OU: tenant-{org_id}

Properties

  • No real-time auth server on data path (I-Auth1). Certificates are local credentials. Authentication is a TLS handshake, not an RPC to a central authority. This eliminates a latency-sensitive dependency on the data path.
  • Per-tenant certificates: Each tenant’s clients and gateways present certificates that identify the tenant. The storage layer validates the certificate chain and extracts the tenant identity.
  • Certificate revocation: Supported via CRL (KISEKI_CRL_PATH). The CRL is reloaded periodically. Revoked certificates are rejected at the TLS handshake.

Configuration

# On storage nodes
KISEKI_CA_PATH=/etc/kiseki/tls/ca.crt
KISEKI_CERT_PATH=/etc/kiseki/tls/server.crt
KISEKI_KEY_PATH=/etc/kiseki/tls/server.key
KISEKI_CRL_PATH=/etc/kiseki/tls/crl.pem  # optional

# On client nodes
KISEKI_CA_PATH=/etc/kiseki/tls/ca.crt
KISEKI_CERT_PATH=/etc/kiseki/tls/client.crt
KISEKI_KEY_PATH=/etc/kiseki/tls/client.key

Certificate generation example

# Generate Cluster CA (do this once)
openssl req -x509 -newkey ec -pkeyopt ec_paramgen_curve:P-256 \
  -keyout ca.key -out ca.crt -days 3650 -nodes \
  -subj "/CN=Kiseki Cluster CA"

# Generate server certificate
openssl req -newkey ec -pkeyopt ec_paramgen_curve:P-256 \
  -keyout server.key -out server.csr -nodes \
  -subj "/CN=node1.example.com/OU=kiseki-server"

openssl x509 -req -in server.csr -CA ca.crt -CAkey ca.key \
  -CAcreateserial -out server.crt -days 365 \
  -extfile <(echo "subjectAltName=DNS:node1.example.com,IP:10.0.0.1")

# Generate tenant client certificate
openssl req -newkey ec -pkeyopt ec_paramgen_curve:P-256 \
  -keyout tenant.key -out tenant.csr -nodes \
  -subj "/CN=workload-1/OU=tenant-acme-corp"

openssl x509 -req -in tenant.csr -CA ca.crt -CAkey ca.key \
  -CAcreateserial -out tenant.crt -days 365

SPIFFE SVID (I-Auth3)

SPIFFE (Secure Production Identity Framework for Everyone) is available as an alternative to raw mTLS certificate management.

SPIFFE ID structure

spiffe://kiseki.example.com/tenant/{org_id}/workload/{workload_id}
spiffe://kiseki.example.com/tenant/{org_id}/project/{project_id}/workload/{workload_id}

The SPIFFE ID maps directly to the tenant hierarchy (organization/project/workload).

SPIRE integration

SPIRE (the SPIFFE Runtime Environment) handles certificate issuance and rotation automatically:

  1. SPIRE Server acts as the Cluster CA (or delegates to it).
  2. SPIRE Agent runs on each node (storage and compute).
  3. Workloads receive SVIDs via the Workload API.
  4. Certificates rotate automatically (no manual renewal).

Benefits over raw mTLS

  • Automatic certificate rotation (no manual renewal ceremonies).
  • Workload attestation (verify the workload binary, not just the certificate).
  • Short-lived certificates reduce the window of compromise.

S3 SigV4 authentication

The S3 gateway supports AWS Signature Version 4 authentication for S3 API clients.

How it works

  1. The S3 client signs each request with an access key and secret key.
  2. The gateway validates the signature.
  3. The access key is mapped to a tenant identity via the control plane.
  4. Subsequent authorization is based on the tenant identity.

Configuration

Access keys are provisioned via the control plane:

kiseki-server s3-credentials create --tenant-id acme-corp --workload-id training-job-1

Compatibility

The SigV4 implementation supports standard S3 clients:

# AWS CLI
aws --endpoint-url http://node1:9000 s3 ls

# boto3
import boto3
s3 = boto3.client('s3', endpoint_url='http://node1:9000',
                  aws_access_key_id='...', aws_secret_access_key='...')

NFS authentication

The NFS gateway supports two authentication mechanisms:

NFSv4.2 with Kerberos provides strong authentication:

  • krb5 — Authentication only.
  • krb5i — Authentication + integrity.
  • krb5p — Authentication + integrity + privacy (encrypted).

The Kerberos principal maps to a tenant identity.

AUTH_SYS (development only)

AUTH_SYS (traditional UNIX UID/GID authentication) is supported for development and testing. It provides no real security and should not be used in production. When AUTH_SYS is used, the NFS gateway maps the export path to a tenant identity.


OIDC/JWT second-stage authentication (I-Auth2)

Optional second-stage authentication validates workload identity against the tenant admin’s authorization. This provides an additional layer beyond the mTLS “belongs to this cluster” identity.

Architecture

Workload
  |
  v
mTLS (Cluster CA)  -->  "This workload belongs to tenant X"
  |
  v
OIDC/JWT (Tenant IdP)  -->  "This workload is authorized by tenant X's admin"

Integration

  1. Tenant admin configures their identity provider (Keycloak, Okta, Azure AD, etc.) in the control plane.
  2. Workloads obtain JWT tokens from the tenant IdP.
  3. On connection, the workload presents both:
    • mTLS certificate (Cluster CA trust chain)
    • JWT token (tenant IdP authorization)
  4. The storage node validates both independently.

Token validation

  • JWT signature verification against the tenant IdP’s JWKS endpoint.
  • Token expiry and audience validation.
  • Claims mapping to tenant hierarchy (org, project, workload).
  • No real-time IdP dependency on the data path: JWKS keys are cached and refreshed periodically.

gRPC role-based authorization

After authentication (mTLS + optional OIDC), gRPC services enforce role-based authorization:

Roles

RoleAuthenticationAccess
Cluster adminAdmin certificate (OU: kiseki-admin)StorageAdminService, ControlService (full)
SRE (read-only)SRE certificateStorageAdminService (read-only: List*, Get*, Status)
Tenant adminTenant certificate + OIDC (optional)ControlService (tenant-scoped), AuditExportService
WorkloadTenant certificate + OIDC (optional)Data-path services, WorkflowAdvisoryService

Authorization enforcement

  • StorageAdminService: Cluster admin only (mTLS cert with admin OU). SRE read-only role for monitoring.
  • ControlService: Cluster admin for system operations, tenant admin for tenant-scoped operations.
  • Data-path services (LogService, ChunkOps, CompositionOps, ViewOps): Any authenticated tenant workload, scoped to the tenant’s own data.
  • WorkflowAdvisoryService: Any authenticated tenant workload. Per-operation authorization (I-WA3): every request re-validates the caller’s mTLS identity against the workflow’s owning workload.

Cluster admin isolation (I-T4)

The cluster admin certificate grants access to infrastructure management but explicitly does NOT grant access to:

  • Tenant configuration
  • Tenant audit logs
  • Tenant data (read or write)
  • Tenant key material

Access to tenant resources requires an explicit access request approved by the tenant admin.


Client identity

Client ID (native client)

Each native client process generates a stable identifier at startup:

  • 128-bit CSPRNG value.
  • Bound to the workload’s mTLS certificate at first use.
  • Scoped within (org, project, workload).
  • Never reused across processes (I-WA4).

The client ID ties an operation stream to a single process instance. It is not a user identity and not a session token.

Workflow reference

For advisory-enabled workloads, a workflow reference is attached to data-path RPCs as a gRPC binary metadata entry (x-kiseki-workflow-ref-bin). This is a 16-byte opaque handle, generated with 128+ bits of entropy, never reused, and verified against the caller’s mTLS identity on every request (I-WA3, I-WA10).

Tenant Isolation

Tenant isolation is a foundational invariant of Kiseki. Tenants are fully isolated with no cross-tenant data access, no delegation tokens, and no cross-tenant key sharing (I-T1).


Isolation model

Kiseki implements hierarchical tenancy with strict isolation boundaries:

Organization (billing, admin, master key authority)
  |
  +-- Project (optional: resource grouping, key delegation)
  |     |
  |     +-- Workload (runtime isolation unit)
  |     +-- Workload
  |
  +-- Workload (directly under org, if no projects)

Isolation guarantees

PropertyGuaranteeInvariant
Data accessNo cross-tenant data accessI-T1
Key materialPer-tenant encryption keys, never sharedI-T3, I-K3
Resource consumptionBounded by quotas at org and workload levelsI-T2
Audit visibilityTenant sees only their own eventsI-A2
MetricsTenant-anonymous for cluster adminADR-015
Admin accessZero-trust: cluster admin cannot access tenant data without approvalI-T4

Per-tenant encryption keys

Each tenant has their own KEK (Key Encryption Key) managed by their chosen KMS backend (ADR-028). The tenant KEK wraps access to system DEK derivation parameters for that tenant’s data.

Key isolation

  • System DEKs are derived per-chunk and are the same for identical chunks across tenants (enabling cross-tenant dedup by default).
  • Tenant KEKs are unique per tenant. Even if two tenants store the same data, each tenant wraps access to the DEK derivation parameters independently.
  • Tenant keys are not accessible to other tenants or to shared system processes (I-T3).

Key storage isolation

When using the internal KMS provider (default), tenant KEKs are stored in a separate Raft group from system master keys (I-K19). Compromise of one group does not expose the other.

When using external KMS providers (Vault, KMIP, AWS KMS, PKCS#11), tenant key material is managed entirely outside of Kiseki’s storage, under the tenant’s own operational control.


HMAC-keyed chunk IDs for opted-out tenants

By default, chunk IDs are derived from plaintext content: chunk_id = SHA-256(plaintext). This enables cross-tenant deduplication: identical data stored by different tenants produces the same chunk ID and shares storage.

Tenants that require stronger isolation can opt out of cross-tenant dedup (I-X2, I-K10):

Default:     chunk_id = SHA-256(plaintext)
Opted-out:   chunk_id = HMAC-SHA256(plaintext, tenant_key)

What opt-out provides

  • No cross-tenant dedup: Identical data from different tenants produces different chunk IDs. Each tenant’s data is stored independently.
  • Zero co-occurrence leak: An observer cannot determine whether two tenants store the same data by comparing chunk IDs.
  • Storage overhead: Duplicate data across tenants consumes additional storage.

When to opt out

Opt-out is recommended for tenants with:

  • Regulatory requirements prohibiting any form of cross-tenant data correlation (even at the metadata level).
  • High-sensitivity data where the existence of shared content is itself sensitive information.
  • Compliance regimes (HIPAA, ITAR) where data co-location with other tenants must be minimized.

Audit log scoping

The audit log is append-only, immutable, and system-wide (I-A1). Audit visibility is strictly scoped:

Tenant audit export (I-A2)

Each tenant receives a filtered projection of the audit log:

  • All events originating from the tenant’s own operations.
  • Relevant system events sufficient for a coherent, complete audit trail (e.g., a cluster admin modifying a pool that contains the tenant’s data).
  • Delivered on the tenant’s VLAN.
  • Sufficient for independent compliance demonstration (e.g., HIPAA Section 164.312 audit controls).

The tenant admin consumes this export. No events from other tenants appear in the export.

Cluster admin audit view (I-A3)

The cluster admin sees:

  • System-level events (node joins, pool changes, key rotations).
  • Tenant-anonymous or aggregated metrics.
  • No tenant-attributable content.

Cluster admin modifications to pools containing tenant data are audit-logged to the affected tenant’s audit shard (I-T4c), so the tenant can review.

Advisory audit scoping (I-WA8)

Workflow advisory events (declare, end, phase-advance, hint accept/reject, etc.) are written to the tenant’s audit shard.

  • Semantic phase tags and workflow IDs are tenant-scoped.
  • Cluster-admin views see opaque hashes only (consistent with I-A3).
  • High-volume events (hint-accepted, hint-throttled) may be batched or sampled, but at least one event per unique (workflow_id, rejection_reason) tuple is written per second.

Cache isolation (ADR-031)

The client-side cache maintains strict per-tenant isolation:

L1 (in-memory) isolation

  • The L1 cache operates within a single client process.
  • A client process is authenticated as a specific tenant via mTLS.
  • L1 entries are decrypted plaintext chunks, keyed by chunk ID.
  • On process termination, L1 entries are zeroized (I-CC2).

L2 (on-disk) isolation

  • Each client process creates its own L2 cache pool on local NVMe.
  • Pool isolation is enforced by:
    • Unique pool ID: 128-bit CSPRNG value per process.
    • flock: Ownership proven by file lock on pool.lock.
    • Per-process directory: No cross-process sharing.
  • Concurrent same-tenant processes have independent pools. There is no cross-process cache sharing.
  • Orphaned pools (no live flock holder) are scavenged on startup or by kiseki-cache-scrub.
  • On eviction or cache wipe, L2 entries are overwritten with zeros before unlink (I-CC2).

Crypto-shred cache wipe (I-CC12)

When a crypto-shred event is detected for a tenant:

  1. All cached plaintext for that tenant is wiped from L1 and L2.
  2. L1 entries: Zeroizing<Vec<u8>> ensures memory-level erasure.
  3. L2 entries: File contents overwritten with zeros before unlink.
  4. Detection mechanisms:
    • Periodic key health check (default 30 seconds).
    • Advisory channel notification.
    • KMS error on next operation.

Maximum detection latency: min(key_health_interval, max_disconnect_seconds).

Physical-level erasure note

Logical-level erasure (zeroize before deallocation) provides strong protection against software-level attacks. For protection against physical-level attacks on flash storage (e.g., reading NAND cells after logical deletion), hardware encryption (OPAL/SED) on the compute node’s local NVMe is required. This is outside Kiseki’s control but should be part of the compute node security policy.


Network isolation

Data fabric

All data-fabric traffic is mTLS-encrypted. Tenant identity is extracted from the client certificate and validated on every RPC.

Management network

The management network (control plane, admin API) is separate from the data fabric. Cluster admin access requires admin-OU certificates.

Tenant VLAN

Tenant audit exports are delivered on the tenant’s VLAN, providing network-level isolation of audit data.


Advisory isolation (I-WA1, I-WA2, I-WA5, I-WA6)

The workflow advisory subsystem enforces strict tenant isolation:

  • No existence oracles (I-WA6): A client cannot determine the existence of resources it is not authorized to observe. Unauthorized and absent targets return identical responses (same error code, payload size, and latency distribution).
  • No content oracles (I-WA11): Advisory fields never include cluster-internal identifiers (shard IDs, chunk IDs, node IDs, device IDs, rack labels).
  • Telemetry scoping (I-WA5): Every telemetry value is computed over resources the caller is authorized to read. Aggregate metrics use k-anonymous bucketing (minimum k=5).
  • Covert-channel hardening (I-WA15): Response timing and size do not vary with neighbor-workload state.

Pool handle isolation (I-WA19)

Affinity pools are referenced via opaque pool handles, not cluster-internal pool IDs:

  • Handles are valid for one workflow’s lifetime only.
  • Never reused across workflows.
  • Never equal or leak the cluster-internal pool identity.
  • Multiple tenants can see the same opaque label attached to different internal pools; correlation across tenants is impossible because handles differ.

Compliance support

Kiseki’s tenant isolation model supports the following compliance regimes:

RegimeRelevant guarantees
HIPAAPer-tenant encryption, audit export for Section 164.312, crypto-shred, bounded staleness (2s floor).
SOC 2Audit log immutability, access control separation, key management lifecycle.
GDPRCrypto-shred as right-to-erasure mechanism, data isolation by design.
ITARHMAC-keyed chunk IDs (no cross-tenant correlation), dedicated tenant KMS.

Compliance tags attach at any level of the tenant hierarchy (organization, project, workload) and inherit downward. Tags may impose additional constraints:

  • Prohibit compression (HIPAA namespaces, I-K14).
  • Set staleness floor (minimum 2 seconds for HIPAA).
  • Require external KMS provider (no internal mode).
  • Restrict pool placement (data residency).

Troubleshooting

This guide covers common issues, diagnostic tools, and resolution procedures for Kiseki clusters.


Diagnostic tools

Health endpoint

# Quick liveness check (returns "OK" or connection refused)
curl http://node1:9090/health

Event log

The event log captures categorized diagnostic events in memory. Query via the admin API:

# All events from the last 3 hours
curl http://node1:9090/ui/api/events

# Error events only
curl 'http://node1:9090/ui/api/events?severity=error'

# Critical events from the last 24 hours
curl 'http://node1:9090/ui/api/events?severity=critical&hours=24'

# Device-related events
curl 'http://node1:9090/ui/api/events?category=device'

# Raft events (elections, membership changes)
curl 'http://node1:9090/ui/api/events?category=raft'

Node status

# Per-node metrics and health
curl http://node1:9090/ui/api/nodes

# Cluster summary
curl http://node1:9090/ui/api/cluster

Structured logs

# Tail logs for errors (systemd)
journalctl -u kiseki-server -f --priority=err

# Search for specific errors in JSON logs
journalctl -u kiseki-server --output=json | jq 'select(.level == "ERROR")'

# Raft-specific logs
journalctl -u kiseki-server | grep kiseki_raft

Common issues

Connection refused on data-path port (9100)

Symptoms: Clients cannot connect. curl http://node:9090/health returns OK but gRPC connections to port 9100 fail.

Diagnosis:

  1. Verify the port is listening:
    ss -tlnp | grep 9100
    
  2. Check firewall rules:
    iptables -L -n | grep 9100
    
  3. Check the server logs for bind errors:
    journalctl -u kiseki-server | grep "bind\|listen\|9100"
    

Common causes:

  • Port conflict: Another process is using port 9100.
  • Bind address: KISEKI_DATA_ADDR is set to 127.0.0.1:9100 instead of 0.0.0.0:9100.
  • Firewall: Port 9100 is not open between nodes or to clients.

mTLS authentication failures

Symptoms: AuthenticationFailed errors in logs. Clients receive gRPC UNAUTHENTICATED (16) status.

Diagnosis:

# Verify certificate validity
openssl x509 -in /etc/kiseki/tls/server.crt -noout -dates -subject -issuer

# Verify certificate chain
openssl verify -CAfile /etc/kiseki/tls/ca.crt /etc/kiseki/tls/server.crt

# Test TLS handshake
openssl s_client -connect node1:9100 \
  -cert /etc/kiseki/tls/client.crt \
  -key /etc/kiseki/tls/client.key \
  -CAfile /etc/kiseki/tls/ca.crt

Common causes:

  • Certificate expired: Renew the certificate.
  • CA mismatch: Client and server certificates signed by different CAs.
  • Missing SAN: Server certificate does not include the hostname or IP the client is connecting to.
  • CRL revocation: Certificate revoked via KISEKI_CRL_PATH. Check the CRL:
    openssl crl -in /etc/kiseki/tls/crl.pem -text -noout
    
  • Wrong OU: Tenant certificate has wrong OU, or admin certificate does not have kiseki-admin OU.

Capacity full (ENOSPC)

Symptoms: Write operations return PoolFull errors. S3 PutObject returns HTTP 507. NFS writes return EIO or ENOSPC.

Diagnosis:

# Check pool capacity
curl -s http://node1:9090/metrics | grep kiseki_pool_capacity

# Check system disk usage
df -h /var/lib/kiseki

Resolution:

  1. Add devices to the pool to increase capacity.
  2. Rebalance to distribute data more evenly:
    kiseki-server pool rebalance --pool-id fast-nvme
    
  3. Evacuate devices from an over-full pool to a different pool (within the same device class).
  4. Delete data: Remove compositions/objects to free space. GC runs periodically (default every 300 seconds).
  5. Adjust thresholds if the defaults are too conservative for your deployment:
    kiseki-server pool set-thresholds --pool-id fast-nvme \
      --warning-pct 80 --critical-pct 90
    

Metadata disk full (system partition)

Symptoms: Inline threshold drops to floor (128 bytes). Alert: “system disk metadata usage exceeds hard limit.” Raft may stall if the system disk is completely full.

Diagnosis:

# Check system partition usage
df -h /var/lib/kiseki

# Check individual redb sizes
du -sh /var/lib/kiseki/raft/log.redb
du -sh /var/lib/kiseki/chunks/meta.redb
du -sh /var/lib/kiseki/small/objects.redb

Resolution:

  1. The system automatically reduces the inline threshold to the floor (128 bytes) when the hard limit is exceeded (I-SF2).
  2. Trigger Raft log compaction to reduce raft/log.redb size:
    kiseki-server compact
    
  3. Run GC to clean up orphaned entries in small/objects.redb (I-SF6).
  4. Consider migrating shards to nodes with larger system disks.
  5. If the system partition is persistently undersized, upgrade to larger NVMe for the system RAID-1.

Raft diagnostics

Leader election issues

Symptoms: ShardUnavailable errors. Writes fail intermittently.

Diagnosis:

# Check shard health
kiseki-server shard health --shard-id shard-0001

# Check Raft events
curl 'http://node1:9090/ui/api/events?category=raft'

# Check election metrics
curl -s http://node1:9090/metrics | grep kiseki_raft

Common causes:

  • Network partition: Raft peers cannot communicate. Check connectivity on port 9300 between all nodes.
  • Clock skew: Large clock differences can cause election timeouts. Verify NTP synchronization. Nodes with Unsync clock quality are flagged (I-T6).
  • Disk latency: HDD system disks cause 5-10ms fsync latency per Raft commit. Use NVMe or SSD for the system partition.

Quorum loss

Symptoms: All writes fail. Reads may succeed (depending on consistency model).

Diagnosis:

# Check how many nodes are reachable
for node in node1 node2 node3; do
  echo -n "$node: "
  curl -s http://$node:9090/health && echo "OK" || echo "DOWN"
done

Resolution:

  • If one node is down (3-node cluster): The remaining 2 nodes form a majority. Raft continues. Repair or replace the failed node.
  • If two nodes are down: Quorum is lost. See Backup & Recovery for recovery procedures.

Shard split stalls

Symptoms: Shard reports high delta count or throughput but split does not complete.

Diagnosis:

kiseki-server shard info --shard-id shard-0001

Resolution:

  • Verify the shard is not in maintenance mode (I-O6).
  • Check if the cluster-wide concurrent migration limit is reached (I-SF4): max(1, num_nodes / 10).
  • Check the exponential backoff timer (I-SF4): Minimum 2 hours between placement changes per shard.
  • Manually trigger a split if auto-split is not firing:
    kiseki-server shard split --shard-id shard-0001
    

Device issues

Integrity scrub

Trigger a manual integrity scrub to verify chunk data against EC parity:

# Scrub all devices
curl -X POST http://node1:9090/ui/api/ops/scrub

# Scrub a specific device
kiseki-server device scrub --device-id nvme-0001

The periodic scrub runs every 7 days by default (scrub_interval_h).

SMART warnings

Automatic evacuation triggers when a device reports:

  • SSD: SMART wear indicator > 90%.
  • HDD: > 100 bad sectors.

Check device health:

kiseki-server device info --device-id nvme-0001

Device evacuation

Monitor evacuation progress:

# List active repairs/evacuations
kiseki-server repair list

# Check device state
kiseki-server device info --device-id nvme-0001

Device state transitions: Healthy -> Degraded -> Evacuating -> Failed -> Removed (I-D2).

A device in Evacuating state can be cancelled:

kiseki-server device cancel-evacuation --device-id nvme-0001

RemoveDevice is rejected unless the device state is Removed (post-evacuation) (I-D5).


Key management issues

Key manager unreachable

Symptoms: KeyManagerUnavailable errors. All chunk writes fail cluster-wide (I-K12).

Diagnosis:

# Check key manager health
kiseki-server keymanager health

# Check connectivity from storage node
curl -s http://node1:9090/metrics | grep kms_reachability

Resolution:

  • The key manager is a Raft-replicated HA service. If one node is down, the remaining majority continues serving.
  • If the entire key manager cluster is unreachable, storage nodes use cached master keys (mlock’d in memory) for reads but cannot process new writes.
  • Restore key manager connectivity as soon as possible.

Tenant KMS unreachable

Symptoms: TenantKmsUnreachable errors for operations involving the affected tenant. Other tenants are unaffected.

Diagnosis:

kiseki-server keymanager check-kms --tenant-id acme-corp

Resolution:

  • Check network connectivity to the tenant’s KMS endpoint.
  • Check KMS credentials and certificate validity.
  • The tenant admin is responsible for their KMS availability (I-K11).

Crypto-shred verification

After a crypto-shred, verify that all clients have wiped their caches:

# Check crypto-shred count
curl -s http://node1:9090/metrics | grep kiseki_crypto_shred_total

# Check security events
curl 'http://node1:9090/ui/api/events?category=security'

Gateway issues

S3 errors

Common S3 error codes returned by the gateway:

ErrorCauseResolution
403 ForbiddenSigV4 authentication failureCheck access key/secret key.
404 Not FoundBucket or object does not existVerify namespace and key.
507 Insufficient StoragePool fullAdd capacity. See Capacity Full above.
503 Service UnavailableRaft quorum lost or maintenance modeWait for recovery or disable maintenance.

NFS errors

ErrorCauseResolution
ESTALEShard split caused file handle invalidationRetry the operation.
EIOInternal error (chunk read failure, key manager unreachable)Check server logs.
ENOSPCPool fullAdd capacity.
EXDEVCross-shard rename (I-L8)Use copy + delete instead.
ENOTSUPWritable shared mmap (I-O8)Use read/write instead of mmap for writes.

Performance Tuning

Kiseki is designed for HPC and AI workloads running at 200+ Gbps per NIC. This guide covers tuning levers for maximizing throughput and minimizing latency.


Transport selection

The transport layer abstracts the network fabric. Kiseki automatically selects the best available transport, but manual override is possible.

Transport hierarchy (fastest to slowest)

TransportTypical bandwidthLatencyFeature flagNotes
CXI (HPE Slingshot)200 Gbps<1 uskiseki-transport/cxiRequires libfabric with CXI provider. CSCS/Alps native.
InfiniBand verbs100-400 Gbps1-2 uskiseki-transport/verbsRequires RDMA-capable NICs and verbs libraries.
RoCE v225-100 Gbps2-5 uskiseki-transport/verbsRDMA over Converged Ethernet. Requires lossless fabric (PFC/ECN).
TCP10-100 Gbps50-200 us(always available)Fallback. Uses kernel TCP with TLS.

Enabling high-performance transports

# Build with CXI support (requires libfabric development headers)
cargo build --release --features kiseki-transport/cxi

# Build with RDMA verbs support (requires rdma-core)
cargo build --release --features kiseki-transport/verbs

The client automatically detects available transports and selects the fastest one. Override with:

# Force TCP transport (e.g., for debugging)
KISEKI_TRANSPORT=tcp kiseki-client-fuse --mountpoint /mnt/kiseki

Transport tuning

  • Connection pooling: The transport layer maintains a pool of connections per peer. Pool size adapts to workload.
  • Keepalive: Connections are kept alive to avoid handshake overhead. Configure via KISEKI_TRANSPORT_KEEPALIVE_MS.
  • Zero-copy: CXI and verbs transports use zero-copy DMA where possible.

NUMA pinning

For multi-socket servers, NUMA-aware placement is critical for avoiding cross-socket memory traffic.

Recommendations

  • Pin kiseki-server to the NUMA node closest to the NIC:
    numactl --cpunodebind=0 --membind=0 kiseki-server
    
  • Pin NVMe interrupts to the same NUMA node:
    echo 0 > /proc/irq/<irq>/smp_affinity_list
    
  • Pin data devices to the NUMA node closest to their PCIe root complex.

systemd integration

[Service]
# Pin to NUMA node 0
ExecStart=/usr/bin/numactl --cpunodebind=0 --membind=0 /usr/local/bin/kiseki-server

Verification

# Check NUMA topology
numactl --hardware

# Check NIC NUMA node
cat /sys/class/net/eth0/device/numa_node

# Check NVMe NUMA node
cat /sys/block/nvme0n1/device/numa_node

Erasure coding parameters

EC parameters control the trade-off between storage overhead, repair bandwidth, and read performance.

Common configurations

ConfigDataParityOverheadFault toleranceUse case
4+24250%2 device failuresDefault for NVMe. Good balance.
8+38337.5%3 device failuresLarge HDD pools. Lower overhead.
4+14125%1 device failureLow-criticality data. Minimum overhead.
2+222100%2 device failuresSmall pools (<6 devices). High redundancy.

Performance implications

  • Read amplification: Reading a chunk requires reading data_chunks fragments. More data chunks = more read I/O.
  • Write amplification: Writing a chunk requires writing data_chunks + parity_chunks fragments.
  • Repair bandwidth: Repairing a lost fragment requires reading data_chunks fragments and writing 1. Higher data_chunks = more repair bandwidth.
  • Minimum pool size: The pool must have at least data_chunks + parity_chunks devices.

EC parameters are immutable per pool after creation (I-C6). Choose carefully. Changing requires creating a new pool and migrating data.


Inline threshold (ADR-030)

The inline threshold determines whether small files are stored in the metadata tier (NVMe, redb) or the data tier (block device extents).

Tuning the threshold

The system automatically adjusts the threshold per-shard based on system disk capacity (I-SF1, I-SF2). Manual adjustment:

# Set cluster-wide default for new shards
kiseki-server tuning set --inline-threshold-bytes 8192

Trade-offs

ThresholdMetadata tier impactData tier impactLatency
128 B (floor)Minimal metadata growthAll files in chunksHigher for tiny files
4 KB (default)Moderate growthSmall files inlineLower for small files
64 KB (ceiling)Large growthMore inline dataLowest for small files

Monitoring

# Check system disk usage
df -h /var/lib/kiseki

# Check per-store sizes
du -sh /var/lib/kiseki/small/objects.redb
du -sh /var/lib/kiseki/raft/log.redb

The Raft inline throughput guard (I-SF7) automatically reduces the threshold to the floor if inline write rate exceeds KISEKI_RAFT_INLINE_MBPS (default 10 MB/s per shard). This prevents inline data from starving metadata-only Raft operations during write storms.


Cache tuning (ADR-031)

L1 cache (in-memory)

The L1 cache holds decrypted plaintext chunks in process memory.

ParameterDefaultRecommendation
KISEKI_CACHE_L1_MAX1 GBSet to 10-25% of available process memory. AI training with large datasets: increase. Memory-constrained compute: decrease.

L2 cache (local NVMe)

The L2 cache uses local NVMe on compute nodes.

ParameterDefaultRecommendation
KISEKI_CACHE_L2_MAX100 GBSet based on available NVMe capacity. Training datasets: size to fit the working set. Inference: size to fit model weights.

Metadata TTL

ParameterDefaultRecommendation
KISEKI_CACHE_META_TTL_MS5000 (5s)Read-heavy workloads: increase for fewer metadata fetches. Low-latency requirements: decrease for fresher data. POSIX close-to-open consistency: 0 (no caching).

Cache mode selection

WorkloadRecommended modeRationale
AI training (epoch reuse)pinnedDataset is re-read every epoch. Pin to avoid refetching.
AI inferenceorganicModel weights are hot, prompts rotate. LRU works well.
HPC checkpoint/restartbypassCheckpoints are write-heavy. Caching checkpoints wastes NVMe.
Climate/weather stagingpinnedBoundary conditions staged once, read many times.
Interactive analysisorganicMixed access patterns. LRU adapts.

Staging for training workloads

Pre-stage datasets before training begins to avoid cold-start latency:

# Slurm prolog script
kiseki-client-fuse --stage /datasets/imagenet --mountpoint /mnt/kiseki
export KISEKI_CACHE_POOL_ID=$(cat /var/cache/kiseki/pool_id)

# Workload picks up the staged cache via KISEKI_CACHE_POOL_ID
srun --export=ALL python train.py

Raft tuning

Snapshot interval

kiseki-server tuning set --raft-snapshot-interval 10000
  • Lower values (1000-5000): More frequent snapshots. Faster catch-up for new nodes. More I/O.
  • Higher values (50000-100000): Less snapshot overhead. Slower catch-up.

Compaction rate

kiseki-server tuning set --compaction-rate-mb-s 200

Higher compaction rate reduces Raft log size faster but consumes more I/O bandwidth.

View materialization poll interval

kiseki-server tuning set --stream-proc-poll-ms 50

Lower poll interval reduces view staleness but increases CPU usage.


Benchmark harness

Kiseki includes a transport benchmark for measuring raw fabric throughput:

# Run transport benchmarks (if available)
tests/hw/run_transport_bench.sh

What it measures

  • Bandwidth: Sequential read/write throughput per transport.
  • Latency: Round-trip latency (p50, p99, p999) per transport.
  • IOPS: Random read/write IOPS per transport.
  • Concurrency: Throughput scaling with connection count.

Interpreting results

MetricGood (CXI)Good (TCP)Action if below
Bandwidth>150 Gbps>50 GbpsCheck NIC config, MTU, NUMA pinning
Latency p99<10 us<500 usCheck CPU frequency, interrupt coalescing
IOPS (4K random)>1M>100KCheck NVMe config, queue depth

System tuning checklist

Kernel parameters

# Increase maximum open files
echo "fs.file-max = 1048576" >> /etc/sysctl.conf

# Increase socket buffer sizes for high-bandwidth transports
echo "net.core.rmem_max = 67108864" >> /etc/sysctl.conf
echo "net.core.wmem_max = 67108864" >> /etc/sysctl.conf
echo "net.ipv4.tcp_rmem = 4096 87380 67108864" >> /etc/sysctl.conf
echo "net.ipv4.tcp_wmem = 4096 65536 67108864" >> /etc/sysctl.conf

# Disable transparent hugepages (can cause latency spikes)
echo never > /sys/kernel/mm/transparent_hugepage/enabled

NVMe tuning

# Set I/O scheduler to none (best for NVMe)
echo none > /sys/block/nvme0n1/queue/scheduler

# Increase queue depth
echo 1024 > /sys/block/nvme0n1/queue/nr_requests

Process limits

# /etc/security/limits.d/kiseki.conf
kiseki  soft  nofile  1048576
kiseki  hard  nofile  1048576
kiseki  soft  memlock unlimited
kiseki  hard  memlock unlimited

Performance Tests

Benchmark results for kiseki on GCP infrastructure.

Test Environment

ComponentSpec
HDD nodes (3)n2-standard-16, 3 x PD-Standard 200GB each
Fast nodes (2)n2-standard-16, 2 x local NVMe + 2 x PD-SSD 375GB
Client nodes (3)n2-standard-8, 100GB SSD cache
Ctrl node (1)e2-standard-4, orchestrator
NetworkGCP VPC, single subnet 10.0.0.0/24
Regioneurope-west6-c (Zurich)
RaftSingle group, 5 nodes, node 1 bootstrap
Releasev2026.1.352 (async GatewayOps, ADR-032)

Results (2026-04-24)

Network Bandwidth

PathThroughput
Client → Leader (n2-standard-8 → n2-standard-16)15.2 - 15.3 Gbps
HDD → Fast cross-tier (n2-standard-16 → n2-standard-16)18.3 - 20.4 Gbps

S3 Gateway

All S3 tests run from client nodes (n2-standard-8) with 8-way parallelism.

Write Throughput (single client → leader)

Object SizeCountParallelismTimeThroughput
1 MB20081,624 ms123.2 MB/s
4 MB508239 ms836.8 MB/s
16 MB258363 ms1,101.9 MB/s

Read Throughput

Object SizeCountParallelismTimeThroughput
1 MB2008176 ms1,136.4 MB/s

PUT Latency (1 KB objects, sequential)

PercentileLatency
p507.6 ms
p998.6 ms
avg7.7 ms
max9.7 ms

Aggregate Write (3 clients, parallel)

WorkloadTimeAggregate Throughput
3 x 100 x 1 MB (8 concurrent/client)2,205 ms136.1 MB/s

NFS / pNFS / FUSE

Not yet tested on GCP. NFS mount from client nodes requires SSH key distribution from the ctrl node (OS Login configuration pending). FUSE requires the kiseki-client binary installed on client nodes.

Local testing (3-node cluster on localhost) confirms all protocols functional via unit and integration tests.

Prometheus Metrics

Gateway request counters showed 0 during the test. The requests_total atomic counter in InMemoryGateway is not wired to the Prometheus metrics exporter yet.

Local Test Results (same binary, localhost)

For comparison, local 3-node cluster results (loopback network, no disk I/O latency, 32-way parallelism):

TestResult
S3 Write 1 MB x 200 (32 parallel)380.2 MB/s
S3 Write 4 MB x 50 (32 parallel)349.7 MB/s
S3 Write 16 MB x 25 (32 parallel)340.7 MB/s
S3 Read 1 MB x 200 (32 parallel)913.2 MB/s
32 concurrent PUTs50 ms (no deadlock)

Observations

  1. Small object writes improved 9.6x after ADR-032 (async GatewayOps + lock-free composition writes). The composition lock is no longer held during Raft consensus, allowing concurrent writes to proceed in parallel.

  2. Read throughput exceeds write. Reads bypass Raft consensus (served from the local composition + chunk store) and hit 1.1 GB/s even for 1 MB objects.

  3. GCP outperforms localhost for large objects. The GCP network (15+ Gbps) and n2-standard-16 nodes have more bandwidth than localhost loopback under contention. 16 MB writes: 1,102 MB/s (GCP) vs 341 MB/s (local).

  4. Latency is network-bound. p50 latency on GCP (7.6 ms) includes network RTT + Raft consensus (5-node quorum). Local latency is dominated by CPU contention on shared machine.

  5. Single Raft group is the write bottleneck. All writes go through one leader. Multi-shard deployment would distribute leaders across nodes, scaling write throughput linearly.

Known Issues

  • Concurrent write deadlock (fixed in ADR-032). The sync→async bridge (run_on_raft) caused thread starvation under concurrent load. Fixed by making GatewayOps and LogOps fully async, and moving log emission out of the composition lock scope. Result: 1 MB writes improved from 39.5 to 380.2 MB/s (9.6x).

  • NFS mount on GCP. Requires SSH key distribution from ctrl to client nodes. The ctrl service account needs osAdminLogin role and OS Login key registration.

  • Prometheus counters. gateway_requests_total not exported to /metrics endpoint.

Running the Benchmark

# Local 3-node test
cargo build --release --bin kiseki-server
# Start 3 nodes (see examples/cluster-3node.env.node{1,2,3})
# Run: bash infra/gcp/benchmarks/perf-suite.sh

# GCP deployment
cd infra/gcp
terraform apply -var="project_id=PROJECT" -var="zone=ZONE" \
  -var="release_tag=v2026.1.332"
# Deploy perf-suite.sh to ctrl node and run

See infra/gcp/benchmarks/perf-suite.sh for the full benchmark script and infra/gcp/benchmarks/run-perf.sh for the local deployment wrapper.

Comparison with Ceph and Lustre

Single-Leader Kiseki vs Typical Deployments (similar hardware scale)

MetricKiseki (1 leader)Ceph RGW (S3)Lustre
Large object write1.1 GB/s (16 MB)0.5-2 GB/s1-2 GB/s per OST
Small object write122 MB/s (1 MB)50-200 MB/s200-500 MB/s
Read throughput1.1 GB/s1-3 GB/s2-10 GB/s
PUT latencyp50: 7.6 msp50: 2-5 msp50: <1 ms (POSIX)
Aggregate 3-client133 MB/s300-800 MB/s1-5 GB/s
EncryptionAlways (AES-256-GCM)Optional (rarely on)No

Why aggregate throughput is lower

All writes go through a single Raft leader (single Raft group). Ceph distributes across PGs/OSDs, Lustre stripes across OSTs. They parallelize writes across all nodes; kiseki serializes through one leader. This is a deployment constraint, not an architectural limit.

Where kiseki is strong

  1. Per-leader throughput is excellent. 1.1 GB/s per leader with full AES-256-GCM encryption is comparable to Ceph RGW without encryption. The crypto overhead is nearly invisible (aws-lc-rs with AES-NI).

  2. Read throughput matches. Reads bypass Raft consensus entirely and serve from local composition + chunk store. Multi-node reads scale linearly since any node can serve.

  3. Latency is reasonable. 7.6 ms includes Raft consensus over network + encryption. Ceph’s 2-5 ms S3 latency is lower but typically without encryption. Lustre’s sub-ms is POSIX (kernel bypass), not comparable to HTTP/S3.

Bottleneck analysis

  • Not bottlenecked by crypto – AES-256-GCM at 1.1 GB/s means the CPU encrypts faster than the network/Raft can deliver.
  • Not bottlenecked by network – 15 Gbps available, using <10 Gbps.
  • Bottlenecked by Raft consensus – 7.6 ms per round-trip for small objects, amortized for large ones.
  • Multi-shard is the path to parity – linear scaling with shard count, same model as Ceph PGs and Lustre OSTs.

Projected multi-shard performance

Shards1 MB Write16 MB WriteRead
1122 MB/s1.1 GB/s1.1 GB/s
3~366 MB/s~3.4 GB/s~3.4 GB/s
5~610 MB/s~5.7 GB/s~5.7 GB/s

At 5 shards on the same hardware, kiseki reaches parity with Ceph and approaches Lustre – while encrypting all data at rest and in transit, on commodity GCP VMs with network-attached storage (not local NVMe or InfiniBand).

Capacity Planning

Kiseki separates metadata and data onto different storage tiers. Proper sizing of both tiers is critical for stable operation at scale.


Storage tiers

Each storage node has two distinct storage tiers:

System disk (metadata tier)

The system partition hosts:

  • Raft log (raft/log.redb): Bounded by snapshot interval. Grows with write rate, compacted periodically.
  • Key epochs (keys/epochs.redb): Tiny (<10 MB). One entry per key epoch.
  • Chunk metadata (chunks/meta.redb): Scales linearly with file count. Approximately 80 bytes per file.
  • Inline content (small/objects.redb): Variable. Controlled by the dynamic inline threshold (ADR-030).

Requirements:

  • NVMe or SSD strongly recommended. HDD system disks trigger a boot warning because Raft fsync latency will be 5-10ms per commit.
  • RAID-1 on 2x SSD for redundancy (the system disk is not protected by Kiseki’s EC; it uses traditional RAID).
  • Size based on expected file count and inline content.

Data devices (data tier)

Data devices are JBOD-managed by Kiseki. They store chunk ciphertext as extents on raw block devices (ADR-029).

Requirements:

  • NVMe, SSD, or HDD depending on the pool’s device class.
  • Multiple devices per node for EC placement (I-D4: no two EC fragments on the same device).
  • JBOD (no RAID): Kiseki manages durability via EC or replication.

Metadata capacity sizing

Per-file metadata footprint

ComponentPer fileNotes
Delta log entry~200 bytesRaft log entry with header fields
Chunk metadata~80 bytesExtent index entry in chunks/meta.redb
Subtotal (no inline)~280 bytesFixed per file
Inline content0 to 64 KBOnly if file is below inline threshold

Capacity examples

10 billion files, 50-node cluster, RF=3, no inline:

ComponentCluster totalPer node
Delta log (metadata only)~2 TB~120 GB
Chunk metadata index~0.8 TB~48 GB
Total metadata~2.8 TB~168 GB

At 168 GB per node, 256 GB NVMe system disks are tight. Larger system disks (512 GB or 1 TB) provide comfortable headroom.

10 billion files, 50-node cluster, RF=3, with inline (4 KB threshold):

ComponentCluster totalPer node
Metadata (as above)~2.8 TB~168 GB
Inline content (10% of files < 4 KB, avg 2 KB)~2 TB~120 GB
Total~4.8 TB~288 GB

This exceeds 256 GB system disks. The dynamic inline threshold (ADR-030) prevents this by automatically reducing the threshold when system disk usage approaches the soft limit.

Capacity monitoring

The system automatically monitors metadata disk usage and adjusts:

Usage levelResponse
Below KISEKI_META_SOFT_LIMIT_PCT (50%)Normal operation
Above soft limitInline threshold reduced
Above KISEKI_META_HARD_LIMIT_PCT (75%)Threshold forced to floor (128 B), alert emitted

Alerts use out-of-band gRPC health reports (not Raft) so that a full-disk node can signal without writing Raft entries (I-SF2).


Dynamic inline threshold (ADR-030)

The inline threshold is computed per-shard as the minimum affordable threshold across all Raft voters:

available = min(node.small_file_budget_bytes for node in shard.voters)
projected_files = shard.file_count_estimate
raw_threshold = available / max(projected_files, 1)
shard_threshold = clamp(raw_threshold, INLINE_FLOOR, INLINE_CEILING)

Where:

ParameterValue
INLINE_FLOOR128 bytes (hard lower bound)
INLINE_CEILING64 KB (system-wide maximum)
KISEKI_META_SOFT_LIMIT_PCT50% (default)
KISEKI_META_HARD_LIMIT_PCT75% (default)

Threshold behavior

  • Decrease: Automatic and safe. New files use the chunk path. Existing inline data is not retroactively migrated (I-L9).
  • Increase: Requires cluster admin decision. May trigger optional background migration of small chunked files back to inline.
  • Emergency: If any voter reports hard-limit breach, the leader commits a threshold reduction via Raft (2/3 majority; the full-disk node’s vote is not required).

Raft throughput guard (I-SF7)

The effective inline threshold is further clamped by a per-shard Raft log throughput budget (KISEKI_RAFT_INLINE_MBPS, default 10 MB/s). If the shard’s inline write rate exceeds this budget, the threshold temporarily drops to the floor until the rate subsides. This prevents inline data from starving metadata-only Raft operations during write storms.


Pool capacity thresholds

Data-tier capacity is managed per pool. Thresholds vary by device class to account for SSD/NVMe GC pressure at high fill levels (ADR-024):

StateNVMe/SSDHDDBehavior
Healthy0-75%0-85%Normal writes, background rebalance
Warning75-85%85-92%Log warning, emit telemetry
Critical85-92%92-97%Reject new placements, advisory backpressure
ReadOnly92-97%97-99%In-flight writes drain, no new writes
Full97-100%99-100%ENOSPC to clients

Why NVMe/SSD thresholds are lower

NVMe and SSD devices experience write amplification from garbage collection at high fill levels. Above ~80% fill, GC pressure increases sharply, causing:

  • Increased write latency (10-100x during GC storms).
  • Reduced effective write bandwidth.
  • Accelerated wear.

Enterprise storage arrays (VAST, Pure) operate at 95%+ because they have global wear leveling across all flash in the system. JBOD devices do not have this capability, so Kiseki’s thresholds are more conservative.


Growth estimation

File count growth

Monitor kiseki_shard_delta_count to track delta (file) accumulation:

# Current delta count per shard
curl -s http://node1:9090/metrics | grep kiseki_shard_delta_count

Use the rate of delta count increase to project when the metadata tier will reach capacity.

Data volume growth

Monitor pool capacity metrics:

# Current pool utilization
curl -s http://node1:9090/metrics | grep kiseki_pool_capacity

Projection formula

days_until_full = (capacity_total - capacity_used) / daily_write_rate

For metadata:

metadata_per_file = 280 bytes (no inline) or 280 + avg_inline_size (with inline)
days_until_full = (system_disk_capacity * soft_limit_pct - current_used) /
                  (new_files_per_day * metadata_per_file * replication_factor)

Sizing recommendations

Small deployment (development/testing)

ComponentRecommendation
Nodes3 (minimum for Raft)
System disk256 GB NVMe each (RAID-1 on 2x SSD)
Data devices2x 1 TB NVMe per node
Key managerCo-located with storage nodes (internal KMS)
File countUp to 100 million

Medium deployment (departmental HPC)

ComponentRecommendation
Nodes5-10
System disk512 GB NVMe each (RAID-1)
Data devices4-8 NVMe per node (2-8 TB each)
Key manager3 dedicated nodes
File countUp to 1 billion

Large deployment (institutional HPC/AI)

ComponentRecommendation
Nodes50-200
System disk1 TB NVMe each (RAID-1)
Data devices8-24 devices per node, mixed tiers (NVMe + SSD + HDD)
Key manager5 dedicated nodes
File countUp to 10 billion
Total capacity100 PB+

Rules of thumb

  • System disk: Size at 2x the expected metadata footprint for comfortable headroom. Include inline content estimates.
  • Data devices: At least ec_data_chunks + ec_parity_chunks devices per pool (for EC placement across distinct devices, I-D4).
  • Network: CXI or InfiniBand for clusters where storage bandwidth is critical. TCP is acceptable for cold-tier pools.
  • Memory: At least 64 GB per storage node for Raft state, chunk metadata caching, and stream processor buffers.

Capacity alerts

Configuring alerts

Use Prometheus alerting rules (see Monitoring) to detect capacity issues before they become critical:

- alert: KisekiSystemDiskWarning
  expr: >
    node_filesystem_avail_bytes{mountpoint="/var/lib/kiseki"} /
    node_filesystem_size_bytes{mountpoint="/var/lib/kiseki"} < 0.50
  for: 10m
  labels:
    severity: warning
  annotations:
    summary: "System disk above 50% on {{ $labels.instance }}"

- alert: KisekiSystemDiskCritical
  expr: >
    node_filesystem_avail_bytes{mountpoint="/var/lib/kiseki"} /
    node_filesystem_size_bytes{mountpoint="/var/lib/kiseki"} < 0.25
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "System disk above 75% on {{ $labels.instance }}"

When to add capacity

  • System disk above 50% (soft limit): Plan for capacity expansion. Inline threshold will start decreasing.
  • System disk above 75% (hard limit): Urgent. Inline threshold is at floor. Add nodes or upgrade system disks.
  • Pool above Warning threshold: Monitor growth. Plan for device additions.
  • Pool above Critical threshold: Writes are being rejected. Add devices immediately or evacuate data to another pool.

gRPC Services

Kiseki exposes several gRPC services across two network ports. Data-path services run on port 9100. The advisory service runs on a separate listener at port 9101 (isolated runtime, ADR-021).


LogService

Port: 9100 (data fabric) Provider: kiseki-log (via kiseki-server) Consumers: Composition, View stream processors, Gateway, Client

RPCTypeDescription
AppendDeltaUnaryAppend a delta to a shard. Returns the assigned sequence number. Commits via Raft majority before ack (I-L2).
ReadDeltasServer streamingRead a range of deltas from a shard. Used by view stream processors for materialization.
TruncateLogUnaryTrigger delta GC up to the minimum consumer watermark. Returns the new GC boundary.
ShardHealthUnaryQuery shard health, Raft state, and replication status.
SplitShardUnaryTrigger mandatory shard split at a given boundary.
SetMaintenanceUnaryEnable or disable maintenance mode on a shard (I-O6).
CompactShardUnaryTrigger compaction (header-only merge, I-O2).

KeyManagerService

Port: Internal network (dedicated key manager cluster) Provider: kiseki-keymanager (via kiseki-keyserver) Consumers: Storage nodes (chunk encryption), Gateway, Client

RPCTypeDescription
FetchMasterKeyUnaryFetch the master key for a given epoch. Used at node startup and rotation.
RotateKeyUnaryRotate system or tenant keys. Creates a new epoch.
CryptoShredUnaryDestroy tenant KEK, rendering all tenant data unreadable.
FullReEncryptUnaryTrigger full re-encryption of a tenant’s data under new keys.
FetchTenantKekUnaryFetch tenant KEK for wrapping/unwrapping operations.
CheckKmsHealthUnaryCheck tenant KMS provider connectivity.
KeyManagerHealthUnaryQuery key manager cluster health and Raft state.

System DEK derivation is local (HKDF, no RPC). Only master key fetch and tenant KEK operations require network calls (ADR-003).


ControlService

Port: Management network Provider: kiseki-control Consumers: Admin CLI, storage nodes, advisory runtime

Tenant management

RPCDescription
CreateOrgCreate a new organization (top-level tenant)
CreateProjectCreate a project within an organization
CreateWorkloadCreate a workload within an org or project
DeleteOrg / DeleteProject / DeleteWorkloadRemove tenant hierarchy nodes

Namespace and policy

RPCDescription
CreateNamespaceCreate a tenant-scoped namespace
SetComplianceTagsSet compliance regime tags (inherit downward)
SetQuotaSet resource quotas at org/project/workload level
SetRetentionHoldCreate a retention hold on a namespace or composition
ReleaseRetentionHoldRelease an active retention hold

IAM

RPCDescription
RequestAccessCluster admin requests access to tenant data
ApproveAccessTenant admin approves access request
DenyAccessTenant admin denies access request

Operations

RPCDescription
SetMaintenanceModeEnable/disable cluster-wide maintenance mode
ListFlavors / MatchFlavorQuery and match deployment flavors

Federation

RPCDescription
RegisterFederationPeerRegister a remote Kiseki cluster for async replication

Advisory policy

RPCDescription
SetAdvisoryPolicyConfigure profiles, budgets, and state per scope
TransitionAdvisoryStateTransition advisory state (enabled/draining/disabled)
GetEffectiveAdvisoryPolicyCompute effective policy for a workload (min across hierarchy)

WorkflowAdvisoryService

Port: 9101 (data fabric, separate listener) Provider: kiseki-advisory (via kiseki-server, isolated tokio runtime) Consumers: Native client, any authorized tenant caller

RPCTypeDescription
DeclareWorkflowUnaryDeclare a new workflow with profile, initial phase, and TTL. Returns a WorkflowRef handle and authorized pool handles.
EndWorkflowUnaryEnd a declared workflow. Triggers audit summary and GC of workflow state.
PhaseAdvanceUnaryAdvance to the next phase. Phase order is monotonic (I-WA13).
GetWorkflowStatusUnaryQuery current workflow state, phase, and budget usage.
AdvisoryStreamBidirectional streamingMultiplexed channel: hints in (client to storage), telemetry out (storage to client).
SubscribeTelemetryServer streamingSubscribe to specific telemetry channels for a workflow.

Advisory stream message types

Inbound hints (client to storage):

  • Access pattern declaration
  • Prefetch range (up to 4096 tuples per hint, I-WA16)
  • Affinity pool preference (via opaque pool handles, I-WA19)
  • Priority class (within policy-allowed maximum)
  • Retention intent
  • Dedup intent
  • Collective checkpoint announcement
  • Deadline hint

Outbound telemetry (storage to client):

  • Backpressure signal (ok / soft / hard severity with retry-after)
  • Placement locality class (local-node / local-rack / same-pool / remote / degraded)
  • Materialization lag
  • Prefetch effectiveness
  • QoS headroom
  • Hotspot detection (caller-owned compositions only)

StorageAdminService (ADR-025)

Port: Management network Provider: kiseki-server Consumers: Cluster admin, SRE (read-only role)

RPCTypeDescription
ClusterStatusUnaryCluster-wide status summary
ListDevices / GetDeviceUnaryQuery storage devices
AddDevice / RemoveDeviceUnaryAdd or remove a device (removal requires Removed state)
EvacuateDevice / CancelEvacuationUnaryTrigger or cancel device evacuation
ListPools / GetPool / PoolStatusUnaryQuery affinity pools
CreatePool / SetPoolDurability / SetPoolThresholdsUnaryManage pool configuration
RebalancePool / CancelRebalanceUnaryTrigger or cancel pool rebalance
ListShards / GetShard / GetShardHealthUnaryQuery shard state
SplitShard / SetShardMaintenanceUnaryShard management
SetTuningParams / GetTuningParamsUnaryRuntime tuning parameters
DrainNodeUnaryDrain all shards and chunks from a node
TriggerScrub / RepairChunk / ListRepairsUnaryData integrity operations
DeviceHealthServer streamingLive device health events
IOStatsServer streamingLive I/O statistics
DeviceIOStatsServer streamingPer-device I/O statistics

DiscoveryService

Port: 9100 (data fabric) Provider: kiseki-server Consumers: Native client

Used by the native client to discover shards, views, and gateways from the data fabric without requiring direct control plane access (I-O4, ADR-008).


Protocol binding

  • Protobuf definitions: proto/kiseki/v1/*.proto
  • Generated code: kiseki-proto crate
  • Workflow ref header: x-kiseki-workflow-ref-bin (16 raw bytes as gRPC binary metadata, not a proto field, per ADR-021)

REST & Admin API

The kiseki-server binary exposes an HTTP server (default port 9090) for health checks, Prometheus metrics, and an admin dashboard. All endpoints are served via axum.


Health and metrics

GET /health

Liveness probe for load balancers and orchestrators.

Response: 200 OK with body ok when the server is running.

GET /metrics

Prometheus text-format metrics endpoint.

Response: 200 OK with text/plain body containing all registered Prometheus metrics including:

  • Raft state per shard (leader, follower, candidate)
  • Chunk operations (reads, writes, dedup hits)
  • Transport metrics (connections, bytes, errors per transport type)
  • Pool utilization (capacity, used, free per pool)
  • View materialization lag
  • Advisory budget usage

Admin dashboard

GET /ui

HTML admin dashboard with HTMX live polling. Provides a visual overview of cluster health, node status, and operational metrics.

The dashboard polls the JSON API endpoints below for live updates.


JSON API endpoints

GET /ui/api/cluster

Cluster-wide summary with aggregated metrics from all nodes.

Response: JSON object with node count, total capacity, total used, shard count, and aggregated health status.

GET /ui/api/nodes

List of all known nodes with per-node metrics.

Response: JSON array of node objects, each with node ID, address, status, device count, shard count, and key metrics.

GET /ui/api/history

Metric time series for charting.

Query parameters:

ParameterTypeDefaultDescription
hoursfloat3Number of hours of history to retrieve

Response: JSON object with hours and points array containing timestamped metric snapshots.

GET /ui/api/events

Filtered event log for diagnostics and alerting.

Query parameters:

ParameterTypeDefaultDescription
severitystring(all)Filter by severity: info, warning, error, critical
categorystring(all)Filter by category: node, shard, device, tenant, security, admin
hoursfloat3Hours to look back

Response: JSON array of event objects with timestamp, severity, category, message, and source.


Operations endpoints

These endpoints trigger operational actions and require cluster admin authentication.

POST /ui/api/ops/maintenance

Toggle maintenance mode for the cluster or specific shards.

Request body: JSON with enabled (boolean) and optional shard_id.

Effect: Sets shards to read-only. Write commands are rejected with a retriable error (I-O6). Shard splits, compaction, and GC for in-progress operations continue.

POST /ui/api/ops/backup

Trigger a backup operation.

Request body: JSON with backup configuration parameters.

Effect: Initiates backup per ADR-016. Returns a job ID for status tracking.

POST /ui/api/ops/scrub

Trigger a data integrity scrub.

Request body: JSON with optional scope (pool, device, or cluster-wide).

Effect: Verifies chunk integrity via EC checksums. Reports corrupt or missing chunks. Triggers automatic repair for recoverable issues.


HTMX fragment endpoints

These endpoints return HTML fragments for the admin dashboard’s live polling:

EndpointDescription
GET /ui/fragment/cluster-cardsCluster status summary cards
GET /ui/fragment/node-tableNode list table rows
GET /ui/fragment/chart-dataChart data for metrics graphs
GET /ui/fragment/alertsActive alerts and warnings

CLI Reference

Kiseki provides two binaries with CLI interfaces: kiseki-server (which doubles as the admin CLI) and kiseki-client (native client with staging and cache commands).

All admin operations use these CLIs. The underlying gRPC API is also available for programmatic access (see gRPC), but the CLI is the primary admin interface.


kiseki-server

The server binary starts the storage node when invoked without arguments. When invoked with a subcommand, it acts as an admin CLI that connects to the local node’s gRPC endpoint.

Server mode

kiseki-server

Starts the storage node. Configuration is via environment variables (see Environment Variables).

status

kiseki-server status

Display cluster status summary: node count, shard count, device health, Raft leadership, and pool utilization.

Node management

kiseki-server node add --node-id <id>
kiseki-server node drain --node-id <id>
kiseki-server node remove --node-id <id>

Add, drain, or remove a node from the cluster. Drain migrates shard assignments before removal. See Cluster Management.

Shard management

kiseki-server shard list
kiseki-server shard info --shard-id <id>
kiseki-server shard health --shard-id <id>
kiseki-server shard split --shard-id <id> [--boundary <key>]
kiseki-server shard maintenance --shard-id <id> --enabled
kiseki-server shard maintenance --shard-id <id> --disabled

List shards, inspect details, check health, trigger manual splits, and toggle per-shard maintenance mode (I-O6).

Pool management

kiseki-server pool list
kiseki-server pool status --pool-id <id>
kiseki-server pool create --pool-id <id> --device-class <class> --ec-data <n> --ec-parity <n>
kiseki-server pool set-durability --pool-id <id> --ec-data <n> --ec-parity <n>
kiseki-server pool rebalance --pool-id <id>
kiseki-server pool cancel-rebalance --pool-id <id>
kiseki-server pool set-thresholds --pool-id <id> --warning-pct <n> --critical-pct <n>

Manage affinity pools: create, inspect capacity, set EC parameters, rebalance data, and adjust capacity thresholds (I-C5, I-C6).

Device management

kiseki-server device list
kiseki-server device info --device-id <id>
kiseki-server device evacuate --device-id <id>
kiseki-server device cancel-evacuation --device-id <id>
kiseki-server device scrub --device-id <id>

List devices, check health and SMART status, trigger evacuation or integrity scrub, and cancel in-progress evacuations (I-D2, I-D3, I-D5).

Maintenance mode

kiseki-server maintenance on
kiseki-server maintenance off

Enable or disable cluster-wide maintenance mode. Sets all shards to read-only. Write commands are rejected with a retriable error. Shard splits, compaction, and GC for in-progress operations continue but no new triggers fire from write pressure (I-O6).

Backup and recovery

kiseki-server backup create
kiseki-server backup list
kiseki-server backup delete --backup-id <id>
kiseki-server repair list
kiseki-server compact

Create, list, and delete backup snapshots. List active repairs and evacuations. Trigger Raft log compaction.

Key management

kiseki-server keymanager health
kiseki-server keymanager check-kms
kiseki-server keymanager check-kms --tenant-id <id>

Check system key manager health and tenant KMS connectivity.

S3 credentials

kiseki-server s3-credentials create --tenant-id <id> --workload-id <id>

Provision S3-compatible access keys for a tenant workload via the control plane.

Tuning parameters

kiseki-server tuning set --inline-threshold-bytes <n>
kiseki-server tuning set --raft-snapshot-interval <n>
kiseki-server tuning set --compaction-rate-mb-s <n>
kiseki-server tuning set --stream-proc-poll-ms <n>

Adjust cluster-wide tuning parameters. See Performance Tuning for guidance.


kiseki-client

The native client binary provides dataset staging and cache management commands for compute nodes.

stage –dataset

kiseki-client stage --dataset <path> [--timeout <seconds>]

Pre-fetch a dataset’s chunks into the L2 cache with pinned retention. Recursively enumerates compositions under the given namespace path, fetches all chunks from canonical, verifies by content-address (SHA-256), and stores in the L2 cache pool.

Staging is idempotent and resumable. Produces a manifest file listing staged compositions and chunk IDs.

Limits: max_staging_depth (10 levels), max_staging_files (100,000).

stage –status

kiseki-client stage --status

Show the status of the current staging operation: progress, number of chunks fetched, total size, and any errors.

stage –release

kiseki-client stage --release <path>

Release a staged dataset. Unpins cached chunks, making them eligible for LRU eviction. To pick up updates from canonical, release and re-stage.

stage –release-all

kiseki-client stage --release-all

Release all staged datasets.

cache –stats

kiseki-client cache --stats

Print cache statistics: mode, L1/L2 bytes used, hit/miss counts, errors, metadata cache stats, and wipe count.

cache –wipe

kiseki-client cache --wipe

Wipe all cached data (L1 + L2 + metadata). Zeroizes data before deletion (I-CC2).

version

kiseki-client version

Print the client version.


Environment variables (kiseki-client)

VariableDefaultDescription
KISEKI_CACHE_DIR/tmp/kiseki-cacheCache directory
KISEKI_CACHE_MODEorganicCache mode: pinned, organic, bypass
KISEKI_CACHE_L1_MAX268435456 (256 MB)L1 max bytes
KISEKI_CACHE_L2_MAX53687091200 (50 GB)L2 max bytes

kiseki-admin

Standalone remote administration CLI. Runs from an admin workstation and connects to any Kiseki node via the REST API (port 9090). No server dependencies are needed on the workstation.

Default endpoint: KISEKI_ENDPOINT env var, or http://localhost:9090.

status

kiseki-admin --endpoint http://storage-node:9090 status

Cluster status summary: node count, Raft entries, gateway requests, data written/read, and active connections.

Example output:

Cluster Status
══════════════
Nodes:       3/3 healthy
Raft:        42,567 entries
Requests:    1,234 served
Written:     12.5 GB
Read:        8.2 GB
Connections: 15 active

nodes

kiseki-admin nodes

Node list with health badges and per-node metrics.

Example output:

NODE              STATUS    RAFT     REQUESTS  WRITTEN   READ      CONNS
10.0.0.1:9090     healthy   14,189   411       4.2 GB    2.7 GB    5
10.0.0.2:9090     healthy   14,189   412       4.2 GB    2.8 GB    5
10.0.0.3:9090     healthy   14,189   411       4.1 GB    2.7 GB    5

events

kiseki-admin events [--severity error] [--hours 1]

Filtered event log. Optional --severity (info, warning, error, critical) and --hours (default: 3).

Example output:

TIME      SEVERITY  CATEGORY  SOURCE    MESSAGE
12:34:56  ERROR     node      node-3    unreachable
12:35:12  ERROR     device    nvme0n1   CRC mismatch detected

history

kiseki-admin history [--hours 3]

Metric history time series for the specified number of hours (default: 3).

maintenance

kiseki-admin maintenance on
kiseki-admin maintenance off

Toggle cluster-wide maintenance mode. Enables read-only on all shards. Write commands return a retriable error (I-O6).

backup

kiseki-admin backup

Trigger a background backup operation (ADR-016).

scrub

kiseki-admin scrub

Trigger a background data integrity scrub.


Exit codes

CodeMeaning
0Success
1General error
2Invalid arguments
3Connection failure (server unreachable)
4Authentication failure (mTLS)

Environment Variables

All Kiseki configuration is done via environment variables. No configuration files are used for runtime settings (I-K8: keys are never stored in configuration files).


Server configuration

VariableTypeDefaultDescription
KISEKI_DATA_ADDRSocketAddr0.0.0.0:9100Data-path gRPC listener address
KISEKI_ADVISORY_ADDRSocketAddr0.0.0.0:9101Advisory gRPC listener address (isolated runtime)
KISEKI_S3_ADDRSocketAddr0.0.0.0:9000S3 HTTP gateway listener address
KISEKI_NFS_ADDRSocketAddr0.0.0.0:2049NFS server listener address
KISEKI_METRICS_ADDRSocketAddr0.0.0.0:9090Prometheus metrics and admin UI listener address
KISEKI_DATA_DIRPathBuf(none)Persistent storage directory for redb databases. If unset, runs in-memory only.
KISEKI_NODE_IDu640Raft node ID. 0 = single-node mode.
KISEKI_BOOTSTRAPboolfalseCreate a well-known bootstrap shard on startup. Set to true or 1 for development/testing.

TLS configuration

TLS is enabled when all three path variables are set. Otherwise the server runs in plaintext mode (development only, logged as a warning).

VariableTypeDefaultDescription
KISEKI_CA_PATHPathBuf(none)Cluster CA certificate PEM file
KISEKI_CERT_PATHPathBuf(none)Node certificate chain PEM file
KISEKI_KEY_PATHPathBuf(none)Node private key PEM file
KISEKI_CRL_PATHPathBuf(none)Optional CRL PEM file for certificate revocation

Raft configuration

VariableTypeDefaultDescription
KISEKI_RAFT_ADDRSocketAddr(none)Raft RPC listen address. Required for multi-node clusters.
KISEKI_RAFT_PEERSString(empty)Comma-separated peer list in id=addr format, e.g. 1=10.0.0.1:9200,2=10.0.0.2:9200,3=10.0.0.3:9200

Metadata capacity (ADR-030)

VariableTypeDefaultDescription
KISEKI_META_SOFT_LIMIT_PCTu850Soft limit percentage for system disk metadata usage. Exceeding triggers inline threshold reduction.
KISEKI_META_HARD_LIMIT_PCTu875Hard limit percentage for system disk metadata usage. Exceeding forces inline threshold to INLINE_FLOOR and emits alert (I-SF2).
KISEKI_RAFT_INLINE_MBPSu3210Per-shard Raft inline throughput cap in MB/s. Prevents inline data from starving metadata-only Raft operations (I-SF7).

Client cache configuration

VariableTypeDefaultDescription
KISEKI_CACHE_MODEStringorganicCache mode: organic (LRU), pinned (staging-driven), or bypass (no caching)
KISEKI_CACHE_DIRPathBuf/tmp/kiseki-cacheL2 cache pool directory on local NVMe
KISEKI_CACHE_L2_MAXu6453687091200 (50 GB)Maximum L2 cache size in bytes
KISEKI_CACHE_POOL_IDString(generated)Adopt an existing L2 pool (128-bit hex). Used for staging handoff between processes.

Transport configuration

VariableTypeDefaultDescription
KISEKI_IB_DEVICEString(auto-detect)InfiniBand device name for RDMA verbs transport. If unset, auto-detects the first available device.

Observability

Standard Rust/tokio observability variables:

VariableTypeDefaultDescription
RUST_LOGStringinfoLog filter directive (e.g., kiseki_log=debug,kiseki_raft=trace)
OTEL_EXPORTER_OTLP_ENDPOINTString(none)OpenTelemetry collector endpoint for distributed tracing

Example: single-node development

export KISEKI_DATA_DIR=/var/lib/kiseki
export KISEKI_BOOTSTRAP=true
kiseki-server

Example: three-node cluster

# Node 1
export KISEKI_NODE_ID=1
export KISEKI_DATA_DIR=/var/lib/kiseki
export KISEKI_RAFT_ADDR=10.0.0.1:9200
export KISEKI_RAFT_PEERS=1=10.0.0.1:9200,2=10.0.0.2:9200,3=10.0.0.3:9200
export KISEKI_CA_PATH=/etc/kiseki/ca.pem
export KISEKI_CERT_PATH=/etc/kiseki/node1.pem
export KISEKI_KEY_PATH=/etc/kiseki/node1-key.pem
export KISEKI_BOOTSTRAP=true
kiseki-server

Architecture Decision Records

All architectural decisions are recorded as ADRs in specs/architecture/adr/.


ADR index

ADRTitleStatus
ADR-001Pure Rust, No Mochi DependencyAccepted
ADR-002Two-Layer Encryption Model (C)Accepted
ADR-003System DEK Derivation (Not Storage)Accepted
ADR-004Schema Versioning and Rolling UpgradesAccepted
ADR-005Erasure Coding and Chunk DurabilityAccepted
ADR-006Inline Data ThresholdAccepted
ADR-007System Key Manager HA via RaftAccepted
ADR-008Native Client Fabric DiscoveryAccepted
ADR-009Audit Log Sharding and GCAccepted
ADR-010Retention Hold Enforcement Before Crypto-ShredAccepted
ADR-011Crypto-Shred Cache Invalidation and TTLAccepted
ADR-012Stream Processor Tenant IsolationAccepted
ADR-013POSIX Semantics ScopeAccepted
ADR-014S3 API Compatibility ScopeAccepted
ADR-015Observability ContractAccepted
ADR-016Backup and Disaster RecoveryAccepted
ADR-017Dedup Refcount Metadata Access ControlAccepted
ADR-018Runtime Integrity MonitorAccepted
ADR-019Gateway Deployment ModelAccepted
ADR-020Workflow Advisory & Client TelemetryAccepted
ADR-021Workflow Advisory ArchitectureAccepted
ADR-022Storage Backend – redb (Pure Rust)Accepted
ADR-023Protocol RFC Compliance ScopeAccepted
ADR-024Device Management, Storage Tiers, and Capacity ThresholdsAccepted
ADR-025Storage Administration APIAccepted
ADR-026Raft Topology – Per-Shard on Fabric (Strategy A)Accepted
ADR-027Single-Language Implementation – Rust OnlyAccepted
ADR-028External Tenant KMS ProvidersAccepted
ADR-029Raw Block Device AllocatorAccepted
ADR-030Dynamic Small-File Placement and Metadata Capacity ManagementAccepted
ADR-031Client-Side CacheAccepted

ADR template

New ADRs follow this structure:

# ADR-NNN: Title

**Status**: Proposed | Accepted | Superseded by ADR-XXX
**Date**: YYYY-MM-DD
**Context**: Why this decision is needed.

## Decision

What was decided and why.

## Consequences

What changes as a result. Trade-offs accepted.

## Alternatives considered

What else was evaluated and why it was rejected.

Key decisions by topic

Language and architecture

  • ADR-001: Pure Rust (no Mochi dependency)
  • ADR-027: Single-language Rust (Go control plane replaced)
  • ADR-022: redb as storage backend (pure Rust, no RocksDB)

Encryption

  • ADR-002: Two-layer encryption model (system DEK + tenant KEK)
  • ADR-003: HKDF-based DEK derivation (not per-chunk storage)
  • ADR-011: Crypto-shred cache invalidation TTL
  • ADR-028: External tenant KMS providers (Vault, KMIP, AWS KMS, PKCS#11)

Consensus and replication

  • ADR-007: System key manager HA via Raft
  • ADR-026: Per-shard Raft groups on fabric (Strategy A)
  • ADR-009: Audit log sharding and GC

Storage

  • ADR-005: Erasure coding and chunk durability
  • ADR-006: Inline data threshold
  • ADR-029: Raw block device allocator
  • ADR-030: Dynamic small-file placement

Protocols and access

  • ADR-008: Native client fabric discovery
  • ADR-013: POSIX semantics scope
  • ADR-014: S3 API compatibility scope
  • ADR-019: Gateway deployment model
  • ADR-023: Protocol RFC compliance scope

Operations

  • ADR-015: Observability contract
  • ADR-016: Backup and disaster recovery
  • ADR-024: Device management and capacity thresholds
  • ADR-025: Storage administration API

Advisory

  • ADR-020: Workflow advisory and client telemetry
  • ADR-021: Workflow advisory architecture

Client

  • ADR-031: Client-side cache

ADR-001: Pure Rust, No Mochi Dependency

Status: Accepted Date: 2026-04-17 Context: Q-E3, A-E3

Decision

Build all core components in Rust. Do not depend on Mochi (Mercury/Bake/SDSKV). Learn from Mochi’s design patterns (transport abstraction, composable services).

Rationale

  • Mochi has never been deployed in regulated environments (HIPAA/GDPR)
  • C/C++ FFI creates a FIPS compliance surface across two languages
  • Single-language FIPS module boundary is cleaner for certification
  • Rust ecosystem has the building blocks (aws-lc-rs for FIPS, tokio, tonic, openraft)
  • Weakest link is libfabric/CXI Rust binding — bounded scope, solvable

Consequences

  • Must build transport abstraction in Rust (kiseki-transport)
  • Must build chunk storage engine in Rust (kiseki-chunk)
  • Must build KV backend for log storage in Rust (RocksDB via rust-rocksdb, or sled)
  • libfabric-sys crate needed for Slingshot support (immature, may need contribution)

ADR-002: Two-Layer Encryption Model (C)

Status: Accepted Date: 2026-04-17 Context: Q-K-arch1, I-K1 through I-K14

Decision

Single data encryption pass at the system layer. Tenant access via key wrapping. No double encryption.

  • System DEK encrypts chunk data (AES-256-GCM via FIPS module)
  • Tenant KEK wraps access to system DEK derivation material
  • System key manager derives per-chunk DEKs via HKDF (see ADR-003)

Rationale

  • Single encryption pass at HPC line rates (200+ Gbps per NIC)
  • Double encryption doubles CPU cost for no additional security benefit given that both layers use authenticated encryption
  • Key wrapping is O(32 bytes) per operation vs O(data_size) for encryption
  • Cross-tenant dedup works: same plaintext → same chunk_id → one ciphertext, multiple tenant KEK wrappings

Consequences

  • Crypto-shred destroys tenant KEK → data unreadable but not physically deleted
  • System key compromise exposes system-layer ciphertext; combined with tenant KEK = full access. System key manager must be highly protected (ADR-007).
  • Envelope must carry both system and tenant wrapping metadata

ADR-003: System DEK Derivation (Not Storage)

Status: Accepted Date: 2026-04-17 Context: B-ADV-3 (system DEK count at scale), escalation point 3

Decision

System DEKs are derived locally on storage nodes via HKDF, not stored individually and not derived via RPC to the key manager.

system_dek = HKDF-SHA256(
    key = system_master_key[epoch],
    salt = chunk_id,
    info = "kiseki-chunk-dek-v1"
)

Key distribution model (revised per ADV-ARCH-01)

The system key manager (kiseki-keyserver) stores and replicates master keys. Storage nodes (kiseki-server) fetch the current master key at startup and on epoch rotation. DEK derivation happens locally on the storage node — the key manager never sees individual chunk_ids.

kiseki-keyserver:
  Stores: master_key per epoch
  Serves: master_key to authenticated kiseki-server processes
  Never sees: individual chunk_ids or per-chunk operations

kiseki-server:
  Caches: master_key (mlock'd, refreshed on rotation)
  Derives: per-chunk DEK = HKDF(master_key, chunk_id) — locally
  Never sends: chunk_ids to the key manager

This prevents the key manager from building an index of all chunk_ids (ADV-ARCH-01: HKDF leak), which would reconstruct the per-tenant refcount data we explicitly decided not to store (ADR-017).

Rationale

  • At petabyte scale with ~1MB average chunks: billions of chunks
  • Storing billions of 32-byte DEKs = tens of GB in the key manager
  • HKDF derivation is deterministic: same (master_key, chunk_id) → same DEK
  • The key manager stores only master keys (one per epoch) — trivial storage
  • HKDF is fast (~microseconds) and FIPS-approved
  • Local derivation eliminates per-chunk RPC to key manager (performance)
  • Key rotation: new epoch = new master key. Old master keys retained during migration window. Derivation still works for old-epoch chunks.
  • Key manager never sees chunk-level operations (ADV-ARCH-01 fix)

Consequences

  • System key manager is simpler (stores epochs, not individual DEKs)
  • Master key is cached in kiseki-server memory — this is the highest-value target on a storage node (ADV-ARCH-04, accepted risk with mitigations: mlock, MADV_DONTDUMP, seccomp, core dumps disabled)
  • Master key compromise exposes ALL system DEKs for that epoch
  • Chunk ID is used as salt — chunk ID must not change after creation
  • Tenant KEK wraps the HKDF derivation parameters (epoch + chunk_id), not the DEK itself — unwrapping + HKDF derives the DEK

ADR-004: Schema Versioning and Rolling Upgrades

Status: Accepted Date: 2026-04-17 Context: A-ADV-2 (upgrade and schema evolution)

Decision

All persistent formats carry a version field. Rolling upgrades are supported with N-1/N version compatibility.

Delta envelope versioning

  • DeltaHeader.format_version: u16 — first field, fixed offset
  • Readers that encounter unknown versions fail open (skip the delta, log warning) rather than crash
  • Writers always produce the current version
  • Compaction preserves original format version (does not upgrade)

Chunk envelope versioning

  • EnvelopeMeta.format_version: u16
  • Algorithm ID already provides crypto-agility
  • New envelope fields are additive (protobuf-style: unknown fields preserved)

Wire protocol versioning (gRPC)

  • Protobuf with reserved fields and additive evolution
  • gRPC service versioning: /kiseki.v1.LogService, /kiseki.v2.LogService
  • Native client negotiates version on connect

View materialization

  • Stream processors declare which delta format versions they support
  • Upgrade sequence: deploy new stream processors first (can read old+new), then upgrade writers (produce new format)

Rolling upgrade sequence

  1. Deploy new kiseki-server binaries (can read old + new formats)
  2. Rolling restart storage nodes (one at a time, Raft quorum maintained)
  3. Deploy new kiseki-control (Go, independent restart)
  4. Deploy new kiseki-client-fuse to compute nodes
  5. After all nodes upgraded: optional compaction to upgrade old deltas

Consequences

  • All format changes must be backward-compatible for at least one version
  • Breaking changes require a two-phase rollout (add new, migrate, remove old)
  • Format version is the first field read on every deserialization path

ADR-005: Erasure Coding and Chunk Durability

Status: Accepted Date: 2026-04-17 Context: I-C4, escalation point 10

Decision

EC parameters are per affinity pool, configured by cluster admin.

Default profiles

Pool typeStrategyRationale
fast-nvme (metadata, hot data)EC 4+2Balance of space efficiency and rebuild speed
bulk-nvme (cold data, checkpoints)EC 8+3Higher space efficiency for bulk data
meta-nvme (log SSTables, key manager)Replication-3Lowest latency for consensus-critical data

Chunk-RDMA alignment (C-ADV-3)

Content-defined chunking produces variable-size chunks. For RDMA:

  • Chunks are stored with 4KB-aligned padding on disk
  • RDMA scatter-gather lists map logical chunk boundaries to aligned physical blocks
  • One-sided RDMA transfers use pre-registered memory regions at 4KB alignment
  • Padding overhead is bounded: max 4KB per chunk, typically <1% for chunks >256KB

Consequences

  • Pool-level EC config means all chunks in a pool share the same protection level
  • Changing EC parameters requires re-encoding existing chunks (background process)
  • RDMA alignment adds trivial storage overhead but enables zero-copy transfers

ADR-006: Inline Data Threshold

Status: Accepted Date: 2026-04-17 Context: Escalation point 6, analyst session

Decision

Delta payloads may carry inline data up to 4096 bytes (4KB).

Data below this threshold is encrypted and stored directly in the delta payload. No separate chunk write occurs.

Rationale

  • Small files (symlinks, xattrs, tiny configs): avoid chunk overhead
  • DeltaFS validated this pattern at scale (inode metadata with inline data)
  • 4KB aligns with filesystem block size and NVMe sector size
  • Raft replication cost per delta increases slightly but acceptably (4KB payload vs ~200 byte metadata-only delta)
  • Standard practice: ext4, Btrfs, XFS all support inline data

Threshold selection

ThresholdRaft costUse cases capturedChunk overhead saved
1KBMinimalSymlinks, xattrsLow
4KBAcceptableSmall files, metadata, configsModerate
8KBNoticeableMore files inlineHigher but Raft fan-out increases
64KBSignificantToo much data in the logRaft becomes bottleneck

4KB is the sweet spot: captures the majority of metadata-only operations without overloading Raft replication.

Consequences

  • Configurable per cluster (system-level setting, not per-tenant)
  • Compaction must handle deltas with inline data (encrypted payload may be larger than metadata-only deltas)

ADR-007: System Key Manager HA via Raft

Status: Accepted Date: 2026-04-17 Context: I-K12, escalation point 7, B-ADV-3

Decision

The system key manager is a dedicated Raft group (3 or 5 members) running as kiseki-keyserver on dedicated nodes. It stores system master keys (one per epoch) and derives per-chunk DEKs via HKDF at runtime (ADR-003).

Architecture

kiseki-keyserver (3-5 nodes, Raft)
  ├── Stores: system master keys (one per epoch, ~32 bytes each)
  ├── Derives: system DEK = HKDF(master_key, chunk_id) — stateless
  ├── Manages: epoch lifecycle (create, rotate, retain, destroy)
  └── Audits: all key events to audit log

Rationale

  • System key manager is the highest-severity SPOF (P0 if unavailable)
  • Must be at least as available as the Log
  • Raft provides consensus + replication + leader election
  • Separate from shard Raft groups (independent failure domain)
  • Dedicated nodes: key material never co-located with tenant data
  • Master key storage is trivial (epochs × 32 bytes)
  • DEK derivation is stateless and fast (HKDF, ~microseconds)

Deployment

  • 3 nodes for standard deployments, 5 for high-criticality
  • Dedicated hardware (or at minimum, dedicated processes on control-plane nodes)
  • Key material in memory only (mlock’d, guard pages)
  • On-disk: Raft log + snapshot of epoch state (encrypted with node-local key)

Consequences

  • Adds a deployment component (kiseki-keyserver)
  • Key manager must be deployed and healthy before any data operations
  • Cross-site: each site has its own system key manager (federation doesn’t share system keys — only tenant keys cross sites via tenant KMS)

ADR-008: Native Client Fabric Discovery

Status: Accepted Date: 2026-04-17 Context: Escalation point 8, A-ADV-1, I-O4

Decision

Native clients discover shards, views, and gateways via a lightweight discovery service running on every storage node, accessible on the data fabric. No control plane access required.

Mechanism

  1. Bootstrap: client is configured with a list of seed endpoints (storage node addresses on the data fabric). Seed list can be provided via environment variable, config file, or DHCP option.

  2. Discovery query: client sends a discovery request to any seed. The storage node responds with:

    • List of active shards (shard_id, leader node, key range)
    • List of materialized views (view_id, protocol, endpoint)
    • List of gateway endpoints (protocol, transport)
    • Tenant authentication requirements
  3. Authentication: client presents mTLS certificate (Cluster CA signed, per-tenant). Optional second-stage auth via tenant IdP.

  4. Cache: discovery results cached with TTL. Periodic refresh. Shard split/merge events invalidate relevant cache entries.

  5. Transport negotiation: client probes available transports (CXI → verbs → TCP) and selects highest-performance option.

Why not DNS-SD or multicast

  • Slingshot fabric may not support multicast reliably
  • DNS-SD requires DNS infrastructure on the data fabric
  • Seed-based discovery is simple, deterministic, and works with any transport

Consequences

  • Every storage node runs a discovery responder (lightweight, part of kiseki-server)
  • Seed list is the only bootstrap configuration for compute nodes
  • Discovery responder must not expose tenant-sensitive information (shard/view metadata is operational, not tenant content)

ADR-009: Audit Log Sharding and GC

Status: Accepted Date: 2026-04-17 Context: B-ADV-1 (audit log scalability)

Decision

The audit log is sharded per tenant with its own archival lifecycle.

Architecture

Audit subsystem:
  ├── Per-tenant audit shard (append-only, Raft-replicated)
  │   └── Contains: tenant events + relevant system events
  │   └── GC: events archived to cold storage after retention period
  │   └── Retention period: set by compliance tags (e.g., HIPAA = 6 years)
  │
  ├── System audit shard (cluster-wide operational events)
  │   └── Contains: node events, maintenance, non-tenant-scoped events
  │   └── GC: configurable retention (default 1 year)
  │
  └── Export pipeline
      └── Tenant export: filtered stream to tenant VLAN
      └── System export: to cluster admin's SIEM

GC interaction with delta GC (I-L4)

  • Each tenant audit shard tracks its own watermark per data shard
  • Delta GC checks the relevant tenant audit shard’s watermark
  • A stalled tenant audit shard blocks delta GC only for that tenant’s data shards (not cluster-wide)

Rationale

  • Single global audit log is a cluster-wide GC bottleneck (B-ADV-1)
  • Per-tenant sharding: stalled export for one tenant doesn’t block others
  • Audit retention aligns with compliance (HIPAA 6yr, GDPR varies)
  • Archived events move to cold storage (bulk-nvme pool) after active retention

GC safety valve and backpressure (analyst backpass contention 2)

Default behavior (safety valve): if a tenant’s audit export stalls for > configurable threshold (default 24 hours), data shard GC proceeds anyway. The audit gap is logged, and the compliance team is notified. Storage exhaustion is worse than an auditable gap.

Per-tenant configurable: tenants can enable audit backpressure mode. When enabled, if the audit export falls behind, write throughput for that tenant is throttled (reducing GC pressure at the source). This preserves audit completeness at the cost of write performance.

ModeGC behaviorWrite impactUse case
Safety valve (default)GC proceeds after timeoutNoneMost tenants
Backpressure (opt-in)GC waits; writes throttledSlower writesStrict compliance

Consequences

  • More audit shards to manage (one per tenant + one system)
  • Audit Raft groups are lightweight (small append-only logs)
  • Archival pipeline is a background process
  • Safety valve prevents storage exhaustion from stalled audit export
  • Backpressure mode available for tenants with strict audit requirements

ADR-010: Retention Hold Enforcement Before Crypto-Shred

Status: Accepted Date: 2026-04-17 Context: B-ADV-4 (retention hold ordering race)

Decision

Compliance tags that imply retention requirements automatically create retention holds when data is written. Crypto-shred checks for active holds before proceeding.

Mechanism

  1. When a namespace has compliance tags (HIPAA, GDPR, etc.), the control plane derives retention requirements from the tag.
  2. A default retention hold is automatically created for the namespace with the TTL mandated by the compliance regime.
  3. Crypto-shred for a tenant checks all namespaces for active holds:
    • If holds exist: crypto-shred proceeds (KEK destroyed, data unreadable) but physical GC is blocked (correct behavior).
    • If no holds exist AND compliance tags imply retention: crypto-shred is blocked with an error requiring explicit override.
  4. Override requires force_without_hold_check: true + audit log entry documenting the override and the reason.

Compliance tag → retention mapping (configurable)

TagDefault retentionSource
HIPAA6 years45 CFR §164.530(j)
GDPRPer DPA agreementNo fixed minimum
revFADPPer data controller policySwiss FDPA Art. 6

Consequences

  • Retention holds are created automatically, reducing risk of human error
  • Crypto-shred with override is audited (compliance team can review)
  • Tenant admin can extend holds but not shorten below compliance minimum

ADR-011: Crypto-Shred Cache Invalidation and TTL

Status: Accepted Date: 2026-04-17 Context: B-ADV-5 (crypto-shred propagation)

Decision

Maximum tenant KEK cache TTL is 60 seconds. Crypto-shred triggers an active invalidation broadcast in addition to TTL expiry.

Mechanism

  1. Default cache TTL: 60 seconds (configurable per tenant, cannot exceed max)
  2. On crypto-shred: a. KEK destroyed in tenant KMS b. Invalidation broadcast to all known gateways, stream processors, and native clients for that tenant c. Components receiving invalidation immediately purge cached KEK d. Components unreachable during broadcast will expire naturally at TTL
  3. Crypto-shred operation returns success after KEK destruction + broadcast (does not wait for all acknowledgments)
  4. Maximum residual window: 60 seconds (cache TTL for unreachable components)

TTL configuration (analyst backpass contention 3)

The 60-second TTL is the default, not a fixed value. TTL is configurable per tenant within bounds:

ParameterValueRationale
Minimum TTL5 secondsBelow this, KMS load becomes problematic (key fetch every 5s per component)
Default TTL60 secondsReasonable for most deployments
Maximum TTL300 seconds (5 min)Beyond this, the crypto-shred window is unreasonable

Tenants under stricter regulation can request shorter TTL (e.g., 10s). The trade-off is higher KMS load (more frequent key fetches). The control plane validates that the requested TTL is within [min, max] and warns if KMS capacity may be insufficient.

HIPAA/GDPR acceptability

  • GDPR Art. 17 requires erasure “without undue delay” — even 300 seconds is within reasonable interpretation for a distributed system
  • HIPAA does not specify a time bound for deletion
  • The audit log records exact times: KEK destroyed, broadcast sent, cache TTL expiry — providing compliance evidence
  • Configurable TTL allows compliance-sensitive tenants to reduce the window

Consequences

  • Default 60-second window where data is technically readable after shred
  • Configurable per tenant within [5s, 300s] bounds
  • Components must handle invalidation broadcast (new message type)
  • Native clients on unreachable compute nodes: data readable until their process exits or TTL expires (whichever comes first)
  • Shorter TTLs increase KMS load (more frequent key fetches)
  • TTL bounds are performance parameters that may conflict with compliance — the minimum (5s) is a hard engineering limit, not a policy choice

ADR-012: Stream Processor Tenant Isolation

Status: Accepted Date: 2026-04-17 Context: B-ADV-6 (stream processor isolation)

Decision

Stream processors for different tenants run in separate OS processes on storage nodes. Key material is protected with mlock and guard pages.

Isolation model

MechanismPurpose
Separate processesOS-level memory isolation between tenants
mlock on key pagesPrevent key material from swapping to disk
Guard pagesDetect buffer overflows near key material
seccomp (Linux)Restrict syscalls to minimum needed
Separate cgroupsResource isolation (CPU, memory) per tenant

Co-location policy

  • Small tenants: multiple stream processors per node (process isolation)
  • Large/sensitive tenants: dedicated nodes (configurable via placement policy)
  • Compliance tags can mandate dedicated nodes (e.g., HIPAA with strict isolation)

Hardware isolation (future)

  • AMD SEV-SNP / Intel TDX confidential VMs: out of scope for initial build
  • Envelope format and key wrapping are compatible with confidential compute (keys are already protected end-to-end; adding a TEE is additive, not architectural change)

Consequences

  • More processes per storage node (one per tenant per view)
  • Process management in kiseki-server (spawn, monitor, restart)
  • Memory overhead per process (Rust process ~10-20MB base)
  • Key material never in shared memory between tenants

ADR-013: POSIX Semantics Scope

Status: Accepted Date: 2026-04-17 Context: A-ADV-4 (POSIX semantics depth)

Decision

POSIX support via FUSE with explicit compatibility matrix.

Supported (full semantics)

OperationNotes
open, close, read, writeStandard file I/O
create, unlink, mkdir, rmdirDirectory operations
rename (within namespace)Atomic within shard
stat, fstat, lstatFile metadata
chmod, chownPermission changes (stored in delta attributes)
readdir, readdirplusDirectory listing from view
symlink, readlinkStored as inline data in delta
truncate, ftruncateComposition resize
fsync, fdatasyncFlush to durable (delta committed)
extended attributes (xattr)getxattr, setxattr, listxattr, removexattr
POSIX file locks (fcntl)Per-gateway lock state
O_APPENDAtomic append via delta
O_CREAT, O_EXCLAtomic create-if-not-exists

Supported (limited semantics)

OperationLimitation
rename (cross-namespace)Returns EXDEV (ADR: I-L8)
hard linksWithin namespace only; cross-namespace returns EXDEV
sparse filesHoles tracked in composition; zero-fill on read
O_DIRECTBypasses client cache but still goes through FUSE
flock (advisory)Best-effort; not guaranteed across gateway failover

Not supported

OperationReason
mmap (shared, writable)Distributed shared writable mmap requires page-level coherence — not tractable for a distributed system at HPC scale. Read-only mmap is supported. The FUSE client returns ENOTSUP with a log message: “writable shared mmap not supported; use write() instead.”
ACLs (POSIX.1e)Unix permissions only (uid/gid/mode). POSIX ACLs add complexity without significant benefit for the target workload. Revisit if needed.
chroot, pivot_rootFilesystem-level operations, not meaningful for FUSE mount

Consequences

  • mmap restriction documented prominently (HPC users expect it)
  • Read-only mmap works (useful for model loading)
  • Writable mmap requires application changes (use write() instead)
  • No POSIX ACLs simplifies the permission model

ADR-014: S3 API Compatibility Scope

Status: Accepted Date: 2026-04-17 Context: A-ADV-5 (S3 API compatibility scope)

Decision

Implement a subset of S3 API covering the operations needed by HPC/AI workloads. Not a complete S3 implementation.

Supported (full)

APINotes
PutObjectSingle-part upload
GetObjectIncluding byte-range reads
HeadObjectMetadata retrieval
DeleteObjectTombstone or delete marker (versioning)
ListObjectsV2Prefix, delimiter, pagination
CreateMultipartUpload
UploadPart
CompleteMultipartUpload
AbortMultipartUpload
ListMultipartUploads
ListParts
CreateBucketMaps to namespace creation
DeleteBucketMaps to namespace deletion
HeadBucketExistence check
ListBucketsPer-tenant bucket listing

Supported (versioning)

APINotes
GetObjectVersionSpecific version retrieval
ListObjectVersionsVersion listing
DeleteObjectVersionDelete specific version

Supported (conditional)

APINotes
If-None-Match, If-MatchConditional writes
If-Modified-SinceConditional reads

Not supported (initial build)

APIReasonFuture?
Lifecycle policiesComplex; competes with Kiseki’s own tieringMaybe
Event notificationsRequires message bus integrationMaybe
SSE-S3, SSE-KMS, SSE-CKiseki’s encryption is always-on; S3 SSE headers are acknowledged but don’t change behaviorN/A
Presigned URLsUseful; add after core is stableYes
Bucket policiesKiseki uses its own IAM/policy modelNo
CORSNot relevant for HPC/AI workloadsNo
Object LockCovered by Kiseki’s retention holdsMapping possible
Select (S3 Select)Out of scopeNo

SSE header handling

S3 clients may send SSE headers. Kiseki always encrypts (I-K1).

  • SSE-S3 headers: acknowledged, no-op (system encryption is always on)
  • SSE-KMS headers with key ARN: if ARN matches tenant KMS config, acknowledged. If different: error (tenant can’t specify arbitrary keys)
  • SSE-C headers: rejected (Kiseki manages encryption, not the client)

Consequences

  • S3-compatible tooling (aws cli, boto3, rclone) works for supported operations
  • Unsupported operations return 501 Not Implemented
  • SSE headers are handled gracefully without breaking encryption model

ADR-015: Observability Contract

Status: Accepted Date: 2026-04-17 Context: A-ADV-7 (observability)

Decision

OpenTelemetry-native observability with tenant-aware metric scoping.

Metrics (Prometheus-compatible, via OpenTelemetry)

ContextKey metrics
Logdelta_append_latency, raft_commit_latency, shard_count, shard_size, compaction_duration, election_count
Chunkwrite_latency, read_latency, dedup_hit_rate, gc_chunks_collected, repair_count, pool_utilization
Compositioncreate_latency, delete_count, multipart_in_progress, refcount_operations
Viewmaterialization_lag_ms, staleness_violation_count, rebuild_progress, pin_count
Gatewayrequest_latency (p50/p99/p999), requests_per_sec, error_rate, active_connections
Clientfuse_latency, transport_type, cache_hit_rate, prefetch_effectiveness
Key Mgrderive_latency, rotation_in_progress, kms_reachability, cache_hit_rate
Controltenant_count, namespace_count, quota_utilization, federation_sync_lag

Zero-trust metric scoping

  • Cluster admin sees: aggregated metrics, per-node metrics, system health. Per-tenant metrics are anonymized (tenant_id replaced with opaque hash) unless cluster admin has approved access for that tenant.
  • Tenant admin sees: their own tenant’s metrics via tenant audit export.
  • No metric exposes: file names, directory structure, data content, or access patterns attributable to a specific tenant (without approval).

Distributed tracing

  • Every write/read path carries a trace ID (OpenTelemetry context propagation)
  • Traces span: client → gateway → composition → log → chunk → view
  • Tenant-scoped traces are visible only to the tenant admin
  • Cluster admin sees system-level spans (no tenant content in span attributes)

Structured logging

  • JSON structured logs, one line per event
  • Log levels: ERROR, WARN, INFO, DEBUG, TRACE
  • Tenant-identifying fields are present but content fields are encrypted
  • Logs ship to the same audit/observability pipeline

Consequences

  • OpenTelemetry SDK in both Rust and Go codebases
  • Metric cardinality must be bounded (no unbounded label values)
  • Tracing overhead ~1-2% on data path (acceptable for production)

ADR-016: Backup and Disaster Recovery

Status: Accepted Date: 2026-04-17 Context: A-ADV-8 (backup and DR)

Decision

Federation is the primary DR mechanism. External backup is additive and optional.

Site-level DR via federation

  • Federated-async replication to a secondary site is the primary DR story
  • RPO: bounded by async replication lag (seconds to minutes)
  • RTO: secondary site is warm (has replicated data + tenant config); switchover requires KMS connectivity and control plane reconfiguration
  • Data replication is ciphertext-only (no key material in replication stream)

What is replicated

ComponentReplicated?Mechanism
Chunk data (ciphertext)YesAsync replication to peer site
Log deltasYesAsync replication of committed deltas
Control plane configYesFederation config sync
Tenant KMS configNoSame tenant KMS serves both sites
System master keysNoPer-site system key manager
Audit logYesPer-tenant audit shard replicated

External backup (optional, additive)

  • Cluster admin can configure external backup targets (S3-compatible store)
  • Backup contains: encrypted chunk data + log snapshots + control plane state
  • Backup is encrypted with the system key (at rest) — no plaintext in backup
  • HIPAA requirement met: backup is encrypted
  • Backup frequency: configurable (hourly/daily snapshots of control plane, continuous for chunk data)

Recovery scenarios

ScenarioRecovery pathRPORTO
Single node lossRaft re-election + EC repair0Seconds-minutes
Multiple node lossRaft reconfiguration + EC repair0Minutes
Full site lossFailover to federated peerReplication lagMinutes-hours
Site loss, no federationRestore from external backupBackup lagHours
Tenant KMS lossUnrecoverable (I-K11)N/AN/A

Consequences

  • Federation is the recommended (and primary) DR strategy
  • External backup is for defense-in-depth, not primary recovery
  • RTO for site failover depends on control plane reconfiguration speed
  • System key manager is per-site — site failover requires the secondary site’s own system key manager (different master keys, but tenants’ data is accessible because tenant KMS is shared cross-site)

ADR-017: Dedup Refcount Metadata Access Control

Status: Accepted Date: 2026-04-17 Context: B-ADV-2 (cross-tenant dedup refcount metadata)

Decision

Chunk refcount metadata stores total refcount only, without per-tenant attribution. Tenant-to-chunk mapping is derived from composition metadata (which is tenant-encrypted).

Mechanism

ChunkMeta:
  chunk_id: abc123
  total_refcount: 3      ← visible to system
  per_tenant_refs: N/A   ← NOT stored

Tenant attribution is in the composition deltas:
  org-pharma/composition-X references chunk abc123   ← encrypted in delta payload
  org-biotech/composition-Y references chunk abc123  ← encrypted in delta payload

Access control

  • Cluster admin can see: chunk_id, total_refcount, pool, EC status
  • Cluster admin CANNOT see: which tenants reference which chunks (this information is in tenant-encrypted delta payloads)
  • System dedup process: compares chunk_ids (in the clear for dedup), but does not record which tenant triggered the dedup match

Residual risk

  • Total refcount > 1 reveals that SOME dedup occurred, but not who
  • Timing side channel: a dedup hit is faster than a full write. An observer who can measure write latency precisely could infer dedup. Mitigation: add random delay to normalize write timing (optional, configurable per tenant).

Consequences

  • No per-tenant refcount tracking in chunk metadata
  • Refcount decrement on crypto-shred: the crypto-shred process walks the tenant’s compositions (decrypted with tenant key during shred) to identify which chunks to decrement
  • This is slower than a per-tenant refcount lookup but only happens during crypto-shred (rare operation)

ADR-018: Runtime Integrity Monitor

Status: Accepted Date: 2026-04-17 Context: ADV-ARCH-04 (master key in memory), analyst backpass contention 1

Decision

A runtime integrity monitor runs as a side process on every storage node, detecting signs of key material extraction attempts.

Detection signals

SignalDetection methodSeverity
ptrace attachment to kiseki processesMonitor /proc/pid/status TracerPidCritical
/proc/pid/mem reads on kiseki processesinotify/audit on /proc/pid/memCritical
Debugger presence (gdb, lldb, strace)Process enumerationHigh
Core dump generation attemptMonitor core_pattern, catch SIGABRTCritical
Unexpected LD_PRELOAD on kiseki processesCheck /proc/pid/environ at startupHigh
Process memory mapping changesMonitor /proc/pid/maps periodicallyMedium

Response

  1. Alert: cluster admin + affected tenant admin(s) immediately
  2. Log: audit event with full context (pid, signal, timestamp)
  3. Optional auto-response (configurable):
    • Rotate system master key (new epoch, invalidate cached key)
    • Evict cached tenant KEKs (force re-fetch from KMS)
    • Kill the suspect process
  4. Do NOT: shut down the storage node (availability over prevention — the attacker may already have the key; shutting down just causes an outage)

Performance impact

Negligible. The monitor checks /proc periodically (every 1-5 seconds), not on every crypto operation. Crypto operations themselves are not a performance concern:

  • HKDF derivation: ~1μs per call, ~25,000 calls/sec at line rate = ~25ms CPU/sec
  • AES-256-GCM (the actual encryption): with AES-NI, ~5-10% of one core at 200 Gbps
  • The bottleneck is the AEAD data encryption, not key derivation or monitoring

Consequences

  • Additional process per storage node (lightweight)
  • Linux-specific (/proc-based detection); needs platform abstraction for other OS
  • Not a prevention mechanism — it’s detection and response
  • False positives possible (legitimate debugging during development); monitor should be disableable in dev/test mode

ADR-019: Gateway Deployment Model

Status: Accepted Date: 2026-04-17 Context: ADV-ARCH-03 (monolith blast radius), analyst backpass contention 4

Decision

Gateways run in-process with kiseki-server (monolith per node). Client resilience is provided by multi-endpoint resolution, not per-process gateway isolation.

Rationale

This is a distributed system with no master. Every storage node runs kiseki-server (log + chunk + composition + view + gateways). Clients resolve to multiple endpoints:

Client (NFS/S3/native)
  │
  ├── DNS round-robin: kiseki-nfs.cluster.local → [node1, node2, node3, ...]
  ├── Multiple A/AAAA records
  ├── Native client: seed list → discovery → multiple endpoints
  │
  └── On node failure: client reconnects to next endpoint
      (NFS: automatic reconnect; S3: retry to different host;
       native: transport failover)

Why monolith is acceptable

ConcernMitigation
Gateway crash = node crashClient reconnects to another node (seconds)
All tenants on crashed node affectedTenants are served by multiple nodes; one node loss = partial, not total
Memory leak in gateway affects log/chunkResource limits via cgroups; OOM killer targets the process, not the node
Bug in NFS gateway affects S3 gatewayAccept — both are in the same process. Isolation adds operational complexity disproportionate to the risk

Why NOT separate gateway processes

  • Additional process management per node (spawn, monitor, restart, IPC)
  • Performance overhead of IPC between gateway and log/chunk/view
  • Operational complexity (more processes to configure, monitor, upgrade)
  • The resilience model is client-side multi-endpoint, not server-side process isolation

Client resolution

Client typeResolution mechanism
NFSDNS (multiple A records), NFS mount with multiple server addresses
S3DNS round-robin, HTTP retry to next endpoint on 5xx
NativeSeed list → fabric discovery → multiple endpoints, automatic failover

Consequences

  • kiseki-server remains a single-process monolith per node
  • Client-side resilience is the primary availability mechanism
  • Update failure-modes.md: F-D1 (gateway crash) → node-scoped, not protocol-scoped
  • Node loss tolerance depends on tenant data distribution across nodes

ADR-020: Workflow Advisory & Client Telemetry

Status: Accepted (implemented, 51/51 BDD scenarios pass) Date: 2026-04-17 Context: new capability — HPC/AI workloads need to steer storage (prefetch, affinity, priority, phase-adaptive tuning) and consume caller-scoped feedback (backpressure, locality, materialization lag, QoS headroom). ADR-015 covers operator-facing observability; this ADR covers the orthogonal client-facing advisory/telemetry surface.

Decision (analyst-level; architect will refine interfaces)

Introduce a Workflow Advisory cross-cutting concern carrying two flows over one bidirectional advisory channel per declared workflow:

  1. Hints (client → storage) — advisory, never authoritative (I-WA1).
  2. Telemetry feedback (storage → client) — caller-scoped only (I-WA5, I-WA6).

Workflow is not a bounded context. It is a correlation + steering construct owned entirely by the client, with a stateless routing layer on the server side and bounded per-workflow state.

Correlation identity

Every data-path operation issued while a workflow is active carries:

(org_id, project_id?, workload_id, client_id, workflow_id, phase_id)
  • client_id pinned per native-client process (I-WA4).
  • workflow_id ≥128-bit opaque, unique within workload (I-WA10).
  • phase_id monotonic within workflow, bounded phase history (I-WA13).

Advisory channel

  • One bidi gRPC stream per active workflow, on the data fabric, under the same mTLS tenant certificate as the data path (I-Auth1, I-WA3).
  • Authorization is per-operation on the stream, not only at establishment (I-WA3). Certificate revocation tears down the stream.
  • Side-by-side with the data path — not in-band. Data-path requests may be annotated with a short workflow_ref header that the data-path code passes through; server-side the annotation is routed to the advisory subsystem asynchronously (I-WA2). Annotation is strictly best-effort:
    • malformed workflow_ref → ignored, no data-path impact
    • workflow_ref for an expired workflow → dropped silently on the advisory side with an hint-rejected: workflow_unknown audit event
    • advisory subsystem overloaded or unavailable → annotation enqueued with bounded buffer; on overflow dropped with a rate-limited annotation_dropped audit event. Data-path operation outcome is never affected (I-WA2).
  • Closure of the advisory stream without End auto-expires the workflow on TTL. Process restart produces a fresh client_id; the old workflow expires on TTL and the new process must redeclare. No reattach protocol is defined in this ADR — it may be revisited as a follow-up feature with its own spec + adversary review.

Hint taxonomy (must-have)

CategoryExample valuesActed on by
Workload profileai-training, ai-inference, hpc-checkpoint, batch-etl, interactiveControl Plane policy gate; tunes other hint defaults
Phase markerstage-in, compute, checkpoint, stage-out, epoch-N (opaque semantic tag)View (cache policy), Composition (write-absorb), Chunk (placement hot-set)
Access patternsequential / random / strided / broadcastNative Client (prefetch), View (materialization priority)
Prefetch rangelist of (composition_id, offset, length)View, Chunk (opportunistic warm)
Priority classinteractive / batch / bulk within policy-allowed maxGateway / Client QoS scheduler
Affinity preferencepool / rack / node preference within policyChunk placement engine
Retention intenttemp / working / finalComposition GC urgency, Chunk EC policy selection
Dedup intentshared-ensemble / per-rankChunk dedup path (still bounded by I-X2)
Collective announcement{ranks, bytes_per_rank, deadline}Chunk write-absorb provisioning

Hint taxonomy (nice-to-have, deferred)

Co-access grouping, deadline, transient markers (discardable after epoch N), NUMA/GPU topology, peer-rank state. Architect may add these in a follow-up.

Telemetry feedback (must-have)

SignalShapeScoping
Backpressureseverity enum + retry_after_msCaller’s own resources only
Materialization lagmsCaller’s views only
Locality classbucketed enum (local-node, local-rack, same-pool, remote, degraded)Caller-owned chunks only
Prefetch effectivenessbucketed hit-rateCaller’s declared prefetches only
QoS headroombucketed fractionCaller’s workflow/workload
Own-hotspotcomposition_id + coarse levelCaller’s own compositions

Tenant-hierarchy scoping

  • Policy chain: cluster → org → project → workload. Each level narrows (never broadens) its parent’s ceilings (I-WA7). Profile allow-lists inherit the same way.
  • Workflow lives strictly within one workload (I-WA3).
  • Disable switch at any level (I-WA12) — data path unaffected when advisory is disabled.

Security posture

  • Hints cannot extend capability (I-WA14).
  • Telemetry is not an existence oracle (I-WA6) — unauthorized target → same shape as absent target, including timing distribution.
  • Telemetry aggregation uses k-anonymity over neighbour workloads, k ≥ 5 (I-WA5).
  • Covert-channel hardening: rejection latency and telemetry response size are bucketed (I-WA15).
  • All advisory decisions audited on tenant shard; cluster-admin view sees opaque hashes (I-WA8, consistent with I-A3 / ADR-015).

Isolation from data path

  • Advisory channel on a separate gRPC service and (ideally) a separate server-side tokio runtime / goroutine pool from the data path.
  • Hint handling is best-effort with bounded buffering; on overload the handler drops-and-audits rather than queuing.
  • Data-path code never awaits advisory responses. At most it emits fire-and-forget annotations.

Alternatives considered

  1. Attach hints as headers on existing data-path RPCs, no separate channel. Rejected: couples hint handling to data-path latency, violates I-WA2 isolation, and makes bidirectional telemetry awkward.
  2. Model workflow as a new bounded context with durable state. Rejected: workflows are ephemeral correlation handles. Persisting them invites a new shared-state problem and gives little value beyond what the audit log already provides.
  3. Expose ADR-015 observability directly to clients. Rejected: ADR-015 is operator-facing with aggregate/anonymized scope. Clients need caller-scoped, near-real-time feedback with a different privacy boundary (I-WA5/6).
  4. Server-authoritative hints (storage can infer and inject its own). Rejected: inferring client intent from data-path patterns is already done internally; the point of this ADR is to let clients supply authoritative-to-themselves hints. Server-side inference remains available as a fallback when hints are absent.

Consequences

  • New crate kiseki-advisory (Rust) — hint validation, routing, rate limiting, telemetry emission, audit emission. Side-by-side with kiseki-server, not inside the data-path crates.
  • New protobuf service WorkflowAdvisory with DeclareWorkflow, EndWorkflow, PhaseAdvance, a bidi AdvisoryStream, and SubscribeTelemetry (may be a stream within AdvisoryStream).
  • Control Plane extensions: profile allow-lists, hint budgets, opt-out switches — inherited org → project → workload.
  • Native Client extensions: WorkflowSession handle; existing data-path methods accept an optional &WorkflowSession for automatic correlation annotation.
  • Audit additions: new event types per I-WA8. Tenant audit export (I-A2) includes them; cluster-admin export (I-A3) hashes the tenant-scoped identifiers.
  • Metric additions (ADR-015 operator view): advisory_hints_accepted, _rejected, _throttled, active_workflows, advisory_channel_latency, tenant-anonymized.
  • Performance: hint handling overhead target < 5µs p99 per accepted hint; telemetry emission frequency capped per subscription.
  • Failure mode F-ADV-1: advisory-subsystem outage → data path unaffected; clients observe advisory_unavailable until restoration. To be added to specs/failure-modes.md (severity P2, blast radius: steering quality only).

Changes from adversary gate-0 review

  • I-WA6 extended to cover hint rejection (previously telemetry-only).
  • I-WA3 tightened to per-operation authorization.
  • I-WA5 defines explicit low-k behaviour (fixed-sentinel neighbour component, unchanged response shape).
  • New invariants I-WA16 (hint payload size bound), I-WA17 (declare rate bound), I-WA18 (prospective policy application).
  • I-WA11 tightened to enumerate forbidden advisory target field types.
  • I-WA12 defines three-state opt-out with draining transition.
  • I-WA13 specifies CAS serialization for PhaseAdvance.
  • Reattach protocol explicitly dropped; TTL-only recovery.
  • client_id construction simplified to CSPRNG (≥128 bits), pinning enforced by registrar.
  • F-ADV-1 (advisory outage) and F-ADV-2 (audit storm) added to failure-modes.md.
  • A-ADV-1..A-ADV-4 added to assumptions.md.

Follow-ups (architect’s scope)

  • gRPC service definitions and message schemas.
  • Exact integration surface between kiseki-advisory and each of Chunk, View, Composition, Gateway.
  • Concrete k-anonymity bucketing algorithm and parameters.
  • Exact latency-bucketing and size-bucketing schemes for I-WA15.
  • Phase-history compaction format and retention per workload.
  • Reattach protocol for process-restart scenarios (I-WA4 scenario).

Follow-ups (adversary’s scope — gate 0 before architect)

  • Threat-model the covert-channel surface (timing, size, error-code).
  • Validate that the inherent side-channels from backpressure signals are truly k-anonymised under worst-case neighbour composition.
  • Probe the reattach protocol once drafted.

ADR-021: Workflow Advisory Architecture

Status: Accepted (implemented, 51/51 BDD scenarios pass) Date: 2026-04-17 Context: ADR-020 analyst-level decision; this ADR commits the architecture (crate shape, runtime isolation, advisory-to-data-path coupling, protobuf + intra-Rust boundaries).

Decision

Three structural commitments that, together, make the analyst-level invariants in ADR-020 enforceable at compile time and at runtime.

1. Advisory is a separate crate with an isolated runtime

  • New Rust crate kiseki-advisory, located at crates/kiseki-advisory/.
  • Compiled into kiseki-server but runs on a dedicated tokio runtime with its own thread pool, separate from the data-path runtime. Configured via kiseki-server at process start.
  • All advisory ingress (AdvisoryStream, DeclareWorkflow, PhaseAdvance, telemetry subscriptions) is accepted on a separate gRPC listener from the data-path gRPC listeners.
  • Advisory-audit emission uses kiseki-audit’s existing tenant-shard path but with its own bounded queue and drop-and-record-on-overflow policy (no awaits out of the advisory runtime into the data path).
  • Structural enforcement of I-WA2: data-path crates do not depend on kiseki-advisory in their Cargo manifests. The only way an advisory event can affect data-path behaviour is through well-typed domain-level preferences (see §3), which the data path treats as advisory hints — never as preconditions.

2. Shared domain types live in kiseki-common

A small set of enums and structs representing “the advisory context of one operation” is declared in kiseki-common (already a dependency of every context). This lets data-path crates accept an Option<&OperationAdvisory> on their operations without pulling in the advisory runtime.

kiseki-common        (domain types: WorkflowRef, OperationAdvisory, enums)
  ↑
kiseki-{log,chunk,composition,view,gateway-*,client}
  (accept Option<&OperationAdvisory>, use for preferences only)

kiseki-advisory      (runtime, router, budget, audit emitter)
  ├── depends on kiseki-common
  ├── depends on kiseki-audit
  └── depends on kiseki-proto (for WorkflowAdvisoryService)
  ↑
kiseki-server        (wires advisory runtime to each context)

Cycle-free: no data-path crate depends on kiseki-advisory; the runtime wiring happens only in the kiseki-server binary.

3. Pull-based advisory lookup (not push into the data path)

When a data-path request arrives carrying a workflow_ref header:

3.a Header mechanism

The workflow_ref is carried as a gRPC metadata entry, not as a protobuf field on any data-path message. Concrete binding:

  • Metadata key: x-kiseki-workflow-ref-bin (binary metadata, per gRPC convention for raw-bytes values)
  • Metadata value: the raw 16-byte WorkflowRef handle
  • All data-path protos remain unchanged — this is the structural payoff that makes I-WA2 tractable (data-path code stays advisory-unaware).
  • A gRPC interceptor in kiseki-server lifts the header into a request-scoped context at ingress. The context is accessed by each data-path handler through a small kiseki-common helper (CurrentAdvisory::from_request_context()), which returns an Option<OperationAdvisory> by calling AdvisoryLookup::lookup_fast.
  • For intra-Rust calls (e.g., native client’s native API path), the same helper reads from a task-local set by the caller. The native client’s WorkflowSession handle scopes this automatically.
  • For external protocols (NFS, S3) the HTTP-level header is x-kiseki-workflow-ref (plain, hex-encoded), translated by the protocol gateway into the gRPC binary metadata entry x-kiseki-workflow-ref-bin before forwarding to any internal gRPC service. This keeps external clients unaware of gRPC conventions.
  • No data-path proto file contains workflow_ref. Any future attempt to add it is rejected at architecture review.
  1. The kiseki-server gRPC interceptor extracts workflow_ref and stores it in the request context.
  2. The data-path operation (e.g., WriteChunk) optionally consults CurrentAdvisory::from_request_context() to obtain an Option<OperationAdvisory>.
  3. The data-path code may, synchronously and fallibly, call AdvisoryLookup::lookup_fast(workflow_ref) -> Option<OperationAdvisory> with a strict bounded deadline (≤ 500 µs, configurable, default 200 µs). The method name carries the contract: implementations MUST NOT block, allocate on the happy path, or call non-O(1) functions.
  4. On timeout, unavailability, or cache miss the lookup returns None. The data-path code proceeds exactly as it would for an operation without any workflow_ref.
  5. There is no blocking wait, no retry, and no propagated error. The lookup is a hot-path cache read (see §4 below).

This guarantees I-WA2 structurally: the data path cannot be stalled or corrupted by the advisory subsystem. At worst, advisory context is unavailable and steering quality degrades.

4. Advisory state shape and hot path

kiseki-advisory maintains three bounded in-memory caches keyed by workflow:

CacheContentsSize boundEviction
Workflow table(workflow_id) → { mTLS-identity, profile, current_phase, budgets, TTL }policy-bounded max concurrent workflows per workload × total workloadsTTL + End
Effective-hints table(workflow_id) → OperationAdvisory (latest accepted hints, merged across phase)1 row per active workflowreplaced on new accept
Prefetch ringper-workflow ring buffer of accepted prefetch tuplesmax_prefetch_tuples_per_hint × in-flight phasesFIFO on cap

Reads from the data path hit the effective-hints table (O(1)). Writes into these caches happen on the advisory runtime only. Cross-thread access uses arc-swap (snapshot-read, copy-on-write) so the data-path read never takes a lock held by the advisory runtime.

5. gRPC service shape

One new service, WorkflowAdvisoryService, on its own gRPC listener. Unary: DeclareWorkflow, EndWorkflow, PhaseAdvance, GetWorkflowStatus (for admin/debug within caller’s own scope). Bidi streaming: AdvisoryStream (hints in, telemetry out over the same stream, multiplexed). Unary: SubscribeTelemetry (server-stream variant for callers who don’t want to send hints).

Full schema in specs/architecture/proto/kiseki/v1/advisory.proto.

6. Control-plane integration

New Go package control/pkg/advisory:

  • Policy CRUD for profile allow-lists, budgets, opt-out state per org/project/workload. Inheritance computed server-side; effective policy returned to kiseki-advisory via existing ControlService.
  • Opt-out state transitions (enabled/draining/disabled) are Raft-backed in the existing control-plane state store.
  • Federation does NOT replicate workflow state (ephemeral, local). It DOES replicate policy (existing async config replication path).

7. k-anonymity bucketing: concrete algorithm

For pool/shard saturation signals that incorporate cross-workload aggregate:

  1. Compute aggregate metric A over all contributing workloads on the pool/shard.
  2. Count distinct contributing workloads k.
  3. If k ≥ 5 (policy-configurable minimum): return severity = bucket(A); retry-after = bucket(compute_retry(A)).
  4. If k < 5: return severity = bucket(A_caller_only); retry-after = bucket(compute_retry(A_caller_only)). The response shape is identical to the k≥5 case; only the value of the neighbour-derived component is replaced by a sentinel bucket (ok, regardless of true aggregate) chosen to minimize caller utility of detecting the substitution.

Bucket function: fixed set {ok, soft, hard} for severity, {<50ms, 50-250ms, 250-1000ms, 1-10s, >10s} for retry-after.

8. Covert-channel hardening: concrete widths

  • Rejection response timing: every advisory rejection path (hint, subscription, declare, phase) pads response emission to the next 100-µs boundary after a fixed minimum of 300 µs. Enforced by a common emit_bucketed_response helper in kiseki-advisory.
  • Telemetry message sizes: protobuf messages padded to one of {128, 256, 512, 1024} bytes with a reserved bytes padding field repeated to the target size. Selection uses the nearest bucket ≥ actual size.
  • Error codes: every rejection caused by authorization or scope violation returns the SCOPE_NOT_FOUND code with the same message payload, regardless of whether the cause was “unauthorized” or “absent”. Internal audit records carry the true reason.
  • gRPC status code: WorkflowAdvisoryService MUST return gRPC status NOT_FOUND (code 5) for every SCOPE_NOT_FOUND case. Using PERMISSION_DENIED (code 7) or UNAUTHENTICATED (code 16) on authorization failures would leak the distinction via the gRPC trailers, defeating the canonicalization above. All gRPC clients and middleware expose the status code, so this is not a “docs-only” rule — it is enforced by an integration test at Phase 11.5 exit that compares status-code distributions across authorized-absent and unauthorized-existing cases.

9. Phase-history compaction format

Per workflow, keep the last 64 phase records in the workflow table (ring buffer of PhaseRecord { phase_id, tag_hash, entered_at, hints_accepted_count, hints_rejected_count }). On eviction, the evicted record is rolled up into a per-workflow PhaseSummary { from_phase_id, to_phase_id, total_hints_accepted, total_hints_rejected, duration_ms } audit event emitted to the tenant audit shard. The summary replaces all evicted individual records in audit history.

Alternatives considered

  1. Put advisory code inside each data-path crate behind a feature flag. Rejected: tight coupling; impossible to guarantee I-WA2 (hot data-path code lives in the advisory lifecycle), and per-crate feature flags multiply combinatorics of build variants.

  2. Separate OS process for advisory runtime, IPC’d from kiseki-server. Rejected: IPC adds serialization cost on the hot-path lookup (§3) and complicates deployment (another process per node). The isolated-tokio-runtime pattern gives enough blast-radius reduction at much lower overhead.

  3. Define advisory traits in a new tiny crate kiseki-advisory-api separate from kiseki-common. Considered. Rejected for now: the advisory domain types (OperationAdvisory, enums) are small, stable, and already conceptually part of the shared vocabulary (Workflow, Phase, AccessPattern appear in ubiquitous-language.md). Adding a one-concept crate adds build-graph overhead without payoff. Can be split out later if the type set grows.

  4. Push hints directly into each context via per-context channels (no OperationAdvisory aggregation). Rejected: spreads fan-out logic across every context and makes I-WA11 (target-field restriction) and I-WA16 (size cap) harder to enforce. Centralizing in kiseki-advisory and passing an already-validated bundle simplifies data-path code.

10. Schema versioning

advisory.proto ships as kiseki.v1. Forward-evolution rules:

  • Additions (new fields, new oneof variants, new enum values) stay within v1. Unknown fields are preserved by gRPC clients.
  • Deprecations mark fields with reserved after one minor release; old clients continue to work.
  • Breaking changes (semantic change of a field, required removal) move to v2 with a deprecation window ≥ 2 releases in which both versions are served.
  • Advisory-policy changes in the control plane (profile allow-list additions, budget changes) are config, not schema — no version bump needed.

11. Padding to bucket size

AdvisoryError.padding, AdvisoryServerMessage.padding, TelemetryEvent.padding, WorkflowStatus.padding, and AdvisoryAuditBody.padding carry the variable bytes needed to hit one of the bucket sizes {128, 256, 512, 1024, 2048 for audit bodies}. Computation at emit time:

serialized_size = serialize(rest_of_message).len();
target_bucket   = smallest bucket >= serialized_size + padding_overhead;
padding_len     = target_bucket - serialized_size - varint_overhead(target_bucket);

varint_overhead(N) accounts for the two-byte (tag + length-varint) prefix of the padding field; standard protobuf wire format. Implementations MUST use the kiseki-advisory::emit_bucketed_response helper. Property test at Phase 11.5 exit: every response on WorkflowAdvisoryService is exactly one of the bucket sizes.

Consequences

  • Adds one Rust crate (kiseki-advisory), one Go package (control/pkg/advisory), one proto file (proto/kiseki/v1/advisory.proto), one data-model stub (data-models/advisory.rs).
  • Adds a new phase to the build sequence (see build-phases.md).
  • Every data-path *Ops trait in api-contracts.md gains an optional advisory: Option<&OperationAdvisory> parameter on its methods. Callers that don’t care pass None.
  • Isolation requires kiseki-server to instantiate two tokio runtimes. Accepted cost.
  • The arc-swap hot-path read is the only cross-runtime coupling. Property-test and benchmark-verified at Phase 11 exit.

Open items (escalated to adversary gate-1)

  • Validate that §3 (pull-based lookup) cannot itself become a DoS surface: a malicious client pummelling workflow_ref headers causes lookups. Mitigation: lookup cache is per-node, bounded, and miss cost is a None return (no upstream RPC).
  • Validate §4 (arc-swap snapshot) meets latency targets on the actual data-path hot code (FUSE read/write, chunk write, view read).
  • Validate §8 covert-channel widths are large enough to mask actual work variance under realistic load.
  • Confirm §9 audit summary compaction does not itself become an existence oracle (size of summary varies with workflow activity).

ADR-022: Storage Backend — redb (Pure Rust)

Status: Accepted. Date: 2026-04-20. Deciders: Architect + implementer.

Context

The system needs persistent storage for:

  1. Raft log entries — append-heavy, sequential reads for replay
  2. State machine snapshots — periodic full-state serialization
  3. Chunk metadata index — key-value mapping (chunk_id → placement, refcount)
  4. View watermark checkpoints — small, frequently updated

The spec references “RocksDB or equivalent” (build-phases.md Phase 3) but does not commit to a specific engine. RocksDB is C++ and brings ~200MB build dependency via cmake/clang/librocksdb.

Decision

Use redb v2 for all structured persistent storage.

What redb handles

Dataredb TableKeyValue
Raft log entriesraft_logu64 (log index)bincode-serialized entry
Raft vote/termraft_meta&str (“vote”, “term”)u64
State machine snapshotsm_snapshot"latest"bincode-serialized state
Chunk metadatachunk_meta[u8; 32] (chunk_id)bincode ChunkMeta
Device allocationdevice_alloc(DeviceId, u64) (device + offset)[u8; 32] (chunk_id) — reverse index
View watermarksview_wm[u8; 16] (view_id)u64 (sequence)

What redb does NOT handle

Chunk ciphertext data is written directly to raw block devices (or file-backed fallback for VMs/CI) via the DeviceBackend trait in kiseki-block (ADR-029). redb stores metadata only; chunk ciphertext never passes through redb.

$KISEKI_DATA_DIR/
  devices/
    /dev/nvme0n1          # raw block device (default, ADR-029)
    /dev/nvme1n1          # raw block device
    /tmp/kiseki-dev0.img  # file-backed fallback (VMs/CI)
  raft/
    db.redb               # redb database file (metadata only)

redb tracks chunk placement: chunk_meta table maps chunk_id → (device_id, offset, size, fragment_index). The device_alloc table provides a reverse index (device_id, offset) → chunk_id for bitmap rebuild and scrub. Bitmap allocation updates are journaled in redb before application to the on-device bitmap (ADR-029).

Why pool files, not per-chunk files:

  • At 100TB / 64KB avg = 1.6B chunks → filesystem inode exhaustion
  • Pool files support O_DIRECT and RDMA pre-registration (single mmap region)
  • Chunks are 4KB-aligned within the pool file for NVMe block alignment
  • Pool file is sparse: only allocated regions consume disk space

EC fragment placement (CRUSH-like)

Fragments placed across devices via deterministic hashing:

fn place_fragment(chunk_id, frag_idx, pool_devices) -> DeviceId {
    // Ensure no two fragments on same device
    let mut candidates = pool_devices.clone();
    for prior in 0..frag_idx {
        candidates.remove(placed[prior]);
    }
    candidates[hash(chunk_id, frag_idx) % candidates.len()]
}

Deterministic — can recalculate placement without storing it. Reverse index (device_id, chunk_id) → fragment_index in redb enables efficient repair on device failure.

Raft snapshots

  • Trigger: Every 10,000 log entries
  • Format: bincode::serialize(&state_machine_inner)
  • Storage: redb sm_snapshot table, key = "latest"
  • Restore: Deserialize snapshot → replay log entries after snapshot index
  • Log cleanup: Truncate entries before snapshot index after snapshot

Rationale

CriterionredbRocksDBfjallCustom files
Pure RustYesNo (C++)YesYes
Build depsNonecmake, clang, librocksdbNoneNone
Binary size~50KB~5MB~100KB0
ACIDYes (COW)Yes (WAL)Yes (WAL)Manual (fsync)
Crash recoveryAutomaticAutomaticAutomaticManual replay
CompactionNone needed (B-tree)Required (LSM)Required (LSM)None
Maturity1.0, used by FirefoxVery matureNewerN/A
Write amplificationLow (COW)High (LSM)High (LSM)Low

redb wins on simplicity, zero deps, and sufficient performance for Raft log append + metadata lookup.

Consequences

  • No LSM-tree compaction complexity
  • No C++ build toolchain required
  • Chunk blobs as files: simple, inspectable, compatible with RDMA
  • redb’s COW B-tree has higher read amplification than LSM for range scans — acceptable for our workload (point lookups + append)
  • If redb proves insufficient for high-throughput Raft log append, migrate to fjall (LSM, same API pattern)

References

  • redb: https://github.com/cberner/redb
  • RFC 1813 §3: NFS3 procedure semantics
  • build-phases.md Phase 3: “SSTable” storage (now redb B-tree)
  • ADR-029: Raw Block Device Allocator (chunk data I/O)

ADR-023: Protocol RFC Compliance Scope

Status: Accepted. Date: 2026-04-20. Deciders: Architect + implementer.

Context

Kiseki exposes three protocol interfaces: S3 HTTP, NFSv3, NFSv4.2. ADR-013 (POSIX semantics) and ADR-014 (S3 API scope) define the functional subset but don’t reference specific RFC sections or define wire-format compliance testing.

Now that wire protocol implementations exist, we need to codify which RFC requirements are met and how compliance is verified.

Decision

Protocol scope

ProtocolStandardImplemented SubsetTotal in Standard
NFSv3RFC 18137 of 22 procedures22 procedures
NFSv4.2RFC 786210 of ~60 operations~60 operations
S3AWS S3 API5 of 40+ operations40+ operations

NFSv3 (RFC 1813) — implemented procedures

#ProcedureStatusNotes
0NULLImplementedPing/health check
1GETATTRImplementedFile/directory attributes
3LOOKUPImplementedName → file handle resolution
6READImplementedByte-range file read
7WRITEImplementedFile data write
8CREATEImplementedCreate new file + directory index entry
16READDIRImplementedDirectory listing with real filenames

Not implemented: SETATTR, ACCESS, READLINK, SYMLINK, MKNOD, REMOVE, RMDIR, RENAME, LINK, READDIRPLUS, FSSTAT, FSINFO, PATHCONF, COMMIT.

NFSv4.2 (RFC 7862) — implemented COMPOUND operations

OpNameStatusNotes
9GETATTRImplementedBitmap-selected attributes
10GETFHImplementedReturn current file handle
15LOOKUPStub (delegates to directory index)
24PUTROOTFHImplementedSet root file handle
25READImplementedVia stateid + offset + count
38WRITEImplementedVia stateid + offset + stable
42EXCHANGE_IDImplementedRandom client IDs (C-ADV-7)
43CREATE_SESSIONImplementedRandom session IDs (C-ADV-2)
44DESTROY_SESSIONImplementedSession teardown
53SEQUENCEImplementedPer-request sequencing
63IO_ADVISEImplementedAccepted (advisory integration pending)

S3 API — implemented operations

OperationHTTP MethodStatus
PutObjectPUT /:bucket/:keyImplemented
GetObjectGET /:bucket/:keyImplemented
HeadObjectHEAD /:bucket/:keyImplemented
DeleteObjectDELETE /:bucket/:keyStub (returns 204)
ListObjectsV2GET /:bucketNot yet

Compliance testing approach

  1. BDD feature files map to RFC sections:

    • specs/features/nfs3-rfc1813.feature (14 scenarios)
    • specs/features/nfs4-rfc7862.feature (20 scenarios)
    • specs/features/s3-api.feature (10 scenarios)
  2. Wire format validation via Python e2e tests:

    • NFS: raw TCP with struct.pack for ONC RPC framing
    • S3: requests library for HTTP
  3. Real client interop (future):

    • NFS: mount -t nfs -o nfsvers=3,tcp in Docker
    • S3: boto3 / aws-cli

Consequences

  • Clear documentation of what’s implemented vs what’s not
  • BDD scenarios serve as living compliance spec
  • Real client interop deferred until wire format proven via raw tests
  • Expanding the subset (e.g., adding REMOVE, RENAME) requires: new BDD scenario → new step definition → implementation → test green

References

  • RFC 1813: NFS Version 3 Protocol Specification
  • RFC 7862: NFS Version 4.2 Protocol
  • RFC 5531: ONC RPC Version 2
  • RFC 4506: XDR: External Data Representation Standard
  • AWS S3 API Reference
  • ADR-013: POSIX Semantics Scope
  • ADR-014: S3 API Scope

ADR-024: Device Management, Storage Tiers, and Capacity Thresholds

Status: Accepted (19/19 device-management BDD scenarios pass). Date: 2026-04-20. Deciders: Architect + domain expert.

Context

The current design (ADR-005) defines three NVMe device classes but does not address:

  • HDD / spinning disk tiers (common in cost-optimized HPC clusters)
  • System partition vs data partition separation
  • Capacity thresholds and degradation behavior
  • Device health monitoring and proactive replacement
  • Memory-attached storage (CXL, persistent memory)
  • Mixed-tier deployments (SSD+HDD, fast-SSD+cheap-SSD)

Real HPC deployments often have:

  • System partition: RAID-1 (or RAID-1+0) on 2 SSDs for OS + Kiseki binaries + redb
  • Data partitions: JBOD — each NVMe/SSD/HDD is an independent pool member
  • Tiering: Hot data on fast NVMe, warm on cheap SSD, cold on HDD

Decision

Device classification

Extend DeviceClass to cover the full storage hierarchy:

ClassMediumUse caseTypical capacity
NvmeU2NVMe U.2 TLC/MLCMetadata, hot data, Raft log1-8 TB
NvmeQlcNVMe QLCCheckpoints, warm data4-30 TB
NvmePersistentMemoryIntel Optane / CXLCache, ultra-hot metadata128 GB - 1 TB
SsdSataSATA SSDBudget fast storage1-8 TB
HddEnterpriseSAS/SATA HDD 10k/15kCold data, archive4-20 TB
HddBulkSATA HDD 7.2kDeep archive, bulk cold10-20 TB
Custom(String)User-definedVendor-specificVaries

Server disk layout

Server node:
├── System partition (RAID-1 on 2× SSD)
│   ├── /boot, /root, OS
│   ├── /var/lib/kiseki/redb/       ← Raft log, metadata index
│   └── /var/lib/kiseki/config/     ← Node config, certs
│
├── Data devices (JBOD, managed by Kiseki)
│   ├── /dev/nvme0n1 → pool "fast-nvme"  (device member)
│   ├── /dev/nvme1n1 → pool "fast-nvme"  (device member)
│   ├── /dev/sda     → pool "bulk-ssd"   (device member)
│   ├── /dev/sdb     → pool "cold-hdd"   (device member)
│   └── ...
│
└── Optional: CXL memory → pool "pmem" (hot cache tier)

JBOD for data, RAID-1 for system. Kiseki manages data durability via EC/replication across JBOD members. The system partition uses traditional RAID-1 because redb and Raft log must survive single-disk failure without Kiseki’s own repair mechanism.

Pool capacity management

Per-device-class capacity thresholds

Thresholds vary by device type because NVMe/SSD suffer GC-induced write amplification at high fill levels, while HDD does not. Enterprise arrays (VAST, Pure) can operate at 95%+ because they have global wear leveling — JBOD does not have that luxury.

StateNVMe/SSDHDDBehavior
Healthy0-75%0-85%Normal writes, background rebalance
Warning75-85%85-92%Log warning, emit telemetry
Critical85-92%92-97%Reject new placements, advisory backpressure
ReadOnly92-97%97-99%In-flight writes drain, no new writes
Full97-100%99-100%ENOSPC to clients

Rationale: NVMe/SSD GC pressure increases sharply above ~80% fill. QLC is worse than TLC. The SSD Warning threshold (75%) gives the placement engine time to redirect before the GC cliff. HDD has no such cliff — outer-track vs inner-track difference is ~20%, not a performance wall.

Implementation:

#![allow(unused)]
fn main() {
pub enum PoolHealth {
    Healthy,
    Warning { used_percent: u8 },
    Critical { used_percent: u8 },
    ReadOnly { used_percent: u8 },
    Full,
}

pub struct CapacityThresholds {
    pub warning_pct: u8,
    pub critical_pct: u8,
    pub readonly_pct: u8,
    pub full_pct: u8,
}

impl CapacityThresholds {
    pub fn for_device_class(class: &DeviceClass) -> Self {
        match class {
            DeviceClass::NvmeU2 | DeviceClass::NvmeQlc
            | DeviceClass::NvmePersistentMemory | DeviceClass::SsdSata => Self {
                warning_pct: 75,
                critical_pct: 85,
                readonly_pct: 92,
                full_pct: 97,
            },
            DeviceClass::HddEnterprise | DeviceClass::HddBulk => Self {
                warning_pct: 85,
                critical_pct: 92,
                readonly_pct: 97,
                full_pct: 99,
            },
            DeviceClass::Custom(_) => Self {
                warning_pct: 80,
                critical_pct: 90,
                readonly_pct: 95,
                full_pct: 99,
            },
        }
    }
}

impl AffinityPool {
    pub fn health(&self) -> PoolHealth {
        let pct = (self.used_bytes * 100) / self.capacity_bytes;
            81..=90 => PoolHealth::Warning { used_percent: pct as u8 },
            91..=95 => PoolHealth::Critical { used_percent: pct as u8 },
            96..=99 => PoolHealth::ReadOnly { used_percent: pct as u8 },
            _ => PoolHealth::Full,
        }
    }
}
}

Placement engine behavior:

  • Healthy: Place chunks according to affinity policy
  • Warning: Continue placing but emit telemetry; cluster admin should add capacity
  • Critical: Reject new placements; redirect to same device-class sibling only
  • ReadOnly: In-flight writes complete; new writes fail with retriable error
  • Full: ENOSPC — client gets permanent error

Pool redirection policy: When a pool is Critical, the placement engine redirects to another pool of the same device class only. Never cross device-class boundaries (e.g., never NVMe → HDD). If no same-class sibling has capacity, return ENOSPC to client. This preserves performance SLAs and compliance tag enforcement.

System partition

OS-managed RAID-1 on 2× SSD. Kiseki does not manage the RAID.

Kiseki monitors system partition health:

  1. On startup: check /proc/mdstat for RAID health
  2. If degraded → log WARNING, continue operating
  3. If both drives failed → log CRITICAL, refuse to start
  4. Periodic check every 60 seconds

Admin is responsible for replacing failed system drives and rebuilding the RAID. Kiseki trusts the OS for system partition durability.

Device health monitoring

Each device reports SMART/health metrics:

MetricThresholdAction
Temperature>70°CWarning; throttle if >80°C
Wear level (SSD)>90% life usedWarning; proactive replacement window
Bad sectors (HDD)>0 reallocatedWarning at 1; evacuate at >100
Latency>10× baselineMark degraded; reduce placement priority
ErrorsUncorrectable readMark suspect; verify EC/replicas for affected chunks

Device states:

Healthy → Degraded → Failed → Removed
     ↘       ↗
   Evacuating → Removed

Eviction and evacuation policy

Key principle: Unhealthy devices are evacuated proactively, not waited on until failure. Full devices are write-blocked, not evicted (data is still readable).

TriggerActionAutomatic?Priority
SMART wear >90% (SSD)Evacuate — migrate chunks to other pool membersYes (background)Normal
Bad sectors >100 (HDD)Evacuate — migrate before cascading failureYes (background)High
Uncorrectable read errorEvacuate + EC repair for affected chunksYes (immediate)Critical
Temperature >80°CThrottle I/O, alert adminYesHigh
Device unresponsiveMark Failed — trigger EC repair from survivorsYes (immediate)Critical
Pool at Critical thresholdBlock writes — redirect to sibling poolsYesNormal
Pool at ReadOnly thresholdDrain writes — no new data, existing completesYesNormal
Admin-initiatedEvacuate — controlled migration before physical removalManualNormal

Evacuation process:

  1. Mark device Evacuating
  2. For each chunk on device: read fragment, write to another healthy device in pool
  3. Update chunk metadata (redb) with new placement
  4. When all chunks migrated: mark device Removed
  5. Admin can physically pull the device

Evacuation speed: Bounded by network and destination device throughput. At 1 GB/s NVMe write speed, a 4TB device evacuates in ~67 minutes. EC repair (from parity) is faster since only the missing fragments need reconstruction.

Invariant: A device in Evacuating state accepts no new writes but serves reads for chunks not yet migrated.

Storage backend per JBOD device

ApproachProsConsRecommendation
Raw block (ADR-029)Zero FS overhead, direct I/O, aligned writes, bitmap allocator with redb journalCustom allocator in kiseki-blockDefault — recommended for production
File-backed (ADR-029)Same DeviceBackend trait, works in VMs/CI without raw devicesSlight overhead from host FSVMs and CI environments
xfsScales to 100M+ files, good NVMe supportExtra FS overhead, inode pressure at scaleLegacy / deprecated

Default: Raw block device I/O via kiseki-block (DeviceBackend trait with auto-detection of device characteristics). File-backed fallback for VMs and CI. XFS is deprecated as a chunk storage backend; existing XFS deployments can migrate via background evacuation.

Device discovery

Manual configuration (MVP):

  • Admin provides device list in node config (kiseki-server.toml)
  • Each device: path, class, pool assignment

Future: Auto-discovery:

  • Scan /sys/block/ for NVMe/SSD/HDD devices

  • Classify by transport (NVMe, SATA, SAS) and media (rotational flag)

  • Present to admin for pool assignment confirmation

  • Healthy: Normal I/O

  • Degraded: Elevated errors or latency; reduce write priority

  • Evacuating: Admin-initiated; migrate chunks to other devices, then remove

  • Failed: I/O errors; trigger EC repair for all chunks

  • Removed: Device physically absent; metadata cleaned up

Tiering and data movement

Static placement (MVP): Admin assigns pools to device classes. Chunk placement is determined at write time by the composition’s view descriptor affinity policy. No automatic migration.

Future: Reactive tiering (per assumption A8):

  • Compositions with high read frequency auto-promote from cold → hot
  • Compositions with no reads for >N days auto-demote from hot → cold
  • Promotion/demotion as background job (copy chunk, update metadata, delete old)
  • Bounded by pool capacity thresholds (don’t overfill hot tier)

Data model changes

#![allow(unused)]
fn main() {
pub enum DeviceClass {
    NvmeU2,
    NvmeQlc,
    NvmePersistentMemory,
    SsdSata,
    HddEnterprise,
    HddBulk,
    Custom(String),
}

pub struct DeviceInfo {
    pub id: DeviceId,
    pub class: DeviceClass,
    pub path: String,          // /dev/nvme0n1 or mount point
    pub capacity_bytes: u64,
    pub used_bytes: u64,
    pub state: DeviceState,
    pub pool_id: Option<String>,
}

pub enum DeviceState {
    Healthy,
    Degraded { reason: String },
    Evacuating { progress_percent: u8 },
    Failed { since: u64 },
    Removed,
}
}

Consequences

  • Device diversity now first-class (HDD, SSD, NVMe, PMem)
  • Capacity management is explicit with defined thresholds
  • System partition (RAID-1) separated from data (JBOD)
  • Device health monitoring enables proactive replacement
  • Tiering is future work; static placement for MVP
  • Cluster admin must provision devices and assign to pools at setup time

References

  • ADR-005: EC and chunk durability (per pool)
  • ADR-022: Storage backend (redb on system partition)
  • Assumption A4: ClusterStor hardware
  • Assumption A8: Reactive tiering
  • Failure mode F-I2: Storage node failure
  • Failure mode F-I4: Disk/device failure

ADR-025: Storage Administration API

Status: Proposed. Date: 2026-04-20. Deciders: Architect + domain expert.

Context

Storage administrators need to performance-tune the system similar to Ceph (ceph osd pool set), VAST (management UI), or Lustre (lctl). The current control plane API handles tenant lifecycle but has no storage admin surface — no pool management, device management, performance tuning, or cluster-wide observability.

API-first principle: All admin interactions go through gRPC APIs. CLI (kiseki-cli), Web UI, and job orchestrators (Ansible, Terraform) are wrappers around these APIs. No SSH-and-edit-config path.

Decision

Admin API surface (new gRPC service)

service StorageAdminService {
  // === Device management ===
  rpc ListDevices(ListDevicesRequest) returns (ListDevicesResponse);
  rpc GetDevice(GetDeviceRequest) returns (DeviceInfo);
  rpc AddDevice(AddDeviceRequest) returns (AddDeviceResponse);
  rpc RemoveDevice(RemoveDeviceRequest) returns (RemoveDeviceResponse);
  rpc EvacuateDevice(EvacuateDeviceRequest) returns (EvacuateDeviceResponse);
  rpc CancelEvacuation(CancelEvacuationRequest) returns (CancelEvacuationResponse);

  // === Pool management ===
  rpc ListPools(ListPoolsRequest) returns (ListPoolsResponse);
  rpc GetPool(GetPoolRequest) returns (PoolInfo);
  rpc CreatePool(CreatePoolRequest) returns (CreatePoolResponse);
  rpc SetPoolDurability(SetPoolDurabilityRequest) returns (SetPoolDurabilityResponse);
  rpc SetPoolThresholds(SetPoolThresholdsRequest) returns (SetPoolThresholdsResponse);
  rpc RebalancePool(RebalancePoolRequest) returns (RebalancePoolResponse);

  // === Performance tuning ===
  rpc GetTuningParams(GetTuningParamsRequest) returns (TuningParams);
  rpc SetTuningParams(SetTuningParamsRequest) returns (SetTuningParamsResponse);

  // === Cluster observability ===
  rpc ClusterStatus(ClusterStatusRequest) returns (ClusterStatus);
  rpc PoolStatus(PoolStatusRequest) returns (PoolStatus);
  rpc DeviceHealth(DeviceHealthRequest) returns (stream DeviceHealthEvent);
  rpc IOStats(IOStatsRequest) returns (stream IOStatsEvent);

  // === Shard management ===
  rpc ListShards(ListShardsRequest) returns (ListShardsResponse);
  rpc GetShard(GetShardRequest) returns (ShardInfo);
  rpc SplitShard(SplitShardRequest) returns (SplitShardResponse);
  rpc SetShardMaintenance(SetShardMaintenanceRequest) returns (SetShardMaintenanceResponse);

  // === Repair and scrub ===
  rpc TriggerScrub(TriggerScrubRequest) returns (TriggerScrubResponse);
  rpc RepairChunk(RepairChunkRequest) returns (RepairChunkResponse);
  rpc ListRepairs(ListRepairsRequest) returns (ListRepairsResponse);
}

Tuning parameters

Storage admins tune at four levels: cluster → pool → tenant → workload. Lower levels inherit from higher, can only narrow (not broaden).

Cluster-wide tuning

ParameterDefaultRangeWhat it controls
compaction_rate_mb_s10010-1000Background compaction throughput cap
gc_interval_s30060-3600How often GC scans for reclaimable chunks
rebalance_rate_mb_s500-500Background rebalance/evacuation throughput
scrub_interval_h168 (7d)24-720How often integrity scrub runs
max_concurrent_repairs41-32Parallel EC repair jobs
stream_proc_poll_ms10010-1000View materialization poll interval
inline_threshold_bytes4096512-65536Below this, data inlined in delta
raft_snapshot_interval100001000-100000Entries between Raft snapshots

Per-pool tuning

ParameterDefaultRangeWhat it controls
ec_data_chunks4 (NVMe) / 8 (HDD)2-16EC data fragment count
ec_parity_chunks2 (NVMe) / 3 (HDD)1-8EC parity fragment count
replication_count32-5For replication pools (not EC)
warning_threshold_pctper ADR-02450-95Capacity warning level
critical_threshold_pctper ADR-02460-98Capacity critical level
readonly_threshold_pctper ADR-02470-99Read-only level
target_fill_pct70 (SSD) / 80 (HDD)50-90Rebalance target fill level
chunk_alignment_bytes4096512-65536On-disk alignment (RDMA/NVMe)
prefer_sequential_alloctrueboolAllocate sequentially in pool file

Per-tenant tuning (via ControlService, existing)

ParameterExisting APIWhat it controls
quota.capacity_bytesSetQuotaTenant capacity ceiling
quota.iopsSetQuotaIOPS limit
quota.metadata_ops_per_secSetQuotaMetadata op rate limit
dedup_policyCreateOrganizationCross-tenant vs isolated dedup
compliance_tagsSetComplianceTagsRegulatory constraints

Per-workload tuning (via ControlService + Advisory)

ParameterAPIWhat it controls
workload.quotaCreateWorkloadWorkload-level capacity/IOPS
advisory.hints_per_secAdvisory ceilingsHint submission rate
advisory.prefetch_bytes_maxAdvisory ceilingsPrefetch budget
advisory.profileAdvisory profilesAllowed hint profiles

Observability API

ClusterStatus response

message ClusterStatus {
  uint32 node_count = 1;
  uint32 healthy_nodes = 2;
  uint64 total_capacity_bytes = 3;
  uint64 used_bytes = 4;
  uint32 pool_count = 5;
  uint32 shard_count = 6;
  uint32 active_repairs = 7;
  uint32 evacuating_devices = 8;
  repeated PoolSummary pools = 9;
}

PoolStatus response

message PoolStatus {
  string pool_id = 1;
  PoolHealth health = 2;
  uint64 capacity_bytes = 3;
  uint64 used_bytes = 4;
  uint32 device_count = 5;
  uint32 healthy_devices = 6;
  uint32 chunk_count = 7;
  // Performance metrics (rolling 60s window)
  double read_iops = 8;
  double write_iops = 9;
  double read_throughput_mb_s = 10;
  double write_throughput_mb_s = 11;
  double avg_read_latency_ms = 12;
  double avg_write_latency_ms = 13;
  double p99_read_latency_ms = 14;
  double p99_write_latency_ms = 15;
}

Streaming events

message DeviceHealthEvent {
  DeviceId device_id = 1;
  DeviceState old_state = 2;
  DeviceState new_state = 3;
  string reason = 4;
  uint64 timestamp_ms = 5;
}

message IOStatsEvent {
  string pool_id = 1;
  double read_iops = 2;
  double write_iops = 3;
  double read_throughput_mb_s = 4;
  double write_throughput_mb_s = 5;
  uint64 timestamp_ms = 6;
}

Admin personas and API mapping

PersonaTypical actionsAPIs used
Cluster adminAdd/remove nodes, set cluster params, view healthStorageAdminService (all), ClusterStatus
Storage adminCreate pools, tune EC, set thresholds, rebalancePool*, SetTuningParams, PoolStatus
Tenant adminSet quotas, compliance, retention, advisoryControlService (existing)
Workload adminTune advisory, prefetch, dedup hintsAdvisory (existing) + workload quota
On-call/SREView health, trigger repair, check alertsClusterStatus, DeviceHealth stream, TriggerScrub

CLI mapping (kiseki-cli)

kiseki cluster status              → ClusterStatus
kiseki pool list                   → ListPools
kiseki pool status fast-nvme       → PoolStatus
kiseki pool create --name bulk-hdd --class HddBulk --ec 8+3
kiseki pool tune fast-nvme --warning-pct 75 --target-fill 70
kiseki device list                 → ListDevices
kiseki device add /dev/nvme2n1 --pool fast-nvme
kiseki device evacuate dev-uuid    → EvacuateDevice
kiseki device health --watch       → DeviceHealth stream
kiseki tune set --compaction-rate 200 --gc-interval 120
kiseki shard list                  → ListShards
kiseki shard split shard-uuid      → SplitShard
kiseki repair scrub --pool fast-nvme
kiseki iostat --pool fast-nvme     → IOStats stream

Authorization model

APIWho can callAuth
StorageAdminService (all)Cluster admin onlymTLS cert with admin OU
ControlService (tenant ops)Tenant adminmTLS cert with tenant OU
Advisory (workload ops)Workload identitymTLS cert + workflow token
Read-only observabilityCluster admin, SREmTLS cert with admin/sre OU

Tenant admins cannot access StorageAdminService. They see their own quotas and compliance tags, not pool health or device state. This preserves the zero-trust boundary (I-T4).

Consequences

  • Full API-first admin surface — no SSH-and-edit needed
  • CLI, UI, automation all use the same gRPC APIs
  • Performance tuning at four levels with inheritance
  • Streaming observability for real-time monitoring
  • Clear authorization boundary between cluster admin and tenant admin
  • Significantly expands the gRPC surface (20+ new RPCs)

References

  • ADR-024: Device management and capacity thresholds
  • ADR-005: EC and chunk durability
  • ADR-020: Workflow advisory (workload-level tuning)
  • Ceph: ceph osd pool set command reference
  • Lustre: lctl set_param tunables
  • I-T4: Zero-trust infra/tenant boundary

Addendum: Adversarial Review Resolutions (2026-04-20)

C1: Per-tenant resource usage → ControlService, not StorageAdminService

Per-tenant resource usage (capacity, IOPS attribution) is exposed via ControlService with tenant-admin authorization, NOT via StorageAdminService. Cluster admin sees pool-level aggregates only. Tenant admin sees their own usage. This preserves I-T4.

// In ControlService (not StorageAdminService):
rpc GetTenantUsage(GetTenantUsageRequest) returns (TenantUsage);
// Requires tenant admin cert (mTLS OU = tenant ID)

C2: Per-device I/O stats added

rpc DeviceIOStats(DeviceIOStatsRequest) returns (stream DeviceIOStatsEvent);

message DeviceIOStatsEvent {
  string device_id = 1;
  double read_iops = 2;
  double write_iops = 3;
  double read_latency_p50_ms = 4;
  double read_latency_p99_ms = 5;
  double errors_per_sec = 6;
  uint64 timestamp_ms = 7;
}

C3: Shard health observability added

rpc GetShardHealth(GetShardHealthRequest) returns (ShardHealthInfo);

message ShardHealthInfo {
  string shard_id = 1;
  uint64 leader_node_id = 2;
  uint32 replica_count = 3;
  uint32 reachable_count = 4;
  uint32 recent_elections = 5;
  uint64 commit_lag_entries = 6;
}

C4: EC parameters immutable per pool

New invariant I-C6: EC parameters (data_chunks, parity_chunks) are immutable per pool. SetPoolDurability applies only to NEW chunks. Existing chunks retain their original EC configuration. Explicit re-encoding via ReencodePool RPC (long-running, cancellable).

C5: Compaction rate validation

Protobuf-level validation: compaction_rate_mb_s ∈ [10, 1000]. API rejects values outside range. Audit event on every change.

C6: Inline threshold is prospective

New invariant I-L9: A delta’s inlined payload is immutable after write. inline_threshold_bytes changes do NOT retroactively affect existing deltas. Old and new thresholds coexist in the log.

C7: RemoveDevice requires evacuated state

New invariant I-D5: RemoveDevice rejects if device state is not Removed (post-evacuation). Precondition: EvacuateDevice must complete first. Error code: DEVICE_NOT_EVACUATED.

C8: Pool modifications audited to affected tenants

New invariant I-T4c: Cluster admin modifications to pools containing tenant data (SetPoolDurability, EvacuateDevice) are audit-logged to the affected tenant’s audit shard. Tenant admin can review.

C9: Tuning change audit trail

New invariant I-A6: All tuning parameter changes via SetTuningParams are recorded in the cluster audit shard with parameter name, old value, new value, timestamp, and admin identity.

H5: SRE roles defined

RoleAccess
cluster-adminFull StorageAdminService (read + write)
sre-on-callRead-only: List*, Get*, Status, Health streams
sre-incident-responseSRE + TriggerScrub, RepairChunk

Enforced via mTLS certificate OU field.

M4: DrainNode added

rpc DrainNode(DrainNodeRequest) returns (stream DrainNodeProgress);

Internally evacuates all devices on the node, then removes them. Idempotent, safe to retry.

ADR-026: Raft Topology — Per-Shard on Fabric (Strategy A)

Status: Accepted. Date: 2026-04-20. Deciders: Architect + domain expert.

Context

Kiseki needs multi-node Raft for durability (I-L2) and failover. The cluster operates on a shared Slingshot fabric (200 Gbps per node) where control messages (Raft) and data (chunk I/O) share bandwidth.

Three strategies were evaluated:

  • A: Raft per shard, all traffic on fabric
  • B: Raft for metadata only, primary-copy for data (Ceph-like)
  • C: Multi-Raft with batched transport (TiKV-like)

Decision

Strategy A: Raft per shard, on the data fabric.

Start with A, add C’s batching optimization when monitoring shows it’s needed (>1000 connections per node).

Why this works

Raft traffic is negligible compared to data fabric capacity:

ScaleShardsGroups/nodeHeartbeat/nodeReplication/node% of 200 Gbps
10 nodes1003078 KB/s3 MB/s<0.001%
100 nodes10003078 KB/s3 MB/s<0.001%
1000 nodes10,0003078 KB/s3 MB/s<0.001%

Groups-per-node stays constant at ~30 because shard count scales with node count (each node hosts ~30 shard replicas regardless of cluster size).

Key insight: Raft only for metadata

Chunk data does NOT go through Raft. The write path:

Large write:
  Client → Gateway → encrypt → chunk to NVMe (EC direct) → delta to Raft (1KB metadata)

Small write (<4KB):
  Client → Gateway → encrypt → inline in delta → Raft only

Raft replicates delta metadata (~1KB per operation). Chunk ciphertext (64KB-64MB) is written directly to NVMe devices via EC. This means:

  • Write throughput limited by NVMe/network, NOT by Raft
  • Raft consensus adds ~30-60µs (RDMA) or ~75-250µs (TCP) per metadata op
  • 50-100k metadata ops/sec per shard, shards in parallel

Projected performance vs competition

MetricKiseki (projected)LustreCephGPFS
Write GB/s /node25-405-121-35-15
Read GB/s /node40-5010-203-810-30
Write latency30-250µs100-500µs500-2000µs100-300µs
Metadata IOPS /node1.5-3M50-100k10-50k200k

Raft group configuration

Raft groupMembersWhere
Key manager3-5Dedicated keyserver nodes
Log shard (per shard)3Spread across storage nodes
Audit shard (per tenant)3Spread across storage nodes

Placement rule: no two members of the same group on the same node (or same rack if rack-aware placement is configured).

Transport

PhaseTransportOptimization
Phase 1 (now)TCP + TLSDirect connections, one per Raft peer
Phase 2 (10+ nodes)TCP + TLS + connection poolingReuse connections across groups
Phase 3 (100+ nodes)Batched transport (Strategy C)Coalesce heartbeats per node pair
FutureSlingshot CXI / RDMASub-10µs Raft RTT

Election storm mitigation

Correlated failure (rack power loss) causes simultaneous elections for all Raft groups on affected nodes (~30 groups per node × N nodes).

Mitigations:

  1. Randomized election timeouts: openraft already does this (150-300ms jitter)
  2. Staggered group startup: on node restart, groups start elections over a 5-second window (not all at once)
  3. Leader sticky: prefer re-electing the same leader if it recovers within the election timeout (avoids unnecessary leader changes)

Network requirements

NetworkPurposeKiseki traffic
Data fabric (Slingshot/ethernet)Chunk I/O + Raft99.99% data, 0.01% Raft
Management network (if available)ControlService, monitoringOptional: route Raft here to fully isolate

Management network is NOT required. Raft on the fabric is fine because the overhead is <0.001% of capacity. If a management network exists (common in HPC), Raft CAN be routed there for belt-and-suspenders isolation, but it’s not necessary.

Consequences

  • Simplest implementation: use openraft’s built-in TCP transport
  • No separate management network required (but can use one)
  • Scales to ~10k shards / 1000 nodes without transport optimization
  • Add batching (Strategy C) as a pure transport optimization later
  • Election storms during correlated failure are bounded by randomized timeouts
  • Raft adds ~30-250µs to metadata write latency (acceptable for HPC)

Migration path

If Strategy A proves insufficient at extreme scale:

  1. Add batched transport (C) — pure transport change, no protocol change
  2. If even C is insufficient, partition shards into metadata-Raft and data-EC groups (B) — larger refactor but data model already supports it

References

  • ADR-005: EC and chunk durability
  • ADR-022: Storage backend (redb)
  • ADR-024: Device management
  • TiKV Multi-Raft: https://tikv.org/deep-dive/scalability/multi-raft/
  • openraft: https://datafuselabs.github.io/openraft/
  • Slingshot fabric: ~5-10µs RTT, 200 Gbps per endpoint

ADR-027: Single-Language Implementation — Rust Only

Status: Accepted (implemented 2026-04-21, Go code removed) Date: 2026-04-20 (proposed), 2026-04-21 (accepted + migrated) Context: Supersedes the implicit language split in docs/analysis/design-conversation.md §2.13. No prior ADR recorded the Rust/Go decision.

Context

Kiseki’s original design split the implementation across two languages:

  • Rust for the core (log, chunks, views, native client, hot paths)
  • Go for the control plane (tenancy, IAM, policy, flavor, federation, audit export, CLI) and one half each of two cross-cutting contexts (kiseki-audit + control/pkg/audit; kiseki-advisory + control/pkg/advisory)
  • gRPC/protobuf as the boundary

The split was recorded in docs/analysis/design-conversation.md §2.13 but never promoted to an ADR. It surfaces in specs/architecture/module-graph.md (Go modules section), .claude/coding/go.md, and in two contexts that are currently split across both languages. The split pre-dates ADR-001 (pure-Rust, no Mochi/FFI), which already identified “FIPS compliance surface across two languages” as a cost.

At proposal time, 1,490 lines of Go business logic existed with 32/32 BDD scenarios passing (godog, Strict:true). The migration ported all 32 scenarios to cucumber-rs backed by a new kiseki-control Rust crate (~650 lines, 10 modules) before deleting the Go code. See specs/implementation/adr027-go-to-rust-migration.md for the migration plan and specs/findings/adr027-adversarial.md for the gate-1 review.

Decision

Implement Kiseki in Rust only. Retire the Go control plane, the Go CLI, and the Go halves of audit and advisory. Keep gRPC/protobuf as the wire boundary between the control plane and data plane so that a future non-Rust control plane remains possible.

Concretely:

  1. New Rust crates replace the Go packages one-for-one:
    • kiseki-control — control plane daemon (tenancy, IAM, policy, flavor, federation, audit export, discovery)
    • kiseki-cli — admin CLI
    • The control/pkg/audit half is absorbed into kiseki-audit
    • The control/pkg/advisory half is absorbed into kiseki-advisory
  2. gRPC/protobuf stays as the wire boundary. kiseki-control serves ControlService, AuditExportService, and policy endpoints over gRPC. kiseki-server consumes them as a client. No in-process shortcut across the boundary, even though both sides are now Rust.
  3. Architectural firewall is enforced by crate dependencies, not by language. kiseki-control and kiseki-cli depend only on kiseki-common and kiseki-proto. They MUST NOT depend on any data-path crate (kiseki-log, kiseki-chunk, kiseki-composition, kiseki-view, kiseki-gateway-*, kiseki-client, kiseki-keymanager). Enforced by a cargo-deny or workspace-level architectural lint at CI.
  4. Control plane binaries live alongside data-plane binaries in crates/bin/:
    • bin/kiseki-control/ (new)
    • bin/kiseki-cli/ (new)
  5. gRPC server framework: tonic (already the Rust-side choice). Config: figment or config-rs for layered YAML/env overrides (parity with Go’s viper pattern).
  6. Federation / state machine: kiseki-control uses openraft (already the project’s Raft choice per ADR-026) for replicated control-plane state (policy, opt-out state, tenant topology). This also eliminates the second Raft vendor that a Go control plane would have required (etcd client or dragonboat).

Rationale

One domain model

specs/ubiquitous-language.md defines Tenant, Org, Project, Workload, RetentionHold, Policy, Flavor, WorkflowRef, OperationAdvisory. Every one of these would otherwise need two implementations (Rust enums/structs + Go types). Two implementations drift: field renames, validation subtly different, invariant enforcement on one side only. Consolidating removes the class of bug where control-plane Go says a name is valid but data-path Rust rejects it (or vice versa).

One error taxonomy

specs/architecture/error-taxonomy.md enumerates retriable / permanent / security error categories. A Go implementation mirrors the Rust taxonomy as Go types + gRPC status mappings. One language means one thiserror-derived enum hierarchy and one mapping to tonic::Status.

Smaller FIPS surface

ADR-001 already cited “FIPS compliance surface across two languages” as a reason to reject C/C++ FFI. The same cost applies to Go: either BoringCrypto (Go’s FIPS module) is part of the certification boundary, or the control plane sits outside the FIPS module boundary and the certification scope has to be carefully drawn. Rust-only gives one aws-lc-rs FIPS module boundary for the whole system.

Cross-context crates stop being split

kiseki-audit and kiseki-advisory are currently split across Rust and Go. That means two queue implementations, two filter implementations, two sets of integration tests, two ways that tenant-scope validation can drift. In Rust-only, each is one crate with one set of invariants.

Eliminated toolchain duplication

Today’s per-commit gate has to run: cargo fmt, clippy, cargo-deny, cargo test and go fmt, go vet, golangci-lint, go test -race. Rust-only halves the CI configuration, halves the local developer setup, and removes one supply-chain audit surface (Go module proxy + checksum DB alongside crates.io).

Reuse of kiseki-common and kiseki-proto

The CLI and control plane can import the real domain types rather than regenerated protobuf Go structs. Validation logic written once in kiseki-common (e.g., tenant-id parsing, flavor matching, policy inheritance) is reused verbatim in the control plane and the CLI.

Build-phase cost is low now

Phase 0 has not started. Adding two Rust crates (kiseki-control, kiseki-cli) is cheaper than maintaining a separate control/ Go module, its go.mod, its generated proto outputs, and its CI lane. The cost rises monotonically with every phase that ships Go code.

Hiring and cognitive load

Contributors need one language, one async runtime (tokio), one tracing stack, one error model. Code review crosses fewer idiom boundaries. Onboarding doc shrinks.

Alternatives considered

  1. Keep Go as specified.

    • Pro: Go’s ecosystem for control planes (cobra, viper, operator-sdk, client-go patterns) is the golden path; k8s, etcd, Consul all use it. GC is fine on cold paths. Operators extending the system are more likely to know Go.
    • Pro: the language wall is the architectural wall — the Go control plane physically cannot reach into data-plane memory or internals.
    • Con: every benefit above comes with the duplication, drift, and FIPS-surface costs enumerated in “Rationale”. With no code written, the ecosystem-maturity argument is weaker than at a later stage.
  2. Port only the CLI to Rust, keep the Go control-plane daemon.

    • Pro: preserves Go for the longer-lived daemon code where operator-sdk patterns matter most. Low churn.
    • Con: doesn’t remove duplication for the split contexts (audit, advisory). Doesn’t shrink the FIPS surface. Doesn’t remove the second toolchain from CI. Half-measure.
  3. Rewrite the core in Go (single-language Go).

    • Rejected immediately: Go GC and lack of precise control over allocation and layout disqualify it from the hot data path at 200 Gbps per NIC. This inverts the original rationale for Rust in the core.
  4. Separate Rust crate per Go package, but share no runtime (same-language boundary still isolated by process).

    • Considered. Rejected: unnecessary. The isolation value of “separate OS process” is already provided by kiseki-control being a distinct binary. Running two daemons is orthogonal to the language question.
  5. Defer the decision until after Phase 3.

    • Rejected: the decision is cheapest to reverse now. Each build phase that ships Go code raises the cost of consolidation and lets duplication set in. The analyst already flagged the split without recording a decision; formalizing now is overdue.

Consequences

Positive

  • Single toolchain: cargo fmt, clippy, cargo-deny, cargo test, cargo audit. Lefthook configuration shrinks.
  • Single FIPS module boundary (aws-lc-rs).
  • Domain types (Tenant, Policy, RetentionHold, Flavor, WorkflowRef, OperationAdvisory) exist once in kiseki-common.
  • kiseki-audit and kiseki-advisory become whole crates rather than split halves. Their invariants (I-A1..I-A3, I-WA1..I-WA16) are enforced in one place.
  • kiseki-control can reuse openraft (ADR-026) for its replicated state rather than requiring a second Raft implementation (etcd/dragonboat).
  • No generated Go protobuf stubs to keep in sync; one generated tree under crates/kiseki-proto/.
  • CI matrix shrinks; no go test -race lane.

Negative

  • Loses the “language wall as architectural wall” property. Must be replaced with crate-graph enforcement (see “Enforcement” below). This is a discipline cost and must be tooled, not trusted.
  • Rust’s CLI/operator ecosystem (clap, tonic, figment) is less mature than Go’s (cobra, viper, operator-sdk). Some patterns (admission webhooks, CRD controllers) will require more bespoke code if we ever grow a k8s operator.
  • Contributors with Go-only platform experience face a higher barrier to writing control-plane extensions.
  • kiseki-control uses tokio for async I/O and is exposed to async-Rust complexity on request handlers (cancellation safety, 'static bounds) that Go handlers would not have had.
  • One-time rewrite cost for the control-plane spec surface (api-contracts.md, module-graph.md, .claude/coding/go.md → remove or archive, build-phases.md may need to re-sequence control-plane phases).

Enforcement (replacing the language wall)

The split previously enforced “control plane never reaches into data plane” structurally. In Rust-only, this is enforced by:

  1. Crate-graph rule. kiseki-control and kiseki-cli depend only on kiseki-common and kiseki-proto. This is asserted by a CI check that greps Cargo manifests, or by cargo-deny’s bans section, or by a custom workspace lint.
  2. No re-export shortcut. kiseki-common MUST NOT re-export internal types from data-path crates. This is already the case; restated here as a rule.
  3. gRPC boundary preserved. Even though both sides are now Rust, control-plane-to-data-plane traffic still goes through tonic over gRPC, not through in-process trait calls. This keeps the wire contract as the source of truth and preserves the option of a non-Rust control plane later.
  4. Runtime separation. kiseki-control runs as its own binary (bin/kiseki-control/), not as a library linked into kiseki-server. The isolation that process separation provides is preserved.

Migration

No code exists yet. Migration is a spec update:

  1. docs/analysis/design-conversation.md §2.13: annotate with a pointer to this ADR.
  2. specs/architecture/module-graph.md: delete the “Go modules (control plane)” section; add the new Rust crates (kiseki-control, kiseki-cli) and update the “Bounded context → module mapping” table to say Rust for every row.
  3. specs/architecture/build-phases.md: review Phase sequencing — the Go control-plane phase collapses into a Rust phase; audit/advisory phases no longer have a “Go side” task.
  4. .claude/CLAUDE.md and .claude/guidelines/go.md: remove Go from the workflow router; keep .claude/coding/go.md archived (move to specs/archive/ or delete) as a historical record.
  5. .claude/coding/rust.md: add a “control plane” section describing kiseki-control/kiseki-cli conventions (config with figment, CLI with clap, server with tonic + axum for any REST admin surface).
  6. Makefile (when it exists): drop Go lanes.
  7. specs/features/control-plane.feature: BDD scenarios remain; the step definitions move from godog to cucumber-rs.

Open items (escalated to adversary gate-1)

  • Verify the crate-graph rule (control plane depends only on kiseki-common/kiseki-proto) is enforceable with cargo-deny alone, or whether a custom workspace lint is needed.
  • Confirm cucumber-rs covers the Gherkin features that godog was planned to run, without step-definition regressions.
  • Confirm FIPS posture: aws-lc-rs covers the control-plane’s TLS needs (mTLS CA, admin endpoints) as well as the data-plane’s. No Go BoringCrypto equivalent is needed.
  • Verify that removing the Go language wall does not create a realistic path by which a control-plane code change accidentally links data-path crates. Propose a pre-merge check if manifest-grep is insufficient.
  • Decide the fate of control/pkg/discovery: if fabric discovery uses libfabric/CXI, it was already going to need a Rust FFI layer; confirm the Rust-only home for it is kiseki-control (or a new kiseki-discovery crate).

References

  • ADR-001: Pure Rust, No Mochi Dependency (FIPS surface precedent).
  • ADR-021: Workflow Advisory Architecture (defines the Rust+Go split for advisory that this ADR collapses).
  • ADR-026: Raft Topology — openraft is the Rust-side Raft; now also the control plane’s Raft.
  • docs/analysis/design-conversation.md §2.13 — original (now superseded) language-split rationale.
  • specs/architecture/module-graph.md — current two-language module layout (to be rewritten).
  • .claude/coding/go.md — Go coding standards (to be archived on acceptance).

ADR-028: External Tenant KMS Providers

Status: Accepted Date: 2026-04-22 Context: I-K11, ADR-002, ADR-003, ADR-007 Adversarial review: 2026-04-22 (8 findings: 2H 5M 1L, all resolved)

Problem

ADR-002 defines a two-layer encryption model where tenant KEKs wrap access to system DEK derivation material. The current implementation hardcodes tenant KEK as a locally-managed [u8; 32] — there is no mechanism for tenants to bring their own key management infrastructure.

HPC and enterprise tenants require integration with their existing KMS:

  • Regulatory compliance (FIPS 140-2/3, Common Criteria, SOC 2)
  • Centralized key lifecycle management
  • Hardware-backed key storage (HSMs)
  • Audit trails in their own systems
  • Key escrow and disaster recovery under their own policies

Decision

Introduce a TenantKmsProvider trait with five backend implementations. Tenant KEK sourcing becomes pluggable per-tenant via control-plane configuration. The system key manager (ADR-007) remains unchanged — only the tenant KEK layer is externalized.

Provider Backends

#BackendTypeStandardTransportMaterial model
1Kiseki InternalBuilt-inIn-processLocal
2HashiCorp VaultOpen sourceProprietary (Transit)HTTPSLocal (cached)
3KMIP 2.1StandardOASIS KMIP SP 800-57mTLS (TTLV)Remote or local
4AWS KMSCloudAWS Sig V4HTTPSRemote only
5PKCS#11 v3.0HSMOASIS PKCS#11Local (FFI)Remote only (HSM)

Material model: “Local” = KEK material cached in Kiseki process memory. “Remote” = material never leaves the provider; all wrap/unwrap operations are remote calls. The trait fully encapsulates this distinction — callers never branch on provider type.

Provider 1: Kiseki Internal (default)

The existing behavior. Kiseki manages tenant KEKs internally. Suitable for deployments where tenants trust the operator or where external KMS is unavailable.

  • Tenant KEK generated internally on tenant creation
  • Stored in a separate Raft group from system master keys (independent compromise domain — see Security Considerations §6)
  • Rotation managed by Kiseki’s epoch mechanism
  • No external dependency

This is the zero-configuration default. Existing tenants and single-operator deployments use this without change.

Security trade-off: Internal mode does not provide the full two-layer security guarantee of ADR-002. A compromise of both the system key manager and the tenant key store (even though they are separate Raft groups) yields full access. Compliance-sensitive tenants should use an external provider where the tenant KEK is under the tenant’s own operational control.

Provider 2: HashiCorp Vault (Transit secrets engine)

Vault’s Transit engine provides encryption-as-a-service with key versioning that maps cleanly to Kiseki’s epoch model.

Operations mapping:

Kiseki operationVault API
wrapPOST /transit/encrypt/:name (with context = AAD)
unwrapPOST /transit/decrypt/:name (with context = AAD)
rotatePOST /transit/keys/:name/rotate
rewrapPOST /transit/rewrap/:name (server-side, no plaintext exposure)
destroyDELETE /transit/keys/:name (after enabling deletion)

Authentication methods (tenant-configurable):

  • TLS certificate — maps to Kiseki’s SPIFFE/mTLS identity
  • AppRole — role_id + secret_id for service authentication
  • Kubernetes — ServiceAccount JWT (for k8s-deployed Kiseki)
  • OIDC/JWT — external IdP token

Vault namespaces: Multi-tenant Vault deployments use namespaces to isolate tenant key material. The tenant’s Vault namespace is configured at onboarding.

Caching: Vault provider may optionally cache KEK material locally (fetched via POST /transit/datakey/plaintext/:name). When caching is disabled, all wrap/unwrap calls go through Vault directly. Caching mode is configurable per tenant.

Rust crate: vaultrs (maintained, async, supports Transit engine).

Provider 3: KMIP 2.1 (OASIS standard)

KMIP is the interoperability standard for enterprise key management. A single KMIP client covers: Thales CipherTrust Manager, IBM Security Guardium Key Lifecycle Manager, Fortanix SDKMS, Entrust KeyControl, NetApp StorageGRID KMS, Dell PowerProtect, and any KMIP-compliant HSM.

Relevant OASIS specifications:

  • KMIP Specification v2.1 (2019) — protocol and operations
  • KMIP Profiles v2.1 — conformance levels
  • KMIP Usage Guide v2.1 — implementation guidance

Operations mapping:

Kiseki operationKMIP operation
wrapEncrypt with Correlation Value (AAD)
unwrapDecrypt with Correlation Value (AAD)
rotateReKey or Create + Activate + Revoke old
destroy (crypto-shred)Destroy (state → Destroyed, irrecoverable)

Transport: TTLV (Tag-Type-Length-Value) binary encoding over mTLS. The KMIP spec mandates mutual TLS with X.509 certificates.

Key object attributes: KMIP keys carry rich metadata — Cryptographic Algorithm, Cryptographic Length, State (Pre-Active/Active/Deactivated/Compromised/Destroyed), Activation Date, Deactivation Date. These map to Kiseki’s EpochInfo (is_current, migration_complete).

Material model: Depends on KMIP server configuration. Some servers allow Get to extract key material (local caching). Others enforce non-extractable keys (remote-only wrap/unwrap). The provider detects this via CKA_EXTRACTABLE equivalent attribute and adapts.

Rust implementation: No mature KMIP crate exists. Implement a minimal KMIP client covering the Symmetric Key Foundry Client profile (KMIP Profiles v2.1 §4.1). The wire format (TTLV) is straightforward — ~1500 lines for the operations Kiseki needs.

Provider 4: AWS KMS (cloud KMS exemplar)

AWS KMS as the reference cloud implementation. Azure Key Vault and GCP Cloud KMS follow the same adapter pattern.

Operations mapping:

Kiseki operationAWS KMS API
wrapEncrypt (with EncryptionContext = AAD)
unwrapDecrypt (with EncryptionContext = AAD)
rotateCreateKey + CreateAlias (manual) or EnableKeyRotation (automatic annual)
rewrapReEncrypt (server-side, no plaintext exposure)

Key difference: With cloud KMS, the KEK material never leaves the cloud provider. Kiseki sends the derivation parameters (epoch + chunk_id) to KMS for wrapping/unwrapping. This is strictly more secure than local caching but adds network latency per operation.

Caching strategy: Kiseki caches the unwrapped derivation parameters (not the KEK itself, which never leaves KMS). The existing KeyCache TTL mechanism applies — after TTL expiry, a new Decrypt call to KMS is required.

Auth: IAM role assumption via STS, instance metadata, or environment credentials. For Azure: AAD/Managed Identity. For GCP: service account key or Workload Identity.

Rust crates: aws-sdk-kms, azure_security_keyvault, google-cloud-kms (all maintained, async).

Provider 5: PKCS#11 v3.0 (HSM direct)

For tenants with on-premises HSMs (Thales Luna, Utimaco, nCipher, YubiHSM). PKCS#11 is the standard C API for cryptographic tokens.

Relevant standards:

  • OASIS PKCS#11 v3.0 (2020) — Cryptographic Token Interface
  • PKCS#11 Profiles v3.0 — baseline/extended profiles

Operations mapping:

Kiseki operationPKCS#11 function
wrapC_WrapKey (AES-KWP per RFC 5649, with pParameter = AAD)
unwrapC_UnwrapKey
rotateC_GenerateKey + C_DestroyObject (old, after migration)
destroyC_DestroyObject

Material model: Remote only. HSM keys are CKA_SENSITIVE and CKA_EXTRACTABLE=FALSE by default — material never leaves the HSM. All wrap/unwrap operations execute on the HSM hardware. Kiseki caches unwrapped derivation parameters (same as cloud KMS model).

Transport: Local — PKCS#11 is a C shared library (.so/.dylib) loaded via FFI. The HSM may be network-attached (e.g., Luna Network HSM), but the PKCS#11 interface is local to the host.

Rust crate: cryptoki (maintained, wraps PKCS#11 C API).

Trait Interface

#![allow(unused)]
fn main() {
/// Provider for tenant key encryption keys (KEKs).
///
/// Each tenant configures exactly one provider. The provider handles
/// authentication, key lifecycle, and wrapping/unwrapping operations.
/// The trait fully encapsulates the provider's material model — callers
/// never need to know whether wrapping happens locally or remotely.
///
/// Providers that cache KEK material locally (Internal, Vault) manage
/// their own cache internally. Providers where material never leaves
/// the backend (AWS KMS, PKCS#11) perform remote wrap/unwrap calls.
/// The caller's code path is identical in both cases.
#[async_trait]
pub trait TenantKmsProvider: Send + Sync {
    /// Wrap DEK derivation parameters (epoch + chunk_id) with the
    /// tenant KEK. The `aad` binds the wrapped ciphertext to its
    /// envelope context (typically chunk_id), preventing splice attacks.
    /// Returns opaque ciphertext stored in the envelope.
    async fn wrap(
        &self,
        tenant: &OrgId,
        plaintext: &[u8],
        aad: &[u8],
    ) -> Result<Vec<u8>, KmsProviderError>;

    /// Unwrap DEK derivation parameters from envelope ciphertext.
    /// The `aad` must match the value used during wrapping.
    async fn unwrap(
        &self,
        tenant: &OrgId,
        ciphertext: &[u8],
        aad: &[u8],
    ) -> Result<Zeroizing<Vec<u8>>, KmsProviderError>;

    /// Rotate the tenant KEK to a new version/epoch.
    /// Returns the new provider-specific epoch identifier.
    async fn rotate(
        &self,
        tenant: &OrgId,
    ) -> Result<KmsEpochId, KmsProviderError>;

    /// Re-wrap ciphertext from old key version to current version
    /// without exposing plaintext (server-side re-wrap where supported).
    /// Falls back to unwrap + wrap if the provider doesn't support
    /// server-side re-wrap. The `aad` is preserved across the re-wrap.
    async fn rewrap(
        &self,
        tenant: &OrgId,
        old_ciphertext: &[u8],
        aad: &[u8],
    ) -> Result<Vec<u8>, KmsProviderError>;

    /// Destroy the tenant KEK (crypto-shred). Irrecoverable.
    /// Also purges any locally cached material for this tenant.
    async fn destroy(
        &self,
        tenant: &OrgId,
    ) -> Result<(), KmsProviderError>;

    /// Check provider health and connectivity.
    async fn health(&self) -> KmsHealthStatus;

    /// Provider name for logging and diagnostics (never includes
    /// credentials or key material).
    fn provider_name(&self) -> &'static str;
}
}

AAD usage: Callers pass chunk_id.as_bytes() as aad for per-chunk envelope wrapping. Each provider maps aad to its native authenticated context mechanism:

ProviderAAD mechanism
InternalAES-256-GCM additional data (existing "kiseki-tenant-wrap-v1" prefix + aad)
VaultTransit context parameter (base64-encoded)
KMIPCorrelation Value attribute on Encrypt/Decrypt
AWS KMSEncryptionContext key-value map ({"chunk_id": "<hex>"})
PKCS#11pParameter field in mechanism struct

Tenant Configuration

Stored in the control plane (kiseki-control) per-tenant:

#![allow(unused)]
fn main() {
pub struct TenantKmsConfig {
    /// Provider type.
    pub provider: KmsProviderType,
    /// Provider-specific endpoint (URL, socket path, or "internal").
    pub endpoint: String,
    /// Authentication configuration. All secret fields use Zeroizing
    /// wrappers and implement Debug redaction (I-K8 extended).
    pub auth: KmsAuthConfig,
    /// Key identifier within the provider.
    pub key_name: String,
    /// Provider namespace (Vault namespace, KMIP group, KMS alias prefix).
    pub namespace: Option<String>,
    /// Cache TTL override (bounded by I-K15: 5s-300s).
    pub cache_ttl_secs: Option<u64>,
}

pub enum KmsProviderType {
    Internal,
    Vault,
    Kmip,
    AwsKms,
    AzureKeyVault,
    GcpCloudKms,
    Pkcs11,
}

/// Authentication configuration for external KMS providers.
///
/// All secret fields use `Zeroizing<String>` for automatic memory
/// clearing on drop. The `Debug` impl prints variant names only —
/// never credential contents (I-K8 extended to provider credentials).
pub enum KmsAuthConfig {
    /// Internal provider — no external auth needed.
    None,
    /// mTLS client certificate (KMIP, Vault TLS auth).
    TlsCert {
        cert_pem: String,
        key_pem: Zeroizing<String>,
    },
    /// Vault AppRole.
    AppRole {
        role_id: String,
        secret_id: Zeroizing<String>,
    },
    /// OIDC/JWT token (Vault, cloud providers).
    Oidc {
        token_endpoint: String,
        client_id: String,
    },
    /// AWS IAM role assumption.
    AwsIamRole {
        role_arn: String,
        region: String,
    },
    /// Azure Managed Identity or Service Principal.
    AzureIdentity {
        tenant_id: String,
        client_id: String,
    },
    /// GCP Service Account.
    GcpServiceAccount {
        credentials_json: Zeroizing<String>,
    },
    /// PKCS#11 library path + slot/pin.
    Pkcs11 {
        library_path: String,
        slot_id: u64,
        pin: Zeroizing<String>,
    },
}
}

I-K8 extended: KmsAuthConfig implements Debug with redaction:

#![allow(unused)]
fn main() {
impl fmt::Debug for KmsAuthConfig {
    fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
        match self {
            Self::None => write!(f, "KmsAuthConfig::None"),
            Self::TlsCert { .. } => write!(f, "KmsAuthConfig::TlsCert(***)"),
            Self::AppRole { role_id, .. } => write!(f, "KmsAuthConfig::AppRole({})", role_id),
            // ... all variants redact secret fields
        }
    }
}
}

Caching and Fallback

The existing KeyCache (cache.rs) is reused for providers with local material. Remote-only providers (AWS KMS, PKCS#11) cache unwrapped derivation parameters instead.

ProviderWhat is cachedCache miss action
InternalKEK material (32 bytes)Fetch from tenant key Raft store
VaultKEK material or nothing (configurable)POST /transit/decrypt
KMIPKEK material or nothing (depends on server)Encrypt/Decrypt operation
AWS KMSUnwrapped derivation paramsDecrypt API call
PKCS#11Unwrapped derivation paramsC_UnwrapKey

I-K15 applies: Cache TTL bounded to [5s, 300s] regardless of provider. This ensures crypto-shred takes effect within the TTL window even if the external KMS is ahead of Kiseki’s cache.

Provider unavailability:

  • Within TTL window: cached material serves reads (degraded mode)
  • Beyond TTL: reads fail with TenantKekUnavailable (retriable)
  • Writes always require fresh validation (no stale-cache writes)

Resilience (adversarial finding #5):

  • Circuit breaker per provider endpoint: open after 5 consecutive failures/timeouts, half-open probe every 30s
  • Jittered cache TTL: actual TTL = configured TTL ± 10% (random) to prevent synchronized expiry across storage nodes
  • Concurrency limit: max 10 concurrent KMS requests per tenant per storage node (backpressure, not queuing)
  • Timeout bounds: 2s connect timeout, 5s operation timeout for all network-based providers

I-K11 unchanged: Kiseki provides no escrow. If the tenant loses access to their external KMS and has no backup, their data is unrecoverable. This is documented and accepted.

Provider Migration

Changing a tenant’s KMS provider (e.g., Internal → Vault) requires re-wrapping all existing envelopes (adversarial finding #3):

  1. Provision new KEK in the target provider
  2. Configure the new provider as “pending” in control plane
  3. Background re-wrap: for each envelope, old_provider.unwrap()new_provider.wrap() with the same AAD
  4. Track progress (same mechanism as epoch re-wrap: RewrapProgress)
  5. Once 100% re-wrapped, atomically switch active provider
  6. Decommission old provider KEK

During migration, reads use whichever provider matches the envelope’s tenant_epoch. The envelope carries a provider-version tag to disambiguate.

Constraint: Provider migration is an operator-initiated, audited action. It cannot be triggered by the tenant API alone.

Crypto-Shred Interaction

Crypto-shred (tenant KEK destruction) behavior per provider:

ProviderCrypto-shred mechanism
InternalDelete KEK from tenant key store; purge cache
VaultPOST /transit/keys/:name/config with deletion_allowed=true, then DELETE /transit/keys/:name
KMIPDestroy operation (state → Destroyed, irrecoverable)
AWS KMSDisableKey (immediate, blocks all operations) + ScheduleKeyDeletion (permanent, 7-30 day window)
PKCS#11C_DestroyObject

AWS KMS: DisableKey is called immediately on crypto-shred to block all wrap/unwrap operations. ScheduleKeyDeletion follows for permanent destruction. The 7-day AWS-enforced waiting period applies to permanent deletion only — the key is operationally dead from the moment DisableKey is called. The health() check reports supports_immediate_shred: true (via DisableKey) so tenants can verify crypto-shred SLA compliance at configuration time.

Security Considerations

  1. Credential protection: KMS auth credentials stored in the control plane are encrypted at rest with the system master key. All secret fields use Zeroizing<String> for memory protection. Debug implementations redact all credential content (I-K8 extended). Credentials are excluded from core dumps via MADV_DONTDUMP on the containing allocation.

  2. Network isolation: External KMS calls are made from storage nodes, not the control plane. This avoids routing tenant data through the control plane. mTLS is required for all network-based providers.

  3. Provider compromise: If a tenant’s external KMS is compromised, only that tenant’s data is at risk. System master keys and other tenants are unaffected (tenant isolation, I-T3).

  4. Mixed providers: Different tenants can use different providers. A single Kiseki cluster can serve tenants using Vault, AWS KMS, and internal management simultaneously.

  5. FIPS compliance: The HKDF derivation and AES-256-GCM encryption remain on Kiseki’s FIPS-validated aws-lc-rs module regardless of provider. The external KMS only handles the tenant KEK wrapping layer — the system encryption layer is always FIPS.

  6. Internal provider isolation: Tenant KEKs in Internal mode are stored in a separate Raft group from system master keys. This provides an independent compromise domain — system key manager compromise alone does not yield tenant KEKs, and vice versa. However, an operator with access to both stores has full access. Compliance-sensitive tenants should use an external provider where the KEK is under their own operational control.

Implementation Phases

  1. Phase K1: TenantKmsProvider trait + Internal backend (refactor current code to use the trait; no behavioral change)
  2. Phase K2: Vault backend (Transit engine, vaultrs crate)
  3. Phase K3: KMIP 2.1 backend (custom TTLV client, ~1500 lines)
  4. Phase K4: AWS KMS backend (aws-sdk-kms crate)
  5. Phase K5: PKCS#11 backend (cryptoki crate)

Phases K2-K5 are independent and can be built in any order.

Alternatives Considered

  1. BYOK (Bring Your Own Key) upload model: Tenant uploads raw key material to Kiseki. Rejected — defeats the purpose of external KMS (key material leaves tenant’s control boundary).

  2. Single cloud KMS only: Support only AWS KMS. Rejected — HPC customers are frequently on-premises or multi-cloud.

  3. KMIP only: Use KMIP as the sole external standard. Rejected — Vault and cloud KMS are too prevalent to ignore, and KMIP client implementation cost is non-trivial.

  4. No internal provider: Require all tenants to configure external KMS. Rejected — creates unnecessary deployment friction for simple or single-operator clusters.

  5. fetch_kek in trait interface: Original design included fetch_kek() -> Option<TenantKekMaterial> with None for cloud providers. Rejected after adversarial review — leaky abstraction that forces callers to branch on provider model. wrap/unwrap as the universal interface fully encapsulates the distinction.

Adversarial Review Findings (2026-04-22)

#SeverityFindingResolution
1HighCredential fields as plaintext StringZeroizing<String> + Debug redaction
2Highfetch_kek leaky abstractionRemoved; wrap/unwrap are universal
3MediumNo provider migration pathMigration protocol documented
4MediumNo AAD in wrap/unwrapaad: &[u8] parameter added
5MediumNo rate limiting/circuit breakerCircuit breaker + jitter + limits specified
6MediumPKCS#11 C_GetAttributeValue violates HSM modelRemoved; HSM uses C_WrapKey/C_UnwrapKey only
7MediumInternal KEK co-located with system keysSeparate Raft group for tenant KEKs
8LowAWS KMS 7-day deletion windowDisableKey immediate + ScheduleKeyDeletion deferred

Consequences

  • Adds kiseki-kms crate (or module within kiseki-keymanager)
  • Tenant key Raft group added (separate from system key manager)
  • Control plane gains KMS configuration endpoints
  • Each storage node needs network access to tenant KMS endpoints
  • KMIP requires custom protocol implementation (~1500 lines)
  • PKCS#11 requires unsafe FFI (contained within cryptoki crate)
  • Testing requires mock KMS servers (Vault dev mode, LocalStack, SoftHSM for PKCS#11)

ADR-029: Raw Block Device Allocator

Status: Accepted Adversarial review: 2026-04-22 (8 findings: 2H 4M 2L, all resolved) Date: 2026-04-22 Context: ADR-022, ADR-024, ADR-005, I-C1 through I-C6

Problem

Chunk ciphertext needs to persist on JBOD data devices. ADR-024 specifies XFS on each device as the default, but filesystem overhead becomes the bottleneck at HPC scale:

  • Double journaling: XFS journals its metadata, then redb journals ours — redundant durability cost
  • Page cache pollution: OS caches data we already manage in our own cache layer, wasting DRAM
  • Inode contention: Billions of chunks = billions of inodes; XFS metadata operations become the throughput ceiling
  • Indirection: Every I/O traverses VFS → XFS → block layer → device; raw access removes two layers

Ceph’s migration from FileStore (XFS) to BlueStore (raw block) was driven by exactly these issues. DAOS uses SPDK for the same reason.

Decision

New crate: kiseki-block

A device I/O crate that manages raw block devices (and file-backed fallback for VMs/CI). Separate from kiseki-chunk (domain logic). kiseki-chunk depends on kiseki-block for storage.

Device Backend Trait

#![allow(unused)]
fn main() {
/// Abstraction over a storage device — raw block or file-backed.
/// Auto-detects device characteristics and adapts I/O strategy.
#[async_trait]
pub trait DeviceBackend: Send + Sync {
    /// Allocate a contiguous extent of at least `size` bytes.
    /// Alignment matches the device's physical block size.
    fn alloc(&self, size: u64) -> Result<Extent, AllocError>;

    /// Write data at the given extent.
    fn write(&self, extent: &Extent, data: &[u8]) -> Result<(), BlockError>;

    /// Read data from the given extent.
    fn read(&self, extent: &Extent) -> Result<Vec<u8>, BlockError>;

    /// Free an extent, returning blocks to the free pool.
    fn free(&self, extent: &Extent) -> Result<(), AllocError>;

    /// Sync all pending writes to stable storage.
    fn sync(&self) -> Result<(), BlockError>;

    /// Device capacity: (used_bytes, total_bytes).
    fn capacity(&self) -> (u64, u64);

    /// Probed device characteristics (read-only after open).
    fn characteristics(&self) -> &DeviceCharacteristics;
}
}

Auto-detection (no manual configuration)

On DeviceManager::open(path), probe sysfs (Linux):

/sys/block/<dev>/queue/rotational         → 0 (SSD/NVMe) or 1 (HDD)
/sys/block/<dev>/queue/physical_block_size → 512 or 4096
/sys/block/<dev>/queue/optimal_io_size    → device-preferred I/O size
/sys/block/<dev>/queue/max_hw_sectors_kb  → max single I/O size
/sys/block/<dev>/device/model             → model string
/sys/block/<dev>/device/numa_node         → NUMA node (-1 if none)
/sys/block/<dev>/queue/discard_max_bytes  → TRIM support (>0 = yes)

Derived properties:

#![allow(unused)]
fn main() {
pub struct DeviceCharacteristics {
    pub medium: DetectedMedium,
    pub physical_block_size: u32,
    pub optimal_io_size: u32,
    pub rotational: bool,
    pub numa_node: Option<u32>,
    pub supports_trim: bool,
    pub supports_smart: bool,
    pub io_strategy: IoStrategy,
}

pub enum DetectedMedium {
    NvmeSsd,       // /sys/block/nvme*/ + rotational=0
    SataSsd,       // rotational=0, not NVMe
    Hdd,           // rotational=1
    Virtual,       // virtio in model, no SMART
    Unknown,
}

pub enum IoStrategy {
    DirectAligned,       // O_DIRECT | O_DSYNC — NVMe, SATA SSD
    BufferedSequential,  // O_SYNC — HDD (readahead benefits)
    FileBacked,          // Default flags — VM, dev, CI
}
}

For non-Linux / VMs without sysfs: detect virtio in model string or absence of block device properties → fall back to IoStrategy::FileBacked with sparse file. All three strategies implement the same DeviceBackend trait transparently.

On-disk format

Per data device:

Offset 0:     [Superblock — 4K]
Offset 4K:    [Primary Bitmap — variable size]
Offset M:     [Mirror Bitmap — same size as primary]
Offset N:     [Data Region — remainder of device]

Superblock (4K, first block):

#![allow(unused)]
fn main() {
pub struct Superblock {
    pub magic: [u8; 8],              // b"KISEKI\x01\x00"
    pub version: u32,                // Format version (1)
    pub device_id: [u8; 16],         // UUID
    pub block_size: u32,             // Physical block size (probed)
    pub total_blocks: u64,           // Device capacity in blocks
    pub bitmap_offset: u64,          // Byte offset of primary bitmap
    pub bitmap_mirror_offset: u64,   // Byte offset of mirror bitmap
    pub bitmap_blocks: u64,          // Size of each bitmap in blocks
    pub data_offset: u64,            // Byte offset of data region
    pub generation: u64,             // Monotonic, incremented on bitmap flush
    pub checksum: [u8; 32],          // SHA-256 of superblock fields
}
}

Allocation bitmap (primary + mirror): 1 bit per block in the data region. Stored twice at different offsets for redundancy.

  • At 4K blocks: 4TB device = 1 billion blocks = 128MB × 2 = 256MB
  • At 512B blocks: 4TB device = 8 billion blocks = 1GB × 2 = 2GB
  • Bitmap overhead: 0.006% (4K) to 0.048% (512B)
  • On read: verify primary against mirror. On mismatch, use the copy consistent with the redb journal.

Per-extent CRC32: Every data extent has a 4-byte CRC32 trailer written after the payload data (within the same aligned block).

  • On read: verify CRC32 before returning data.
  • CRC mismatch → hardware corruption → trigger EC repair from parity fragments (not a security incident).
  • AES-GCM auth_tag failure after CRC pass → actual tampering (security incident, alert + audit).
  • This distinguishes hardware failure from cryptographic attack, enabling correct operational response.

Allocation algorithm

Extent-based best-fit with free-list cache (Ceph BlueStore pattern, simpler than DAOS VEA):

  • In-memory: B-tree of free extents (offset, block_count), sorted by offset. On alloc, scan for smallest extent >= requested blocks. On free, insert and coalesce with neighbors.
  • Concurrency: alloc() and free() are serialized per device via Mutex on the allocator state. This is acceptable — allocation is a B-tree lookup (microseconds); I/O is the bottleneck, not allocation. Ceph BlueStore also serializes allocation per OSD.
  • On-disk: Bitmap is ground truth. Free-list rebuilt from bitmap on startup (~100ms for 4TB at 4K blocks).
  • Crash safety: Bitmap updates journaled in redb (device_alloc table) before applied to device bitmap region. On crash recovery: reload bitmap from device, replay pending journal entries from redb, rebuild free-list.

Allocation flow (WAL-ordered for crash safety):

  1. Round up requested size to block_size boundary
  2. Search free-list for best-fit extent
  3. Split extent if larger than needed
  4. Journal intent in redb (device_alloc table: alloc intent)
  5. Mark bits in bitmap (pwrite to bitmap region)
  6. Return Extent { offset, length }
  7. Caller writes data to extent, then commits chunk_meta to redb
  8. Clear intent from device_alloc journal (write complete)

On crash recovery: scan device_alloc for pending intents. If the corresponding chunk_meta exists → write completed, clear intent. If no chunk_meta → write was interrupted, free the extent (clear bitmap bits, remove intent). This is the standard WAL pattern — Ceph BlueStore uses the same approach.

Free flow:

  1. Journal the deallocation intent in redb
  2. Clear bits in bitmap
  3. Insert freed extent into free-list, coalesce neighbors
  4. If supports_trim: add to TRIM batch queue (see below)
  5. Clear dealloc intent from journal

TRIM batching: Freed extents accumulate in a TRIM queue per device. A batched BLKDISCARD ioctl is issued periodically (every 60 seconds or when queue exceeds 1GB). This avoids write amplification from many small TRIM commands.

Maximum extent size: 16MB. Allocations larger than 16MB are split into multiple extents. FragmentLocation in chunk_meta already supports multiple extents per chunk via Vec<FragmentLocation>.

I/O strategy per device type

StrategyOpen flagsAlignmentSyncUse case
DirectAlignedO_DIRECT | O_DSYNCphysical_block_sizeImplicit (O_DSYNC)NVMe, SATA SSD
BufferedSequentialO_SYNC512Bfdatasync()HDD
FileBackeddefault4K (simulated)fsync()VM, dev, CI

FileBacked alignment: FileBackedDevice enforces the same 4K alignment as RawBlockDevice to ensure tests faithfully reproduce raw block behavior. Code that passes CI will not fail on real hardware due to alignment issues.

  • Write buffers aligned via std::alloc::Layout::from_size_align for O_DIRECT compatibility
  • NUMA-aware: pin allocator thread to numa_node if detected
  • TRIM/UNMAP on free if supports_trim (SSD wear management)
  • optimal_io_size used for write batching (coalesce small writes up to this size before issuing I/O)

Metadata in redb (system partition)

ADR-022’s redb on the RAID-1 system partition stores chunk metadata:

Table: chunk_meta

Key:   [u8; 32]  (chunk_id)
Value: bincode-serialized ChunkMeta {
    refcount: u64,
    retention_holds: Vec<String>,
    pool_name: String,
    stored_bytes: u64,
    fragments: Vec<FragmentLocation {
        device_id: [u8; 16],
        offset: u64,
        length: u64,
    }>,
    envelope_meta: EnvelopeMeta {
        nonce: [u8; 12],
        auth_tag: [u8; 16],
        system_epoch: u64,
        tenant_epoch: Option<u64>,
        tenant_wrapped_material: Option<Vec<u8>>,
    },
}

Table: device_alloc (bitmap journal for crash safety)

Key:   (device_id: [u8; 16], generation: u64)
Value: bincode-serialized Vec<AllocJournalEntry {
    offset: u64,
    length: u64,
    is_alloc: bool,  // true = allocate, false = free
}>

Separation of concerns

The allocator does NOT know about device subclasses (NvmeU2 vs NvmeQlc, HddEnterprise vs HddBulk). Those are pool/placement concerns in kiseki-chunk and kiseki-control (ADR-024).

LayerCares aboutDoesn’t care about
kiseki-blockphysical_block_size, rotational, O_DIRECTTLC vs QLC, RPM, pool policy
kiseki-chunkpool thresholds, EC config, placementblock alignment, I/O flags
kiseki-controldevice class, pool assignment, tieringhow bytes reach the device

The DeviceClass enum (ADR-024) stays in kiseki-chunk/kiseki-control. DeviceCharacteristics (auto-probed) stays in kiseki-block.

Integration with existing code

  • ChunkOps trait (ADR-005) unchanged — callers unaware of backend
  • New PersistentChunkStore in kiseki-chunk implements ChunkOps:
    • write_chunk(): EC encode → alloc extents per device via DeviceBackend → write fragments → update redb chunk_meta
    • read_chunk(): lookup redb chunk_metaDeviceBackend::read per fragment → EC decode if needed → return Envelope
    • gc(): free extents via DeviceBackend::free → update bitmap → remove from redb
  • DeviceManager in kiseki-block opens devices at startup, probes characteristics, creates appropriate DeviceBackend per device
  • Server runtime (kiseki-server) wires DeviceManager → pools → PersistentChunkStore when KISEKI_DATA_DIR is set

Crate structure

kiseki-block/
├── Cargo.toml
└── src/
    ├── lib.rs
    ├── backend.rs        # DeviceBackend trait
    ├── raw.rs            # RawBlockDevice (O_DIRECT)
    ├── file.rs           # FileBackedDevice (sparse file)
    ├── probe.rs          # Sysfs device probing
    ├── superblock.rs     # On-disk superblock format
    ├── bitmap.rs         # Allocation bitmap
    ├── allocator.rs      # Extent allocator (free-list + bitmap)
    ├── extent.rs         # Extent type
    ├── manager.rs        # DeviceManager
    └── error.rs          # BlockError, AllocError

Rationale

  • Raw block over XFS: Eliminates FS overhead (journaling, inode, page cache) that becomes the bottleneck at NVMe line rate. Ceph BlueStore validated this approach at scale.
  • Auto-detection over manual config: Reduces deployment friction. Admin provides device paths; Kiseki probes characteristics. Works correctly on bare metal, VMs, and CI without config changes.
  • Bitmap over B-tree free-list on disk: Simpler crash recovery (fixed-size, position-indexed). Free-list is derived in-memory. DAOS VEA uses B-tree on persistent memory, but we don’t require PMEM — bitmap on block device with redb journal is sufficient.
  • File-backed fallback: Same trait, different backend. Tests and CI don’t need raw devices. VMs work without device passthrough.
  • Separate crate: kiseki-block has no domain knowledge (chunks, EC, pools). Clean dependency boundary. Testable in isolation.

Alternatives Considered

  1. XFS on each JBOD device (ADR-024 original default): Rejected for production — FS overhead at NVMe line rate is unacceptable. Still available as FileBacked strategy for dev/VM.

  2. SPDK userspace I/O (DAOS model): Rejected — requires dedicated devices (no kernel access), complicates deployment, needs custom memory management (DMA buffers). Future optimization path if kernel I/O overhead is measured as bottleneck.

  3. Pool files (one large file per device): Rejected — still has FS overhead (XFS metadata for the pool file itself). Raw block eliminates the FS entirely.

  4. redb for chunk data: Rejected — B-tree not designed for multi-GB blob storage. Acceptable for metadata only.

Consequences

  • Adds kiseki-block crate to workspace (~2000 lines estimated)
  • Data devices must be provisioned as raw (no filesystem). Operator provides device paths in config; Kiseki writes superblock on init.
  • VMs and CI use file-backed mode transparently (no raw devices needed)
  • Crash recovery depends on redb journal + device bitmap consistency
  • Device initialization is a destructive operation (writes superblock, bitmap — existing data on device is lost). Safety checks before init: (1) check for existing Kiseki superblock magic — require --force if found, (2) check for known FS signatures (XFS, ext4, NTFS magic) — refuse with clear error, (3) audit log the init
  • TRIM/UNMAP support improves SSD endurance but is optional
  • Future: SPDK backend can implement DeviceBackend trait for userspace I/O without changing upper layers

Adversarial Review Findings (2026-04-22)

#SeverityFindingResolution
1HighWrite ordering — data before metadata creates phantom chunks on crashWAL intent journal: alloc → journal intent → write data → commit chunk_meta → clear intent. Recovery replays intents.
2HighNo per-extent checksum — silent corruption indistinguishable from tamperingCRC32 trailer per extent. CRC fail = hardware corruption (EC repair). Auth tag fail after CRC pass = tampering (security alert).
3MediumBitmap single point of failure per devicePrimary + mirror bitmap at different offsets. On mismatch, use copy consistent with redb journal.
4MediumNo device init safety — accidental overwrite of existing dataSafety checks: existing Kiseki magic → require –force. Known FS signatures → refuse. Audit log init.
5MediumFile-backed mode doesn’t enforce alignment — CI misses bugsFileBacked enforces same 4K alignment as RawBlockDevice.
6MediumConcurrent alloc race on shared free-listMutex per device on allocator state. Allocation is microseconds; I/O is the bottleneck.
7LowImmediate TRIM on free causes write amplificationBatch TRIM queue: accumulate, issue BLKDISCARD every 60s or at 1GB threshold.
8LowNo max extent size — unbounded alloc fragments bitmap scanMax extent 16MB. Larger chunks split into multiple extents.

References

  • Ceph BlueStore: Architecture
  • DAOS VOS/VEA: Storage Model
  • ADR-022: Storage backend (redb for metadata)
  • ADR-024: Device management and capacity thresholds
  • ADR-005: EC and chunk durability

ADR-030: Dynamic Small-File Placement and Metadata Capacity Management

Status: Accepted Date: 2026-04-22 Deciders: Architect + domain expert Adversarial review: 2026-04-22 (6 findings: 1C 2H 2M 1L, all resolved) Context: ADR-024 (device management), ADR-029 (raw block allocator), I-L9 (inline threshold), I-C5 (capacity thresholds), I-C8 (bitmap ground truth)

Problem

At scale (10B+ files, 100PB+), the metadata tier (redb on system NVMe) becomes a sizing bottleneck. The per-file metadata footprint (~280 bytes) is unavoidable, but small-file content inlined into deltas causes the metadata tier to scale with data volume, not just file count.

Current state:

  • inline_threshold_bytes is specified (I-L9) but not implemented
  • No dynamic adjustment mechanism exists
  • No awareness of system disk capacity or media type
  • No workload-driven shard placement across heterogeneous nodes

Capacity example

10B files, 100PB total, 50-node cluster, RF=3, 256GB NVMe root disks:

ComponentPer fileCluster totalPer node
Delta log (no inline)~200 B~2 TB~120 GB
Chunk metadata~80 B~0.8 TB~48 GB
Subtotal (metadata only)~280 B~2.8 TB~168 GB
Small-file content (if inlined)variable3-200 TBblows budget

Metadata alone consumes 168 GB/node at 50 nodes. Adding inline content makes 256 GB root disks insufficient.

Decision

1. System disk auto-detection and budget calculation

At server boot, detect the system partition’s capacity and media type. Compute a metadata budget with configurable soft and hard limits.

KISEKI_DATA_DIR → stat() → total_bytes, fs_type
/sys/block/{dev}/queue/rotational → 0 = SSD/NVMe, 1 = HDD
/sys/block/{dev}/device/model → device identification

Defaults (configurable via env or config file):

ParameterDefaultDescription
KISEKI_META_SOFT_LIMIT_PCT50%Normal operating ceiling
KISEKI_META_HARD_LIMIT_PCT75%Absolute maximum, triggers emergency
KISEKI_META_INLINE_FLOOR128 BHard lower bound for inline (metadata-like payloads only)

Warning: If the system disk is rotational (HDD), emit a persistent warning at boot and in health reports:

WARNING: system disk is rotational (HDD). Raft fsync latency will
be 5-10ms per commit. Production deployments require NVMe or SSD
for the metadata partition. See ADR-030.

Reported to cluster (via gRPC health reports, not Raft — see SF-ADV-4 resolution):

#![allow(unused)]
fn main() {
struct NodeMetadataCapacity {
    total_bytes: u64,
    used_bytes: u64,
    soft_limit_bytes: u64,
    hard_limit_bytes: u64,
    media_type: MediaType,  // Nvme, Ssd, Hdd
    small_file_budget_bytes: u64,  // derived: soft_limit - reserved - metadata
}
}

2. Two-tier redb layout on system disk

Separate metadata (Raft log, chunk index) from small-file content:

KISEKI_DATA_DIR/
├── raft/log.redb            ← Raft log entries (bounded by snapshot policy)
├── keys/epochs.redb         ← Key epoch metadata (tiny, <10 MB)
├── chunks/meta.redb         ← Chunk extent index (scales with file count)
└── small/objects.redb       ← Small-file encrypted content (capacity-managed)

The first three are structural metadata — required regardless of inline threshold. The fourth (small/objects.redb) is data-tier extension — its size is controlled by the inline threshold.

This separation enables:

  • Independent monitoring of each tier’s growth
  • Emergency response: disable inline (threshold → floor) without touching structural metadata
  • Backup/restore of structural metadata without bulk data

GC contract (SF-ADV-6): When truncate_log or compact_shard removes a delta that references an inline object, the corresponding small/objects.redb entry is also deleted. The GC path must cover both stores — orphan entries in small/objects.redb are a capacity leak. The chunk_id key is shared between small/objects.redb and the block device extent mapping, so deletion is keyed identically.

3. Per-shard dynamic inline threshold

The inline threshold determines whether a file’s encrypted content is stored in small/objects.redb (metadata tier) or as a chunk extent on a raw block device (data tier).

Threshold is per-shard, not per-node, because all Raft replicas of a shard must agree on whether content is inline or chunked (state machine determinism).

Computation: The shard leader computes the threshold from the minimum small-file budget across all nodes hosting that shard:

available = min(node.small_file_budget_bytes for node in shard.voters)
projected_files = shard.file_count_estimate (from delta count heuristic)
raw_threshold = available / max(projected_files, 1)
shard_threshold = clamp(raw_threshold, INLINE_FLOOR, INLINE_CEILING)

Where INLINE_CEILING is a system-wide maximum (e.g., 64 KB) to prevent pathological cases.

Raft log throughput guard (SF-ADV-1): The threshold is further clamped by a per-shard Raft log throughput budget (KISEKI_RAFT_INLINE_MBPS, default 10 MB/s). If the shard’s inline write rate (measured over a sliding 10-second window) would exceed this budget at the current threshold, the effective threshold is temporarily reduced to floor until the rate drops. This prevents inline data from starving metadata-only Raft operations (large-file chunk_ref deltas, maintenance commands, watermark advances) during write storms.

effective_threshold = if shard.inline_write_rate_mbps > RAFT_INLINE_MBPS:
    INLINE_FLOOR
else:
    shard_threshold

Threshold adjustment rules (I-L9 compatibility):

  • Threshold can decrease dynamically (safe — new files use chunks)
  • Threshold changes are prospective only — existing inline data is not retroactively migrated
  • Threshold increase requires cluster admin decision and may trigger background migration of small chunked files back to inline (optional, maintenance-mode operation)
  • Threshold is stored in ShardConfig and replicated via Raft

Read latency note (SF-ADV-3): After a threshold decrease, existing inline files remain in small/objects.redb (fast, NVMe reads) while new files of the same size go to block device extents (potentially slower, especially on HDD). This bimodal latency for same-sized files is expected behavior. Administrators can normalize it via the maintenance-mode migration path (move old inline content to chunks), but this is optional and not automatic.

Emergency override (SF-ADV-4): Capacity alerts use out-of-band gRPC health reports, not Raft. Each node periodically reports its NodeMetadataCapacity to the shard leader (or control plane) via the data-path gRPC channel. If any voter reports hard-limit breach, the leader commits a threshold reduction via Raft. This works because:

  • The full-disk node doesn’t need to write Raft entries for the signal
  • The leader commits the threshold change with 2/3 majority (the full-disk node’s vote is not required)
  • The full-disk node receives the committed threshold change via Raft replication (read-only, no disk write needed until next apply)

4. Small-file data path

Inline content flows through Raft (SF-ADV-2): Inline content is carried as payload in the Raft log entry (LogCommand::AppendDelta with payload field). The state machine’s apply() method offloads the payload to small/objects.redb on apply, keyed by chunk_id. The in-memory state machine retains only the delta header (no payload).

This ensures:

  • Snapshot correctness: build_snapshot() reads inline content from small/objects.redb, includes it in the serialized snapshot. install_snapshot() writes it back. Learners and restarted nodes receive all inline content via snapshot transfer.
  • State machine determinism: all replicas apply the same log entries and write to their local small/objects.redb identically.
  • Memory efficiency: inline payloads are not held in memory after apply — only the redb reference remains.

Below threshold (inline path):

client write → gateway encrypt → delta with payload →
  Raft client_write (payload in log entry) →
  replicated to voters →
  state machine apply() → offload payload to small/objects.redb →
  in-memory state: header only (no payload)

Above threshold (chunk path, unchanged):

client write → gateway encrypt → chunk alloc on DeviceBackend →
  extent write (O_DIRECT) → delta with chunk_ref (no payload) →
  Raft client_write → replicated (metadata only)

Read path: ChunkOps::get() checks small/objects.redb first (keyed by chunk_id). If not found, reads from block device extent. This is transparent to callers.

5. Workload-driven shard placement (heterogeneous clusters)

When the cluster has mixed node types (HDD + SSD), the control plane can migrate shards to better-suited nodes using Raft membership changes.

Placement levers (ordered by preference, topology-dependent):

LeverWhen to useMechanism
Lower inline thresholdAlways availableShardConfig update via Raft
Split shardShard exceeds I-L6 ceilingStandard shard split
Migrate to larger-NVMe nodeHeterogeneous cluster, metadata pressureRaft add_learner → promote → demote
Migrate to SSD nodeHeterogeneous, small-file-heavy shardRaft add_learner → promote → demote

Decision tree (control plane policy):

IF shard.metadata_pressure > soft_limit:
  IF can_lower_threshold(shard):
    lower_threshold(shard)               # cheapest, always try first
  ELSE IF shard.exceeds_split_ceiling:
    split_shard(shard)                   # distributes load
  ELSE IF cluster.has_better_node(shard):
    migrate_shard(shard, better_node)    # needs heterogeneous cluster
  ELSE:
    alert("metadata tier at capacity, no placement options available")

In a homogeneous cluster, only the first two levers exist. The policy prunes itself based on what’s available.

Shard migration via Raft:

Migration is not a special operation — it’s a Raft membership change:

  1. raft.add_learner(target_node) — target receives log/snapshot
  2. Wait for learner to catch up (snapshot transfer, then log replay)
  3. raft.change_membership(new_voter_set) — promote target, demote source
  4. Old node removed from voter set, its data eventually GC’d

Properties:

  • Zero downtime: reads/writes continue during migration
  • Zero data loss: old node stays in membership until new node is caught up
  • Reversible: if migration fails, learner is removed, no state change

6. Placement change rate limiting

Placement changes (shard migration, learner add/remove) consume snapshot transfer bandwidth. In HPC environments, workload profiles shift at job boundaries (hours to days), not continuously.

Exponential backoff per shard:

Observation windowAfter N-th change
2 hours1st (initial observation, minimum floor)
2 hours2nd (backoff resets never go below 2h)
4 hours3rd
8 hours4th
doubles each time
24 hourscap (maximum interval)

Reset (SF-ADV-5): The backoff resets to the minimum floor of 2 hours, not to a shorter interval. Even when the shard’s workload profile changes significantly (e.g., small-file ratio crosses a threshold boundary), the shard cannot be migrated more than once per 2 hours. This prevents oscillating workloads from causing continuous snapshot transfers. The 2-hour floor is chosen because:

  • HPC job boundaries are typically hours apart
  • A snapshot transfer of a large shard takes minutes, and the target node needs time to stabilize before being evaluated again
  • The floor applies per-shard, so different shards can migrate concurrently within the cluster-wide rate limit

Per-cluster rate limit: at most max(1, num_nodes / 10) concurrent shard migrations cluster-wide, to bound snapshot transfer bandwidth.

7. SSD nodes as read accelerators (Raft learners)

For read-heavy small-file workloads, SSD nodes can serve as non-voting Raft learners:

  • Learners receive the full Raft log (including small-file content)
  • Learners do NOT participate in elections or commit quorum
  • Learners serve read requests (state machine is up-to-date)
  • Add/remove learners without disturbing the voter set

Use case: a shard has RF=3 on HDD voters (for capacity) plus 1-2 SSD learners (for read IOPS). The SSD learners handle small-file reads, HDD voters handle bulk writes.

Correction after suboptimal placement: Initial shard placement does not need to be optimal. The control plane observes shard metrics (small-file ratio, read IOPS, p99 latency) and corrects placement via Raft membership changes. Adding an SSD learner, promoting it to voter, and demoting an HDD voter is a zero-downtime, zero-data-loss operation. The cost is one snapshot transfer per migrated shard — bounded by the rate limiting in §6.

Promotion path: if workload shifts permanently, a learner can be promoted to voter (and an HDD voter demoted) via standard membership change.

Consequences

Positive

  • Metadata tier sizing becomes self-managing
  • Small files handled efficiently without manual tuning
  • Mixed HDD/SSD clusters used optimally
  • Placement corrections have zero downtime and zero data loss
  • I-L9 compatibility preserved (prospective-only threshold changes)
  • Snapshot transfer includes inline content (SF-ADV-2 resolved)

Negative

  • Per-shard threshold adds complexity to ShardConfig
  • ChunkOps::get() now checks two stores (redb + block device)
  • Snapshot transfer is the bottleneck for migration speed
  • Threshold computation requires cluster-wide metadata aggregation
  • Inline writes under high load may be temporarily demoted to chunk path (throughput guard), causing brief latency increase for small files

Neutral

  • Threshold floor (128 B) means truly tiny files are always inline
  • Homogeneous clusters get simpler behavior (fewer levers)
  • Migration mechanism is just Raft membership changes — no new protocol
  • Bimodal read latency after threshold decrease is expected (SF-ADV-3)

Adversarial findings (resolved)

IDSeverityFindingResolution
SF-ADV-1HighRaft log throughput saturation from inline writesPer-shard throughput budget (§3), temporarily lowers threshold to floor under load
SF-ADV-2CriticalInline content missing from Raft snapshotsInline content flows through Raft log; state machine offloads to redb on apply; snapshot reads from redb (§4)
SF-ADV-3MediumBimodal read latency after threshold decreaseDocumented as expected; optional admin migration path to normalize (§3)
SF-ADV-4HighEmergency override fails if full-disk node can’t write Raft entriesCapacity reporting via out-of-band gRPC, not Raft; leader commits with 2/3 majority (§3)
SF-ADV-5LowBackoff reset allows frequent migrations from oscillating workloadsMinimum 2-hour floor that never resets below (§6)
SF-ADV-6MediumNo GC path for small/objects.redbGC contract: truncate_log and compact_shard delete corresponding redb entries (§2)

Invariant impact

InvariantImpact
I-L9Extended: threshold is now per-shard and dynamic, but still prospective-only. Increase requires admin action.
I-C5Unchanged: capacity thresholds on data devices unaffected.
I-C8Unchanged: bitmap remains ground truth for block device allocations.
I-K3Unchanged: inline content is still encrypted with system DEK, wrapped with tenant KEK.

New invariants

IDInvariant
I-SF1The inline threshold for a shard is the minimum affordable threshold across all nodes hosting that shard’s voter set. Threshold stored in ShardConfig, replicated via Raft.
I-SF2System disk metadata usage must not exceed hard_limit_pct of system partition capacity. Exceeding soft limit triggers threshold reduction; exceeding hard limit forces threshold to floor and emits alert. Alert uses out-of-band gRPC, not Raft.
I-SF3Shard migration via Raft membership change must not proceed until the target node has fully caught up (learner state matches leader’s committed index).
I-SF4Placement change rate per shard follows exponential backoff (2h floor, 24h cap). Backoff resets never go below 2h floor. Cluster-wide concurrent migrations bounded by max(1, num_nodes / 10).
I-SF5Inline content is carried in Raft log entries and offloaded to small/objects.redb on state machine apply. Snapshots include inline content read from redb. No inline content is held in the in-memory state machine after apply.
I-SF6GC (truncate_log, compact_shard) must delete corresponding entries from small/objects.redb when removing deltas that reference inline objects. Orphan redb entries are a capacity leak.
I-SF7Per-shard Raft inline throughput must not exceed KISEKI_RAFT_INLINE_MBPS (default 10 MB/s). When exceeded, effective inline threshold drops to floor until rate subsides.

Spec references

  • specs/invariants.md — I-L9, I-C5, I-C8, I-K3
  • specs/architecture/adr/024-device-management-and-capacity.md — device classes, server disk layout
  • specs/architecture/adr/029-raw-block-device-allocator.md — DeviceBackend trait, extent allocation
  • specs/architecture/adr/026-raft-topology.md — Raft membership, multi-Raft pattern
  • specs/implementation/phase-7-9-assessment.md — open design question on small files

ADR-031: Client-Side Cache

Status: Accepted Date: 2026-04-23 Deciders: Architect + domain expert Adversarial review: 2026-04-23 (14 findings: 2C 4H 4M 4L, all resolved)

Context

ADR-013 (POSIX semantics scope), ADR-019 (gateway deployment model), ADR-020 (workflow advisory), ADR-030 (dynamic small-file placement), control-plane.feature (policy distribution precedent), native-client.feature (client architecture).

CSCS workload mix: LLM pretraining (epoch reuse of tokenized datasets), LLM inference (model weight cold-start), climate/weather simulation (bounded input staging with hard deadlines), HPC checkpoint/restart. Common pattern: compute nodes repeatedly pull the same encrypted chunks across the fabric.

Existing client architecture: kiseki-client crate with feature flags (fuse, ffi, python, pure-Rust default). Performs tenant-layer encryption — plaintext never leaves the workload process. The existing ClientCache is an in-memory HashMap<ChunkId, Vec<u8>> with TTL and max-entries eviction.

Problem

  1. Repeat reads of the same chunks cross the fabric unnecessarily. Training datasets are read epoch after epoch. Inference weights are loaded identically by multiple model replicas. Climate boundary conditions are staged identically to every simulation rank.

  2. In-memory cache (current ClientCache) is bounded by process memory, which is primarily needed for computation. Compute-node NVMe is available and underutilized.

  3. No mechanism for pre-staging datasets. Jobs start with cold cache and pay first-access latency on every rank simultaneously, creating a thundering-herd pattern on the storage fabric.

  4. No cache mode differentiation. Training (pin everything), inference (pin weights, LRU prompts), and HPC checkpoint (don’t cache) have fundamentally different cache needs.

Decision

1. Cache architecture

The client-side cache is a library-level module in kiseki-client, shared across all linkage modes (FUSE, FFI, Python, native Rust). It operates on decrypted plaintext chunks keyed by ChunkId.

canonical (fabric) → decrypt → cache store (NVMe) → serve to caller
                                    ↑
                          cache hit path (no fabric, no decrypt)

Two-tier storage:

TierBackingPurposeEviction
Hot (L1)In-memory HashMapSub-microsecond hits for active working setLRU, bounded by max_memory_bytes
Warm (L2)Local NVMe file or directoryLarge capacity for datasets and weightsPer-mode policy (see §2)

L2 layout on NVMe (CC-ADV-4 resolved: per-process subdirectories):

$KISEKI_CACHE_DIR/
├── <tenant_id_hex>/
│   ├── <pool_id>/                 ← per-process pool (128-bit CSPRNG)
│   │   ├── chunks/
│   │   │   ├── <prefix>/
│   │   │   │   └── <chunk_id_hex> ← plaintext + CRC32 trailer
│   │   │   └── ...
│   │   ├── meta/
│   │   │   └── file_chunks.db
│   │   ├── staging/
│   │   │   └── <dataset_id>.manifest
│   │   └── pool.lock              ← flock, proves process is alive
│   └── <pool_id>/                 ← another concurrent process
│       └── ...
└── ...

Each client process creates its own pool_id directory (128-bit CSPRNG, same generation as client_id per I-WA4). The pool.lock file holds an flock for the process lifetime. Multiple concurrent same-tenant processes on the same node have fully independent pools with no contention.

L2 integrity (CC-ADV-3 resolved): Each L2 chunk file stores the plaintext data followed by a 4-byte CRC32 trailer, computed at insert time. On L2 read, the CRC32 is verified before serving. Full SHA-256 content-address verification occurs only at fetch time (when the chunk is first retrieved from canonical). CRC32 catches bit-flips and filesystem corruption at ~1 GB/s throughput cost. CRC mismatch triggers bypass to canonical and L2 entry deletion (I-CC7).

Security model (plaintext cache):

The L2 cache holds decrypted plaintext on local NVMe. This is acceptable because:

  • The compute node already holds decrypted data in process memory (computation requires plaintext)
  • L2 NVMe is local to the compute node, same trust domain as process memory
  • L2 is ephemeral — wiped on process exit and on long disconnect
  • zeroize on eviction/wipe: overwrite chunk data before deallocation (I-CC2)
  • File permissions: 0600, owned by process UID
  • Crash recovery: startup scavenger + periodic scrubber clean orphaned pools (CC-ADV-1 resolved, see §9)

Residual risk (CC-ADV-10 acknowledged): Software zeroize on NVMe/SSD provides logical-level erasure only. The Flash Translation Layer may retain physical copies of overwritten data until internal garbage collection. For deployments requiring physical erasure guarantees, use NVMe drives with hardware encryption (OPAL/SED) and rotate the drive encryption key on node reboot. This is an operational hardening measure, not a baseline requirement.

2. Cache modes

Three modes, selectable per client instance at session establishment:

Pinned mode

For workloads that declare their dataset upfront: training runs (epoch reuse), inference (model weights), climate (boundary conditions).

  • Chunks are retained against eviction until explicit release
  • Populated via the staging API (§6) or on first access
  • L2 is the primary tier; L1 is a hot subset
  • Eviction: only on explicit release() or process exit
  • Capacity bounded by max_cache_bytes (§8); staging beyond capacity returns an error, does not evict pinned chunks

Dataset versioning (CC-ADV-8 resolved): Pinned mode stages a point-in-time snapshot of the dataset. The staged version is immutable in the cache regardless of canonical updates. This is intentional — training runs require a stable dataset across epochs. To pick up dataset updates, the user must explicitly release and re-stage. There is no automatic dataset-level version check.

Organic mode

Default for mixed workloads. LRU with usage-weighted retention.

  • Chunks cached on first read, evicted on LRU when capacity is reached
  • Frequently accessed chunks promoted to L1
  • L2 eviction: LRU by last-access timestamp, weighted by access count (chunks accessed N times survive N eviction rounds)
  • Metadata cache (file→chunk_list) with configurable TTL (default 5s)

Bypass mode

For workloads that don’t benefit from caching: streaming ingest, one-shot scans, checkpoint writes, compute-bound codes with no repeat reads.

  • All reads go directly to canonical
  • No L1 or L2 storage consumed
  • Zero overhead beyond mode selection

3. Metadata cache

The cache stores file-to-chunk-list mappings with a bounded TTL:

#![allow(unused)]
fn main() {
struct MetadataEntry {
    chunk_list: Vec<ChunkId>,
    fetched_at: Instant,
    ttl: Duration,
}
}

I-CC3 (metadata freshness and authority): File→chunk_list metadata mappings are served from cache only within the configured TTL (default 5s). After TTL expiry, the mapping must be re-fetched from canonical before serving chunks that depend on it. Within the TTL window, the cached mapping is authoritative — it may serve data for files that have since been modified or deleted in canonical. This is an accepted consequence of the TTL window, not a correctness violation. Modifications create new compositions with new chunk_ids; the old mapping points to valid immutable chunks that were the file’s content at fetch time. Deletions remove the composition; the cached mapping continues to serve the deleted file’s data until TTL expiry.

I-CC5 (staleness bound): Metadata TTL is the upper bound on read staleness. A file modified or deleted in canonical will be visible to a caching client within at most one metadata TTL period. The default TTL (5 seconds) balances freshness against metadata lookup cost.

Write-through: When the client writes a file (creating new chunks and a new composition), the local metadata cache is updated immediately with the new chunk list. This provides read-your-writes consistency within a single client process without waiting for TTL expiry.

4. Correctness invariants

The cache’s correctness rests on a small set of stated invariants. Each case where the cache serves (rather than bypasses) is backed by one or more of these invariants. Cases not covered bypass to canonical.

I-CC1 (chunk immutability): Chunks are immutable in canonical (I-C1). A chunk fetched, verified by content-address (SHA-256 of plaintext matches chunk_id derivation), and stored in cache is correct for all future reads of that chunk_id. No TTL needed for chunk data.

I-CC2 (plaintext security): Cached plaintext is overwritten with zeros (zeroize) before deallocation, eviction, or cache wipe. File-level: overwrite contents before unlink. Memory-level: Zeroizing<Vec<u8>> for L1 entries. This provides logical-level erasure; physical-level erasure on flash storage requires hardware encryption (see §1 residual risk).

I-CC6 (disconnect threshold): Cached entries remain authoritative across fabric disconnects shorter than max_disconnect_seconds (default 300s). Beyond this threshold, the entire cache (L1 + L2) is wiped. Disconnect is defined as: no successful RPC to any canonical endpoint (storage node or gateway) for max_disconnect_seconds consecutive seconds. The client maintains a last_successful_rpc timestamp updated on every successful data-path or heartbeat RPC. Background heartbeat RPCs (every 60s, piggybacked on metadata TTL refresh when idle) keep this timestamp current. Transient single-RPC failures do not trigger the disconnect timer — only sustained unreachability across all endpoints does.

I-CC7 (error bypass): Any local cache error (L2 I/O failure, corrupt chunk detected by CRC32 mismatch, metadata lookup failure) bypasses to canonical unconditionally. The cache never serves data it cannot verify. Failed L2 reads are not retried from L2 — they go to canonical immediately.

I-CC8 (wipe on restart / crash recovery): On process start, the client either creates a new L2 pool (wiping any prior orphaned pools) or adopts an existing pool identified by KISEKI_CACHE_POOL_ID environment variable (see §6 staging handoff). Orphaned pools are detected by attempting flock on each pool.lock — if the lock succeeds, the pool is orphaned (no live process holds it) and is wiped (zeroized and deleted). A separate kiseki-cache-scrub service runs on node boot and periodically (every 60s) to clean orphaned pools across all tenants, covering crash recovery when no subsequent kiseki process starts on that node.

I-CC13 (L2 integrity): L2 cache entries are protected by a CRC32 checksum computed at insert time and stored as a 4-byte trailer on each chunk file. On L2 read, the CRC32 is verified before serving. CRC mismatch triggers bypass to canonical and L2 entry deletion.

5. Policy authority and distribution

Cache policy follows the same distribution mechanism as quotas (per control-plane.feature scenario “Quota enforcement during control plane outage”).

Policy hierarchy

cluster default → org override → project override → workload override
                                                      → session selection

Each level narrows (never broadens) the parent’s settings, consistent with ADR-020 / I-WA7.

Policy attributes

AttributeTypeAdmin levelsClient selectableDefault
cache_enabledboolcluster, org, project, workloadNotrue
allowed_modesset{pinned, organic, bypass}cluster, orgNo{pinned, organic, bypass}
max_cache_bytesu64cluster, org, workloadUp to ceiling50 GB
max_node_cache_bytesu64clusterNo80% of cache filesystem
metadata_ttl_msu64cluster, orgUp to ceiling5000
max_disconnect_secondsu64clusterNo300
key_health_interval_msu64clusterNo30000
staging_enabledboolcluster, orgNotrue
modeenumworkload (default)Yes (within allowed)organic

Narrowing rules (same as I-WA7):

  • cache_enabled = false at any level → disabled for all children
  • allowed_modes at child ⊆ allowed_modes at parent
  • max_cache_bytes at child ≤ max_cache_bytes at parent
  • metadata_ttl_ms at child ≤ metadata_ttl_ms at parent

Distribution mechanism

Cache policy is carried in the same TenantConfig structure that carries quotas. At session establishment, the client resolves its effective policy through multiple paths (CC-ADV-6 resolved):

  1. Primary: GetCachePolicy RPC on the data-path gRPC channel to any connected storage node. Storage nodes have TenantConfig (same data they use for quota enforcement). No gateway or control plane reachability required — the client only needs the data fabric.
  2. Secondary: fetch from gateway’s locally-cached TenantConfig (if gateway is reachable)
  3. Stale tolerance: last-known policy persisted in L2 pool directory (policy.json). Remains effective during outages, consistent with quota scenario in control-plane.feature.
  4. Fallback: if no policy resolvable (first-ever session, all paths unreachable), use conservative defaults (cache enabled, organic mode, 10 GB max, 5s TTL)
  5. Reconciliation: on control-plane recovery, client re-fetches policy and applies prospectively (I-WA18 pattern — active sessions continue under session-start policy; new sessions use updated policy)

No parallel policy-distribution path is introduced. Cache policy is one more field in TenantConfig, alongside quotas, compliance tags, and advisory settings.

I-CC9 (policy fallback): When effective cache policy is unreachable at session start, the client operates with conservative defaults (cache enabled, organic mode, 10 GB ceiling, 5s metadata TTL). The cache is a performance feature; failing to resolve policy must not prevent data access.

I-CC10 (prospective policy): Cache policy changes apply to new sessions only. Active sessions continue under the policy effective at session establishment, consistent with I-WA18.

6. Staging API

Client-local operation for pre-populating the cache with a dataset’s chunks in pinned mode. Pull-based — the client fetches from canonical.

Interface

# CLI (Slurm prolog, manual use)
kiseki-client stage --dataset <namespace_path> [--timeout <seconds>]
kiseki-client stage --status [--dataset <namespace_path>]
kiseki-client stage --release <namespace_path>
kiseki-client stage --release-all

# Rust API
impl CacheManager {
    async fn stage(&self, namespace_path: &str) -> Result<StageResult>;
    fn stage_status(&self) -> Vec<StagedDataset>;
    fn release(&self, namespace_path: &str);
    fn release_all(&self);
}

# Python API (via PyO3)
client.stage(namespace_path="/training/imagenet")
client.stage_status()
client.release(namespace_path="/training/imagenet")

# C FFI
kiseki_stage(handle, "/training/imagenet", timeout_secs)
kiseki_stage_status(handle, &status)
kiseki_release(handle, "/training/imagenet")

Flow (CC-ADV-11 resolved: directory tree handling)

  1. Resolve namespace_path to composition metadata via canonical. If the path is a directory, recursively enumerate all files (compositions) up to max_staging_depth (default 10) and max_staging_files (default 100,000). If limits are exceeded, return an error with the count of files discovered.
  2. Extract full chunk list from all resolved compositions
  3. For each chunk not already in L2: fetch from canonical, decrypt, verify content-address (SHA-256), store in L2 with CRC32 trailer and pinned retention
  4. Write staging/<dataset_id>.manifest listing all compositions, their chunk_ids, and the total byte count
  5. Report progress (chunks staged / total, bytes, elapsed)

Staging is idempotent — re-staging an already-staged dataset is a no-op (chunks already present). Partial staging (interrupted) can be resumed by re-running the command.

Staging handoff (CC-ADV-5 resolved)

The staging CLI creates a cache pool and holds the pool.lock flock for its lifetime. The workload process adopts the staging pool instead of creating a new one:

  1. Staging CLI: kiseki-client stage --dataset /training/imagenet
    • Creates pool, writes pool_id to stdout and to $KISEKI_CACHE_DIR/<tenant>/staging_pool_id
    • Stages chunks, holds flock, stays alive (daemon mode)
  2. Workload process: sets KISEKI_CACHE_POOL_ID=<pool_id> (from Slurm prolog output, Lattice env injection, or the file)
    • On start, detects existing pool with matching pool_id
    • Adopts pool: takes over flock from staging daemon
    • Staging daemon detects flock loss, exits cleanly
  3. If KISEKI_CACHE_POOL_ID is not set: normal fresh-pool behavior (create new pool, wipe orphans)

Slurm integration:

# prolog.sh:
POOL_ID=$(kiseki-client stage --dataset /training/imagenet --daemon)
echo "export KISEKI_CACHE_POOL_ID=$POOL_ID" >> $SLURM_EXPORT_FILE

# epilog.sh:
kiseki-client stage --release-all --pool $KISEKI_CACHE_POOL_ID

Lattice integration: injects KISEKI_CACHE_POOL_ID into the workload environment after parallel staging completes across the node set. Queries stage --status to verify readiness before launching the workload.

I-CC11 (staging correctness): Staged chunks are fetched from canonical, verified by content-address, and stored with pinned retention. The staging manifest records the compositions and chunk_ids at staging time as a point-in-time snapshot. If the dataset is modified in canonical after staging, the staged version remains correct for its chunk_ids (immutable chunks) but stale relative to the current dataset version. To pick up updates, the user must explicitly release and re-stage.

7. Cache invalidation

The cache is primarily self-consistent due to chunk immutability (I-C1). Explicit invalidation is needed only for metadata:

Metadata invalidation: TTL-based. No push invalidation from canonical to client. The metadata TTL is the sole freshness mechanism.

Chunk invalidation: Not needed under normal operation (chunks are immutable). Two exceptional cases:

  1. Crypto-shred (CC-ADV-2 resolved): When a tenant’s KEK is destroyed, all cached plaintext for that tenant must be wiped. Detection via three paths:

    • Periodic key health check: Client pings KMS every key_health_interval (default 30s). If the tenant KEK is reported destroyed (KEK_DESTROYED error), wipe immediately.
    • Advisory channel: If connected, receives shred notification immediately (fast path, best-effort).
    • KMS error on next operation: Any key fetch that returns KEK_DESTROYED triggers immediate wipe.
    • Unreachability fallback: If KMS is unreachable for max_disconnect_seconds, the disconnect timer triggers a full cache wipe (I-CC6), which covers the case where the KMS is unreachable because the KEK was destroyed.

    Maximum time between crypto-shred event and cache wipe is bounded by min(key_health_interval, max_disconnect_seconds) — default 30 seconds.

  2. Key rotation: When the system key epoch rotates, existing cached plaintext remains valid (same content, different encryption at rest). No cache action needed — the cache holds plaintext, not ciphertext.

I-CC12 (crypto-shred wipe): On crypto-shred event, all cached plaintext for the affected tenant is wiped from L1 and L2 with zeroize. Detection bounded by key_health_interval (default 30s). No cached data from a shredded tenant is served after detection.

8. Capacity management

Per-process limits:

ParameterDefaultSource
max_memory_bytes (L1)256 MBenv KISEKI_CACHE_L1_MAX or API
max_cache_bytes (L2)50 GBpolicy ceiling or env KISEKI_CACHE_L2_MAX

Per-node limit (CC-ADV-9 resolved):

max_node_cache_bytes (default: 80% of $KISEKI_CACHE_DIR filesystem capacity). Enforced cooperatively: before inserting into L2, each process sums total usage across all pool directories in $KISEKI_CACHE_DIR. If total exceeds max_node_cache_bytes, the insert is rejected (organic: evict first; pinned: staging error). The disk-pressure check (90% filesystem utilization) remains as a hard backstop.

Capacity enforcement:

  • L1: strict LRU eviction at max_memory_bytes
  • L2 organic mode: LRU eviction at max_cache_bytes
  • L2 pinned mode: staging requests rejected with CacheCapacityExceeded when staged + proposed > max_cache_bytes. No eviction of pinned data.
  • Combined pinned + organic: pinned chunks are never evicted by organic LRU. Organic eviction only considers non-pinned chunks.
  • Node-wide: cooperative check against max_node_cache_bytes before any L2 insert.

9. Lifecycle

Process start (CC-ADV-1 resolved: crash recovery):

  1. If KISEKI_CACHE_POOL_ID set: adopt existing pool (§6 handoff)
  2. Otherwise: create new pool with CSPRNG pool_id
  3. Scavenge orphans: scan all pool directories under $KISEKI_CACHE_DIR/<tenant_id>/, attempt flock on each pool.lock. If lock succeeds (no live holder), the pool is orphaned — zeroize all chunk files, delete directory. This catches prior crashes.
  4. Resolve effective cache policy (§5)
  5. Initialize L1 (empty HashMap)
  6. Start background tasks: metadata TTL eviction, disk-pressure check, key health check (every key_health_interval), heartbeat RPC (every 60s for disconnect detection)
  7. Cache operational

Crash recovery service (kiseki-cache-scrub): A systemd one-shot service (or cron job) that runs on node boot and every 60 seconds. Scans $KISEKI_CACHE_DIR for all tenant/pool directories, wipes any whose pool.lock has no live flock holder. This covers the case where no subsequent kiseki process starts on the node after a crash.

Steady state:

  • Reads: L1 → L2 (CRC32 verify) → canonical (decrypt + SHA-256 verify
    • store in L1/L2 with CRC32 trailer)
  • Writes: straight to canonical; update local metadata cache
  • Background: periodic L1 expired-entry eviction, L2 disk-pressure check, key health check, heartbeat RPC

Disconnect (fabric unreachable):

  • Reads from L1/L2 continue to be served (chunks are immutable)
  • After max_disconnect_seconds with no successful RPC to any canonical endpoint: wipe entire cache (I-CC6)
  • On reconnect before threshold: resume normal operation

Process exit (clean):

  • Wipe L2 (zeroize all chunk files, delete pool directory)
  • L1 freed with process memory (Zeroizing drop handles cleanup)
  • Release flock on pool.lock

Process exit (crash):

  • L2 chunk files remain on NVMe (no zeroize opportunity)
  • Next process start or kiseki-cache-scrub service detects orphaned pool via flock check and wipes it

10. Configuration surface

Linkage modeConfiguration mechanism
FUSE mountMount options: -o cache_mode=organic,cache_l2_max=50G,cache_dir=/tmp/kiseki
Rust APICacheConfig struct passed to Client::new()
Pythonkiseki.Client(cache_mode="pinned", cache_l2_max=50*1024**3)
C FFIkiseki_open() with KisekiCacheConfig struct
EnvironmentKISEKI_CACHE_MODE, KISEKI_CACHE_DIR, KISEKI_CACHE_L1_MAX, KISEKI_CACHE_L2_MAX, KISEKI_CACHE_META_TTL_MS, KISEKI_CACHE_POOL_ID

Priority: API/mount options > environment variables > policy defaults. All client-set values are clamped to policy ceilings (§5).

11. Observability

Cache metrics exposed via the client’s local metrics (not Prometheus — client runs on compute nodes, not storage nodes):

MetricTypeDescription
cache_l1_hitscounterL1 (memory) cache hits
cache_l2_hitscounterL2 (NVMe) cache hits
cache_missescounterCache misses (bypassed to canonical)
cache_bypassescounterBypass mode reads (intentional non-cache)
cache_errorscounterL2 I/O errors (bypassed to canonical per I-CC7)
cache_l1_bytesgaugeCurrent L1 memory usage
cache_l2_bytesgaugeCurrent L2 disk usage
cache_staged_datasetsgaugeNumber of pinned datasets
cache_staged_bytesgaugeTotal bytes in pinned datasets
cache_meta_hitscounterMetadata cache hits (within TTL)
cache_meta_missescounterMetadata cache misses (TTL expired or absent)
cache_wipescounterFull cache wipes (disconnect threshold, restart, crypto-shred)
cache_l2_read_latency_ushistogramL2 NVMe read latency
cache_l2_write_latency_ushistogramL2 NVMe write latency

Metrics available via workflow advisory telemetry (scoped to caller) and via local API (cache_stats()).

Consequences

Positive

  • Repeat reads served from local NVMe: order-of-magnitude latency reduction for training datasets, inference weights, simulation input
  • Staging API with scheduler handoff eliminates thundering-herd on job start
  • Three modes match the three dominant workload patterns precisely
  • Plaintext cache means cache hits avoid decryption cost entirely
  • Policy model reuses existing TenantConfig distribution — no new subsystem
  • Content-addressed chunk immutability makes cache correctness simple (I-C1 is the foundation)
  • Crash recovery via flock-based orphan detection + scrubber service

Negative

  • Plaintext on local NVMe is a security surface. Mitigated by zeroize, file permissions, wipe-on-exit, crash scrubber, and ephemeral-only semantics (I-CC2, I-CC8). Residual FTL risk documented.
  • Metadata TTL introduces a staleness window including for deleted files. Mitigated by short default (5s) and write-through for own writes (I-CC3, I-CC5)
  • L2 NVMe cache competes with application use of local NVMe (e.g., scratch, checkpoint). Mitigated by configurable per-process ceiling, per-node ceiling, and disk-pressure backoff (§8)
  • No cross-process chunk sharing within a tenant means duplicate chunks when multiple jobs for the same tenant run on the same node. Accepted trade-off: simplicity over hit-rate optimization

Neutral

  • Bypass mode has zero overhead (no cache code on read path)
  • Staging is idempotent and resumable
  • Cache wipe on long disconnect is conservative but safe
  • Policy distribution via data-path gRPC works in all deployment topologies (no gateway or control plane access required)

Adversarial findings

IDSeveritySectionFindingResolution
CC-ADV-1Critical§1, §9Crash leaves plaintext on NVMe unreachable by zeroize. Process crash skips the exit wipe path.Resolved: startup scavenger wipes orphaned pools (flock detection). kiseki-cache-scrub systemd/cron service runs on boot + every 60s for nodes where no subsequent kiseki process starts. Residual FTL risk documented. §9 updated.
CC-ADV-2Critical§7Crypto-shred detection has no reliable delivery path. Advisory channel is best-effort.Resolved: periodic key health check (default 30s) as primary detection. Advisory channel as fast path. KMS error on next operation as tertiary. Unreachability falls through to disconnect timer (I-CC6). Maximum detection latency: min(key_health_interval, max_disconnect_seconds) = 30s default. §7 updated, I-CC12 revised.
CC-ADV-3High§1L2 read verification unspecified. Full SHA-256 on every read too expensive for training throughput.Resolved: CRC32 trailer on each L2 chunk file, verified on read. SHA-256 only at fetch time. CRC32 catches bit-flips at ~1 GB/s cost. CRC mismatch → bypass canonical + delete entry. New I-CC13. §1 updated.
CC-ADV-4High§1cache.lock flock contradicts separate-pool semantics for concurrent same-tenant processes.Resolved: per-process pool_id subdirectory (128-bit CSPRNG). Each process has own pool.lock. No contention between concurrent processes. L2 layout updated in §1.
CC-ADV-5High§6Staging CLI is separate process — workload’s wipe-on-start destroys staged data.Resolved: staging daemon holds flock; workload adopts pool via KISEKI_CACHE_POOL_ID env var instead of creating new pool. Handoff mechanism specified in §6. I-CC8 revised to include adoption path.
CC-ADV-6High§5Policy resolution via gateway unreachable in some topologies.Resolved: primary path is GetCachePolicy RPC on data-path gRPC channel to any storage node. No gateway or control plane access required. Fallback chain: data-path → gateway → persisted last-known → conservative defaults. §5 updated.
CC-ADV-7Medium§3, §4Metadata TTL authority doesn’t explicitly cover file deletion case.Resolved: I-CC3 text now explicitly states that serving data for a deleted file within TTL is an accepted consequence. I-CC5 updated to cover deletion. §3 updated.
CC-ADV-8Medium§2Pinned mode has no mechanism to detect canonical dataset updates.Resolved: documented as intentional. Pinned mode stages a point-in-time snapshot. Update requires explicit release + re-stage. §2 updated.
CC-ADV-9Medium§8No aggregate capacity enforcement across processes on same node.Resolved: max_node_cache_bytes policy attribute (default 80% of cache filesystem). Cooperative enforcement: each process sums all pools before inserting. Disk-pressure 90% as hard backstop. §8 updated, policy table updated.
CC-ADV-10Medium§1NVMe FTL retains physical copies after software zeroize.Resolved: acknowledged as residual risk. Recommended hardening: OPAL/SED NVMe with per-boot key rotation. §1 updated.
CC-ADV-11Medium§6Staging conflates namespace path with single composition.Resolved: staging flow now specifies recursive directory enumeration with max_staging_depth (default 10) and max_staging_files (default 100,000). §6 flow updated.
CC-ADV-12Low§4I-CC3, I-CC4, I-CC5 partially overlap.Resolved: I-CC3 and I-CC4 consolidated into single I-CC3 covering freshness, authority, and deletion case. I-CC5 retained as the externally-facing staleness guarantee. Invariant table updated.
CC-ADV-13Low§9Disconnect detection mechanism unspecified.Resolved: defined as “no successful RPC to any canonical endpoint for max_disconnect_seconds consecutive seconds.” Client maintains last_successful_rpc timestamp. Background heartbeat every 60s. I-CC6 updated with detection mechanism.
CC-ADV-14Low§11Missing L2 read/write latency metrics.Resolved: added cache_l2_read_latency_us and cache_l2_write_latency_us histograms to metrics table. §11 updated.

Invariant impact

InvariantImpact
I-C1Foundation: chunk immutability enables the cache. No change to I-C1.
I-K1, I-K2Unchanged: plaintext never leaves the compute node. Cache stores plaintext locally, same trust domain as process memory.
I-WA18Reused: cache policy changes apply prospectively.
I-WA7Reused: scope narrowing pattern for policy hierarchy.

New invariants

IDInvariant
I-CC1A chunk in pinned or organic mode is served from cache if and only if (a) the chunk was fetched from canonical and verified by chunk_id content-address match (SHA-256) at fetch time, and (b) no crypto-shred event has been detected for that tenant since fetch. Chunks are immutable in canonical (I-C1); therefore a verified chunk remains correct indefinitely absent crypto-shred.
I-CC2Cached plaintext is overwritten with zeros (zeroize) before deallocation, eviction, or cache wipe. File-level: overwrite contents before unlink. Memory-level: Zeroizing<Vec<u8>> for L1 entries. This provides logical-level erasure; physical-level erasure on flash storage requires hardware encryption (OPAL/SED).
I-CC3File→chunk_list metadata mappings are served from cache only within the configured TTL (default 5s). After TTL expiry, the mapping must be re-fetched from canonical. Within the TTL window, the cached mapping is authoritative: it may serve data for files that have since been modified or deleted in canonical. This is the sole freshness window in the cache design — chunk data itself has no TTL.
I-CC5Metadata TTL is the upper bound on read staleness. A file modified or deleted in canonical is visible to a caching client within at most one metadata TTL period (default 5s).
I-CC6Cached entries remain authoritative across fabric disconnects shorter than max_disconnect_seconds (default 300s). Beyond this threshold, the entire cache (L1 + L2) is wiped. Disconnect defined as: no successful RPC to any canonical endpoint for the threshold duration. Background heartbeat RPCs (every 60s) maintain the last_successful_rpc timestamp.
I-CC7Any local cache error (L2 I/O failure, CRC32 mismatch, metadata lookup failure) bypasses to canonical unconditionally. The cache never serves data it cannot verify.
I-CC8The cache is ephemeral. On process start, the client either creates a new L2 pool (wiping orphaned pools detected via flock) or adopts an existing pool via KISEKI_CACHE_POOL_ID. A kiseki-cache-scrub service runs on node boot and periodically to clean orphaned pools from crashed processes.
I-CC9When effective cache policy is unreachable at session start, the client operates with conservative defaults (cache enabled, organic mode, 10 GB ceiling, 5s metadata TTL). Policy is fetched via data-path gRPC (primary), gateway (secondary), persisted last-known (tertiary), or conservative defaults (fallback).
I-CC10Cache policy changes apply to new sessions only. Active sessions continue under session-start policy (consistent with I-WA18).
I-CC11Staged chunks are fetched from canonical, verified by content-address, and stored with pinned retention as a point-in-time snapshot. The staged version is immutable in the cache regardless of canonical updates. To pick up updates, the user must explicitly release and re-stage. Staging enumerates directory trees recursively up to max_staging_depth (10) and max_staging_files (100,000).
I-CC12On crypto-shred event, all cached plaintext for the affected tenant is wiped from L1 and L2 with zeroize. Detection via periodic key health check (default 30s), advisory channel notification, or KMS error on next operation. Maximum detection latency bounded by min(key_health_interval, max_disconnect_seconds).
I-CC13L2 cache entries are protected by a CRC32 checksum computed at insert time and stored as a 4-byte trailer. On L2 read, the CRC32 is verified before serving. Mismatch triggers bypass to canonical and L2 entry deletion.

Spec references

  • specs/features/native-client.feature — cache hit/invalidation/staging scenarios (extend)
  • specs/features/control-plane.feature — cache policy distribution scenarios (extend)
  • specs/invariants.md — add I-CC1 through I-CC13
  • specs/ubiquitous-language.md — add cache-specific terms
  • specs/failure-modes.md — add F-CC1 through F-CC4
  • specs/assumptions.md — add A-CC1 through A-CC4

ADR-032: Async GatewayOps

Status: Accepted
Date: 2026-04-24
Traces: I-L2, I-L5, I-V3, I-WA2, I-C2, I-C5, I-L8

Context

GatewayOps is a synchronous trait used by all three protocol gateways (S3, NFS, FUSE) to perform reads and writes through the composition and chunk stores. When the Raft-backed log store was introduced, the sync trait required a sync→async bridge (run_on_raft) that blocks the calling thread while waiting for Raft consensus.

Under concurrent load (≥ Raft runtime thread count), this causes thread starvation: all Raft threads are occupied polling client_write futures, leaving no thread for the Raft core loop to dispatch entries. The current mitigation (KISEKI_RAFT_THREADS = cpus/2) works but wastes resources and imposes a concurrency ceiling equal to the thread count.

For HPC/ML workloads with hundreds to thousands of concurrent writers, the thread-per-request model is unsustainable. The gateway must not block OS threads while waiting for Raft consensus.

Decision

Make GatewayOps an async trait. All protocol gateways call async methods directly. NFS and FUSE callers bridge async→sync via tokio::runtime::Handle::block_on on a dedicated runtime (the reverse of the current problem, but on threads they own — OS threads that are explicitly meant to block).

Trait change

#![allow(unused)]
fn main() {
// Before (sync)
pub trait GatewayOps: Send + Sync {
    fn write(&self, req: WriteRequest) -> Result<WriteResponse, GatewayError>;
    fn read(&self, req: ReadRequest) -> Result<ReadResponse, GatewayError>;
    fn list(...) -> Result<...>;
    fn delete(...) -> Result<...>;
    // ...
}

// After (async)
pub trait GatewayOps: Send + Sync {
    async fn write(&self, req: WriteRequest) -> Result<WriteResponse, GatewayError>;
    async fn read(&self, req: ReadRequest) -> Result<ReadResponse, GatewayError>;
    async fn list(...) -> Result<...>;
    async fn delete(...) -> Result<...>;
    // ...
}
}

Mutex strategy

Replace std::sync::Mutex with tokio::sync::Mutex for CompositionStore and ChunkStore in InMemoryGateway. Lock guards must NOT be held across .await points that perform disk I/O or Raft submissions — acquire, do in-memory work, drop, then await I/O.

Protocol gateway changes

ProtocolCurrentAfter
S3 (axum)block_in_place(|| gateway.write())gateway.write().await
NFS (std::thread)gateway.write()rt.block_on(gateway.write()) on NFS thread
FUSE (fuser threads)gateway.write()rt.block_on(gateway.write()) on fuser thread

S3 becomes fully non-blocking. NFS and FUSE threads block as before, but they own their threads (not tokio worker threads) so no starvation.

LogOps bridge

LogOps::append_delta becomes async. The run_on_raft bridge is removed — the Raft runtime’s handle is used directly via .await from async gateway methods. No mpsc::recv blocking, no thread starvation.

Invariant preservation

The async conversion preserves all invariants by maintaining the same happens-before ordering via .await:

InvariantGuarantee
I-L2Gateway awaits Raft commit before returning to client
I-L5Chunk writes awaited before composition finalize delta
I-V3Read-your-writes: last_written_seq set after awaited write
I-C2Refcount ops after awaited chunk confirm
I-C5Capacity check before async write submission
I-L8Shard membership validated before async rename
I-WA2Advisory lookups remain sync + bounded (≤500 µs timeout)

Concurrency model

With async GatewayOps, the concurrency ceiling becomes the tokio task limit (effectively unbounded) instead of the thread count. Thousands of concurrent writes share a fixed thread pool without starvation.

Migration

Big-bang conversion. All callers updated in one pass:

  1. Make GatewayOps async (trait + InMemoryGateway impl)
  2. Replace std::sync::Mutextokio::sync::Mutex in gateway
  3. Make LogOps async, remove run_on_raft bridge
  4. Update S3 handlers: remove block_in_place, use .await
  5. Update NFS server: add rt.block_on() wrapper on NFS threads
  6. Update FUSE daemon: add rt.block_on() wrapper on fuser threads
  7. Update all tests + BDD step definitions
  8. Remove KISEKI_RAFT_THREADS (no longer needed)

Consequences

Benefits:

  • No thread starvation under any concurrency level
  • S3 handler is fully non-blocking (proper async axum)
  • Removes run_on_raft, block_in_place, KISEKI_RAFT_THREADS
  • Single Raft runtime (no dedicated runtime needed)
  • Clean async-all-the-way data path

Costs:

  • Large refactor touching all protocol gateways and tests
  • NFS/FUSE need a tokio runtime handle for block_on
  • tokio::sync::Mutex has slightly higher per-lock overhead than std::sync::Mutex (but eliminates thread starvation)
  • Async trait requires Send + 'static bounds on futures

Risks:

  • tokio::sync::Mutex held across .await can cause deadlocks if not careful. Mitigated by code review rule: never hold gateway mutex across Raft submission or disk I/O.
  • NFS/FUSE block_on on a non-tokio thread: works correctly but must not be called from within a tokio context (same issue we already solved with std::thread::spawn for runtime creation).

Implementation Notes (2026-04-24)

CompositionOps reverted to sync. The initial implementation made CompositionOps async, but holding tokio::sync::Mutex<CompositionStore> across emit_delta().await serialized all writes behind a single Raft round-trip — the same bottleneck as before, just without thread starvation.

Final architecture:

  • GatewayOps: async (S3 handlers await directly)
  • LogOps: async (Raft consensus)
  • CompositionOps: sync (in-memory HashMap operations only)

Gateway write pattern (lock-free):

  1. Lock compositions → create() (sync, microseconds) → drop lock
  2. Emit delta to log (async, Raft consensus, ~8ms) — no lock held
  3. If emission fails, re-acquire lock and rollback (PIPE-ADV-1)

NFS/FUSE bridge: block_gateway() helper uses block_in_place when on a tokio worker thread (tests), or direct block_on on OS threads (production NFS/FUSE daemon).

Result: 1MB write throughput: 39.5 → 380.2 MB/s (9.6x improvement). 32 concurrent S3 PUTs complete in 50ms with no deadlock.