Kiseki
Kiseki is a distributed storage system built for HPC and AI workloads. It provides a unified data plane that serves files and objects through multiple protocol gateways (S3, NFS, FUSE) while handling encryption, replication, and caching transparently.
Key Features
-
S3 and NFS gateways – access the same data through S3-compatible HTTP, NFSv3/v4.2, or a native FUSE mount. Protocol gateways translate wire protocols into operations on the shared log-structured data model.
-
Client-side cache with staging – a two-tier cache (L1 in-memory, L2 local NVMe) on compute nodes eliminates repeated fabric traversals. Three modes (pinned, organic, bypass) match the dominant workload patterns: epoch-reuse training, mixed inference, and streaming ingest.
-
Per-shard Raft consensus – every shard is a single-tenant Raft group. Deltas (metadata mutations) are totally ordered within a shard and replicated to a quorum before acknowledgement.
-
Erasure coding and placement – chunks are stored across affinity pools with configurable EC profiles. The placement engine distributes data across device classes (fast-NVMe, bulk-NVMe) and rebuilds lost chunks from parity.
-
FIPS 140-2/3 encryption – always-on, two-layer envelope encryption. System DEKs (AES-256-GCM via aws-lc-rs) encrypt chunk data; tenant KEKs wrap the DEKs for access control. Five tenant KMS backends: Kiseki-Internal, HashiCorp Vault, KMIP 2.1, AWS KMS, PKCS#11.
-
GPU-direct and fabric transports – the native client selects the fastest available transport: libfabric/CXI (Slingshot), RDMA verbs, or TCP+TLS. Transport selection is automatic based on fabric discovery.
-
Multi-tenant isolation – tenant hierarchy (organization / project / workload) with per-level quotas, compliance tags, and key isolation. Shards are single-tenant. Cross-tenant data access is out of scope by design.
-
OIDC and mTLS authentication – Keycloak (or any OIDC provider) for identity; Cluster CA-signed mTLS certificates for data-fabric authentication. Certificates work on the SAN with no control plane access needed on the hot path.
-
Workflow advisory – a bidirectional advisory channel carries workload hints (access pattern, prefetch range, priority) inbound and telemetry feedback (backpressure, locality, staleness) outbound. The advisory path is side-by-side with the data path – it never blocks or delays data operations.
Architecture at a Glance
Kiseki is a single-language Rust system organized as 18 crates in a Cargo workspace:
| Layer | Crates |
|---|---|
| Foundation | kiseki-common, kiseki-proto, kiseki-crypto, kiseki-transport |
| Data path | kiseki-log, kiseki-block, kiseki-chunk, kiseki-composition, kiseki-view |
| Protocol | kiseki-gateway (NFS + S3) |
| Client | kiseki-client (FUSE, FFI, Python via PyO3) |
| Infrastructure | kiseki-raft, kiseki-keymanager, kiseki-audit, kiseki-advisory, kiseki-control |
| Integration | kiseki-server, kiseki-acceptance |
The data model is log-structured: mutations are recorded as deltas appended to per-shard Raft logs. Compositions describe how content-addressed, encrypted chunks assemble into files or objects. Views are materialized projections of shard state, maintained incrementally by stream processors and served by protocol gateways.
Four binaries are produced:
| Binary | Role |
|---|---|
kiseki-server | Storage node (log + chunk + composition + view + gateways + audit + advisory) |
kiseki-keyserver | HA system key manager (Raft-replicated) |
kiseki-client-fuse | Compute-node FUSE mount with native client |
kiseki-control | Control plane (tenancy, IAM, policy, federation) |
Target Workloads
| Workload | How Kiseki helps |
|---|---|
| LLM training | Tokenized datasets staged once per job, served from local NVMe cache across epochs. Pinned cache mode prevents eviction. |
| LLM inference | Model weights cold-started into cache on first load, then served locally for all replicas on the node. |
| Climate / weather simulation | Boundary conditions staged with hard deadline via Slurm prolog. Input files cached; checkpoint writes bypass the cache. |
| HPC checkpoint/restart | Checkpoint writes go straight to canonical (bypass mode). Restart reads benefit from organic caching if the same node is reused. |
Quick Links
- Getting Started – Docker Compose quickstart
- S3 API – supported operations, examples
- NFS Access – NFSv3/v4.2 mount instructions
- FUSE Mount – native client mount
- Python SDK – PyO3 bindings
- Client Cache & Staging – ADR-031 cache modes
Getting Started
This guide walks through running a single-node Kiseki stack with Docker Compose, verifying the deployment, and performing basic S3 operations.
Prerequisites
- Docker 24+ with Compose V2 (
docker compose) - curl (for health checks)
- aws-cli (optional, for S3 operations)
If building from source instead of Docker:
- Rust 1.78+ (stable)
- Protobuf compiler (
protoc)
Quick Start with Docker Compose
The repository includes a docker-compose.yml that brings up a
single-node Kiseki server with supporting services:
| Service | Port | Purpose |
|---|---|---|
kiseki-server | 9000 | S3 HTTP gateway |
kiseki-server | 2049 | NFS (v3 + v4.2) |
kiseki-server | 9090 | Prometheus metrics |
kiseki-server | 9100 | Data-path gRPC |
kiseki-server | 9101 | Advisory gRPC |
jaeger | 16686 | Tracing UI |
jaeger | 4317 | OTLP gRPC receiver |
vault | 8200 | HashiCorp Vault (dev mode, tenant KMS) |
keycloak | 8080 | Keycloak (OIDC identity provider) |
Start the stack:
docker compose up --build -d
Wait for all services to become healthy:
docker compose ps
The kiseki-server container sets KISEKI_BOOTSTRAP=true, which
creates an initial shard for immediate use.
Verify the Deployment
Health Check
The data-path gRPC port responds to TCP connections when the server is ready:
# TCP probe on the data-path port
timeout 1 bash -c 'echo > /dev/tcp/127.0.0.1/9100'
echo $? # 0 = healthy
Prometheus Metrics
curl -s http://localhost:9090/metrics | head -20
Jaeger Tracing
Open http://localhost:16686 in a browser to view distributed traces. The server exports traces via OTLP to Jaeger automatically.
Vault (Dev Mode)
Vault runs in dev mode with root token kiseki-e2e-token:
curl -s http://localhost:8200/v1/sys/health | python3 -m json.tool
Keycloak
Keycloak is available at http://localhost:8080
with admin credentials admin / admin.
S3 Operations
With aws-cli configured to point at the local S3 gateway:
# Configure a local profile (no real AWS credentials needed)
export AWS_ACCESS_KEY_ID=kiseki
export AWS_SECRET_ACCESS_KEY=kiseki
export AWS_DEFAULT_REGION=us-east-1
# Create a bucket (maps to a Kiseki namespace)
aws --endpoint-url http://localhost:9000 s3 mb s3://test-bucket
# Upload a file
echo "hello kiseki" > /tmp/hello.txt
aws --endpoint-url http://localhost:9000 s3 cp /tmp/hello.txt s3://test-bucket/hello.txt
# Download and verify
aws --endpoint-url http://localhost:9000 s3 cp s3://test-bucket/hello.txt /tmp/hello-back.txt
cat /tmp/hello-back.txt
Or with curl directly:
# List buckets
curl -s http://localhost:9000/
# PUT an object
curl -X PUT http://localhost:9000/test-bucket/greeting.txt \
-d "hello from curl"
# GET it back
curl -s http://localhost:9000/test-bucket/greeting.txt
Multi-Node Cluster
A three-node cluster configuration is also provided:
docker compose -f docker-compose.3node.yml up --build -d
This starts three kiseki-server instances that form Raft groups for
shard replication.
Building from Source
# Clone and build
git clone https://github.com/your-org/kiseki.git
cd kiseki
cargo build --release
# Run the server
KISEKI_BOOTSTRAP=true \
KISEKI_DATA_DIR=/tmp/kiseki-data \
KISEKI_S3_ADDR=0.0.0.0:9000 \
KISEKI_NFS_ADDR=0.0.0.0:2049 \
KISEKI_DATA_ADDR=0.0.0.0:9100 \
KISEKI_METRICS_ADDR=0.0.0.0:9090 \
./target/release/kiseki-server
Next Steps
- S3 API – full list of supported S3 operations
- NFS Access – mount via NFS
- FUSE Mount – native client mount on compute nodes
- Python SDK – use Kiseki from Python workloads
- Client Cache & Staging – pre-stage datasets for training jobs
S3 API
Kiseki exposes an S3-compatible HTTP gateway on port 9000 (configurable
via KISEKI_S3_ADDR). The gateway implements the subset of S3 API
operations needed by HPC/AI workloads (ADR-014). Unsupported operations
return 501 Not Implemented.
Endpoint
http://<node>:9000
In the Docker Compose development stack, the endpoint is
http://localhost:9000.
Authentication
Kiseki supports AWS Signature Version 4 authentication:
- Authorization header – standard SigV4 signing for aws-cli, boto3, and other SDK clients.
- Presigned URLs – planned for a future release (not yet implemented).
In development mode (Docker Compose), any access key and secret key values are accepted.
Supported Operations
Bucket Operations
S3 buckets map to Kiseki namespaces. Creating a bucket creates a tenant-scoped namespace; deleting a bucket deletes the namespace.
| Operation | S3 API | Notes |
|---|---|---|
| Create bucket | PUT /{bucket} | Maps to namespace creation |
| Delete bucket | DELETE /{bucket} | Maps to namespace deletion |
| Head bucket | HEAD /{bucket} | Existence check |
| List buckets | GET / | Per-tenant bucket listing |
Object Operations
| Operation | S3 API | Notes |
|---|---|---|
| Put object | PUT /{bucket}/{key} | Single-part upload |
| Get object | GET /{bucket}/{key} | Including byte-range reads (Range header) |
| Head object | HEAD /{bucket}/{key} | Metadata retrieval |
| Delete object | DELETE /{bucket}/{key} | Tombstone or delete marker (versioning) |
| List objects | GET /{bucket}?list-type=2 | ListObjectsV2 with prefix, delimiter, pagination |
Multipart Upload
For objects larger than a single PUT (large datasets, model weights):
| Operation | S3 API | Notes |
|---|---|---|
| Create multipart upload | POST /{bucket}/{key}?uploads | Returns upload ID |
| Upload part | PUT /{bucket}/{key}?partNumber={n}&uploadId={id} | Upload one part |
| Complete multipart upload | POST /{bucket}/{key}?uploadId={id} | Assemble parts into final object |
| Abort multipart upload | DELETE /{bucket}/{key}?uploadId={id} | Clean up incomplete upload |
| List multipart uploads | GET /{bucket}?uploads | List in-progress uploads |
| List parts | GET /{bucket}/{key}?uploadId={id} | List parts of an in-progress upload |
Versioning
| Operation | S3 API | Notes |
|---|---|---|
| Get object version | GET /{bucket}/{key}?versionId={v} | Specific version retrieval |
| List object versions | GET /{bucket}?versions | Version listing |
| Delete object version | DELETE /{bucket}/{key}?versionId={v} | Delete specific version |
Conditional Operations
| Header | Direction | Notes |
|---|---|---|
If-None-Match | Write | Conditional write (create-if-not-exists) |
If-Match | Write | Conditional write (update-if-matches) |
If-Modified-Since | Read | Conditional read |
Examples
aws-cli
# Set up environment
export AWS_ACCESS_KEY_ID=kiseki
export AWS_SECRET_ACCESS_KEY=kiseki
export AWS_DEFAULT_REGION=us-east-1
ENDPOINT="--endpoint-url http://localhost:9000"
# Bucket operations
aws $ENDPOINT s3 mb s3://datasets
aws $ENDPOINT s3 ls
# Upload a directory
aws $ENDPOINT s3 sync ./training-data/ s3://datasets/imagenet/
# Download a file
aws $ENDPOINT s3 cp s3://datasets/imagenet/train.tar /tmp/train.tar
# Multipart upload (automatic for large files)
aws $ENDPOINT s3 cp ./large-model.bin s3://datasets/models/gpt.bin
# List objects with prefix
aws $ENDPOINT s3 ls s3://datasets/imagenet/ --recursive
# Delete
aws $ENDPOINT s3 rm s3://datasets/imagenet/train.tar
curl
# Create a bucket
curl -X PUT http://localhost:9000/my-bucket
# PUT an object
curl -X PUT http://localhost:9000/my-bucket/config.json \
-H "Content-Type: application/json" \
-d '{"epochs": 100, "batch_size": 32}'
# GET an object
curl -s http://localhost:9000/my-bucket/config.json
# HEAD an object (metadata only)
curl -I http://localhost:9000/my-bucket/config.json
# Byte-range read (first 1024 bytes)
curl -s http://localhost:9000/my-bucket/large-file.bin \
-H "Range: bytes=0-1023"
# DELETE an object
curl -X DELETE http://localhost:9000/my-bucket/config.json
# List objects (ListObjectsV2)
curl -s "http://localhost:9000/my-bucket?list-type=2&prefix=models/"
# Delete a bucket
curl -X DELETE http://localhost:9000/my-bucket
Python (boto3)
import boto3
s3 = boto3.client(
"s3",
endpoint_url="http://localhost:9000",
aws_access_key_id="kiseki",
aws_secret_access_key="kiseki",
region_name="us-east-1",
)
# Create bucket
s3.create_bucket(Bucket="training")
# Upload
s3.put_object(Bucket="training", Key="data.csv", Body=b"col1,col2\n1,2\n")
# Download
obj = s3.get_object(Bucket="training", Key="data.csv")
print(obj["Body"].read().decode())
# List
for item in s3.list_objects_v2(Bucket="training")["Contents"]:
print(item["Key"], item["Size"])
Bucket-to-Namespace Mapping
Every S3 bucket maps 1:1 to a Kiseki namespace within the authenticated tenant’s scope. Bucket names become namespace identifiers. Buckets from different tenants are fully isolated – two tenants can have buckets with the same name without conflict.
Objects within a bucket map to Kiseki compositions. Each object version corresponds to a sequence of deltas in the shard that owns the namespace.
Encryption Handling
Kiseki always encrypts all data (invariant I-K1). S3 server-side encryption headers are handled as follows:
| Header | Behavior |
|---|---|
SSE-S3 (x-amz-server-side-encryption: AES256) | Acknowledged, no-op. System encryption is always on. |
| SSE-KMS with matching ARN | Acknowledged if the ARN matches the tenant KMS config. |
| SSE-KMS with different ARN | Rejected. Tenants cannot specify arbitrary keys. |
SSE-C (x-amz-server-side-encryption-customer-*) | Rejected. Kiseki manages encryption, not the client. |
Limitations
The following S3 features are not implemented:
| Feature | Reason |
|---|---|
| Lifecycle policies | Kiseki has its own tiering and retention model |
| Event notifications (SNS/SQS) | Requires message bus integration |
| Presigned URLs | Planned for future release |
| Bucket policies / IAM | Kiseki uses its own IAM and policy model |
| CORS | Not relevant for HPC/AI workloads |
| Object Lock | Covered by Kiseki’s retention hold mechanism |
| S3 Select | Out of scope |
| Replication configuration | Kiseki manages replication internally |
| Storage classes | Kiseki uses affinity pools, not S3 storage classes |
NFS Access
Kiseki exposes an NFS gateway on port 2049 (configurable via
KISEKI_NFS_ADDR) supporting both NFSv3 and NFSv4.2. The gateway
translates NFS operations into reads and writes against materialized
views and the composition log.
Protocol Support
| Protocol | Status | Notes |
|---|---|---|
| NFSv3 | Supported | Stateless, lower overhead |
| NFSv4.2 | Supported | Stateful, with lock support and extended attributes |
Mounting
Basic Mount
mount -t nfs <node>:/ /mnt/kiseki
With explicit version and options:
# NFSv4.2
mount -t nfs -o vers=4.2,proto=tcp <node>:/ /mnt/kiseki
# NFSv3
mount -t nfs -o vers=3,proto=tcp <node>:/ /mnt/kiseki
Docker Compose (Development)
When using the development Docker Compose stack, the NFS port is published to the host:
mount -t nfs -o vers=4.2,proto=tcp,port=2049 127.0.0.1:/ /mnt/kiseki
fstab Entry
<node>:/ /mnt/kiseki nfs vers=4.2,proto=tcp,hard,intr 0 0
Authentication
| Mode | Use case | Notes |
|---|---|---|
| AUTH_SYS | Development and testing | UID/GID-based, no Kerberos |
| Kerberos (RPCSEC_GSS) | Production | krb5, krb5i, or krb5p security flavors |
In development (Docker Compose), AUTH_SYS is used with no additional configuration. For production deployments, Kerberos provides authentication and optional integrity/privacy protection on the wire.
Kiseki always encrypts data at rest regardless of the NFS authentication mode. The gateway performs tenant-layer encryption: clients send plaintext over TLS to the gateway, and the gateway encrypts before writing to the log and chunk store.
Supported Operations
Full Semantics
| Operation | Notes |
|---|---|
open, close, read, write | Standard file I/O |
create, unlink | File creation and deletion |
mkdir, rmdir | Directory creation and deletion |
rename (within namespace) | Atomic within shard |
stat, fstat, lstat | File metadata |
chmod, chown | Permission changes (stored in delta attributes) |
readdir, readdirplus | Directory listing from materialized view |
symlink, readlink | Stored as inline data in delta |
truncate, ftruncate | Composition resize |
fsync, fdatasync | Flush to durable (delta committed to Raft quorum) |
| Extended attributes (xattr) | getxattr, setxattr, listxattr, removexattr |
POSIX file locks (fcntl) | Per-gateway lock state |
O_APPEND | Atomic append via delta |
O_CREAT, O_EXCL | Atomic create-if-not-exists |
Limited Semantics
| Operation | Limitation |
|---|---|
rename (cross-namespace) | Returns EXDEV – cannot rename across shards |
| Hard links | Within namespace only; cross-namespace returns EXDEV |
| Sparse files | Holes tracked in composition; zero-fill on read |
O_DIRECT | Bypasses client cache but still traverses the gateway |
flock (advisory) | Best-effort; not guaranteed across gateway failover |
Not Supported
| Operation | Reason |
|---|---|
Writable shared mmap | Distributed shared writable mmap requires page-level coherence that is not tractable at HPC scale. Read-only mmap is supported. The gateway returns ENOTSUP. See ADR-013. |
| POSIX ACLs (POSIX.1e) | Unix permissions only (uid/gid/mode). POSIX ACLs add complexity without benefit for the target workloads. |
Namespace Mapping
The NFS root (/) lists the tenant’s namespaces as top-level
directories. Each namespace contains the compositions (files and
directories) belonging to that namespace. This is analogous to the S3
bucket mapping – the same namespace appears as a bucket via S3 and as a
top-level directory via NFS.
/mnt/kiseki/
training/ <- namespace "training"
imagenet/
train.tar
val.tar
checkpoints/ <- namespace "checkpoints"
epoch-001.pt
Performance Considerations
-
Readdir performance – directory listings are served from materialized views, not reconstructed from the log on each request. Views are updated incrementally by stream processors.
-
Write path – writes flow through the gateway to the composition context, which appends deltas to the shard log. An
fsyncensures the delta is committed to a Raft quorum before returning. -
Concurrent access – multiple NFS clients can read the same files concurrently. Write contention within a shard is serialized by the Raft leader.
-
Large files – large files are chunked using content-defined chunking (Rabin fingerprinting). Byte-range reads are served by fetching only the relevant chunks.
Limitations Summary
-
No writable shared mmap – applications that use writable shared memory-mapped files must use
write()instead. Read-only mmap works and is useful for model loading. -
Cross-namespace rename returns EXDEV – renaming a file from one namespace to another requires a copy-and-delete at the application level, same as moving files across filesystem boundaries on a traditional system.
-
No POSIX ACLs – only standard Unix permissions (mode bits). Fine-grained access control is handled by Kiseki’s tenant IAM model, not filesystem-level ACLs.
-
Lock state is per-gateway – POSIX file locks (
fcntl) are maintained by the gateway instance. If a gateway fails over, lock state is lost. Advisory locks (flock) are best-effort.
FUSE Mount
The Kiseki native client provides a FUSE mount that exposes the distributed storage as a local filesystem on compute nodes. Unlike the NFS gateway, the FUSE client runs in the workload’s process space and performs client-side encryption – plaintext never leaves the process.
Building
The FUSE mount is feature-gated. Build the client binary with the fuse
feature:
cargo build --release --bin kiseki-client-fuse --features fuse
This requires the fuser crate, which depends on the FUSE kernel module
being available on the host:
- Linux: install
fuse3orlibfuse3-dev - macOS: install macFUSE
Mounting
kiseki-client-fuse mount /mnt/kiseki \
--data-addr <storage-node>:9100 \
--tenant <tenant-id> \
--namespace <namespace-id>
Mount Options
Options are passed with -o:
kiseki-client-fuse mount /mnt/kiseki \
-o cache_mode=organic \
-o cache_dir=/local-nvme/kiseki-cache \
-o cache_l2_max=100G \
-o meta_ttl_ms=5000
| Option | Values | Default | Description |
|---|---|---|---|
cache_mode | pinned, organic, bypass | organic | Cache operating mode (see Client Cache) |
cache_dir | path | /tmp/kiseki-cache | L2 NVMe cache directory |
cache_l1_max | bytes | 256M | L1 (in-memory) cache size |
cache_l2_max | bytes | 50G | L2 (NVMe) cache size per process |
meta_ttl_ms | milliseconds | 5000 | Metadata cache TTL |
Environment Variables
Mount options can also be set via environment variables. Mount options take priority over environment variables.
| Variable | Equivalent option |
|---|---|
KISEKI_CACHE_MODE | cache_mode |
KISEKI_CACHE_DIR | cache_dir |
KISEKI_CACHE_L1_MAX | cache_l1_max |
KISEKI_CACHE_L2_MAX | cache_l2_max |
KISEKI_CACHE_META_TTL_MS | meta_ttl_ms |
KISEKI_CACHE_POOL_ID | Adopt an existing cache pool (see staging handoff) |
Supported Operations
Read/Write
| Operation | Supported | Notes |
|---|---|---|
read | Yes | Served from cache (L1 -> L2 -> canonical) |
write | Yes | Writes to canonical; local metadata cache updated immediately |
open / close | Yes | Standard file handles |
fsync / fdatasync | Yes | Flushes delta to Raft quorum |
truncate / ftruncate | Yes | Composition resize |
O_APPEND | Yes | Atomic append via delta |
O_CREAT / O_EXCL | Yes | Atomic create-if-not-exists |
O_DIRECT | Limited | Bypasses client cache, still goes through FUSE |
Directory Operations
| Operation | Supported | Notes |
|---|---|---|
mkdir / rmdir | Yes | Create and remove directories |
readdir / readdirplus | Yes | Listing from materialized view |
rename (within namespace) | Yes | Atomic within shard |
rename (cross-namespace) | No | Returns EXDEV |
Metadata and Links
| Operation | Supported | Notes |
|---|---|---|
stat / fstat / lstat | Yes | File metadata |
chmod / chown | Yes | Stored in delta attributes |
symlink / readlink | Yes | Symlink targets stored as inline data |
| Hard links (within namespace) | Yes | |
| Hard links (cross-namespace) | No | Returns EXDEV |
xattr operations | Yes | getxattr, setxattr, listxattr, removexattr |
Nested Directories and Write-at-Offset
The FUSE filesystem supports full directory trees within a namespace. Files can be created in nested directories, and writes at arbitrary offsets within a file are supported (the composition tracks chunk references and handles sparse regions with zero-fill).
mkdir -p /mnt/kiseki/experiments/run-42/logs
echo "epoch 1 loss: 0.3" > /mnt/kiseki/experiments/run-42/logs/train.log
# Write at offset (sparse file)
dd if=/dev/zero of=/mnt/kiseki/data/sparse.bin bs=1 count=1 seek=1048576
Not Supported
| Operation | Reason |
|---|---|
Writable shared mmap | Returns ENOTSUP. Read-only mmap works. Use write() instead. (ADR-013) |
| POSIX ACLs | Unix permissions only (uid/gid/mode) |
Cache Mode Selection
The cache mode determines how aggressively the client caches data on local storage. Choose the mode that matches your workload:
| Mode | Best for | Behavior |
|---|---|---|
pinned | Training (epoch reuse), inference (model weights) | Chunks retained until explicit release. Populate via staging API. |
organic | Mixed workloads, interactive use | LRU eviction with usage-weighted retention. Default. |
bypass | Streaming ingest, checkpoint writes, one-shot scans | No caching. All reads go directly to canonical storage. |
# Training job: pin the dataset
kiseki-client-fuse mount /mnt/kiseki -o cache_mode=pinned
# Interactive exploration
kiseki-client-fuse mount /mnt/kiseki -o cache_mode=organic
# Checkpoint writer
kiseki-client-fuse mount /mnt/kiseki -o cache_mode=bypass
See Client Cache & Staging for staging pre-fetch, Slurm integration, and policy configuration.
Transport Selection
The native client automatically selects the fastest available transport to reach storage nodes:
- libfabric/CXI (Slingshot) – if available on the fabric
- RDMA verbs – if InfiniBand/RoCE is available
- TCP+TLS – universal fallback
Transport selection is automatic and requires no configuration. The client discovers available transports during fabric discovery at startup (ADR-008).
Unmounting
fusermount -u /mnt/kiseki # Linux
umount /mnt/kiseki # macOS
On clean unmount, the L2 cache pool is wiped (all chunk files are
zeroized and deleted). On crash, the orphaned cache pool is cleaned up
by the next client process or by the kiseki-cache-scrub service.
Python SDK
Kiseki provides Python bindings via PyO3, exposing
the native client’s cache, staging, and workflow advisory APIs to Python
workloads. The bindings are part of the kiseki-client crate, enabled
with the python feature flag.
Building
Build and install the Python module using maturin:
pip install maturin
maturin develop --features python
This builds the native Rust code and installs the kiseki module into
the active Python environment.
For a release build:
maturin build --release --features python
pip install target/wheels/kiseki-*.whl
Quick Start
import kiseki
# Create a client with organic caching (default)
client = kiseki.Client(cache_mode="organic", cache_dir="/tmp/kiseki-cache")
# Stage a dataset into the local cache
client.stage("/training/imagenet")
# ... workload reads via FUSE or native API ...
# Check cache statistics
stats = client.cache_stats()
print(stats)
# CacheStats(l1_hits=42, l2_hits=1500, misses=200, l1_bytes=134217728, l2_bytes=5368709120, wipes=0)
# Release the staged dataset
client.release("/training/imagenet")
# Clean up
client.close()
API Reference
kiseki.Client
The main entry point. Each Client instance manages its own cache pool
(L1 in-memory + L2 NVMe) and advisory session.
Constructor
client = kiseki.Client(
cache_mode="organic", # "pinned", "organic", or "bypass"
cache_dir="/tmp/kiseki-cache", # L2 NVMe cache directory
cache_l2_max=50 * 1024**3, # L2 max bytes (default: 50 GB)
meta_ttl_ms=5000, # Metadata TTL in ms (default: 5000)
)
| Parameter | Type | Default | Description |
|---|---|---|---|
cache_mode | str | "organic" | Cache mode: "pinned", "organic", or "bypass" |
cache_dir | str | "/tmp/kiseki-cache" | Directory for L2 NVMe cache files |
cache_l2_max | int or None | None (50 GB) | Maximum L2 cache size in bytes |
meta_ttl_ms | int | 5000 | Metadata cache TTL in milliseconds |
stage(namespace_path: str) -> None
Pre-fetch a dataset’s chunks into the local cache with pinned retention.
The dataset is identified by its namespace path (e.g.,
"/training/imagenet"). Staging is idempotent – re-staging an
already-staged dataset is a no-op.
client.stage("/training/imagenet")
client.stage("/training/imagenet") # no-op, already staged
For directory paths, staging recursively enumerates all files up to a depth of 10 and a maximum of 100,000 files.
stage_status() -> list[str]
Return the namespace paths of all currently staged datasets.
paths = client.stage_status()
# ["/training/imagenet", "/models/gpt-3"]
release(namespace_path: str) -> None
Release a staged dataset, unpinning its chunks and making them eligible for eviction.
client.release("/training/imagenet")
release_all() -> None
Release all staged datasets.
client.release_all()
cache_stats() -> CacheStatsView
Return current cache statistics.
stats = client.cache_stats()
print(f"L1 hits: {stats.l1_hits}")
print(f"L2 hits: {stats.l2_hits}")
print(f"Misses: {stats.misses}")
print(f"L1 used: {stats.l1_bytes / 1024**2:.0f} MB")
print(f"L2 used: {stats.l2_bytes / 1024**3:.1f} GB")
print(f"Wipes: {stats.wipes}")
cache_mode() -> str
Return the current cache mode as a string.
print(client.cache_mode()) # "organic"
declare_workflow() -> int
Declare a new workflow for advisory integration. Returns a workflow ID (128-bit integer) that can be used to correlate operations with the advisory channel for telemetry feedback.
wf_id = client.declare_workflow()
# ... run training epochs ...
client.end_workflow(wf_id)
end_workflow(workflow_id: int) -> None
End a previously declared workflow.
wipe() -> None
Immediately wipe the entire cache (L1 + L2). All cached plaintext is zeroized before deletion.
close() -> None
Wipe the cache and release resources. Call this when the workload is
done. Equivalent to wipe().
kiseki.CacheStatsView
Read-only statistics object returned by cache_stats().
| Attribute | Type | Description |
|---|---|---|
l1_hits | int | Number of L1 (memory) cache hits |
l2_hits | int | Number of L2 (NVMe) cache hits |
misses | int | Number of cache misses (fetched from canonical) |
l1_bytes | int | Current L1 memory usage in bytes |
l2_bytes | int | Current L2 disk usage in bytes |
wipes | int | Number of full cache wipes |
Example: Training Workflow
import kiseki
def train():
# Pin the dataset for the duration of training
client = kiseki.Client(cache_mode="pinned", cache_dir="/local-nvme/cache")
# Pre-stage the dataset (ideally done in Slurm prolog)
client.stage("/training/imagenet-22k")
# Declare a workflow for advisory telemetry
wf_id = client.declare_workflow()
try:
for epoch in range(100):
# Dataset reads hit L2 cache after first epoch
# ... training loop reads from /mnt/kiseki/training/imagenet-22k/ ...
pass
stats = client.cache_stats()
print(f"Cache hit rate: {(stats.l1_hits + stats.l2_hits) / "
f"(stats.l1_hits + stats.l2_hits + stats.misses) * 100:.1f}%")
finally:
client.end_workflow(wf_id)
client.release_all()
client.close()
if __name__ == "__main__":
train()
Example: Inference with Organic Caching
import kiseki
client = kiseki.Client(cache_mode="organic", cache_l2_max=20 * 1024**3)
# Model weights are cached on first load, then served from L2
# Prompt data is cached with LRU eviction
wf_id = client.declare_workflow()
try:
# ... inference serving loop ...
pass
finally:
client.end_workflow(wf_id)
client.close()
Example: Checkpoint Writer (No Caching)
import kiseki
# Bypass mode: checkpoint writes go straight to canonical
client = kiseki.Client(cache_mode="bypass")
# ... write checkpoints to /mnt/kiseki/checkpoints/ ...
client.close()
Environment Variable Overrides
The Python client respects the same environment variables as the FUSE mount and CLI:
| Variable | Description |
|---|---|
KISEKI_CACHE_MODE | Override cache mode |
KISEKI_CACHE_DIR | Override cache directory |
KISEKI_CACHE_L1_MAX | Override L1 max bytes |
KISEKI_CACHE_L2_MAX | Override L2 max bytes |
KISEKI_CACHE_META_TTL_MS | Override metadata TTL |
KISEKI_CACHE_POOL_ID | Adopt an existing cache pool (staging handoff) |
Constructor parameters take priority over environment variables. All client-set values are clamped to the effective policy ceilings set by tenant and cluster administrators.
Client Cache & Staging
The client-side cache (ADR-031) eliminates repeated data transfers
across the storage fabric by caching decrypted plaintext chunks on
compute-node local NVMe. It is a library-level module in
kiseki-client, shared across all access modes: FUSE, FFI, Python, and
native Rust.
Architecture
canonical (fabric) -> decrypt -> cache store (NVMe) -> serve to caller
^
cache hit path (no fabric, no decrypt)
Two-Tier Storage
| Tier | Backing | Capacity | Purpose |
|---|---|---|---|
| L1 (Hot) | In-memory HashMap | 256 MB default | Sub-microsecond hits for active working set |
| L2 (Warm) | Local NVMe files | 50 GB default | Large capacity for datasets and model weights |
Read path: L1 -> L2 (with CRC32 verification) -> canonical (decrypt + SHA-256 verify + store in L1/L2).
L2 files are organized per-process with isolated cache pools:
$KISEKI_CACHE_DIR/
<tenant_id_hex>/
<pool_id>/ <- per-process pool (128-bit CSPRNG)
chunks/
<prefix>/
<chunk_id_hex> <- plaintext + CRC32 trailer
meta/
file_chunks.db
staging/
<dataset_id>.manifest
pool.lock <- flock proves process is alive
Each client process creates its own pool directory. Multiple concurrent same-tenant processes on the same node have fully independent pools with no contention.
Security Model
The cache stores decrypted plaintext on local NVMe. This is acceptable because:
- The compute node already holds decrypted data in process memory (computation requires plaintext)
- L2 NVMe is local to the compute node, same trust domain as process memory
- L2 is ephemeral – wiped on process exit and on long disconnect
- All cached data is overwritten with zeros (
zeroize) before deallocation or eviction - File permissions are
0600, owned by the process UID - Orphaned pools from crashes are cleaned by the
kiseki-cache-scrubservice
Cache Modes
Three modes are available, selected per client instance at session establishment.
Pinned Mode
For workloads that declare their dataset upfront: training runs (epoch reuse), inference (model weights), climate simulations (boundary conditions).
- Chunks are retained against eviction until explicit
release() - Populated via the staging API or on first access
- Staging captures a point-in-time snapshot; canonical updates do not invalidate pinned data
- Capacity bounded by
max_cache_bytes; staging beyond capacity returnsCacheCapacityExceeded
Organic Mode
Default for mixed workloads. LRU with usage-weighted retention.
- Chunks cached on first read, evicted when capacity is reached
- Frequently accessed chunks promoted to L1
- L2 eviction: LRU by last-access timestamp, weighted by access count (chunks accessed N times survive N eviction rounds)
- Metadata cache with configurable TTL (default 5 seconds)
Bypass Mode
For workloads that do not benefit from caching: streaming ingest, one-shot scans, checkpoint writes.
- All reads go directly to canonical
- No L1 or L2 storage consumed
- Zero overhead beyond mode selection
Staging API
Client-local operation for pre-populating the cache in pinned mode. Pull-based – the client fetches from canonical.
CLI
# Stage a dataset
kiseki-client stage --dataset /training/imagenet
# Stage in daemon mode (for Slurm prolog)
POOL_ID=$(kiseki-client stage --dataset /training/imagenet --daemon)
# Check staging status
kiseki-client stage --status
# Release a dataset
kiseki-client stage --release /training/imagenet
# Release all
kiseki-client stage --release-all
Rust API
#![allow(unused)]
fn main() {
let result = cache_manager.stage("/training/imagenet").await?;
let datasets = cache_manager.stage_status();
cache_manager.release("/training/imagenet");
cache_manager.release_all();
}
Python API
client.stage("/training/imagenet")
paths = client.stage_status()
client.release("/training/imagenet")
client.release_all()
C FFI
kiseki_stage(handle, "/training/imagenet", timeout_secs);
kiseki_stage_status(handle, &status);
kiseki_release(handle, "/training/imagenet");
Staging Flow
- Resolve
namespace_pathto compositions via canonical. For directory paths, recursively enumerate all files up tomax_staging_depth(10) andmax_staging_files(100,000). - Extract full chunk list from all resolved compositions.
- For each chunk not already in L2: fetch from canonical, decrypt, verify content-address (SHA-256), store in L2 with CRC32 trailer and pinned retention.
- Write a staging manifest listing all compositions and chunk IDs.
- Report progress (chunks staged / total, bytes, elapsed).
Staging is idempotent – re-staging an already-staged dataset is a no-op. Partial staging (interrupted) can be resumed by re-running the command.
Slurm Integration
Staging Handoff
The staging CLI creates a cache pool and holds its pool.lock flock.
The workload process adopts the pool instead of creating a new one:
- Prolog: staging CLI fetches chunks in daemon mode, outputs
pool_id. - Workload: sets
KISEKI_CACHE_POOL_ID=<pool_id>, starts, adopts the existing pool, takes over the flock. - Staging daemon: detects flock loss, exits cleanly.
Prolog Script
#!/bin/bash
# prolog.sh -- run before the job starts
POOL_ID=$(kiseki-client stage --dataset /training/imagenet --daemon)
echo "export KISEKI_CACHE_POOL_ID=$POOL_ID" >> $SLURM_EXPORT_FILE
Epilog Script
#!/bin/bash
# epilog.sh -- run after the job completes
kiseki-client stage --release-all --pool $KISEKI_CACHE_POOL_ID
Lattice Integration
Lattice injects KISEKI_CACHE_POOL_ID into the workload environment
after parallel staging completes across the node set. It queries
stage --status to verify readiness before launching the workload.
Policy Hierarchy
Cache policy follows the same distribution mechanism as quotas, using
the existing TenantConfig structure.
cluster default -> org override -> project override -> workload override
-> session selection
Each level narrows (never broadens) the parent’s settings.
Policy Attributes
| Attribute | Type | Admin levels | Client selectable | Default |
|---|---|---|---|---|
cache_enabled | bool | cluster, org, project, workload | No | true |
allowed_modes | set | cluster, org | No | {pinned, organic, bypass} |
max_cache_bytes | u64 | cluster, org, workload | Up to ceiling | 50 GB |
max_node_cache_bytes | u64 | cluster | No | 80% of cache FS |
metadata_ttl_ms | u64 | cluster, org | Up to ceiling | 5000 |
max_disconnect_seconds | u64 | cluster | No | 300 |
key_health_interval_ms | u64 | cluster | No | 30000 |
staging_enabled | bool | cluster, org | No | true |
mode | enum | workload (default) | Yes (within allowed) | organic |
Policy Resolution
At session establishment, the client resolves its effective policy through multiple paths:
- Primary:
GetCachePolicyRPC on the data-path gRPC channel to any storage node. No gateway or control plane access required. - Secondary: gateway’s locally-cached
TenantConfig. - Stale tolerance: last-known policy persisted in the L2 pool
directory (
policy.json). - Fallback: conservative defaults (organic mode, 10 GB max, 5s TTL).
Policy changes apply to new sessions only. Active sessions continue under the policy effective at session establishment.
Configuration
Environment Variables
| Variable | Description | Default |
|---|---|---|
KISEKI_CACHE_MODE | Cache mode | organic |
KISEKI_CACHE_DIR | L2 cache directory | /tmp/kiseki-cache |
KISEKI_CACHE_L1_MAX | L1 memory max bytes | 256 MB |
KISEKI_CACHE_L2_MAX | L2 NVMe max bytes | 50 GB |
KISEKI_CACHE_META_TTL_MS | Metadata TTL (ms) | 5000 |
KISEKI_CACHE_POOL_ID | Adopt existing pool | (none) |
Mount Options (FUSE)
kiseki-client-fuse mount /mnt/kiseki \
-o cache_mode=pinned \
-o cache_dir=/local-nvme/kiseki \
-o cache_l2_max=100G
API (Rust)
#![allow(unused)]
fn main() {
let config = CacheConfig {
mode: CacheMode::Pinned,
cache_dir: PathBuf::from("/local-nvme/kiseki"),
max_cache_bytes: 100 * 1024 * 1024 * 1024,
metadata_ttl: Duration::from_secs(5),
..CacheConfig::default()
};
}
API (Python)
client = kiseki.Client(
cache_mode="pinned",
cache_dir="/local-nvme/kiseki",
cache_l2_max=100 * 1024**3,
)
Priority: API/mount options > environment variables > policy defaults. All client-set values are clamped to policy ceilings.
Cache Invalidation
Metadata
TTL-based only. No push invalidation from canonical. The metadata TTL (default 5 seconds) is the sole freshness mechanism and the upper bound on read staleness.
Write-through: when the client writes a file, the local metadata cache is updated immediately, providing read-your-writes consistency within a single process.
Crypto-Shred
When a tenant’s KEK is destroyed, all cached plaintext for that tenant must be wiped. Detection via three paths:
- Periodic key health check (default every 30 seconds) – primary.
- Advisory channel notification – fast path, best-effort.
- KMS error on next operation – tertiary.
Maximum detection latency: min(key_health_interval, max_disconnect_seconds) = 30 seconds by default.
Disconnect
If the client cannot reach any canonical endpoint for
max_disconnect_seconds (default 300 seconds), the entire cache is
wiped. Background heartbeat RPCs (every 60 seconds) maintain the
disconnect timer.
Capacity Management
| Limit | Scope | Default | Enforcement |
|---|---|---|---|
max_memory_bytes (L1) | Per-process | 256 MB | Strict LRU eviction |
max_cache_bytes (L2) | Per-process | 50 GB | LRU (organic), reject (pinned) |
max_node_cache_bytes | Per-node | 80% of cache FS | Cooperative check before L2 insert |
| Disk pressure backstop | Per-node | 90% utilization | Hard backstop |
Pinned chunks are never evicted by organic LRU. Organic eviction considers only non-pinned chunks.
Crash Recovery
- On process start: the client scans for orphaned cache pools
(those whose
pool.lockhas no liveflockholder), zeroizes their contents, and deletes them. kiseki-cache-scrubservice: a systemd one-shot (or cron job) that runs on node boot and every 60 seconds, covering the case where no subsequent Kiseki process starts on the node after a crash.
Deployment
This guide covers deploying Kiseki in development, multi-node cluster, and bare-metal production environments.
Docker Compose (development)
The single-node development stack includes Kiseki plus supporting services for tracing, KMS, and identity.
Services
| Service | Image | Ports | Purpose |
|---|---|---|---|
kiseki-server | Dockerfile.server (local build) | 2049, 9000, 9090, 9100, 9101 | Storage node |
jaeger | jaegertracing/all-in-one:latest | 4317, 16686 | Distributed tracing (OTLP) |
vault | hashicorp/vault:1.19 | 8200 | Tenant KMS backend (Transit engine) |
keycloak | quay.io/keycloak/keycloak:26.0 | 8080 | OIDC identity provider |
Starting the stack
# Build and start all services
docker compose up --build
# Run in background for e2e tests
docker compose up --build -d && pytest tests/e2e/
Port map (single-node)
| Port | Protocol | Service |
|---|---|---|
| 2049 | TCP | NFS (v3 + v4.2) |
| 9000 | HTTP | S3 gateway |
| 9090 | HTTP | Prometheus metrics + admin dashboard |
| 9100 | gRPC | Data-path (log, chunk, composition, view) |
| 9101 | gRPC | Workflow advisory |
| 4317 | gRPC | Jaeger OTLP receiver |
| 16686 | HTTP | Jaeger UI |
| 8200 | HTTP | Vault API |
| 8080 | HTTP | Keycloak admin console |
Environment (dev defaults)
The development compose file sets these environment variables on the
kiseki-server container:
KISEKI_DATA_ADDR: "0.0.0.0:9100"
KISEKI_ADVISORY_ADDR: "0.0.0.0:9101"
KISEKI_S3_ADDR: "0.0.0.0:9000"
KISEKI_NFS_ADDR: "0.0.0.0:2049"
KISEKI_METRICS_ADDR: "0.0.0.0:9090"
KISEKI_DATA_DIR: "/data"
KISEKI_BOOTSTRAP: "true"
OTEL_EXPORTER_OTLP_ENDPOINT: "http://jaeger:4317"
OTEL_SERVICE_NAME: "kiseki-server"
The KISEKI_BOOTSTRAP=true flag tells the node to create an initial
shard on first start, enabling immediate use without manual cluster
initialization.
Vault (dev mode)
Vault runs in dev mode with the root token kiseki-e2e-token. This is
suitable only for development and testing. The Transit secrets engine is
used by Kiseki as a tenant KMS backend (ADR-028 Provider 2).
# Verify Vault is ready
curl http://localhost:8200/v1/sys/health
Keycloak (dev mode)
Keycloak runs with start-dev and default admin credentials
(admin/admin). Configure OIDC realms for tenant identity provider
integration.
Docker Compose (3-node cluster)
The multi-node compose file (docker-compose.3node.yml) deploys a
3-node Raft cluster for testing consensus, replication, and failover.
Starting
docker compose -f docker-compose.3node.yml up --build -d
# Run multi-node tests
KISEKI_E2E_COMPOSE=docker-compose.3node.yml pytest tests/e2e/test_multi_node.py
Node configuration
All three nodes share the same Raft peer list and each has a unique
KISEKI_NODE_ID:
| Node | Node ID | Data gRPC | Advisory gRPC | S3 | Raft |
|---|---|---|---|---|---|
kiseki-node1 | 1 | localhost:9100 | localhost:9101 | localhost:9000 | 9300 |
kiseki-node2 | 2 | localhost:9110 | localhost:9111 | localhost:9010 | 9300 |
kiseki-node3 | 3 | localhost:9120 | localhost:9121 | localhost:9020 | 9300 |
The Raft peer list is configured identically on all nodes:
KISEKI_RAFT_PEERS=1=kiseki-node1:9300,2=kiseki-node2:9300,3=kiseki-node3:9300
Node 1 is the bootstrap node. Each node has an independent data volume
(node1-data, node2-data, node3-data).
Verifying cluster health
# Check all nodes are healthy
for port in 9100 9110 9120; do
curl -s http://localhost:${port/9100/9090}/health && echo " :$port OK"
done
# View cluster status via the admin dashboard
open http://localhost:9090/ui
Bare metal deployment
Build from source
Prerequisites: Rust stable toolchain, protobuf compiler (protoc),
OpenSSL development headers, pkg-config.
# Clone and build
git clone https://github.com/your-org/kiseki.git
cd kiseki
# Release build (all binaries)
cargo build --release
# Binaries produced:
# target/release/kiseki-server — storage node
# target/release/kiseki-keyserver — system key manager (HA)
# target/release/kiseki-client-fuse — FUSE client for compute nodes
# target/release/kiseki-control — control plane
Optional feature flags:
# Enable CXI/Slingshot transport (requires libfabric)
cargo build --release --features kiseki-transport/cxi
# Enable RDMA verbs transport
cargo build --release --features kiseki-transport/verbs
# Enable tenant opt-in compression
cargo build --release --features kiseki-chunk/compression
Disk layout
Each storage node should follow the recommended disk layout:
Server node:
System partition (RAID-1 on 2x SSD):
/var/lib/kiseki/raft/log.redb Raft log entries
/var/lib/kiseki/keys/epochs.redb Key epoch metadata
/var/lib/kiseki/chunks/meta.redb Chunk extent index
/var/lib/kiseki/small/objects.redb Small-file inline content
/var/lib/kiseki/config/ Node config, TLS certs
Data devices (JBOD, managed by Kiseki):
/dev/nvme0n1 -> pool "fast-nvme"
/dev/nvme1n1 -> pool "fast-nvme"
/dev/sda -> pool "bulk-ssd"
/dev/sdb -> pool "cold-hdd"
JBOD for data devices, RAID-1 for the system partition. Kiseki manages data durability via EC/replication across JBOD members. The system partition uses RAID-1 because redb and Raft log must survive a single disk failure without Kiseki’s own repair mechanism.
systemd unit: kiseki-server
[Unit]
Description=Kiseki Storage Node
After=network-online.target
Wants=network-online.target
StartLimitIntervalSec=300
StartLimitBurst=5
[Service]
Type=simple
User=kiseki
Group=kiseki
ExecStart=/usr/local/bin/kiseki-server
Restart=on-failure
RestartSec=5
# Environment
Environment=KISEKI_DATA_ADDR=0.0.0.0:9100
Environment=KISEKI_ADVISORY_ADDR=0.0.0.0:9101
Environment=KISEKI_S3_ADDR=0.0.0.0:9000
Environment=KISEKI_NFS_ADDR=0.0.0.0:2049
Environment=KISEKI_METRICS_ADDR=0.0.0.0:9090
Environment=KISEKI_DATA_DIR=/var/lib/kiseki
Environment=KISEKI_NODE_ID=1
Environment=KISEKI_RAFT_PEERS=1=node1.example.com:9300,2=node2.example.com:9300,3=node3.example.com:9300
Environment=KISEKI_RAFT_ADDR=0.0.0.0:9300
# TLS
Environment=KISEKI_CA_PATH=/etc/kiseki/tls/ca.crt
Environment=KISEKI_CERT_PATH=/etc/kiseki/tls/server.crt
Environment=KISEKI_KEY_PATH=/etc/kiseki/tls/server.key
# Observability
Environment=OTEL_EXPORTER_OTLP_ENDPOINT=http://jaeger.internal:4317
Environment=OTEL_SERVICE_NAME=kiseki-server
Environment=RUST_LOG=kiseki=info
# Security hardening
NoNewPrivileges=yes
ProtectSystem=strict
ProtectHome=yes
ReadWritePaths=/var/lib/kiseki
PrivateTmp=yes
MemoryDenyWriteExecute=yes
LimitCORE=0
[Install]
WantedBy=multi-user.target
systemd unit: kiseki-keyserver
[Unit]
Description=Kiseki System Key Manager
After=network-online.target
Wants=network-online.target
[Service]
Type=simple
User=kiseki-keys
Group=kiseki-keys
ExecStart=/usr/local/bin/kiseki-keyserver
Restart=on-failure
RestartSec=5
Environment=KISEKI_DATA_DIR=/var/lib/kiseki-keys
Environment=KISEKI_RAFT_PEERS=1=keysrv1:9400,2=keysrv2:9400,3=keysrv3:9400
Environment=KISEKI_RAFT_ADDR=0.0.0.0:9400
Environment=KISEKI_CA_PATH=/etc/kiseki/tls/ca.crt
Environment=KISEKI_CERT_PATH=/etc/kiseki/tls/keyserver.crt
Environment=KISEKI_KEY_PATH=/etc/kiseki/tls/keyserver.key
NoNewPrivileges=yes
ProtectSystem=strict
ProtectHome=yes
ReadWritePaths=/var/lib/kiseki-keys
PrivateTmp=yes
MemoryDenyWriteExecute=yes
LimitCORE=0
[Install]
WantedBy=multi-user.target
systemd unit: kiseki-client-fuse
[Unit]
Description=Kiseki FUSE Client
After=network-online.target
Wants=network-online.target
[Service]
Type=simple
User=root
ExecStart=/usr/local/bin/kiseki-client-fuse --mountpoint /mnt/kiseki
ExecStop=/bin/fusermount -u /mnt/kiseki
Restart=on-failure
RestartSec=5
Environment=KISEKI_DATA_ADDR=node1.example.com:9100,node2.example.com:9100,node3.example.com:9100
Environment=KISEKI_CA_PATH=/etc/kiseki/tls/ca.crt
Environment=KISEKI_CERT_PATH=/etc/kiseki/tls/client.crt
Environment=KISEKI_KEY_PATH=/etc/kiseki/tls/client.key
Environment=KISEKI_CACHE_MODE=organic
Environment=KISEKI_CACHE_DIR=/var/cache/kiseki
Environment=KISEKI_CACHE_L1_MAX=1073741824
Environment=KISEKI_CACHE_L2_MAX=107374182400
[Install]
WantedBy=multi-user.target
Configuration checklist
Before starting a production cluster, verify the following:
TLS certificates
- Cluster CA certificate generated and distributed to all nodes
- Per-node server certificate signed by Cluster CA
- Per-tenant client certificates signed by Cluster CA
- Key manager server certificate signed by Cluster CA
- CRL distribution point configured (if using CRL-based revocation)
- Certificate SANs include all node hostnames and IP addresses
- All certificates use ECDSA P-256 or RSA 2048+ keys
Data directories
-
KISEKI_DATA_DIRexists and is owned by thekisekiuser - System partition has sufficient capacity for metadata (see Capacity Planning)
- Data devices formatted and accessible (raw block or file-backed)
- Separate RAID-1 for system partition
Bootstrap
- Exactly one node has
KISEKI_BOOTSTRAP=trueon first start - After initial bootstrap, set
KISEKI_BOOTSTRAP=falseon the bootstrap node (or remove the variable) -
KISEKI_RAFT_PEERSis identical on all nodes -
KISEKI_NODE_IDis unique per node - System key manager cluster is started before storage nodes
Network
- Data-fabric ports (9100, 9101) reachable between all nodes
- Raft port (9300) reachable between all nodes
- Metrics port (9090) accessible to monitoring infrastructure
- NFS port (2049) accessible to clients
- S3 port (9000) accessible to clients
- Management network separated from data fabric (recommended)
Observability
- Jaeger or OTLP-compatible collector endpoint configured
- Prometheus scrape target added for each node’s
:9090/metrics -
RUST_LOGlevel set appropriately (production:kiseki=info)
Health verification
After deployment, verify the cluster is healthy:
HTTP health endpoint
# Returns "OK" when the node is ready
curl http://node1:9090/health
Prometheus metrics
# Verify metrics are being exported
curl -s http://node1:9090/metrics | head -20
Admin dashboard
Open http://node1:9090/ui in a browser. The dashboard shows:
- Cluster health (nodes healthy / total)
- Raft entries applied
- Gateway requests served
- Data written and read
- Active transport connections
Any node in the cluster serves the full cluster-wide view by scraping metrics from its peers.
Raft consensus
Verify that the Raft cluster has elected a leader:
# Check the cluster status via the admin API
curl -s http://node1:9090/ui/api/cluster | jq .
S3 connectivity
# Test S3 access (if a tenant namespace is configured)
aws --endpoint-url http://node1:9000 s3 ls
NFS connectivity
# Test NFS mount
mount -t nfs node1:/ /mnt/kiseki -o vers=4.2
FUSE client
# Mount via FUSE (on a compute node)
kiseki-client-fuse --mountpoint /mnt/kiseki
ls /mnt/kiseki
Configuration Reference
Kiseki is configured entirely through environment variables. There are no configuration files to manage. Every tunable parameter has a sensible default. Variables are grouped by function below.
Network addresses
| Variable | Default | Description |
|---|---|---|
KISEKI_DATA_ADDR | 0.0.0.0:9100 | Listen address for data-path gRPC (log, chunk, composition, view, discovery). |
KISEKI_ADVISORY_ADDR | 0.0.0.0:9101 | Listen address for the Workflow Advisory gRPC service. Runs on a dedicated tokio runtime, isolated from the data path (ADR-021). |
KISEKI_S3_ADDR | 0.0.0.0:9000 | Listen address for the S3 HTTP gateway. |
KISEKI_NFS_ADDR | 0.0.0.0:2049 | Listen address for the NFS gateway (v3 + v4.2). |
KISEKI_METRICS_ADDR | 0.0.0.0:9090 | Listen address for Prometheus metrics (/metrics), health endpoint (/health), and admin dashboard (/ui). |
KISEKI_RAFT_ADDR | 0.0.0.0:9300 | Listen address for Raft consensus traffic between nodes. |
All addresses accept the host:port format. Use 0.0.0.0 to bind to
all interfaces or a specific IP to restrict to one network.
Cluster membership
| Variable | Default | Description |
|---|---|---|
KISEKI_NODE_ID | (required) | Unique integer identifier for this node within the cluster. Must be stable across restarts. |
KISEKI_RAFT_PEERS | (required) | Comma-separated list of id=host:port pairs for all Raft voters. Example: 1=node1:9300,2=node2:9300,3=node3:9300. Must be identical on every node. |
KISEKI_BOOTSTRAP | false | When true, the node creates an initial shard on first start. Set to true on exactly one node during initial cluster formation, then set back to false. |
Storage
| Variable | Default | Description |
|---|---|---|
KISEKI_DATA_DIR | /var/lib/kiseki | Root directory for all persistent state. Contains Raft log (raft/log.redb), key epochs (keys/epochs.redb), chunk metadata (chunks/meta.redb), and inline small-file content (small/objects.redb). Must reside on a low-latency device (NVMe or SSD strongly recommended; HDD triggers a boot warning). |
Data directory layout
KISEKI_DATA_DIR/
raft/log.redb Raft log entries (bounded by snapshot policy)
keys/epochs.redb Key epoch metadata (<10 MB)
chunks/meta.redb Chunk extent index (scales with file count)
small/objects.redb Small-file encrypted content (capacity-managed)
TLS / mTLS
| Variable | Default | Description |
|---|---|---|
KISEKI_CA_PATH | (none) | Path to the Cluster CA certificate (PEM). Required for production. When set, all gRPC connections require mTLS. |
KISEKI_CERT_PATH | (none) | Path to this node’s TLS certificate (PEM), signed by the Cluster CA. |
KISEKI_KEY_PATH | (none) | Path to this node’s TLS private key (PEM). Never logged, printed, or transmitted. |
KISEKI_CRL_PATH | (none) | Path to a CRL file (PEM) for certificate revocation. Reloaded periodically. Optional; if not set, CRL checking is disabled. |
When KISEKI_CA_PATH is not set, the server runs without TLS. This is
acceptable for development but must not be used in production.
Client-side cache (ADR-031)
These variables configure the native client cache on compute nodes
running kiseki-client-fuse.
| Variable | Default | Description |
|---|---|---|
KISEKI_CACHE_MODE | organic | Cache operating mode. One of: pinned (staging-driven, eviction-resistant), organic (LRU with usage-weighted retention), bypass (no caching). Mode is per session, not per file. |
KISEKI_CACHE_DIR | $KISEKI_DATA_DIR/cache | Directory for L2 cache pools on local NVMe. Each client process creates an isolated pool with a unique pool_id. |
KISEKI_CACHE_L1_MAX | 1073741824 (1 GB) | Maximum bytes for the in-memory L1 cache (decrypted plaintext chunks). Bounded by process memory. |
KISEKI_CACHE_L2_MAX | 107374182400 (100 GB) | Maximum bytes for the on-disk L2 cache on local NVMe. Per-process, per-tenant isolation via pool directories. |
KISEKI_CACHE_META_TTL_MS | 5000 (5 seconds) | Metadata TTL in milliseconds. File-to-chunk-list mappings are served from cache within this window. After expiry, mappings are re-fetched from canonical. This is the sole freshness window: chunk data itself has no TTL because chunks are immutable (I-C1). |
KISEKI_CACHE_POOL_ID | (none) | Adopt an existing L2 cache pool instead of creating a new one. Used for staging handoff from a Slurm prolog daemon to a workload process. |
Cache behavior notes
- Pinned mode: Pre-staged datasets remain in cache until explicitly released. Best for training workloads that re-read the same data across epochs.
- Organic mode: LRU eviction with usage-weighted retention. Default for mixed workloads.
- Bypass mode: No caching at all. Best for checkpoint/restart and streaming workloads.
- On process restart, the client creates a new L2 pool (wiping orphaned
pools). A
kiseki-cache-scrubservice cleans orphans on node boot. - Disconnects longer than 300 seconds (configurable) wipe the entire cache.
- Crypto-shred events wipe all cached plaintext for the affected tenant within the key health check interval (default 30 seconds).
Metadata capacity (ADR-030)
These variables control the dynamic inline threshold for small-file placement.
| Variable | Default | Description |
|---|---|---|
KISEKI_META_SOFT_LIMIT_PCT | 50 | Normal operating ceiling for system disk metadata usage, as a percentage of system partition capacity. Exceeding this triggers inline threshold reduction. |
KISEKI_META_HARD_LIMIT_PCT | 75 | Absolute maximum for system disk metadata usage. Exceeding this forces the inline threshold to the floor (128 bytes) and emits an alert via out-of-band gRPC (not Raft). |
The inline threshold determines whether a file’s encrypted content is
stored in small/objects.redb (metadata tier, NVMe) or as a chunk
extent on a raw block device (data tier). The threshold is computed
per-shard as the minimum affordable threshold across all Raft voters,
clamped between 128 bytes (floor) and 64 KB (ceiling).
Observability
| Variable | Default | Description |
|---|---|---|
OTEL_EXPORTER_OTLP_ENDPOINT | (none) | OpenTelemetry OTLP gRPC endpoint for distributed traces. Example: http://jaeger:4317. When not set, tracing is disabled. |
OTEL_SERVICE_NAME | kiseki-server | Service name reported in traces. Set to kiseki-keyserver or kiseki-client for other binaries. |
RUST_LOG | info | Logging filter directive for the tracing crate. Supports per-module granularity. Examples: kiseki=debug, kiseki_raft=trace,kiseki=info, warn. |
KISEKI_LOG_FORMAT | text | Log output format. text for human-readable, json for structured JSON (one line per event). Use json in production for log aggregation. |
Tuning parameters (runtime)
The following parameters are set at runtime via the StorageAdminService
gRPC API (SetTuningParams / GetTuningParams), not via environment
variables. They are listed here for reference.
Cluster-wide tuning
| Parameter | Default | Range | Description |
|---|---|---|---|
compaction_rate_mb_s | 100 | 10-1000 | Background compaction throughput cap (MB/s). |
gc_interval_s | 300 | 60-3600 | Interval between GC scans for reclaimable chunks. |
rebalance_rate_mb_s | 50 | 0-500 | Background rebalance/evacuation throughput (MB/s). |
scrub_interval_h | 168 (7 days) | 24-720 | Interval between integrity scrub runs. |
max_concurrent_repairs | 4 | 1-32 | Maximum parallel EC repair jobs. |
stream_proc_poll_ms | 100 | 10-1000 | View materialization polling interval (ms). |
inline_threshold_bytes | 4096 | 512-65536 | Default inline threshold for new shards. |
raft_snapshot_interval | 10000 | 1000-100000 | Entries between Raft snapshots. |
Per-pool tuning
| Parameter | Default | Range | Description |
|---|---|---|---|
ec_data_chunks | 4 (NVMe) / 8 (HDD) | 2-16 | EC data fragment count. Immutable per pool after creation (I-C6). |
ec_parity_chunks | 2 (NVMe) / 3 (HDD) | 1-8 | EC parity fragment count. Immutable per pool after creation. |
replication_count | 3 | 2-5 | For replication pools (non-EC). |
warning_threshold_pct | Per device class | 50-95 | Pool capacity warning level. |
critical_threshold_pct | Per device class | 60-98 | Pool capacity critical level. Writes rejected. |
readonly_threshold_pct | Per device class | 70-99 | Read-only level. In-flight writes drain. |
target_fill_pct | 70 (SSD) / 80 (HDD) | 50-90 | Rebalance target fill level. |
Default capacity thresholds by device class:
| State | NVMe/SSD | HDD |
|---|---|---|
| Healthy | 0-75% | 0-85% |
| Warning | 75-85% | 85-92% |
| Critical | 85-92% | 92-97% |
| ReadOnly | 92-97% | 97-99% |
| Full | 97-100% | 99-100% |
All tuning parameter changes via SetTuningParams are recorded in the
cluster audit shard with parameter name, old value, new value, timestamp,
and admin identity (I-A6).
Environment variable summary
Quick reference of all environment variables:
# Network
KISEKI_DATA_ADDR=0.0.0.0:9100
KISEKI_ADVISORY_ADDR=0.0.0.0:9101
KISEKI_S3_ADDR=0.0.0.0:9000
KISEKI_NFS_ADDR=0.0.0.0:2049
KISEKI_METRICS_ADDR=0.0.0.0:9090
KISEKI_RAFT_ADDR=0.0.0.0:9300
# Cluster
KISEKI_NODE_ID=1
KISEKI_RAFT_PEERS=1=node1:9300,2=node2:9300,3=node3:9300
KISEKI_BOOTSTRAP=false
# Storage
KISEKI_DATA_DIR=/var/lib/kiseki
# TLS
KISEKI_CA_PATH=/etc/kiseki/tls/ca.crt
KISEKI_CERT_PATH=/etc/kiseki/tls/server.crt
KISEKI_KEY_PATH=/etc/kiseki/tls/server.key
KISEKI_CRL_PATH=/etc/kiseki/tls/crl.pem
# Cache (client only)
KISEKI_CACHE_MODE=organic
KISEKI_CACHE_DIR=/var/cache/kiseki
KISEKI_CACHE_L1_MAX=1073741824
KISEKI_CACHE_L2_MAX=107374182400
KISEKI_CACHE_META_TTL_MS=5000
# Metadata capacity
KISEKI_META_SOFT_LIMIT_PCT=50
KISEKI_META_HARD_LIMIT_PCT=75
# Observability
OTEL_EXPORTER_OTLP_ENDPOINT=http://jaeger:4317
OTEL_SERVICE_NAME=kiseki-server
RUST_LOG=kiseki=info
KISEKI_LOG_FORMAT=json
Cluster Management
This guide covers day-to-day cluster operations: adding and removing nodes, managing shards and pools, maintenance mode, and schema migration.
Node management
Kiseki uses Raft consensus groups for metadata and log replication. Adding or removing nodes is done through Raft membership changes, which are zero-downtime and zero-data-loss operations.
Adding a node
-
Deploy
kiseki-serveron the new host with a uniqueKISEKI_NODE_IDand the fullKISEKI_RAFT_PEERSlist (including the new node). -
Start the service. The node registers with the cluster and begins receiving Raft log entries as a learner.
-
Promote the node to a voter once it has caught up:
kiseki-server node add --node-id 4 -
The node receives shard assignments and begins participating in Raft elections and commit quorums.
Catch-up requirement (I-SF3): A learner must fully catch up with the leader’s committed index before being promoted to voter. The old voter remains in membership until the new voter is promoted.
Removing a node
-
Drain the node to migrate its shard assignments to other nodes:
kiseki-server node drain --node-id 4 -
Wait for all shards to be migrated. The drain operation uses Raft membership changes (add learner on target, promote, demote source) for each shard hosted on the node.
-
Once drained, remove the node from the cluster:
kiseki-server node remove --node-id 4 -
Stop the
kiseki-serverprocess and decommission the hardware.
Safety: Removing a node without draining first triggers automatic shard repair, but this is reactive rather than proactive. Always drain first for orderly removal.
Cluster sizing
- Minimum: 3 nodes (Raft requires a majority quorum; 2-of-3 for writes).
- Recommended: 5+ nodes for production. Tolerates 2 simultaneous node failures.
- Key manager: Deploy on a dedicated 3-5 node HA cluster, separate from storage nodes. The system key manager must be at least as available as the log (I-K12).
Shard management
Shards are the smallest unit of totally-ordered deltas, backed by one Raft group. They split automatically when size or throughput thresholds are exceeded (I-L6).
Viewing shard status
# List all shards
kiseki-server shard list
# Get details for a specific shard
kiseki-server shard info --shard-id shard-0001
# Check shard health
kiseki-server shard health --shard-id shard-0001
Automatic shard split
Shards have a hard ceiling triggering mandatory split (I-L6). The ceiling is configurable across three dimensions:
- Delta count: Maximum number of deltas in a shard.
- Byte size: Maximum total size of shard data.
- Write throughput: Maximum sustained write rate.
Any dimension exceeding its ceiling forces a split. The split operation:
- Selects a split boundary (key range partition).
- Creates a new shard for the upper range.
- Continues accepting writes during the split (I-O1).
- Notifies the control plane, views, and clients of the new shard topology.
Manual shard split
kiseki-server shard split --shard-id shard-0001 --boundary "..."
Shard maintenance mode
Set a shard to read-only for maintenance operations:
# Enable maintenance mode (writes rejected with retriable error)
kiseki-server shard maintenance --shard-id shard-0001 --enabled
During maintenance mode (I-O6):
- Write commands are rejected with a retriable error.
- Read operations continue normally.
- In-progress compaction and GC continue but no new triggers fire from write pressure.
- Shard splits do not initiate.
Cross-shard operations
Cross-shard rename returns EXDEV (I-L8). Shards are independent
consensus domains with no two-phase commit. Applications must handle
cross-shard moves via copy + delete.
Pool management
Affinity pools are groups of storage devices sharing a device class. Pools are the unit of capacity management and durability policy.
Viewing pools
# List all pools
kiseki-server pool list
# Get pool details including capacity and health
kiseki-server pool status --pool-id fast-nvme
Creating a pool
kiseki-server pool create --pool-id fast-nvme --device-class NvmeU2 \
--ec-data 4 --ec-parity 2
Important: EC parameters (ec_data_chunks, ec_parity_chunks) are
immutable per pool after creation (I-C6). Changing them requires
creating a new pool and migrating data via ReencodePool.
Setting pool durability
# Switch pool durability strategy (applies to new chunks only)
kiseki-server pool set-durability --pool-id fast-nvme \
--ec-data 4 --ec-parity 2
Existing chunks retain their original EC config. Re-encoding requires
an explicit ReencodePool RPC.
Rebalancing a pool
Rebalance distributes data evenly across devices in a pool:
# Start rebalance
kiseki-server pool rebalance --pool-id fast-nvme
# Cancel a running rebalance
kiseki-server pool cancel-rebalance --pool-id fast-nvme
Rebalance runs at the configured rebalance_rate_mb_s (default
50 MB/s) to limit impact on production traffic.
Device evacuation
When a device shows signs of failure (SMART wear > 90% for SSD, > 100 bad sectors for HDD), automatic evacuation is triggered (I-D3). Evacuation can also be initiated manually:
# Start evacuation
kiseki-server device evacuate --device-id nvme-0001
# Cancel evacuation
kiseki-server device cancel-evacuation --device-id nvme-0001
Evacuation migrates all chunks from the device to other devices in the
same pool. Device removal (RemoveDevice) is rejected unless the device
state is Removed (post-evacuation) (I-D5).
Device state transitions: Healthy -> Degraded -> Evacuating -> Failed -> Removed.
All transitions are recorded in the audit log (I-D2).
Pool capacity thresholds
Pool writes are rejected when the pool reaches the Critical threshold (I-C5). Thresholds vary by device class to account for SSD/NVMe GC pressure at high fill levels:
| State | NVMe/SSD | HDD | Behavior |
|---|---|---|---|
| Healthy | 0-75% | 0-85% | Normal writes |
| Warning | 75-85% | 85-92% | Log warning, emit telemetry |
| Critical | 85-92% | 92-97% | Reject new placements |
| ReadOnly | 92-97% | 97-99% | In-flight writes drain, no new writes |
| Full | 97-100% | 99-100% | ENOSPC to clients |
Pool redirection stays within the same device class only. ENOSPC is returned when the pool is Full.
Maintenance mode
Cluster-wide or per-shard maintenance mode sets the cluster (or specific shards) to read-only (I-O6).
Enabling cluster-wide maintenance
# Via the admin dashboard
curl -X POST http://node1:9090/ui/api/ops/maintenance \
-H 'Content-Type: application/json' \
-d '{"enabled": true}'
# Via the kiseki-server CLI
kiseki-server maintenance on
Maintenance mode behavior
- All write commands are rejected with a retriable error code
(
MaintenanceMode). Clients can retry after maintenance ends. - Read operations continue normally.
- In-progress compaction and GC complete their current run.
- New shard splits, compaction triggers, and GC triggers from write pressure are suppressed.
- Maintenance mode is the prerequisite for:
- Schema migration on upgrade
- Inline threshold increase (optional migration of small chunked files back to inline)
- Full cluster re-encryption
Disabling maintenance
curl -X POST http://node1:9090/ui/api/ops/maintenance \
-H 'Content-Type: application/json' \
-d '{"enabled": false}'
Writes resume immediately. Clients that were retrying will succeed on their next attempt.
Schema migration on upgrade
Kiseki uses versioned on-disk formats. Upgrades that change the schema follow this procedure:
-
Read the release notes for migration requirements. Not every release requires migration.
-
Enable maintenance mode on the cluster to prevent writes during migration.
-
Stop all nodes in the cluster.
-
Upgrade the binaries on all nodes (
kiseki-server,kiseki-keyserver,kiseki-client-fuse). -
Start nodes one at a time. On startup, each node detects the old schema version (via the superblock on each data device and the redb metadata version) and applies migration automatically.
-
Verify migration by checking the admin dashboard and node logs.
-
Disable maintenance mode to resume normal operations.
Rolling upgrades
For minor releases that do not change the on-disk format, rolling upgrades are supported:
- Drain a node (
DrainNode). - Stop the node.
- Upgrade the binary.
- Start the node.
- Wait for it to rejoin and catch up.
- Repeat for the next node.
The superblock on each data device carries a format version (ADR-029). Format version mismatches are detected at device open and handled by the migration path.
Admin Dashboard
Kiseki includes a built-in web dashboard for cluster monitoring and basic operations. The dashboard is served by every storage node on the metrics HTTP port.
Access
http://<node>:9090/ui
Any node in the cluster serves the full cluster-wide view. The dashboard scrapes metrics from peer nodes in the background and aggregates them locally. There is no dedicated dashboard server; connect to whichever node is most convenient.
The metrics HTTP server also serves:
| Path | Purpose |
|---|---|
/health | Health probe (returns 200 OK). Used by load balancers. |
/metrics | Prometheus text exposition format. |
/ui | Admin dashboard (HTML + HTMX + Chart.js). |
/ui/logo | Kiseki logo image. |
Technology
The dashboard is a single-page HTML application using:
- HTMX for live updates via HTML fragment polling.
- Chart.js for time-series and per-node comparison charts.
- No build step, no JavaScript framework, no node_modules.
The dashboard HTML is embedded in the kiseki-server binary at compile
time (include_str!). No external files to deploy or manage.
Overview tab
The main view shows six metric cards at the top, a time-series chart in the middle, and a node table at the bottom. All data refreshes automatically via HTMX polling.
Metric cards
| Card | Source metric | Description |
|---|---|---|
| Cluster Health | Node liveness | N/M nodes healthy with color coding: green (all healthy), yellow (degraded), red (all down). |
| Raft Entries | kiseki_raft_entries_total | Total Raft entries applied across the cluster. |
| Gateway Requests | kiseki_gateway_requests_total | Total S3 and NFS requests served. |
| Data Written | kiseki_chunk_write_bytes_total | Aggregate chunk bytes written. |
| Data Read | kiseki_chunk_read_bytes_total | Aggregate chunk bytes read. |
| Connections | kiseki_transport_connections_active | Active transport connections. |
Numbers are formatted with SI suffixes (K, M, B) and byte units (KB, MB, GB, TB) for readability.
Time-series charts
The dashboard stores up to 3 hours of metric history (configurable) in memory. Time-series charts show:
- Raft entries over time
- Gateway request rate
- Chunk write/read throughput
- Connection count
Historical data is available via the API:
# Get 3 hours of history (default)
curl http://node1:9090/ui/api/history
# Get 1 hour of history
curl http://node1:9090/ui/api/history?hours=1
Node table
A table listing every node in the cluster with per-node metrics:
| Column | Description |
|---|---|
| Node | Node address (hostname:port) |
| Status | Health badge: green “Healthy” or red “Unreachable” |
| Raft | Raft entries applied by this node |
| Requests | Gateway requests served by this node |
| Written | Chunk bytes written by this node |
| Read | Chunk bytes read by this node |
| Conns | Active transport connections on this node |
Click a node row to drill down to the node detail view.
Performance tab
The performance tab shows per-node comparison charts for identifying hotspots and imbalances:
- Write throughput by node: Bar chart comparing chunk bytes written per node.
- Read throughput by node: Bar chart comparing chunk bytes read per node.
- Request count by node: Bar chart comparing gateway requests per node.
Chart data is sourced from the chart-data API:
curl http://node1:9090/ui/fragment/chart-data
# Returns: {"labels": [...], "writes": [...], "reads": [...], "requests": [...]}
Alerts tab
The alerts tab shows health status and capacity warnings. Each alert is a row with a colored dot (green, yellow, red, blue), a message, and a timestamp.
Alert types
| Dot | Meaning | Example |
|---|---|---|
| Green | All clear | “All 3 nodes healthy” |
| Red | Critical | “Node node2:9100 unreachable” |
| Blue | Informational | “Capacity monitoring active (3 nodes reporting)” |
| Green | Activity | “node1:9100: 1.2K gateway requests served” |
Alerts are generated by comparing the current cluster state against expected conditions. The alert endpoint returns HTML fragments for HTMX polling:
curl http://node1:9090/ui/fragment/alerts
Operations tab
The operations tab provides buttons for common administrative actions. Each action calls a REST endpoint and records an event in the diagnostic event store.
Available operations
| Operation | Endpoint | Method | Description |
|---|---|---|---|
| Maintenance Mode | /ui/api/ops/maintenance | POST | Enable or disable cluster-wide maintenance mode. Body: {"enabled": true} or {"enabled": false}. |
| Backup | /ui/api/ops/backup | POST | Initiate a background backup. |
| Scrub | /ui/api/ops/scrub | POST | Initiate a background integrity scrub. |
Example:
# Enable maintenance mode
curl -X POST http://node1:9090/ui/api/ops/maintenance \
-H 'Content-Type: application/json' \
-d '{"enabled": true}'
# Trigger a scrub
curl -X POST http://node1:9090/ui/api/ops/scrub
All operations return {"status": "ok", "message": "..."} on success.
Node drill-down
Click a node in the node table to see its detailed view. The drill-down shows:
- Node-specific metric history (time-series)
- Device health for devices attached to that node
- Shard assignments on that node
- Raft role (leader/follower/learner) per shard
API endpoints
All dashboard data is available via JSON APIs for scripting and integration:
| Endpoint | Method | Description |
|---|---|---|
/ui/api/cluster | GET | Cluster summary: healthy nodes, total nodes, aggregate metrics. |
/ui/api/nodes | GET | List of all nodes with per-node metrics and health status. |
/ui/api/history | GET | Time-series metric history. Query: ?hours=3 (default). |
/ui/api/events | GET | Diagnostic event log. Query parameters below. |
Event log query parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
severity | string | (all) | Filter by severity: info, warning, error, critical. |
category | string | (all) | Filter by category: node, shard, device, tenant, security, admin, gateway, raft. |
hours | float | 3 | Hours to look back. |
limit | integer | 100 | Maximum events to return. |
Example:
# Get last 50 error events in the past hour
curl 'http://node1:9090/ui/api/events?severity=error&hours=1&limit=50'
Response format:
{
"count": 2,
"events": [
{
"timestamp": "2026-04-23T14:30:00Z",
"severity": "error",
"category": "device",
"source": "nvme-0001",
"message": "Device SMART wear exceeds 90%"
}
]
}
Cluster-wide view architecture
Every node in the cluster runs the same dashboard. The cluster-wide
view is assembled by scraping /metrics from peer nodes:
- Each node knows its peers from
KISEKI_RAFT_PEERS. - A background task scrapes each peer’s
/metricsendpoint at a configurable interval (default 10 seconds). - Scraped metrics are cached locally in a
MetricsAggregator. - Dashboard requests aggregate local + cached peer metrics.
This means:
- No single point of failure. Any node serves the dashboard.
- Stale data tolerance. If a peer is unreachable, the dashboard shows the last known state and marks the node as “Unreachable.”
- No additional infrastructure. No dedicated monitoring server is needed for basic cluster visibility.
For production monitoring with alerting and long-term retention, use Prometheus and Grafana (see Monitoring).
Backup & Recovery
Kiseki’s primary disaster recovery mechanism is federation (async replication to a secondary site). External backup is additive and optional, providing defense-in-depth for deployments that require it.
Architecture overview
Federation as primary DR
Federated-async replication to a secondary site is the recommended DR strategy (ADR-016). Properties:
- RPO: Bounded by async replication lag (seconds to minutes).
- RTO: Secondary site is warm (has replicated data + tenant config); switchover requires KMS connectivity and control plane reconfiguration.
- Data replication: Ciphertext-only. No key material in the replication stream.
What is replicated
| Component | Replicated? | Mechanism |
|---|---|---|
| Chunk data (ciphertext) | Yes | Async replication to peer site |
| Log deltas | Yes | Async replication of committed deltas |
| Control plane config | Yes | Federation config sync |
| Tenant KMS config | No | Same tenant KMS serves both sites |
| System master keys | No | Per-site system key manager |
| Audit log | Yes | Per-tenant audit shard replicated |
External backup
Cluster admins can configure external backup targets (S3-compatible object store). Backup data is encrypted with the system key at rest.
Backup operations
Creating a backup
# Via the admin dashboard
curl -X POST http://node1:9090/ui/api/ops/backup
# Via the kiseki-server CLI
kiseki-server backup create
Backup contents
Each backup snapshot contains:
- Per-shard metadata: Raft log snapshots for each shard, capturing the delta history up to the snapshot point.
- Chunk extent manifests: The
chunks/meta.redbindex mapping chunk IDs to device extents. - Inline content: The
small/objects.redbdatabase (small-file data below the inline threshold). - Control plane state: Tenant configuration, namespace mappings, quotas, compliance tags, federation peer registry.
- Key epoch metadata: Key epoch records from
keys/epochs.redb(key material itself is NOT included in backups; it is managed by the system key manager and tenant KMS independently).
All backup data is encrypted. No plaintext chunk data appears in backup output. Backups reference chunk ciphertext on data devices by extent coordinates, not by copying the raw ciphertext (which would require reading and re-encrypting terabytes of data).
Listing backups
kiseki-server backup list
Deleting a backup
kiseki-server backup delete --backup-id backup-20260423-001
Retention policy
Backup retention is configurable per cluster. Defaults:
| Setting | Default | Description |
|---|---|---|
| Retention period | 7 days | Backups older than this are automatically deleted. |
| Maximum backups | 10 | Maximum number of retained backup snapshots. |
| Backup frequency | Daily | How often automatic backups are created (if enabled). |
Retention is enforced by a background task that runs on the Raft leader. Deletion of expired backups is recorded in the cluster audit log.
Recovery procedures
Single node failure
Recovery path: Raft re-election + EC repair.
- The Raft group detects the failed node and elects a new leader (if the failed node was leader).
- EC repair automatically rebuilds chunk fragments that were on the failed node’s devices.
- RPO: 0 (committed data is on a majority of replicas). RTO: seconds to minutes.
No manual intervention required. Monitor the repair progress via:
kiseki-server repair list
Multiple node failure (quorum maintained)
Recovery path: Raft reconfiguration + EC repair.
If the cluster still has a Raft majority (e.g., 2 of 3 nodes alive), recovery is automatic:
- Raft continues operating with the surviving majority.
- EC repair rebuilds lost chunk fragments.
- Deploy replacement nodes and add them to the cluster.
Multiple node failure (quorum lost)
Recovery path: Manual Raft reconfiguration.
If the majority is lost (e.g., 2 of 3 nodes down), Raft cannot make progress. Recovery requires manual intervention:
- Identify the surviving node(s) with the most recent committed state.
- Force a new Raft configuration with the surviving node(s) as the initial voter set.
- Deploy replacement nodes and add them as learners.
- Promote learners to voters once they catch up.
Data loss risk: Deltas committed on the failed majority but not yet replicated to the surviving minority may be lost.
Full site failure (with federation)
Recovery path: Failover to federated peer.
- Redirect clients to the secondary site (DNS, load balancer, or manual reconfiguration).
- The secondary site has replicated chunk data, log deltas, and control plane config.
- Tenant KMS must be reachable from the secondary site (same KMS serves both sites).
- The secondary site’s system key manager has its own master keys, but tenant data is accessible because tenant KEKs come from the shared tenant KMS.
RPO: Replication lag. RTO: Minutes to hours (depends on control plane reconfiguration speed).
Full site failure (without federation)
Recovery path: Restore from external backup.
- Deploy a new cluster.
- Restore the backup snapshot to the new cluster.
- The system key manager on the new cluster generates new system master keys.
- Tenant KMS must be reconfigured to point to the new cluster.
- Re-wrap all envelopes with new system master keys.
RPO: Time since last backup. RTO: Hours (depends on data volume).
Tenant KMS loss
Unrecoverable (I-K11). If the tenant loses their KMS and has no backup of their KEK material, all data encrypted under those keys is permanently unreadable. Kiseki documents this requirement but provides no system-side escrow. The tenant controls and is responsible for their keys.
Recovery summary
| Scenario | Recovery path | RPO | RTO |
|---|---|---|---|
| Single node loss | Raft re-election + EC repair | 0 | Seconds-minutes |
| Multiple node loss (quorum held) | Raft reconfiguration + EC repair | 0 | Minutes |
| Multiple node loss (quorum lost) | Manual Raft reconfig | Possible delta loss | Minutes-hours |
| Full site loss (with federation) | Failover to peer | Replication lag | Minutes-hours |
| Full site loss (no federation) | Restore from backup | Backup lag | Hours |
| Tenant KMS loss | Unrecoverable | N/A | N/A |
Limitations
-
No point-in-time restore. Backups are snapshots, not continuous journals. Recovery restores the cluster to the state at the snapshot time. Deltas committed after the snapshot are lost unless federation has replicated them.
-
Backup does not include key material. System master keys and tenant KEKs are managed by their respective key managers. Backup and recovery of key material is the responsibility of the key manager operator (cluster admin for system keys, tenant admin for tenant KEKs).
-
Chunk ciphertext is referenced, not copied. Backup manifests reference chunk extents on data devices. If data devices are destroyed, the chunk ciphertext is lost. Federation replicates the actual ciphertext to a secondary site, which is why it is the primary DR mechanism.
-
Cross-site backup requires federation. There is no built-in mechanism to ship backup snapshots to a remote site outside of the federation framework. For cross-site backup without federation, operators must arrange their own transport of backup snapshots.
Monitoring & Observability
Kiseki provides three observability pillars: metrics (Prometheus), structured logging (tracing), and distributed traces (OpenTelemetry). All three are tenant-aware, respecting the zero-trust boundary between cluster admin and tenant admin (ADR-015).
Prometheus metrics
Every kiseki-server node exposes Prometheus metrics in text exposition
format on the metrics HTTP port.
Endpoint
GET http://<node>:9090/metrics
Registered metrics
| Metric name | Type | Labels | Description |
|---|---|---|---|
kiseki_raft_commit_latency_seconds | Histogram | shard | Raft commit latency per shard. Buckets: 100us to 1s. |
kiseki_raft_entries_total | Counter | (none) | Total Raft entries applied on this node. |
kiseki_chunk_write_bytes_total | Counter | (none) | Total chunk bytes written. |
kiseki_chunk_read_bytes_total | Counter | (none) | Total chunk bytes read. |
kiseki_chunk_ec_encode_seconds | Histogram | strategy | EC encode latency. Buckets: 100us to 50ms. |
kiseki_gateway_requests_total | Counter | method, status | Gateway request count by method (GET, PUT, DELETE, etc.) and HTTP status. |
kiseki_gateway_request_duration_seconds | Histogram | method | Gateway request duration. Buckets: 1ms to 5s. |
kiseki_pool_capacity_total_bytes | Gauge | pool | Total capacity per pool in bytes. |
kiseki_pool_capacity_used_bytes | Gauge | pool | Used capacity per pool in bytes. |
kiseki_transport_connections_active | Gauge | (none) | Active transport connections. |
kiseki_transport_connections_idle | Gauge | (none) | Idle transport connections. |
kiseki_shard_delta_count | Gauge | shard | Current delta count per shard. |
kiseki_key_rotation_total | Counter | (none) | Key rotations performed (system + tenant). |
kiseki_crypto_shred_total | Counter | (none) | Crypto-shred operations performed. |
Metric scoping (zero-trust)
Per ADR-015, metric scoping respects the zero-trust boundary:
- Cluster admin sees: Aggregated metrics, per-node metrics, system health. Per-tenant metrics are anonymized (tenant_id replaced with opaque hash) unless the cluster admin has approved access for that tenant.
- Tenant admin sees: Their own tenant’s metrics via the tenant audit export.
- No metric exposes: File names, directory structure, data content, or access patterns attributable to a specific tenant (without approval).
Metric cardinality
Metric cardinality is bounded by design. Label values are drawn from fixed sets (shard IDs, pool names, HTTP methods, strategy names). There are no unbounded label values such as file paths, tenant IDs, or user identifiers in metrics labels.
Structured logging
Kiseki uses the tracing crate for structured logging. Every log event
is a structured record with typed fields.
Configuration
| Variable | Default | Description |
|---|---|---|
RUST_LOG | info | Filter directive. Supports per-module granularity. |
KISEKI_LOG_FORMAT | text | Output format: text (human-readable) or json (structured). |
Filter examples
# Default: info-level for all Kiseki modules
RUST_LOG=kiseki=info
# Debug for the Raft subsystem, info for everything else
RUST_LOG=kiseki_raft=debug,kiseki=info
# Trace-level for the chunk subsystem (very verbose)
RUST_LOG=kiseki_chunk=trace,kiseki=info
# Warnings only (quiet)
RUST_LOG=warn
JSON output format
In production, set KISEKI_LOG_FORMAT=json for structured log
aggregation (ELK, Loki, Datadog, etc.):
{
"timestamp": "2026-04-23T14:30:00.123Z",
"level": "INFO",
"target": "kiseki_raft",
"message": "Raft leader elected",
"shard": "shard-0001",
"node_id": 1,
"term": 42
}
Log levels
| Level | Usage |
|---|---|
ERROR | Unrecoverable failures, invariant violations, data loss events. |
WARN | Recoverable issues, degraded state, approaching capacity limits. |
INFO | Significant state changes: leader election, key rotation, shard split, node join/leave. |
DEBUG | Detailed operational events: individual RPCs, cache hits/misses, EC operations. |
TRACE | Wire-level detail: Raft message contents, HKDF inputs, bitmap operations. |
Security in logs
- Tenant-identifying fields (tenant_id, namespace) are present for correlation.
- Content fields (file names, chunk plaintext, key material) are never logged (I-K8).
- Logs ship to the same audit/observability pipeline.
Distributed tracing (OpenTelemetry)
Kiseki uses OpenTelemetry for distributed tracing across the full write/read path.
Configuration
| Variable | Default | Description |
|---|---|---|
OTEL_EXPORTER_OTLP_ENDPOINT | (none) | OTLP gRPC endpoint. Example: http://jaeger:4317. When not set, tracing is disabled. |
OTEL_SERVICE_NAME | kiseki-server | Service name in traces. |
OTEL_TRACES_SAMPLER_ARG | 1.0 | Sampling rate (1.0 = 100%, 0.1 = 10%). Reduce in production for high-throughput workloads. |
Trace propagation
Every write/read path carries a trace ID via OpenTelemetry context propagation. Traces span:
client -> gateway -> composition -> log -> chunk -> view
For the native client path:
client (FUSE) -> transport -> composition -> log -> chunk
Jaeger integration
The development Docker Compose stack includes Jaeger for trace visualization:
- Jaeger UI:
http://localhost:16686 - OTLP gRPC receiver:
localhost:4317
Trace scoping
Traces respect the zero-trust boundary:
- Tenant-scoped traces are visible only to the tenant admin (via tenant audit export).
- Cluster admin sees system-level spans. No tenant content appears in span attributes visible to the cluster admin.
- Trace overhead is approximately 1-2% on the data path (acceptable for production).
Event store
The admin dashboard maintains an in-memory event store for diagnostic events. Events are categorized and severity-tagged.
Event categories
| Category | Events |
|---|---|
node | Node join, node leave, node unreachable, node recovered. |
shard | Shard created, shard split, shard maintenance entered/exited. |
device | Device added, device failed, SMART warning, evacuation started/completed. |
tenant | Tenant created, tenant deleted, quota changed. |
security | Auth failure, cert revocation, crypto-shred. |
admin | Maintenance mode toggle, backup requested, scrub requested, tuning parameter change. |
gateway | Protocol errors, connection surge, rate limiting. |
raft | Leader election, membership change, snapshot transfer. |
Event severities
| Severity | Description |
|---|---|
info | Normal operations. |
warning | Attention needed, but system is operating. |
error | Failure requiring investigation. |
critical | Immediate action required (data at risk, quorum lost). |
Event API
# All events from the last 3 hours
curl http://node1:9090/ui/api/events
# Errors from the last hour
curl 'http://node1:9090/ui/api/events?severity=error&hours=1'
# Device events, last 50
curl 'http://node1:9090/ui/api/events?category=device&limit=50'
# Security events from the last 24 hours
curl 'http://node1:9090/ui/api/events?category=security&hours=24'
Historical metrics API
# Metric snapshots from the last 3 hours
curl http://node1:9090/ui/api/history
# Last 6 hours
curl 'http://node1:9090/ui/api/history?hours=6'
The history endpoint returns time-series data points suitable for charting. The default retention is 3 hours in memory. For longer retention, use Prometheus.
Grafana integration
For production monitoring with alerting and long-term storage, configure Prometheus to scrape Kiseki metrics and visualize with Grafana.
Prometheus scrape configuration
scrape_configs:
- job_name: 'kiseki'
scrape_interval: 15s
static_configs:
- targets:
- 'node1:9090'
- 'node2:9090'
- 'node3:9090'
metrics_path: '/metrics'
Recommended Grafana dashboards
Cluster overview dashboard:
- Cluster health (up/down per node)
- Total Raft entries/sec (rate of
kiseki_raft_entries_total) - Gateway request rate (rate of
kiseki_gateway_requests_total) - Gateway latency p50/p99 (
kiseki_gateway_request_duration_seconds) - Pool utilization (
kiseki_pool_capacity_used_bytes/kiseki_pool_capacity_total_bytes)
Per-node dashboard:
- Raft commit latency histogram
(
kiseki_raft_commit_latency_seconds) - Chunk read/write throughput
- Transport connection count
- Shard delta count per shard
Capacity dashboard:
- Pool fill percentage over time
- Pool capacity trend (linear projection for capacity planning)
- Delta count growth rate (shard split prediction)
Key management dashboard:
- Key rotation count over time (
kiseki_key_rotation_total) - Crypto-shred count (
kiseki_crypto_shred_total)
Alerting rules
Recommended Prometheus alerting rules:
groups:
- name: kiseki
rules:
- alert: KisekiNodeDown
expr: up{job="kiseki"} == 0
for: 1m
labels:
severity: critical
- alert: KisekiPoolCapacityWarning
expr: >
kiseki_pool_capacity_used_bytes / kiseki_pool_capacity_total_bytes > 0.85
for: 5m
labels:
severity: warning
- alert: KisekiPoolCapacityCritical
expr: >
kiseki_pool_capacity_used_bytes / kiseki_pool_capacity_total_bytes > 0.92
for: 1m
labels:
severity: critical
- alert: KisekiGatewayLatencyHigh
expr: >
histogram_quantile(0.99, rate(kiseki_gateway_request_duration_seconds_bucket[5m])) > 1
for: 5m
labels:
severity: warning
- alert: KisekiRaftCommitLatencyHigh
expr: >
histogram_quantile(0.99, rate(kiseki_raft_commit_latency_seconds_bucket[5m])) > 0.1
for: 5m
labels:
severity: warning
Key Management
Kiseki uses a two-layer encryption model where system-level encryption protects data at rest and tenant-level key wrapping controls access. This page covers operational aspects of key management: rotation, re-encryption, crypto-shred, and external KMS integration.
Encryption model
Kiseki implements Model (C) from ADR-002: single data encryption pass at the system layer, with tenant access via key wrapping. No double encryption.
Plaintext chunk
|
v
System DEK (AES-256-GCM) --> Ciphertext (stored on disk)
|
v
System KEK (wraps DEK derivation material)
|
v
Tenant KEK (wraps system DEK derivation parameters per tenant)
System keys
- System DEK: Per-chunk symmetric key derived locally on each
storage node via HKDF-SHA256 (ADR-003). Never stored, never
transmitted. Derivation:
HKDF(master_key[epoch], chunk_id, "kiseki-chunk-dek-v1"). - System master key: Per-epoch master key stored in the system key manager (kiseki-keyserver). Storage nodes fetch it at startup and on epoch rotation, then derive per-chunk DEKs locally. The key manager never sees individual chunk IDs.
- System KEK: Wraps system master keys. Managed by the cluster admin.
Tenant keys
- Tenant KEK: Key wrapping key managed by the tenant’s chosen KMS backend. Wraps access to system DEK derivation parameters (epoch + chunk_id). Destroying the tenant KEK = crypto-shred (data becomes unreadable).
- No Tenant DEK: Model (C) does not double-encrypt. The tenant layer is key-wrapping, not data-encryption.
Invariants
- I-K1: No plaintext chunk is ever persisted to storage.
- I-K2: No plaintext payload is ever sent on the wire.
- I-K7: Authenticated encryption (AES-256-GCM) everywhere.
- I-K8: Keys are never logged, printed, transmitted in the clear, or stored in configuration files.
System key manager
The system key manager (kiseki-keyserver) is a dedicated HA service
backed by its own Raft consensus group.
Deployment
Deploy on 3-5 dedicated nodes, separate from storage nodes. The system key manager must be at least as available as the log (I-K12) because its unavailability blocks all chunk writes cluster-wide.
Key distribution
kiseki-keyserver:
Stores: master_key per epoch (Raft-replicated)
Serves: master_key to authenticated kiseki-server processes (mTLS)
Never sees: individual chunk_ids or per-chunk operations
kiseki-server:
Caches: master_key (mlock'd, MADV_DONTDUMP, seccomp)
Derives: per-chunk DEK = HKDF(master_key, chunk_id) -- locally
Never sends: chunk_ids to the key manager
This design prevents the key manager from building an index of all chunk IDs, which would leak per-tenant access patterns.
Key rotation
System key rotation
System key rotation creates a new epoch with a new master key. The rotation process:
- Cluster admin initiates rotation via
RotateSystemKey(). - The key manager generates a new master key and assigns a new epoch.
- Storage nodes are notified and fetch the new master key.
- New writes use the new epoch. Old data retains its epoch.
- Two epochs coexist during the rotation window (I-K6).
Old master keys are retained until all data encrypted under them has been re-encrypted or deleted. Full re-encryption is available as an explicit admin action.
Tenant key rotation
Tenant key rotation creates a new epoch for the tenant’s KEK:
- Tenant admin initiates rotation via
RotateTenantKey(tenant). - The tenant KMS generates or rotates the key (provider-specific).
- New envelope wrappings use the new epoch.
- Old wrapped material remains valid until background re-wrapping completes.
Background re-encryption
A background monitor detects envelopes wrapped under old epochs and schedules re-wrapping. The rewrap worker:
- Reads envelopes with old-epoch tenant wrapping.
- Unwraps with old KEK.
- Re-wraps with current KEK.
- Writes the updated envelope.
For providers that support server-side rewrap (e.g., Vault Transit), the rewrap operation never exposes plaintext derivation material to the storage node.
Crypto-shred
Crypto-shred is the authoritative deletion mechanism in Kiseki. Destroying the tenant KEK renders all tenant data unreadable.
Process
- Tenant admin initiates via
CryptoShred(tenant). - The tenant KMS destroys the KEK (provider-specific: Vault key deletion, AWS KMS key scheduling, PKCS#11 key destruction).
- All cached key material for the tenant is invalidated across the cluster.
- Native clients detect the shred via key health checks (default every 30 seconds) and wipe their caches (I-CC12).
What happens after crypto-shred
- Data is semantically deleted: No component can decrypt the tenant’s data because the KEK is destroyed.
- Ciphertext remains on disk: Physical GC runs separately when chunk refcount = 0 AND no retention hold is active (I-C2b).
- Audit trail preserved: Crypto-shred events are recorded in the audit log.
Ordering requirement
If retention holds are needed, they must be set before crypto-shred:
Set retention hold -> Crypto-shred -> Hold expires -> GC eligible
This prevents a race between crypto-shred and GC (I-C2b).
Detection latency
Crypto-shred detection is bounded by:
min(key_health_interval, max_disconnect_seconds).
Default key health check interval: 30 seconds. Configurable per tenant within [5s, 300s], default 60s (I-K15).
External KMS providers (ADR-028)
Kiseki supports five tenant KMS backends via the TenantKmsProvider
trait. The provider is selected per-tenant at onboarding.
Provider comparison
| # | Backend | Transport | Material model | Key material location |
|---|---|---|---|---|
| 1 | Kiseki Internal | In-process | Local | Separate Raft group in Kiseki |
| 2 | HashiCorp Vault | HTTPS | Local (cached) | Vault Transit engine |
| 3 | KMIP 2.1 | mTLS (TTLV) | Remote or local | KMIP server / HSM |
| 4 | AWS KMS | HTTPS | Remote only | AWS KMS |
| 5 | PKCS#11 v3.0 | Local (FFI) | Remote only (HSM) | Hardware Security Module |
Provider invariants
- I-K16: Provider abstraction is opaque to callers. No correctness decision depends on which backend is selected.
- I-K17: Wrap/unwrap operations include AAD (chunk_id) binding. A wrapped blob cannot be spliced from one envelope to another.
- I-K18: Provider is validated on configuration: connectivity test, wrap/unwrap round-trip, certificate chain. Validation failure prevents tenant activation.
- I-K19: Internal provider stores tenant KEKs in a separate Raft group from system master keys.
- I-K20: Provider migration (e.g., Internal to Vault) requires re-wrapping all existing envelopes. Migration is background, audited, and preserves data availability throughout.
Provider 1: Kiseki Internal (default)
Zero-configuration default. Kiseki manages tenant KEKs internally in a Raft group separate from system master keys. Suitable for single-operator deployments.
Security trade-off: Internal mode does not provide the full two-layer security guarantee. Compromise of both the system key store and the tenant key store yields full access. Compliance-sensitive tenants should use an external provider.
Provider 2: HashiCorp Vault
Uses Vault’s Transit secrets engine for encryption-as-a-service:
| Kiseki operation | Vault API |
|---|---|
wrap | POST /transit/encrypt/:name (with context = AAD) |
unwrap | POST /transit/decrypt/:name (with context = AAD) |
rotate | POST /transit/keys/:name/rotate |
rewrap | POST /transit/rewrap/:name (server-side, no plaintext exposure) |
destroy | DELETE /transit/keys/:name |
Provider 3: KMIP 2.1
Standards-based integration with enterprise KMS and HSM appliances. Uses mTLS over TTLV binary protocol.
Provider 4: AWS KMS
Cloud-native KMS integration. Key material never leaves AWS. All wrap/unwrap operations are remote HTTPS calls. Suitable for hybrid cloud deployments.
Provider 5: PKCS#11 v3.0
Direct HSM integration via the PKCS#11 C API (FFI). Key material stays in the HSM. Highest security level, requires HSM hardware on or accessible from storage nodes.
OIDC integration
Tenant identity providers can be integrated for second-stage authentication (I-Auth2). This is optional and orthogonal to the KMS provider choice.
When configured, workload-level identity is validated against the tenant admin’s authorization via OIDC/JWT tokens, providing “authorized by my tenant admin” on top of the mTLS-based “belongs to this cluster” identity.
Keycloak is included in the development Docker Compose stack for OIDC testing.
Operational checklist
Key rotation schedule
| Key type | Recommended interval | Enforcement |
|---|---|---|
| System master key | Quarterly | Manual (cluster admin) |
| Tenant KEK | Per tenant policy | Manual or automated via KMS |
| TLS certificates | Annual | Cluster CA renewal |
Monitoring key health
# Check key manager health
kiseki-server keymanager health
# Check tenant KMS connectivity
kiseki-server keymanager check-kms
# Monitor key rotation metrics
curl -s http://node1:9090/metrics | grep kiseki_key_rotation_total
# Monitor crypto-shred events
curl -s http://node1:9090/metrics | grep kiseki_crypto_shred_total
Key material security
- Master keys are mlock’d in memory on storage nodes (prevent swapping).
- Core dumps are disabled (
LimitCORE=0in systemd,MADV_DONTDUMP). - seccomp filters restrict system calls on key-handling threads.
- Runtime integrity monitor detects ptrace,
/proc/pid/memaccess, and debugger attachment (I-O7). - Keys are zeroized on deallocation (
Zeroizing<Vec<u8>>).
System Overview
Kiseki is a distributed storage system designed for HPC and AI workloads. It provides a unified data fabric with POSIX (FUSE), NFS, and S3 access paths, two-layer encryption with tenant-controlled crypto-shred, and pluggable HPC transports (CXI/Slingshot, InfiniBand, RoCEv2).
Workspace structure
The codebase is a single Rust workspace with 18 crates:
| Crate | Purpose |
|---|---|
kiseki-common | Shared types, HLC, identifiers, errors |
kiseki-proto | Generated protobuf/gRPC code |
kiseki-crypto | FIPS AEAD (AES-256-GCM), envelope encryption, tenant KMS providers |
kiseki-raft | Shared Raft config, redb log store, TCP transport |
kiseki-transport | Transport abstraction: TCP+TLS, RDMA verbs, CXI/libfabric |
kiseki-log | Log context: delta ordering, shard lifecycle, Raft consensus |
kiseki-block | Raw block device I/O, bitmap allocator, superblock (ADR-029) |
kiseki-chunk | Chunk storage: placement, erasure coding, GC, device management |
kiseki-composition | Composition context: namespace, refcount, multipart |
kiseki-view | View materialization: stream processors, MVCC pins |
kiseki-gateway | Protocol gateway: NFS and S3 translation |
kiseki-client | Native client: FUSE, transport selection, client-side cache |
kiseki-keymanager | System key manager with Raft HA |
kiseki-audit | Append-only audit log with per-tenant shards |
kiseki-advisory | Workflow advisory: hints, telemetry, budgets (ADR-020/021) |
kiseki-control | Control plane: tenancy, IAM, policy, federation |
kiseki-server | Storage node binary (composes all server-side crates) |
kiseki-acceptance | BDD acceptance tests (cucumber-rs) |
Bounded contexts
The domain is organized into eight bounded contexts, each with a distinct responsibility, failure domain, and scaling concern:
- Log – Delta ordering, Raft consensus, shard lifecycle
- Chunk Storage – Encrypted chunk persistence, placement, EC, GC
- Composition – Tenant-scoped metadata assembly, namespace management
- View Materialization – Protocol-shaped materialized projections
- Protocol Gateway – NFS and S3 wire protocol translation
- Control Plane – Tenancy, IAM, quota, policy, federation
- Key Management – System DEK/KEK, tenant KMS providers, crypto-shred
- Workflow Advisory – Client hints, telemetry feedback (cross-cutting)
Additionally, Native Client runs on compute nodes as a separate trust boundary and Block I/O handles raw device management underneath chunk storage.
Data path
Client (plaintext) ──encrypt──► Gateway / Native Client
│
▼
Composition
(assemble chunks, record delta)
│
┌────────┴────────┐
▼ ▼
Log (Raft) Chunk Storage
(commit delta, (write encrypted
replicate) chunk to device)
Write path: The client (native or protocol) encrypts data with the tenant KEK wrapping a system DEK. The composition layer assembles chunk references and records a delta. The delta is committed through Raft on the owning shard. Chunks are written to affinity pools with erasure coding.
Read path: The client issues a view lookup (materialized from log deltas). The view resolves chunk references. Chunks are read from devices, decrypted, and returned to the client.
Control path
Admin ──► Control Plane (gRPC)
│
├── Tenant / Namespace / Quota / Policy
├── Flavor management
├── Federation (async cross-site)
└── Advisory policy (hint budgets, profiles)
The control plane manages tenant lifecycle, IAM, quotas, compliance tags,
placement policy, and federation. It communicates with storage nodes via
gRPC on the management network. The control plane depends only on
kiseki-common and kiseki-proto (crate-graph firewall, ADR-027).
Advisory path (ADR-020)
Client ──hints──► Advisory Runtime ──telemetry──► Client
│
├── Route hints to Chunk / View / Composition
├── Emit caller-scoped telemetry feedback
└── Audit advisory events
The workflow advisory system is a cross-cutting concern (not a bounded context). It carries two flows over a bidirectional gRPC channel per declared workflow:
- Hints (client to storage): advisory steering signals for prefetch, affinity, priority, and phase-adaptive tuning. Never authoritative (I-WA1).
- Telemetry feedback (storage to client): caller-scoped signals about backpressure, locality, materialization lag, and QoS headroom (I-WA5).
The advisory runtime runs on a dedicated tokio runtime, isolated from the data path. Advisory failures never block data-path operations (I-WA2).
Network ports
| Port | Purpose |
|---|---|
| 9100 | Data-path gRPC (Log, Chunk, Composition, View, Discovery) |
| 9101 | Advisory gRPC (WorkflowAdvisoryService) |
| 9000 | S3 HTTP gateway |
| 2049 | NFS server |
| 9090 | Prometheus metrics + health + admin UI |
Binaries
| Binary | Contents | Deployment |
|---|---|---|
kiseki-server | Log, Chunk, Composition, View, Gateway, Audit, Advisory | Every storage node |
kiseki-client-fuse | Native client with FUSE | Compute nodes |
kiseki-control | Control plane | Management network (3+ instances) |
kiseki-keyserver | System key manager (Raft HA) | Dedicated cluster (3-5 nodes) |
Bounded Contexts
Eight bounded contexts form the core domain model. Each has a distinct responsibility, failure domain, and scaling concern. This page describes each context’s purpose, implementing crate, key types, and governing invariants.
1. Log
Crate: kiseki-log
Purpose: Accept deltas, assign them a total order within a shard, replicate via Raft, persist durably, and support range reads for view materialization and replay.
Key types: Delta, DeltaEnvelope, Shard, ShardConfig, ShardInfo
Key invariants:
| ID | Rule |
|---|---|
| I-L1 | Within a shard, deltas have a total order |
| I-L2 | A committed delta is durable on a majority of Raft replicas before ack |
| I-L3 | A delta is immutable once committed |
| I-L4 | Delta GC requires ALL consumers (views + audit) to have advanced past the delta |
| I-L5 | A composition is not visible until all referenced chunks are durable |
| I-L6 | Shards have a hard ceiling triggering mandatory split (delta count, byte size, or throughput) |
| I-L7 | Delta envelope has separated system-visible header and tenant-encrypted payload |
| I-L8 | Cross-shard rename returns EXDEV (no 2PC across shards) |
| I-L9 | A delta’s inlined payload is immutable after write; threshold changes apply prospectively |
Failure domain: Per-shard. Leader loss causes transient latency (election). Quorum loss makes the shard unavailable.
2. Chunk Storage
Crate: kiseki-chunk (with kiseki-block for device I/O)
Purpose: Store and retrieve opaque encrypted chunks. Manage placement across affinity pools. Handle erasure coding and replication. Run GC based on refcounts and retention holds.
Key types: Chunk, ChunkId, Envelope, AffinityPool, DeviceBackend
Key invariants:
| ID | Rule |
|---|---|
| I-C1 | Chunks are immutable; new versions are new chunks |
| I-C2 | A chunk is not GC’d while any composition references it (refcount > 0) |
| I-C2b | A chunk is not GC’d while a retention hold is active |
| I-C3 | Chunks are placed according to affinity policy from the referencing view descriptor |
| I-C4 | Durability strategy is per affinity pool (EC default, N-copy replication available) |
| I-C5 | Pool writes rejected at Critical threshold (SSD 85%, HDD 92%); ENOSPC at Full |
| I-C6 | EC parameters are immutable per pool; SetPoolDurability applies to new chunks only |
| I-C7 | All chunk data writes are aligned to device physical block size (ADR-029) |
| I-C8 | Allocation bitmap is ground truth; free-list is a derived cache rebuilt on startup |
Failure domain: Per-chunk or per-device. Chunk loss recoverable via EC parity or replicas.
3. Composition
Crate: kiseki-composition
Purpose: Maintain tenant-scoped metadata structures describing how chunks assemble into data units (files, objects). Manage namespaces. Record mutations as deltas in the log.
Key types: Composition, Namespace, CompositionMutation
Key invariants:
| ID | Rule |
|---|---|
| I-X1 | A composition belongs to exactly one tenant |
| I-X2 | A composition’s chunks respect the tenant’s dedup policy (global hash or per-tenant HMAC) |
| I-X3 | A composition’s mutation history is fully reconstructible from its shard’s deltas |
Failure domain: Coupled to Log. If a shard fails, its compositions are affected.
4. View Materialization
Crate: kiseki-view
Purpose: Consume deltas from shards and maintain materialized views per view descriptor. Handle view lifecycle (create, discard, rebuild) and MVCC read pins.
Key types: View, ViewDescriptor, StreamProcessor, MvccPin
Key invariants:
| ID | Rule |
|---|---|
| I-V1 | A view is derivable from its source shard(s) alone (rebuildable-from-log) |
| I-V2 | A view’s observed state is a consistent prefix of its source log(s) up to a watermark |
| I-V3 | Cross-view consistency governed by the reading protocol’s declared consistency model |
| I-V4 | MVCC read pins have bounded lifetime; pin expiration revokes the snapshot guarantee |
Failure domain: Per-view. A fallen-behind view serves stale data. A lost view can be rebuilt from the log.
5. Protocol Gateway
Crate: kiseki-gateway
Purpose: Translate wire protocol requests (NFS, S3) into operations against views and the log. Serve reads from views. Route writes as deltas to the log via composition. Perform tenant-layer encryption for protocol-path clients.
Key types: Protocol gateway instance, protocol plugin
Trust boundary: NFS/S3 clients send plaintext over TLS to the gateway. The gateway encrypts before writing to log/chunks. Plaintext exists in gateway memory only ephemerally.
Failure domain: Per-gateway. Crash disconnects affected clients. Restart and client reconnect recovers.
6. Control Plane
Crate: kiseki-control
Purpose: Declarative API for tenancy, IAM, policy, placement, discovery, compliance tagging, and federation. Manages cluster-level and tenant-level configuration.
Key types: Organization, Project, Workload, Flavor,
ComplianceRegime, RetentionHold, FederationPeer
Key invariants:
| ID | Rule |
|---|---|
| I-T1 | Tenants are fully isolated; no cross-tenant data access |
| I-T2 | Tenant resource consumption bounded by quotas at org and workload levels |
| I-T3 | Tenant keys not accessible to other tenants or shared processes |
| I-T4 | Cluster admin cannot access tenant data without tenant admin approval |
| I-T4c | Cluster admin modifications to pools with tenant data are audit-logged to tenant |
Failure domain: Control plane unavailability prevents new tenant creation and policy changes, but the existing data path continues with last-known configuration.
7. Key Management
Crates: kiseki-keymanager, kiseki-crypto
Purpose: Custody, rotation, escrow, and issuance of all key material. Two layers: system keys (cluster admin) and tenant key wrapping (tenant admin via tenant KMS). Orchestrate crypto-shred.
Key types: SystemDek, SystemKek, TenantKek, KeyEpoch,
Envelope, TenantKmsProvider
Tenant KMS providers (ADR-028): Five pluggable backends implementing
the TenantKmsProvider trait – Kiseki-Internal, HashiCorp Vault, KMIP 2.1,
AWS KMS, and PKCS#11.
Key invariants:
| ID | Rule |
|---|---|
| I-K1 | No plaintext chunk is ever persisted to storage |
| I-K2 | No plaintext payload is ever sent on the wire |
| I-K4 | System can enforce access without reading plaintext |
| I-K5 | Crypto-shred renders data unreadable within bounded time |
| I-K6 | Key rotation does not lose access to old data until explicit cutover |
| I-K7 | Authenticated encryption everywhere |
| I-K8 | Keys are never logged, printed, transmitted in the clear, or in config files |
| I-K16 | Provider abstraction is opaque to callers |
| I-K17 | Wrap/unwrap operations include AAD (chunk_id) binding |
Failure domain: KMS unavailability blocks new encrypt/decrypt operations. This context’s availability is as critical as the Log’s.
8. Workflow Advisory (cross-cutting)
Crate: kiseki-advisory
Purpose: Carry workflow hints from clients to storage and telemetry feedback from storage back to clients. Route advisory signals to the bounded context best able to act on them.
Key types: WorkflowRef, OperationAdvisory, PoolHandle,
PoolDescriptor, HintBudget
Key invariants:
| ID | Rule |
|---|---|
| I-WA1 | Hints are advisory only; no correctness decision depends on a hint |
| I-WA2 | Advisory subsystem is isolated from the data path; failures do not block data-path operations |
| I-WA3 | A workflow belongs to exactly one workload; authorization is per-operation |
| I-WA5 | Telemetry feedback is scoped to the caller’s authorization |
| I-WA6 | Advisory requests are not existence or content oracles |
| I-WA7 | Hint budgets enforced per workload within parent ceilings |
| I-WA14 | Hints do not extend tenant capabilities |
Runtime isolation: The advisory runtime runs on a dedicated tokio
runtime separate from the data-path runtime (ADR-021). No data-path crate
depends on kiseki-advisory.
Cross-context relationships
| Producer | Consumer | What flows |
|---|---|---|
| Control Plane | All contexts | Policy, placement, tenant config, compliance tags |
| Log | Composition, View | Deltas (ordered, durable) |
| Composition | Chunk Storage | Chunk references (refcounts) |
| Key Management | Chunk Storage | System DEKs |
| Key Management | Gateway, Native Client | Tenant KEK (wrapping) |
| View Materialization | Gateway, Native Client | Materialized view state |
| Chunk Storage | View, Native Client | Chunk data (encrypted) |
Data Flow
This page describes the write, read, inline, and cross-node data paths through the Kiseki system.
Write path
┌──────────┐ plaintext ┌──────────────────┐
│ Client │ ──────────────► │ Gateway / │
│ │ (over TLS) │ Native Client │
└──────────┘ └────────┬──────────┘
│ 1. Encrypt with tenant KEK
│ wrapping system DEK
│ 2. Content-defined chunking
│ (Rabin fingerprinting)
▼
┌──────────────────┐
│ Composition │
│ │
│ 3. Record chunk │
│ references │
│ 4. Build delta │
└───────┬──────────┘
│
┌────────┴────────┐
▼ ▼
┌──────────────┐ ┌──────────────┐
│ Log (Raft) │ │ Chunk Storage │
│ │ │ │
│ 5. Commit │ │ 6. Write │
│ delta via │ │ encrypted │
│ Raft │ │ chunk to │
│ 7. Replicate │ │ device │
│ to │ │ 8. EC encode │
│ majority │ │ across │
│ │ │ pool │
└──────────────┘ └──────────────┘
Step-by-step
-
Client encrypt: The native client encrypts data before it leaves the process. Protocol-path clients (NFS/S3) send plaintext over TLS to the gateway, which encrypts on their behalf.
-
Content-defined chunking: Data is split into variable-size chunks using Rabin fingerprinting. Each chunk gets a content-addressed ID (SHA-256 hash of plaintext, or HMAC when tenant opts out of cross-tenant dedup).
-
Compose: The composition layer records chunk references and constructs a delta describing the mutation (create, update, delete).
-
Raft commit: The delta is appended to the owning shard’s Raft log. The leader replicates to a majority of voters before acknowledging.
-
Chunk write: Encrypted chunks are written to affinity pool devices with erasure coding (or N-copy replication, per pool policy).
-
Ack: The write is acknowledged to the client only after the delta is committed (I-L2) and all referenced chunks are durable (I-L5).
Read path
┌──────────┐ ┌──────────────────┐
│ Client │ ◄────────────── │ Gateway / │
│ │ plaintext │ Native Client │
└──────────┘ (over TLS) └────────┬──────────┘
▲ 5. Decrypt
│
┌────────┴──────────┐
│ View Lookup │
│ │
│ 1. Resolve path │
│ to composition │
│ 2. Get chunk list │
└────────┬──────────┘
│
▼
┌──────────────────┐
│ Chunk Storage │
│ │
│ 3. Read chunks │
│ from device │
│ 4. EC decode if │
│ degraded │
└──────────────────┘
Step-by-step
-
View lookup: The client or gateway queries a materialized view to resolve a path (POSIX) or key (S3) to a composition and its chunk list.
-
Chunk read: Encrypted chunks are read from the storage devices. If a device is degraded, EC parity reconstructs the missing data.
-
Decrypt: The client (native path) or gateway (protocol path) unwraps the system DEK using the tenant KEK, then decrypts the chunk data with AES-256-GCM.
-
Return: Plaintext is returned to the client.
Inline path (ADR-030)
Small files below the configurable inline threshold bypass chunk storage entirely:
Client ──► Composition ──► Log (Raft)
│
▼
Delta with inline payload
│
▼
Raft replication to voters
│
▼
State machine apply:
store in small/objects.redb
Threshold computation: The inline threshold for a shard is the minimum affordable threshold across all nodes hosting that shard’s voter set:
clamp(min(voter_budgets) / file_count_estimate, INLINE_FLOOR, INLINE_CEILING)
Key invariants:
- I-L9: Inlined payloads are immutable after write; threshold changes apply prospectively only
- I-SF5: Inline content is offloaded to
small/objects.redbon state machine apply; snapshots include inline content from redb - I-SF7: Per-shard Raft inline throughput capped at
KISEKI_RAFT_INLINE_MBPS(default 10 MB/s)
Cross-node data paths
Raft replication
Each shard runs an independent Raft group (ADR-026). The leader replicates log entries (deltas) to followers via the Raft RPC transport. Replication uses mTLS on the data fabric.
Leader ──► Follower 1 (AppendEntries)
──► Follower 2 (AppendEntries)
──► Follower 3 (AppendEntries)
Committed entries are persisted in RedbRaftLogStore on each voter.
Snapshot transfer
When a follower is too far behind or a new voter joins, the leader sends a full snapshot. Snapshots are transferred as length-prefixed JSON over the Raft transport connection.
For shards with inline data, the snapshot includes all entries from
small/objects.redb (I-SF5).
Chunk replication and EC
Chunks are placed across distinct physical devices within a pool using deterministic hashing (CRUSH-like). No two EC fragments of the same chunk reside on the same device (I-D4).
Device failure triggers automatic repair from EC parity or replicas (I-D1).
Federation
Federated sites replicate data asynchronously. Only ciphertext is replicated – no key material in the replication stream (I-CS3). All federated sites for a tenant connect to the same tenant KMS.
Encryption Model
Kiseki uses a two-layer encryption architecture (ADR-002, model C) that separates data encryption from access control. One encryption pass protects data; key wrapping controls who can read it.
Two-layer architecture
┌─────────────────────────────────────────────────┐
│ Tenant Layer (access) │
│ │
│ Tenant KEK (controlled by tenant admin) │
│ wraps the system DEK for tenant-scoped access │
│ │
│ Destroying the tenant KEK = crypto-shred │
│ (all tenant data rendered unreadable) │
├─────────────────────────────────────────────────┤
│ System Layer (data) │
│ │
│ System DEK encrypts chunk data (AES-256-GCM) │
│ System KEK wraps system DEKs │
│ Always on -- no unencrypted chunks │
└─────────────────────────────────────────────────┘
System layer: The system DEK encrypts every chunk using AES-256-GCM. System DEKs are derived per-chunk using HKDF-SHA256 from a master key (ADR-003). The system KEK wraps system DEKs and is managed by the cluster admin via the system key manager.
Tenant layer: The tenant KEK wraps the system DEK for tenant-scoped access control. There is no double encryption – one data encryption pass, with key wrapping for access control. The tenant admin controls the tenant KEK via the tenant KMS.
Envelope structure
Each chunk is stored as an envelope containing:
┌──────────────────────────────────────────┐
│ Envelope │
│ │
│ ┌──────────────────────────────────┐ │
│ │ Ciphertext (AES-256-GCM) │ │
│ │ (encrypted chunk data) │ │
│ └──────────────────────────────────┘ │
│ │
│ auth_tag (16 bytes, GCM tag) │
│ nonce (12 bytes, unique per chunk) │
│ system_key_epoch (current epoch) │
│ tenant_key_epoch (current epoch) │
│ chunk_id (content-addressed) │
│ algorithm_id (for crypto-agility) │
│ │
│ System wrapping metadata │
│ Tenant wrapping metadata │
└──────────────────────────────────────────┘
The envelope carries algorithm identifiers for crypto-agility (I-K7). All metadata is authenticated – unauthenticated encryption is never acceptable.
Key derivation
System DEKs are derived locally on each storage node using HKDF-SHA256 (ADR-003). No DEK-per-chunk RPC is required:
system_dek = HKDF-SHA256(
ikm = master_key[epoch],
salt = chunk_id,
info = "kiseki-chunk-dek-v1"
)
The master key is fetched from the system key manager at startup and on rotation events. DEK derivation is deterministic – the same chunk ID and epoch always produce the same DEK.
Key rotation
Key rotation is epoch-based (I-K6):
- The admin triggers rotation (system or tenant level)
- A new epoch is created with fresh key material
- New data is encrypted with the current epoch’s keys
- Old data retains its epoch until background re-encryption migrates it
- Two epochs coexist during the rotation window
- Full re-encryption available as an explicit admin action for key-compromise incidents
Crypto-shred
Destroying the tenant KEK renders all tenant data unreadable (I-K5):
1. Set retention hold (if compliance requires)
2. Destroy tenant KEK at tenant KMS
3. All wrapped system DEKs for this tenant become unwrappable
4. Chunk ciphertext remains on disk (system-encrypted) until GC
5. Physical GC runs separately when refcount = 0 AND no retention hold
The ordering contract (I-C2b): set hold before crypto-shred to prevent race with GC.
Client-side detection: periodic key health check (default 30s) detects
KEK_DESTROYED and triggers immediate cache wipe (I-CC12). Maximum
detection latency: min(key_health_interval, max_disconnect_seconds).
Chunk ID derivation
| Mode | Algorithm | Cross-tenant dedup |
|---|---|---|
| Default | SHA-256(plaintext) | Yes |
| Opted-out | HMAC-SHA256(plaintext, tenant_key) | No (zero co-occurrence leak) |
When a tenant opts out of cross-tenant dedup (I-X2, I-K10), chunk IDs are derived using HMAC with a tenant-specific key, making it impossible to determine whether two tenants store the same data.
Tenant KMS providers (ADR-028)
Five pluggable backends implement the TenantKmsProvider trait:
| Provider | Key model | Key location |
|---|---|---|
| Kiseki-Internal | Raft-replicated | On-cluster |
| HashiCorp Vault | Transit secrets engine | External |
| KMIP 2.1 | Standard key management protocol | External |
| AWS KMS | Cloud-managed keys | External |
| PKCS#11 | Hardware security modules | External |
Provider selection is per-tenant at onboarding. The trait fully encapsulates protocol differences – callers never branch on provider type (I-K16). Wrap/unwrap operations include AAD (chunk_id) binding to prevent envelope splicing (I-K17).
FIPS compliance
Kiseki uses aws-lc-rs with the FIPS feature flag for FIPS 140-2/3
validated cryptographic operations. The kiseki-crypto crate provides:
- AES-256-GCM authenticated encryption
- HKDF-SHA256 key derivation
- SHA-256 hashing
- HMAC-SHA256 for opted-out chunk ID derivation
zeroizeintegration for all key material in memory
Delta encryption
Log delta payloads (filenames, attributes, inline data) are encrypted with the system DEK, wrapped with the tenant KEK (I-K3). The delta envelope has structurally separated:
- System-visible header (cleartext or system-encrypted): sequence number, shard ID, hashed_key, operation type, timestamp
- Tenant-encrypted payload: the actual mutation data
Compaction operates on headers only and never decrypts tenant-encrypted payloads (I-O2).
Raft Consensus
Kiseki uses Raft for ordering and replicating deltas within each shard. The implementation is based on openraft 0.10 with a custom TCP transport and redb-backed persistent storage.
Per-shard Raft groups
Each shard runs an independent Raft group (ADR-026, Strategy A). This provides:
- Independent scaling: shard count grows with data volume and throughput
- Isolated failure domains: quorum loss in one shard does not affect others
- No cross-shard coordination: cross-shard rename returns EXDEV (I-L8)
The system key manager also runs its own Raft group for high availability (ADR-007), as do audit log shards (ADR-009).
openraft integration
The kiseki-raft crate defines KisekiTypeConfig used by all Raft groups:
- Node identity:
u64node IDs - Async runtime: tokio
- Log store:
RedbRaftLogStore(persistent) orMemLogStore(testing) - Entry format: customized per context (log deltas, key manager ops, audit events)
Each context (log, key manager, audit) defines its own request (D) and
response (R) types while sharing the node identity, entry format, and
async runtime configuration.
Persistent log: RedbRaftLogStore
Raft log entries are persisted using redb (ADR-022), a pure-Rust
embedded key-value store. The RedbRaftLogStore provides:
- Durable append and truncation of log entries
- Vote persistence (current term, voted-for)
- Snapshot metadata storage
- Crash-safe operations (redb uses write-ahead logging internally)
For shards with inline data (ADR-030), the state machine offloads inline
content to small/objects.redb on apply. The in-memory state machine does
not hold inline content after apply (I-SF5).
Snapshot transfer
When a follower falls behind or a new voter joins the group, the leader sends a full snapshot:
- Leader serializes the current state machine as length-prefixed JSON
- For shards with inline data, the snapshot includes all entries from
small/objects.redb - The snapshot is streamed over the Raft transport connection
- The follower installs the snapshot and resumes normal replication
Transport and security
Raft RPCs use a custom TCP transport with mTLS:
- All Raft communication is authenticated via per-node mTLS certificates signed by the Cluster CA (I-Auth1)
- The transport runs on the data fabric (not the management network)
- Connection pooling and keepalive are managed by the transport layer
The Raft transport address is configured via KISEKI_RAFT_ADDR.
Dynamic membership changes
Raft membership changes follow the standard joint-consensus protocol:
- Add voter: new node starts as learner, catches up to committed index, then promoted to voter
- Remove voter: validated that removal does not break quorum
(safety check via
can_remove_safely) - Shard migration: target node must fully catch up (learner state matches leader’s committed index) before old voter is removed (I-SF3)
Membership changes are validated by validate_membership_change in
kiseki-raft, which checks quorum preservation and prevents unsafe
removal.
Shard lifecycle
| Event | Description |
|---|---|
| Create | New shard created when a namespace is created |
| Split | Mandatory split when shard exceeds ceiling (I-L6): delta count, byte size, or throughput |
| Maintenance | Shard set to read-only; writes rejected with retriable error (I-O6) |
| Compaction | Header-only merge; tenant-encrypted payloads carried opaquely (I-O2) |
| GC | Delta garbage collection after all consumers advance past the delta (I-L4) |
Shard splits do not block writes to the existing shard during the split operation (I-O1).
Consistency guarantees
| Scope | Guarantee | Mechanism |
|---|---|---|
| Intra-shard | Total order | Raft sequence numbers |
| Cross-shard | Causal ordering | HLC (Hybrid Logical Clock) |
| Cross-site | Eventual consistency | Async replication via federation |
| Writes | CP (no split-brain) | Raft majority commit (I-CS1) |
| Reads | Bounded staleness | Per view descriptor, subject to compliance floor (I-CS2) |
Transport Layer
The kiseki-transport crate provides a pluggable transport abstraction
for bidirectional byte-stream connections. It ships with a TCP+TLS
reference implementation and feature-flagged support for HPC fabric
transports.
Transport trait
The Transport trait is the core abstraction:
#![allow(unused)]
fn main() {
pub trait Transport: Send + Sync + 'static {
type Connection: Connection;
async fn connect(&self, addr: SocketAddr) -> Result<Self::Connection>;
async fn listen(&self, addr: SocketAddr) -> Result<Listener>;
}
pub trait Connection: AsyncRead + AsyncWrite + Send + Unpin + 'static {
fn peer_identity(&self) -> Option<&PeerIdentity>;
}
}
All components (client, server, Raft) use this trait, enabling transport selection without code changes.
TCP+TLS (reference implementation)
The TcpTlsTransport is always available and serves as the universal
fallback:
- mTLS: Cluster CA validation with per-tenant certificates (I-Auth1, I-K13)
- SPIFFE: SAN-based SVID validation for workload identity (I-Auth3)
- CRL: Optional certificate revocation list support via
KISEKI_CRL_PATH - Connection pooling: Configurable pool size per peer
- Keepalive: TCP keepalive for connection health
- Timeouts: Configurable connect, read, and write timeouts
Configuration: TlsConfig with CA cert, node cert, node key, and
optional CRL path.
RDMA verbs (feature: verbs)
Native InfiniBand and RoCEv2 support for low-latency HPC fabrics:
- InfiniBand: Direct RDMA over InfiniBand fabric (
VerbsIb) - RoCEv2: RDMA over Converged Ethernet (
VerbsRoce) - Device selection: Auto-detects the first available IB device, or
uses the device named in
KISEKI_IB_DEVICE - Zero-copy: RDMA read/write for chunk data transfer
The verbs module uses unsafe code for FFI calls to libibverbs.
Each unsafe block has a per-block SAFETY comment.
CXI/libfabric (feature: cxi)
HPE Slingshot fabric support via libfabric:
- CXI provider: Lowest-latency transport on Slingshot-equipped systems
- libfabric: Uses the libfabric API (
fi_*calls) for fabric operations - Feature-flagged: Only compiled when
cxifeature is enabled
The CXI module uses unsafe code for FFI calls to libfabric.
FabricSelector
The FabricSelector provides priority-based transport selection with
automatic failover:
Priority 0: CXI (Slingshot, lowest latency)
Priority 1: VerbsIb (InfiniBand)
Priority 2: VerbsRoce (RoCEv2)
Priority 3: TcpTls (always available, universal fallback)
At boot, the selector probes for available transports (hardware presence check). On connection, it selects the highest-priority available transport. On failure, it falls back to the next-best transport.
The TransportHealthTracker monitors transport health and marks transports
as unhealthy after repeated failures, temporarily removing them from
selection until they recover.
GPU-direct (planned)
Future support for direct GPU memory access:
- NVIDIA cuFile (feature:
gpu-cuda): GPUDirect Storage for direct NVMe-to-GPU data transfer - AMD ROCm (feature:
gpu-rocm): ROCm-based GPU direct access
These features bypass CPU memory for chunk data, reducing latency for AI training workloads.
NUMA-aware thread pinning
The NumaTopology module provides NUMA-aware thread pinning for optimal
memory locality:
- Auto-detects NUMA topology on Linux via
sched_setaffinity - Pins I/O threads to the NUMA node closest to the network device
- Reduces cross-NUMA memory access latency for high-throughput workloads
Metrics and health
The transport layer exports Prometheus metrics via TransportMetrics:
- Connection count per transport type
- Bytes sent/received per transport
- Connection errors and failover events
- Latency histograms per transport
Health tracking (TransportHealthTracker) provides per-transport health
status for the selector’s failover decisions.
Invariant mapping
| Invariant | How the transport layer enforces it |
|---|---|
| I-K2 | All data on the wire is TLS-encrypted (or pre-encrypted chunks over CXI) |
| I-K13 | mTLS with Cluster CA validation on every data-fabric connection |
| I-Auth1 | Client certificate required on data fabric |
| I-Auth3 | SPIFFE SVID validation via SAN matching |
Client-Side Cache (ADR-031)
The native client (kiseki-client) includes a two-tier read-only cache
of decrypted plaintext chunks. The cache is a performance feature, not a
correctness mechanism – it is ephemeral and wiped on process restart or
extended disconnect.
Architecture
┌────────────────────────────────────────────┐
│ kiseki-client process │
│ │
│ ┌──────────────────────────────────────┐ │
│ │ L1: In-memory cache │ │
│ │ Zeroizing<Vec<u8>> entries │ │
│ │ Content-addressed by ChunkId │ │
│ └──────────────┬───────────────────────┘ │
│ │ miss │
│ ┌──────────────▼───────────────────────┐ │
│ │ L2: Local NVMe cache pool │ │
│ │ CRC32 integrity per entry │ │
│ │ Per-process, per-tenant isolation │ │
│ └──────────────┬───────────────────────┘ │
│ │ miss │
│ ▼ │
│ Fetch from canonical │
│ (verify by ChunkId SHA-256) │
└────────────────────────────────────────────┘
L1 (in-memory): Fast access to recently-used chunks. Entries use
Zeroizing<Vec<u8>> so plaintext is overwritten with zeros on eviction
or deallocation (I-CC2).
L2 (local NVMe): Larger cache on local storage. Each entry has a CRC32 checksum trailer computed at insert time (I-CC13). On read, the CRC32 is verified before serving; mismatch triggers bypass to canonical and entry deletion (I-CC7).
Cache modes
Three modes are available per client session (selected at session establishment):
| Mode | Behavior | Use case |
|---|---|---|
| Pinned | Staging-driven, eviction-resistant; for declared datasets | HPC pre-staging (Slurm prolog) |
| Organic | LRU with usage-weighted retention | Mixed workloads (default) |
| Bypass | No caching | Streaming, checkpoint workloads |
Mode is per session, not per file. The admin controls which modes are available for each workload.
Staging API
Staging pre-fetches a dataset’s chunks into the L2 cache with pinned retention:
kiseki-client stage --dataset /path/to/data
- Takes a namespace path and recursively enumerates compositions
- Fetches and verifies all chunks from canonical (SHA-256 match)
- Stores chunks in L2 with pinned retention
- Produces a manifest file listing staged compositions and chunk IDs
Staging is idempotent and resumable. Limits: max_staging_depth (10),
max_staging_files (100,000).
Pool handoff
The staging daemon and workload process can be different processes (e.g., Slurm prolog stages, then the workload runs):
- Staging daemon holds the L2 pool via
flockonpool.lock - Workload process adopts the pool via
KISEKI_CACHE_POOL_IDenv var - Workload takes over the
flock
Each cache pool is identified by a 128-bit CSPRNG pool_id, isolated
per process and per tenant.
Freshness and staleness
Metadata TTL (default 5s): File-to-chunk-list mappings are cached with a configurable TTL. Within the TTL, cached metadata is authoritative and may serve data for files that have since been modified (I-CC3, I-CC5).
Chunk data: No TTL needed. Chunks are immutable (I-C1), so a verified chunk remains correct indefinitely absent crypto-shred.
Crypto-shred detection
On crypto-shred (tenant KEK destruction), all cached plaintext must be wiped (I-CC12):
Detection mechanisms (in priority order):
- Advisory channel notification (if active)
- KMS error on next operation
- Periodic key health check (default every 30s)
Response: Immediate wipe of L1 and L2 with zeroize.
Maximum detection latency: min(key_health_interval, max_disconnect_seconds).
Disconnect handling
If the client loses connectivity to all canonical endpoints for longer
than max_disconnect_seconds (default 300s), the entire cache (L1 + L2)
is wiped (I-CC6).
A background heartbeat RPC (every 60s) maintains the last_successful_rpc
timestamp for disconnect detection.
Error handling
Any local cache error bypasses to canonical unconditionally (I-CC7):
- L2 I/O failure: bypass and flag pool for scrub
- CRC32 mismatch: bypass, delete corrupt entry
- Metadata lookup failure: bypass to canonical
Invariants
| ID | Rule |
|---|---|
| I-CC1 | A cached chunk is served only if content-address verified and no crypto-shred detected |
| I-CC2 | Cached plaintext is zeroized before deallocation, eviction, or cache wipe |
| I-CC3 | File-to-chunk metadata served from cache only within TTL (default 5s) |
| I-CC5 | Metadata TTL is the upper bound on read staleness |
| I-CC6 | Disconnect beyond threshold triggers full cache wipe |
| I-CC7 | Any cache error bypasses to canonical unconditionally |
| I-CC8 | Cache is ephemeral; wiped on process start (or adopted via pool handoff) |
| I-CC9 | Unreachable cache policy falls back to conservative defaults |
| I-CC10 | Cache policy changes apply to new sessions only |
| I-CC11 | Staged chunks are a point-in-time snapshot; re-stage to pick up updates |
| I-CC12 | Crypto-shred triggers immediate cache wipe with zeroize |
| I-CC13 | L2 entries protected by CRC32 checksum trailer |
Environment variables
| Variable | Default | Description |
|---|---|---|
KISEKI_CACHE_MODE | organic | Cache mode: organic, pinned, or bypass |
KISEKI_CACHE_DIR | /tmp/kiseki-cache | L2 pool directory on local NVMe |
KISEKI_CACHE_L2_MAX | 50 GB | Maximum L2 cache size in bytes |
KISEKI_CACHE_POOL_ID | (generated) | Adopt an existing pool (for staging handoff) |
Security Model
Kiseki is designed with security as a foundational constraint, not a bolted-on feature. The system enforces strong tenant isolation, mandatory encryption, and a zero-trust boundary between infrastructure operators and tenants.
Zero-trust boundary
Kiseki enforces a strict separation between two administrative domains:
Cluster admin (infrastructure operator)
- Manages nodes, global policy, system keys, pools, devices.
- Cannot access tenant config, logs, or data without explicit tenant admin approval (I-T4).
- Sees operational metrics in tenant-anonymous or aggregated form.
- Modifications to pools containing tenant data are audit-logged to the affected tenant’s audit shard (I-T4c).
Tenant admin (data owner)
- Controls tenant keys, projects, workload authorization, compliance tags, user access.
- Grants or denies cluster admin access requests.
- Receives tenant-scoped audit exports sufficient for independent compliance demonstration.
- Can crypto-shred to render all tenant data unreadable.
Access request flow
When a cluster admin needs access to tenant resources (for debugging, migration, etc.):
- Cluster admin submits an access request via the control plane.
- The request is recorded in the audit log.
- Tenant admin reviews and approves or denies.
- If approved, access is time-bounded and scoped.
- All access is audit-logged to the tenant’s shard.
Encryption at rest
Every chunk stored on disk is encrypted. There are no exceptions.
- Algorithm: AES-256-GCM (authenticated encryption with associated data).
- Key derivation: HKDF-SHA256 derives per-chunk DEKs from a system master key and the chunk ID (ADR-003).
- Envelope: Each chunk carries an envelope containing ciphertext, system-layer wrapping metadata, tenant-layer wrapping metadata, and authenticated metadata (chunk ID, algorithm identifiers, key epoch).
What is encrypted
| Data | Encryption | Location |
|---|---|---|
| Chunk data on disk | System DEK (AES-256-GCM) | Data devices |
| Inline small-file content | System DEK | small/objects.redb |
| Delta payloads (filenames, attributes) | System DEK, wrapped with tenant KEK | Raft log / redb |
| Delta headers (sequence, shard, operation type, timestamp) | Cleartext or system-encrypted | Raft log / redb |
| Backup data | System-encrypted | External backup target |
| Federation replication | Ciphertext-only | Replication stream |
What is NOT encrypted
- Delta headers: Compaction operates on headers only (I-O2). Headers contain no tenant-attributable content.
- Prometheus metrics: Aggregated counters and histograms. No tenant-attributable data in metric labels.
- Health/liveness probes:
200 OKresponse.
Encryption in transit
All data-fabric communication uses mTLS. No plaintext data crosses the network.
- Data path: mTLS with per-tenant certificates signed by the Cluster CA (I-K2).
- Raft consensus: mTLS between Raft peers.
- Key manager: mTLS between storage nodes and the key manager.
- Client to gateway: TLS (clients send plaintext over TLS; the gateway encrypts before writing).
- Native client: Client-side encryption (plaintext never leaves the workload process).
Protocol gateway encryption
Protocol gateway clients (NFS, S3) send plaintext over TLS to the gateway. The gateway performs tenant-layer encryption before writing to the storage layer. This means plaintext exists in gateway process memory but never on the wire in cleartext and never at rest.
Native client encryption
Native clients (FUSE, FFI, Python) perform tenant-layer encryption themselves. Plaintext never leaves the workload process and never traverses the data fabric.
FIPS 140-2/3 compliance
Kiseki uses aws-lc-rs as its cryptographic backend, which provides
a FIPS 140-2/3 validated implementation of:
- AES-256-GCM (authenticated encryption)
- HKDF-SHA256 (key derivation)
- SHA-256 (content-addressed chunk IDs)
- HMAC-SHA256 (per-tenant chunk IDs for opted-out tenants)
The FIPS feature is controlled by the kiseki-crypto/fips feature
flag at compile time.
Crypto-agility
Envelope metadata carries algorithm identifiers for crypto-agility. If a new algorithm is needed (e.g., post-quantum), envelopes can carry the new algorithm identifier alongside the existing one during a transition period.
No plaintext past gateway boundary (I-K1, I-K2)
This is the fundamental security invariant. Kiseki guarantees:
- No plaintext chunk is ever persisted to storage (I-K1).
- No plaintext payload is ever sent on the wire between any components (I-K2).
- The system can enforce access to ciphertext without being able to read plaintext without tenant key material (I-K4).
Where plaintext exists
Plaintext exists only in:
- Client process memory: For native clients that perform client-side encryption.
- Gateway process memory: Transiently, while the gateway encrypts protocol-path data.
- Stream processor memory: Stream processors cache tenant key material and are in the tenant trust domain (I-O3).
- Client cache (L1): In-memory cache of decrypted chunks (zeroized on eviction or deallocation, I-CC2).
- Client cache (L2): On-disk cache of decrypted chunks on local NVMe (zeroized before unlink, I-CC2).
Content-addressed chunk IDs
Chunk identity is derived from content, serving both dedup and integrity:
- Default:
chunk_id = SHA-256(plaintext). Enables cross-tenant dedup. - Opted-out tenants:
chunk_id = HMAC-SHA256(plaintext, tenant_key). Cross-tenant dedup is impossible. Zero co-occurrence leak (I-K10).
Tenants that opt out of cross-tenant dedup pay a storage overhead (identical data stored separately per tenant) but gain the guarantee that no metadata (chunk IDs, refcounts) leaks information about data similarity across tenants.
Audit trail
All security-relevant events are recorded in an append-only, immutable audit log with the same durability guarantees as the data log (I-A1).
Audit events include:
- Data access (read/write by tenant, workload, client)
- Key lifecycle (rotation, crypto-shred, KMS health)
- Admin actions (pool changes, device management, tuning parameters)
- Policy changes (quotas, compliance tags, advisory policy)
- Authentication events (mTLS success/failure, cert revocation)
Audit scoping
- Tenant audit export: Filtered to the tenant’s own events plus relevant system events. Delivered on the tenant’s VLAN (I-A2). Sufficient for independent compliance demonstration (HIPAA Section 164.312 audit controls).
- Cluster admin audit view: System-level events only. Tenant-anonymous or aggregated (I-A3).
Runtime integrity
An optional runtime integrity monitor detects attempts to access Kiseki process memory (I-O7):
- ptrace detection
/proc/pid/memaccess monitoring- Debugger attachment detection
- Core dump attempt detection
On detection, the monitor alerts both cluster admin and tenant admin. Optional auto-rotation of keys can be configured as a response.
STRIDE Threat Analysis
Systematic analysis of Kiseki’s attack surfaces using the STRIDE framework.
Spoofing (identity)
| Threat | Attack surface | Mitigation | Invariant |
|---|---|---|---|
| Rogue node joins cluster | Raft peer handshake | mTLS with Cluster CA — only certs signed by the cluster CA are accepted. Raft RPC server rejects plaintext when TLS is configured. | I-Auth1, I-K13 |
| Client impersonates tenant | Data fabric connection | mTLS required. OrgId extracted from cert OU or SPIFFE SAN. Fallback: UUID v5 from cert fingerprint (no anonymous access). | I-Auth1, I-Auth3 |
| Forged S3 request | S3 gateway | SigV4 signature validation with HMAC-SHA256 (constant-time comparison). x-amz-date required, host must be signed. | SigV4 auth |
| Forged JWT token | OIDC second-stage | alg=none rejected unconditionally. HS256 verified via HMAC. RS256/ES256 verified via JWKS with key ID matching. | I-Auth2 |
| NFS UID spoofing | NFS gateway | AUTH_SYS trusts client-asserted UID (known limitation). Mitigated by: network segmentation, Kerberos for production, per-export allowed method list. | NFS auth |
| Replay of captured request | S3 gateway | Timestamp validation (TODO: ±15min window). Captured Raft RPCs are harmless (Raft rejects stale term/log index). | SigV4 |
Residual risk: NFS AUTH_SYS is inherently spoofable. Production deployments MUST use Kerberos or restrict NFS to trusted networks.
Tampering (data integrity)
| Threat | Attack surface | Mitigation | Invariant |
|---|---|---|---|
| Modify chunk on disk | Block device | CRC32C on every extent read. Mismatch → EC repair from parity. Periodic scrub with configurable sample rate. | I-C7, I-C8 |
| Modify chunk in transit | Fabric | TLS 1.3 (authenticated encryption). RDMA paths use pre-encrypted chunks. | I-K2, I-Auth1 |
| Modify Raft log entry | Raft replication | Raft consensus — committed entries are immutable (I-L3). Log entries validated by majority before commit. WAL journal for crash-safe bitmap. | I-L2, I-L3 |
| Tamper with envelope | Crypto layer | AES-256-GCM authenticated encryption. Tampered ciphertext, auth tag, or nonce → decryption failure. AAD binding to chunk_id prevents envelope splicing (I-K17). | I-K7, I-K17 |
| Modify L2 cache file | Client NVMe | CRC32 trailer on every L2 read. Mismatch → bypass to canonical + delete corrupt entry. | I-CC7, I-CC13 |
| Corrupt staging manifest | Client cache | Invalid JSON silently skipped during manifest load. No data served from unverifiable source. | I-CC7 |
Repudiation (deniability)
| Threat | Attack surface | Mitigation | Invariant |
|---|---|---|---|
| Admin denies action | Control plane | All admin operations (maintenance, quota, compliance, key rotation) recorded in cluster audit shard with timestamp, identity, and parameters. | I-A1, I-A6 |
| Tenant denies access | Data path | All data access operations auditable. Tenant audit export provides filtered, coherent trail for compliance (HIPAA §164.312). | I-A2 |
| Advisory abuse denied | Workflow advisory | Advisory lifecycle events (declare, end, phase-advance, budget-exceeded) logged per-occurrence. High-volume events sampled with per-second-per-workflow counts. | I-WA8 |
| Device state change denied | Storage | Device state transitions (Healthy→Degraded→Evacuating→Failed→Removed) recorded with timestamp, reason, admin identity. | I-D2 |
| Crypto-shred denied | Key management | Shred event logged in tenant audit shard. Key health check provides detection confirmation. Cache wipe events counted. | I-K5, I-CC12 |
Information disclosure (confidentiality)
| Threat | Attack surface | Mitigation | Invariant |
|---|---|---|---|
| Plaintext leak on wire | All RPCs | TLS mandatory on all data fabric connections. No plaintext payloads transmitted. | I-K1, I-K2 |
| Plaintext on disk (server) | Chunk storage | All chunks encrypted at rest with system DEK (AES-256-GCM). No plaintext persisted on storage nodes. Compaction operates on headers only — never decrypts payloads. | I-K1, I-O2 |
| Plaintext on disk (client) | L2 cache | Cached plaintext on compute-node NVMe (same trust domain as process memory). File permissions 0600. Zeroize on eviction/wipe. Crash scrubber for orphaned pools. FTL residual risk documented. | I-CC2, I-CC8 |
| Cross-tenant data leak | Multi-tenant | Full tenant isolation (I-T1). Per-tenant encryption keys. Cluster admin cannot access tenant data without approval (I-T4). HMAC-keyed chunk IDs for dedup-opted-out tenants prevent co-occurrence analysis. | I-T1, I-T3, I-K10 |
| Telemetry leaks tenant info | Advisory | Telemetry scoped to caller’s authorization. k-anonymity (k≥5) over neighbour workloads. Response shape unchanged under low-k conditions. Timing and size bucketed to prevent covert channels. | I-WA5, I-WA6, I-WA15 |
| Error messages leak state | All APIs | AuthError returns generic failures. KmsError uses enum variants not freeform strings. Advisory requests for unauthorized targets return same shape as absent targets. | I-WA6 |
| Core dump exposes keys | Server/client | Key material wrapped in Zeroizing<Vec<u8>>. Runtime integrity monitor detects debugger/ptrace. | I-K8, I-O7 |
| Log messages leak data | Structured logging | Structured tracing with typed fields. No plaintext in log events. Tenant-scoped identifiers hashed in cluster-admin views. | I-A3, I-K8 |
Denial of service (availability)
| Threat | Attack surface | Mitigation | Invariant |
|---|---|---|---|
| Raft leader flooding | Raft consensus | MAX_RAFT_RPC_SIZE (128MB) rejects oversized messages. Per-shard throughput guard (I-SF7) limits inline write rate. | ADV-S1, I-SF7 |
| Advisory hint flooding | Workflow advisory | Per-workload hint budget (hints/sec, concurrent workflows). Budget exceeded → local degradation only. Advisory isolated from data path (I-WA2). | I-WA7, I-WA16, I-WA17 |
| Connection pool exhaustion | Transport | max_per_endpoint connection cap. Circuit breaker trips after threshold failures. FabricSelector falls back to TCP. | Transport health |
| Disk exhaustion (metadata) | System NVMe | ADR-030 dynamic inline threshold. Soft limit → threshold reduction. Hard limit → threshold floor + alert via out-of-band gRPC. | I-SF1, I-SF2 |
| Disk exhaustion (data) | Device pools | Per-pool capacity thresholds (Warning/Critical/Full). Writes rejected at Critical. Pool rebalancing. | I-C5 |
| Cache exhaustion (client) | Client NVMe | Per-process max_cache_bytes. Per-node max_node_cache_bytes (80% of filesystem). Disk-pressure backstop at 90%. | ADR-031 §8 |
| Audit log backpressure | Audit | Safety valve: if audit export stalls >24h, data GC proceeds with documented gap. Per-tenant configurable backpressure mode. | I-A5 |
| Shard split storm | Log | Exponential backoff per shard (2h floor, 24h cap). Cluster-wide concurrent migrations bounded by max(1, num_nodes/10). | I-SF4 |
Elevation of privilege (authorization)
| Threat | Attack surface | Mitigation | Invariant |
|---|---|---|---|
| Cluster admin accesses tenant data | Control plane | Zero-trust boundary. Access requires explicit tenant admin approval, time-bounded, scope-limited, audit-logged. | I-T4, I-T4c |
| Tenant escapes namespace | Data path | Namespace isolation per tenant. Cross-shard operations return EXDEV (I-L8). Compositions belong to exactly one tenant (I-X1). | I-T1, I-X1, I-L8 |
| Hint escalates priority | Advisory | Hints cannot extend capability. Cannot cause operation success that would otherwise be rejected. Cannot cross namespace/tenant boundary. Cannot bypass retention hold. | I-WA14 |
| Client escalates cache policy | Client cache | Client selections bounded by admin-set ceilings. Policy narrowing only (child ≤ parent). cache_enabled=false at any level → disabled for all children. | I-CC10, I-WA7 |
| KMS provider escalation | Key management | Provider abstraction opaque to callers (I-K16). No access-control decision depends on provider type. Provider migration requires 100% re-wrap before atomic switch. | I-K16, I-K20 |
| gRPC method escalation | Control plane | Per-method authorization. 9 admin-only methods gated by require_admin(). Unknown role → rejected. | gRPC authz |
Summary
| STRIDE Category | Threats identified | Mitigated | Residual risk |
|---|---|---|---|
| Spoofing | 6 | 5 | NFS AUTH_SYS UID spoofing (use Kerberos in prod) |
| Tampering | 6 | 6 | None — all paths have integrity verification |
| Repudiation | 5 | 5 | None — comprehensive audit trail |
| Information disclosure | 8 | 8 | Client L2 NVMe FTL residual (use OPAL/SED) |
| Denial of service | 8 | 8 | None — all paths have rate limiting/backpressure |
| Elevation of privilege | 6 | 6 | None — defense in depth at every boundary |
| Total | 39 | 37 | 2 documented residual risks |
Both residual risks have documented mitigations:
- NFS AUTH_SYS → deploy Kerberos or restrict to trusted networks
- NVMe FTL data remanence → deploy OPAL/SED with per-boot key rotation
Authentication
Kiseki uses a layered authentication model. The primary mechanism is mTLS with certificates signed by a Cluster CA. Optional second-stage authentication via tenant identity providers adds workload-level authorization.
mTLS with Cluster CA (I-Auth1)
The Cluster CA is the trust root for all data-fabric authentication. Every participant in the data fabric (storage nodes, gateways, clients, stream processors) presents a certificate signed by the Cluster CA.
Certificate hierarchy
Cluster CA (managed by cluster admin)
|
+-- Server certificates (per storage node)
| SAN: node hostname, IP address
| OU: kiseki-server
|
+-- Key manager certificates (per key server)
| SAN: keyserver hostname, IP address
| OU: kiseki-keyserver
|
+-- Admin certificates (cluster admin)
| OU: kiseki-admin
|
+-- Tenant certificates (per tenant)
SAN: tenant identifier
OU: tenant-{org_id}
Properties
- No real-time auth server on data path (I-Auth1). Certificates are local credentials. Authentication is a TLS handshake, not an RPC to a central authority. This eliminates a latency-sensitive dependency on the data path.
- Per-tenant certificates: Each tenant’s clients and gateways present certificates that identify the tenant. The storage layer validates the certificate chain and extracts the tenant identity.
- Certificate revocation: Supported via CRL (
KISEKI_CRL_PATH). The CRL is reloaded periodically. Revoked certificates are rejected at the TLS handshake.
Configuration
# On storage nodes
KISEKI_CA_PATH=/etc/kiseki/tls/ca.crt
KISEKI_CERT_PATH=/etc/kiseki/tls/server.crt
KISEKI_KEY_PATH=/etc/kiseki/tls/server.key
KISEKI_CRL_PATH=/etc/kiseki/tls/crl.pem # optional
# On client nodes
KISEKI_CA_PATH=/etc/kiseki/tls/ca.crt
KISEKI_CERT_PATH=/etc/kiseki/tls/client.crt
KISEKI_KEY_PATH=/etc/kiseki/tls/client.key
Certificate generation example
# Generate Cluster CA (do this once)
openssl req -x509 -newkey ec -pkeyopt ec_paramgen_curve:P-256 \
-keyout ca.key -out ca.crt -days 3650 -nodes \
-subj "/CN=Kiseki Cluster CA"
# Generate server certificate
openssl req -newkey ec -pkeyopt ec_paramgen_curve:P-256 \
-keyout server.key -out server.csr -nodes \
-subj "/CN=node1.example.com/OU=kiseki-server"
openssl x509 -req -in server.csr -CA ca.crt -CAkey ca.key \
-CAcreateserial -out server.crt -days 365 \
-extfile <(echo "subjectAltName=DNS:node1.example.com,IP:10.0.0.1")
# Generate tenant client certificate
openssl req -newkey ec -pkeyopt ec_paramgen_curve:P-256 \
-keyout tenant.key -out tenant.csr -nodes \
-subj "/CN=workload-1/OU=tenant-acme-corp"
openssl x509 -req -in tenant.csr -CA ca.crt -CAkey ca.key \
-CAcreateserial -out tenant.crt -days 365
SPIFFE SVID (I-Auth3)
SPIFFE (Secure Production Identity Framework for Everyone) is available as an alternative to raw mTLS certificate management.
SPIFFE ID structure
spiffe://kiseki.example.com/tenant/{org_id}/workload/{workload_id}
spiffe://kiseki.example.com/tenant/{org_id}/project/{project_id}/workload/{workload_id}
The SPIFFE ID maps directly to the tenant hierarchy (organization/project/workload).
SPIRE integration
SPIRE (the SPIFFE Runtime Environment) handles certificate issuance and rotation automatically:
- SPIRE Server acts as the Cluster CA (or delegates to it).
- SPIRE Agent runs on each node (storage and compute).
- Workloads receive SVIDs via the Workload API.
- Certificates rotate automatically (no manual renewal).
Benefits over raw mTLS
- Automatic certificate rotation (no manual renewal ceremonies).
- Workload attestation (verify the workload binary, not just the certificate).
- Short-lived certificates reduce the window of compromise.
S3 SigV4 authentication
The S3 gateway supports AWS Signature Version 4 authentication for S3 API clients.
How it works
- The S3 client signs each request with an access key and secret key.
- The gateway validates the signature.
- The access key is mapped to a tenant identity via the control plane.
- Subsequent authorization is based on the tenant identity.
Configuration
Access keys are provisioned via the control plane:
kiseki-server s3-credentials create --tenant-id acme-corp --workload-id training-job-1
Compatibility
The SigV4 implementation supports standard S3 clients:
# AWS CLI
aws --endpoint-url http://node1:9000 s3 ls
# boto3
import boto3
s3 = boto3.client('s3', endpoint_url='http://node1:9000',
aws_access_key_id='...', aws_secret_access_key='...')
NFS authentication
The NFS gateway supports two authentication mechanisms:
Kerberos (recommended for production)
NFSv4.2 with Kerberos provides strong authentication:
krb5— Authentication only.krb5i— Authentication + integrity.krb5p— Authentication + integrity + privacy (encrypted).
The Kerberos principal maps to a tenant identity.
AUTH_SYS (development only)
AUTH_SYS (traditional UNIX UID/GID authentication) is supported for development and testing. It provides no real security and should not be used in production. When AUTH_SYS is used, the NFS gateway maps the export path to a tenant identity.
OIDC/JWT second-stage authentication (I-Auth2)
Optional second-stage authentication validates workload identity against the tenant admin’s authorization. This provides an additional layer beyond the mTLS “belongs to this cluster” identity.
Architecture
Workload
|
v
mTLS (Cluster CA) --> "This workload belongs to tenant X"
|
v
OIDC/JWT (Tenant IdP) --> "This workload is authorized by tenant X's admin"
Integration
- Tenant admin configures their identity provider (Keycloak, Okta, Azure AD, etc.) in the control plane.
- Workloads obtain JWT tokens from the tenant IdP.
- On connection, the workload presents both:
- mTLS certificate (Cluster CA trust chain)
- JWT token (tenant IdP authorization)
- The storage node validates both independently.
Token validation
- JWT signature verification against the tenant IdP’s JWKS endpoint.
- Token expiry and audience validation.
- Claims mapping to tenant hierarchy (org, project, workload).
- No real-time IdP dependency on the data path: JWKS keys are cached and refreshed periodically.
gRPC role-based authorization
After authentication (mTLS + optional OIDC), gRPC services enforce role-based authorization:
Roles
| Role | Authentication | Access |
|---|---|---|
| Cluster admin | Admin certificate (OU: kiseki-admin) | StorageAdminService, ControlService (full) |
| SRE (read-only) | SRE certificate | StorageAdminService (read-only: List*, Get*, Status) |
| Tenant admin | Tenant certificate + OIDC (optional) | ControlService (tenant-scoped), AuditExportService |
| Workload | Tenant certificate + OIDC (optional) | Data-path services, WorkflowAdvisoryService |
Authorization enforcement
- StorageAdminService: Cluster admin only (mTLS cert with admin OU). SRE read-only role for monitoring.
- ControlService: Cluster admin for system operations, tenant admin for tenant-scoped operations.
- Data-path services (LogService, ChunkOps, CompositionOps, ViewOps): Any authenticated tenant workload, scoped to the tenant’s own data.
- WorkflowAdvisoryService: Any authenticated tenant workload. Per-operation authorization (I-WA3): every request re-validates the caller’s mTLS identity against the workflow’s owning workload.
Cluster admin isolation (I-T4)
The cluster admin certificate grants access to infrastructure management but explicitly does NOT grant access to:
- Tenant configuration
- Tenant audit logs
- Tenant data (read or write)
- Tenant key material
Access to tenant resources requires an explicit access request approved by the tenant admin.
Client identity
Client ID (native client)
Each native client process generates a stable identifier at startup:
- 128-bit CSPRNG value.
- Bound to the workload’s mTLS certificate at first use.
- Scoped within (org, project, workload).
- Never reused across processes (I-WA4).
The client ID ties an operation stream to a single process instance. It is not a user identity and not a session token.
Workflow reference
For advisory-enabled workloads, a workflow reference is attached to
data-path RPCs as a gRPC binary metadata entry
(x-kiseki-workflow-ref-bin). This is a 16-byte opaque handle,
generated with 128+ bits of entropy, never reused, and verified
against the caller’s mTLS identity on every request (I-WA3, I-WA10).
Tenant Isolation
Tenant isolation is a foundational invariant of Kiseki. Tenants are fully isolated with no cross-tenant data access, no delegation tokens, and no cross-tenant key sharing (I-T1).
Isolation model
Kiseki implements hierarchical tenancy with strict isolation boundaries:
Organization (billing, admin, master key authority)
|
+-- Project (optional: resource grouping, key delegation)
| |
| +-- Workload (runtime isolation unit)
| +-- Workload
|
+-- Workload (directly under org, if no projects)
Isolation guarantees
| Property | Guarantee | Invariant |
|---|---|---|
| Data access | No cross-tenant data access | I-T1 |
| Key material | Per-tenant encryption keys, never shared | I-T3, I-K3 |
| Resource consumption | Bounded by quotas at org and workload levels | I-T2 |
| Audit visibility | Tenant sees only their own events | I-A2 |
| Metrics | Tenant-anonymous for cluster admin | ADR-015 |
| Admin access | Zero-trust: cluster admin cannot access tenant data without approval | I-T4 |
Per-tenant encryption keys
Each tenant has their own KEK (Key Encryption Key) managed by their chosen KMS backend (ADR-028). The tenant KEK wraps access to system DEK derivation parameters for that tenant’s data.
Key isolation
- System DEKs are derived per-chunk and are the same for identical chunks across tenants (enabling cross-tenant dedup by default).
- Tenant KEKs are unique per tenant. Even if two tenants store the same data, each tenant wraps access to the DEK derivation parameters independently.
- Tenant keys are not accessible to other tenants or to shared system processes (I-T3).
Key storage isolation
When using the internal KMS provider (default), tenant KEKs are stored in a separate Raft group from system master keys (I-K19). Compromise of one group does not expose the other.
When using external KMS providers (Vault, KMIP, AWS KMS, PKCS#11), tenant key material is managed entirely outside of Kiseki’s storage, under the tenant’s own operational control.
HMAC-keyed chunk IDs for opted-out tenants
By default, chunk IDs are derived from plaintext content:
chunk_id = SHA-256(plaintext). This enables cross-tenant
deduplication: identical data stored by different tenants produces the
same chunk ID and shares storage.
Tenants that require stronger isolation can opt out of cross-tenant dedup (I-X2, I-K10):
Default: chunk_id = SHA-256(plaintext)
Opted-out: chunk_id = HMAC-SHA256(plaintext, tenant_key)
What opt-out provides
- No cross-tenant dedup: Identical data from different tenants produces different chunk IDs. Each tenant’s data is stored independently.
- Zero co-occurrence leak: An observer cannot determine whether two tenants store the same data by comparing chunk IDs.
- Storage overhead: Duplicate data across tenants consumes additional storage.
When to opt out
Opt-out is recommended for tenants with:
- Regulatory requirements prohibiting any form of cross-tenant data correlation (even at the metadata level).
- High-sensitivity data where the existence of shared content is itself sensitive information.
- Compliance regimes (HIPAA, ITAR) where data co-location with other tenants must be minimized.
Audit log scoping
The audit log is append-only, immutable, and system-wide (I-A1). Audit visibility is strictly scoped:
Tenant audit export (I-A2)
Each tenant receives a filtered projection of the audit log:
- All events originating from the tenant’s own operations.
- Relevant system events sufficient for a coherent, complete audit trail (e.g., a cluster admin modifying a pool that contains the tenant’s data).
- Delivered on the tenant’s VLAN.
- Sufficient for independent compliance demonstration (e.g., HIPAA Section 164.312 audit controls).
The tenant admin consumes this export. No events from other tenants appear in the export.
Cluster admin audit view (I-A3)
The cluster admin sees:
- System-level events (node joins, pool changes, key rotations).
- Tenant-anonymous or aggregated metrics.
- No tenant-attributable content.
Cluster admin modifications to pools containing tenant data are audit-logged to the affected tenant’s audit shard (I-T4c), so the tenant can review.
Advisory audit scoping (I-WA8)
Workflow advisory events (declare, end, phase-advance, hint accept/reject, etc.) are written to the tenant’s audit shard.
- Semantic phase tags and workflow IDs are tenant-scoped.
- Cluster-admin views see opaque hashes only (consistent with I-A3).
- High-volume events (hint-accepted, hint-throttled) may be batched or sampled, but at least one event per unique (workflow_id, rejection_reason) tuple is written per second.
Cache isolation (ADR-031)
The client-side cache maintains strict per-tenant isolation:
L1 (in-memory) isolation
- The L1 cache operates within a single client process.
- A client process is authenticated as a specific tenant via mTLS.
- L1 entries are decrypted plaintext chunks, keyed by chunk ID.
- On process termination, L1 entries are zeroized (I-CC2).
L2 (on-disk) isolation
- Each client process creates its own L2 cache pool on local NVMe.
- Pool isolation is enforced by:
- Unique pool ID: 128-bit CSPRNG value per process.
- flock: Ownership proven by file lock on
pool.lock. - Per-process directory: No cross-process sharing.
- Concurrent same-tenant processes have independent pools. There is no cross-process cache sharing.
- Orphaned pools (no live flock holder) are scavenged on startup or by
kiseki-cache-scrub. - On eviction or cache wipe, L2 entries are overwritten with zeros before unlink (I-CC2).
Crypto-shred cache wipe (I-CC12)
When a crypto-shred event is detected for a tenant:
- All cached plaintext for that tenant is wiped from L1 and L2.
- L1 entries:
Zeroizing<Vec<u8>>ensures memory-level erasure. - L2 entries: File contents overwritten with zeros before unlink.
- Detection mechanisms:
- Periodic key health check (default 30 seconds).
- Advisory channel notification.
- KMS error on next operation.
Maximum detection latency: min(key_health_interval, max_disconnect_seconds).
Physical-level erasure note
Logical-level erasure (zeroize before deallocation) provides strong protection against software-level attacks. For protection against physical-level attacks on flash storage (e.g., reading NAND cells after logical deletion), hardware encryption (OPAL/SED) on the compute node’s local NVMe is required. This is outside Kiseki’s control but should be part of the compute node security policy.
Network isolation
Data fabric
All data-fabric traffic is mTLS-encrypted. Tenant identity is extracted from the client certificate and validated on every RPC.
Management network
The management network (control plane, admin API) is separate from the data fabric. Cluster admin access requires admin-OU certificates.
Tenant VLAN
Tenant audit exports are delivered on the tenant’s VLAN, providing network-level isolation of audit data.
Advisory isolation (I-WA1, I-WA2, I-WA5, I-WA6)
The workflow advisory subsystem enforces strict tenant isolation:
- No existence oracles (I-WA6): A client cannot determine the existence of resources it is not authorized to observe. Unauthorized and absent targets return identical responses (same error code, payload size, and latency distribution).
- No content oracles (I-WA11): Advisory fields never include cluster-internal identifiers (shard IDs, chunk IDs, node IDs, device IDs, rack labels).
- Telemetry scoping (I-WA5): Every telemetry value is computed over resources the caller is authorized to read. Aggregate metrics use k-anonymous bucketing (minimum k=5).
- Covert-channel hardening (I-WA15): Response timing and size do not vary with neighbor-workload state.
Pool handle isolation (I-WA19)
Affinity pools are referenced via opaque pool handles, not cluster-internal pool IDs:
- Handles are valid for one workflow’s lifetime only.
- Never reused across workflows.
- Never equal or leak the cluster-internal pool identity.
- Multiple tenants can see the same opaque label attached to different internal pools; correlation across tenants is impossible because handles differ.
Compliance support
Kiseki’s tenant isolation model supports the following compliance regimes:
| Regime | Relevant guarantees |
|---|---|
| HIPAA | Per-tenant encryption, audit export for Section 164.312, crypto-shred, bounded staleness (2s floor). |
| SOC 2 | Audit log immutability, access control separation, key management lifecycle. |
| GDPR | Crypto-shred as right-to-erasure mechanism, data isolation by design. |
| ITAR | HMAC-keyed chunk IDs (no cross-tenant correlation), dedicated tenant KMS. |
Compliance tags attach at any level of the tenant hierarchy (organization, project, workload) and inherit downward. Tags may impose additional constraints:
- Prohibit compression (HIPAA namespaces, I-K14).
- Set staleness floor (minimum 2 seconds for HIPAA).
- Require external KMS provider (no internal mode).
- Restrict pool placement (data residency).
Troubleshooting
This guide covers common issues, diagnostic tools, and resolution procedures for Kiseki clusters.
Diagnostic tools
Health endpoint
# Quick liveness check (returns "OK" or connection refused)
curl http://node1:9090/health
Event log
The event log captures categorized diagnostic events in memory. Query via the admin API:
# All events from the last 3 hours
curl http://node1:9090/ui/api/events
# Error events only
curl 'http://node1:9090/ui/api/events?severity=error'
# Critical events from the last 24 hours
curl 'http://node1:9090/ui/api/events?severity=critical&hours=24'
# Device-related events
curl 'http://node1:9090/ui/api/events?category=device'
# Raft events (elections, membership changes)
curl 'http://node1:9090/ui/api/events?category=raft'
Node status
# Per-node metrics and health
curl http://node1:9090/ui/api/nodes
# Cluster summary
curl http://node1:9090/ui/api/cluster
Structured logs
# Tail logs for errors (systemd)
journalctl -u kiseki-server -f --priority=err
# Search for specific errors in JSON logs
journalctl -u kiseki-server --output=json | jq 'select(.level == "ERROR")'
# Raft-specific logs
journalctl -u kiseki-server | grep kiseki_raft
Common issues
Connection refused on data-path port (9100)
Symptoms: Clients cannot connect. curl http://node:9090/health
returns OK but gRPC connections to port 9100 fail.
Diagnosis:
- Verify the port is listening:
ss -tlnp | grep 9100 - Check firewall rules:
iptables -L -n | grep 9100 - Check the server logs for bind errors:
journalctl -u kiseki-server | grep "bind\|listen\|9100"
Common causes:
- Port conflict: Another process is using port 9100.
- Bind address:
KISEKI_DATA_ADDRis set to127.0.0.1:9100instead of0.0.0.0:9100. - Firewall: Port 9100 is not open between nodes or to clients.
mTLS authentication failures
Symptoms: AuthenticationFailed errors in logs. Clients receive
gRPC UNAUTHENTICATED (16) status.
Diagnosis:
# Verify certificate validity
openssl x509 -in /etc/kiseki/tls/server.crt -noout -dates -subject -issuer
# Verify certificate chain
openssl verify -CAfile /etc/kiseki/tls/ca.crt /etc/kiseki/tls/server.crt
# Test TLS handshake
openssl s_client -connect node1:9100 \
-cert /etc/kiseki/tls/client.crt \
-key /etc/kiseki/tls/client.key \
-CAfile /etc/kiseki/tls/ca.crt
Common causes:
- Certificate expired: Renew the certificate.
- CA mismatch: Client and server certificates signed by different CAs.
- Missing SAN: Server certificate does not include the hostname or IP the client is connecting to.
- CRL revocation: Certificate revoked via
KISEKI_CRL_PATH. Check the CRL:openssl crl -in /etc/kiseki/tls/crl.pem -text -noout - Wrong OU: Tenant certificate has wrong OU, or admin certificate does
not have
kiseki-adminOU.
Capacity full (ENOSPC)
Symptoms: Write operations return PoolFull errors. S3 PutObject
returns HTTP 507. NFS writes return EIO or ENOSPC.
Diagnosis:
# Check pool capacity
curl -s http://node1:9090/metrics | grep kiseki_pool_capacity
# Check system disk usage
df -h /var/lib/kiseki
Resolution:
- Add devices to the pool to increase capacity.
- Rebalance to distribute data more evenly:
kiseki-server pool rebalance --pool-id fast-nvme - Evacuate devices from an over-full pool to a different pool (within the same device class).
- Delete data: Remove compositions/objects to free space. GC runs periodically (default every 300 seconds).
- Adjust thresholds if the defaults are too conservative for your
deployment:
kiseki-server pool set-thresholds --pool-id fast-nvme \ --warning-pct 80 --critical-pct 90
Metadata disk full (system partition)
Symptoms: Inline threshold drops to floor (128 bytes). Alert: “system disk metadata usage exceeds hard limit.” Raft may stall if the system disk is completely full.
Diagnosis:
# Check system partition usage
df -h /var/lib/kiseki
# Check individual redb sizes
du -sh /var/lib/kiseki/raft/log.redb
du -sh /var/lib/kiseki/chunks/meta.redb
du -sh /var/lib/kiseki/small/objects.redb
Resolution:
- The system automatically reduces the inline threshold to the floor (128 bytes) when the hard limit is exceeded (I-SF2).
- Trigger Raft log compaction to reduce
raft/log.redbsize:kiseki-server compact - Run GC to clean up orphaned entries in
small/objects.redb(I-SF6). - Consider migrating shards to nodes with larger system disks.
- If the system partition is persistently undersized, upgrade to larger NVMe for the system RAID-1.
Raft diagnostics
Leader election issues
Symptoms: ShardUnavailable errors. Writes fail intermittently.
Diagnosis:
# Check shard health
kiseki-server shard health --shard-id shard-0001
# Check Raft events
curl 'http://node1:9090/ui/api/events?category=raft'
# Check election metrics
curl -s http://node1:9090/metrics | grep kiseki_raft
Common causes:
- Network partition: Raft peers cannot communicate. Check connectivity on port 9300 between all nodes.
- Clock skew: Large clock differences can cause election timeouts.
Verify NTP synchronization. Nodes with
Unsyncclock quality are flagged (I-T6). - Disk latency: HDD system disks cause 5-10ms fsync latency per Raft commit. Use NVMe or SSD for the system partition.
Quorum loss
Symptoms: All writes fail. Reads may succeed (depending on consistency model).
Diagnosis:
# Check how many nodes are reachable
for node in node1 node2 node3; do
echo -n "$node: "
curl -s http://$node:9090/health && echo "OK" || echo "DOWN"
done
Resolution:
- If one node is down (3-node cluster): The remaining 2 nodes form a majority. Raft continues. Repair or replace the failed node.
- If two nodes are down: Quorum is lost. See Backup & Recovery for recovery procedures.
Shard split stalls
Symptoms: Shard reports high delta count or throughput but split does not complete.
Diagnosis:
kiseki-server shard info --shard-id shard-0001
Resolution:
- Verify the shard is not in maintenance mode (I-O6).
- Check if the cluster-wide concurrent migration limit is reached
(I-SF4):
max(1, num_nodes / 10). - Check the exponential backoff timer (I-SF4): Minimum 2 hours between placement changes per shard.
- Manually trigger a split if auto-split is not firing:
kiseki-server shard split --shard-id shard-0001
Device issues
Integrity scrub
Trigger a manual integrity scrub to verify chunk data against EC parity:
# Scrub all devices
curl -X POST http://node1:9090/ui/api/ops/scrub
# Scrub a specific device
kiseki-server device scrub --device-id nvme-0001
The periodic scrub runs every 7 days by default (scrub_interval_h).
SMART warnings
Automatic evacuation triggers when a device reports:
- SSD: SMART wear indicator > 90%.
- HDD: > 100 bad sectors.
Check device health:
kiseki-server device info --device-id nvme-0001
Device evacuation
Monitor evacuation progress:
# List active repairs/evacuations
kiseki-server repair list
# Check device state
kiseki-server device info --device-id nvme-0001
Device state transitions: Healthy -> Degraded -> Evacuating -> Failed -> Removed (I-D2).
A device in Evacuating state can be cancelled:
kiseki-server device cancel-evacuation --device-id nvme-0001
RemoveDevice is rejected unless the device state is Removed
(post-evacuation) (I-D5).
Key management issues
Key manager unreachable
Symptoms: KeyManagerUnavailable errors. All chunk writes fail
cluster-wide (I-K12).
Diagnosis:
# Check key manager health
kiseki-server keymanager health
# Check connectivity from storage node
curl -s http://node1:9090/metrics | grep kms_reachability
Resolution:
- The key manager is a Raft-replicated HA service. If one node is down, the remaining majority continues serving.
- If the entire key manager cluster is unreachable, storage nodes use cached master keys (mlock’d in memory) for reads but cannot process new writes.
- Restore key manager connectivity as soon as possible.
Tenant KMS unreachable
Symptoms: TenantKmsUnreachable errors for operations involving
the affected tenant. Other tenants are unaffected.
Diagnosis:
kiseki-server keymanager check-kms --tenant-id acme-corp
Resolution:
- Check network connectivity to the tenant’s KMS endpoint.
- Check KMS credentials and certificate validity.
- The tenant admin is responsible for their KMS availability (I-K11).
Crypto-shred verification
After a crypto-shred, verify that all clients have wiped their caches:
# Check crypto-shred count
curl -s http://node1:9090/metrics | grep kiseki_crypto_shred_total
# Check security events
curl 'http://node1:9090/ui/api/events?category=security'
Gateway issues
S3 errors
Common S3 error codes returned by the gateway:
| Error | Cause | Resolution |
|---|---|---|
| 403 Forbidden | SigV4 authentication failure | Check access key/secret key. |
| 404 Not Found | Bucket or object does not exist | Verify namespace and key. |
| 507 Insufficient Storage | Pool full | Add capacity. See Capacity Full above. |
| 503 Service Unavailable | Raft quorum lost or maintenance mode | Wait for recovery or disable maintenance. |
NFS errors
| Error | Cause | Resolution |
|---|---|---|
| ESTALE | Shard split caused file handle invalidation | Retry the operation. |
| EIO | Internal error (chunk read failure, key manager unreachable) | Check server logs. |
| ENOSPC | Pool full | Add capacity. |
| EXDEV | Cross-shard rename (I-L8) | Use copy + delete instead. |
| ENOTSUP | Writable shared mmap (I-O8) | Use read/write instead of mmap for writes. |
Performance Tuning
Kiseki is designed for HPC and AI workloads running at 200+ Gbps per NIC. This guide covers tuning levers for maximizing throughput and minimizing latency.
Transport selection
The transport layer abstracts the network fabric. Kiseki automatically selects the best available transport, but manual override is possible.
Transport hierarchy (fastest to slowest)
| Transport | Typical bandwidth | Latency | Feature flag | Notes |
|---|---|---|---|---|
| CXI (HPE Slingshot) | 200 Gbps | <1 us | kiseki-transport/cxi | Requires libfabric with CXI provider. CSCS/Alps native. |
| InfiniBand verbs | 100-400 Gbps | 1-2 us | kiseki-transport/verbs | Requires RDMA-capable NICs and verbs libraries. |
| RoCE v2 | 25-100 Gbps | 2-5 us | kiseki-transport/verbs | RDMA over Converged Ethernet. Requires lossless fabric (PFC/ECN). |
| TCP | 10-100 Gbps | 50-200 us | (always available) | Fallback. Uses kernel TCP with TLS. |
Enabling high-performance transports
# Build with CXI support (requires libfabric development headers)
cargo build --release --features kiseki-transport/cxi
# Build with RDMA verbs support (requires rdma-core)
cargo build --release --features kiseki-transport/verbs
The client automatically detects available transports and selects the fastest one. Override with:
# Force TCP transport (e.g., for debugging)
KISEKI_TRANSPORT=tcp kiseki-client-fuse --mountpoint /mnt/kiseki
Transport tuning
- Connection pooling: The transport layer maintains a pool of connections per peer. Pool size adapts to workload.
- Keepalive: Connections are kept alive to avoid handshake overhead.
Configure via
KISEKI_TRANSPORT_KEEPALIVE_MS. - Zero-copy: CXI and verbs transports use zero-copy DMA where possible.
NUMA pinning
For multi-socket servers, NUMA-aware placement is critical for avoiding cross-socket memory traffic.
Recommendations
- Pin kiseki-server to the NUMA node closest to the NIC:
numactl --cpunodebind=0 --membind=0 kiseki-server - Pin NVMe interrupts to the same NUMA node:
echo 0 > /proc/irq/<irq>/smp_affinity_list - Pin data devices to the NUMA node closest to their PCIe root complex.
systemd integration
[Service]
# Pin to NUMA node 0
ExecStart=/usr/bin/numactl --cpunodebind=0 --membind=0 /usr/local/bin/kiseki-server
Verification
# Check NUMA topology
numactl --hardware
# Check NIC NUMA node
cat /sys/class/net/eth0/device/numa_node
# Check NVMe NUMA node
cat /sys/block/nvme0n1/device/numa_node
Erasure coding parameters
EC parameters control the trade-off between storage overhead, repair bandwidth, and read performance.
Common configurations
| Config | Data | Parity | Overhead | Fault tolerance | Use case |
|---|---|---|---|---|---|
| 4+2 | 4 | 2 | 50% | 2 device failures | Default for NVMe. Good balance. |
| 8+3 | 8 | 3 | 37.5% | 3 device failures | Large HDD pools. Lower overhead. |
| 4+1 | 4 | 1 | 25% | 1 device failure | Low-criticality data. Minimum overhead. |
| 2+2 | 2 | 2 | 100% | 2 device failures | Small pools (<6 devices). High redundancy. |
Performance implications
- Read amplification: Reading a chunk requires reading
data_chunksfragments. More data chunks = more read I/O. - Write amplification: Writing a chunk requires writing
data_chunks + parity_chunksfragments. - Repair bandwidth: Repairing a lost fragment requires reading
data_chunksfragments and writing 1. Higherdata_chunks= more repair bandwidth. - Minimum pool size: The pool must have at least
data_chunks + parity_chunksdevices.
EC parameters are immutable per pool after creation (I-C6). Choose carefully. Changing requires creating a new pool and migrating data.
Inline threshold (ADR-030)
The inline threshold determines whether small files are stored in the metadata tier (NVMe, redb) or the data tier (block device extents).
Tuning the threshold
The system automatically adjusts the threshold per-shard based on system disk capacity (I-SF1, I-SF2). Manual adjustment:
# Set cluster-wide default for new shards
kiseki-server tuning set --inline-threshold-bytes 8192
Trade-offs
| Threshold | Metadata tier impact | Data tier impact | Latency |
|---|---|---|---|
| 128 B (floor) | Minimal metadata growth | All files in chunks | Higher for tiny files |
| 4 KB (default) | Moderate growth | Small files inline | Lower for small files |
| 64 KB (ceiling) | Large growth | More inline data | Lowest for small files |
Monitoring
# Check system disk usage
df -h /var/lib/kiseki
# Check per-store sizes
du -sh /var/lib/kiseki/small/objects.redb
du -sh /var/lib/kiseki/raft/log.redb
The Raft inline throughput guard (I-SF7) automatically reduces the
threshold to the floor if inline write rate exceeds
KISEKI_RAFT_INLINE_MBPS (default 10 MB/s per shard). This prevents
inline data from starving metadata-only Raft operations during write
storms.
Cache tuning (ADR-031)
L1 cache (in-memory)
The L1 cache holds decrypted plaintext chunks in process memory.
| Parameter | Default | Recommendation |
|---|---|---|
KISEKI_CACHE_L1_MAX | 1 GB | Set to 10-25% of available process memory. AI training with large datasets: increase. Memory-constrained compute: decrease. |
L2 cache (local NVMe)
The L2 cache uses local NVMe on compute nodes.
| Parameter | Default | Recommendation |
|---|---|---|
KISEKI_CACHE_L2_MAX | 100 GB | Set based on available NVMe capacity. Training datasets: size to fit the working set. Inference: size to fit model weights. |
Metadata TTL
| Parameter | Default | Recommendation |
|---|---|---|
KISEKI_CACHE_META_TTL_MS | 5000 (5s) | Read-heavy workloads: increase for fewer metadata fetches. Low-latency requirements: decrease for fresher data. POSIX close-to-open consistency: 0 (no caching). |
Cache mode selection
| Workload | Recommended mode | Rationale |
|---|---|---|
| AI training (epoch reuse) | pinned | Dataset is re-read every epoch. Pin to avoid refetching. |
| AI inference | organic | Model weights are hot, prompts rotate. LRU works well. |
| HPC checkpoint/restart | bypass | Checkpoints are write-heavy. Caching checkpoints wastes NVMe. |
| Climate/weather staging | pinned | Boundary conditions staged once, read many times. |
| Interactive analysis | organic | Mixed access patterns. LRU adapts. |
Staging for training workloads
Pre-stage datasets before training begins to avoid cold-start latency:
# Slurm prolog script
kiseki-client-fuse --stage /datasets/imagenet --mountpoint /mnt/kiseki
export KISEKI_CACHE_POOL_ID=$(cat /var/cache/kiseki/pool_id)
# Workload picks up the staged cache via KISEKI_CACHE_POOL_ID
srun --export=ALL python train.py
Raft tuning
Snapshot interval
kiseki-server tuning set --raft-snapshot-interval 10000
- Lower values (1000-5000): More frequent snapshots. Faster catch-up for new nodes. More I/O.
- Higher values (50000-100000): Less snapshot overhead. Slower catch-up.
Compaction rate
kiseki-server tuning set --compaction-rate-mb-s 200
Higher compaction rate reduces Raft log size faster but consumes more I/O bandwidth.
View materialization poll interval
kiseki-server tuning set --stream-proc-poll-ms 50
Lower poll interval reduces view staleness but increases CPU usage.
Benchmark harness
Kiseki includes a transport benchmark for measuring raw fabric throughput:
# Run transport benchmarks (if available)
tests/hw/run_transport_bench.sh
What it measures
- Bandwidth: Sequential read/write throughput per transport.
- Latency: Round-trip latency (p50, p99, p999) per transport.
- IOPS: Random read/write IOPS per transport.
- Concurrency: Throughput scaling with connection count.
Interpreting results
| Metric | Good (CXI) | Good (TCP) | Action if below |
|---|---|---|---|
| Bandwidth | >150 Gbps | >50 Gbps | Check NIC config, MTU, NUMA pinning |
| Latency p99 | <10 us | <500 us | Check CPU frequency, interrupt coalescing |
| IOPS (4K random) | >1M | >100K | Check NVMe config, queue depth |
System tuning checklist
Kernel parameters
# Increase maximum open files
echo "fs.file-max = 1048576" >> /etc/sysctl.conf
# Increase socket buffer sizes for high-bandwidth transports
echo "net.core.rmem_max = 67108864" >> /etc/sysctl.conf
echo "net.core.wmem_max = 67108864" >> /etc/sysctl.conf
echo "net.ipv4.tcp_rmem = 4096 87380 67108864" >> /etc/sysctl.conf
echo "net.ipv4.tcp_wmem = 4096 65536 67108864" >> /etc/sysctl.conf
# Disable transparent hugepages (can cause latency spikes)
echo never > /sys/kernel/mm/transparent_hugepage/enabled
NVMe tuning
# Set I/O scheduler to none (best for NVMe)
echo none > /sys/block/nvme0n1/queue/scheduler
# Increase queue depth
echo 1024 > /sys/block/nvme0n1/queue/nr_requests
Process limits
# /etc/security/limits.d/kiseki.conf
kiseki soft nofile 1048576
kiseki hard nofile 1048576
kiseki soft memlock unlimited
kiseki hard memlock unlimited
Performance Tests
Benchmark results for kiseki on GCP infrastructure.
Test Environment
| Component | Spec |
|---|---|
| HDD nodes (3) | n2-standard-16, 3 x PD-Standard 200GB each |
| Fast nodes (2) | n2-standard-16, 2 x local NVMe + 2 x PD-SSD 375GB |
| Client nodes (3) | n2-standard-8, 100GB SSD cache |
| Ctrl node (1) | e2-standard-4, orchestrator |
| Network | GCP VPC, single subnet 10.0.0.0/24 |
| Region | europe-west6-c (Zurich) |
| Raft | Single group, 5 nodes, node 1 bootstrap |
| Release | v2026.1.352 (async GatewayOps, ADR-032) |
Results (2026-04-24)
Network Bandwidth
| Path | Throughput |
|---|---|
| Client → Leader (n2-standard-8 → n2-standard-16) | 15.2 - 15.3 Gbps |
| HDD → Fast cross-tier (n2-standard-16 → n2-standard-16) | 18.3 - 20.4 Gbps |
S3 Gateway
All S3 tests run from client nodes (n2-standard-8) with 8-way parallelism.
Write Throughput (single client → leader)
| Object Size | Count | Parallelism | Time | Throughput |
|---|---|---|---|---|
| 1 MB | 200 | 8 | 1,624 ms | 123.2 MB/s |
| 4 MB | 50 | 8 | 239 ms | 836.8 MB/s |
| 16 MB | 25 | 8 | 363 ms | 1,101.9 MB/s |
Read Throughput
| Object Size | Count | Parallelism | Time | Throughput |
|---|---|---|---|---|
| 1 MB | 200 | 8 | 176 ms | 1,136.4 MB/s |
PUT Latency (1 KB objects, sequential)
| Percentile | Latency |
|---|---|
| p50 | 7.6 ms |
| p99 | 8.6 ms |
| avg | 7.7 ms |
| max | 9.7 ms |
Aggregate Write (3 clients, parallel)
| Workload | Time | Aggregate Throughput |
|---|---|---|
| 3 x 100 x 1 MB (8 concurrent/client) | 2,205 ms | 136.1 MB/s |
NFS / pNFS / FUSE
Not yet tested on GCP. NFS mount from client nodes requires SSH key distribution from the ctrl node (OS Login configuration pending). FUSE requires the kiseki-client binary installed on client nodes.
Local testing (3-node cluster on localhost) confirms all protocols functional via unit and integration tests.
Prometheus Metrics
Gateway request counters showed 0 during the test. The
requests_total atomic counter in InMemoryGateway is not wired
to the Prometheus metrics exporter yet.
Local Test Results (same binary, localhost)
For comparison, local 3-node cluster results (loopback network, no disk I/O latency, 32-way parallelism):
| Test | Result |
|---|---|
| S3 Write 1 MB x 200 (32 parallel) | 380.2 MB/s |
| S3 Write 4 MB x 50 (32 parallel) | 349.7 MB/s |
| S3 Write 16 MB x 25 (32 parallel) | 340.7 MB/s |
| S3 Read 1 MB x 200 (32 parallel) | 913.2 MB/s |
| 32 concurrent PUTs | 50 ms (no deadlock) |
Observations
-
Small object writes improved 9.6x after ADR-032 (async GatewayOps + lock-free composition writes). The composition lock is no longer held during Raft consensus, allowing concurrent writes to proceed in parallel.
-
Read throughput exceeds write. Reads bypass Raft consensus (served from the local composition + chunk store) and hit 1.1 GB/s even for 1 MB objects.
-
GCP outperforms localhost for large objects. The GCP network (15+ Gbps) and n2-standard-16 nodes have more bandwidth than localhost loopback under contention. 16 MB writes: 1,102 MB/s (GCP) vs 341 MB/s (local).
-
Latency is network-bound. p50 latency on GCP (7.6 ms) includes network RTT + Raft consensus (5-node quorum). Local latency is dominated by CPU contention on shared machine.
-
Single Raft group is the write bottleneck. All writes go through one leader. Multi-shard deployment would distribute leaders across nodes, scaling write throughput linearly.
Known Issues
-
Concurrent write deadlock (fixed in ADR-032). The sync→async bridge (
run_on_raft) caused thread starvation under concurrent load. Fixed by making GatewayOps and LogOps fully async, and moving log emission out of the composition lock scope. Result: 1 MB writes improved from 39.5 to 380.2 MB/s (9.6x). -
NFS mount on GCP. Requires SSH key distribution from ctrl to client nodes. The ctrl service account needs
osAdminLoginrole and OS Login key registration. -
Prometheus counters.
gateway_requests_totalnot exported to/metricsendpoint.
Running the Benchmark
# Local 3-node test
cargo build --release --bin kiseki-server
# Start 3 nodes (see examples/cluster-3node.env.node{1,2,3})
# Run: bash infra/gcp/benchmarks/perf-suite.sh
# GCP deployment
cd infra/gcp
terraform apply -var="project_id=PROJECT" -var="zone=ZONE" \
-var="release_tag=v2026.1.332"
# Deploy perf-suite.sh to ctrl node and run
See infra/gcp/benchmarks/perf-suite.sh for the full benchmark
script and infra/gcp/benchmarks/run-perf.sh for the local
deployment wrapper.
Comparison with Ceph and Lustre
Single-Leader Kiseki vs Typical Deployments (similar hardware scale)
| Metric | Kiseki (1 leader) | Ceph RGW (S3) | Lustre |
|---|---|---|---|
| Large object write | 1.1 GB/s (16 MB) | 0.5-2 GB/s | 1-2 GB/s per OST |
| Small object write | 122 MB/s (1 MB) | 50-200 MB/s | 200-500 MB/s |
| Read throughput | 1.1 GB/s | 1-3 GB/s | 2-10 GB/s |
| PUT latency | p50: 7.6 ms | p50: 2-5 ms | p50: <1 ms (POSIX) |
| Aggregate 3-client | 133 MB/s | 300-800 MB/s | 1-5 GB/s |
| Encryption | Always (AES-256-GCM) | Optional (rarely on) | No |
Why aggregate throughput is lower
All writes go through a single Raft leader (single Raft group). Ceph distributes across PGs/OSDs, Lustre stripes across OSTs. They parallelize writes across all nodes; kiseki serializes through one leader. This is a deployment constraint, not an architectural limit.
Where kiseki is strong
-
Per-leader throughput is excellent. 1.1 GB/s per leader with full AES-256-GCM encryption is comparable to Ceph RGW without encryption. The crypto overhead is nearly invisible (aws-lc-rs with AES-NI).
-
Read throughput matches. Reads bypass Raft consensus entirely and serve from local composition + chunk store. Multi-node reads scale linearly since any node can serve.
-
Latency is reasonable. 7.6 ms includes Raft consensus over network + encryption. Ceph’s 2-5 ms S3 latency is lower but typically without encryption. Lustre’s sub-ms is POSIX (kernel bypass), not comparable to HTTP/S3.
Bottleneck analysis
- Not bottlenecked by crypto – AES-256-GCM at 1.1 GB/s means the CPU encrypts faster than the network/Raft can deliver.
- Not bottlenecked by network – 15 Gbps available, using <10 Gbps.
- Bottlenecked by Raft consensus – 7.6 ms per round-trip for small objects, amortized for large ones.
- Multi-shard is the path to parity – linear scaling with shard count, same model as Ceph PGs and Lustre OSTs.
Projected multi-shard performance
| Shards | 1 MB Write | 16 MB Write | Read |
|---|---|---|---|
| 1 | 122 MB/s | 1.1 GB/s | 1.1 GB/s |
| 3 | ~366 MB/s | ~3.4 GB/s | ~3.4 GB/s |
| 5 | ~610 MB/s | ~5.7 GB/s | ~5.7 GB/s |
At 5 shards on the same hardware, kiseki reaches parity with Ceph and approaches Lustre – while encrypting all data at rest and in transit, on commodity GCP VMs with network-attached storage (not local NVMe or InfiniBand).
Capacity Planning
Kiseki separates metadata and data onto different storage tiers. Proper sizing of both tiers is critical for stable operation at scale.
Storage tiers
Each storage node has two distinct storage tiers:
System disk (metadata tier)
The system partition hosts:
- Raft log (
raft/log.redb): Bounded by snapshot interval. Grows with write rate, compacted periodically. - Key epochs (
keys/epochs.redb): Tiny (<10 MB). One entry per key epoch. - Chunk metadata (
chunks/meta.redb): Scales linearly with file count. Approximately 80 bytes per file. - Inline content (
small/objects.redb): Variable. Controlled by the dynamic inline threshold (ADR-030).
Requirements:
- NVMe or SSD strongly recommended. HDD system disks trigger a boot warning because Raft fsync latency will be 5-10ms per commit.
- RAID-1 on 2x SSD for redundancy (the system disk is not protected by Kiseki’s EC; it uses traditional RAID).
- Size based on expected file count and inline content.
Data devices (data tier)
Data devices are JBOD-managed by Kiseki. They store chunk ciphertext as extents on raw block devices (ADR-029).
Requirements:
- NVMe, SSD, or HDD depending on the pool’s device class.
- Multiple devices per node for EC placement (I-D4: no two EC fragments on the same device).
- JBOD (no RAID): Kiseki manages durability via EC or replication.
Metadata capacity sizing
Per-file metadata footprint
| Component | Per file | Notes |
|---|---|---|
| Delta log entry | ~200 bytes | Raft log entry with header fields |
| Chunk metadata | ~80 bytes | Extent index entry in chunks/meta.redb |
| Subtotal (no inline) | ~280 bytes | Fixed per file |
| Inline content | 0 to 64 KB | Only if file is below inline threshold |
Capacity examples
10 billion files, 50-node cluster, RF=3, no inline:
| Component | Cluster total | Per node |
|---|---|---|
| Delta log (metadata only) | ~2 TB | ~120 GB |
| Chunk metadata index | ~0.8 TB | ~48 GB |
| Total metadata | ~2.8 TB | ~168 GB |
At 168 GB per node, 256 GB NVMe system disks are tight. Larger system disks (512 GB or 1 TB) provide comfortable headroom.
10 billion files, 50-node cluster, RF=3, with inline (4 KB threshold):
| Component | Cluster total | Per node |
|---|---|---|
| Metadata (as above) | ~2.8 TB | ~168 GB |
| Inline content (10% of files < 4 KB, avg 2 KB) | ~2 TB | ~120 GB |
| Total | ~4.8 TB | ~288 GB |
This exceeds 256 GB system disks. The dynamic inline threshold (ADR-030) prevents this by automatically reducing the threshold when system disk usage approaches the soft limit.
Capacity monitoring
The system automatically monitors metadata disk usage and adjusts:
| Usage level | Response |
|---|---|
Below KISEKI_META_SOFT_LIMIT_PCT (50%) | Normal operation |
| Above soft limit | Inline threshold reduced |
Above KISEKI_META_HARD_LIMIT_PCT (75%) | Threshold forced to floor (128 B), alert emitted |
Alerts use out-of-band gRPC health reports (not Raft) so that a full-disk node can signal without writing Raft entries (I-SF2).
Dynamic inline threshold (ADR-030)
The inline threshold is computed per-shard as the minimum affordable threshold across all Raft voters:
available = min(node.small_file_budget_bytes for node in shard.voters)
projected_files = shard.file_count_estimate
raw_threshold = available / max(projected_files, 1)
shard_threshold = clamp(raw_threshold, INLINE_FLOOR, INLINE_CEILING)
Where:
| Parameter | Value |
|---|---|
INLINE_FLOOR | 128 bytes (hard lower bound) |
INLINE_CEILING | 64 KB (system-wide maximum) |
KISEKI_META_SOFT_LIMIT_PCT | 50% (default) |
KISEKI_META_HARD_LIMIT_PCT | 75% (default) |
Threshold behavior
- Decrease: Automatic and safe. New files use the chunk path. Existing inline data is not retroactively migrated (I-L9).
- Increase: Requires cluster admin decision. May trigger optional background migration of small chunked files back to inline.
- Emergency: If any voter reports hard-limit breach, the leader commits a threshold reduction via Raft (2/3 majority; the full-disk node’s vote is not required).
Raft throughput guard (I-SF7)
The effective inline threshold is further clamped by a per-shard Raft
log throughput budget (KISEKI_RAFT_INLINE_MBPS, default 10 MB/s).
If the shard’s inline write rate exceeds this budget, the threshold
temporarily drops to the floor until the rate subsides. This prevents
inline data from starving metadata-only Raft operations during write
storms.
Pool capacity thresholds
Data-tier capacity is managed per pool. Thresholds vary by device class to account for SSD/NVMe GC pressure at high fill levels (ADR-024):
| State | NVMe/SSD | HDD | Behavior |
|---|---|---|---|
| Healthy | 0-75% | 0-85% | Normal writes, background rebalance |
| Warning | 75-85% | 85-92% | Log warning, emit telemetry |
| Critical | 85-92% | 92-97% | Reject new placements, advisory backpressure |
| ReadOnly | 92-97% | 97-99% | In-flight writes drain, no new writes |
| Full | 97-100% | 99-100% | ENOSPC to clients |
Why NVMe/SSD thresholds are lower
NVMe and SSD devices experience write amplification from garbage collection at high fill levels. Above ~80% fill, GC pressure increases sharply, causing:
- Increased write latency (10-100x during GC storms).
- Reduced effective write bandwidth.
- Accelerated wear.
Enterprise storage arrays (VAST, Pure) operate at 95%+ because they have global wear leveling across all flash in the system. JBOD devices do not have this capability, so Kiseki’s thresholds are more conservative.
Growth estimation
File count growth
Monitor kiseki_shard_delta_count to track delta (file) accumulation:
# Current delta count per shard
curl -s http://node1:9090/metrics | grep kiseki_shard_delta_count
Use the rate of delta count increase to project when the metadata tier will reach capacity.
Data volume growth
Monitor pool capacity metrics:
# Current pool utilization
curl -s http://node1:9090/metrics | grep kiseki_pool_capacity
Projection formula
days_until_full = (capacity_total - capacity_used) / daily_write_rate
For metadata:
metadata_per_file = 280 bytes (no inline) or 280 + avg_inline_size (with inline)
days_until_full = (system_disk_capacity * soft_limit_pct - current_used) /
(new_files_per_day * metadata_per_file * replication_factor)
Sizing recommendations
Small deployment (development/testing)
| Component | Recommendation |
|---|---|
| Nodes | 3 (minimum for Raft) |
| System disk | 256 GB NVMe each (RAID-1 on 2x SSD) |
| Data devices | 2x 1 TB NVMe per node |
| Key manager | Co-located with storage nodes (internal KMS) |
| File count | Up to 100 million |
Medium deployment (departmental HPC)
| Component | Recommendation |
|---|---|
| Nodes | 5-10 |
| System disk | 512 GB NVMe each (RAID-1) |
| Data devices | 4-8 NVMe per node (2-8 TB each) |
| Key manager | 3 dedicated nodes |
| File count | Up to 1 billion |
Large deployment (institutional HPC/AI)
| Component | Recommendation |
|---|---|
| Nodes | 50-200 |
| System disk | 1 TB NVMe each (RAID-1) |
| Data devices | 8-24 devices per node, mixed tiers (NVMe + SSD + HDD) |
| Key manager | 5 dedicated nodes |
| File count | Up to 10 billion |
| Total capacity | 100 PB+ |
Rules of thumb
- System disk: Size at 2x the expected metadata footprint for comfortable headroom. Include inline content estimates.
- Data devices: At least
ec_data_chunks + ec_parity_chunksdevices per pool (for EC placement across distinct devices, I-D4). - Network: CXI or InfiniBand for clusters where storage bandwidth is critical. TCP is acceptable for cold-tier pools.
- Memory: At least 64 GB per storage node for Raft state, chunk metadata caching, and stream processor buffers.
Capacity alerts
Configuring alerts
Use Prometheus alerting rules (see Monitoring) to detect capacity issues before they become critical:
- alert: KisekiSystemDiskWarning
expr: >
node_filesystem_avail_bytes{mountpoint="/var/lib/kiseki"} /
node_filesystem_size_bytes{mountpoint="/var/lib/kiseki"} < 0.50
for: 10m
labels:
severity: warning
annotations:
summary: "System disk above 50% on {{ $labels.instance }}"
- alert: KisekiSystemDiskCritical
expr: >
node_filesystem_avail_bytes{mountpoint="/var/lib/kiseki"} /
node_filesystem_size_bytes{mountpoint="/var/lib/kiseki"} < 0.25
for: 5m
labels:
severity: critical
annotations:
summary: "System disk above 75% on {{ $labels.instance }}"
When to add capacity
- System disk above 50% (soft limit): Plan for capacity expansion. Inline threshold will start decreasing.
- System disk above 75% (hard limit): Urgent. Inline threshold is at floor. Add nodes or upgrade system disks.
- Pool above Warning threshold: Monitor growth. Plan for device additions.
- Pool above Critical threshold: Writes are being rejected. Add devices immediately or evacuate data to another pool.
gRPC Services
Kiseki exposes several gRPC services across two network ports. Data-path services run on port 9100. The advisory service runs on a separate listener at port 9101 (isolated runtime, ADR-021).
LogService
Port: 9100 (data fabric)
Provider: kiseki-log (via kiseki-server)
Consumers: Composition, View stream processors, Gateway, Client
| RPC | Type | Description |
|---|---|---|
AppendDelta | Unary | Append a delta to a shard. Returns the assigned sequence number. Commits via Raft majority before ack (I-L2). |
ReadDeltas | Server streaming | Read a range of deltas from a shard. Used by view stream processors for materialization. |
TruncateLog | Unary | Trigger delta GC up to the minimum consumer watermark. Returns the new GC boundary. |
ShardHealth | Unary | Query shard health, Raft state, and replication status. |
SplitShard | Unary | Trigger mandatory shard split at a given boundary. |
SetMaintenance | Unary | Enable or disable maintenance mode on a shard (I-O6). |
CompactShard | Unary | Trigger compaction (header-only merge, I-O2). |
KeyManagerService
Port: Internal network (dedicated key manager cluster)
Provider: kiseki-keymanager (via kiseki-keyserver)
Consumers: Storage nodes (chunk encryption), Gateway, Client
| RPC | Type | Description |
|---|---|---|
FetchMasterKey | Unary | Fetch the master key for a given epoch. Used at node startup and rotation. |
RotateKey | Unary | Rotate system or tenant keys. Creates a new epoch. |
CryptoShred | Unary | Destroy tenant KEK, rendering all tenant data unreadable. |
FullReEncrypt | Unary | Trigger full re-encryption of a tenant’s data under new keys. |
FetchTenantKek | Unary | Fetch tenant KEK for wrapping/unwrapping operations. |
CheckKmsHealth | Unary | Check tenant KMS provider connectivity. |
KeyManagerHealth | Unary | Query key manager cluster health and Raft state. |
System DEK derivation is local (HKDF, no RPC). Only master key fetch and tenant KEK operations require network calls (ADR-003).
ControlService
Port: Management network
Provider: kiseki-control
Consumers: Admin CLI, storage nodes, advisory runtime
Tenant management
| RPC | Description |
|---|---|
CreateOrg | Create a new organization (top-level tenant) |
CreateProject | Create a project within an organization |
CreateWorkload | Create a workload within an org or project |
DeleteOrg / DeleteProject / DeleteWorkload | Remove tenant hierarchy nodes |
Namespace and policy
| RPC | Description |
|---|---|
CreateNamespace | Create a tenant-scoped namespace |
SetComplianceTags | Set compliance regime tags (inherit downward) |
SetQuota | Set resource quotas at org/project/workload level |
SetRetentionHold | Create a retention hold on a namespace or composition |
ReleaseRetentionHold | Release an active retention hold |
IAM
| RPC | Description |
|---|---|
RequestAccess | Cluster admin requests access to tenant data |
ApproveAccess | Tenant admin approves access request |
DenyAccess | Tenant admin denies access request |
Operations
| RPC | Description |
|---|---|
SetMaintenanceMode | Enable/disable cluster-wide maintenance mode |
ListFlavors / MatchFlavor | Query and match deployment flavors |
Federation
| RPC | Description |
|---|---|
RegisterFederationPeer | Register a remote Kiseki cluster for async replication |
Advisory policy
| RPC | Description |
|---|---|
SetAdvisoryPolicy | Configure profiles, budgets, and state per scope |
TransitionAdvisoryState | Transition advisory state (enabled/draining/disabled) |
GetEffectiveAdvisoryPolicy | Compute effective policy for a workload (min across hierarchy) |
WorkflowAdvisoryService
Port: 9101 (data fabric, separate listener)
Provider: kiseki-advisory (via kiseki-server, isolated tokio runtime)
Consumers: Native client, any authorized tenant caller
| RPC | Type | Description |
|---|---|---|
DeclareWorkflow | Unary | Declare a new workflow with profile, initial phase, and TTL. Returns a WorkflowRef handle and authorized pool handles. |
EndWorkflow | Unary | End a declared workflow. Triggers audit summary and GC of workflow state. |
PhaseAdvance | Unary | Advance to the next phase. Phase order is monotonic (I-WA13). |
GetWorkflowStatus | Unary | Query current workflow state, phase, and budget usage. |
AdvisoryStream | Bidirectional streaming | Multiplexed channel: hints in (client to storage), telemetry out (storage to client). |
SubscribeTelemetry | Server streaming | Subscribe to specific telemetry channels for a workflow. |
Advisory stream message types
Inbound hints (client to storage):
- Access pattern declaration
- Prefetch range (up to 4096 tuples per hint, I-WA16)
- Affinity pool preference (via opaque pool handles, I-WA19)
- Priority class (within policy-allowed maximum)
- Retention intent
- Dedup intent
- Collective checkpoint announcement
- Deadline hint
Outbound telemetry (storage to client):
- Backpressure signal (ok / soft / hard severity with retry-after)
- Placement locality class (local-node / local-rack / same-pool / remote / degraded)
- Materialization lag
- Prefetch effectiveness
- QoS headroom
- Hotspot detection (caller-owned compositions only)
StorageAdminService (ADR-025)
Port: Management network
Provider: kiseki-server
Consumers: Cluster admin, SRE (read-only role)
| RPC | Type | Description |
|---|---|---|
ClusterStatus | Unary | Cluster-wide status summary |
ListDevices / GetDevice | Unary | Query storage devices |
AddDevice / RemoveDevice | Unary | Add or remove a device (removal requires Removed state) |
EvacuateDevice / CancelEvacuation | Unary | Trigger or cancel device evacuation |
ListPools / GetPool / PoolStatus | Unary | Query affinity pools |
CreatePool / SetPoolDurability / SetPoolThresholds | Unary | Manage pool configuration |
RebalancePool / CancelRebalance | Unary | Trigger or cancel pool rebalance |
ListShards / GetShard / GetShardHealth | Unary | Query shard state |
SplitShard / SetShardMaintenance | Unary | Shard management |
SetTuningParams / GetTuningParams | Unary | Runtime tuning parameters |
DrainNode | Unary | Drain all shards and chunks from a node |
TriggerScrub / RepairChunk / ListRepairs | Unary | Data integrity operations |
DeviceHealth | Server streaming | Live device health events |
IOStats | Server streaming | Live I/O statistics |
DeviceIOStats | Server streaming | Per-device I/O statistics |
DiscoveryService
Port: 9100 (data fabric)
Provider: kiseki-server
Consumers: Native client
Used by the native client to discover shards, views, and gateways from the data fabric without requiring direct control plane access (I-O4, ADR-008).
Protocol binding
- Protobuf definitions:
proto/kiseki/v1/*.proto - Generated code:
kiseki-protocrate - Workflow ref header:
x-kiseki-workflow-ref-bin(16 raw bytes as gRPC binary metadata, not a proto field, per ADR-021)
REST & Admin API
The kiseki-server binary exposes an HTTP server (default port 9090) for
health checks, Prometheus metrics, and an admin dashboard. All endpoints
are served via axum.
Health and metrics
GET /health
Liveness probe for load balancers and orchestrators.
Response: 200 OK with body ok when the server is running.
GET /metrics
Prometheus text-format metrics endpoint.
Response: 200 OK with text/plain body containing all registered
Prometheus metrics including:
- Raft state per shard (leader, follower, candidate)
- Chunk operations (reads, writes, dedup hits)
- Transport metrics (connections, bytes, errors per transport type)
- Pool utilization (capacity, used, free per pool)
- View materialization lag
- Advisory budget usage
Admin dashboard
GET /ui
HTML admin dashboard with HTMX live polling. Provides a visual overview of cluster health, node status, and operational metrics.
The dashboard polls the JSON API endpoints below for live updates.
JSON API endpoints
GET /ui/api/cluster
Cluster-wide summary with aggregated metrics from all nodes.
Response: JSON object with node count, total capacity, total used, shard count, and aggregated health status.
GET /ui/api/nodes
List of all known nodes with per-node metrics.
Response: JSON array of node objects, each with node ID, address, status, device count, shard count, and key metrics.
GET /ui/api/history
Metric time series for charting.
Query parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
hours | float | 3 | Number of hours of history to retrieve |
Response: JSON object with hours and points array containing
timestamped metric snapshots.
GET /ui/api/events
Filtered event log for diagnostics and alerting.
Query parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
severity | string | (all) | Filter by severity: info, warning, error, critical |
category | string | (all) | Filter by category: node, shard, device, tenant, security, admin |
hours | float | 3 | Hours to look back |
Response: JSON array of event objects with timestamp, severity, category, message, and source.
Operations endpoints
These endpoints trigger operational actions and require cluster admin authentication.
POST /ui/api/ops/maintenance
Toggle maintenance mode for the cluster or specific shards.
Request body: JSON with enabled (boolean) and optional shard_id.
Effect: Sets shards to read-only. Write commands are rejected with a retriable error (I-O6). Shard splits, compaction, and GC for in-progress operations continue.
POST /ui/api/ops/backup
Trigger a backup operation.
Request body: JSON with backup configuration parameters.
Effect: Initiates backup per ADR-016. Returns a job ID for status tracking.
POST /ui/api/ops/scrub
Trigger a data integrity scrub.
Request body: JSON with optional scope (pool, device, or cluster-wide).
Effect: Verifies chunk integrity via EC checksums. Reports corrupt or missing chunks. Triggers automatic repair for recoverable issues.
HTMX fragment endpoints
These endpoints return HTML fragments for the admin dashboard’s live polling:
| Endpoint | Description |
|---|---|
GET /ui/fragment/cluster-cards | Cluster status summary cards |
GET /ui/fragment/node-table | Node list table rows |
GET /ui/fragment/chart-data | Chart data for metrics graphs |
GET /ui/fragment/alerts | Active alerts and warnings |
CLI Reference
Kiseki provides two binaries with CLI interfaces: kiseki-server (which
doubles as the admin CLI) and kiseki-client (native client with staging
and cache commands).
All admin operations use these CLIs. The underlying gRPC API is also available for programmatic access (see gRPC), but the CLI is the primary admin interface.
kiseki-server
The server binary starts the storage node when invoked without arguments. When invoked with a subcommand, it acts as an admin CLI that connects to the local node’s gRPC endpoint.
Server mode
kiseki-server
Starts the storage node. Configuration is via environment variables (see Environment Variables).
status
kiseki-server status
Display cluster status summary: node count, shard count, device health, Raft leadership, and pool utilization.
Node management
kiseki-server node add --node-id <id>
kiseki-server node drain --node-id <id>
kiseki-server node remove --node-id <id>
Add, drain, or remove a node from the cluster. Drain migrates shard assignments before removal. See Cluster Management.
Shard management
kiseki-server shard list
kiseki-server shard info --shard-id <id>
kiseki-server shard health --shard-id <id>
kiseki-server shard split --shard-id <id> [--boundary <key>]
kiseki-server shard maintenance --shard-id <id> --enabled
kiseki-server shard maintenance --shard-id <id> --disabled
List shards, inspect details, check health, trigger manual splits, and toggle per-shard maintenance mode (I-O6).
Pool management
kiseki-server pool list
kiseki-server pool status --pool-id <id>
kiseki-server pool create --pool-id <id> --device-class <class> --ec-data <n> --ec-parity <n>
kiseki-server pool set-durability --pool-id <id> --ec-data <n> --ec-parity <n>
kiseki-server pool rebalance --pool-id <id>
kiseki-server pool cancel-rebalance --pool-id <id>
kiseki-server pool set-thresholds --pool-id <id> --warning-pct <n> --critical-pct <n>
Manage affinity pools: create, inspect capacity, set EC parameters, rebalance data, and adjust capacity thresholds (I-C5, I-C6).
Device management
kiseki-server device list
kiseki-server device info --device-id <id>
kiseki-server device evacuate --device-id <id>
kiseki-server device cancel-evacuation --device-id <id>
kiseki-server device scrub --device-id <id>
List devices, check health and SMART status, trigger evacuation or integrity scrub, and cancel in-progress evacuations (I-D2, I-D3, I-D5).
Maintenance mode
kiseki-server maintenance on
kiseki-server maintenance off
Enable or disable cluster-wide maintenance mode. Sets all shards to read-only. Write commands are rejected with a retriable error. Shard splits, compaction, and GC for in-progress operations continue but no new triggers fire from write pressure (I-O6).
Backup and recovery
kiseki-server backup create
kiseki-server backup list
kiseki-server backup delete --backup-id <id>
kiseki-server repair list
kiseki-server compact
Create, list, and delete backup snapshots. List active repairs and evacuations. Trigger Raft log compaction.
Key management
kiseki-server keymanager health
kiseki-server keymanager check-kms
kiseki-server keymanager check-kms --tenant-id <id>
Check system key manager health and tenant KMS connectivity.
S3 credentials
kiseki-server s3-credentials create --tenant-id <id> --workload-id <id>
Provision S3-compatible access keys for a tenant workload via the control plane.
Tuning parameters
kiseki-server tuning set --inline-threshold-bytes <n>
kiseki-server tuning set --raft-snapshot-interval <n>
kiseki-server tuning set --compaction-rate-mb-s <n>
kiseki-server tuning set --stream-proc-poll-ms <n>
Adjust cluster-wide tuning parameters. See Performance Tuning for guidance.
kiseki-client
The native client binary provides dataset staging and cache management commands for compute nodes.
stage –dataset
kiseki-client stage --dataset <path> [--timeout <seconds>]
Pre-fetch a dataset’s chunks into the L2 cache with pinned retention. Recursively enumerates compositions under the given namespace path, fetches all chunks from canonical, verifies by content-address (SHA-256), and stores in the L2 cache pool.
Staging is idempotent and resumable. Produces a manifest file listing staged compositions and chunk IDs.
Limits: max_staging_depth (10 levels), max_staging_files (100,000).
stage –status
kiseki-client stage --status
Show the status of the current staging operation: progress, number of chunks fetched, total size, and any errors.
stage –release
kiseki-client stage --release <path>
Release a staged dataset. Unpins cached chunks, making them eligible for LRU eviction. To pick up updates from canonical, release and re-stage.
stage –release-all
kiseki-client stage --release-all
Release all staged datasets.
cache –stats
kiseki-client cache --stats
Print cache statistics: mode, L1/L2 bytes used, hit/miss counts, errors, metadata cache stats, and wipe count.
cache –wipe
kiseki-client cache --wipe
Wipe all cached data (L1 + L2 + metadata). Zeroizes data before deletion (I-CC2).
version
kiseki-client version
Print the client version.
Environment variables (kiseki-client)
| Variable | Default | Description |
|---|---|---|
KISEKI_CACHE_DIR | /tmp/kiseki-cache | Cache directory |
KISEKI_CACHE_MODE | organic | Cache mode: pinned, organic, bypass |
KISEKI_CACHE_L1_MAX | 268435456 (256 MB) | L1 max bytes |
KISEKI_CACHE_L2_MAX | 53687091200 (50 GB) | L2 max bytes |
kiseki-admin
Standalone remote administration CLI. Runs from an admin workstation and connects to any Kiseki node via the REST API (port 9090). No server dependencies are needed on the workstation.
Default endpoint: KISEKI_ENDPOINT env var, or http://localhost:9090.
status
kiseki-admin --endpoint http://storage-node:9090 status
Cluster status summary: node count, Raft entries, gateway requests, data written/read, and active connections.
Example output:
Cluster Status
══════════════
Nodes: 3/3 healthy
Raft: 42,567 entries
Requests: 1,234 served
Written: 12.5 GB
Read: 8.2 GB
Connections: 15 active
nodes
kiseki-admin nodes
Node list with health badges and per-node metrics.
Example output:
NODE STATUS RAFT REQUESTS WRITTEN READ CONNS
10.0.0.1:9090 healthy 14,189 411 4.2 GB 2.7 GB 5
10.0.0.2:9090 healthy 14,189 412 4.2 GB 2.8 GB 5
10.0.0.3:9090 healthy 14,189 411 4.1 GB 2.7 GB 5
events
kiseki-admin events [--severity error] [--hours 1]
Filtered event log. Optional --severity (info, warning, error,
critical) and --hours (default: 3).
Example output:
TIME SEVERITY CATEGORY SOURCE MESSAGE
12:34:56 ERROR node node-3 unreachable
12:35:12 ERROR device nvme0n1 CRC mismatch detected
history
kiseki-admin history [--hours 3]
Metric history time series for the specified number of hours (default: 3).
maintenance
kiseki-admin maintenance on
kiseki-admin maintenance off
Toggle cluster-wide maintenance mode. Enables read-only on all shards. Write commands return a retriable error (I-O6).
backup
kiseki-admin backup
Trigger a background backup operation (ADR-016).
scrub
kiseki-admin scrub
Trigger a background data integrity scrub.
Exit codes
| Code | Meaning |
|---|---|
| 0 | Success |
| 1 | General error |
| 2 | Invalid arguments |
| 3 | Connection failure (server unreachable) |
| 4 | Authentication failure (mTLS) |
Environment Variables
All Kiseki configuration is done via environment variables. No configuration files are used for runtime settings (I-K8: keys are never stored in configuration files).
Server configuration
| Variable | Type | Default | Description |
|---|---|---|---|
KISEKI_DATA_ADDR | SocketAddr | 0.0.0.0:9100 | Data-path gRPC listener address |
KISEKI_ADVISORY_ADDR | SocketAddr | 0.0.0.0:9101 | Advisory gRPC listener address (isolated runtime) |
KISEKI_S3_ADDR | SocketAddr | 0.0.0.0:9000 | S3 HTTP gateway listener address |
KISEKI_NFS_ADDR | SocketAddr | 0.0.0.0:2049 | NFS server listener address |
KISEKI_METRICS_ADDR | SocketAddr | 0.0.0.0:9090 | Prometheus metrics and admin UI listener address |
KISEKI_DATA_DIR | PathBuf | (none) | Persistent storage directory for redb databases. If unset, runs in-memory only. |
KISEKI_NODE_ID | u64 | 0 | Raft node ID. 0 = single-node mode. |
KISEKI_BOOTSTRAP | bool | false | Create a well-known bootstrap shard on startup. Set to true or 1 for development/testing. |
TLS configuration
TLS is enabled when all three path variables are set. Otherwise the server runs in plaintext mode (development only, logged as a warning).
| Variable | Type | Default | Description |
|---|---|---|---|
KISEKI_CA_PATH | PathBuf | (none) | Cluster CA certificate PEM file |
KISEKI_CERT_PATH | PathBuf | (none) | Node certificate chain PEM file |
KISEKI_KEY_PATH | PathBuf | (none) | Node private key PEM file |
KISEKI_CRL_PATH | PathBuf | (none) | Optional CRL PEM file for certificate revocation |
Raft configuration
| Variable | Type | Default | Description |
|---|---|---|---|
KISEKI_RAFT_ADDR | SocketAddr | (none) | Raft RPC listen address. Required for multi-node clusters. |
KISEKI_RAFT_PEERS | String | (empty) | Comma-separated peer list in id=addr format, e.g. 1=10.0.0.1:9200,2=10.0.0.2:9200,3=10.0.0.3:9200 |
Metadata capacity (ADR-030)
| Variable | Type | Default | Description |
|---|---|---|---|
KISEKI_META_SOFT_LIMIT_PCT | u8 | 50 | Soft limit percentage for system disk metadata usage. Exceeding triggers inline threshold reduction. |
KISEKI_META_HARD_LIMIT_PCT | u8 | 75 | Hard limit percentage for system disk metadata usage. Exceeding forces inline threshold to INLINE_FLOOR and emits alert (I-SF2). |
KISEKI_RAFT_INLINE_MBPS | u32 | 10 | Per-shard Raft inline throughput cap in MB/s. Prevents inline data from starving metadata-only Raft operations (I-SF7). |
Client cache configuration
| Variable | Type | Default | Description |
|---|---|---|---|
KISEKI_CACHE_MODE | String | organic | Cache mode: organic (LRU), pinned (staging-driven), or bypass (no caching) |
KISEKI_CACHE_DIR | PathBuf | /tmp/kiseki-cache | L2 cache pool directory on local NVMe |
KISEKI_CACHE_L2_MAX | u64 | 53687091200 (50 GB) | Maximum L2 cache size in bytes |
KISEKI_CACHE_POOL_ID | String | (generated) | Adopt an existing L2 pool (128-bit hex). Used for staging handoff between processes. |
Transport configuration
| Variable | Type | Default | Description |
|---|---|---|---|
KISEKI_IB_DEVICE | String | (auto-detect) | InfiniBand device name for RDMA verbs transport. If unset, auto-detects the first available device. |
Observability
Standard Rust/tokio observability variables:
| Variable | Type | Default | Description |
|---|---|---|---|
RUST_LOG | String | info | Log filter directive (e.g., kiseki_log=debug,kiseki_raft=trace) |
OTEL_EXPORTER_OTLP_ENDPOINT | String | (none) | OpenTelemetry collector endpoint for distributed tracing |
Example: single-node development
export KISEKI_DATA_DIR=/var/lib/kiseki
export KISEKI_BOOTSTRAP=true
kiseki-server
Example: three-node cluster
# Node 1
export KISEKI_NODE_ID=1
export KISEKI_DATA_DIR=/var/lib/kiseki
export KISEKI_RAFT_ADDR=10.0.0.1:9200
export KISEKI_RAFT_PEERS=1=10.0.0.1:9200,2=10.0.0.2:9200,3=10.0.0.3:9200
export KISEKI_CA_PATH=/etc/kiseki/ca.pem
export KISEKI_CERT_PATH=/etc/kiseki/node1.pem
export KISEKI_KEY_PATH=/etc/kiseki/node1-key.pem
export KISEKI_BOOTSTRAP=true
kiseki-server
Architecture Decision Records
All architectural decisions are recorded as ADRs in
specs/architecture/adr/.
ADR index
| ADR | Title | Status |
|---|---|---|
| ADR-001 | Pure Rust, No Mochi Dependency | Accepted |
| ADR-002 | Two-Layer Encryption Model (C) | Accepted |
| ADR-003 | System DEK Derivation (Not Storage) | Accepted |
| ADR-004 | Schema Versioning and Rolling Upgrades | Accepted |
| ADR-005 | Erasure Coding and Chunk Durability | Accepted |
| ADR-006 | Inline Data Threshold | Accepted |
| ADR-007 | System Key Manager HA via Raft | Accepted |
| ADR-008 | Native Client Fabric Discovery | Accepted |
| ADR-009 | Audit Log Sharding and GC | Accepted |
| ADR-010 | Retention Hold Enforcement Before Crypto-Shred | Accepted |
| ADR-011 | Crypto-Shred Cache Invalidation and TTL | Accepted |
| ADR-012 | Stream Processor Tenant Isolation | Accepted |
| ADR-013 | POSIX Semantics Scope | Accepted |
| ADR-014 | S3 API Compatibility Scope | Accepted |
| ADR-015 | Observability Contract | Accepted |
| ADR-016 | Backup and Disaster Recovery | Accepted |
| ADR-017 | Dedup Refcount Metadata Access Control | Accepted |
| ADR-018 | Runtime Integrity Monitor | Accepted |
| ADR-019 | Gateway Deployment Model | Accepted |
| ADR-020 | Workflow Advisory & Client Telemetry | Accepted |
| ADR-021 | Workflow Advisory Architecture | Accepted |
| ADR-022 | Storage Backend – redb (Pure Rust) | Accepted |
| ADR-023 | Protocol RFC Compliance Scope | Accepted |
| ADR-024 | Device Management, Storage Tiers, and Capacity Thresholds | Accepted |
| ADR-025 | Storage Administration API | Accepted |
| ADR-026 | Raft Topology – Per-Shard on Fabric (Strategy A) | Accepted |
| ADR-027 | Single-Language Implementation – Rust Only | Accepted |
| ADR-028 | External Tenant KMS Providers | Accepted |
| ADR-029 | Raw Block Device Allocator | Accepted |
| ADR-030 | Dynamic Small-File Placement and Metadata Capacity Management | Accepted |
| ADR-031 | Client-Side Cache | Accepted |
ADR template
New ADRs follow this structure:
# ADR-NNN: Title
**Status**: Proposed | Accepted | Superseded by ADR-XXX
**Date**: YYYY-MM-DD
**Context**: Why this decision is needed.
## Decision
What was decided and why.
## Consequences
What changes as a result. Trade-offs accepted.
## Alternatives considered
What else was evaluated and why it was rejected.
Key decisions by topic
Language and architecture
- ADR-001: Pure Rust (no Mochi dependency)
- ADR-027: Single-language Rust (Go control plane replaced)
- ADR-022: redb as storage backend (pure Rust, no RocksDB)
Encryption
- ADR-002: Two-layer encryption model (system DEK + tenant KEK)
- ADR-003: HKDF-based DEK derivation (not per-chunk storage)
- ADR-011: Crypto-shred cache invalidation TTL
- ADR-028: External tenant KMS providers (Vault, KMIP, AWS KMS, PKCS#11)
Consensus and replication
- ADR-007: System key manager HA via Raft
- ADR-026: Per-shard Raft groups on fabric (Strategy A)
- ADR-009: Audit log sharding and GC
Storage
- ADR-005: Erasure coding and chunk durability
- ADR-006: Inline data threshold
- ADR-029: Raw block device allocator
- ADR-030: Dynamic small-file placement
Protocols and access
- ADR-008: Native client fabric discovery
- ADR-013: POSIX semantics scope
- ADR-014: S3 API compatibility scope
- ADR-019: Gateway deployment model
- ADR-023: Protocol RFC compliance scope
Operations
- ADR-015: Observability contract
- ADR-016: Backup and disaster recovery
- ADR-024: Device management and capacity thresholds
- ADR-025: Storage administration API
Advisory
- ADR-020: Workflow advisory and client telemetry
- ADR-021: Workflow advisory architecture
Client
- ADR-031: Client-side cache
ADR-001: Pure Rust, No Mochi Dependency
Status: Accepted Date: 2026-04-17 Context: Q-E3, A-E3
Decision
Build all core components in Rust. Do not depend on Mochi (Mercury/Bake/SDSKV). Learn from Mochi’s design patterns (transport abstraction, composable services).
Rationale
- Mochi has never been deployed in regulated environments (HIPAA/GDPR)
- C/C++ FFI creates a FIPS compliance surface across two languages
- Single-language FIPS module boundary is cleaner for certification
- Rust ecosystem has the building blocks (aws-lc-rs for FIPS, tokio, tonic, openraft)
- Weakest link is libfabric/CXI Rust binding — bounded scope, solvable
Consequences
- Must build transport abstraction in Rust (kiseki-transport)
- Must build chunk storage engine in Rust (kiseki-chunk)
- Must build KV backend for log storage in Rust (RocksDB via rust-rocksdb, or sled)
- libfabric-sys crate needed for Slingshot support (immature, may need contribution)
ADR-002: Two-Layer Encryption Model (C)
Status: Accepted Date: 2026-04-17 Context: Q-K-arch1, I-K1 through I-K14
Decision
Single data encryption pass at the system layer. Tenant access via key wrapping. No double encryption.
- System DEK encrypts chunk data (AES-256-GCM via FIPS module)
- Tenant KEK wraps access to system DEK derivation material
- System key manager derives per-chunk DEKs via HKDF (see ADR-003)
Rationale
- Single encryption pass at HPC line rates (200+ Gbps per NIC)
- Double encryption doubles CPU cost for no additional security benefit given that both layers use authenticated encryption
- Key wrapping is O(32 bytes) per operation vs O(data_size) for encryption
- Cross-tenant dedup works: same plaintext → same chunk_id → one ciphertext, multiple tenant KEK wrappings
Consequences
- Crypto-shred destroys tenant KEK → data unreadable but not physically deleted
- System key compromise exposes system-layer ciphertext; combined with tenant KEK = full access. System key manager must be highly protected (ADR-007).
- Envelope must carry both system and tenant wrapping metadata
ADR-003: System DEK Derivation (Not Storage)
Status: Accepted Date: 2026-04-17 Context: B-ADV-3 (system DEK count at scale), escalation point 3
Decision
System DEKs are derived locally on storage nodes via HKDF, not stored individually and not derived via RPC to the key manager.
system_dek = HKDF-SHA256(
key = system_master_key[epoch],
salt = chunk_id,
info = "kiseki-chunk-dek-v1"
)
Key distribution model (revised per ADV-ARCH-01)
The system key manager (kiseki-keyserver) stores and replicates master keys. Storage nodes (kiseki-server) fetch the current master key at startup and on epoch rotation. DEK derivation happens locally on the storage node — the key manager never sees individual chunk_ids.
kiseki-keyserver:
Stores: master_key per epoch
Serves: master_key to authenticated kiseki-server processes
Never sees: individual chunk_ids or per-chunk operations
kiseki-server:
Caches: master_key (mlock'd, refreshed on rotation)
Derives: per-chunk DEK = HKDF(master_key, chunk_id) — locally
Never sends: chunk_ids to the key manager
This prevents the key manager from building an index of all chunk_ids (ADV-ARCH-01: HKDF leak), which would reconstruct the per-tenant refcount data we explicitly decided not to store (ADR-017).
Rationale
- At petabyte scale with ~1MB average chunks: billions of chunks
- Storing billions of 32-byte DEKs = tens of GB in the key manager
- HKDF derivation is deterministic: same (master_key, chunk_id) → same DEK
- The key manager stores only master keys (one per epoch) — trivial storage
- HKDF is fast (~microseconds) and FIPS-approved
- Local derivation eliminates per-chunk RPC to key manager (performance)
- Key rotation: new epoch = new master key. Old master keys retained during migration window. Derivation still works for old-epoch chunks.
- Key manager never sees chunk-level operations (ADV-ARCH-01 fix)
Consequences
- System key manager is simpler (stores epochs, not individual DEKs)
- Master key is cached in kiseki-server memory — this is the highest-value target on a storage node (ADV-ARCH-04, accepted risk with mitigations: mlock, MADV_DONTDUMP, seccomp, core dumps disabled)
- Master key compromise exposes ALL system DEKs for that epoch
- Chunk ID is used as salt — chunk ID must not change after creation
- Tenant KEK wraps the HKDF derivation parameters (epoch + chunk_id), not the DEK itself — unwrapping + HKDF derives the DEK
ADR-004: Schema Versioning and Rolling Upgrades
Status: Accepted Date: 2026-04-17 Context: A-ADV-2 (upgrade and schema evolution)
Decision
All persistent formats carry a version field. Rolling upgrades are supported with N-1/N version compatibility.
Delta envelope versioning
DeltaHeader.format_version: u16— first field, fixed offset- Readers that encounter unknown versions fail open (skip the delta, log warning) rather than crash
- Writers always produce the current version
- Compaction preserves original format version (does not upgrade)
Chunk envelope versioning
EnvelopeMeta.format_version: u16- Algorithm ID already provides crypto-agility
- New envelope fields are additive (protobuf-style: unknown fields preserved)
Wire protocol versioning (gRPC)
- Protobuf with reserved fields and additive evolution
- gRPC service versioning:
/kiseki.v1.LogService,/kiseki.v2.LogService - Native client negotiates version on connect
View materialization
- Stream processors declare which delta format versions they support
- Upgrade sequence: deploy new stream processors first (can read old+new), then upgrade writers (produce new format)
Rolling upgrade sequence
- Deploy new
kiseki-serverbinaries (can read old + new formats) - Rolling restart storage nodes (one at a time, Raft quorum maintained)
- Deploy new
kiseki-control(Go, independent restart) - Deploy new
kiseki-client-fuseto compute nodes - After all nodes upgraded: optional compaction to upgrade old deltas
Consequences
- All format changes must be backward-compatible for at least one version
- Breaking changes require a two-phase rollout (add new, migrate, remove old)
- Format version is the first field read on every deserialization path
ADR-005: Erasure Coding and Chunk Durability
Status: Accepted Date: 2026-04-17 Context: I-C4, escalation point 10
Decision
EC parameters are per affinity pool, configured by cluster admin.
Default profiles
| Pool type | Strategy | Rationale |
|---|---|---|
| fast-nvme (metadata, hot data) | EC 4+2 | Balance of space efficiency and rebuild speed |
| bulk-nvme (cold data, checkpoints) | EC 8+3 | Higher space efficiency for bulk data |
| meta-nvme (log SSTables, key manager) | Replication-3 | Lowest latency for consensus-critical data |
Chunk-RDMA alignment (C-ADV-3)
Content-defined chunking produces variable-size chunks. For RDMA:
- Chunks are stored with 4KB-aligned padding on disk
- RDMA scatter-gather lists map logical chunk boundaries to aligned physical blocks
- One-sided RDMA transfers use pre-registered memory regions at 4KB alignment
- Padding overhead is bounded: max 4KB per chunk, typically <1% for chunks >256KB
Consequences
- Pool-level EC config means all chunks in a pool share the same protection level
- Changing EC parameters requires re-encoding existing chunks (background process)
- RDMA alignment adds trivial storage overhead but enables zero-copy transfers
ADR-006: Inline Data Threshold
Status: Accepted Date: 2026-04-17 Context: Escalation point 6, analyst session
Decision
Delta payloads may carry inline data up to 4096 bytes (4KB).
Data below this threshold is encrypted and stored directly in the delta payload. No separate chunk write occurs.
Rationale
- Small files (symlinks, xattrs, tiny configs): avoid chunk overhead
- DeltaFS validated this pattern at scale (inode metadata with inline data)
- 4KB aligns with filesystem block size and NVMe sector size
- Raft replication cost per delta increases slightly but acceptably (4KB payload vs ~200 byte metadata-only delta)
- Standard practice: ext4, Btrfs, XFS all support inline data
Threshold selection
| Threshold | Raft cost | Use cases captured | Chunk overhead saved |
|---|---|---|---|
| 1KB | Minimal | Symlinks, xattrs | Low |
| 4KB | Acceptable | Small files, metadata, configs | Moderate |
| 8KB | Noticeable | More files inline | Higher but Raft fan-out increases |
| 64KB | Significant | Too much data in the log | Raft becomes bottleneck |
4KB is the sweet spot: captures the majority of metadata-only operations without overloading Raft replication.
Consequences
- Configurable per cluster (system-level setting, not per-tenant)
- Compaction must handle deltas with inline data (encrypted payload may be larger than metadata-only deltas)
ADR-007: System Key Manager HA via Raft
Status: Accepted Date: 2026-04-17 Context: I-K12, escalation point 7, B-ADV-3
Decision
The system key manager is a dedicated Raft group (3 or 5 members) running
as kiseki-keyserver on dedicated nodes. It stores system master keys
(one per epoch) and derives per-chunk DEKs via HKDF at runtime (ADR-003).
Architecture
kiseki-keyserver (3-5 nodes, Raft)
├── Stores: system master keys (one per epoch, ~32 bytes each)
├── Derives: system DEK = HKDF(master_key, chunk_id) — stateless
├── Manages: epoch lifecycle (create, rotate, retain, destroy)
└── Audits: all key events to audit log
Rationale
- System key manager is the highest-severity SPOF (P0 if unavailable)
- Must be at least as available as the Log
- Raft provides consensus + replication + leader election
- Separate from shard Raft groups (independent failure domain)
- Dedicated nodes: key material never co-located with tenant data
- Master key storage is trivial (epochs × 32 bytes)
- DEK derivation is stateless and fast (HKDF, ~microseconds)
Deployment
- 3 nodes for standard deployments, 5 for high-criticality
- Dedicated hardware (or at minimum, dedicated processes on control-plane nodes)
- Key material in memory only (mlock’d, guard pages)
- On-disk: Raft log + snapshot of epoch state (encrypted with node-local key)
Consequences
- Adds a deployment component (
kiseki-keyserver) - Key manager must be deployed and healthy before any data operations
- Cross-site: each site has its own system key manager (federation doesn’t share system keys — only tenant keys cross sites via tenant KMS)
ADR-008: Native Client Fabric Discovery
Status: Accepted Date: 2026-04-17 Context: Escalation point 8, A-ADV-1, I-O4
Decision
Native clients discover shards, views, and gateways via a lightweight discovery service running on every storage node, accessible on the data fabric. No control plane access required.
Mechanism
-
Bootstrap: client is configured with a list of seed endpoints (storage node addresses on the data fabric). Seed list can be provided via environment variable, config file, or DHCP option.
-
Discovery query: client sends a discovery request to any seed. The storage node responds with:
- List of active shards (shard_id, leader node, key range)
- List of materialized views (view_id, protocol, endpoint)
- List of gateway endpoints (protocol, transport)
- Tenant authentication requirements
-
Authentication: client presents mTLS certificate (Cluster CA signed, per-tenant). Optional second-stage auth via tenant IdP.
-
Cache: discovery results cached with TTL. Periodic refresh. Shard split/merge events invalidate relevant cache entries.
-
Transport negotiation: client probes available transports (CXI → verbs → TCP) and selects highest-performance option.
Why not DNS-SD or multicast
- Slingshot fabric may not support multicast reliably
- DNS-SD requires DNS infrastructure on the data fabric
- Seed-based discovery is simple, deterministic, and works with any transport
Consequences
- Every storage node runs a discovery responder (lightweight, part of kiseki-server)
- Seed list is the only bootstrap configuration for compute nodes
- Discovery responder must not expose tenant-sensitive information (shard/view metadata is operational, not tenant content)
ADR-009: Audit Log Sharding and GC
Status: Accepted Date: 2026-04-17 Context: B-ADV-1 (audit log scalability)
Decision
The audit log is sharded per tenant with its own archival lifecycle.
Architecture
Audit subsystem:
├── Per-tenant audit shard (append-only, Raft-replicated)
│ └── Contains: tenant events + relevant system events
│ └── GC: events archived to cold storage after retention period
│ └── Retention period: set by compliance tags (e.g., HIPAA = 6 years)
│
├── System audit shard (cluster-wide operational events)
│ └── Contains: node events, maintenance, non-tenant-scoped events
│ └── GC: configurable retention (default 1 year)
│
└── Export pipeline
└── Tenant export: filtered stream to tenant VLAN
└── System export: to cluster admin's SIEM
GC interaction with delta GC (I-L4)
- Each tenant audit shard tracks its own watermark per data shard
- Delta GC checks the relevant tenant audit shard’s watermark
- A stalled tenant audit shard blocks delta GC only for that tenant’s data shards (not cluster-wide)
Rationale
- Single global audit log is a cluster-wide GC bottleneck (B-ADV-1)
- Per-tenant sharding: stalled export for one tenant doesn’t block others
- Audit retention aligns with compliance (HIPAA 6yr, GDPR varies)
- Archived events move to cold storage (bulk-nvme pool) after active retention
GC safety valve and backpressure (analyst backpass contention 2)
Default behavior (safety valve): if a tenant’s audit export stalls for > configurable threshold (default 24 hours), data shard GC proceeds anyway. The audit gap is logged, and the compliance team is notified. Storage exhaustion is worse than an auditable gap.
Per-tenant configurable: tenants can enable audit backpressure mode. When enabled, if the audit export falls behind, write throughput for that tenant is throttled (reducing GC pressure at the source). This preserves audit completeness at the cost of write performance.
| Mode | GC behavior | Write impact | Use case |
|---|---|---|---|
| Safety valve (default) | GC proceeds after timeout | None | Most tenants |
| Backpressure (opt-in) | GC waits; writes throttled | Slower writes | Strict compliance |
Consequences
- More audit shards to manage (one per tenant + one system)
- Audit Raft groups are lightweight (small append-only logs)
- Archival pipeline is a background process
- Safety valve prevents storage exhaustion from stalled audit export
- Backpressure mode available for tenants with strict audit requirements
ADR-010: Retention Hold Enforcement Before Crypto-Shred
Status: Accepted Date: 2026-04-17 Context: B-ADV-4 (retention hold ordering race)
Decision
Compliance tags that imply retention requirements automatically create retention holds when data is written. Crypto-shred checks for active holds before proceeding.
Mechanism
- When a namespace has compliance tags (HIPAA, GDPR, etc.), the control plane derives retention requirements from the tag.
- A default retention hold is automatically created for the namespace with the TTL mandated by the compliance regime.
- Crypto-shred for a tenant checks all namespaces for active holds:
- If holds exist: crypto-shred proceeds (KEK destroyed, data unreadable) but physical GC is blocked (correct behavior).
- If no holds exist AND compliance tags imply retention: crypto-shred is blocked with an error requiring explicit override.
- Override requires
force_without_hold_check: true+ audit log entry documenting the override and the reason.
Compliance tag → retention mapping (configurable)
| Tag | Default retention | Source |
|---|---|---|
| HIPAA | 6 years | 45 CFR §164.530(j) |
| GDPR | Per DPA agreement | No fixed minimum |
| revFADP | Per data controller policy | Swiss FDPA Art. 6 |
Consequences
- Retention holds are created automatically, reducing risk of human error
- Crypto-shred with override is audited (compliance team can review)
- Tenant admin can extend holds but not shorten below compliance minimum
ADR-011: Crypto-Shred Cache Invalidation and TTL
Status: Accepted Date: 2026-04-17 Context: B-ADV-5 (crypto-shred propagation)
Decision
Maximum tenant KEK cache TTL is 60 seconds. Crypto-shred triggers an active invalidation broadcast in addition to TTL expiry.
Mechanism
- Default cache TTL: 60 seconds (configurable per tenant, cannot exceed max)
- On crypto-shred: a. KEK destroyed in tenant KMS b. Invalidation broadcast to all known gateways, stream processors, and native clients for that tenant c. Components receiving invalidation immediately purge cached KEK d. Components unreachable during broadcast will expire naturally at TTL
- Crypto-shred operation returns success after KEK destruction + broadcast (does not wait for all acknowledgments)
- Maximum residual window: 60 seconds (cache TTL for unreachable components)
TTL configuration (analyst backpass contention 3)
The 60-second TTL is the default, not a fixed value. TTL is configurable per tenant within bounds:
| Parameter | Value | Rationale |
|---|---|---|
| Minimum TTL | 5 seconds | Below this, KMS load becomes problematic (key fetch every 5s per component) |
| Default TTL | 60 seconds | Reasonable for most deployments |
| Maximum TTL | 300 seconds (5 min) | Beyond this, the crypto-shred window is unreasonable |
Tenants under stricter regulation can request shorter TTL (e.g., 10s). The trade-off is higher KMS load (more frequent key fetches). The control plane validates that the requested TTL is within [min, max] and warns if KMS capacity may be insufficient.
HIPAA/GDPR acceptability
- GDPR Art. 17 requires erasure “without undue delay” — even 300 seconds is within reasonable interpretation for a distributed system
- HIPAA does not specify a time bound for deletion
- The audit log records exact times: KEK destroyed, broadcast sent, cache TTL expiry — providing compliance evidence
- Configurable TTL allows compliance-sensitive tenants to reduce the window
Consequences
- Default 60-second window where data is technically readable after shred
- Configurable per tenant within [5s, 300s] bounds
- Components must handle invalidation broadcast (new message type)
- Native clients on unreachable compute nodes: data readable until their process exits or TTL expires (whichever comes first)
- Shorter TTLs increase KMS load (more frequent key fetches)
- TTL bounds are performance parameters that may conflict with compliance — the minimum (5s) is a hard engineering limit, not a policy choice
ADR-012: Stream Processor Tenant Isolation
Status: Accepted Date: 2026-04-17 Context: B-ADV-6 (stream processor isolation)
Decision
Stream processors for different tenants run in separate OS processes
on storage nodes. Key material is protected with mlock and guard pages.
Isolation model
| Mechanism | Purpose |
|---|---|
| Separate processes | OS-level memory isolation between tenants |
| mlock on key pages | Prevent key material from swapping to disk |
| Guard pages | Detect buffer overflows near key material |
| seccomp (Linux) | Restrict syscalls to minimum needed |
| Separate cgroups | Resource isolation (CPU, memory) per tenant |
Co-location policy
- Small tenants: multiple stream processors per node (process isolation)
- Large/sensitive tenants: dedicated nodes (configurable via placement policy)
- Compliance tags can mandate dedicated nodes (e.g., HIPAA with strict isolation)
Hardware isolation (future)
- AMD SEV-SNP / Intel TDX confidential VMs: out of scope for initial build
- Envelope format and key wrapping are compatible with confidential compute (keys are already protected end-to-end; adding a TEE is additive, not architectural change)
Consequences
- More processes per storage node (one per tenant per view)
- Process management in kiseki-server (spawn, monitor, restart)
- Memory overhead per process (Rust process ~10-20MB base)
- Key material never in shared memory between tenants
ADR-013: POSIX Semantics Scope
Status: Accepted Date: 2026-04-17 Context: A-ADV-4 (POSIX semantics depth)
Decision
POSIX support via FUSE with explicit compatibility matrix.
Supported (full semantics)
| Operation | Notes |
|---|---|
| open, close, read, write | Standard file I/O |
| create, unlink, mkdir, rmdir | Directory operations |
| rename (within namespace) | Atomic within shard |
| stat, fstat, lstat | File metadata |
| chmod, chown | Permission changes (stored in delta attributes) |
| readdir, readdirplus | Directory listing from view |
| symlink, readlink | Stored as inline data in delta |
| truncate, ftruncate | Composition resize |
| fsync, fdatasync | Flush to durable (delta committed) |
| extended attributes (xattr) | getxattr, setxattr, listxattr, removexattr |
| POSIX file locks (fcntl) | Per-gateway lock state |
| O_APPEND | Atomic append via delta |
| O_CREAT, O_EXCL | Atomic create-if-not-exists |
Supported (limited semantics)
| Operation | Limitation |
|---|---|
| rename (cross-namespace) | Returns EXDEV (ADR: I-L8) |
| hard links | Within namespace only; cross-namespace returns EXDEV |
| sparse files | Holes tracked in composition; zero-fill on read |
| O_DIRECT | Bypasses client cache but still goes through FUSE |
| flock (advisory) | Best-effort; not guaranteed across gateway failover |
Not supported
| Operation | Reason |
|---|---|
| mmap (shared, writable) | Distributed shared writable mmap requires page-level coherence — not tractable for a distributed system at HPC scale. Read-only mmap is supported. The FUSE client returns ENOTSUP with a log message: “writable shared mmap not supported; use write() instead.” |
| ACLs (POSIX.1e) | Unix permissions only (uid/gid/mode). POSIX ACLs add complexity without significant benefit for the target workload. Revisit if needed. |
| chroot, pivot_root | Filesystem-level operations, not meaningful for FUSE mount |
Consequences
- mmap restriction documented prominently (HPC users expect it)
- Read-only mmap works (useful for model loading)
- Writable mmap requires application changes (use write() instead)
- No POSIX ACLs simplifies the permission model
ADR-014: S3 API Compatibility Scope
Status: Accepted Date: 2026-04-17 Context: A-ADV-5 (S3 API compatibility scope)
Decision
Implement a subset of S3 API covering the operations needed by HPC/AI workloads. Not a complete S3 implementation.
Supported (full)
| API | Notes |
|---|---|
| PutObject | Single-part upload |
| GetObject | Including byte-range reads |
| HeadObject | Metadata retrieval |
| DeleteObject | Tombstone or delete marker (versioning) |
| ListObjectsV2 | Prefix, delimiter, pagination |
| CreateMultipartUpload | |
| UploadPart | |
| CompleteMultipartUpload | |
| AbortMultipartUpload | |
| ListMultipartUploads | |
| ListParts | |
| CreateBucket | Maps to namespace creation |
| DeleteBucket | Maps to namespace deletion |
| HeadBucket | Existence check |
| ListBuckets | Per-tenant bucket listing |
Supported (versioning)
| API | Notes |
|---|---|
| GetObjectVersion | Specific version retrieval |
| ListObjectVersions | Version listing |
| DeleteObjectVersion | Delete specific version |
Supported (conditional)
| API | Notes |
|---|---|
| If-None-Match, If-Match | Conditional writes |
| If-Modified-Since | Conditional reads |
Not supported (initial build)
| API | Reason | Future? |
|---|---|---|
| Lifecycle policies | Complex; competes with Kiseki’s own tiering | Maybe |
| Event notifications | Requires message bus integration | Maybe |
| SSE-S3, SSE-KMS, SSE-C | Kiseki’s encryption is always-on; S3 SSE headers are acknowledged but don’t change behavior | N/A |
| Presigned URLs | Useful; add after core is stable | Yes |
| Bucket policies | Kiseki uses its own IAM/policy model | No |
| CORS | Not relevant for HPC/AI workloads | No |
| Object Lock | Covered by Kiseki’s retention holds | Mapping possible |
| Select (S3 Select) | Out of scope | No |
SSE header handling
S3 clients may send SSE headers. Kiseki always encrypts (I-K1).
- SSE-S3 headers: acknowledged, no-op (system encryption is always on)
- SSE-KMS headers with key ARN: if ARN matches tenant KMS config, acknowledged. If different: error (tenant can’t specify arbitrary keys)
- SSE-C headers: rejected (Kiseki manages encryption, not the client)
Consequences
- S3-compatible tooling (aws cli, boto3, rclone) works for supported operations
- Unsupported operations return 501 Not Implemented
- SSE headers are handled gracefully without breaking encryption model
ADR-015: Observability Contract
Status: Accepted Date: 2026-04-17 Context: A-ADV-7 (observability)
Decision
OpenTelemetry-native observability with tenant-aware metric scoping.
Metrics (Prometheus-compatible, via OpenTelemetry)
| Context | Key metrics |
|---|---|
| Log | delta_append_latency, raft_commit_latency, shard_count, shard_size, compaction_duration, election_count |
| Chunk | write_latency, read_latency, dedup_hit_rate, gc_chunks_collected, repair_count, pool_utilization |
| Composition | create_latency, delete_count, multipart_in_progress, refcount_operations |
| View | materialization_lag_ms, staleness_violation_count, rebuild_progress, pin_count |
| Gateway | request_latency (p50/p99/p999), requests_per_sec, error_rate, active_connections |
| Client | fuse_latency, transport_type, cache_hit_rate, prefetch_effectiveness |
| Key Mgr | derive_latency, rotation_in_progress, kms_reachability, cache_hit_rate |
| Control | tenant_count, namespace_count, quota_utilization, federation_sync_lag |
Zero-trust metric scoping
- Cluster admin sees: aggregated metrics, per-node metrics, system health. Per-tenant metrics are anonymized (tenant_id replaced with opaque hash) unless cluster admin has approved access for that tenant.
- Tenant admin sees: their own tenant’s metrics via tenant audit export.
- No metric exposes: file names, directory structure, data content, or access patterns attributable to a specific tenant (without approval).
Distributed tracing
- Every write/read path carries a trace ID (OpenTelemetry context propagation)
- Traces span: client → gateway → composition → log → chunk → view
- Tenant-scoped traces are visible only to the tenant admin
- Cluster admin sees system-level spans (no tenant content in span attributes)
Structured logging
- JSON structured logs, one line per event
- Log levels: ERROR, WARN, INFO, DEBUG, TRACE
- Tenant-identifying fields are present but content fields are encrypted
- Logs ship to the same audit/observability pipeline
Consequences
- OpenTelemetry SDK in both Rust and Go codebases
- Metric cardinality must be bounded (no unbounded label values)
- Tracing overhead ~1-2% on data path (acceptable for production)
ADR-016: Backup and Disaster Recovery
Status: Accepted Date: 2026-04-17 Context: A-ADV-8 (backup and DR)
Decision
Federation is the primary DR mechanism. External backup is additive and optional.
Site-level DR via federation
- Federated-async replication to a secondary site is the primary DR story
- RPO: bounded by async replication lag (seconds to minutes)
- RTO: secondary site is warm (has replicated data + tenant config); switchover requires KMS connectivity and control plane reconfiguration
- Data replication is ciphertext-only (no key material in replication stream)
What is replicated
| Component | Replicated? | Mechanism |
|---|---|---|
| Chunk data (ciphertext) | Yes | Async replication to peer site |
| Log deltas | Yes | Async replication of committed deltas |
| Control plane config | Yes | Federation config sync |
| Tenant KMS config | No | Same tenant KMS serves both sites |
| System master keys | No | Per-site system key manager |
| Audit log | Yes | Per-tenant audit shard replicated |
External backup (optional, additive)
- Cluster admin can configure external backup targets (S3-compatible store)
- Backup contains: encrypted chunk data + log snapshots + control plane state
- Backup is encrypted with the system key (at rest) — no plaintext in backup
- HIPAA requirement met: backup is encrypted
- Backup frequency: configurable (hourly/daily snapshots of control plane, continuous for chunk data)
Recovery scenarios
| Scenario | Recovery path | RPO | RTO |
|---|---|---|---|
| Single node loss | Raft re-election + EC repair | 0 | Seconds-minutes |
| Multiple node loss | Raft reconfiguration + EC repair | 0 | Minutes |
| Full site loss | Failover to federated peer | Replication lag | Minutes-hours |
| Site loss, no federation | Restore from external backup | Backup lag | Hours |
| Tenant KMS loss | Unrecoverable (I-K11) | N/A | N/A |
Consequences
- Federation is the recommended (and primary) DR strategy
- External backup is for defense-in-depth, not primary recovery
- RTO for site failover depends on control plane reconfiguration speed
- System key manager is per-site — site failover requires the secondary site’s own system key manager (different master keys, but tenants’ data is accessible because tenant KMS is shared cross-site)
ADR-017: Dedup Refcount Metadata Access Control
Status: Accepted Date: 2026-04-17 Context: B-ADV-2 (cross-tenant dedup refcount metadata)
Decision
Chunk refcount metadata stores total refcount only, without per-tenant attribution. Tenant-to-chunk mapping is derived from composition metadata (which is tenant-encrypted).
Mechanism
ChunkMeta:
chunk_id: abc123
total_refcount: 3 ← visible to system
per_tenant_refs: N/A ← NOT stored
Tenant attribution is in the composition deltas:
org-pharma/composition-X references chunk abc123 ← encrypted in delta payload
org-biotech/composition-Y references chunk abc123 ← encrypted in delta payload
Access control
- Cluster admin can see: chunk_id, total_refcount, pool, EC status
- Cluster admin CANNOT see: which tenants reference which chunks (this information is in tenant-encrypted delta payloads)
- System dedup process: compares chunk_ids (in the clear for dedup), but does not record which tenant triggered the dedup match
Residual risk
- Total refcount > 1 reveals that SOME dedup occurred, but not who
- Timing side channel: a dedup hit is faster than a full write. An observer who can measure write latency precisely could infer dedup. Mitigation: add random delay to normalize write timing (optional, configurable per tenant).
Consequences
- No per-tenant refcount tracking in chunk metadata
- Refcount decrement on crypto-shred: the crypto-shred process walks the tenant’s compositions (decrypted with tenant key during shred) to identify which chunks to decrement
- This is slower than a per-tenant refcount lookup but only happens during crypto-shred (rare operation)
ADR-018: Runtime Integrity Monitor
Status: Accepted Date: 2026-04-17 Context: ADV-ARCH-04 (master key in memory), analyst backpass contention 1
Decision
A runtime integrity monitor runs as a side process on every storage node, detecting signs of key material extraction attempts.
Detection signals
| Signal | Detection method | Severity |
|---|---|---|
| ptrace attachment to kiseki processes | Monitor /proc/pid/status TracerPid | Critical |
| /proc/pid/mem reads on kiseki processes | inotify/audit on /proc/pid/mem | Critical |
| Debugger presence (gdb, lldb, strace) | Process enumeration | High |
| Core dump generation attempt | Monitor core_pattern, catch SIGABRT | Critical |
| Unexpected LD_PRELOAD on kiseki processes | Check /proc/pid/environ at startup | High |
| Process memory mapping changes | Monitor /proc/pid/maps periodically | Medium |
Response
- Alert: cluster admin + affected tenant admin(s) immediately
- Log: audit event with full context (pid, signal, timestamp)
- Optional auto-response (configurable):
- Rotate system master key (new epoch, invalidate cached key)
- Evict cached tenant KEKs (force re-fetch from KMS)
- Kill the suspect process
- Do NOT: shut down the storage node (availability over prevention — the attacker may already have the key; shutting down just causes an outage)
Performance impact
Negligible. The monitor checks /proc periodically (every 1-5 seconds), not on every crypto operation. Crypto operations themselves are not a performance concern:
- HKDF derivation: ~1μs per call, ~25,000 calls/sec at line rate = ~25ms CPU/sec
- AES-256-GCM (the actual encryption): with AES-NI, ~5-10% of one core at 200 Gbps
- The bottleneck is the AEAD data encryption, not key derivation or monitoring
Consequences
- Additional process per storage node (lightweight)
- Linux-specific (/proc-based detection); needs platform abstraction for other OS
- Not a prevention mechanism — it’s detection and response
- False positives possible (legitimate debugging during development); monitor should be disableable in dev/test mode
ADR-019: Gateway Deployment Model
Status: Accepted Date: 2026-04-17 Context: ADV-ARCH-03 (monolith blast radius), analyst backpass contention 4
Decision
Gateways run in-process with kiseki-server (monolith per node). Client resilience is provided by multi-endpoint resolution, not per-process gateway isolation.
Rationale
This is a distributed system with no master. Every storage node runs kiseki-server (log + chunk + composition + view + gateways). Clients resolve to multiple endpoints:
Client (NFS/S3/native)
│
├── DNS round-robin: kiseki-nfs.cluster.local → [node1, node2, node3, ...]
├── Multiple A/AAAA records
├── Native client: seed list → discovery → multiple endpoints
│
└── On node failure: client reconnects to next endpoint
(NFS: automatic reconnect; S3: retry to different host;
native: transport failover)
Why monolith is acceptable
| Concern | Mitigation |
|---|---|
| Gateway crash = node crash | Client reconnects to another node (seconds) |
| All tenants on crashed node affected | Tenants are served by multiple nodes; one node loss = partial, not total |
| Memory leak in gateway affects log/chunk | Resource limits via cgroups; OOM killer targets the process, not the node |
| Bug in NFS gateway affects S3 gateway | Accept — both are in the same process. Isolation adds operational complexity disproportionate to the risk |
Why NOT separate gateway processes
- Additional process management per node (spawn, monitor, restart, IPC)
- Performance overhead of IPC between gateway and log/chunk/view
- Operational complexity (more processes to configure, monitor, upgrade)
- The resilience model is client-side multi-endpoint, not server-side process isolation
Client resolution
| Client type | Resolution mechanism |
|---|---|
| NFS | DNS (multiple A records), NFS mount with multiple server addresses |
| S3 | DNS round-robin, HTTP retry to next endpoint on 5xx |
| Native | Seed list → fabric discovery → multiple endpoints, automatic failover |
Consequences
- kiseki-server remains a single-process monolith per node
- Client-side resilience is the primary availability mechanism
- Update failure-modes.md: F-D1 (gateway crash) → node-scoped, not protocol-scoped
- Node loss tolerance depends on tenant data distribution across nodes
ADR-020: Workflow Advisory & Client Telemetry
Status: Accepted (implemented, 51/51 BDD scenarios pass) Date: 2026-04-17 Context: new capability — HPC/AI workloads need to steer storage (prefetch, affinity, priority, phase-adaptive tuning) and consume caller-scoped feedback (backpressure, locality, materialization lag, QoS headroom). ADR-015 covers operator-facing observability; this ADR covers the orthogonal client-facing advisory/telemetry surface.
Decision (analyst-level; architect will refine interfaces)
Introduce a Workflow Advisory cross-cutting concern carrying two flows over one bidirectional advisory channel per declared workflow:
- Hints (client → storage) — advisory, never authoritative (I-WA1).
- Telemetry feedback (storage → client) — caller-scoped only (I-WA5, I-WA6).
Workflow is not a bounded context. It is a correlation + steering construct owned entirely by the client, with a stateless routing layer on the server side and bounded per-workflow state.
Correlation identity
Every data-path operation issued while a workflow is active carries:
(org_id, project_id?, workload_id, client_id, workflow_id, phase_id)
client_idpinned per native-client process (I-WA4).workflow_id≥128-bit opaque, unique within workload (I-WA10).phase_idmonotonic within workflow, bounded phase history (I-WA13).
Advisory channel
- One bidi gRPC stream per active workflow, on the data fabric, under the same mTLS tenant certificate as the data path (I-Auth1, I-WA3).
- Authorization is per-operation on the stream, not only at establishment (I-WA3). Certificate revocation tears down the stream.
- Side-by-side with the data path — not in-band. Data-path requests
may be annotated with a short
workflow_refheader that the data-path code passes through; server-side the annotation is routed to the advisory subsystem asynchronously (I-WA2). Annotation is strictly best-effort:- malformed
workflow_ref→ ignored, no data-path impact workflow_reffor an expired workflow → dropped silently on the advisory side with anhint-rejected: workflow_unknownaudit event- advisory subsystem overloaded or unavailable → annotation enqueued
with bounded buffer; on overflow dropped with a rate-limited
annotation_droppedaudit event. Data-path operation outcome is never affected (I-WA2).
- malformed
- Closure of the advisory stream without
Endauto-expires the workflow on TTL. Process restart produces a freshclient_id; the old workflow expires on TTL and the new process must redeclare. No reattach protocol is defined in this ADR — it may be revisited as a follow-up feature with its own spec + adversary review.
Hint taxonomy (must-have)
| Category | Example values | Acted on by |
|---|---|---|
| Workload profile | ai-training, ai-inference, hpc-checkpoint, batch-etl, interactive | Control Plane policy gate; tunes other hint defaults |
| Phase marker | stage-in, compute, checkpoint, stage-out, epoch-N (opaque semantic tag) | View (cache policy), Composition (write-absorb), Chunk (placement hot-set) |
| Access pattern | sequential / random / strided / broadcast | Native Client (prefetch), View (materialization priority) |
| Prefetch range | list of (composition_id, offset, length) | View, Chunk (opportunistic warm) |
| Priority class | interactive / batch / bulk within policy-allowed max | Gateway / Client QoS scheduler |
| Affinity preference | pool / rack / node preference within policy | Chunk placement engine |
| Retention intent | temp / working / final | Composition GC urgency, Chunk EC policy selection |
| Dedup intent | shared-ensemble / per-rank | Chunk dedup path (still bounded by I-X2) |
| Collective announcement | {ranks, bytes_per_rank, deadline} | Chunk write-absorb provisioning |
Hint taxonomy (nice-to-have, deferred)
Co-access grouping, deadline, transient markers (discardable after epoch N),
NUMA/GPU topology, peer-rank state. Architect may add these in a follow-up.
Telemetry feedback (must-have)
| Signal | Shape | Scoping |
|---|---|---|
| Backpressure | severity enum + retry_after_ms | Caller’s own resources only |
| Materialization lag | ms | Caller’s views only |
| Locality class | bucketed enum (local-node, local-rack, same-pool, remote, degraded) | Caller-owned chunks only |
| Prefetch effectiveness | bucketed hit-rate | Caller’s declared prefetches only |
| QoS headroom | bucketed fraction | Caller’s workflow/workload |
| Own-hotspot | composition_id + coarse level | Caller’s own compositions |
Tenant-hierarchy scoping
- Policy chain: cluster → org → project → workload. Each level narrows (never broadens) its parent’s ceilings (I-WA7). Profile allow-lists inherit the same way.
- Workflow lives strictly within one workload (I-WA3).
- Disable switch at any level (I-WA12) — data path unaffected when advisory is disabled.
Security posture
- Hints cannot extend capability (I-WA14).
- Telemetry is not an existence oracle (I-WA6) — unauthorized target → same shape as absent target, including timing distribution.
- Telemetry aggregation uses k-anonymity over neighbour workloads, k ≥ 5 (I-WA5).
- Covert-channel hardening: rejection latency and telemetry response size are bucketed (I-WA15).
- All advisory decisions audited on tenant shard; cluster-admin view sees opaque hashes (I-WA8, consistent with I-A3 / ADR-015).
Isolation from data path
- Advisory channel on a separate gRPC service and (ideally) a separate server-side tokio runtime / goroutine pool from the data path.
- Hint handling is best-effort with bounded buffering; on overload the handler drops-and-audits rather than queuing.
- Data-path code never awaits advisory responses. At most it emits fire-and-forget annotations.
Alternatives considered
- Attach hints as headers on existing data-path RPCs, no separate channel. Rejected: couples hint handling to data-path latency, violates I-WA2 isolation, and makes bidirectional telemetry awkward.
- Model workflow as a new bounded context with durable state. Rejected: workflows are ephemeral correlation handles. Persisting them invites a new shared-state problem and gives little value beyond what the audit log already provides.
- Expose ADR-015 observability directly to clients. Rejected: ADR-015 is operator-facing with aggregate/anonymized scope. Clients need caller-scoped, near-real-time feedback with a different privacy boundary (I-WA5/6).
- Server-authoritative hints (storage can infer and inject its own). Rejected: inferring client intent from data-path patterns is already done internally; the point of this ADR is to let clients supply authoritative-to-themselves hints. Server-side inference remains available as a fallback when hints are absent.
Consequences
- New crate
kiseki-advisory(Rust) — hint validation, routing, rate limiting, telemetry emission, audit emission. Side-by-side withkiseki-server, not inside the data-path crates. - New protobuf service
WorkflowAdvisorywithDeclareWorkflow,EndWorkflow,PhaseAdvance, a bidiAdvisoryStream, andSubscribeTelemetry(may be a stream withinAdvisoryStream). - Control Plane extensions: profile allow-lists, hint budgets, opt-out switches — inherited org → project → workload.
- Native Client extensions:
WorkflowSessionhandle; existing data-path methods accept an optional&WorkflowSessionfor automatic correlation annotation. - Audit additions: new event types per I-WA8. Tenant audit export (I-A2) includes them; cluster-admin export (I-A3) hashes the tenant-scoped identifiers.
- Metric additions (ADR-015 operator view):
advisory_hints_accepted,_rejected,_throttled,active_workflows,advisory_channel_latency, tenant-anonymized. - Performance: hint handling overhead target < 5µs p99 per accepted hint; telemetry emission frequency capped per subscription.
- Failure mode
F-ADV-1: advisory-subsystem outage → data path unaffected; clients observeadvisory_unavailableuntil restoration. To be added tospecs/failure-modes.md(severity P2, blast radius: steering quality only).
Changes from adversary gate-0 review
- I-WA6 extended to cover hint rejection (previously telemetry-only).
- I-WA3 tightened to per-operation authorization.
- I-WA5 defines explicit low-k behaviour (fixed-sentinel neighbour component, unchanged response shape).
- New invariants I-WA16 (hint payload size bound), I-WA17 (declare rate bound), I-WA18 (prospective policy application).
- I-WA11 tightened to enumerate forbidden advisory target field types.
- I-WA12 defines three-state opt-out with draining transition.
- I-WA13 specifies CAS serialization for PhaseAdvance.
- Reattach protocol explicitly dropped; TTL-only recovery.
client_idconstruction simplified to CSPRNG (≥128 bits), pinning enforced by registrar.- F-ADV-1 (advisory outage) and F-ADV-2 (audit storm) added to failure-modes.md.
- A-ADV-1..A-ADV-4 added to assumptions.md.
Follow-ups (architect’s scope)
- gRPC service definitions and message schemas.
- Exact integration surface between
kiseki-advisoryand each of Chunk, View, Composition, Gateway. - Concrete k-anonymity bucketing algorithm and parameters.
- Exact latency-bucketing and size-bucketing schemes for I-WA15.
- Phase-history compaction format and retention per workload.
- Reattach protocol for process-restart scenarios (I-WA4 scenario).
Follow-ups (adversary’s scope — gate 0 before architect)
- Threat-model the covert-channel surface (timing, size, error-code).
- Validate that the inherent side-channels from backpressure signals are truly k-anonymised under worst-case neighbour composition.
- Probe the reattach protocol once drafted.
ADR-021: Workflow Advisory Architecture
Status: Accepted (implemented, 51/51 BDD scenarios pass) Date: 2026-04-17 Context: ADR-020 analyst-level decision; this ADR commits the architecture (crate shape, runtime isolation, advisory-to-data-path coupling, protobuf + intra-Rust boundaries).
Decision
Three structural commitments that, together, make the analyst-level invariants in ADR-020 enforceable at compile time and at runtime.
1. Advisory is a separate crate with an isolated runtime
- New Rust crate
kiseki-advisory, located atcrates/kiseki-advisory/. - Compiled into
kiseki-serverbut runs on a dedicated tokio runtime with its own thread pool, separate from the data-path runtime. Configured viakiseki-serverat process start. - All advisory ingress (
AdvisoryStream,DeclareWorkflow,PhaseAdvance, telemetry subscriptions) is accepted on a separate gRPC listener from the data-path gRPC listeners. - Advisory-audit emission uses
kiseki-audit’s existing tenant-shard path but with its own bounded queue and drop-and-record-on-overflow policy (no awaits out of the advisory runtime into the data path). - Structural enforcement of I-WA2: data-path crates do not depend
on
kiseki-advisoryin their Cargo manifests. The only way an advisory event can affect data-path behaviour is through well-typed domain-level preferences (see §3), which the data path treats as advisory hints — never as preconditions.
2. Shared domain types live in kiseki-common
A small set of enums and structs representing “the advisory context
of one operation” is declared in kiseki-common (already a dependency
of every context). This lets data-path crates accept an
Option<&OperationAdvisory> on their operations without pulling in
the advisory runtime.
kiseki-common (domain types: WorkflowRef, OperationAdvisory, enums)
↑
kiseki-{log,chunk,composition,view,gateway-*,client}
(accept Option<&OperationAdvisory>, use for preferences only)
kiseki-advisory (runtime, router, budget, audit emitter)
├── depends on kiseki-common
├── depends on kiseki-audit
└── depends on kiseki-proto (for WorkflowAdvisoryService)
↑
kiseki-server (wires advisory runtime to each context)
Cycle-free: no data-path crate depends on kiseki-advisory; the
runtime wiring happens only in the kiseki-server binary.
3. Pull-based advisory lookup (not push into the data path)
When a data-path request arrives carrying a workflow_ref header:
3.a Header mechanism
The workflow_ref is carried as a gRPC metadata entry, not as a
protobuf field on any data-path message. Concrete binding:
- Metadata key:
x-kiseki-workflow-ref-bin(binary metadata, per gRPC convention for raw-bytes values) - Metadata value: the raw 16-byte
WorkflowRefhandle - All data-path protos remain unchanged — this is the structural payoff that makes I-WA2 tractable (data-path code stays advisory-unaware).
- A gRPC interceptor in
kiseki-serverlifts the header into a request-scoped context at ingress. The context is accessed by each data-path handler through a smallkiseki-commonhelper (CurrentAdvisory::from_request_context()), which returns anOption<OperationAdvisory>by callingAdvisoryLookup::lookup_fast. - For intra-Rust calls (e.g., native client’s native API path),
the same helper reads from a task-local set by the caller. The
native client’s
WorkflowSessionhandle scopes this automatically. - For external protocols (NFS, S3) the HTTP-level header is
x-kiseki-workflow-ref(plain, hex-encoded), translated by the protocol gateway into the gRPC binary metadata entryx-kiseki-workflow-ref-binbefore forwarding to any internal gRPC service. This keeps external clients unaware of gRPC conventions. - No data-path proto file contains
workflow_ref. Any future attempt to add it is rejected at architecture review.
- The
kiseki-servergRPC interceptor extractsworkflow_refand stores it in the request context. - The data-path operation (e.g.,
WriteChunk) optionally consultsCurrentAdvisory::from_request_context()to obtain anOption<OperationAdvisory>. - The data-path code may, synchronously and fallibly, call
AdvisoryLookup::lookup_fast(workflow_ref) -> Option<OperationAdvisory>with a strict bounded deadline (≤ 500 µs, configurable, default 200 µs). The method name carries the contract: implementations MUST NOT block, allocate on the happy path, or call non-O(1) functions. - On timeout, unavailability, or cache miss the lookup returns
None. The data-path code proceeds exactly as it would for an operation without anyworkflow_ref. - There is no blocking wait, no retry, and no propagated error. The lookup is a hot-path cache read (see §4 below).
This guarantees I-WA2 structurally: the data path cannot be stalled or corrupted by the advisory subsystem. At worst, advisory context is unavailable and steering quality degrades.
4. Advisory state shape and hot path
kiseki-advisory maintains three bounded in-memory caches keyed by
workflow:
| Cache | Contents | Size bound | Eviction |
|---|---|---|---|
| Workflow table | (workflow_id) → { mTLS-identity, profile, current_phase, budgets, TTL } | policy-bounded max concurrent workflows per workload × total workloads | TTL + End |
| Effective-hints table | (workflow_id) → OperationAdvisory (latest accepted hints, merged across phase) | 1 row per active workflow | replaced on new accept |
| Prefetch ring | per-workflow ring buffer of accepted prefetch tuples | max_prefetch_tuples_per_hint × in-flight phases | FIFO on cap |
Reads from the data path hit the effective-hints table (O(1)).
Writes into these caches happen on the advisory runtime only.
Cross-thread access uses arc-swap (snapshot-read, copy-on-write)
so the data-path read never takes a lock held by the advisory
runtime.
5. gRPC service shape
One new service, WorkflowAdvisoryService, on its own gRPC listener.
Unary: DeclareWorkflow, EndWorkflow, PhaseAdvance,
GetWorkflowStatus (for admin/debug within caller’s own scope).
Bidi streaming: AdvisoryStream (hints in, telemetry out over the
same stream, multiplexed). Unary: SubscribeTelemetry (server-stream
variant for callers who don’t want to send hints).
Full schema in specs/architecture/proto/kiseki/v1/advisory.proto.
6. Control-plane integration
New Go package control/pkg/advisory:
- Policy CRUD for profile allow-lists, budgets, opt-out state per
org/project/workload. Inheritance computed server-side; effective
policy returned to
kiseki-advisoryvia existingControlService. - Opt-out state transitions (
enabled/draining/disabled) are Raft-backed in the existing control-plane state store. - Federation does NOT replicate workflow state (ephemeral, local). It DOES replicate policy (existing async config replication path).
7. k-anonymity bucketing: concrete algorithm
For pool/shard saturation signals that incorporate cross-workload aggregate:
- Compute aggregate metric
Aover all contributing workloads on the pool/shard. - Count distinct contributing workloads
k. - If
k ≥ 5(policy-configurable minimum): returnseverity = bucket(A); retry-after =bucket(compute_retry(A)). - If
k < 5: returnseverity = bucket(A_caller_only); retry-after =bucket(compute_retry(A_caller_only)). The response shape is identical to thek≥5case; only the value of the neighbour-derived component is replaced by a sentinel bucket (ok, regardless of true aggregate) chosen to minimize caller utility of detecting the substitution.
Bucket function: fixed set {ok, soft, hard} for severity,
{<50ms, 50-250ms, 250-1000ms, 1-10s, >10s} for retry-after.
8. Covert-channel hardening: concrete widths
- Rejection response timing: every advisory rejection path (hint,
subscription, declare, phase) pads response emission to the next
100-µs boundary after a fixed minimum of 300 µs. Enforced by a
common
emit_bucketed_responsehelper inkiseki-advisory. - Telemetry message sizes: protobuf messages padded to one of
{128, 256, 512, 1024}bytes with areserved bytes paddingfield repeated to the target size. Selection uses the nearest bucket ≥ actual size. - Error codes: every rejection caused by authorization or
scope violation returns the
SCOPE_NOT_FOUNDcode with the same message payload, regardless of whether the cause was “unauthorized” or “absent”. Internal audit records carry the true reason. - gRPC status code:
WorkflowAdvisoryServiceMUST return gRPC statusNOT_FOUND(code 5) for everySCOPE_NOT_FOUNDcase. UsingPERMISSION_DENIED(code 7) orUNAUTHENTICATED(code 16) on authorization failures would leak the distinction via the gRPC trailers, defeating the canonicalization above. All gRPC clients and middleware expose the status code, so this is not a “docs-only” rule — it is enforced by an integration test at Phase 11.5 exit that compares status-code distributions across authorized-absent and unauthorized-existing cases.
9. Phase-history compaction format
Per workflow, keep the last 64 phase records in the workflow table
(ring buffer of PhaseRecord { phase_id, tag_hash, entered_at, hints_accepted_count, hints_rejected_count }). On eviction, the
evicted record is rolled up into a per-workflow
PhaseSummary { from_phase_id, to_phase_id, total_hints_accepted, total_hints_rejected, duration_ms } audit event emitted to the
tenant audit shard. The summary replaces all evicted individual
records in audit history.
Alternatives considered
-
Put advisory code inside each data-path crate behind a feature flag. Rejected: tight coupling; impossible to guarantee I-WA2 (hot data-path code lives in the advisory lifecycle), and per-crate feature flags multiply combinatorics of build variants.
-
Separate OS process for advisory runtime, IPC’d from kiseki-server. Rejected: IPC adds serialization cost on the hot-path lookup (§3) and complicates deployment (another process per node). The isolated-tokio-runtime pattern gives enough blast-radius reduction at much lower overhead.
-
Define advisory traits in a new tiny crate
kiseki-advisory-apiseparate fromkiseki-common. Considered. Rejected for now: the advisory domain types (OperationAdvisory, enums) are small, stable, and already conceptually part of the shared vocabulary (Workflow, Phase, AccessPattern appear in ubiquitous-language.md). Adding a one-concept crate adds build-graph overhead without payoff. Can be split out later if the type set grows. -
Push hints directly into each context via per-context channels (no
OperationAdvisoryaggregation). Rejected: spreads fan-out logic across every context and makes I-WA11 (target-field restriction) and I-WA16 (size cap) harder to enforce. Centralizing inkiseki-advisoryand passing an already-validated bundle simplifies data-path code.
10. Schema versioning
advisory.proto ships as kiseki.v1. Forward-evolution rules:
- Additions (new fields, new oneof variants, new enum values)
stay within
v1. Unknown fields are preserved by gRPC clients. - Deprecations mark fields with
reservedafter one minor release; old clients continue to work. - Breaking changes (semantic change of a field, required
removal) move to
v2with a deprecation window ≥ 2 releases in which both versions are served. - Advisory-policy changes in the control plane (profile allow-list additions, budget changes) are config, not schema — no version bump needed.
11. Padding to bucket size
AdvisoryError.padding, AdvisoryServerMessage.padding,
TelemetryEvent.padding, WorkflowStatus.padding, and
AdvisoryAuditBody.padding carry the variable bytes needed to hit
one of the bucket sizes {128, 256, 512, 1024, 2048 for audit bodies}.
Computation at emit time:
serialized_size = serialize(rest_of_message).len();
target_bucket = smallest bucket >= serialized_size + padding_overhead;
padding_len = target_bucket - serialized_size - varint_overhead(target_bucket);
varint_overhead(N) accounts for the two-byte (tag + length-varint)
prefix of the padding field; standard protobuf wire format.
Implementations MUST use the kiseki-advisory::emit_bucketed_response
helper. Property test at Phase 11.5 exit: every response on
WorkflowAdvisoryService is exactly one of the bucket sizes.
Consequences
- Adds one Rust crate (
kiseki-advisory), one Go package (control/pkg/advisory), one proto file (proto/kiseki/v1/advisory.proto), one data-model stub (data-models/advisory.rs). - Adds a new phase to the build sequence (see
build-phases.md). - Every data-path
*Opstrait inapi-contracts.mdgains an optionaladvisory: Option<&OperationAdvisory>parameter on its methods. Callers that don’t care passNone. - Isolation requires
kiseki-serverto instantiate two tokio runtimes. Accepted cost. - The
arc-swaphot-path read is the only cross-runtime coupling. Property-test and benchmark-verified at Phase 11 exit.
Open items (escalated to adversary gate-1)
- Validate that §3 (pull-based lookup) cannot itself become a DoS
surface: a malicious client pummelling
workflow_refheaders causes lookups. Mitigation: lookup cache is per-node, bounded, and miss cost is aNonereturn (no upstream RPC). - Validate §4 (
arc-swapsnapshot) meets latency targets on the actual data-path hot code (FUSE read/write, chunk write, view read). - Validate §8 covert-channel widths are large enough to mask actual work variance under realistic load.
- Confirm §9 audit summary compaction does not itself become an existence oracle (size of summary varies with workflow activity).
ADR-022: Storage Backend — redb (Pure Rust)
Status: Accepted. Date: 2026-04-20. Deciders: Architect + implementer.
Context
The system needs persistent storage for:
- Raft log entries — append-heavy, sequential reads for replay
- State machine snapshots — periodic full-state serialization
- Chunk metadata index — key-value mapping (chunk_id → placement, refcount)
- View watermark checkpoints — small, frequently updated
The spec references “RocksDB or equivalent” (build-phases.md Phase 3) but does not commit to a specific engine. RocksDB is C++ and brings ~200MB build dependency via cmake/clang/librocksdb.
Decision
Use redb v2 for all structured persistent storage.
What redb handles
| Data | redb Table | Key | Value |
|---|---|---|---|
| Raft log entries | raft_log | u64 (log index) | bincode-serialized entry |
| Raft vote/term | raft_meta | &str (“vote”, “term”) | u64 |
| State machine snapshot | sm_snapshot | "latest" | bincode-serialized state |
| Chunk metadata | chunk_meta | [u8; 32] (chunk_id) | bincode ChunkMeta |
| Device allocation | device_alloc | (DeviceId, u64) (device + offset) | [u8; 32] (chunk_id) — reverse index |
| View watermarks | view_wm | [u8; 16] (view_id) | u64 (sequence) |
What redb does NOT handle
Chunk ciphertext data is written directly to raw block devices
(or file-backed fallback for VMs/CI) via the DeviceBackend trait
in kiseki-block (ADR-029). redb stores metadata only; chunk
ciphertext never passes through redb.
$KISEKI_DATA_DIR/
devices/
/dev/nvme0n1 # raw block device (default, ADR-029)
/dev/nvme1n1 # raw block device
/tmp/kiseki-dev0.img # file-backed fallback (VMs/CI)
raft/
db.redb # redb database file (metadata only)
redb tracks chunk placement: chunk_meta table maps
chunk_id → (device_id, offset, size, fragment_index).
The device_alloc table provides a reverse index
(device_id, offset) → chunk_id for bitmap rebuild and scrub.
Bitmap allocation updates are journaled in redb before application
to the on-device bitmap (ADR-029).
Why pool files, not per-chunk files:
- At 100TB / 64KB avg = 1.6B chunks → filesystem inode exhaustion
- Pool files support O_DIRECT and RDMA pre-registration (single mmap region)
- Chunks are 4KB-aligned within the pool file for NVMe block alignment
- Pool file is sparse: only allocated regions consume disk space
EC fragment placement (CRUSH-like)
Fragments placed across devices via deterministic hashing:
fn place_fragment(chunk_id, frag_idx, pool_devices) -> DeviceId {
// Ensure no two fragments on same device
let mut candidates = pool_devices.clone();
for prior in 0..frag_idx {
candidates.remove(placed[prior]);
}
candidates[hash(chunk_id, frag_idx) % candidates.len()]
}
Deterministic — can recalculate placement without storing it.
Reverse index (device_id, chunk_id) → fragment_index in redb
enables efficient repair on device failure.
Raft snapshots
- Trigger: Every 10,000 log entries
- Format:
bincode::serialize(&state_machine_inner) - Storage: redb
sm_snapshottable, key ="latest" - Restore: Deserialize snapshot → replay log entries after snapshot index
- Log cleanup: Truncate entries before snapshot index after snapshot
Rationale
| Criterion | redb | RocksDB | fjall | Custom files |
|---|---|---|---|---|
| Pure Rust | Yes | No (C++) | Yes | Yes |
| Build deps | None | cmake, clang, librocksdb | None | None |
| Binary size | ~50KB | ~5MB | ~100KB | 0 |
| ACID | Yes (COW) | Yes (WAL) | Yes (WAL) | Manual (fsync) |
| Crash recovery | Automatic | Automatic | Automatic | Manual replay |
| Compaction | None needed (B-tree) | Required (LSM) | Required (LSM) | None |
| Maturity | 1.0, used by Firefox | Very mature | Newer | N/A |
| Write amplification | Low (COW) | High (LSM) | High (LSM) | Low |
redb wins on simplicity, zero deps, and sufficient performance for Raft log append + metadata lookup.
Consequences
- No LSM-tree compaction complexity
- No C++ build toolchain required
- Chunk blobs as files: simple, inspectable, compatible with RDMA
- redb’s COW B-tree has higher read amplification than LSM for range scans — acceptable for our workload (point lookups + append)
- If redb proves insufficient for high-throughput Raft log append, migrate to fjall (LSM, same API pattern)
References
- redb: https://github.com/cberner/redb
- RFC 1813 §3: NFS3 procedure semantics
- build-phases.md Phase 3: “SSTable” storage (now redb B-tree)
- ADR-029: Raw Block Device Allocator (chunk data I/O)
ADR-023: Protocol RFC Compliance Scope
Status: Accepted. Date: 2026-04-20. Deciders: Architect + implementer.
Context
Kiseki exposes three protocol interfaces: S3 HTTP, NFSv3, NFSv4.2. ADR-013 (POSIX semantics) and ADR-014 (S3 API scope) define the functional subset but don’t reference specific RFC sections or define wire-format compliance testing.
Now that wire protocol implementations exist, we need to codify which RFC requirements are met and how compliance is verified.
Decision
Protocol scope
| Protocol | Standard | Implemented Subset | Total in Standard |
|---|---|---|---|
| NFSv3 | RFC 1813 | 7 of 22 procedures | 22 procedures |
| NFSv4.2 | RFC 7862 | 10 of ~60 operations | ~60 operations |
| S3 | AWS S3 API | 5 of 40+ operations | 40+ operations |
NFSv3 (RFC 1813) — implemented procedures
| # | Procedure | Status | Notes |
|---|---|---|---|
| 0 | NULL | Implemented | Ping/health check |
| 1 | GETATTR | Implemented | File/directory attributes |
| 3 | LOOKUP | Implemented | Name → file handle resolution |
| 6 | READ | Implemented | Byte-range file read |
| 7 | WRITE | Implemented | File data write |
| 8 | CREATE | Implemented | Create new file + directory index entry |
| 16 | READDIR | Implemented | Directory listing with real filenames |
Not implemented: SETATTR, ACCESS, READLINK, SYMLINK, MKNOD, REMOVE, RMDIR, RENAME, LINK, READDIRPLUS, FSSTAT, FSINFO, PATHCONF, COMMIT.
NFSv4.2 (RFC 7862) — implemented COMPOUND operations
| Op | Name | Status | Notes |
|---|---|---|---|
| 9 | GETATTR | Implemented | Bitmap-selected attributes |
| 10 | GETFH | Implemented | Return current file handle |
| 15 | LOOKUP | Stub (delegates to directory index) | |
| 24 | PUTROOTFH | Implemented | Set root file handle |
| 25 | READ | Implemented | Via stateid + offset + count |
| 38 | WRITE | Implemented | Via stateid + offset + stable |
| 42 | EXCHANGE_ID | Implemented | Random client IDs (C-ADV-7) |
| 43 | CREATE_SESSION | Implemented | Random session IDs (C-ADV-2) |
| 44 | DESTROY_SESSION | Implemented | Session teardown |
| 53 | SEQUENCE | Implemented | Per-request sequencing |
| 63 | IO_ADVISE | Implemented | Accepted (advisory integration pending) |
S3 API — implemented operations
| Operation | HTTP Method | Status |
|---|---|---|
| PutObject | PUT /:bucket/:key | Implemented |
| GetObject | GET /:bucket/:key | Implemented |
| HeadObject | HEAD /:bucket/:key | Implemented |
| DeleteObject | DELETE /:bucket/:key | Stub (returns 204) |
| ListObjectsV2 | GET /:bucket | Not yet |
Compliance testing approach
-
BDD feature files map to RFC sections:
specs/features/nfs3-rfc1813.feature(14 scenarios)specs/features/nfs4-rfc7862.feature(20 scenarios)specs/features/s3-api.feature(10 scenarios)
-
Wire format validation via Python e2e tests:
- NFS: raw TCP with
struct.packfor ONC RPC framing - S3:
requestslibrary for HTTP
- NFS: raw TCP with
-
Real client interop (future):
- NFS:
mount -t nfs -o nfsvers=3,tcpin Docker - S3:
boto3/aws-cli
- NFS:
Consequences
- Clear documentation of what’s implemented vs what’s not
- BDD scenarios serve as living compliance spec
- Real client interop deferred until wire format proven via raw tests
- Expanding the subset (e.g., adding REMOVE, RENAME) requires: new BDD scenario → new step definition → implementation → test green
References
- RFC 1813: NFS Version 3 Protocol Specification
- RFC 7862: NFS Version 4.2 Protocol
- RFC 5531: ONC RPC Version 2
- RFC 4506: XDR: External Data Representation Standard
- AWS S3 API Reference
- ADR-013: POSIX Semantics Scope
- ADR-014: S3 API Scope
ADR-024: Device Management, Storage Tiers, and Capacity Thresholds
Status: Accepted (19/19 device-management BDD scenarios pass). Date: 2026-04-20. Deciders: Architect + domain expert.
Context
The current design (ADR-005) defines three NVMe device classes but does not address:
- HDD / spinning disk tiers (common in cost-optimized HPC clusters)
- System partition vs data partition separation
- Capacity thresholds and degradation behavior
- Device health monitoring and proactive replacement
- Memory-attached storage (CXL, persistent memory)
- Mixed-tier deployments (SSD+HDD, fast-SSD+cheap-SSD)
Real HPC deployments often have:
- System partition: RAID-1 (or RAID-1+0) on 2 SSDs for OS + Kiseki binaries + redb
- Data partitions: JBOD — each NVMe/SSD/HDD is an independent pool member
- Tiering: Hot data on fast NVMe, warm on cheap SSD, cold on HDD
Decision
Device classification
Extend DeviceClass to cover the full storage hierarchy:
| Class | Medium | Use case | Typical capacity |
|---|---|---|---|
NvmeU2 | NVMe U.2 TLC/MLC | Metadata, hot data, Raft log | 1-8 TB |
NvmeQlc | NVMe QLC | Checkpoints, warm data | 4-30 TB |
NvmePersistentMemory | Intel Optane / CXL | Cache, ultra-hot metadata | 128 GB - 1 TB |
SsdSata | SATA SSD | Budget fast storage | 1-8 TB |
HddEnterprise | SAS/SATA HDD 10k/15k | Cold data, archive | 4-20 TB |
HddBulk | SATA HDD 7.2k | Deep archive, bulk cold | 10-20 TB |
Custom(String) | User-defined | Vendor-specific | Varies |
Server disk layout
Server node:
├── System partition (RAID-1 on 2× SSD)
│ ├── /boot, /root, OS
│ ├── /var/lib/kiseki/redb/ ← Raft log, metadata index
│ └── /var/lib/kiseki/config/ ← Node config, certs
│
├── Data devices (JBOD, managed by Kiseki)
│ ├── /dev/nvme0n1 → pool "fast-nvme" (device member)
│ ├── /dev/nvme1n1 → pool "fast-nvme" (device member)
│ ├── /dev/sda → pool "bulk-ssd" (device member)
│ ├── /dev/sdb → pool "cold-hdd" (device member)
│ └── ...
│
└── Optional: CXL memory → pool "pmem" (hot cache tier)
JBOD for data, RAID-1 for system. Kiseki manages data durability via EC/replication across JBOD members. The system partition uses traditional RAID-1 because redb and Raft log must survive single-disk failure without Kiseki’s own repair mechanism.
Pool capacity management
Per-device-class capacity thresholds
Thresholds vary by device type because NVMe/SSD suffer GC-induced write amplification at high fill levels, while HDD does not. Enterprise arrays (VAST, Pure) can operate at 95%+ because they have global wear leveling — JBOD does not have that luxury.
| State | NVMe/SSD | HDD | Behavior |
|---|---|---|---|
| Healthy | 0-75% | 0-85% | Normal writes, background rebalance |
| Warning | 75-85% | 85-92% | Log warning, emit telemetry |
| Critical | 85-92% | 92-97% | Reject new placements, advisory backpressure |
| ReadOnly | 92-97% | 97-99% | In-flight writes drain, no new writes |
| Full | 97-100% | 99-100% | ENOSPC to clients |
Rationale: NVMe/SSD GC pressure increases sharply above ~80% fill. QLC is worse than TLC. The SSD Warning threshold (75%) gives the placement engine time to redirect before the GC cliff. HDD has no such cliff — outer-track vs inner-track difference is ~20%, not a performance wall.
Implementation:
#![allow(unused)]
fn main() {
pub enum PoolHealth {
Healthy,
Warning { used_percent: u8 },
Critical { used_percent: u8 },
ReadOnly { used_percent: u8 },
Full,
}
pub struct CapacityThresholds {
pub warning_pct: u8,
pub critical_pct: u8,
pub readonly_pct: u8,
pub full_pct: u8,
}
impl CapacityThresholds {
pub fn for_device_class(class: &DeviceClass) -> Self {
match class {
DeviceClass::NvmeU2 | DeviceClass::NvmeQlc
| DeviceClass::NvmePersistentMemory | DeviceClass::SsdSata => Self {
warning_pct: 75,
critical_pct: 85,
readonly_pct: 92,
full_pct: 97,
},
DeviceClass::HddEnterprise | DeviceClass::HddBulk => Self {
warning_pct: 85,
critical_pct: 92,
readonly_pct: 97,
full_pct: 99,
},
DeviceClass::Custom(_) => Self {
warning_pct: 80,
critical_pct: 90,
readonly_pct: 95,
full_pct: 99,
},
}
}
}
impl AffinityPool {
pub fn health(&self) -> PoolHealth {
let pct = (self.used_bytes * 100) / self.capacity_bytes;
81..=90 => PoolHealth::Warning { used_percent: pct as u8 },
91..=95 => PoolHealth::Critical { used_percent: pct as u8 },
96..=99 => PoolHealth::ReadOnly { used_percent: pct as u8 },
_ => PoolHealth::Full,
}
}
}
}
Placement engine behavior:
- Healthy: Place chunks according to affinity policy
- Warning: Continue placing but emit telemetry; cluster admin should add capacity
- Critical: Reject new placements; redirect to same device-class sibling only
- ReadOnly: In-flight writes complete; new writes fail with retriable error
- Full: ENOSPC — client gets permanent error
Pool redirection policy: When a pool is Critical, the placement engine redirects to another pool of the same device class only. Never cross device-class boundaries (e.g., never NVMe → HDD). If no same-class sibling has capacity, return ENOSPC to client. This preserves performance SLAs and compliance tag enforcement.
System partition
OS-managed RAID-1 on 2× SSD. Kiseki does not manage the RAID.
Kiseki monitors system partition health:
- On startup: check
/proc/mdstatfor RAID health - If degraded → log WARNING, continue operating
- If both drives failed → log CRITICAL, refuse to start
- Periodic check every 60 seconds
Admin is responsible for replacing failed system drives and rebuilding the RAID. Kiseki trusts the OS for system partition durability.
Device health monitoring
Each device reports SMART/health metrics:
| Metric | Threshold | Action |
|---|---|---|
| Temperature | >70°C | Warning; throttle if >80°C |
| Wear level (SSD) | >90% life used | Warning; proactive replacement window |
| Bad sectors (HDD) | >0 reallocated | Warning at 1; evacuate at >100 |
| Latency | >10× baseline | Mark degraded; reduce placement priority |
| Errors | Uncorrectable read | Mark suspect; verify EC/replicas for affected chunks |
Device states:
Healthy → Degraded → Failed → Removed
↘ ↗
Evacuating → Removed
Eviction and evacuation policy
Key principle: Unhealthy devices are evacuated proactively, not waited on until failure. Full devices are write-blocked, not evicted (data is still readable).
| Trigger | Action | Automatic? | Priority |
|---|---|---|---|
| SMART wear >90% (SSD) | Evacuate — migrate chunks to other pool members | Yes (background) | Normal |
| Bad sectors >100 (HDD) | Evacuate — migrate before cascading failure | Yes (background) | High |
| Uncorrectable read error | Evacuate + EC repair for affected chunks | Yes (immediate) | Critical |
| Temperature >80°C | Throttle I/O, alert admin | Yes | High |
| Device unresponsive | Mark Failed — trigger EC repair from survivors | Yes (immediate) | Critical |
| Pool at Critical threshold | Block writes — redirect to sibling pools | Yes | Normal |
| Pool at ReadOnly threshold | Drain writes — no new data, existing completes | Yes | Normal |
| Admin-initiated | Evacuate — controlled migration before physical removal | Manual | Normal |
Evacuation process:
- Mark device
Evacuating - For each chunk on device: read fragment, write to another healthy device in pool
- Update chunk metadata (redb) with new placement
- When all chunks migrated: mark device
Removed - Admin can physically pull the device
Evacuation speed: Bounded by network and destination device throughput. At 1 GB/s NVMe write speed, a 4TB device evacuates in ~67 minutes. EC repair (from parity) is faster since only the missing fragments need reconstruction.
Invariant: A device in Evacuating state accepts no new writes
but serves reads for chunks not yet migrated.
Storage backend per JBOD device
| Approach | Pros | Cons | Recommendation |
|---|---|---|---|
| Raw block (ADR-029) | Zero FS overhead, direct I/O, aligned writes, bitmap allocator with redb journal | Custom allocator in kiseki-block | Default — recommended for production |
| File-backed (ADR-029) | Same DeviceBackend trait, works in VMs/CI without raw devices | Slight overhead from host FS | VMs and CI environments |
| xfs | Scales to 100M+ files, good NVMe support | Extra FS overhead, inode pressure at scale | Legacy / deprecated |
Default: Raw block device I/O via kiseki-block (DeviceBackend
trait with auto-detection of device characteristics). File-backed
fallback for VMs and CI. XFS is deprecated as a chunk storage backend;
existing XFS deployments can migrate via background evacuation.
Device discovery
Manual configuration (MVP):
- Admin provides device list in node config (
kiseki-server.toml) - Each device: path, class, pool assignment
Future: Auto-discovery:
-
Scan
/sys/block/for NVMe/SSD/HDD devices -
Classify by transport (NVMe, SATA, SAS) and media (rotational flag)
-
Present to admin for pool assignment confirmation
-
Healthy: Normal I/O
-
Degraded: Elevated errors or latency; reduce write priority
-
Evacuating: Admin-initiated; migrate chunks to other devices, then remove
-
Failed: I/O errors; trigger EC repair for all chunks
-
Removed: Device physically absent; metadata cleaned up
Tiering and data movement
Static placement (MVP): Admin assigns pools to device classes. Chunk placement is determined at write time by the composition’s view descriptor affinity policy. No automatic migration.
Future: Reactive tiering (per assumption A8):
- Compositions with high read frequency auto-promote from cold → hot
- Compositions with no reads for >N days auto-demote from hot → cold
- Promotion/demotion as background job (copy chunk, update metadata, delete old)
- Bounded by pool capacity thresholds (don’t overfill hot tier)
Data model changes
#![allow(unused)]
fn main() {
pub enum DeviceClass {
NvmeU2,
NvmeQlc,
NvmePersistentMemory,
SsdSata,
HddEnterprise,
HddBulk,
Custom(String),
}
pub struct DeviceInfo {
pub id: DeviceId,
pub class: DeviceClass,
pub path: String, // /dev/nvme0n1 or mount point
pub capacity_bytes: u64,
pub used_bytes: u64,
pub state: DeviceState,
pub pool_id: Option<String>,
}
pub enum DeviceState {
Healthy,
Degraded { reason: String },
Evacuating { progress_percent: u8 },
Failed { since: u64 },
Removed,
}
}
Consequences
- Device diversity now first-class (HDD, SSD, NVMe, PMem)
- Capacity management is explicit with defined thresholds
- System partition (RAID-1) separated from data (JBOD)
- Device health monitoring enables proactive replacement
- Tiering is future work; static placement for MVP
- Cluster admin must provision devices and assign to pools at setup time
References
- ADR-005: EC and chunk durability (per pool)
- ADR-022: Storage backend (redb on system partition)
- Assumption A4: ClusterStor hardware
- Assumption A8: Reactive tiering
- Failure mode F-I2: Storage node failure
- Failure mode F-I4: Disk/device failure
ADR-025: Storage Administration API
Status: Proposed. Date: 2026-04-20. Deciders: Architect + domain expert.
Context
Storage administrators need to performance-tune the system similar to
Ceph (ceph osd pool set), VAST (management UI), or Lustre (lctl).
The current control plane API handles tenant lifecycle but has no
storage admin surface — no pool management, device management,
performance tuning, or cluster-wide observability.
API-first principle: All admin interactions go through gRPC APIs.
CLI (kiseki-cli), Web UI, and job orchestrators (Ansible, Terraform)
are wrappers around these APIs. No SSH-and-edit-config path.
Decision
Admin API surface (new gRPC service)
service StorageAdminService {
// === Device management ===
rpc ListDevices(ListDevicesRequest) returns (ListDevicesResponse);
rpc GetDevice(GetDeviceRequest) returns (DeviceInfo);
rpc AddDevice(AddDeviceRequest) returns (AddDeviceResponse);
rpc RemoveDevice(RemoveDeviceRequest) returns (RemoveDeviceResponse);
rpc EvacuateDevice(EvacuateDeviceRequest) returns (EvacuateDeviceResponse);
rpc CancelEvacuation(CancelEvacuationRequest) returns (CancelEvacuationResponse);
// === Pool management ===
rpc ListPools(ListPoolsRequest) returns (ListPoolsResponse);
rpc GetPool(GetPoolRequest) returns (PoolInfo);
rpc CreatePool(CreatePoolRequest) returns (CreatePoolResponse);
rpc SetPoolDurability(SetPoolDurabilityRequest) returns (SetPoolDurabilityResponse);
rpc SetPoolThresholds(SetPoolThresholdsRequest) returns (SetPoolThresholdsResponse);
rpc RebalancePool(RebalancePoolRequest) returns (RebalancePoolResponse);
// === Performance tuning ===
rpc GetTuningParams(GetTuningParamsRequest) returns (TuningParams);
rpc SetTuningParams(SetTuningParamsRequest) returns (SetTuningParamsResponse);
// === Cluster observability ===
rpc ClusterStatus(ClusterStatusRequest) returns (ClusterStatus);
rpc PoolStatus(PoolStatusRequest) returns (PoolStatus);
rpc DeviceHealth(DeviceHealthRequest) returns (stream DeviceHealthEvent);
rpc IOStats(IOStatsRequest) returns (stream IOStatsEvent);
// === Shard management ===
rpc ListShards(ListShardsRequest) returns (ListShardsResponse);
rpc GetShard(GetShardRequest) returns (ShardInfo);
rpc SplitShard(SplitShardRequest) returns (SplitShardResponse);
rpc SetShardMaintenance(SetShardMaintenanceRequest) returns (SetShardMaintenanceResponse);
// === Repair and scrub ===
rpc TriggerScrub(TriggerScrubRequest) returns (TriggerScrubResponse);
rpc RepairChunk(RepairChunkRequest) returns (RepairChunkResponse);
rpc ListRepairs(ListRepairsRequest) returns (ListRepairsResponse);
}
Tuning parameters
Storage admins tune at four levels: cluster → pool → tenant → workload. Lower levels inherit from higher, can only narrow (not broaden).
Cluster-wide tuning
| Parameter | Default | Range | What it controls |
|---|---|---|---|
compaction_rate_mb_s | 100 | 10-1000 | Background compaction throughput cap |
gc_interval_s | 300 | 60-3600 | How often GC scans for reclaimable chunks |
rebalance_rate_mb_s | 50 | 0-500 | Background rebalance/evacuation throughput |
scrub_interval_h | 168 (7d) | 24-720 | How often integrity scrub runs |
max_concurrent_repairs | 4 | 1-32 | Parallel EC repair jobs |
stream_proc_poll_ms | 100 | 10-1000 | View materialization poll interval |
inline_threshold_bytes | 4096 | 512-65536 | Below this, data inlined in delta |
raft_snapshot_interval | 10000 | 1000-100000 | Entries between Raft snapshots |
Per-pool tuning
| Parameter | Default | Range | What it controls |
|---|---|---|---|
ec_data_chunks | 4 (NVMe) / 8 (HDD) | 2-16 | EC data fragment count |
ec_parity_chunks | 2 (NVMe) / 3 (HDD) | 1-8 | EC parity fragment count |
replication_count | 3 | 2-5 | For replication pools (not EC) |
warning_threshold_pct | per ADR-024 | 50-95 | Capacity warning level |
critical_threshold_pct | per ADR-024 | 60-98 | Capacity critical level |
readonly_threshold_pct | per ADR-024 | 70-99 | Read-only level |
target_fill_pct | 70 (SSD) / 80 (HDD) | 50-90 | Rebalance target fill level |
chunk_alignment_bytes | 4096 | 512-65536 | On-disk alignment (RDMA/NVMe) |
prefer_sequential_alloc | true | bool | Allocate sequentially in pool file |
Per-tenant tuning (via ControlService, existing)
| Parameter | Existing API | What it controls |
|---|---|---|
quota.capacity_bytes | SetQuota | Tenant capacity ceiling |
quota.iops | SetQuota | IOPS limit |
quota.metadata_ops_per_sec | SetQuota | Metadata op rate limit |
dedup_policy | CreateOrganization | Cross-tenant vs isolated dedup |
compliance_tags | SetComplianceTags | Regulatory constraints |
Per-workload tuning (via ControlService + Advisory)
| Parameter | API | What it controls |
|---|---|---|
workload.quota | CreateWorkload | Workload-level capacity/IOPS |
advisory.hints_per_sec | Advisory ceilings | Hint submission rate |
advisory.prefetch_bytes_max | Advisory ceilings | Prefetch budget |
advisory.profile | Advisory profiles | Allowed hint profiles |
Observability API
ClusterStatus response
message ClusterStatus {
uint32 node_count = 1;
uint32 healthy_nodes = 2;
uint64 total_capacity_bytes = 3;
uint64 used_bytes = 4;
uint32 pool_count = 5;
uint32 shard_count = 6;
uint32 active_repairs = 7;
uint32 evacuating_devices = 8;
repeated PoolSummary pools = 9;
}
PoolStatus response
message PoolStatus {
string pool_id = 1;
PoolHealth health = 2;
uint64 capacity_bytes = 3;
uint64 used_bytes = 4;
uint32 device_count = 5;
uint32 healthy_devices = 6;
uint32 chunk_count = 7;
// Performance metrics (rolling 60s window)
double read_iops = 8;
double write_iops = 9;
double read_throughput_mb_s = 10;
double write_throughput_mb_s = 11;
double avg_read_latency_ms = 12;
double avg_write_latency_ms = 13;
double p99_read_latency_ms = 14;
double p99_write_latency_ms = 15;
}
Streaming events
message DeviceHealthEvent {
DeviceId device_id = 1;
DeviceState old_state = 2;
DeviceState new_state = 3;
string reason = 4;
uint64 timestamp_ms = 5;
}
message IOStatsEvent {
string pool_id = 1;
double read_iops = 2;
double write_iops = 3;
double read_throughput_mb_s = 4;
double write_throughput_mb_s = 5;
uint64 timestamp_ms = 6;
}
Admin personas and API mapping
| Persona | Typical actions | APIs used |
|---|---|---|
| Cluster admin | Add/remove nodes, set cluster params, view health | StorageAdminService (all), ClusterStatus |
| Storage admin | Create pools, tune EC, set thresholds, rebalance | Pool*, SetTuningParams, PoolStatus |
| Tenant admin | Set quotas, compliance, retention, advisory | ControlService (existing) |
| Workload admin | Tune advisory, prefetch, dedup hints | Advisory (existing) + workload quota |
| On-call/SRE | View health, trigger repair, check alerts | ClusterStatus, DeviceHealth stream, TriggerScrub |
CLI mapping (kiseki-cli)
kiseki cluster status → ClusterStatus
kiseki pool list → ListPools
kiseki pool status fast-nvme → PoolStatus
kiseki pool create --name bulk-hdd --class HddBulk --ec 8+3
kiseki pool tune fast-nvme --warning-pct 75 --target-fill 70
kiseki device list → ListDevices
kiseki device add /dev/nvme2n1 --pool fast-nvme
kiseki device evacuate dev-uuid → EvacuateDevice
kiseki device health --watch → DeviceHealth stream
kiseki tune set --compaction-rate 200 --gc-interval 120
kiseki shard list → ListShards
kiseki shard split shard-uuid → SplitShard
kiseki repair scrub --pool fast-nvme
kiseki iostat --pool fast-nvme → IOStats stream
Authorization model
| API | Who can call | Auth |
|---|---|---|
| StorageAdminService (all) | Cluster admin only | mTLS cert with admin OU |
| ControlService (tenant ops) | Tenant admin | mTLS cert with tenant OU |
| Advisory (workload ops) | Workload identity | mTLS cert + workflow token |
| Read-only observability | Cluster admin, SRE | mTLS cert with admin/sre OU |
Tenant admins cannot access StorageAdminService. They see their own quotas and compliance tags, not pool health or device state. This preserves the zero-trust boundary (I-T4).
Consequences
- Full API-first admin surface — no SSH-and-edit needed
- CLI, UI, automation all use the same gRPC APIs
- Performance tuning at four levels with inheritance
- Streaming observability for real-time monitoring
- Clear authorization boundary between cluster admin and tenant admin
- Significantly expands the gRPC surface (20+ new RPCs)
References
- ADR-024: Device management and capacity thresholds
- ADR-005: EC and chunk durability
- ADR-020: Workflow advisory (workload-level tuning)
- Ceph:
ceph osd pool setcommand reference - Lustre:
lctl set_paramtunables - I-T4: Zero-trust infra/tenant boundary
Addendum: Adversarial Review Resolutions (2026-04-20)
C1: Per-tenant resource usage → ControlService, not StorageAdminService
Per-tenant resource usage (capacity, IOPS attribution) is exposed via ControlService with tenant-admin authorization, NOT via StorageAdminService. Cluster admin sees pool-level aggregates only. Tenant admin sees their own usage. This preserves I-T4.
// In ControlService (not StorageAdminService):
rpc GetTenantUsage(GetTenantUsageRequest) returns (TenantUsage);
// Requires tenant admin cert (mTLS OU = tenant ID)
C2: Per-device I/O stats added
rpc DeviceIOStats(DeviceIOStatsRequest) returns (stream DeviceIOStatsEvent);
message DeviceIOStatsEvent {
string device_id = 1;
double read_iops = 2;
double write_iops = 3;
double read_latency_p50_ms = 4;
double read_latency_p99_ms = 5;
double errors_per_sec = 6;
uint64 timestamp_ms = 7;
}
C3: Shard health observability added
rpc GetShardHealth(GetShardHealthRequest) returns (ShardHealthInfo);
message ShardHealthInfo {
string shard_id = 1;
uint64 leader_node_id = 2;
uint32 replica_count = 3;
uint32 reachable_count = 4;
uint32 recent_elections = 5;
uint64 commit_lag_entries = 6;
}
C4: EC parameters immutable per pool
New invariant I-C6: EC parameters (data_chunks, parity_chunks) are
immutable per pool. SetPoolDurability applies only to NEW chunks.
Existing chunks retain their original EC configuration. Explicit
re-encoding via ReencodePool RPC (long-running, cancellable).
C5: Compaction rate validation
Protobuf-level validation: compaction_rate_mb_s ∈ [10, 1000].
API rejects values outside range. Audit event on every change.
C6: Inline threshold is prospective
New invariant I-L9: A delta’s inlined payload is immutable after
write. inline_threshold_bytes changes do NOT retroactively affect
existing deltas. Old and new thresholds coexist in the log.
C7: RemoveDevice requires evacuated state
New invariant I-D5: RemoveDevice rejects if device state is not
Removed (post-evacuation). Precondition: EvacuateDevice must
complete first. Error code: DEVICE_NOT_EVACUATED.
C8: Pool modifications audited to affected tenants
New invariant I-T4c: Cluster admin modifications to pools containing tenant data (SetPoolDurability, EvacuateDevice) are audit-logged to the affected tenant’s audit shard. Tenant admin can review.
C9: Tuning change audit trail
New invariant I-A6: All tuning parameter changes via SetTuningParams are recorded in the cluster audit shard with parameter name, old value, new value, timestamp, and admin identity.
H5: SRE roles defined
| Role | Access |
|---|---|
cluster-admin | Full StorageAdminService (read + write) |
sre-on-call | Read-only: List*, Get*, Status, Health streams |
sre-incident-response | SRE + TriggerScrub, RepairChunk |
Enforced via mTLS certificate OU field.
M4: DrainNode added
rpc DrainNode(DrainNodeRequest) returns (stream DrainNodeProgress);
Internally evacuates all devices on the node, then removes them. Idempotent, safe to retry.
ADR-026: Raft Topology — Per-Shard on Fabric (Strategy A)
Status: Accepted. Date: 2026-04-20. Deciders: Architect + domain expert.
Context
Kiseki needs multi-node Raft for durability (I-L2) and failover. The cluster operates on a shared Slingshot fabric (200 Gbps per node) where control messages (Raft) and data (chunk I/O) share bandwidth.
Three strategies were evaluated:
- A: Raft per shard, all traffic on fabric
- B: Raft for metadata only, primary-copy for data (Ceph-like)
- C: Multi-Raft with batched transport (TiKV-like)
Decision
Strategy A: Raft per shard, on the data fabric.
Start with A, add C’s batching optimization when monitoring shows it’s needed (>1000 connections per node).
Why this works
Raft traffic is negligible compared to data fabric capacity:
| Scale | Shards | Groups/node | Heartbeat/node | Replication/node | % of 200 Gbps |
|---|---|---|---|---|---|
| 10 nodes | 100 | 30 | 78 KB/s | 3 MB/s | <0.001% |
| 100 nodes | 1000 | 30 | 78 KB/s | 3 MB/s | <0.001% |
| 1000 nodes | 10,000 | 30 | 78 KB/s | 3 MB/s | <0.001% |
Groups-per-node stays constant at ~30 because shard count scales with node count (each node hosts ~30 shard replicas regardless of cluster size).
Key insight: Raft only for metadata
Chunk data does NOT go through Raft. The write path:
Large write:
Client → Gateway → encrypt → chunk to NVMe (EC direct) → delta to Raft (1KB metadata)
Small write (<4KB):
Client → Gateway → encrypt → inline in delta → Raft only
Raft replicates delta metadata (~1KB per operation). Chunk ciphertext (64KB-64MB) is written directly to NVMe devices via EC. This means:
- Write throughput limited by NVMe/network, NOT by Raft
- Raft consensus adds ~30-60µs (RDMA) or ~75-250µs (TCP) per metadata op
- 50-100k metadata ops/sec per shard, shards in parallel
Projected performance vs competition
| Metric | Kiseki (projected) | Lustre | Ceph | GPFS |
|---|---|---|---|---|
| Write GB/s /node | 25-40 | 5-12 | 1-3 | 5-15 |
| Read GB/s /node | 40-50 | 10-20 | 3-8 | 10-30 |
| Write latency | 30-250µs | 100-500µs | 500-2000µs | 100-300µs |
| Metadata IOPS /node | 1.5-3M | 50-100k | 10-50k | 200k |
Raft group configuration
| Raft group | Members | Where |
|---|---|---|
| Key manager | 3-5 | Dedicated keyserver nodes |
| Log shard (per shard) | 3 | Spread across storage nodes |
| Audit shard (per tenant) | 3 | Spread across storage nodes |
Placement rule: no two members of the same group on the same node (or same rack if rack-aware placement is configured).
Transport
| Phase | Transport | Optimization |
|---|---|---|
| Phase 1 (now) | TCP + TLS | Direct connections, one per Raft peer |
| Phase 2 (10+ nodes) | TCP + TLS + connection pooling | Reuse connections across groups |
| Phase 3 (100+ nodes) | Batched transport (Strategy C) | Coalesce heartbeats per node pair |
| Future | Slingshot CXI / RDMA | Sub-10µs Raft RTT |
Election storm mitigation
Correlated failure (rack power loss) causes simultaneous elections for all Raft groups on affected nodes (~30 groups per node × N nodes).
Mitigations:
- Randomized election timeouts: openraft already does this (150-300ms jitter)
- Staggered group startup: on node restart, groups start elections over a 5-second window (not all at once)
- Leader sticky: prefer re-electing the same leader if it recovers within the election timeout (avoids unnecessary leader changes)
Network requirements
| Network | Purpose | Kiseki traffic |
|---|---|---|
| Data fabric (Slingshot/ethernet) | Chunk I/O + Raft | 99.99% data, 0.01% Raft |
| Management network (if available) | ControlService, monitoring | Optional: route Raft here to fully isolate |
Management network is NOT required. Raft on the fabric is fine because the overhead is <0.001% of capacity. If a management network exists (common in HPC), Raft CAN be routed there for belt-and-suspenders isolation, but it’s not necessary.
Consequences
- Simplest implementation: use openraft’s built-in TCP transport
- No separate management network required (but can use one)
- Scales to ~10k shards / 1000 nodes without transport optimization
- Add batching (Strategy C) as a pure transport optimization later
- Election storms during correlated failure are bounded by randomized timeouts
- Raft adds ~30-250µs to metadata write latency (acceptable for HPC)
Migration path
If Strategy A proves insufficient at extreme scale:
- Add batched transport (C) — pure transport change, no protocol change
- If even C is insufficient, partition shards into metadata-Raft and data-EC groups (B) — larger refactor but data model already supports it
References
- ADR-005: EC and chunk durability
- ADR-022: Storage backend (redb)
- ADR-024: Device management
- TiKV Multi-Raft: https://tikv.org/deep-dive/scalability/multi-raft/
- openraft: https://datafuselabs.github.io/openraft/
- Slingshot fabric: ~5-10µs RTT, 200 Gbps per endpoint
ADR-027: Single-Language Implementation — Rust Only
Status: Accepted (implemented 2026-04-21, Go code removed)
Date: 2026-04-20 (proposed), 2026-04-21 (accepted + migrated)
Context: Supersedes the implicit language split in docs/analysis/design-conversation.md §2.13. No prior ADR recorded the Rust/Go decision.
Context
Kiseki’s original design split the implementation across two languages:
- Rust for the core (log, chunks, views, native client, hot paths)
- Go for the control plane (tenancy, IAM, policy, flavor, federation, audit export, CLI) and one half each of two cross-cutting contexts (
kiseki-audit+control/pkg/audit;kiseki-advisory+control/pkg/advisory) - gRPC/protobuf as the boundary
The split was recorded in docs/analysis/design-conversation.md §2.13 but never promoted to an ADR. It surfaces in specs/architecture/module-graph.md (Go modules section), .claude/coding/go.md, and in two contexts that are currently split across both languages. The split pre-dates ADR-001 (pure-Rust, no Mochi/FFI), which already identified “FIPS compliance surface across two languages” as a cost.
At proposal time, 1,490 lines of Go business logic existed with 32/32 BDD
scenarios passing (godog, Strict:true). The migration ported all 32 scenarios
to cucumber-rs backed by a new kiseki-control Rust crate (~650 lines,
10 modules) before deleting the Go code. See
specs/implementation/adr027-go-to-rust-migration.md for the migration plan
and specs/findings/adr027-adversarial.md for the gate-1 review.
Decision
Implement Kiseki in Rust only. Retire the Go control plane, the Go CLI, and the Go halves of audit and advisory. Keep gRPC/protobuf as the wire boundary between the control plane and data plane so that a future non-Rust control plane remains possible.
Concretely:
- New Rust crates replace the Go packages one-for-one:
kiseki-control— control plane daemon (tenancy, IAM, policy, flavor, federation, audit export, discovery)kiseki-cli— admin CLI- The
control/pkg/audithalf is absorbed intokiseki-audit - The
control/pkg/advisoryhalf is absorbed intokiseki-advisory
- gRPC/protobuf stays as the wire boundary.
kiseki-controlservesControlService,AuditExportService, and policy endpoints over gRPC.kiseki-serverconsumes them as a client. No in-process shortcut across the boundary, even though both sides are now Rust. - Architectural firewall is enforced by crate dependencies, not by language.
kiseki-controlandkiseki-clidepend only onkiseki-commonandkiseki-proto. They MUST NOT depend on any data-path crate (kiseki-log,kiseki-chunk,kiseki-composition,kiseki-view,kiseki-gateway-*,kiseki-client,kiseki-keymanager). Enforced by acargo-denyor workspace-level architectural lint at CI. - Control plane binaries live alongside data-plane binaries in
crates/bin/:bin/kiseki-control/(new)bin/kiseki-cli/(new)
- gRPC server framework:
tonic(already the Rust-side choice). Config:figmentorconfig-rsfor layered YAML/env overrides (parity with Go’s viper pattern). - Federation / state machine:
kiseki-controlusesopenraft(already the project’s Raft choice per ADR-026) for replicated control-plane state (policy, opt-out state, tenant topology). This also eliminates the second Raft vendor that a Go control plane would have required (etcd client or dragonboat).
Rationale
One domain model
specs/ubiquitous-language.md defines Tenant, Org, Project, Workload, RetentionHold, Policy, Flavor, WorkflowRef, OperationAdvisory. Every one of these would otherwise need two implementations (Rust enums/structs + Go types). Two implementations drift: field renames, validation subtly different, invariant enforcement on one side only. Consolidating removes the class of bug where control-plane Go says a name is valid but data-path Rust rejects it (or vice versa).
One error taxonomy
specs/architecture/error-taxonomy.md enumerates retriable / permanent / security error categories. A Go implementation mirrors the Rust taxonomy as Go types + gRPC status mappings. One language means one thiserror-derived enum hierarchy and one mapping to tonic::Status.
Smaller FIPS surface
ADR-001 already cited “FIPS compliance surface across two languages” as a reason to reject C/C++ FFI. The same cost applies to Go: either BoringCrypto (Go’s FIPS module) is part of the certification boundary, or the control plane sits outside the FIPS module boundary and the certification scope has to be carefully drawn. Rust-only gives one aws-lc-rs FIPS module boundary for the whole system.
Cross-context crates stop being split
kiseki-audit and kiseki-advisory are currently split across Rust and Go. That means two queue implementations, two filter implementations, two sets of integration tests, two ways that tenant-scope validation can drift. In Rust-only, each is one crate with one set of invariants.
Eliminated toolchain duplication
Today’s per-commit gate has to run: cargo fmt, clippy, cargo-deny, cargo test and go fmt, go vet, golangci-lint, go test -race. Rust-only halves the CI configuration, halves the local developer setup, and removes one supply-chain audit surface (Go module proxy + checksum DB alongside crates.io).
Reuse of kiseki-common and kiseki-proto
The CLI and control plane can import the real domain types rather than regenerated protobuf Go structs. Validation logic written once in kiseki-common (e.g., tenant-id parsing, flavor matching, policy inheritance) is reused verbatim in the control plane and the CLI.
Build-phase cost is low now
Phase 0 has not started. Adding two Rust crates (kiseki-control, kiseki-cli) is cheaper than maintaining a separate control/ Go module, its go.mod, its generated proto outputs, and its CI lane. The cost rises monotonically with every phase that ships Go code.
Hiring and cognitive load
Contributors need one language, one async runtime (tokio), one tracing stack, one error model. Code review crosses fewer idiom boundaries. Onboarding doc shrinks.
Alternatives considered
-
Keep Go as specified.
- Pro: Go’s ecosystem for control planes (cobra, viper, operator-sdk, client-go patterns) is the golden path; k8s, etcd, Consul all use it. GC is fine on cold paths. Operators extending the system are more likely to know Go.
- Pro: the language wall is the architectural wall — the Go control plane physically cannot reach into data-plane memory or internals.
- Con: every benefit above comes with the duplication, drift, and FIPS-surface costs enumerated in “Rationale”. With no code written, the ecosystem-maturity argument is weaker than at a later stage.
-
Port only the CLI to Rust, keep the Go control-plane daemon.
- Pro: preserves Go for the longer-lived daemon code where operator-sdk patterns matter most. Low churn.
- Con: doesn’t remove duplication for the split contexts (
audit,advisory). Doesn’t shrink the FIPS surface. Doesn’t remove the second toolchain from CI. Half-measure.
-
Rewrite the core in Go (single-language Go).
- Rejected immediately: Go GC and lack of precise control over allocation and layout disqualify it from the hot data path at 200 Gbps per NIC. This inverts the original rationale for Rust in the core.
-
Separate Rust crate per Go package, but share no runtime (same-language boundary still isolated by process).
- Considered. Rejected: unnecessary. The isolation value of “separate OS process” is already provided by
kiseki-controlbeing a distinct binary. Running two daemons is orthogonal to the language question.
- Considered. Rejected: unnecessary. The isolation value of “separate OS process” is already provided by
-
Defer the decision until after Phase 3.
- Rejected: the decision is cheapest to reverse now. Each build phase that ships Go code raises the cost of consolidation and lets duplication set in. The analyst already flagged the split without recording a decision; formalizing now is overdue.
Consequences
Positive
- Single toolchain:
cargo fmt,clippy,cargo-deny,cargo test,cargo audit. Lefthook configuration shrinks. - Single FIPS module boundary (aws-lc-rs).
- Domain types (
Tenant,Policy,RetentionHold,Flavor,WorkflowRef,OperationAdvisory) exist once inkiseki-common. kiseki-auditandkiseki-advisorybecome whole crates rather than split halves. Their invariants (I-A1..I-A3, I-WA1..I-WA16) are enforced in one place.kiseki-controlcan reuseopenraft(ADR-026) for its replicated state rather than requiring a second Raft implementation (etcd/dragonboat).- No generated Go protobuf stubs to keep in sync; one generated tree under
crates/kiseki-proto/. - CI matrix shrinks; no
go test -racelane.
Negative
- Loses the “language wall as architectural wall” property. Must be replaced with crate-graph enforcement (see “Enforcement” below). This is a discipline cost and must be tooled, not trusted.
- Rust’s CLI/operator ecosystem (
clap,tonic,figment) is less mature than Go’s (cobra,viper,operator-sdk). Some patterns (admission webhooks, CRD controllers) will require more bespoke code if we ever grow a k8s operator. - Contributors with Go-only platform experience face a higher barrier to writing control-plane extensions.
kiseki-controlusestokiofor async I/O and is exposed to async-Rust complexity on request handlers (cancellation safety,'staticbounds) that Go handlers would not have had.- One-time rewrite cost for the control-plane spec surface (
api-contracts.md,module-graph.md,.claude/coding/go.md→ remove or archive,build-phases.mdmay need to re-sequence control-plane phases).
Enforcement (replacing the language wall)
The split previously enforced “control plane never reaches into data plane” structurally. In Rust-only, this is enforced by:
- Crate-graph rule.
kiseki-controlandkiseki-clidepend only onkiseki-commonandkiseki-proto. This is asserted by a CI check that greps Cargo manifests, or bycargo-deny’sbanssection, or by a custom workspace lint. - No re-export shortcut.
kiseki-commonMUST NOT re-export internal types from data-path crates. This is already the case; restated here as a rule. - gRPC boundary preserved. Even though both sides are now Rust, control-plane-to-data-plane traffic still goes through
tonicover gRPC, not through in-process trait calls. This keeps the wire contract as the source of truth and preserves the option of a non-Rust control plane later. - Runtime separation.
kiseki-controlruns as its own binary (bin/kiseki-control/), not as a library linked intokiseki-server. The isolation that process separation provides is preserved.
Migration
No code exists yet. Migration is a spec update:
docs/analysis/design-conversation.md§2.13: annotate with a pointer to this ADR.specs/architecture/module-graph.md: delete the “Go modules (control plane)” section; add the new Rust crates (kiseki-control,kiseki-cli) and update the “Bounded context → module mapping” table to say Rust for every row.specs/architecture/build-phases.md: review Phase sequencing — the Go control-plane phase collapses into a Rust phase; audit/advisory phases no longer have a “Go side” task..claude/CLAUDE.mdand.claude/guidelines/go.md: remove Go from the workflow router; keep.claude/coding/go.mdarchived (move tospecs/archive/or delete) as a historical record..claude/coding/rust.md: add a “control plane” section describingkiseki-control/kiseki-cliconventions (config withfigment, CLI withclap, server withtonic+axumfor any REST admin surface).Makefile(when it exists): drop Go lanes.specs/features/control-plane.feature: BDD scenarios remain; the step definitions move fromgodogtocucumber-rs.
Open items (escalated to adversary gate-1)
- Verify the crate-graph rule (control plane depends only on
kiseki-common/kiseki-proto) is enforceable withcargo-denyalone, or whether a custom workspace lint is needed. - Confirm
cucumber-rscovers the Gherkin features thatgodogwas planned to run, without step-definition regressions. - Confirm FIPS posture: aws-lc-rs covers the control-plane’s TLS needs (mTLS CA, admin endpoints) as well as the data-plane’s. No Go BoringCrypto equivalent is needed.
- Verify that removing the Go language wall does not create a realistic path by which a control-plane code change accidentally links data-path crates. Propose a pre-merge check if manifest-grep is insufficient.
- Decide the fate of
control/pkg/discovery: if fabric discovery uses libfabric/CXI, it was already going to need a Rust FFI layer; confirm the Rust-only home for it iskiseki-control(or a newkiseki-discoverycrate).
References
- ADR-001: Pure Rust, No Mochi Dependency (FIPS surface precedent).
- ADR-021: Workflow Advisory Architecture (defines the Rust+Go split for advisory that this ADR collapses).
- ADR-026: Raft Topology — openraft is the Rust-side Raft; now also the control plane’s Raft.
docs/analysis/design-conversation.md§2.13 — original (now superseded) language-split rationale.specs/architecture/module-graph.md— current two-language module layout (to be rewritten)..claude/coding/go.md— Go coding standards (to be archived on acceptance).
ADR-028: External Tenant KMS Providers
Status: Accepted Date: 2026-04-22 Context: I-K11, ADR-002, ADR-003, ADR-007 Adversarial review: 2026-04-22 (8 findings: 2H 5M 1L, all resolved)
Problem
ADR-002 defines a two-layer encryption model where tenant KEKs wrap
access to system DEK derivation material. The current implementation
hardcodes tenant KEK as a locally-managed [u8; 32] — there is no
mechanism for tenants to bring their own key management infrastructure.
HPC and enterprise tenants require integration with their existing KMS:
- Regulatory compliance (FIPS 140-2/3, Common Criteria, SOC 2)
- Centralized key lifecycle management
- Hardware-backed key storage (HSMs)
- Audit trails in their own systems
- Key escrow and disaster recovery under their own policies
Decision
Introduce a TenantKmsProvider trait with five backend
implementations. Tenant KEK sourcing becomes pluggable per-tenant
via control-plane configuration. The system key manager (ADR-007)
remains unchanged — only the tenant KEK layer is externalized.
Provider Backends
| # | Backend | Type | Standard | Transport | Material model |
|---|---|---|---|---|---|
| 1 | Kiseki Internal | Built-in | — | In-process | Local |
| 2 | HashiCorp Vault | Open source | Proprietary (Transit) | HTTPS | Local (cached) |
| 3 | KMIP 2.1 | Standard | OASIS KMIP SP 800-57 | mTLS (TTLV) | Remote or local |
| 4 | AWS KMS | Cloud | AWS Sig V4 | HTTPS | Remote only |
| 5 | PKCS#11 v3.0 | HSM | OASIS PKCS#11 | Local (FFI) | Remote only (HSM) |
Material model: “Local” = KEK material cached in Kiseki process memory. “Remote” = material never leaves the provider; all wrap/unwrap operations are remote calls. The trait fully encapsulates this distinction — callers never branch on provider type.
Provider 1: Kiseki Internal (default)
The existing behavior. Kiseki manages tenant KEKs internally. Suitable for deployments where tenants trust the operator or where external KMS is unavailable.
- Tenant KEK generated internally on tenant creation
- Stored in a separate Raft group from system master keys (independent compromise domain — see Security Considerations §6)
- Rotation managed by Kiseki’s epoch mechanism
- No external dependency
This is the zero-configuration default. Existing tenants and single-operator deployments use this without change.
Security trade-off: Internal mode does not provide the full two-layer security guarantee of ADR-002. A compromise of both the system key manager and the tenant key store (even though they are separate Raft groups) yields full access. Compliance-sensitive tenants should use an external provider where the tenant KEK is under the tenant’s own operational control.
Provider 2: HashiCorp Vault (Transit secrets engine)
Vault’s Transit engine provides encryption-as-a-service with key versioning that maps cleanly to Kiseki’s epoch model.
Operations mapping:
| Kiseki operation | Vault API |
|---|---|
wrap | POST /transit/encrypt/:name (with context = AAD) |
unwrap | POST /transit/decrypt/:name (with context = AAD) |
rotate | POST /transit/keys/:name/rotate |
rewrap | POST /transit/rewrap/:name (server-side, no plaintext exposure) |
destroy | DELETE /transit/keys/:name (after enabling deletion) |
Authentication methods (tenant-configurable):
- TLS certificate — maps to Kiseki’s SPIFFE/mTLS identity
- AppRole — role_id + secret_id for service authentication
- Kubernetes — ServiceAccount JWT (for k8s-deployed Kiseki)
- OIDC/JWT — external IdP token
Vault namespaces: Multi-tenant Vault deployments use namespaces to isolate tenant key material. The tenant’s Vault namespace is configured at onboarding.
Caching: Vault provider may optionally cache KEK material locally
(fetched via POST /transit/datakey/plaintext/:name). When caching is
disabled, all wrap/unwrap calls go through Vault directly. Caching mode
is configurable per tenant.
Rust crate: vaultrs (maintained, async, supports Transit engine).
Provider 3: KMIP 2.1 (OASIS standard)
KMIP is the interoperability standard for enterprise key management. A single KMIP client covers: Thales CipherTrust Manager, IBM Security Guardium Key Lifecycle Manager, Fortanix SDKMS, Entrust KeyControl, NetApp StorageGRID KMS, Dell PowerProtect, and any KMIP-compliant HSM.
Relevant OASIS specifications:
- KMIP Specification v2.1 (2019) — protocol and operations
- KMIP Profiles v2.1 — conformance levels
- KMIP Usage Guide v2.1 — implementation guidance
Operations mapping:
| Kiseki operation | KMIP operation |
|---|---|
wrap | Encrypt with Correlation Value (AAD) |
unwrap | Decrypt with Correlation Value (AAD) |
rotate | ReKey or Create + Activate + Revoke old |
destroy (crypto-shred) | Destroy (state → Destroyed, irrecoverable) |
Transport: TTLV (Tag-Type-Length-Value) binary encoding over mTLS. The KMIP spec mandates mutual TLS with X.509 certificates.
Key object attributes: KMIP keys carry rich metadata —
Cryptographic Algorithm, Cryptographic Length, State
(Pre-Active/Active/Deactivated/Compromised/Destroyed),
Activation Date, Deactivation Date. These map to Kiseki’s
EpochInfo (is_current, migration_complete).
Material model: Depends on KMIP server configuration. Some servers
allow Get to extract key material (local caching). Others enforce
non-extractable keys (remote-only wrap/unwrap). The provider detects
this via CKA_EXTRACTABLE equivalent attribute and adapts.
Rust implementation: No mature KMIP crate exists. Implement a minimal KMIP client covering the Symmetric Key Foundry Client profile (KMIP Profiles v2.1 §4.1). The wire format (TTLV) is straightforward — ~1500 lines for the operations Kiseki needs.
Provider 4: AWS KMS (cloud KMS exemplar)
AWS KMS as the reference cloud implementation. Azure Key Vault and GCP Cloud KMS follow the same adapter pattern.
Operations mapping:
| Kiseki operation | AWS KMS API |
|---|---|
wrap | Encrypt (with EncryptionContext = AAD) |
unwrap | Decrypt (with EncryptionContext = AAD) |
rotate | CreateKey + CreateAlias (manual) or EnableKeyRotation (automatic annual) |
rewrap | ReEncrypt (server-side, no plaintext exposure) |
Key difference: With cloud KMS, the KEK material never leaves the cloud provider. Kiseki sends the derivation parameters (epoch + chunk_id) to KMS for wrapping/unwrapping. This is strictly more secure than local caching but adds network latency per operation.
Caching strategy: Kiseki caches the unwrapped derivation
parameters (not the KEK itself, which never leaves KMS). The
existing KeyCache TTL mechanism applies — after TTL expiry, a
new Decrypt call to KMS is required.
Auth: IAM role assumption via STS, instance metadata, or environment credentials. For Azure: AAD/Managed Identity. For GCP: service account key or Workload Identity.
Rust crates: aws-sdk-kms, azure_security_keyvault,
google-cloud-kms (all maintained, async).
Provider 5: PKCS#11 v3.0 (HSM direct)
For tenants with on-premises HSMs (Thales Luna, Utimaco, nCipher, YubiHSM). PKCS#11 is the standard C API for cryptographic tokens.
Relevant standards:
- OASIS PKCS#11 v3.0 (2020) — Cryptographic Token Interface
- PKCS#11 Profiles v3.0 — baseline/extended profiles
Operations mapping:
| Kiseki operation | PKCS#11 function |
|---|---|
wrap | C_WrapKey (AES-KWP per RFC 5649, with pParameter = AAD) |
unwrap | C_UnwrapKey |
rotate | C_GenerateKey + C_DestroyObject (old, after migration) |
destroy | C_DestroyObject |
Material model: Remote only. HSM keys are CKA_SENSITIVE and
CKA_EXTRACTABLE=FALSE by default — material never leaves the HSM.
All wrap/unwrap operations execute on the HSM hardware. Kiseki caches
unwrapped derivation parameters (same as cloud KMS model).
Transport: Local — PKCS#11 is a C shared library (.so/.dylib)
loaded via FFI. The HSM may be network-attached (e.g., Luna Network
HSM), but the PKCS#11 interface is local to the host.
Rust crate: cryptoki (maintained, wraps PKCS#11 C API).
Trait Interface
#![allow(unused)]
fn main() {
/// Provider for tenant key encryption keys (KEKs).
///
/// Each tenant configures exactly one provider. The provider handles
/// authentication, key lifecycle, and wrapping/unwrapping operations.
/// The trait fully encapsulates the provider's material model — callers
/// never need to know whether wrapping happens locally or remotely.
///
/// Providers that cache KEK material locally (Internal, Vault) manage
/// their own cache internally. Providers where material never leaves
/// the backend (AWS KMS, PKCS#11) perform remote wrap/unwrap calls.
/// The caller's code path is identical in both cases.
#[async_trait]
pub trait TenantKmsProvider: Send + Sync {
/// Wrap DEK derivation parameters (epoch + chunk_id) with the
/// tenant KEK. The `aad` binds the wrapped ciphertext to its
/// envelope context (typically chunk_id), preventing splice attacks.
/// Returns opaque ciphertext stored in the envelope.
async fn wrap(
&self,
tenant: &OrgId,
plaintext: &[u8],
aad: &[u8],
) -> Result<Vec<u8>, KmsProviderError>;
/// Unwrap DEK derivation parameters from envelope ciphertext.
/// The `aad` must match the value used during wrapping.
async fn unwrap(
&self,
tenant: &OrgId,
ciphertext: &[u8],
aad: &[u8],
) -> Result<Zeroizing<Vec<u8>>, KmsProviderError>;
/// Rotate the tenant KEK to a new version/epoch.
/// Returns the new provider-specific epoch identifier.
async fn rotate(
&self,
tenant: &OrgId,
) -> Result<KmsEpochId, KmsProviderError>;
/// Re-wrap ciphertext from old key version to current version
/// without exposing plaintext (server-side re-wrap where supported).
/// Falls back to unwrap + wrap if the provider doesn't support
/// server-side re-wrap. The `aad` is preserved across the re-wrap.
async fn rewrap(
&self,
tenant: &OrgId,
old_ciphertext: &[u8],
aad: &[u8],
) -> Result<Vec<u8>, KmsProviderError>;
/// Destroy the tenant KEK (crypto-shred). Irrecoverable.
/// Also purges any locally cached material for this tenant.
async fn destroy(
&self,
tenant: &OrgId,
) -> Result<(), KmsProviderError>;
/// Check provider health and connectivity.
async fn health(&self) -> KmsHealthStatus;
/// Provider name for logging and diagnostics (never includes
/// credentials or key material).
fn provider_name(&self) -> &'static str;
}
}
AAD usage: Callers pass chunk_id.as_bytes() as aad for
per-chunk envelope wrapping. Each provider maps aad to its native
authenticated context mechanism:
| Provider | AAD mechanism |
|---|---|
| Internal | AES-256-GCM additional data (existing "kiseki-tenant-wrap-v1" prefix + aad) |
| Vault | Transit context parameter (base64-encoded) |
| KMIP | Correlation Value attribute on Encrypt/Decrypt |
| AWS KMS | EncryptionContext key-value map ({"chunk_id": "<hex>"}) |
| PKCS#11 | pParameter field in mechanism struct |
Tenant Configuration
Stored in the control plane (kiseki-control) per-tenant:
#![allow(unused)]
fn main() {
pub struct TenantKmsConfig {
/// Provider type.
pub provider: KmsProviderType,
/// Provider-specific endpoint (URL, socket path, or "internal").
pub endpoint: String,
/// Authentication configuration. All secret fields use Zeroizing
/// wrappers and implement Debug redaction (I-K8 extended).
pub auth: KmsAuthConfig,
/// Key identifier within the provider.
pub key_name: String,
/// Provider namespace (Vault namespace, KMIP group, KMS alias prefix).
pub namespace: Option<String>,
/// Cache TTL override (bounded by I-K15: 5s-300s).
pub cache_ttl_secs: Option<u64>,
}
pub enum KmsProviderType {
Internal,
Vault,
Kmip,
AwsKms,
AzureKeyVault,
GcpCloudKms,
Pkcs11,
}
/// Authentication configuration for external KMS providers.
///
/// All secret fields use `Zeroizing<String>` for automatic memory
/// clearing on drop. The `Debug` impl prints variant names only —
/// never credential contents (I-K8 extended to provider credentials).
pub enum KmsAuthConfig {
/// Internal provider — no external auth needed.
None,
/// mTLS client certificate (KMIP, Vault TLS auth).
TlsCert {
cert_pem: String,
key_pem: Zeroizing<String>,
},
/// Vault AppRole.
AppRole {
role_id: String,
secret_id: Zeroizing<String>,
},
/// OIDC/JWT token (Vault, cloud providers).
Oidc {
token_endpoint: String,
client_id: String,
},
/// AWS IAM role assumption.
AwsIamRole {
role_arn: String,
region: String,
},
/// Azure Managed Identity or Service Principal.
AzureIdentity {
tenant_id: String,
client_id: String,
},
/// GCP Service Account.
GcpServiceAccount {
credentials_json: Zeroizing<String>,
},
/// PKCS#11 library path + slot/pin.
Pkcs11 {
library_path: String,
slot_id: u64,
pin: Zeroizing<String>,
},
}
}
I-K8 extended: KmsAuthConfig implements Debug with redaction:
#![allow(unused)]
fn main() {
impl fmt::Debug for KmsAuthConfig {
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
match self {
Self::None => write!(f, "KmsAuthConfig::None"),
Self::TlsCert { .. } => write!(f, "KmsAuthConfig::TlsCert(***)"),
Self::AppRole { role_id, .. } => write!(f, "KmsAuthConfig::AppRole({})", role_id),
// ... all variants redact secret fields
}
}
}
}
Caching and Fallback
The existing KeyCache (cache.rs) is reused for providers with local
material. Remote-only providers (AWS KMS, PKCS#11) cache unwrapped
derivation parameters instead.
| Provider | What is cached | Cache miss action |
|---|---|---|
| Internal | KEK material (32 bytes) | Fetch from tenant key Raft store |
| Vault | KEK material or nothing (configurable) | POST /transit/decrypt |
| KMIP | KEK material or nothing (depends on server) | Encrypt/Decrypt operation |
| AWS KMS | Unwrapped derivation params | Decrypt API call |
| PKCS#11 | Unwrapped derivation params | C_UnwrapKey |
I-K15 applies: Cache TTL bounded to [5s, 300s] regardless of provider. This ensures crypto-shred takes effect within the TTL window even if the external KMS is ahead of Kiseki’s cache.
Provider unavailability:
- Within TTL window: cached material serves reads (degraded mode)
- Beyond TTL: reads fail with
TenantKekUnavailable(retriable) - Writes always require fresh validation (no stale-cache writes)
Resilience (adversarial finding #5):
- Circuit breaker per provider endpoint: open after 5 consecutive failures/timeouts, half-open probe every 30s
- Jittered cache TTL: actual TTL = configured TTL ± 10% (random) to prevent synchronized expiry across storage nodes
- Concurrency limit: max 10 concurrent KMS requests per tenant per storage node (backpressure, not queuing)
- Timeout bounds: 2s connect timeout, 5s operation timeout for all network-based providers
I-K11 unchanged: Kiseki provides no escrow. If the tenant loses access to their external KMS and has no backup, their data is unrecoverable. This is documented and accepted.
Provider Migration
Changing a tenant’s KMS provider (e.g., Internal → Vault) requires re-wrapping all existing envelopes (adversarial finding #3):
- Provision new KEK in the target provider
- Configure the new provider as “pending” in control plane
- Background re-wrap: for each envelope,
old_provider.unwrap()→new_provider.wrap()with the same AAD - Track progress (same mechanism as epoch re-wrap:
RewrapProgress) - Once 100% re-wrapped, atomically switch active provider
- Decommission old provider KEK
During migration, reads use whichever provider matches the envelope’s
tenant_epoch. The envelope carries a provider-version tag to
disambiguate.
Constraint: Provider migration is an operator-initiated, audited action. It cannot be triggered by the tenant API alone.
Crypto-Shred Interaction
Crypto-shred (tenant KEK destruction) behavior per provider:
| Provider | Crypto-shred mechanism |
|---|---|
| Internal | Delete KEK from tenant key store; purge cache |
| Vault | POST /transit/keys/:name/config with deletion_allowed=true, then DELETE /transit/keys/:name |
| KMIP | Destroy operation (state → Destroyed, irrecoverable) |
| AWS KMS | DisableKey (immediate, blocks all operations) + ScheduleKeyDeletion (permanent, 7-30 day window) |
| PKCS#11 | C_DestroyObject |
AWS KMS: DisableKey is called immediately on crypto-shred to
block all wrap/unwrap operations. ScheduleKeyDeletion follows for
permanent destruction. The 7-day AWS-enforced waiting period applies
to permanent deletion only — the key is operationally dead from the
moment DisableKey is called. The health() check reports
supports_immediate_shred: true (via DisableKey) so tenants can
verify crypto-shred SLA compliance at configuration time.
Security Considerations
-
Credential protection: KMS auth credentials stored in the control plane are encrypted at rest with the system master key. All secret fields use
Zeroizing<String>for memory protection.Debugimplementations redact all credential content (I-K8 extended). Credentials are excluded from core dumps viaMADV_DONTDUMPon the containing allocation. -
Network isolation: External KMS calls are made from storage nodes, not the control plane. This avoids routing tenant data through the control plane. mTLS is required for all network-based providers.
-
Provider compromise: If a tenant’s external KMS is compromised, only that tenant’s data is at risk. System master keys and other tenants are unaffected (tenant isolation, I-T3).
-
Mixed providers: Different tenants can use different providers. A single Kiseki cluster can serve tenants using Vault, AWS KMS, and internal management simultaneously.
-
FIPS compliance: The HKDF derivation and AES-256-GCM encryption remain on Kiseki’s FIPS-validated aws-lc-rs module regardless of provider. The external KMS only handles the tenant KEK wrapping layer — the system encryption layer is always FIPS.
-
Internal provider isolation: Tenant KEKs in Internal mode are stored in a separate Raft group from system master keys. This provides an independent compromise domain — system key manager compromise alone does not yield tenant KEKs, and vice versa. However, an operator with access to both stores has full access. Compliance-sensitive tenants should use an external provider where the KEK is under their own operational control.
Implementation Phases
- Phase K1:
TenantKmsProvidertrait + Internal backend (refactor current code to use the trait; no behavioral change) - Phase K2: Vault backend (Transit engine,
vaultrscrate) - Phase K3: KMIP 2.1 backend (custom TTLV client, ~1500 lines)
- Phase K4: AWS KMS backend (
aws-sdk-kmscrate) - Phase K5: PKCS#11 backend (
cryptokicrate)
Phases K2-K5 are independent and can be built in any order.
Alternatives Considered
-
BYOK (Bring Your Own Key) upload model: Tenant uploads raw key material to Kiseki. Rejected — defeats the purpose of external KMS (key material leaves tenant’s control boundary).
-
Single cloud KMS only: Support only AWS KMS. Rejected — HPC customers are frequently on-premises or multi-cloud.
-
KMIP only: Use KMIP as the sole external standard. Rejected — Vault and cloud KMS are too prevalent to ignore, and KMIP client implementation cost is non-trivial.
-
No internal provider: Require all tenants to configure external KMS. Rejected — creates unnecessary deployment friction for simple or single-operator clusters.
-
fetch_kekin trait interface: Original design includedfetch_kek() -> Option<TenantKekMaterial>withNonefor cloud providers. Rejected after adversarial review — leaky abstraction that forces callers to branch on provider model.wrap/unwrapas the universal interface fully encapsulates the distinction.
Adversarial Review Findings (2026-04-22)
| # | Severity | Finding | Resolution |
|---|---|---|---|
| 1 | High | Credential fields as plaintext String | Zeroizing<String> + Debug redaction |
| 2 | High | fetch_kek leaky abstraction | Removed; wrap/unwrap are universal |
| 3 | Medium | No provider migration path | Migration protocol documented |
| 4 | Medium | No AAD in wrap/unwrap | aad: &[u8] parameter added |
| 5 | Medium | No rate limiting/circuit breaker | Circuit breaker + jitter + limits specified |
| 6 | Medium | PKCS#11 C_GetAttributeValue violates HSM model | Removed; HSM uses C_WrapKey/C_UnwrapKey only |
| 7 | Medium | Internal KEK co-located with system keys | Separate Raft group for tenant KEKs |
| 8 | Low | AWS KMS 7-day deletion window | DisableKey immediate + ScheduleKeyDeletion deferred |
Consequences
- Adds
kiseki-kmscrate (or module withinkiseki-keymanager) - Tenant key Raft group added (separate from system key manager)
- Control plane gains KMS configuration endpoints
- Each storage node needs network access to tenant KMS endpoints
- KMIP requires custom protocol implementation (~1500 lines)
- PKCS#11 requires unsafe FFI (contained within cryptoki crate)
- Testing requires mock KMS servers (Vault dev mode, LocalStack, SoftHSM for PKCS#11)
ADR-029: Raw Block Device Allocator
Status: Accepted Adversarial review: 2026-04-22 (8 findings: 2H 4M 2L, all resolved) Date: 2026-04-22 Context: ADR-022, ADR-024, ADR-005, I-C1 through I-C6
Problem
Chunk ciphertext needs to persist on JBOD data devices. ADR-024 specifies XFS on each device as the default, but filesystem overhead becomes the bottleneck at HPC scale:
- Double journaling: XFS journals its metadata, then redb journals ours — redundant durability cost
- Page cache pollution: OS caches data we already manage in our own cache layer, wasting DRAM
- Inode contention: Billions of chunks = billions of inodes; XFS metadata operations become the throughput ceiling
- Indirection: Every I/O traverses VFS → XFS → block layer → device; raw access removes two layers
Ceph’s migration from FileStore (XFS) to BlueStore (raw block) was driven by exactly these issues. DAOS uses SPDK for the same reason.
Decision
New crate: kiseki-block
A device I/O crate that manages raw block devices (and file-backed
fallback for VMs/CI). Separate from kiseki-chunk (domain logic).
kiseki-chunk depends on kiseki-block for storage.
Device Backend Trait
#![allow(unused)]
fn main() {
/// Abstraction over a storage device — raw block or file-backed.
/// Auto-detects device characteristics and adapts I/O strategy.
#[async_trait]
pub trait DeviceBackend: Send + Sync {
/// Allocate a contiguous extent of at least `size` bytes.
/// Alignment matches the device's physical block size.
fn alloc(&self, size: u64) -> Result<Extent, AllocError>;
/// Write data at the given extent.
fn write(&self, extent: &Extent, data: &[u8]) -> Result<(), BlockError>;
/// Read data from the given extent.
fn read(&self, extent: &Extent) -> Result<Vec<u8>, BlockError>;
/// Free an extent, returning blocks to the free pool.
fn free(&self, extent: &Extent) -> Result<(), AllocError>;
/// Sync all pending writes to stable storage.
fn sync(&self) -> Result<(), BlockError>;
/// Device capacity: (used_bytes, total_bytes).
fn capacity(&self) -> (u64, u64);
/// Probed device characteristics (read-only after open).
fn characteristics(&self) -> &DeviceCharacteristics;
}
}
Auto-detection (no manual configuration)
On DeviceManager::open(path), probe sysfs (Linux):
/sys/block/<dev>/queue/rotational → 0 (SSD/NVMe) or 1 (HDD)
/sys/block/<dev>/queue/physical_block_size → 512 or 4096
/sys/block/<dev>/queue/optimal_io_size → device-preferred I/O size
/sys/block/<dev>/queue/max_hw_sectors_kb → max single I/O size
/sys/block/<dev>/device/model → model string
/sys/block/<dev>/device/numa_node → NUMA node (-1 if none)
/sys/block/<dev>/queue/discard_max_bytes → TRIM support (>0 = yes)
Derived properties:
#![allow(unused)]
fn main() {
pub struct DeviceCharacteristics {
pub medium: DetectedMedium,
pub physical_block_size: u32,
pub optimal_io_size: u32,
pub rotational: bool,
pub numa_node: Option<u32>,
pub supports_trim: bool,
pub supports_smart: bool,
pub io_strategy: IoStrategy,
}
pub enum DetectedMedium {
NvmeSsd, // /sys/block/nvme*/ + rotational=0
SataSsd, // rotational=0, not NVMe
Hdd, // rotational=1
Virtual, // virtio in model, no SMART
Unknown,
}
pub enum IoStrategy {
DirectAligned, // O_DIRECT | O_DSYNC — NVMe, SATA SSD
BufferedSequential, // O_SYNC — HDD (readahead benefits)
FileBacked, // Default flags — VM, dev, CI
}
}
For non-Linux / VMs without sysfs: detect virtio in model string
or absence of block device properties → fall back to
IoStrategy::FileBacked with sparse file. All three strategies
implement the same DeviceBackend trait transparently.
On-disk format
Per data device:
Offset 0: [Superblock — 4K]
Offset 4K: [Primary Bitmap — variable size]
Offset M: [Mirror Bitmap — same size as primary]
Offset N: [Data Region — remainder of device]
Superblock (4K, first block):
#![allow(unused)]
fn main() {
pub struct Superblock {
pub magic: [u8; 8], // b"KISEKI\x01\x00"
pub version: u32, // Format version (1)
pub device_id: [u8; 16], // UUID
pub block_size: u32, // Physical block size (probed)
pub total_blocks: u64, // Device capacity in blocks
pub bitmap_offset: u64, // Byte offset of primary bitmap
pub bitmap_mirror_offset: u64, // Byte offset of mirror bitmap
pub bitmap_blocks: u64, // Size of each bitmap in blocks
pub data_offset: u64, // Byte offset of data region
pub generation: u64, // Monotonic, incremented on bitmap flush
pub checksum: [u8; 32], // SHA-256 of superblock fields
}
}
Allocation bitmap (primary + mirror): 1 bit per block in the data region. Stored twice at different offsets for redundancy.
- At 4K blocks: 4TB device = 1 billion blocks = 128MB × 2 = 256MB
- At 512B blocks: 4TB device = 8 billion blocks = 1GB × 2 = 2GB
- Bitmap overhead: 0.006% (4K) to 0.048% (512B)
- On read: verify primary against mirror. On mismatch, use the copy consistent with the redb journal.
Per-extent CRC32: Every data extent has a 4-byte CRC32 trailer written after the payload data (within the same aligned block).
- On read: verify CRC32 before returning data.
- CRC mismatch → hardware corruption → trigger EC repair from parity fragments (not a security incident).
- AES-GCM auth_tag failure after CRC pass → actual tampering (security incident, alert + audit).
- This distinguishes hardware failure from cryptographic attack, enabling correct operational response.
Allocation algorithm
Extent-based best-fit with free-list cache (Ceph BlueStore pattern, simpler than DAOS VEA):
- In-memory: B-tree of free extents
(offset, block_count), sorted by offset. On alloc, scan for smallest extent >= requested blocks. On free, insert and coalesce with neighbors. - Concurrency:
alloc()andfree()are serialized per device via Mutex on the allocator state. This is acceptable — allocation is a B-tree lookup (microseconds); I/O is the bottleneck, not allocation. Ceph BlueStore also serializes allocation per OSD. - On-disk: Bitmap is ground truth. Free-list rebuilt from bitmap on startup (~100ms for 4TB at 4K blocks).
- Crash safety: Bitmap updates journaled in redb
(
device_alloctable) before applied to device bitmap region. On crash recovery: reload bitmap from device, replay pending journal entries from redb, rebuild free-list.
Allocation flow (WAL-ordered for crash safety):
- Round up requested size to
block_sizeboundary - Search free-list for best-fit extent
- Split extent if larger than needed
- Journal intent in redb (
device_alloctable: alloc intent) - Mark bits in bitmap (pwrite to bitmap region)
- Return
Extent { offset, length } - Caller writes data to extent, then commits
chunk_metato redb - Clear intent from
device_allocjournal (write complete)
On crash recovery: scan device_alloc for pending intents. If
the corresponding chunk_meta exists → write completed, clear
intent. If no chunk_meta → write was interrupted, free the
extent (clear bitmap bits, remove intent). This is the standard
WAL pattern — Ceph BlueStore uses the same approach.
Free flow:
- Journal the deallocation intent in redb
- Clear bits in bitmap
- Insert freed extent into free-list, coalesce neighbors
- If
supports_trim: add to TRIM batch queue (see below) - Clear dealloc intent from journal
TRIM batching: Freed extents accumulate in a TRIM queue per
device. A batched BLKDISCARD ioctl is issued periodically
(every 60 seconds or when queue exceeds 1GB). This avoids
write amplification from many small TRIM commands.
Maximum extent size: 16MB. Allocations larger than 16MB are
split into multiple extents. FragmentLocation in chunk_meta
already supports multiple extents per chunk via Vec<FragmentLocation>.
I/O strategy per device type
| Strategy | Open flags | Alignment | Sync | Use case |
|---|---|---|---|---|
DirectAligned | O_DIRECT | O_DSYNC | physical_block_size | Implicit (O_DSYNC) | NVMe, SATA SSD |
BufferedSequential | O_SYNC | 512B | fdatasync() | HDD |
FileBacked | default | 4K (simulated) | fsync() | VM, dev, CI |
FileBacked alignment: FileBackedDevice enforces the same 4K
alignment as RawBlockDevice to ensure tests faithfully reproduce
raw block behavior. Code that passes CI will not fail on real
hardware due to alignment issues.
- Write buffers aligned via
std::alloc::Layout::from_size_alignfor O_DIRECT compatibility - NUMA-aware: pin allocator thread to
numa_nodeif detected - TRIM/UNMAP on free if
supports_trim(SSD wear management) optimal_io_sizeused for write batching (coalesce small writes up to this size before issuing I/O)
Metadata in redb (system partition)
ADR-022’s redb on the RAID-1 system partition stores chunk metadata:
Table: chunk_meta
Key: [u8; 32] (chunk_id)
Value: bincode-serialized ChunkMeta {
refcount: u64,
retention_holds: Vec<String>,
pool_name: String,
stored_bytes: u64,
fragments: Vec<FragmentLocation {
device_id: [u8; 16],
offset: u64,
length: u64,
}>,
envelope_meta: EnvelopeMeta {
nonce: [u8; 12],
auth_tag: [u8; 16],
system_epoch: u64,
tenant_epoch: Option<u64>,
tenant_wrapped_material: Option<Vec<u8>>,
},
}
Table: device_alloc (bitmap journal for crash safety)
Key: (device_id: [u8; 16], generation: u64)
Value: bincode-serialized Vec<AllocJournalEntry {
offset: u64,
length: u64,
is_alloc: bool, // true = allocate, false = free
}>
Separation of concerns
The allocator does NOT know about device subclasses (NvmeU2 vs
NvmeQlc, HddEnterprise vs HddBulk). Those are pool/placement
concerns in kiseki-chunk and kiseki-control (ADR-024).
| Layer | Cares about | Doesn’t care about |
|---|---|---|
kiseki-block | physical_block_size, rotational, O_DIRECT | TLC vs QLC, RPM, pool policy |
kiseki-chunk | pool thresholds, EC config, placement | block alignment, I/O flags |
kiseki-control | device class, pool assignment, tiering | how bytes reach the device |
The DeviceClass enum (ADR-024) stays in kiseki-chunk/kiseki-control.
DeviceCharacteristics (auto-probed) stays in kiseki-block.
Integration with existing code
ChunkOpstrait (ADR-005) unchanged — callers unaware of backend- New
PersistentChunkStoreinkiseki-chunkimplementsChunkOps:write_chunk(): EC encode → alloc extents per device viaDeviceBackend→ write fragments → update redbchunk_metaread_chunk(): lookup redbchunk_meta→DeviceBackend::readper fragment → EC decode if needed → return Envelopegc(): free extents viaDeviceBackend::free→ update bitmap → remove from redb
DeviceManagerinkiseki-blockopens devices at startup, probes characteristics, creates appropriateDeviceBackendper device- Server runtime (
kiseki-server) wiresDeviceManager→ pools →PersistentChunkStorewhenKISEKI_DATA_DIRis set
Crate structure
kiseki-block/
├── Cargo.toml
└── src/
├── lib.rs
├── backend.rs # DeviceBackend trait
├── raw.rs # RawBlockDevice (O_DIRECT)
├── file.rs # FileBackedDevice (sparse file)
├── probe.rs # Sysfs device probing
├── superblock.rs # On-disk superblock format
├── bitmap.rs # Allocation bitmap
├── allocator.rs # Extent allocator (free-list + bitmap)
├── extent.rs # Extent type
├── manager.rs # DeviceManager
└── error.rs # BlockError, AllocError
Rationale
- Raw block over XFS: Eliminates FS overhead (journaling, inode, page cache) that becomes the bottleneck at NVMe line rate. Ceph BlueStore validated this approach at scale.
- Auto-detection over manual config: Reduces deployment friction. Admin provides device paths; Kiseki probes characteristics. Works correctly on bare metal, VMs, and CI without config changes.
- Bitmap over B-tree free-list on disk: Simpler crash recovery (fixed-size, position-indexed). Free-list is derived in-memory. DAOS VEA uses B-tree on persistent memory, but we don’t require PMEM — bitmap on block device with redb journal is sufficient.
- File-backed fallback: Same trait, different backend. Tests and CI don’t need raw devices. VMs work without device passthrough.
- Separate crate:
kiseki-blockhas no domain knowledge (chunks, EC, pools). Clean dependency boundary. Testable in isolation.
Alternatives Considered
-
XFS on each JBOD device (ADR-024 original default): Rejected for production — FS overhead at NVMe line rate is unacceptable. Still available as
FileBackedstrategy for dev/VM. -
SPDK userspace I/O (DAOS model): Rejected — requires dedicated devices (no kernel access), complicates deployment, needs custom memory management (DMA buffers). Future optimization path if kernel I/O overhead is measured as bottleneck.
-
Pool files (one large file per device): Rejected — still has FS overhead (XFS metadata for the pool file itself). Raw block eliminates the FS entirely.
-
redb for chunk data: Rejected — B-tree not designed for multi-GB blob storage. Acceptable for metadata only.
Consequences
- Adds
kiseki-blockcrate to workspace (~2000 lines estimated) - Data devices must be provisioned as raw (no filesystem). Operator provides device paths in config; Kiseki writes superblock on init.
- VMs and CI use file-backed mode transparently (no raw devices needed)
- Crash recovery depends on redb journal + device bitmap consistency
- Device initialization is a destructive operation (writes superblock,
bitmap — existing data on device is lost). Safety checks before
init: (1) check for existing Kiseki superblock magic — require
--forceif found, (2) check for known FS signatures (XFS, ext4, NTFS magic) — refuse with clear error, (3) audit log the init - TRIM/UNMAP support improves SSD endurance but is optional
- Future: SPDK backend can implement
DeviceBackendtrait for userspace I/O without changing upper layers
Adversarial Review Findings (2026-04-22)
| # | Severity | Finding | Resolution |
|---|---|---|---|
| 1 | High | Write ordering — data before metadata creates phantom chunks on crash | WAL intent journal: alloc → journal intent → write data → commit chunk_meta → clear intent. Recovery replays intents. |
| 2 | High | No per-extent checksum — silent corruption indistinguishable from tampering | CRC32 trailer per extent. CRC fail = hardware corruption (EC repair). Auth tag fail after CRC pass = tampering (security alert). |
| 3 | Medium | Bitmap single point of failure per device | Primary + mirror bitmap at different offsets. On mismatch, use copy consistent with redb journal. |
| 4 | Medium | No device init safety — accidental overwrite of existing data | Safety checks: existing Kiseki magic → require –force. Known FS signatures → refuse. Audit log init. |
| 5 | Medium | File-backed mode doesn’t enforce alignment — CI misses bugs | FileBacked enforces same 4K alignment as RawBlockDevice. |
| 6 | Medium | Concurrent alloc race on shared free-list | Mutex per device on allocator state. Allocation is microseconds; I/O is the bottleneck. |
| 7 | Low | Immediate TRIM on free causes write amplification | Batch TRIM queue: accumulate, issue BLKDISCARD every 60s or at 1GB threshold. |
| 8 | Low | No max extent size — unbounded alloc fragments bitmap scan | Max extent 16MB. Larger chunks split into multiple extents. |
References
- Ceph BlueStore: Architecture
- DAOS VOS/VEA: Storage Model
- ADR-022: Storage backend (redb for metadata)
- ADR-024: Device management and capacity thresholds
- ADR-005: EC and chunk durability
ADR-030: Dynamic Small-File Placement and Metadata Capacity Management
Status: Accepted Date: 2026-04-22 Deciders: Architect + domain expert Adversarial review: 2026-04-22 (6 findings: 1C 2H 2M 1L, all resolved) Context: ADR-024 (device management), ADR-029 (raw block allocator), I-L9 (inline threshold), I-C5 (capacity thresholds), I-C8 (bitmap ground truth)
Problem
At scale (10B+ files, 100PB+), the metadata tier (redb on system NVMe) becomes a sizing bottleneck. The per-file metadata footprint (~280 bytes) is unavoidable, but small-file content inlined into deltas causes the metadata tier to scale with data volume, not just file count.
Current state:
inline_threshold_bytesis specified (I-L9) but not implemented- No dynamic adjustment mechanism exists
- No awareness of system disk capacity or media type
- No workload-driven shard placement across heterogeneous nodes
Capacity example
10B files, 100PB total, 50-node cluster, RF=3, 256GB NVMe root disks:
| Component | Per file | Cluster total | Per node |
|---|---|---|---|
| Delta log (no inline) | ~200 B | ~2 TB | ~120 GB |
| Chunk metadata | ~80 B | ~0.8 TB | ~48 GB |
| Subtotal (metadata only) | ~280 B | ~2.8 TB | ~168 GB |
| Small-file content (if inlined) | variable | 3-200 TB | blows budget |
Metadata alone consumes 168 GB/node at 50 nodes. Adding inline content makes 256 GB root disks insufficient.
Decision
1. System disk auto-detection and budget calculation
At server boot, detect the system partition’s capacity and media type. Compute a metadata budget with configurable soft and hard limits.
KISEKI_DATA_DIR → stat() → total_bytes, fs_type
/sys/block/{dev}/queue/rotational → 0 = SSD/NVMe, 1 = HDD
/sys/block/{dev}/device/model → device identification
Defaults (configurable via env or config file):
| Parameter | Default | Description |
|---|---|---|
KISEKI_META_SOFT_LIMIT_PCT | 50% | Normal operating ceiling |
KISEKI_META_HARD_LIMIT_PCT | 75% | Absolute maximum, triggers emergency |
KISEKI_META_INLINE_FLOOR | 128 B | Hard lower bound for inline (metadata-like payloads only) |
Warning: If the system disk is rotational (HDD), emit a persistent warning at boot and in health reports:
WARNING: system disk is rotational (HDD). Raft fsync latency will
be 5-10ms per commit. Production deployments require NVMe or SSD
for the metadata partition. See ADR-030.
Reported to cluster (via gRPC health reports, not Raft — see SF-ADV-4 resolution):
#![allow(unused)]
fn main() {
struct NodeMetadataCapacity {
total_bytes: u64,
used_bytes: u64,
soft_limit_bytes: u64,
hard_limit_bytes: u64,
media_type: MediaType, // Nvme, Ssd, Hdd
small_file_budget_bytes: u64, // derived: soft_limit - reserved - metadata
}
}
2. Two-tier redb layout on system disk
Separate metadata (Raft log, chunk index) from small-file content:
KISEKI_DATA_DIR/
├── raft/log.redb ← Raft log entries (bounded by snapshot policy)
├── keys/epochs.redb ← Key epoch metadata (tiny, <10 MB)
├── chunks/meta.redb ← Chunk extent index (scales with file count)
└── small/objects.redb ← Small-file encrypted content (capacity-managed)
The first three are structural metadata — required regardless of
inline threshold. The fourth (small/objects.redb) is data-tier
extension — its size is controlled by the inline threshold.
This separation enables:
- Independent monitoring of each tier’s growth
- Emergency response: disable inline (threshold → floor) without touching structural metadata
- Backup/restore of structural metadata without bulk data
GC contract (SF-ADV-6): When truncate_log or compact_shard
removes a delta that references an inline object, the corresponding
small/objects.redb entry is also deleted. The GC path must cover
both stores — orphan entries in small/objects.redb are a capacity
leak. The chunk_id key is shared between small/objects.redb and
the block device extent mapping, so deletion is keyed identically.
3. Per-shard dynamic inline threshold
The inline threshold determines whether a file’s encrypted content is
stored in small/objects.redb (metadata tier) or as a chunk extent on
a raw block device (data tier).
Threshold is per-shard, not per-node, because all Raft replicas of a shard must agree on whether content is inline or chunked (state machine determinism).
Computation: The shard leader computes the threshold from the minimum small-file budget across all nodes hosting that shard:
available = min(node.small_file_budget_bytes for node in shard.voters)
projected_files = shard.file_count_estimate (from delta count heuristic)
raw_threshold = available / max(projected_files, 1)
shard_threshold = clamp(raw_threshold, INLINE_FLOOR, INLINE_CEILING)
Where INLINE_CEILING is a system-wide maximum (e.g., 64 KB) to
prevent pathological cases.
Raft log throughput guard (SF-ADV-1): The threshold is further
clamped by a per-shard Raft log throughput budget
(KISEKI_RAFT_INLINE_MBPS, default 10 MB/s). If the shard’s inline
write rate (measured over a sliding 10-second window) would exceed
this budget at the current threshold, the effective threshold is
temporarily reduced to floor until the rate drops. This prevents
inline data from starving metadata-only Raft operations (large-file
chunk_ref deltas, maintenance commands, watermark advances) during
write storms.
effective_threshold = if shard.inline_write_rate_mbps > RAFT_INLINE_MBPS:
INLINE_FLOOR
else:
shard_threshold
Threshold adjustment rules (I-L9 compatibility):
- Threshold can decrease dynamically (safe — new files use chunks)
- Threshold changes are prospective only — existing inline data is not retroactively migrated
- Threshold increase requires cluster admin decision and may trigger background migration of small chunked files back to inline (optional, maintenance-mode operation)
- Threshold is stored in
ShardConfigand replicated via Raft
Read latency note (SF-ADV-3): After a threshold decrease, existing
inline files remain in small/objects.redb (fast, NVMe reads) while
new files of the same size go to block device extents (potentially
slower, especially on HDD). This bimodal latency for same-sized files
is expected behavior. Administrators can normalize it via the
maintenance-mode migration path (move old inline content to chunks),
but this is optional and not automatic.
Emergency override (SF-ADV-4): Capacity alerts use out-of-band
gRPC health reports, not Raft. Each node periodically reports its
NodeMetadataCapacity to the shard leader (or control plane) via the
data-path gRPC channel. If any voter reports hard-limit breach, the
leader commits a threshold reduction via Raft. This works because:
- The full-disk node doesn’t need to write Raft entries for the signal
- The leader commits the threshold change with 2/3 majority (the full-disk node’s vote is not required)
- The full-disk node receives the committed threshold change via Raft replication (read-only, no disk write needed until next apply)
4. Small-file data path
Inline content flows through Raft (SF-ADV-2): Inline content is
carried as payload in the Raft log entry (LogCommand::AppendDelta
with payload field). The state machine’s apply() method offloads
the payload to small/objects.redb on apply, keyed by chunk_id.
The in-memory state machine retains only the delta header (no payload).
This ensures:
- Snapshot correctness:
build_snapshot()reads inline content fromsmall/objects.redb, includes it in the serialized snapshot.install_snapshot()writes it back. Learners and restarted nodes receive all inline content via snapshot transfer. - State machine determinism: all replicas apply the same log
entries and write to their local
small/objects.redbidentically. - Memory efficiency: inline payloads are not held in memory after apply — only the redb reference remains.
Below threshold (inline path):
client write → gateway encrypt → delta with payload →
Raft client_write (payload in log entry) →
replicated to voters →
state machine apply() → offload payload to small/objects.redb →
in-memory state: header only (no payload)
Above threshold (chunk path, unchanged):
client write → gateway encrypt → chunk alloc on DeviceBackend →
extent write (O_DIRECT) → delta with chunk_ref (no payload) →
Raft client_write → replicated (metadata only)
Read path: ChunkOps::get() checks small/objects.redb first
(keyed by chunk_id). If not found, reads from block device extent.
This is transparent to callers.
5. Workload-driven shard placement (heterogeneous clusters)
When the cluster has mixed node types (HDD + SSD), the control plane can migrate shards to better-suited nodes using Raft membership changes.
Placement levers (ordered by preference, topology-dependent):
| Lever | When to use | Mechanism |
|---|---|---|
| Lower inline threshold | Always available | ShardConfig update via Raft |
| Split shard | Shard exceeds I-L6 ceiling | Standard shard split |
| Migrate to larger-NVMe node | Heterogeneous cluster, metadata pressure | Raft add_learner → promote → demote |
| Migrate to SSD node | Heterogeneous, small-file-heavy shard | Raft add_learner → promote → demote |
Decision tree (control plane policy):
IF shard.metadata_pressure > soft_limit:
IF can_lower_threshold(shard):
lower_threshold(shard) # cheapest, always try first
ELSE IF shard.exceeds_split_ceiling:
split_shard(shard) # distributes load
ELSE IF cluster.has_better_node(shard):
migrate_shard(shard, better_node) # needs heterogeneous cluster
ELSE:
alert("metadata tier at capacity, no placement options available")
In a homogeneous cluster, only the first two levers exist. The policy prunes itself based on what’s available.
Shard migration via Raft:
Migration is not a special operation — it’s a Raft membership change:
raft.add_learner(target_node)— target receives log/snapshot- Wait for learner to catch up (snapshot transfer, then log replay)
raft.change_membership(new_voter_set)— promote target, demote source- Old node removed from voter set, its data eventually GC’d
Properties:
- Zero downtime: reads/writes continue during migration
- Zero data loss: old node stays in membership until new node is caught up
- Reversible: if migration fails, learner is removed, no state change
6. Placement change rate limiting
Placement changes (shard migration, learner add/remove) consume snapshot transfer bandwidth. In HPC environments, workload profiles shift at job boundaries (hours to days), not continuously.
Exponential backoff per shard:
| Observation window | After N-th change |
|---|---|
| 2 hours | 1st (initial observation, minimum floor) |
| 2 hours | 2nd (backoff resets never go below 2h) |
| 4 hours | 3rd |
| 8 hours | 4th |
| … | doubles each time |
| 24 hours | cap (maximum interval) |
Reset (SF-ADV-5): The backoff resets to the minimum floor of 2 hours, not to a shorter interval. Even when the shard’s workload profile changes significantly (e.g., small-file ratio crosses a threshold boundary), the shard cannot be migrated more than once per 2 hours. This prevents oscillating workloads from causing continuous snapshot transfers. The 2-hour floor is chosen because:
- HPC job boundaries are typically hours apart
- A snapshot transfer of a large shard takes minutes, and the target node needs time to stabilize before being evaluated again
- The floor applies per-shard, so different shards can migrate concurrently within the cluster-wide rate limit
Per-cluster rate limit: at most max(1, num_nodes / 10) concurrent
shard migrations cluster-wide, to bound snapshot transfer bandwidth.
7. SSD nodes as read accelerators (Raft learners)
For read-heavy small-file workloads, SSD nodes can serve as non-voting Raft learners:
- Learners receive the full Raft log (including small-file content)
- Learners do NOT participate in elections or commit quorum
- Learners serve read requests (state machine is up-to-date)
- Add/remove learners without disturbing the voter set
Use case: a shard has RF=3 on HDD voters (for capacity) plus 1-2 SSD learners (for read IOPS). The SSD learners handle small-file reads, HDD voters handle bulk writes.
Correction after suboptimal placement: Initial shard placement does not need to be optimal. The control plane observes shard metrics (small-file ratio, read IOPS, p99 latency) and corrects placement via Raft membership changes. Adding an SSD learner, promoting it to voter, and demoting an HDD voter is a zero-downtime, zero-data-loss operation. The cost is one snapshot transfer per migrated shard — bounded by the rate limiting in §6.
Promotion path: if workload shifts permanently, a learner can be promoted to voter (and an HDD voter demoted) via standard membership change.
Consequences
Positive
- Metadata tier sizing becomes self-managing
- Small files handled efficiently without manual tuning
- Mixed HDD/SSD clusters used optimally
- Placement corrections have zero downtime and zero data loss
- I-L9 compatibility preserved (prospective-only threshold changes)
- Snapshot transfer includes inline content (SF-ADV-2 resolved)
Negative
- Per-shard threshold adds complexity to
ShardConfig ChunkOps::get()now checks two stores (redb + block device)- Snapshot transfer is the bottleneck for migration speed
- Threshold computation requires cluster-wide metadata aggregation
- Inline writes under high load may be temporarily demoted to chunk path (throughput guard), causing brief latency increase for small files
Neutral
- Threshold floor (128 B) means truly tiny files are always inline
- Homogeneous clusters get simpler behavior (fewer levers)
- Migration mechanism is just Raft membership changes — no new protocol
- Bimodal read latency after threshold decrease is expected (SF-ADV-3)
Adversarial findings (resolved)
| ID | Severity | Finding | Resolution |
|---|---|---|---|
| SF-ADV-1 | High | Raft log throughput saturation from inline writes | Per-shard throughput budget (§3), temporarily lowers threshold to floor under load |
| SF-ADV-2 | Critical | Inline content missing from Raft snapshots | Inline content flows through Raft log; state machine offloads to redb on apply; snapshot reads from redb (§4) |
| SF-ADV-3 | Medium | Bimodal read latency after threshold decrease | Documented as expected; optional admin migration path to normalize (§3) |
| SF-ADV-4 | High | Emergency override fails if full-disk node can’t write Raft entries | Capacity reporting via out-of-band gRPC, not Raft; leader commits with 2/3 majority (§3) |
| SF-ADV-5 | Low | Backoff reset allows frequent migrations from oscillating workloads | Minimum 2-hour floor that never resets below (§6) |
| SF-ADV-6 | Medium | No GC path for small/objects.redb | GC contract: truncate_log and compact_shard delete corresponding redb entries (§2) |
Invariant impact
| Invariant | Impact |
|---|---|
| I-L9 | Extended: threshold is now per-shard and dynamic, but still prospective-only. Increase requires admin action. |
| I-C5 | Unchanged: capacity thresholds on data devices unaffected. |
| I-C8 | Unchanged: bitmap remains ground truth for block device allocations. |
| I-K3 | Unchanged: inline content is still encrypted with system DEK, wrapped with tenant KEK. |
New invariants
| ID | Invariant |
|---|---|
| I-SF1 | The inline threshold for a shard is the minimum affordable threshold across all nodes hosting that shard’s voter set. Threshold stored in ShardConfig, replicated via Raft. |
| I-SF2 | System disk metadata usage must not exceed hard_limit_pct of system partition capacity. Exceeding soft limit triggers threshold reduction; exceeding hard limit forces threshold to floor and emits alert. Alert uses out-of-band gRPC, not Raft. |
| I-SF3 | Shard migration via Raft membership change must not proceed until the target node has fully caught up (learner state matches leader’s committed index). |
| I-SF4 | Placement change rate per shard follows exponential backoff (2h floor, 24h cap). Backoff resets never go below 2h floor. Cluster-wide concurrent migrations bounded by max(1, num_nodes / 10). |
| I-SF5 | Inline content is carried in Raft log entries and offloaded to small/objects.redb on state machine apply. Snapshots include inline content read from redb. No inline content is held in the in-memory state machine after apply. |
| I-SF6 | GC (truncate_log, compact_shard) must delete corresponding entries from small/objects.redb when removing deltas that reference inline objects. Orphan redb entries are a capacity leak. |
| I-SF7 | Per-shard Raft inline throughput must not exceed KISEKI_RAFT_INLINE_MBPS (default 10 MB/s). When exceeded, effective inline threshold drops to floor until rate subsides. |
Spec references
specs/invariants.md— I-L9, I-C5, I-C8, I-K3specs/architecture/adr/024-device-management-and-capacity.md— device classes, server disk layoutspecs/architecture/adr/029-raw-block-device-allocator.md— DeviceBackend trait, extent allocationspecs/architecture/adr/026-raft-topology.md— Raft membership, multi-Raft patternspecs/implementation/phase-7-9-assessment.md— open design question on small files
ADR-031: Client-Side Cache
Status: Accepted Date: 2026-04-23 Deciders: Architect + domain expert Adversarial review: 2026-04-23 (14 findings: 2C 4H 4M 4L, all resolved)
Context
ADR-013 (POSIX semantics scope), ADR-019 (gateway deployment model),
ADR-020 (workflow advisory), ADR-030 (dynamic small-file placement),
control-plane.feature (policy distribution precedent),
native-client.feature (client architecture).
CSCS workload mix: LLM pretraining (epoch reuse of tokenized datasets), LLM inference (model weight cold-start), climate/weather simulation (bounded input staging with hard deadlines), HPC checkpoint/restart. Common pattern: compute nodes repeatedly pull the same encrypted chunks across the fabric.
Existing client architecture: kiseki-client crate with feature flags
(fuse, ffi, python, pure-Rust default). Performs tenant-layer
encryption — plaintext never leaves the workload process. The existing
ClientCache is an in-memory HashMap<ChunkId, Vec<u8>> with TTL and
max-entries eviction.
Problem
-
Repeat reads of the same chunks cross the fabric unnecessarily. Training datasets are read epoch after epoch. Inference weights are loaded identically by multiple model replicas. Climate boundary conditions are staged identically to every simulation rank.
-
In-memory cache (current
ClientCache) is bounded by process memory, which is primarily needed for computation. Compute-node NVMe is available and underutilized. -
No mechanism for pre-staging datasets. Jobs start with cold cache and pay first-access latency on every rank simultaneously, creating a thundering-herd pattern on the storage fabric.
-
No cache mode differentiation. Training (pin everything), inference (pin weights, LRU prompts), and HPC checkpoint (don’t cache) have fundamentally different cache needs.
Decision
1. Cache architecture
The client-side cache is a library-level module in kiseki-client,
shared across all linkage modes (FUSE, FFI, Python, native Rust). It
operates on decrypted plaintext chunks keyed by ChunkId.
canonical (fabric) → decrypt → cache store (NVMe) → serve to caller
↑
cache hit path (no fabric, no decrypt)
Two-tier storage:
| Tier | Backing | Purpose | Eviction |
|---|---|---|---|
| Hot (L1) | In-memory HashMap | Sub-microsecond hits for active working set | LRU, bounded by max_memory_bytes |
| Warm (L2) | Local NVMe file or directory | Large capacity for datasets and weights | Per-mode policy (see §2) |
L2 layout on NVMe (CC-ADV-4 resolved: per-process subdirectories):
$KISEKI_CACHE_DIR/
├── <tenant_id_hex>/
│ ├── <pool_id>/ ← per-process pool (128-bit CSPRNG)
│ │ ├── chunks/
│ │ │ ├── <prefix>/
│ │ │ │ └── <chunk_id_hex> ← plaintext + CRC32 trailer
│ │ │ └── ...
│ │ ├── meta/
│ │ │ └── file_chunks.db
│ │ ├── staging/
│ │ │ └── <dataset_id>.manifest
│ │ └── pool.lock ← flock, proves process is alive
│ └── <pool_id>/ ← another concurrent process
│ └── ...
└── ...
Each client process creates its own pool_id directory (128-bit
CSPRNG, same generation as client_id per I-WA4). The pool.lock
file holds an flock for the process lifetime. Multiple concurrent
same-tenant processes on the same node have fully independent pools
with no contention.
L2 integrity (CC-ADV-3 resolved): Each L2 chunk file stores the plaintext data followed by a 4-byte CRC32 trailer, computed at insert time. On L2 read, the CRC32 is verified before serving. Full SHA-256 content-address verification occurs only at fetch time (when the chunk is first retrieved from canonical). CRC32 catches bit-flips and filesystem corruption at ~1 GB/s throughput cost. CRC mismatch triggers bypass to canonical and L2 entry deletion (I-CC7).
Security model (plaintext cache):
The L2 cache holds decrypted plaintext on local NVMe. This is acceptable because:
- The compute node already holds decrypted data in process memory (computation requires plaintext)
- L2 NVMe is local to the compute node, same trust domain as process memory
- L2 is ephemeral — wiped on process exit and on long disconnect
zeroizeon eviction/wipe: overwrite chunk data before deallocation (I-CC2)- File permissions:
0600, owned by process UID - Crash recovery: startup scavenger + periodic scrubber clean orphaned pools (CC-ADV-1 resolved, see §9)
Residual risk (CC-ADV-10 acknowledged): Software zeroize on NVMe/SSD provides logical-level erasure only. The Flash Translation Layer may retain physical copies of overwritten data until internal garbage collection. For deployments requiring physical erasure guarantees, use NVMe drives with hardware encryption (OPAL/SED) and rotate the drive encryption key on node reboot. This is an operational hardening measure, not a baseline requirement.
2. Cache modes
Three modes, selectable per client instance at session establishment:
Pinned mode
For workloads that declare their dataset upfront: training runs (epoch reuse), inference (model weights), climate (boundary conditions).
- Chunks are retained against eviction until explicit release
- Populated via the staging API (§6) or on first access
- L2 is the primary tier; L1 is a hot subset
- Eviction: only on explicit
release()or process exit - Capacity bounded by
max_cache_bytes(§8); staging beyond capacity returns an error, does not evict pinned chunks
Dataset versioning (CC-ADV-8 resolved): Pinned mode stages a
point-in-time snapshot of the dataset. The staged version is immutable
in the cache regardless of canonical updates. This is intentional —
training runs require a stable dataset across epochs. To pick up
dataset updates, the user must explicitly release and re-stage.
There is no automatic dataset-level version check.
Organic mode
Default for mixed workloads. LRU with usage-weighted retention.
- Chunks cached on first read, evicted on LRU when capacity is reached
- Frequently accessed chunks promoted to L1
- L2 eviction: LRU by last-access timestamp, weighted by access count (chunks accessed N times survive N eviction rounds)
- Metadata cache (file→chunk_list) with configurable TTL (default 5s)
Bypass mode
For workloads that don’t benefit from caching: streaming ingest, one-shot scans, checkpoint writes, compute-bound codes with no repeat reads.
- All reads go directly to canonical
- No L1 or L2 storage consumed
- Zero overhead beyond mode selection
3. Metadata cache
The cache stores file-to-chunk-list mappings with a bounded TTL:
#![allow(unused)]
fn main() {
struct MetadataEntry {
chunk_list: Vec<ChunkId>,
fetched_at: Instant,
ttl: Duration,
}
}
I-CC3 (metadata freshness and authority): File→chunk_list metadata mappings are served from cache only within the configured TTL (default 5s). After TTL expiry, the mapping must be re-fetched from canonical before serving chunks that depend on it. Within the TTL window, the cached mapping is authoritative — it may serve data for files that have since been modified or deleted in canonical. This is an accepted consequence of the TTL window, not a correctness violation. Modifications create new compositions with new chunk_ids; the old mapping points to valid immutable chunks that were the file’s content at fetch time. Deletions remove the composition; the cached mapping continues to serve the deleted file’s data until TTL expiry.
I-CC5 (staleness bound): Metadata TTL is the upper bound on read staleness. A file modified or deleted in canonical will be visible to a caching client within at most one metadata TTL period. The default TTL (5 seconds) balances freshness against metadata lookup cost.
Write-through: When the client writes a file (creating new chunks and a new composition), the local metadata cache is updated immediately with the new chunk list. This provides read-your-writes consistency within a single client process without waiting for TTL expiry.
4. Correctness invariants
The cache’s correctness rests on a small set of stated invariants. Each case where the cache serves (rather than bypasses) is backed by one or more of these invariants. Cases not covered bypass to canonical.
I-CC1 (chunk immutability): Chunks are immutable in canonical (I-C1). A chunk fetched, verified by content-address (SHA-256 of plaintext matches chunk_id derivation), and stored in cache is correct for all future reads of that chunk_id. No TTL needed for chunk data.
I-CC2 (plaintext security): Cached plaintext is overwritten with
zeros (zeroize) before deallocation, eviction, or cache wipe.
File-level: overwrite contents before unlink. Memory-level:
Zeroizing<Vec<u8>> for L1 entries. This provides logical-level
erasure; physical-level erasure on flash storage requires hardware
encryption (see §1 residual risk).
I-CC6 (disconnect threshold): Cached entries remain authoritative
across fabric disconnects shorter than max_disconnect_seconds
(default 300s). Beyond this threshold, the entire cache (L1 + L2) is
wiped. Disconnect is defined as: no successful RPC to any canonical
endpoint (storage node or gateway) for max_disconnect_seconds
consecutive seconds. The client maintains a last_successful_rpc
timestamp updated on every successful data-path or heartbeat RPC.
Background heartbeat RPCs (every 60s, piggybacked on metadata TTL
refresh when idle) keep this timestamp current. Transient single-RPC
failures do not trigger the disconnect timer — only sustained
unreachability across all endpoints does.
I-CC7 (error bypass): Any local cache error (L2 I/O failure, corrupt chunk detected by CRC32 mismatch, metadata lookup failure) bypasses to canonical unconditionally. The cache never serves data it cannot verify. Failed L2 reads are not retried from L2 — they go to canonical immediately.
I-CC8 (wipe on restart / crash recovery): On process start, the
client either creates a new L2 pool (wiping any prior orphaned pools)
or adopts an existing pool identified by KISEKI_CACHE_POOL_ID
environment variable (see §6 staging handoff). Orphaned pools are
detected by attempting flock on each pool.lock — if the lock
succeeds, the pool is orphaned (no live process holds it) and is
wiped (zeroized and deleted). A separate kiseki-cache-scrub service
runs on node boot and periodically (every 60s) to clean orphaned
pools across all tenants, covering crash recovery when no subsequent
kiseki process starts on that node.
I-CC13 (L2 integrity): L2 cache entries are protected by a CRC32 checksum computed at insert time and stored as a 4-byte trailer on each chunk file. On L2 read, the CRC32 is verified before serving. CRC mismatch triggers bypass to canonical and L2 entry deletion.
5. Policy authority and distribution
Cache policy follows the same distribution mechanism as quotas
(per control-plane.feature scenario “Quota enforcement during
control plane outage”).
Policy hierarchy
cluster default → org override → project override → workload override
→ session selection
Each level narrows (never broadens) the parent’s settings, consistent with ADR-020 / I-WA7.
Policy attributes
| Attribute | Type | Admin levels | Client selectable | Default |
|---|---|---|---|---|
cache_enabled | bool | cluster, org, project, workload | No | true |
allowed_modes | set{pinned, organic, bypass} | cluster, org | No | {pinned, organic, bypass} |
max_cache_bytes | u64 | cluster, org, workload | Up to ceiling | 50 GB |
max_node_cache_bytes | u64 | cluster | No | 80% of cache filesystem |
metadata_ttl_ms | u64 | cluster, org | Up to ceiling | 5000 |
max_disconnect_seconds | u64 | cluster | No | 300 |
key_health_interval_ms | u64 | cluster | No | 30000 |
staging_enabled | bool | cluster, org | No | true |
mode | enum | workload (default) | Yes (within allowed) | organic |
Narrowing rules (same as I-WA7):
cache_enabled = falseat any level → disabled for all childrenallowed_modesat child ⊆allowed_modesat parentmax_cache_bytesat child ≤max_cache_bytesat parentmetadata_ttl_msat child ≤metadata_ttl_msat parent
Distribution mechanism
Cache policy is carried in the same TenantConfig structure that
carries quotas. At session establishment, the client resolves its
effective policy through multiple paths (CC-ADV-6 resolved):
- Primary:
GetCachePolicyRPC on the data-path gRPC channel to any connected storage node. Storage nodes haveTenantConfig(same data they use for quota enforcement). No gateway or control plane reachability required — the client only needs the data fabric. - Secondary: fetch from gateway’s locally-cached
TenantConfig(if gateway is reachable) - Stale tolerance: last-known policy persisted in L2 pool directory
(
policy.json). Remains effective during outages, consistent with quota scenario incontrol-plane.feature. - Fallback: if no policy resolvable (first-ever session, all paths unreachable), use conservative defaults (cache enabled, organic mode, 10 GB max, 5s TTL)
- Reconciliation: on control-plane recovery, client re-fetches policy and applies prospectively (I-WA18 pattern — active sessions continue under session-start policy; new sessions use updated policy)
No parallel policy-distribution path is introduced. Cache policy is
one more field in TenantConfig, alongside quotas, compliance tags,
and advisory settings.
I-CC9 (policy fallback): When effective cache policy is unreachable at session start, the client operates with conservative defaults (cache enabled, organic mode, 10 GB ceiling, 5s metadata TTL). The cache is a performance feature; failing to resolve policy must not prevent data access.
I-CC10 (prospective policy): Cache policy changes apply to new sessions only. Active sessions continue under the policy effective at session establishment, consistent with I-WA18.
6. Staging API
Client-local operation for pre-populating the cache with a dataset’s chunks in pinned mode. Pull-based — the client fetches from canonical.
Interface
# CLI (Slurm prolog, manual use)
kiseki-client stage --dataset <namespace_path> [--timeout <seconds>]
kiseki-client stage --status [--dataset <namespace_path>]
kiseki-client stage --release <namespace_path>
kiseki-client stage --release-all
# Rust API
impl CacheManager {
async fn stage(&self, namespace_path: &str) -> Result<StageResult>;
fn stage_status(&self) -> Vec<StagedDataset>;
fn release(&self, namespace_path: &str);
fn release_all(&self);
}
# Python API (via PyO3)
client.stage(namespace_path="/training/imagenet")
client.stage_status()
client.release(namespace_path="/training/imagenet")
# C FFI
kiseki_stage(handle, "/training/imagenet", timeout_secs)
kiseki_stage_status(handle, &status)
kiseki_release(handle, "/training/imagenet")
Flow (CC-ADV-11 resolved: directory tree handling)
- Resolve
namespace_pathto composition metadata via canonical. If the path is a directory, recursively enumerate all files (compositions) up tomax_staging_depth(default 10) andmax_staging_files(default 100,000). If limits are exceeded, return an error with the count of files discovered. - Extract full chunk list from all resolved compositions
- For each chunk not already in L2: fetch from canonical, decrypt, verify content-address (SHA-256), store in L2 with CRC32 trailer and pinned retention
- Write
staging/<dataset_id>.manifestlisting all compositions, their chunk_ids, and the total byte count - Report progress (chunks staged / total, bytes, elapsed)
Staging is idempotent — re-staging an already-staged dataset is a no-op (chunks already present). Partial staging (interrupted) can be resumed by re-running the command.
Staging handoff (CC-ADV-5 resolved)
The staging CLI creates a cache pool and holds the pool.lock flock
for its lifetime. The workload process adopts the staging pool instead
of creating a new one:
- Staging CLI:
kiseki-client stage --dataset /training/imagenet- Creates pool, writes
pool_idto stdout and to$KISEKI_CACHE_DIR/<tenant>/staging_pool_id - Stages chunks, holds flock, stays alive (daemon mode)
- Creates pool, writes
- Workload process: sets
KISEKI_CACHE_POOL_ID=<pool_id>(from Slurm prolog output, Lattice env injection, or the file)- On start, detects existing pool with matching
pool_id - Adopts pool: takes over flock from staging daemon
- Staging daemon detects flock loss, exits cleanly
- On start, detects existing pool with matching
- If
KISEKI_CACHE_POOL_IDis not set: normal fresh-pool behavior (create new pool, wipe orphans)
Slurm integration:
# prolog.sh:
POOL_ID=$(kiseki-client stage --dataset /training/imagenet --daemon)
echo "export KISEKI_CACHE_POOL_ID=$POOL_ID" >> $SLURM_EXPORT_FILE
# epilog.sh:
kiseki-client stage --release-all --pool $KISEKI_CACHE_POOL_ID
Lattice integration: injects KISEKI_CACHE_POOL_ID into the
workload environment after parallel staging completes across the
node set. Queries stage --status to verify readiness before
launching the workload.
I-CC11 (staging correctness): Staged chunks are fetched from
canonical, verified by content-address, and stored with pinned
retention. The staging manifest records the compositions and chunk_ids
at staging time as a point-in-time snapshot. If the dataset is modified
in canonical after staging, the staged version remains correct for its
chunk_ids (immutable chunks) but stale relative to the current dataset
version. To pick up updates, the user must explicitly release and
re-stage.
7. Cache invalidation
The cache is primarily self-consistent due to chunk immutability (I-C1). Explicit invalidation is needed only for metadata:
Metadata invalidation: TTL-based. No push invalidation from canonical to client. The metadata TTL is the sole freshness mechanism.
Chunk invalidation: Not needed under normal operation (chunks are immutable). Two exceptional cases:
-
Crypto-shred (CC-ADV-2 resolved): When a tenant’s KEK is destroyed, all cached plaintext for that tenant must be wiped. Detection via three paths:
- Periodic key health check: Client pings KMS every
key_health_interval(default 30s). If the tenant KEK is reported destroyed (KEK_DESTROYEDerror), wipe immediately. - Advisory channel: If connected, receives shred notification immediately (fast path, best-effort).
- KMS error on next operation: Any key fetch that returns
KEK_DESTROYEDtriggers immediate wipe. - Unreachability fallback: If KMS is unreachable for
max_disconnect_seconds, the disconnect timer triggers a full cache wipe (I-CC6), which covers the case where the KMS is unreachable because the KEK was destroyed.
Maximum time between crypto-shred event and cache wipe is bounded by
min(key_health_interval, max_disconnect_seconds)— default 30 seconds. - Periodic key health check: Client pings KMS every
-
Key rotation: When the system key epoch rotates, existing cached plaintext remains valid (same content, different encryption at rest). No cache action needed — the cache holds plaintext, not ciphertext.
I-CC12 (crypto-shred wipe): On crypto-shred event, all cached
plaintext for the affected tenant is wiped from L1 and L2 with
zeroize. Detection bounded by key_health_interval (default 30s).
No cached data from a shredded tenant is served after detection.
8. Capacity management
Per-process limits:
| Parameter | Default | Source |
|---|---|---|
max_memory_bytes (L1) | 256 MB | env KISEKI_CACHE_L1_MAX or API |
max_cache_bytes (L2) | 50 GB | policy ceiling or env KISEKI_CACHE_L2_MAX |
Per-node limit (CC-ADV-9 resolved):
max_node_cache_bytes (default: 80% of $KISEKI_CACHE_DIR filesystem
capacity). Enforced cooperatively: before inserting into L2, each
process sums total usage across all pool directories in
$KISEKI_CACHE_DIR. If total exceeds max_node_cache_bytes, the
insert is rejected (organic: evict first; pinned: staging error).
The disk-pressure check (90% filesystem utilization) remains as a
hard backstop.
Capacity enforcement:
- L1: strict LRU eviction at
max_memory_bytes - L2 organic mode: LRU eviction at
max_cache_bytes - L2 pinned mode: staging requests rejected with
CacheCapacityExceededwhen staged + proposed >max_cache_bytes. No eviction of pinned data. - Combined pinned + organic: pinned chunks are never evicted by organic LRU. Organic eviction only considers non-pinned chunks.
- Node-wide: cooperative check against
max_node_cache_bytesbefore any L2 insert.
9. Lifecycle
Process start (CC-ADV-1 resolved: crash recovery):
- If
KISEKI_CACHE_POOL_IDset: adopt existing pool (§6 handoff) - Otherwise: create new pool with CSPRNG
pool_id - Scavenge orphans: scan all pool directories under
$KISEKI_CACHE_DIR/<tenant_id>/, attempt flock on eachpool.lock. If lock succeeds (no live holder), the pool is orphaned — zeroize all chunk files, delete directory. This catches prior crashes. - Resolve effective cache policy (§5)
- Initialize L1 (empty
HashMap) - Start background tasks: metadata TTL eviction, disk-pressure check,
key health check (every
key_health_interval), heartbeat RPC (every 60s for disconnect detection) - Cache operational
Crash recovery service (kiseki-cache-scrub):
A systemd one-shot service (or cron job) that runs on node boot and
every 60 seconds. Scans $KISEKI_CACHE_DIR for all tenant/pool
directories, wipes any whose pool.lock has no live flock holder.
This covers the case where no subsequent kiseki process starts on the
node after a crash.
Steady state:
- Reads: L1 → L2 (CRC32 verify) → canonical (decrypt + SHA-256 verify
- store in L1/L2 with CRC32 trailer)
- Writes: straight to canonical; update local metadata cache
- Background: periodic L1 expired-entry eviction, L2 disk-pressure check, key health check, heartbeat RPC
Disconnect (fabric unreachable):
- Reads from L1/L2 continue to be served (chunks are immutable)
- After
max_disconnect_secondswith no successful RPC to any canonical endpoint: wipe entire cache (I-CC6) - On reconnect before threshold: resume normal operation
Process exit (clean):
- Wipe L2 (zeroize all chunk files, delete pool directory)
- L1 freed with process memory (
Zeroizingdrop handles cleanup) - Release flock on
pool.lock
Process exit (crash):
- L2 chunk files remain on NVMe (no zeroize opportunity)
- Next process start or
kiseki-cache-scrubservice detects orphaned pool via flock check and wipes it
10. Configuration surface
| Linkage mode | Configuration mechanism |
|---|---|
| FUSE mount | Mount options: -o cache_mode=organic,cache_l2_max=50G,cache_dir=/tmp/kiseki |
| Rust API | CacheConfig struct passed to Client::new() |
| Python | kiseki.Client(cache_mode="pinned", cache_l2_max=50*1024**3) |
| C FFI | kiseki_open() with KisekiCacheConfig struct |
| Environment | KISEKI_CACHE_MODE, KISEKI_CACHE_DIR, KISEKI_CACHE_L1_MAX, KISEKI_CACHE_L2_MAX, KISEKI_CACHE_META_TTL_MS, KISEKI_CACHE_POOL_ID |
Priority: API/mount options > environment variables > policy defaults. All client-set values are clamped to policy ceilings (§5).
11. Observability
Cache metrics exposed via the client’s local metrics (not Prometheus — client runs on compute nodes, not storage nodes):
| Metric | Type | Description |
|---|---|---|
cache_l1_hits | counter | L1 (memory) cache hits |
cache_l2_hits | counter | L2 (NVMe) cache hits |
cache_misses | counter | Cache misses (bypassed to canonical) |
cache_bypasses | counter | Bypass mode reads (intentional non-cache) |
cache_errors | counter | L2 I/O errors (bypassed to canonical per I-CC7) |
cache_l1_bytes | gauge | Current L1 memory usage |
cache_l2_bytes | gauge | Current L2 disk usage |
cache_staged_datasets | gauge | Number of pinned datasets |
cache_staged_bytes | gauge | Total bytes in pinned datasets |
cache_meta_hits | counter | Metadata cache hits (within TTL) |
cache_meta_misses | counter | Metadata cache misses (TTL expired or absent) |
cache_wipes | counter | Full cache wipes (disconnect threshold, restart, crypto-shred) |
cache_l2_read_latency_us | histogram | L2 NVMe read latency |
cache_l2_write_latency_us | histogram | L2 NVMe write latency |
Metrics available via workflow advisory telemetry (scoped to caller)
and via local API (cache_stats()).
Consequences
Positive
- Repeat reads served from local NVMe: order-of-magnitude latency reduction for training datasets, inference weights, simulation input
- Staging API with scheduler handoff eliminates thundering-herd on job start
- Three modes match the three dominant workload patterns precisely
- Plaintext cache means cache hits avoid decryption cost entirely
- Policy model reuses existing
TenantConfigdistribution — no new subsystem - Content-addressed chunk immutability makes cache correctness simple (I-C1 is the foundation)
- Crash recovery via flock-based orphan detection + scrubber service
Negative
- Plaintext on local NVMe is a security surface. Mitigated by zeroize, file permissions, wipe-on-exit, crash scrubber, and ephemeral-only semantics (I-CC2, I-CC8). Residual FTL risk documented.
- Metadata TTL introduces a staleness window including for deleted files. Mitigated by short default (5s) and write-through for own writes (I-CC3, I-CC5)
- L2 NVMe cache competes with application use of local NVMe (e.g., scratch, checkpoint). Mitigated by configurable per-process ceiling, per-node ceiling, and disk-pressure backoff (§8)
- No cross-process chunk sharing within a tenant means duplicate chunks when multiple jobs for the same tenant run on the same node. Accepted trade-off: simplicity over hit-rate optimization
Neutral
- Bypass mode has zero overhead (no cache code on read path)
- Staging is idempotent and resumable
- Cache wipe on long disconnect is conservative but safe
- Policy distribution via data-path gRPC works in all deployment topologies (no gateway or control plane access required)
Adversarial findings
| ID | Severity | Section | Finding | Resolution |
|---|---|---|---|---|
| CC-ADV-1 | Critical | §1, §9 | Crash leaves plaintext on NVMe unreachable by zeroize. Process crash skips the exit wipe path. | Resolved: startup scavenger wipes orphaned pools (flock detection). kiseki-cache-scrub systemd/cron service runs on boot + every 60s for nodes where no subsequent kiseki process starts. Residual FTL risk documented. §9 updated. |
| CC-ADV-2 | Critical | §7 | Crypto-shred detection has no reliable delivery path. Advisory channel is best-effort. | Resolved: periodic key health check (default 30s) as primary detection. Advisory channel as fast path. KMS error on next operation as tertiary. Unreachability falls through to disconnect timer (I-CC6). Maximum detection latency: min(key_health_interval, max_disconnect_seconds) = 30s default. §7 updated, I-CC12 revised. |
| CC-ADV-3 | High | §1 | L2 read verification unspecified. Full SHA-256 on every read too expensive for training throughput. | Resolved: CRC32 trailer on each L2 chunk file, verified on read. SHA-256 only at fetch time. CRC32 catches bit-flips at ~1 GB/s cost. CRC mismatch → bypass canonical + delete entry. New I-CC13. §1 updated. |
| CC-ADV-4 | High | §1 | cache.lock flock contradicts separate-pool semantics for concurrent same-tenant processes. | Resolved: per-process pool_id subdirectory (128-bit CSPRNG). Each process has own pool.lock. No contention between concurrent processes. L2 layout updated in §1. |
| CC-ADV-5 | High | §6 | Staging CLI is separate process — workload’s wipe-on-start destroys staged data. | Resolved: staging daemon holds flock; workload adopts pool via KISEKI_CACHE_POOL_ID env var instead of creating new pool. Handoff mechanism specified in §6. I-CC8 revised to include adoption path. |
| CC-ADV-6 | High | §5 | Policy resolution via gateway unreachable in some topologies. | Resolved: primary path is GetCachePolicy RPC on data-path gRPC channel to any storage node. No gateway or control plane access required. Fallback chain: data-path → gateway → persisted last-known → conservative defaults. §5 updated. |
| CC-ADV-7 | Medium | §3, §4 | Metadata TTL authority doesn’t explicitly cover file deletion case. | Resolved: I-CC3 text now explicitly states that serving data for a deleted file within TTL is an accepted consequence. I-CC5 updated to cover deletion. §3 updated. |
| CC-ADV-8 | Medium | §2 | Pinned mode has no mechanism to detect canonical dataset updates. | Resolved: documented as intentional. Pinned mode stages a point-in-time snapshot. Update requires explicit release + re-stage. §2 updated. |
| CC-ADV-9 | Medium | §8 | No aggregate capacity enforcement across processes on same node. | Resolved: max_node_cache_bytes policy attribute (default 80% of cache filesystem). Cooperative enforcement: each process sums all pools before inserting. Disk-pressure 90% as hard backstop. §8 updated, policy table updated. |
| CC-ADV-10 | Medium | §1 | NVMe FTL retains physical copies after software zeroize. | Resolved: acknowledged as residual risk. Recommended hardening: OPAL/SED NVMe with per-boot key rotation. §1 updated. |
| CC-ADV-11 | Medium | §6 | Staging conflates namespace path with single composition. | Resolved: staging flow now specifies recursive directory enumeration with max_staging_depth (default 10) and max_staging_files (default 100,000). §6 flow updated. |
| CC-ADV-12 | Low | §4 | I-CC3, I-CC4, I-CC5 partially overlap. | Resolved: I-CC3 and I-CC4 consolidated into single I-CC3 covering freshness, authority, and deletion case. I-CC5 retained as the externally-facing staleness guarantee. Invariant table updated. |
| CC-ADV-13 | Low | §9 | Disconnect detection mechanism unspecified. | Resolved: defined as “no successful RPC to any canonical endpoint for max_disconnect_seconds consecutive seconds.” Client maintains last_successful_rpc timestamp. Background heartbeat every 60s. I-CC6 updated with detection mechanism. |
| CC-ADV-14 | Low | §11 | Missing L2 read/write latency metrics. | Resolved: added cache_l2_read_latency_us and cache_l2_write_latency_us histograms to metrics table. §11 updated. |
Invariant impact
| Invariant | Impact |
|---|---|
| I-C1 | Foundation: chunk immutability enables the cache. No change to I-C1. |
| I-K1, I-K2 | Unchanged: plaintext never leaves the compute node. Cache stores plaintext locally, same trust domain as process memory. |
| I-WA18 | Reused: cache policy changes apply prospectively. |
| I-WA7 | Reused: scope narrowing pattern for policy hierarchy. |
New invariants
| ID | Invariant |
|---|---|
| I-CC1 | A chunk in pinned or organic mode is served from cache if and only if (a) the chunk was fetched from canonical and verified by chunk_id content-address match (SHA-256) at fetch time, and (b) no crypto-shred event has been detected for that tenant since fetch. Chunks are immutable in canonical (I-C1); therefore a verified chunk remains correct indefinitely absent crypto-shred. |
| I-CC2 | Cached plaintext is overwritten with zeros (zeroize) before deallocation, eviction, or cache wipe. File-level: overwrite contents before unlink. Memory-level: Zeroizing<Vec<u8>> for L1 entries. This provides logical-level erasure; physical-level erasure on flash storage requires hardware encryption (OPAL/SED). |
| I-CC3 | File→chunk_list metadata mappings are served from cache only within the configured TTL (default 5s). After TTL expiry, the mapping must be re-fetched from canonical. Within the TTL window, the cached mapping is authoritative: it may serve data for files that have since been modified or deleted in canonical. This is the sole freshness window in the cache design — chunk data itself has no TTL. |
| I-CC5 | Metadata TTL is the upper bound on read staleness. A file modified or deleted in canonical is visible to a caching client within at most one metadata TTL period (default 5s). |
| I-CC6 | Cached entries remain authoritative across fabric disconnects shorter than max_disconnect_seconds (default 300s). Beyond this threshold, the entire cache (L1 + L2) is wiped. Disconnect defined as: no successful RPC to any canonical endpoint for the threshold duration. Background heartbeat RPCs (every 60s) maintain the last_successful_rpc timestamp. |
| I-CC7 | Any local cache error (L2 I/O failure, CRC32 mismatch, metadata lookup failure) bypasses to canonical unconditionally. The cache never serves data it cannot verify. |
| I-CC8 | The cache is ephemeral. On process start, the client either creates a new L2 pool (wiping orphaned pools detected via flock) or adopts an existing pool via KISEKI_CACHE_POOL_ID. A kiseki-cache-scrub service runs on node boot and periodically to clean orphaned pools from crashed processes. |
| I-CC9 | When effective cache policy is unreachable at session start, the client operates with conservative defaults (cache enabled, organic mode, 10 GB ceiling, 5s metadata TTL). Policy is fetched via data-path gRPC (primary), gateway (secondary), persisted last-known (tertiary), or conservative defaults (fallback). |
| I-CC10 | Cache policy changes apply to new sessions only. Active sessions continue under session-start policy (consistent with I-WA18). |
| I-CC11 | Staged chunks are fetched from canonical, verified by content-address, and stored with pinned retention as a point-in-time snapshot. The staged version is immutable in the cache regardless of canonical updates. To pick up updates, the user must explicitly release and re-stage. Staging enumerates directory trees recursively up to max_staging_depth (10) and max_staging_files (100,000). |
| I-CC12 | On crypto-shred event, all cached plaintext for the affected tenant is wiped from L1 and L2 with zeroize. Detection via periodic key health check (default 30s), advisory channel notification, or KMS error on next operation. Maximum detection latency bounded by min(key_health_interval, max_disconnect_seconds). |
| I-CC13 | L2 cache entries are protected by a CRC32 checksum computed at insert time and stored as a 4-byte trailer. On L2 read, the CRC32 is verified before serving. Mismatch triggers bypass to canonical and L2 entry deletion. |
Spec references
specs/features/native-client.feature— cache hit/invalidation/staging scenarios (extend)specs/features/control-plane.feature— cache policy distribution scenarios (extend)specs/invariants.md— add I-CC1 through I-CC13specs/ubiquitous-language.md— add cache-specific termsspecs/failure-modes.md— add F-CC1 through F-CC4specs/assumptions.md— add A-CC1 through A-CC4
ADR-032: Async GatewayOps
Status: Accepted
Date: 2026-04-24
Traces: I-L2, I-L5, I-V3, I-WA2, I-C2, I-C5, I-L8
Context
GatewayOps is a synchronous trait used by all three protocol gateways
(S3, NFS, FUSE) to perform reads and writes through the composition and
chunk stores. When the Raft-backed log store was introduced, the sync
trait required a sync→async bridge (run_on_raft) that blocks the
calling thread while waiting for Raft consensus.
Under concurrent load (≥ Raft runtime thread count), this causes thread
starvation: all Raft threads are occupied polling client_write futures,
leaving no thread for the Raft core loop to dispatch entries. The current
mitigation (KISEKI_RAFT_THREADS = cpus/2) works but wastes resources
and imposes a concurrency ceiling equal to the thread count.
For HPC/ML workloads with hundreds to thousands of concurrent writers, the thread-per-request model is unsustainable. The gateway must not block OS threads while waiting for Raft consensus.
Decision
Make GatewayOps an async trait. All protocol gateways call async
methods directly. NFS and FUSE callers bridge async→sync via
tokio::runtime::Handle::block_on on a dedicated runtime (the reverse
of the current problem, but on threads they own — OS threads that are
explicitly meant to block).
Trait change
#![allow(unused)]
fn main() {
// Before (sync)
pub trait GatewayOps: Send + Sync {
fn write(&self, req: WriteRequest) -> Result<WriteResponse, GatewayError>;
fn read(&self, req: ReadRequest) -> Result<ReadResponse, GatewayError>;
fn list(...) -> Result<...>;
fn delete(...) -> Result<...>;
// ...
}
// After (async)
pub trait GatewayOps: Send + Sync {
async fn write(&self, req: WriteRequest) -> Result<WriteResponse, GatewayError>;
async fn read(&self, req: ReadRequest) -> Result<ReadResponse, GatewayError>;
async fn list(...) -> Result<...>;
async fn delete(...) -> Result<...>;
// ...
}
}
Mutex strategy
Replace std::sync::Mutex with tokio::sync::Mutex for
CompositionStore and ChunkStore in InMemoryGateway. Lock guards
must NOT be held across .await points that perform disk I/O or Raft
submissions — acquire, do in-memory work, drop, then await I/O.
Protocol gateway changes
| Protocol | Current | After |
|---|---|---|
| S3 (axum) | block_in_place(|| gateway.write()) | gateway.write().await |
| NFS (std::thread) | gateway.write() | rt.block_on(gateway.write()) on NFS thread |
| FUSE (fuser threads) | gateway.write() | rt.block_on(gateway.write()) on fuser thread |
S3 becomes fully non-blocking. NFS and FUSE threads block as before, but they own their threads (not tokio worker threads) so no starvation.
LogOps bridge
LogOps::append_delta becomes async. The run_on_raft bridge is
removed — the Raft runtime’s handle is used directly via .await from
async gateway methods. No mpsc::recv blocking, no thread starvation.
Invariant preservation
The async conversion preserves all invariants by maintaining the same
happens-before ordering via .await:
| Invariant | Guarantee |
|---|---|
| I-L2 | Gateway awaits Raft commit before returning to client |
| I-L5 | Chunk writes awaited before composition finalize delta |
| I-V3 | Read-your-writes: last_written_seq set after awaited write |
| I-C2 | Refcount ops after awaited chunk confirm |
| I-C5 | Capacity check before async write submission |
| I-L8 | Shard membership validated before async rename |
| I-WA2 | Advisory lookups remain sync + bounded (≤500 µs timeout) |
Concurrency model
With async GatewayOps, the concurrency ceiling becomes the tokio task limit (effectively unbounded) instead of the thread count. Thousands of concurrent writes share a fixed thread pool without starvation.
Migration
Big-bang conversion. All callers updated in one pass:
- Make
GatewayOpsasync (trait +InMemoryGatewayimpl) - Replace
std::sync::Mutex→tokio::sync::Mutexin gateway - Make
LogOpsasync, removerun_on_raftbridge - Update S3 handlers: remove
block_in_place, use.await - Update NFS server: add
rt.block_on()wrapper on NFS threads - Update FUSE daemon: add
rt.block_on()wrapper on fuser threads - Update all tests + BDD step definitions
- Remove
KISEKI_RAFT_THREADS(no longer needed)
Consequences
Benefits:
- No thread starvation under any concurrency level
- S3 handler is fully non-blocking (proper async axum)
- Removes
run_on_raft,block_in_place,KISEKI_RAFT_THREADS - Single Raft runtime (no dedicated runtime needed)
- Clean async-all-the-way data path
Costs:
- Large refactor touching all protocol gateways and tests
- NFS/FUSE need a tokio runtime handle for
block_on tokio::sync::Mutexhas slightly higher per-lock overhead thanstd::sync::Mutex(but eliminates thread starvation)- Async trait requires
Send + 'staticbounds on futures
Risks:
tokio::sync::Mutexheld across.awaitcan cause deadlocks if not careful. Mitigated by code review rule: never hold gateway mutex across Raft submission or disk I/O.- NFS/FUSE
block_onon a non-tokio thread: works correctly but must not be called from within a tokio context (same issue we already solved withstd::thread::spawnfor runtime creation).
Implementation Notes (2026-04-24)
CompositionOps reverted to sync. The initial implementation made
CompositionOps async, but holding tokio::sync::Mutex<CompositionStore>
across emit_delta().await serialized all writes behind a single Raft
round-trip — the same bottleneck as before, just without thread starvation.
Final architecture:
GatewayOps: async (S3 handlers await directly)LogOps: async (Raft consensus)CompositionOps: sync (in-memory HashMap operations only)
Gateway write pattern (lock-free):
- Lock compositions →
create()(sync, microseconds) → drop lock - Emit delta to log (async, Raft consensus, ~8ms) — no lock held
- If emission fails, re-acquire lock and rollback (PIPE-ADV-1)
NFS/FUSE bridge: block_gateway() helper uses block_in_place
when on a tokio worker thread (tests), or direct block_on on OS
threads (production NFS/FUSE daemon).
Result: 1MB write throughput: 39.5 → 380.2 MB/s (9.6x improvement). 32 concurrent S3 PUTs complete in 50ms with no deadlock.