ADR-031: Client-Side Cache
Status: Accepted Date: 2026-04-23 Deciders: Architect + domain expert Adversarial review: 2026-04-23 (14 findings: 2C 4H 4M 4L, all resolved)
Context
ADR-013 (POSIX semantics scope), ADR-019 (gateway deployment model),
ADR-020 (workflow advisory), ADR-030 (dynamic small-file placement),
control-plane.feature (policy distribution precedent),
native-client.feature (client architecture).
CSCS workload mix: LLM pretraining (epoch reuse of tokenized datasets), LLM inference (model weight cold-start), climate/weather simulation (bounded input staging with hard deadlines), HPC checkpoint/restart. Common pattern: compute nodes repeatedly pull the same encrypted chunks across the fabric.
Existing client architecture: kiseki-client crate with feature flags
(fuse, ffi, python, pure-Rust default). Performs tenant-layer
encryption — plaintext never leaves the workload process. The existing
ClientCache is an in-memory HashMap<ChunkId, Vec<u8>> with TTL and
max-entries eviction.
Problem
-
Repeat reads of the same chunks cross the fabric unnecessarily. Training datasets are read epoch after epoch. Inference weights are loaded identically by multiple model replicas. Climate boundary conditions are staged identically to every simulation rank.
-
In-memory cache (current
ClientCache) is bounded by process memory, which is primarily needed for computation. Compute-node NVMe is available and underutilized. -
No mechanism for pre-staging datasets. Jobs start with cold cache and pay first-access latency on every rank simultaneously, creating a thundering-herd pattern on the storage fabric.
-
No cache mode differentiation. Training (pin everything), inference (pin weights, LRU prompts), and HPC checkpoint (don’t cache) have fundamentally different cache needs.
Decision
1. Cache architecture
The client-side cache is a library-level module in kiseki-client,
shared across all linkage modes (FUSE, FFI, Python, native Rust). It
operates on decrypted plaintext chunks keyed by ChunkId.
canonical (fabric) → decrypt → cache store (NVMe) → serve to caller
↑
cache hit path (no fabric, no decrypt)
Two-tier storage:
| Tier | Backing | Purpose | Eviction |
|---|---|---|---|
| Hot (L1) | In-memory HashMap | Sub-microsecond hits for active working set | LRU, bounded by max_memory_bytes |
| Warm (L2) | Local NVMe file or directory | Large capacity for datasets and weights | Per-mode policy (see §2) |
L2 layout on NVMe (CC-ADV-4 resolved: per-process subdirectories):
$KISEKI_CACHE_DIR/
├── <tenant_id_hex>/
│ ├── <pool_id>/ ← per-process pool (128-bit CSPRNG)
│ │ ├── chunks/
│ │ │ ├── <prefix>/
│ │ │ │ └── <chunk_id_hex> ← plaintext + CRC32 trailer
│ │ │ └── ...
│ │ ├── meta/
│ │ │ └── file_chunks.db
│ │ ├── staging/
│ │ │ └── <dataset_id>.manifest
│ │ └── pool.lock ← flock, proves process is alive
│ └── <pool_id>/ ← another concurrent process
│ └── ...
└── ...
Each client process creates its own pool_id directory (128-bit
CSPRNG, same generation as client_id per I-WA4). The pool.lock
file holds an flock for the process lifetime. Multiple concurrent
same-tenant processes on the same node have fully independent pools
with no contention.
L2 integrity (CC-ADV-3 resolved): Each L2 chunk file stores the plaintext data followed by a 4-byte CRC32 trailer, computed at insert time. On L2 read, the CRC32 is verified before serving. Full SHA-256 content-address verification occurs only at fetch time (when the chunk is first retrieved from canonical). CRC32 catches bit-flips and filesystem corruption at ~1 GB/s throughput cost. CRC mismatch triggers bypass to canonical and L2 entry deletion (I-CC7).
Security model (plaintext cache):
The L2 cache holds decrypted plaintext on local NVMe. This is acceptable because:
- The compute node already holds decrypted data in process memory (computation requires plaintext)
- L2 NVMe is local to the compute node, same trust domain as process memory
- L2 is ephemeral — wiped on process exit and on long disconnect
zeroizeon eviction/wipe: overwrite chunk data before deallocation (I-CC2)- File permissions:
0600, owned by process UID - Crash recovery: startup scavenger + periodic scrubber clean orphaned pools (CC-ADV-1 resolved, see §9)
Residual risk (CC-ADV-10 acknowledged): Software zeroize on NVMe/SSD provides logical-level erasure only. The Flash Translation Layer may retain physical copies of overwritten data until internal garbage collection. For deployments requiring physical erasure guarantees, use NVMe drives with hardware encryption (OPAL/SED) and rotate the drive encryption key on node reboot. This is an operational hardening measure, not a baseline requirement.
2. Cache modes
Three modes, selectable per client instance at session establishment:
Pinned mode
For workloads that declare their dataset upfront: training runs (epoch reuse), inference (model weights), climate (boundary conditions).
- Chunks are retained against eviction until explicit release
- Populated via the staging API (§6) or on first access
- L2 is the primary tier; L1 is a hot subset
- Eviction: only on explicit
release()or process exit - Capacity bounded by
max_cache_bytes(§8); staging beyond capacity returns an error, does not evict pinned chunks
Dataset versioning (CC-ADV-8 resolved): Pinned mode stages a
point-in-time snapshot of the dataset. The staged version is immutable
in the cache regardless of canonical updates. This is intentional —
training runs require a stable dataset across epochs. To pick up
dataset updates, the user must explicitly release and re-stage.
There is no automatic dataset-level version check.
Organic mode
Default for mixed workloads. LRU with usage-weighted retention.
- Chunks cached on first read, evicted on LRU when capacity is reached
- Frequently accessed chunks promoted to L1
- L2 eviction: LRU by last-access timestamp, weighted by access count (chunks accessed N times survive N eviction rounds)
- Metadata cache (file→chunk_list) with configurable TTL (default 5s)
Bypass mode
For workloads that don’t benefit from caching: streaming ingest, one-shot scans, checkpoint writes, compute-bound codes with no repeat reads.
- All reads go directly to canonical
- No L1 or L2 storage consumed
- Zero overhead beyond mode selection
3. Metadata cache
The cache stores file-to-chunk-list mappings with a bounded TTL:
#![allow(unused)]
fn main() {
struct MetadataEntry {
chunk_list: Vec<ChunkId>,
fetched_at: Instant,
ttl: Duration,
}
}
I-CC3 (metadata freshness and authority): File→chunk_list metadata mappings are served from cache only within the configured TTL (default 5s). After TTL expiry, the mapping must be re-fetched from canonical before serving chunks that depend on it. Within the TTL window, the cached mapping is authoritative — it may serve data for files that have since been modified or deleted in canonical. This is an accepted consequence of the TTL window, not a correctness violation. Modifications create new compositions with new chunk_ids; the old mapping points to valid immutable chunks that were the file’s content at fetch time. Deletions remove the composition; the cached mapping continues to serve the deleted file’s data until TTL expiry.
I-CC5 (staleness bound): Metadata TTL is the upper bound on read staleness. A file modified or deleted in canonical will be visible to a caching client within at most one metadata TTL period. The default TTL (5 seconds) balances freshness against metadata lookup cost.
Write-through: When the client writes a file (creating new chunks and a new composition), the local metadata cache is updated immediately with the new chunk list. This provides read-your-writes consistency within a single client process without waiting for TTL expiry.
4. Correctness invariants
The cache’s correctness rests on a small set of stated invariants. Each case where the cache serves (rather than bypasses) is backed by one or more of these invariants. Cases not covered bypass to canonical.
I-CC1 (chunk immutability): Chunks are immutable in canonical (I-C1). A chunk fetched, verified by content-address (SHA-256 of plaintext matches chunk_id derivation), and stored in cache is correct for all future reads of that chunk_id. No TTL needed for chunk data.
I-CC2 (plaintext security): Cached plaintext is overwritten with
zeros (zeroize) before deallocation, eviction, or cache wipe.
File-level: overwrite contents before unlink. Memory-level:
Zeroizing<Vec<u8>> for L1 entries. This provides logical-level
erasure; physical-level erasure on flash storage requires hardware
encryption (see §1 residual risk).
I-CC6 (disconnect threshold): Cached entries remain authoritative
across fabric disconnects shorter than max_disconnect_seconds
(default 300s). Beyond this threshold, the entire cache (L1 + L2) is
wiped. Disconnect is defined as: no successful RPC to any canonical
endpoint (storage node or gateway) for max_disconnect_seconds
consecutive seconds. The client maintains a last_successful_rpc
timestamp updated on every successful data-path or heartbeat RPC.
Background heartbeat RPCs (every 60s, piggybacked on metadata TTL
refresh when idle) keep this timestamp current. Transient single-RPC
failures do not trigger the disconnect timer — only sustained
unreachability across all endpoints does.
I-CC7 (error bypass): Any local cache error (L2 I/O failure, corrupt chunk detected by CRC32 mismatch, metadata lookup failure) bypasses to canonical unconditionally. The cache never serves data it cannot verify. Failed L2 reads are not retried from L2 — they go to canonical immediately.
I-CC8 (wipe on restart / crash recovery): On process start, the
client either creates a new L2 pool (wiping any prior orphaned pools)
or adopts an existing pool identified by KISEKI_CACHE_POOL_ID
environment variable (see §6 staging handoff). Orphaned pools are
detected by attempting flock on each pool.lock — if the lock
succeeds, the pool is orphaned (no live process holds it) and is
wiped (zeroized and deleted). A separate kiseki-cache-scrub service
runs on node boot and periodically (every 60s) to clean orphaned
pools across all tenants, covering crash recovery when no subsequent
kiseki process starts on that node.
I-CC13 (L2 integrity): L2 cache entries are protected by a CRC32 checksum computed at insert time and stored as a 4-byte trailer on each chunk file. On L2 read, the CRC32 is verified before serving. CRC mismatch triggers bypass to canonical and L2 entry deletion.
5. Policy authority and distribution
Cache policy follows the same distribution mechanism as quotas
(per control-plane.feature scenario “Quota enforcement during
control plane outage”).
Policy hierarchy
cluster default → org override → project override → workload override
→ session selection
Each level narrows (never broadens) the parent’s settings, consistent with ADR-020 / I-WA7.
Policy attributes
| Attribute | Type | Admin levels | Client selectable | Default |
|---|---|---|---|---|
cache_enabled | bool | cluster, org, project, workload | No | true |
allowed_modes | set{pinned, organic, bypass} | cluster, org | No | {pinned, organic, bypass} |
max_cache_bytes | u64 | cluster, org, workload | Up to ceiling | 50 GB |
max_node_cache_bytes | u64 | cluster | No | 80% of cache filesystem |
metadata_ttl_ms | u64 | cluster, org | Up to ceiling | 5000 |
max_disconnect_seconds | u64 | cluster | No | 300 |
key_health_interval_ms | u64 | cluster | No | 30000 |
staging_enabled | bool | cluster, org | No | true |
mode | enum | workload (default) | Yes (within allowed) | organic |
Narrowing rules (same as I-WA7):
cache_enabled = falseat any level → disabled for all childrenallowed_modesat child ⊆allowed_modesat parentmax_cache_bytesat child ≤max_cache_bytesat parentmetadata_ttl_msat child ≤metadata_ttl_msat parent
Distribution mechanism
Cache policy is carried in the same TenantConfig structure that
carries quotas. At session establishment, the client resolves its
effective policy through multiple paths (CC-ADV-6 resolved):
- Primary:
GetCachePolicyRPC on the data-path gRPC channel to any connected storage node. Storage nodes haveTenantConfig(same data they use for quota enforcement). No gateway or control plane reachability required — the client only needs the data fabric. - Secondary: fetch from gateway’s locally-cached
TenantConfig(if gateway is reachable) - Stale tolerance: last-known policy persisted in L2 pool directory
(
policy.json). Remains effective during outages, consistent with quota scenario incontrol-plane.feature. - Fallback: if no policy resolvable (first-ever session, all paths unreachable), use conservative defaults (cache enabled, organic mode, 10 GB max, 5s TTL)
- Reconciliation: on control-plane recovery, client re-fetches policy and applies prospectively (I-WA18 pattern — active sessions continue under session-start policy; new sessions use updated policy)
No parallel policy-distribution path is introduced. Cache policy is
one more field in TenantConfig, alongside quotas, compliance tags,
and advisory settings.
I-CC9 (policy fallback): When effective cache policy is unreachable at session start, the client operates with conservative defaults (cache enabled, organic mode, 10 GB ceiling, 5s metadata TTL). The cache is a performance feature; failing to resolve policy must not prevent data access.
I-CC10 (prospective policy): Cache policy changes apply to new sessions only. Active sessions continue under the policy effective at session establishment, consistent with I-WA18.
6. Staging API
Client-local operation for pre-populating the cache with a dataset’s chunks in pinned mode. Pull-based — the client fetches from canonical.
Interface
# CLI (Slurm prolog, manual use)
kiseki-client stage --dataset <namespace_path> [--timeout <seconds>]
kiseki-client stage --status [--dataset <namespace_path>]
kiseki-client stage --release <namespace_path>
kiseki-client stage --release-all
# Rust API
impl CacheManager {
async fn stage(&self, namespace_path: &str) -> Result<StageResult>;
fn stage_status(&self) -> Vec<StagedDataset>;
fn release(&self, namespace_path: &str);
fn release_all(&self);
}
# Python API (via PyO3)
client.stage(namespace_path="/training/imagenet")
client.stage_status()
client.release(namespace_path="/training/imagenet")
# C FFI
kiseki_stage(handle, "/training/imagenet", timeout_secs)
kiseki_stage_status(handle, &status)
kiseki_release(handle, "/training/imagenet")
Flow (CC-ADV-11 resolved: directory tree handling)
- Resolve
namespace_pathto composition metadata via canonical. If the path is a directory, recursively enumerate all files (compositions) up tomax_staging_depth(default 10) andmax_staging_files(default 100,000). If limits are exceeded, return an error with the count of files discovered. - Extract full chunk list from all resolved compositions
- For each chunk not already in L2: fetch from canonical, decrypt, verify content-address (SHA-256), store in L2 with CRC32 trailer and pinned retention
- Write
staging/<dataset_id>.manifestlisting all compositions, their chunk_ids, and the total byte count - Report progress (chunks staged / total, bytes, elapsed)
Staging is idempotent — re-staging an already-staged dataset is a no-op (chunks already present). Partial staging (interrupted) can be resumed by re-running the command.
Staging handoff (CC-ADV-5 resolved)
The staging CLI creates a cache pool and holds the pool.lock flock
for its lifetime. The workload process adopts the staging pool instead
of creating a new one:
- Staging CLI:
kiseki-client stage --dataset /training/imagenet- Creates pool, writes
pool_idto stdout and to$KISEKI_CACHE_DIR/<tenant>/staging_pool_id - Stages chunks, holds flock, stays alive (daemon mode)
- Creates pool, writes
- Workload process: sets
KISEKI_CACHE_POOL_ID=<pool_id>(from Slurm prolog output, Lattice env injection, or the file)- On start, detects existing pool with matching
pool_id - Adopts pool: takes over flock from staging daemon
- Staging daemon detects flock loss, exits cleanly
- On start, detects existing pool with matching
- If
KISEKI_CACHE_POOL_IDis not set: normal fresh-pool behavior (create new pool, wipe orphans)
Slurm integration:
# prolog.sh:
POOL_ID=$(kiseki-client stage --dataset /training/imagenet --daemon)
echo "export KISEKI_CACHE_POOL_ID=$POOL_ID" >> $SLURM_EXPORT_FILE
# epilog.sh:
kiseki-client stage --release-all --pool $KISEKI_CACHE_POOL_ID
Lattice integration: injects KISEKI_CACHE_POOL_ID into the
workload environment after parallel staging completes across the
node set. Queries stage --status to verify readiness before
launching the workload.
I-CC11 (staging correctness): Staged chunks are fetched from
canonical, verified by content-address, and stored with pinned
retention. The staging manifest records the compositions and chunk_ids
at staging time as a point-in-time snapshot. If the dataset is modified
in canonical after staging, the staged version remains correct for its
chunk_ids (immutable chunks) but stale relative to the current dataset
version. To pick up updates, the user must explicitly release and
re-stage.
7. Cache invalidation
The cache is primarily self-consistent due to chunk immutability (I-C1). Explicit invalidation is needed only for metadata:
Metadata invalidation: TTL-based. No push invalidation from canonical to client. The metadata TTL is the sole freshness mechanism.
Chunk invalidation: Not needed under normal operation (chunks are immutable). Two exceptional cases:
-
Crypto-shred (CC-ADV-2 resolved): When a tenant’s KEK is destroyed, all cached plaintext for that tenant must be wiped. Detection via three paths:
- Periodic key health check: Client pings KMS every
key_health_interval(default 30s). If the tenant KEK is reported destroyed (KEK_DESTROYEDerror), wipe immediately. - Advisory channel: If connected, receives shred notification immediately (fast path, best-effort).
- KMS error on next operation: Any key fetch that returns
KEK_DESTROYEDtriggers immediate wipe. - Unreachability fallback: If KMS is unreachable for
max_disconnect_seconds, the disconnect timer triggers a full cache wipe (I-CC6), which covers the case where the KMS is unreachable because the KEK was destroyed.
Maximum time between crypto-shred event and cache wipe is bounded by
min(key_health_interval, max_disconnect_seconds)— default 30 seconds. - Periodic key health check: Client pings KMS every
-
Key rotation: When the system key epoch rotates, existing cached plaintext remains valid (same content, different encryption at rest). No cache action needed — the cache holds plaintext, not ciphertext.
I-CC12 (crypto-shred wipe): On crypto-shred event, all cached
plaintext for the affected tenant is wiped from L1 and L2 with
zeroize. Detection bounded by key_health_interval (default 30s).
No cached data from a shredded tenant is served after detection.
8. Capacity management
Per-process limits:
| Parameter | Default | Source |
|---|---|---|
max_memory_bytes (L1) | 256 MB | env KISEKI_CACHE_L1_MAX or API |
max_cache_bytes (L2) | 50 GB | policy ceiling or env KISEKI_CACHE_L2_MAX |
Per-node limit (CC-ADV-9 resolved):
max_node_cache_bytes (default: 80% of $KISEKI_CACHE_DIR filesystem
capacity). Enforced cooperatively: before inserting into L2, each
process sums total usage across all pool directories in
$KISEKI_CACHE_DIR. If total exceeds max_node_cache_bytes, the
insert is rejected (organic: evict first; pinned: staging error).
The disk-pressure check (90% filesystem utilization) remains as a
hard backstop.
Capacity enforcement:
- L1: strict LRU eviction at
max_memory_bytes - L2 organic mode: LRU eviction at
max_cache_bytes - L2 pinned mode: staging requests rejected with
CacheCapacityExceededwhen staged + proposed >max_cache_bytes. No eviction of pinned data. - Combined pinned + organic: pinned chunks are never evicted by organic LRU. Organic eviction only considers non-pinned chunks.
- Node-wide: cooperative check against
max_node_cache_bytesbefore any L2 insert.
9. Lifecycle
Process start (CC-ADV-1 resolved: crash recovery):
- If
KISEKI_CACHE_POOL_IDset: adopt existing pool (§6 handoff) - Otherwise: create new pool with CSPRNG
pool_id - Scavenge orphans: scan all pool directories under
$KISEKI_CACHE_DIR/<tenant_id>/, attempt flock on eachpool.lock. If lock succeeds (no live holder), the pool is orphaned — zeroize all chunk files, delete directory. This catches prior crashes. - Resolve effective cache policy (§5)
- Initialize L1 (empty
HashMap) - Start background tasks: metadata TTL eviction, disk-pressure check,
key health check (every
key_health_interval), heartbeat RPC (every 60s for disconnect detection) - Cache operational
Crash recovery service (kiseki-cache-scrub):
A systemd one-shot service (or cron job) that runs on node boot and
every 60 seconds. Scans $KISEKI_CACHE_DIR for all tenant/pool
directories, wipes any whose pool.lock has no live flock holder.
This covers the case where no subsequent kiseki process starts on the
node after a crash.
Steady state:
- Reads: L1 → L2 (CRC32 verify) → canonical (decrypt + SHA-256 verify
- store in L1/L2 with CRC32 trailer)
- Writes: straight to canonical; update local metadata cache
- Background: periodic L1 expired-entry eviction, L2 disk-pressure check, key health check, heartbeat RPC
Disconnect (fabric unreachable):
- Reads from L1/L2 continue to be served (chunks are immutable)
- After
max_disconnect_secondswith no successful RPC to any canonical endpoint: wipe entire cache (I-CC6) - On reconnect before threshold: resume normal operation
Process exit (clean):
- Wipe L2 (zeroize all chunk files, delete pool directory)
- L1 freed with process memory (
Zeroizingdrop handles cleanup) - Release flock on
pool.lock
Process exit (crash):
- L2 chunk files remain on NVMe (no zeroize opportunity)
- Next process start or
kiseki-cache-scrubservice detects orphaned pool via flock check and wipes it
10. Configuration surface
| Linkage mode | Configuration mechanism |
|---|---|
| FUSE mount | Mount options: -o cache_mode=organic,cache_l2_max=50G,cache_dir=/tmp/kiseki |
| Rust API | CacheConfig struct passed to Client::new() |
| Python | kiseki.Client(cache_mode="pinned", cache_l2_max=50*1024**3) |
| C FFI | kiseki_open() with KisekiCacheConfig struct |
| Environment | KISEKI_CACHE_MODE, KISEKI_CACHE_DIR, KISEKI_CACHE_L1_MAX, KISEKI_CACHE_L2_MAX, KISEKI_CACHE_META_TTL_MS, KISEKI_CACHE_POOL_ID |
Priority: API/mount options > environment variables > policy defaults. All client-set values are clamped to policy ceilings (§5).
11. Observability
Cache metrics exposed via the client’s local metrics (not Prometheus — client runs on compute nodes, not storage nodes):
| Metric | Type | Description |
|---|---|---|
cache_l1_hits | counter | L1 (memory) cache hits |
cache_l2_hits | counter | L2 (NVMe) cache hits |
cache_misses | counter | Cache misses (bypassed to canonical) |
cache_bypasses | counter | Bypass mode reads (intentional non-cache) |
cache_errors | counter | L2 I/O errors (bypassed to canonical per I-CC7) |
cache_l1_bytes | gauge | Current L1 memory usage |
cache_l2_bytes | gauge | Current L2 disk usage |
cache_staged_datasets | gauge | Number of pinned datasets |
cache_staged_bytes | gauge | Total bytes in pinned datasets |
cache_meta_hits | counter | Metadata cache hits (within TTL) |
cache_meta_misses | counter | Metadata cache misses (TTL expired or absent) |
cache_wipes | counter | Full cache wipes (disconnect threshold, restart, crypto-shred) |
cache_l2_read_latency_us | histogram | L2 NVMe read latency |
cache_l2_write_latency_us | histogram | L2 NVMe write latency |
Metrics available via workflow advisory telemetry (scoped to caller)
and via local API (cache_stats()).
Consequences
Positive
- Repeat reads served from local NVMe: order-of-magnitude latency reduction for training datasets, inference weights, simulation input
- Staging API with scheduler handoff eliminates thundering-herd on job start
- Three modes match the three dominant workload patterns precisely
- Plaintext cache means cache hits avoid decryption cost entirely
- Policy model reuses existing
TenantConfigdistribution — no new subsystem - Content-addressed chunk immutability makes cache correctness simple (I-C1 is the foundation)
- Crash recovery via flock-based orphan detection + scrubber service
Negative
- Plaintext on local NVMe is a security surface. Mitigated by zeroize, file permissions, wipe-on-exit, crash scrubber, and ephemeral-only semantics (I-CC2, I-CC8). Residual FTL risk documented.
- Metadata TTL introduces a staleness window including for deleted files. Mitigated by short default (5s) and write-through for own writes (I-CC3, I-CC5)
- L2 NVMe cache competes with application use of local NVMe (e.g., scratch, checkpoint). Mitigated by configurable per-process ceiling, per-node ceiling, and disk-pressure backoff (§8)
- No cross-process chunk sharing within a tenant means duplicate chunks when multiple jobs for the same tenant run on the same node. Accepted trade-off: simplicity over hit-rate optimization
Neutral
- Bypass mode has zero overhead (no cache code on read path)
- Staging is idempotent and resumable
- Cache wipe on long disconnect is conservative but safe
- Policy distribution via data-path gRPC works in all deployment topologies (no gateway or control plane access required)
Adversarial findings
| ID | Severity | Section | Finding | Resolution |
|---|---|---|---|---|
| CC-ADV-1 | Critical | §1, §9 | Crash leaves plaintext on NVMe unreachable by zeroize. Process crash skips the exit wipe path. | Resolved: startup scavenger wipes orphaned pools (flock detection). kiseki-cache-scrub systemd/cron service runs on boot + every 60s for nodes where no subsequent kiseki process starts. Residual FTL risk documented. §9 updated. |
| CC-ADV-2 | Critical | §7 | Crypto-shred detection has no reliable delivery path. Advisory channel is best-effort. | Resolved: periodic key health check (default 30s) as primary detection. Advisory channel as fast path. KMS error on next operation as tertiary. Unreachability falls through to disconnect timer (I-CC6). Maximum detection latency: min(key_health_interval, max_disconnect_seconds) = 30s default. §7 updated, I-CC12 revised. |
| CC-ADV-3 | High | §1 | L2 read verification unspecified. Full SHA-256 on every read too expensive for training throughput. | Resolved: CRC32 trailer on each L2 chunk file, verified on read. SHA-256 only at fetch time. CRC32 catches bit-flips at ~1 GB/s cost. CRC mismatch → bypass canonical + delete entry. New I-CC13. §1 updated. |
| CC-ADV-4 | High | §1 | cache.lock flock contradicts separate-pool semantics for concurrent same-tenant processes. | Resolved: per-process pool_id subdirectory (128-bit CSPRNG). Each process has own pool.lock. No contention between concurrent processes. L2 layout updated in §1. |
| CC-ADV-5 | High | §6 | Staging CLI is separate process — workload’s wipe-on-start destroys staged data. | Resolved: staging daemon holds flock; workload adopts pool via KISEKI_CACHE_POOL_ID env var instead of creating new pool. Handoff mechanism specified in §6. I-CC8 revised to include adoption path. |
| CC-ADV-6 | High | §5 | Policy resolution via gateway unreachable in some topologies. | Resolved: primary path is GetCachePolicy RPC on data-path gRPC channel to any storage node. No gateway or control plane access required. Fallback chain: data-path → gateway → persisted last-known → conservative defaults. §5 updated. |
| CC-ADV-7 | Medium | §3, §4 | Metadata TTL authority doesn’t explicitly cover file deletion case. | Resolved: I-CC3 text now explicitly states that serving data for a deleted file within TTL is an accepted consequence. I-CC5 updated to cover deletion. §3 updated. |
| CC-ADV-8 | Medium | §2 | Pinned mode has no mechanism to detect canonical dataset updates. | Resolved: documented as intentional. Pinned mode stages a point-in-time snapshot. Update requires explicit release + re-stage. §2 updated. |
| CC-ADV-9 | Medium | §8 | No aggregate capacity enforcement across processes on same node. | Resolved: max_node_cache_bytes policy attribute (default 80% of cache filesystem). Cooperative enforcement: each process sums all pools before inserting. Disk-pressure 90% as hard backstop. §8 updated, policy table updated. |
| CC-ADV-10 | Medium | §1 | NVMe FTL retains physical copies after software zeroize. | Resolved: acknowledged as residual risk. Recommended hardening: OPAL/SED NVMe with per-boot key rotation. §1 updated. |
| CC-ADV-11 | Medium | §6 | Staging conflates namespace path with single composition. | Resolved: staging flow now specifies recursive directory enumeration with max_staging_depth (default 10) and max_staging_files (default 100,000). §6 flow updated. |
| CC-ADV-12 | Low | §4 | I-CC3, I-CC4, I-CC5 partially overlap. | Resolved: I-CC3 and I-CC4 consolidated into single I-CC3 covering freshness, authority, and deletion case. I-CC5 retained as the externally-facing staleness guarantee. Invariant table updated. |
| CC-ADV-13 | Low | §9 | Disconnect detection mechanism unspecified. | Resolved: defined as “no successful RPC to any canonical endpoint for max_disconnect_seconds consecutive seconds.” Client maintains last_successful_rpc timestamp. Background heartbeat every 60s. I-CC6 updated with detection mechanism. |
| CC-ADV-14 | Low | §11 | Missing L2 read/write latency metrics. | Resolved: added cache_l2_read_latency_us and cache_l2_write_latency_us histograms to metrics table. §11 updated. |
Invariant impact
| Invariant | Impact |
|---|---|
| I-C1 | Foundation: chunk immutability enables the cache. No change to I-C1. |
| I-K1, I-K2 | Unchanged: plaintext never leaves the compute node. Cache stores plaintext locally, same trust domain as process memory. |
| I-WA18 | Reused: cache policy changes apply prospectively. |
| I-WA7 | Reused: scope narrowing pattern for policy hierarchy. |
New invariants
| ID | Invariant |
|---|---|
| I-CC1 | A chunk in pinned or organic mode is served from cache if and only if (a) the chunk was fetched from canonical and verified by chunk_id content-address match (SHA-256) at fetch time, and (b) no crypto-shred event has been detected for that tenant since fetch. Chunks are immutable in canonical (I-C1); therefore a verified chunk remains correct indefinitely absent crypto-shred. |
| I-CC2 | Cached plaintext is overwritten with zeros (zeroize) before deallocation, eviction, or cache wipe. File-level: overwrite contents before unlink. Memory-level: Zeroizing<Vec<u8>> for L1 entries. This provides logical-level erasure; physical-level erasure on flash storage requires hardware encryption (OPAL/SED). |
| I-CC3 | File→chunk_list metadata mappings are served from cache only within the configured TTL (default 5s). After TTL expiry, the mapping must be re-fetched from canonical. Within the TTL window, the cached mapping is authoritative: it may serve data for files that have since been modified or deleted in canonical. This is the sole freshness window in the cache design — chunk data itself has no TTL. |
| I-CC5 | Metadata TTL is the upper bound on read staleness. A file modified or deleted in canonical is visible to a caching client within at most one metadata TTL period (default 5s). |
| I-CC6 | Cached entries remain authoritative across fabric disconnects shorter than max_disconnect_seconds (default 300s). Beyond this threshold, the entire cache (L1 + L2) is wiped. Disconnect defined as: no successful RPC to any canonical endpoint for the threshold duration. Background heartbeat RPCs (every 60s) maintain the last_successful_rpc timestamp. |
| I-CC7 | Any local cache error (L2 I/O failure, CRC32 mismatch, metadata lookup failure) bypasses to canonical unconditionally. The cache never serves data it cannot verify. |
| I-CC8 | The cache is ephemeral. On process start, the client either creates a new L2 pool (wiping orphaned pools detected via flock) or adopts an existing pool via KISEKI_CACHE_POOL_ID. A kiseki-cache-scrub service runs on node boot and periodically to clean orphaned pools from crashed processes. |
| I-CC9 | When effective cache policy is unreachable at session start, the client operates with conservative defaults (cache enabled, organic mode, 10 GB ceiling, 5s metadata TTL). Policy is fetched via data-path gRPC (primary), gateway (secondary), persisted last-known (tertiary), or conservative defaults (fallback). |
| I-CC10 | Cache policy changes apply to new sessions only. Active sessions continue under session-start policy (consistent with I-WA18). |
| I-CC11 | Staged chunks are fetched from canonical, verified by content-address, and stored with pinned retention as a point-in-time snapshot. The staged version is immutable in the cache regardless of canonical updates. To pick up updates, the user must explicitly release and re-stage. Staging enumerates directory trees recursively up to max_staging_depth (10) and max_staging_files (100,000). |
| I-CC12 | On crypto-shred event, all cached plaintext for the affected tenant is wiped from L1 and L2 with zeroize. Detection via periodic key health check (default 30s), advisory channel notification, or KMS error on next operation. Maximum detection latency bounded by min(key_health_interval, max_disconnect_seconds). |
| I-CC13 | L2 cache entries are protected by a CRC32 checksum computed at insert time and stored as a 4-byte trailer. On L2 read, the CRC32 is verified before serving. Mismatch triggers bypass to canonical and L2 entry deletion. |
Spec references
specs/features/native-client.feature— cache hit/invalidation/staging scenarios (extend)specs/features/control-plane.feature— cache policy distribution scenarios (extend)specs/invariants.md— add I-CC1 through I-CC13specs/ubiquitous-language.md— add cache-specific termsspecs/failure-modes.md— add F-CC1 through F-CC4specs/assumptions.md— add A-CC1 through A-CC4