ADR-030: Dynamic Small-File Placement and Metadata Capacity Management
Status: Accepted Date: 2026-04-22 Deciders: Architect + domain expert Adversarial review: 2026-04-22 (6 findings: 1C 2H 2M 1L, all resolved) Context: ADR-024 (device management), ADR-029 (raw block allocator), I-L9 (inline threshold), I-C5 (capacity thresholds), I-C8 (bitmap ground truth)
Problem
At scale (10B+ files, 100PB+), the metadata tier (redb on system NVMe) becomes a sizing bottleneck. The per-file metadata footprint (~280 bytes) is unavoidable, but small-file content inlined into deltas causes the metadata tier to scale with data volume, not just file count.
Current state:
inline_threshold_bytesis specified (I-L9) but not implemented- No dynamic adjustment mechanism exists
- No awareness of system disk capacity or media type
- No workload-driven shard placement across heterogeneous nodes
Capacity example
10B files, 100PB total, 50-node cluster, RF=3, 256GB NVMe root disks:
| Component | Per file | Cluster total | Per node |
|---|---|---|---|
| Delta log (no inline) | ~200 B | ~2 TB | ~120 GB |
| Chunk metadata | ~80 B | ~0.8 TB | ~48 GB |
| Subtotal (metadata only) | ~280 B | ~2.8 TB | ~168 GB |
| Small-file content (if inlined) | variable | 3-200 TB | blows budget |
Metadata alone consumes 168 GB/node at 50 nodes. Adding inline content makes 256 GB root disks insufficient.
Decision
1. System disk auto-detection and budget calculation
At server boot, detect the system partition’s capacity and media type. Compute a metadata budget with configurable soft and hard limits.
KISEKI_DATA_DIR → stat() → total_bytes, fs_type
/sys/block/{dev}/queue/rotational → 0 = SSD/NVMe, 1 = HDD
/sys/block/{dev}/device/model → device identification
Defaults (configurable via env or config file):
| Parameter | Default | Description |
|---|---|---|
KISEKI_META_SOFT_LIMIT_PCT | 50% | Normal operating ceiling |
KISEKI_META_HARD_LIMIT_PCT | 75% | Absolute maximum, triggers emergency |
KISEKI_META_INLINE_FLOOR | 128 B | Hard lower bound for inline (metadata-like payloads only) |
Warning: If the system disk is rotational (HDD), emit a persistent warning at boot and in health reports:
WARNING: system disk is rotational (HDD). Raft fsync latency will
be 5-10ms per commit. Production deployments require NVMe or SSD
for the metadata partition. See ADR-030.
Reported to cluster (via gRPC health reports, not Raft — see SF-ADV-4 resolution):
#![allow(unused)]
fn main() {
struct NodeMetadataCapacity {
total_bytes: u64,
used_bytes: u64,
soft_limit_bytes: u64,
hard_limit_bytes: u64,
media_type: MediaType, // Nvme, Ssd, Hdd
small_file_budget_bytes: u64, // derived: soft_limit - reserved - metadata
}
}
2. Two-tier redb layout on system disk
Separate metadata (Raft log, chunk index) from small-file content:
KISEKI_DATA_DIR/
├── raft/log.redb ← Raft log entries (bounded by snapshot policy)
├── keys/epochs.redb ← Key epoch metadata (tiny, <10 MB)
├── chunks/meta.redb ← Chunk extent index (scales with file count)
└── small/objects.redb ← Small-file encrypted content (capacity-managed)
The first three are structural metadata — required regardless of
inline threshold. The fourth (small/objects.redb) is data-tier
extension — its size is controlled by the inline threshold.
This separation enables:
- Independent monitoring of each tier’s growth
- Emergency response: disable inline (threshold → floor) without touching structural metadata
- Backup/restore of structural metadata without bulk data
GC contract (SF-ADV-6): When truncate_log or compact_shard
removes a delta that references an inline object, the corresponding
small/objects.redb entry is also deleted. The GC path must cover
both stores — orphan entries in small/objects.redb are a capacity
leak. The chunk_id key is shared between small/objects.redb and
the block device extent mapping, so deletion is keyed identically.
3. Per-shard dynamic inline threshold
The inline threshold determines whether a file’s encrypted content is
stored in small/objects.redb (metadata tier) or as a chunk extent on
a raw block device (data tier).
Threshold is per-shard, not per-node, because all Raft replicas of a shard must agree on whether content is inline or chunked (state machine determinism).
Computation: The shard leader computes the threshold from the minimum small-file budget across all nodes hosting that shard:
available = min(node.small_file_budget_bytes for node in shard.voters)
projected_files = shard.file_count_estimate (from delta count heuristic)
raw_threshold = available / max(projected_files, 1)
shard_threshold = clamp(raw_threshold, INLINE_FLOOR, INLINE_CEILING)
Where INLINE_CEILING is a system-wide maximum (e.g., 64 KB) to
prevent pathological cases.
Raft log throughput guard (SF-ADV-1): The threshold is further
clamped by a per-shard Raft log throughput budget
(KISEKI_RAFT_INLINE_MBPS, default 10 MB/s). If the shard’s inline
write rate (measured over a sliding 10-second window) would exceed
this budget at the current threshold, the effective threshold is
temporarily reduced to floor until the rate drops. This prevents
inline data from starving metadata-only Raft operations (large-file
chunk_ref deltas, maintenance commands, watermark advances) during
write storms.
effective_threshold = if shard.inline_write_rate_mbps > RAFT_INLINE_MBPS:
INLINE_FLOOR
else:
shard_threshold
Threshold adjustment rules (I-L9 compatibility):
- Threshold can decrease dynamically (safe — new files use chunks)
- Threshold changes are prospective only — existing inline data is not retroactively migrated
- Threshold increase requires cluster admin decision and may trigger background migration of small chunked files back to inline (optional, maintenance-mode operation)
- Threshold is stored in
ShardConfigand replicated via Raft
Read latency note (SF-ADV-3): After a threshold decrease, existing
inline files remain in small/objects.redb (fast, NVMe reads) while
new files of the same size go to block device extents (potentially
slower, especially on HDD). This bimodal latency for same-sized files
is expected behavior. Administrators can normalize it via the
maintenance-mode migration path (move old inline content to chunks),
but this is optional and not automatic.
Emergency override (SF-ADV-4): Capacity alerts use out-of-band
gRPC health reports, not Raft. Each node periodically reports its
NodeMetadataCapacity to the shard leader (or control plane) via the
data-path gRPC channel. If any voter reports hard-limit breach, the
leader commits a threshold reduction via Raft. This works because:
- The full-disk node doesn’t need to write Raft entries for the signal
- The leader commits the threshold change with 2/3 majority (the full-disk node’s vote is not required)
- The full-disk node receives the committed threshold change via Raft replication (read-only, no disk write needed until next apply)
4. Small-file data path
Inline content flows through Raft (SF-ADV-2): Inline content is
carried as payload in the Raft log entry (LogCommand::AppendDelta
with payload field). The state machine’s apply() method offloads
the payload to small/objects.redb on apply, keyed by chunk_id.
The in-memory state machine retains only the delta header (no payload).
This ensures:
- Snapshot correctness:
build_snapshot()reads inline content fromsmall/objects.redb, includes it in the serialized snapshot.install_snapshot()writes it back. Learners and restarted nodes receive all inline content via snapshot transfer. - State machine determinism: all replicas apply the same log
entries and write to their local
small/objects.redbidentically. - Memory efficiency: inline payloads are not held in memory after apply — only the redb reference remains.
Below threshold (inline path):
client write → gateway encrypt → delta with payload →
Raft client_write (payload in log entry) →
replicated to voters →
state machine apply() → offload payload to small/objects.redb →
in-memory state: header only (no payload)
Above threshold (chunk path, unchanged):
client write → gateway encrypt → chunk alloc on DeviceBackend →
extent write (O_DIRECT) → delta with chunk_ref (no payload) →
Raft client_write → replicated (metadata only)
Read path: ChunkOps::get() checks small/objects.redb first
(keyed by chunk_id). If not found, reads from block device extent.
This is transparent to callers.
5. Workload-driven shard placement (heterogeneous clusters)
When the cluster has mixed node types (HDD + SSD), the control plane can migrate shards to better-suited nodes using Raft membership changes.
Placement levers (ordered by preference, topology-dependent):
| Lever | When to use | Mechanism |
|---|---|---|
| Lower inline threshold | Always available | ShardConfig update via Raft |
| Split shard | Shard exceeds I-L6 ceiling | Standard shard split |
| Migrate to larger-NVMe node | Heterogeneous cluster, metadata pressure | Raft add_learner → promote → demote |
| Migrate to SSD node | Heterogeneous, small-file-heavy shard | Raft add_learner → promote → demote |
Decision tree (control plane policy):
IF shard.metadata_pressure > soft_limit:
IF can_lower_threshold(shard):
lower_threshold(shard) # cheapest, always try first
ELSE IF shard.exceeds_split_ceiling:
split_shard(shard) # distributes load
ELSE IF cluster.has_better_node(shard):
migrate_shard(shard, better_node) # needs heterogeneous cluster
ELSE:
alert("metadata tier at capacity, no placement options available")
In a homogeneous cluster, only the first two levers exist. The policy prunes itself based on what’s available.
Shard migration via Raft:
Migration is not a special operation — it’s a Raft membership change:
raft.add_learner(target_node)— target receives log/snapshot- Wait for learner to catch up (snapshot transfer, then log replay)
raft.change_membership(new_voter_set)— promote target, demote source- Old node removed from voter set, its data eventually GC’d
Properties:
- Zero downtime: reads/writes continue during migration
- Zero data loss: old node stays in membership until new node is caught up
- Reversible: if migration fails, learner is removed, no state change
6. Placement change rate limiting
Placement changes (shard migration, learner add/remove) consume snapshot transfer bandwidth. In HPC environments, workload profiles shift at job boundaries (hours to days), not continuously.
Exponential backoff per shard:
| Observation window | After N-th change |
|---|---|
| 2 hours | 1st (initial observation, minimum floor) |
| 2 hours | 2nd (backoff resets never go below 2h) |
| 4 hours | 3rd |
| 8 hours | 4th |
| … | doubles each time |
| 24 hours | cap (maximum interval) |
Reset (SF-ADV-5): The backoff resets to the minimum floor of 2 hours, not to a shorter interval. Even when the shard’s workload profile changes significantly (e.g., small-file ratio crosses a threshold boundary), the shard cannot be migrated more than once per 2 hours. This prevents oscillating workloads from causing continuous snapshot transfers. The 2-hour floor is chosen because:
- HPC job boundaries are typically hours apart
- A snapshot transfer of a large shard takes minutes, and the target node needs time to stabilize before being evaluated again
- The floor applies per-shard, so different shards can migrate concurrently within the cluster-wide rate limit
Per-cluster rate limit: at most max(1, num_nodes / 10) concurrent
shard migrations cluster-wide, to bound snapshot transfer bandwidth.
7. SSD nodes as read accelerators (Raft learners)
For read-heavy small-file workloads, SSD nodes can serve as non-voting Raft learners:
- Learners receive the full Raft log (including small-file content)
- Learners do NOT participate in elections or commit quorum
- Learners serve read requests (state machine is up-to-date)
- Add/remove learners without disturbing the voter set
Use case: a shard has RF=3 on HDD voters (for capacity) plus 1-2 SSD learners (for read IOPS). The SSD learners handle small-file reads, HDD voters handle bulk writes.
Correction after suboptimal placement: Initial shard placement does not need to be optimal. The control plane observes shard metrics (small-file ratio, read IOPS, p99 latency) and corrects placement via Raft membership changes. Adding an SSD learner, promoting it to voter, and demoting an HDD voter is a zero-downtime, zero-data-loss operation. The cost is one snapshot transfer per migrated shard — bounded by the rate limiting in §6.
Promotion path: if workload shifts permanently, a learner can be promoted to voter (and an HDD voter demoted) via standard membership change.
Consequences
Positive
- Metadata tier sizing becomes self-managing
- Small files handled efficiently without manual tuning
- Mixed HDD/SSD clusters used optimally
- Placement corrections have zero downtime and zero data loss
- I-L9 compatibility preserved (prospective-only threshold changes)
- Snapshot transfer includes inline content (SF-ADV-2 resolved)
Negative
- Per-shard threshold adds complexity to
ShardConfig ChunkOps::get()now checks two stores (redb + block device)- Snapshot transfer is the bottleneck for migration speed
- Threshold computation requires cluster-wide metadata aggregation
- Inline writes under high load may be temporarily demoted to chunk path (throughput guard), causing brief latency increase for small files
Neutral
- Threshold floor (128 B) means truly tiny files are always inline
- Homogeneous clusters get simpler behavior (fewer levers)
- Migration mechanism is just Raft membership changes — no new protocol
- Bimodal read latency after threshold decrease is expected (SF-ADV-3)
Adversarial findings (resolved)
| ID | Severity | Finding | Resolution |
|---|---|---|---|
| SF-ADV-1 | High | Raft log throughput saturation from inline writes | Per-shard throughput budget (§3), temporarily lowers threshold to floor under load |
| SF-ADV-2 | Critical | Inline content missing from Raft snapshots | Inline content flows through Raft log; state machine offloads to redb on apply; snapshot reads from redb (§4) |
| SF-ADV-3 | Medium | Bimodal read latency after threshold decrease | Documented as expected; optional admin migration path to normalize (§3) |
| SF-ADV-4 | High | Emergency override fails if full-disk node can’t write Raft entries | Capacity reporting via out-of-band gRPC, not Raft; leader commits with 2/3 majority (§3) |
| SF-ADV-5 | Low | Backoff reset allows frequent migrations from oscillating workloads | Minimum 2-hour floor that never resets below (§6) |
| SF-ADV-6 | Medium | No GC path for small/objects.redb | GC contract: truncate_log and compact_shard delete corresponding redb entries (§2) |
Invariant impact
| Invariant | Impact |
|---|---|
| I-L9 | Extended: threshold is now per-shard and dynamic, but still prospective-only. Increase requires admin action. |
| I-C5 | Unchanged: capacity thresholds on data devices unaffected. |
| I-C8 | Unchanged: bitmap remains ground truth for block device allocations. |
| I-K3 | Unchanged: inline content is still encrypted with system DEK, wrapped with tenant KEK. |
New invariants
| ID | Invariant |
|---|---|
| I-SF1 | The inline threshold for a shard is the minimum affordable threshold across all nodes hosting that shard’s voter set. Threshold stored in ShardConfig, replicated via Raft. |
| I-SF2 | System disk metadata usage must not exceed hard_limit_pct of system partition capacity. Exceeding soft limit triggers threshold reduction; exceeding hard limit forces threshold to floor and emits alert. Alert uses out-of-band gRPC, not Raft. |
| I-SF3 | Shard migration via Raft membership change must not proceed until the target node has fully caught up (learner state matches leader’s committed index). |
| I-SF4 | Placement change rate per shard follows exponential backoff (2h floor, 24h cap). Backoff resets never go below 2h floor. Cluster-wide concurrent migrations bounded by max(1, num_nodes / 10). |
| I-SF5 | Inline content is carried in Raft log entries and offloaded to small/objects.redb on state machine apply. Snapshots include inline content read from redb. No inline content is held in the in-memory state machine after apply. |
| I-SF6 | GC (truncate_log, compact_shard) must delete corresponding entries from small/objects.redb when removing deltas that reference inline objects. Orphan redb entries are a capacity leak. |
| I-SF7 | Per-shard Raft inline throughput must not exceed KISEKI_RAFT_INLINE_MBPS (default 10 MB/s). When exceeded, effective inline threshold drops to floor until rate subsides. |
Spec references
specs/invariants.md— I-L9, I-C5, I-C8, I-K3specs/architecture/adr/024-device-management-and-capacity.md— device classes, server disk layoutspecs/architecture/adr/029-raw-block-device-allocator.md— DeviceBackend trait, extent allocationspecs/architecture/adr/026-raft-topology.md— Raft membership, multi-Raft patternspecs/implementation/phase-7-9-assessment.md— open design question on small files