Capacity Planning

Kiseki separates metadata and data onto different storage tiers. Proper sizing of both tiers is critical for stable operation at scale.

Storage tiers

Each storage node has two distinct storage tiers:

System disk (metadata tier)

The system partition hosts:

Raft log (raft/log.redb): Bounded by snapshot interval. Grows with write rate, compacted periodically.
Key epochs (keys/epochs.redb): Tiny (<10 MB). One entry per key epoch.
Chunk metadata (chunks/meta.redb): Scales linearly with file count. Approximately 80 bytes per file.
Inline content (small/objects.redb): Variable. Controlled by the dynamic inline threshold (ADR-030).

Requirements:

NVMe or SSD strongly recommended. HDD system disks trigger a boot warning because Raft fsync latency will be 5-10ms per commit.
RAID-1 on 2x SSD for redundancy (the system disk is not protected by Kiseki’s EC; it uses traditional RAID).
Size based on expected file count and inline content.

Data devices (data tier)

Data devices are JBOD-managed by Kiseki. They store chunk ciphertext as extents on raw block devices (ADR-029).

Requirements:

NVMe, SSD, or HDD depending on the pool’s device class.
Multiple devices per node for EC placement (I-D4: no two EC fragments on the same device).
JBOD (no RAID): Kiseki manages durability via EC or replication.

Metadata capacity sizing

Per-file metadata footprint

Component	Per file	Notes
Delta log entry	~200 bytes	Raft log entry with header fields
Chunk metadata	~80 bytes	Extent index entry in `chunks/meta.redb`
Subtotal (no inline)	~280 bytes	Fixed per file
Inline content	0 to 64 KB	Only if file is below inline threshold

Capacity examples

10 billion files, 50-node cluster, RF=3, no inline:

Component	Cluster total	Per node
Delta log (metadata only)	~2 TB	~120 GB
Chunk metadata index	~0.8 TB	~48 GB
Total metadata	~2.8 TB	~168 GB

At 168 GB per node, 256 GB NVMe system disks are tight. Larger system disks (512 GB or 1 TB) provide comfortable headroom.

10 billion files, 50-node cluster, RF=3, with inline (4 KB threshold):

Component	Cluster total	Per node
Metadata (as above)	~2.8 TB	~168 GB
Inline content (10% of files < 4 KB, avg 2 KB)	~2 TB	~120 GB
Total	~4.8 TB	~288 GB

This exceeds 256 GB system disks. The dynamic inline threshold (ADR-030) prevents this by automatically reducing the threshold when system disk usage approaches the soft limit.

Capacity monitoring

The system automatically monitors metadata disk usage and adjusts:

Usage level	Response
Below `KISEKI_META_SOFT_LIMIT_PCT` (50%)	Normal operation
Above soft limit	Inline threshold reduced
Above `KISEKI_META_HARD_LIMIT_PCT` (75%)	Threshold forced to floor (128 B), alert emitted

Alerts use out-of-band gRPC health reports (not Raft) so that a full-disk node can signal without writing Raft entries (I-SF2).

Dynamic inline threshold (ADR-030)

The inline threshold is computed per-shard as the minimum affordable threshold across all Raft voters:

available = min(node.small_file_budget_bytes for node in shard.voters)
projected_files = shard.file_count_estimate
raw_threshold = available / max(projected_files, 1)
shard_threshold = clamp(raw_threshold, INLINE_FLOOR, INLINE_CEILING)

Where:

Parameter	Value
`INLINE_FLOOR`	128 bytes (hard lower bound)
`INLINE_CEILING`	64 KB (system-wide maximum)
`KISEKI_META_SOFT_LIMIT_PCT`	50% (default)
`KISEKI_META_HARD_LIMIT_PCT`	75% (default)

Threshold behavior

Decrease: Automatic and safe. New files use the chunk path. Existing inline data is not retroactively migrated (I-L9).
Increase: Requires cluster admin decision. May trigger optional background migration of small chunked files back to inline.
Emergency: If any voter reports hard-limit breach, the leader commits a threshold reduction via Raft (2/3 majority; the full-disk node’s vote is not required).

The effective inline threshold is further clamped by a per-shard Raft log throughput budget (KISEKI_RAFT_INLINE_MBPS, default 10 MB/s). If the shard’s inline write rate exceeds this budget, the threshold temporarily drops to the floor until the rate subsides. This prevents inline data from starving metadata-only Raft operations during write storms.

Pool capacity thresholds

Data-tier capacity is managed per pool. Thresholds vary by device class to account for SSD/NVMe GC pressure at high fill levels (ADR-024):

State	NVMe/SSD	HDD	Behavior
Healthy	0-75%	0-85%	Normal writes, background rebalance
Warning	75-85%	85-92%	Log warning, emit telemetry
Critical	85-92%	92-97%	Reject new placements, advisory backpressure
ReadOnly	92-97%	97-99%	In-flight writes drain, no new writes
Full	97-100%	99-100%	ENOSPC to clients

Why NVMe/SSD thresholds are lower

NVMe and SSD devices experience write amplification from garbage collection at high fill levels. Above ~80% fill, GC pressure increases sharply, causing:

Increased write latency (10-100x during GC storms).
Reduced effective write bandwidth.
Accelerated wear.

Enterprise storage arrays (VAST, Pure) operate at 95%+ because they have global wear leveling across all flash in the system. JBOD devices do not have this capability, so Kiseki’s thresholds are more conservative.

Growth estimation

File count growth

Monitor kiseki_shard_delta_count to track delta (file) accumulation:

# Current delta count per shard
curl -s http://node1:9090/metrics | grep kiseki_shard_delta_count

Use the rate of delta count increase to project when the metadata tier will reach capacity.

Data volume growth

Monitor pool capacity metrics:

# Current pool utilization
curl -s http://node1:9090/metrics | grep kiseki_pool_capacity

Projection formula

days_until_full = (capacity_total - capacity_used) / daily_write_rate

For metadata:

metadata_per_file = 280 bytes (no inline) or 280 + avg_inline_size (with inline)
days_until_full = (system_disk_capacity * soft_limit_pct - current_used) /
                  (new_files_per_day * metadata_per_file * replication_factor)

Sizing recommendations

Small deployment (development/testing)

Component	Recommendation
Nodes	3 (minimum for Raft)
System disk	256 GB NVMe each (RAID-1 on 2x SSD)
Data devices	2x 1 TB NVMe per node
Key manager	Co-located with storage nodes (internal KMS)
File count	Up to 100 million

Medium deployment (departmental HPC)

Component	Recommendation
Nodes	5-10
System disk	512 GB NVMe each (RAID-1)
Data devices	4-8 NVMe per node (2-8 TB each)
Key manager	3 dedicated nodes
File count	Up to 1 billion

Large deployment (institutional HPC/AI)

Component	Recommendation
Nodes	50-200
System disk	1 TB NVMe each (RAID-1)
Data devices	8-24 devices per node, mixed tiers (NVMe + SSD + HDD)
Key manager	5 dedicated nodes
File count	Up to 10 billion
Total capacity	100 PB+

Rules of thumb

System disk: Size at 2x the expected metadata footprint for comfortable headroom. Include inline content estimates.
Data devices: At least ec_data_chunks + ec_parity_chunks devices per pool (for EC placement across distinct devices, I-D4).
Network: CXI or InfiniBand for clusters where storage bandwidth is critical. TCP is acceptable for cold-tier pools.
Memory: At least 64 GB per storage node for Raft state, chunk metadata caching, and stream processor buffers.

Capacity alerts

Configuring alerts

Use Prometheus alerting rules (see Monitoring) to detect capacity issues before they become critical:

- alert: KisekiSystemDiskWarning
  expr: >
    node_filesystem_avail_bytes{mountpoint="/var/lib/kiseki"} /
    node_filesystem_size_bytes{mountpoint="/var/lib/kiseki"} < 0.50
  for: 10m
  labels:
    severity: warning
  annotations:
    summary: "System disk above 50% on {{ $labels.instance }}"

- alert: KisekiSystemDiskCritical
  expr: >
    node_filesystem_avail_bytes{mountpoint="/var/lib/kiseki"} /
    node_filesystem_size_bytes{mountpoint="/var/lib/kiseki"} < 0.25
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "System disk above 75% on {{ $labels.instance }}"

When to add capacity

System disk above 50% (soft limit): Plan for capacity expansion. Inline threshold will start decreasing.
System disk above 75% (hard limit): Urgent. Inline threshold is at floor. Add nodes or upgrade system disks.
Pool above Warning threshold: Monitor growth. Plan for device additions.
Pool above Critical threshold: Writes are being rejected. Add devices immediately or evacuate data to another pool.

Kiseki Documentation