Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Performance Tuning

Kiseki is designed for HPC and AI workloads running at 200+ Gbps per NIC. This guide covers tuning levers for maximizing throughput and minimizing latency.


Transport selection

The transport layer abstracts the network fabric. Kiseki automatically selects the best available transport, but manual override is possible.

Transport hierarchy (fastest to slowest)

TransportTypical bandwidthLatencyFeature flagNotes
CXI (HPE Slingshot)200 Gbps<1 uskiseki-transport/cxiRequires libfabric with CXI provider. CSCS/Alps native.
InfiniBand verbs100-400 Gbps1-2 uskiseki-transport/verbsRequires RDMA-capable NICs and verbs libraries.
RoCE v225-100 Gbps2-5 uskiseki-transport/verbsRDMA over Converged Ethernet. Requires lossless fabric (PFC/ECN).
TCP10-100 Gbps50-200 us(always available)Fallback. Uses kernel TCP with TLS.

Enabling high-performance transports

# Build with CXI support (requires libfabric development headers)
cargo build --release --features kiseki-transport/cxi

# Build with RDMA verbs support (requires rdma-core)
cargo build --release --features kiseki-transport/verbs

The client automatically detects available transports and selects the fastest one. Override with:

# Force TCP transport (e.g., for debugging)
KISEKI_TRANSPORT=tcp kiseki-client-fuse --mountpoint /mnt/kiseki

Transport tuning

  • Connection pooling: The transport layer maintains a pool of connections per peer. Pool size adapts to workload.
  • Keepalive: Connections are kept alive to avoid handshake overhead. Configure via KISEKI_TRANSPORT_KEEPALIVE_MS.
  • Zero-copy: CXI and verbs transports use zero-copy DMA where possible.

NUMA pinning

For multi-socket servers, NUMA-aware placement is critical for avoiding cross-socket memory traffic.

Recommendations

  • Pin kiseki-server to the NUMA node closest to the NIC:
    numactl --cpunodebind=0 --membind=0 kiseki-server
    
  • Pin NVMe interrupts to the same NUMA node:
    echo 0 > /proc/irq/<irq>/smp_affinity_list
    
  • Pin data devices to the NUMA node closest to their PCIe root complex.

systemd integration

[Service]
# Pin to NUMA node 0
ExecStart=/usr/bin/numactl --cpunodebind=0 --membind=0 /usr/local/bin/kiseki-server

Verification

# Check NUMA topology
numactl --hardware

# Check NIC NUMA node
cat /sys/class/net/eth0/device/numa_node

# Check NVMe NUMA node
cat /sys/block/nvme0n1/device/numa_node

Erasure coding parameters

EC parameters control the trade-off between storage overhead, repair bandwidth, and read performance.

Common configurations

ConfigDataParityOverheadFault toleranceUse case
4+24250%2 device failuresDefault for NVMe. Good balance.
8+38337.5%3 device failuresLarge HDD pools. Lower overhead.
4+14125%1 device failureLow-criticality data. Minimum overhead.
2+222100%2 device failuresSmall pools (<6 devices). High redundancy.

Performance implications

  • Read amplification: Reading a chunk requires reading data_chunks fragments. More data chunks = more read I/O.
  • Write amplification: Writing a chunk requires writing data_chunks + parity_chunks fragments.
  • Repair bandwidth: Repairing a lost fragment requires reading data_chunks fragments and writing 1. Higher data_chunks = more repair bandwidth.
  • Minimum pool size: The pool must have at least data_chunks + parity_chunks devices.

EC parameters are immutable per pool after creation (I-C6). Choose carefully. Changing requires creating a new pool and migrating data.


Inline threshold (ADR-030)

The inline threshold determines whether small files are stored in the metadata tier (NVMe, redb) or the data tier (block device extents).

Tuning the threshold

The system automatically adjusts the threshold per-shard based on system disk capacity (I-SF1, I-SF2). Manual adjustment:

# Set cluster-wide default for new shards
kiseki-server tuning set --inline-threshold-bytes 8192

Trade-offs

ThresholdMetadata tier impactData tier impactLatency
128 B (floor)Minimal metadata growthAll files in chunksHigher for tiny files
4 KB (default)Moderate growthSmall files inlineLower for small files
64 KB (ceiling)Large growthMore inline dataLowest for small files

Monitoring

# Check system disk usage
df -h /var/lib/kiseki

# Check per-store sizes
du -sh /var/lib/kiseki/small/objects.redb
du -sh /var/lib/kiseki/raft/log.redb

The Raft inline throughput guard (I-SF7) automatically reduces the threshold to the floor if inline write rate exceeds KISEKI_RAFT_INLINE_MBPS (default 10 MB/s per shard). This prevents inline data from starving metadata-only Raft operations during write storms.


Cache tuning (ADR-031)

L1 cache (in-memory)

The L1 cache holds decrypted plaintext chunks in process memory.

ParameterDefaultRecommendation
KISEKI_CACHE_L1_MAX1 GBSet to 10-25% of available process memory. AI training with large datasets: increase. Memory-constrained compute: decrease.

L2 cache (local NVMe)

The L2 cache uses local NVMe on compute nodes.

ParameterDefaultRecommendation
KISEKI_CACHE_L2_MAX100 GBSet based on available NVMe capacity. Training datasets: size to fit the working set. Inference: size to fit model weights.

Metadata TTL

ParameterDefaultRecommendation
KISEKI_CACHE_META_TTL_MS5000 (5s)Read-heavy workloads: increase for fewer metadata fetches. Low-latency requirements: decrease for fresher data. POSIX close-to-open consistency: 0 (no caching).

Cache mode selection

WorkloadRecommended modeRationale
AI training (epoch reuse)pinnedDataset is re-read every epoch. Pin to avoid refetching.
AI inferenceorganicModel weights are hot, prompts rotate. LRU works well.
HPC checkpoint/restartbypassCheckpoints are write-heavy. Caching checkpoints wastes NVMe.
Climate/weather stagingpinnedBoundary conditions staged once, read many times.
Interactive analysisorganicMixed access patterns. LRU adapts.

Staging for training workloads

Pre-stage datasets before training begins to avoid cold-start latency:

# Slurm prolog script
kiseki-client-fuse --stage /datasets/imagenet --mountpoint /mnt/kiseki
export KISEKI_CACHE_POOL_ID=$(cat /var/cache/kiseki/pool_id)

# Workload picks up the staged cache via KISEKI_CACHE_POOL_ID
srun --export=ALL python train.py

Raft tuning

Snapshot interval

kiseki-server tuning set --raft-snapshot-interval 10000
  • Lower values (1000-5000): More frequent snapshots. Faster catch-up for new nodes. More I/O.
  • Higher values (50000-100000): Less snapshot overhead. Slower catch-up.

Compaction rate

kiseki-server tuning set --compaction-rate-mb-s 200

Higher compaction rate reduces Raft log size faster but consumes more I/O bandwidth.

View materialization poll interval

kiseki-server tuning set --stream-proc-poll-ms 50

Lower poll interval reduces view staleness but increases CPU usage.


Benchmark harness

Kiseki includes a transport benchmark for measuring raw fabric throughput:

# Run transport benchmarks (if available)
tests/hw/run_transport_bench.sh

What it measures

  • Bandwidth: Sequential read/write throughput per transport.
  • Latency: Round-trip latency (p50, p99, p999) per transport.
  • IOPS: Random read/write IOPS per transport.
  • Concurrency: Throughput scaling with connection count.

Interpreting results

MetricGood (CXI)Good (TCP)Action if below
Bandwidth>150 Gbps>50 GbpsCheck NIC config, MTU, NUMA pinning
Latency p99<10 us<500 usCheck CPU frequency, interrupt coalescing
IOPS (4K random)>1M>100KCheck NVMe config, queue depth

System tuning checklist

Kernel parameters

# Increase maximum open files
echo "fs.file-max = 1048576" >> /etc/sysctl.conf

# Increase socket buffer sizes for high-bandwidth transports
echo "net.core.rmem_max = 67108864" >> /etc/sysctl.conf
echo "net.core.wmem_max = 67108864" >> /etc/sysctl.conf
echo "net.ipv4.tcp_rmem = 4096 87380 67108864" >> /etc/sysctl.conf
echo "net.ipv4.tcp_wmem = 4096 65536 67108864" >> /etc/sysctl.conf

# Disable transparent hugepages (can cause latency spikes)
echo never > /sys/kernel/mm/transparent_hugepage/enabled

NVMe tuning

# Set I/O scheduler to none (best for NVMe)
echo none > /sys/block/nvme0n1/queue/scheduler

# Increase queue depth
echo 1024 > /sys/block/nvme0n1/queue/nr_requests

Process limits

# /etc/security/limits.d/kiseki.conf
kiseki  soft  nofile  1048576
kiseki  hard  nofile  1048576
kiseki  soft  memlock unlimited
kiseki  hard  memlock unlimited