Durability
This document is the operator’s reference for what each performance knob trades against data durability. Read this before turning on any of the group-commit flags — most of them shift latency from per-write fsync to a periodic background flush, which means a power loss can discard writes that the application has already seen succeed.
Production target is 10-100+ nodes with Replication-3 or EC-4+2 (or larger codes). At that scale Raft replication + the under-replication scrub recover any per-node loss window automatically — the group-commit flags are a pure throughput win with no operator-visible durability cost. The single-node and 3-node configurations documented below are dev / smoke-test setups; they tolerate the group-commit loss window because they don’t carry production data, not because anyone has designed them to be safe under power loss.
Default behavior
With no env vars set, Kiseki provides the following durability guarantees:
| Operation | Durability |
|---|---|
gateway.write() returns | composition row + chunk fragments are on disk (fsync per write) |
gateway.delete() returns | tombstone is on disk |
FUSE / NFS close(2) | best-effort flush — POSIX permits losing writes that weren’t fsync’d |
FUSE / NFS fsync(2) | data is on disk before the call returns |
| Process crash / SIGKILL | no loss — kernel still owns the dirty pages |
| Power loss / kernel panic | no loss — every accepted write was fsynced |
This is the conservative configuration. It is also the slowest. The local single-node FUSE put-heavy benchmark hits ~2 800 op/s under this mode (May 2026 matrix). Group commit moves the ceiling to ~10 000+ op/s in exchange for a bounded loss window.
Performance flags that trade durability for throughput
KISEKI_CHUNK_FLUSH_INTERVAL_MS (chunk-store flush interval)
| Aspect | Value |
|---|---|
| Default interval | 100 ms |
| Group commit | always on when KISEKI_DATA_DIR is set (since 681de37) |
| Disable group commit | not exposed — would require code change to set_sync_per_write(true) |
| Recommended for production | default (100 ms) |
| Recommended for perf bench | default |
When KISEKI_DATA_DIR is set, the chunk store ALWAYS runs in group-
commit mode: per-write fsync is skipped, and a periodic flush task
issues one device.sync() every N ms. The env var only tunes N.
The gateway’s fsync_pending hook also drives an explicit
device.sync() so FUSE / NFS callers using fsync(2) still get
POSIX-compliant durability.
This was landed in 681de37 because per-write fsync was serializing
parallel chunk writers through the kernel’s writeback queue (5
parallel 16 MiB PUTs went from ~620 ms total to ~370 ms after the
fix). Reverting to per-write fsync is not exposed via env var
because no production workload has been identified that benefits
from it — the under-replication scrub handles the loss window
mitigation on multi-node, and single-node deployments without
replication shouldn’t be hosting durable data anyway.
Loss window on power loss: up to N ms of chunk data that the gateway accepted but the kernel hadn’t yet flushed.
Multi-node mitigation: the under-replication scrub (kiseki-chunk- cluster::scrub_scheduler) walks cluster_chunk_state rows on
restart. Any chunk whose local copy is missing gets re-fetched from
peers via GetFragment. For Replication-3 with at most one
simultaneous power loss, every lost chunk has 2 surviving copies.
For EC-4+2 the under-replication scrub reconstructs from parity.
At the production scale (10-100+ nodes) simultaneous loss across a
quorum is the dominant constraint, not the per-node loss window.
Single-node: dev / smoke-test only. Loss window is a feature of the group-commit pattern; not a target.
KISEKI_COMPOSITION_FLUSH_INTERVAL_MS (composition redb group commit)
| Aspect | Value |
|---|---|
| Default | unset (per-write fsync via Durability::Immediate) |
| Recommended value when set | 100 ms |
| Recommended for production | unset on single-node, 100 on multi-node |
| Recommended for perf bench | 100 |
When set, the composition redb runs at Durability::None — every
txn.commit() skips fsync and stays in the WAL. A periodic task
issues one Immediate-durability commit every N ms, forcing the
WAL to disk. The gateway’s fsync_pending hook drives an immediate
flush so FUSE / NFS fsync(2) callers are POSIX-compliant.
This was added to fix the FUSE p99 = 160 ms tail observed in the May 2026 single-node matrix — root cause was per-write composition redb fsync contending with the chunk store’s periodic device.sync on the same disk.
Loss window on power loss: up to N ms of composition metadata
that the gateway accepted. The chunks are durable (per the chunk-
store group-commit guarantee above), but the composition_id → chunks mapping may be lost — meaning the user-visible file is
unrecoverable even though the bytes are on disk.
Multi-node mitigation: the composition hydrator
(kiseki-composition::hydrator) replays Raft-committed deltas on
restart. Any composition whose local copy is missing gets
reconstructed from the per-shard Raft log replayed by peers. At
production scale this means the per-node loss window is invisible
to applications.
Single-node: dev only — same caveat as the chunk flag.
KISEKI_OBSERVABILITY=off (perf-cluster opt-out)
| Aspect | Value |
|---|---|
| Default | unset (on) |
Set to off / 0 / false / disabled | bypass InstrumentedLogOps + InstrumentedKeyManager wrappers |
| Recommended for production | unset (on) |
| Recommended for perf bench | off (clean baseline) |
Disables the metric-record + tracing-span wrappers around LogOps and
KeyManagerOps calls. Pure observability change — does not affect
durability. Use for perf-cluster runs (infra/gcp/transport) where
you need a clean baseline.
The hot-path #[tracing::instrument] macros are already at
level = "debug", so production RUST_LOG=info / warn settings
already short-circuit span creation. The wrappers themselves cost
~100 ns per call (one atomic counter increment + one histogram
observation). For a 100 k op/s workload that’s ~0.01 % overhead;
turn off only when measuring against a sub-µs baseline.
KISEKI_HYDRATOR_TRANSIENT_RETRIES (composition hydrator backoff)
| Aspect | Value |
|---|---|
| Default | 5 |
| Range | any non-negative integer |
| Recommended for production | default |
Number of retries before the hydrator declares a delta stuck and
halts. Lower values fail faster on a malformed delta; higher values
absorb more transient failures. Does not affect durability of
already-applied deltas — only how aggressively the hydrator gives
up on a problematic one. A stuck hydrator surfaces via the
kiseki_composition_hydrator_halted gauge.
KISEKI_RAFT_THREADS (Raft worker pool size)
| Aspect | Value |
|---|---|
| Default | max(num_cpus / 2, 4) |
| Recommended | default |
Worker thread count for the dedicated Raft tokio runtime. Does not
affect durability — every Raft write still goes through the
configured per-shard log store (redb-backed when KISEKI_DATA_DIR
is set, in-memory otherwise). Tuning is purely a CPU-allocation
trade-off between Raft and the data-path runtime.
fsync(2) semantics
POSIX defines close(2) and fsync(2) differently:
close(2): releases the file descriptor. No durability guarantee. POSIX explicitly permits losing data that was written but notfsync’d before close.fsync(2): blocks until all of this file’s data and metadata is on stable storage.
Kiseki’s FUSE driver follows POSIX:
FUSE_FLUSH(called onclose(2)) drains the dirty buffer through the gateway but does not force fsync on the backing stores. Group commit’s bounded loss window applies.FUSE_FSYNC(called onfsync(2)) drains the dirty buffer and drivesgateway.fsync_pending()— every registered fsync hook runs (composition redb + chunk-store device). The call only returns once data is durable.
Apps that need durability must call fsync(2) after critical writes.
This matches the contract of every other Linux filesystem (ext4,
btrfs, XFS); apps that rely on close(2) for durability are buggy
even on those filesystems.
NFS clients get the same behavior via the COMMIT operation
(NFSv3::commit, NFSv4::commit) — the server-side handler invokes
the same gateway.fsync_pending() path. (NFSv3 with O_SYNC writes
also forces durability per write.)
Failure-mode matrix
Three columns to track:
- Default =
KISEKI_DATA_DIRset, no other env vars. Chunk group commit is on (always — see above), composition is atDurability::Immediate. - +composition GC, single-node =
KISEKI_COMPOSITION_FLUSH_INTERVAL_MS=100. - +composition GC, multi-node R-3 = same env var on a 3-node Replication-3 cluster.
| Failure | Default | +composition GC, single-node | +composition GC, multi-node R-3 |
|---|---|---|---|
| Process crash / SIGKILL | no loss | no loss | no loss |
| OS reboot, no power loss | no loss | no loss | no loss |
Power loss after close(2) | up to 100 ms of chunk data (chunk GC always on; composition is durable per-write so the comp→chunks mapping points at chunks that may be lost) | up to 100 ms of chunk data + composition metadata; comp+chunks may not align — orphan checks at startup | no loss (Raft re-replicates both chunks and compositions from peers) |
Power loss after fsync(2) | no loss (fsync_pending forces device.sync) | no loss (fsync_pending forces both device.sync + redb commit) | no loss |
| Single disk failure | no loss (RAID / EC handles) | no loss | no loss |
| Single node failure (multi-node) | no loss (Raft majority) | n/a | no loss |
| Quorum loss (≥ 2 of 3 down) | accepted writes durable, new writes blocked | n/a | same |
| Whole-cluster power loss | per-node loss window depends on most-recent chunk fsync; operator runs scrub at startup | per-node loss window of chunk + composition state | same as default for chunk; composition recovers from Raft log on each node — cluster recovery reconciles via leader log |
Key invariant: fsync(2) is always durable, regardless of which
flags are on. The group-commit flags only affect the implicit
durability of writes that the application didn’t explicitly flush.
Recommended configurations by deployment
Single-node dev / laptop
# Defaults. Chunk group commit is always on (100 ms loss window on
# unclean power loss); composition is durable per-write. Apps that
# need durability call fsync(2) explicitly.
unset KISEKI_COMPOSITION_FLUSH_INTERVAL_MS
unset KISEKI_OBSERVABILITY
Multi-node production (3+ nodes, Replication-3 or EC)
# Composition group commit ON. Raft re-replication recovers the
# ≤ 100 ms loss window from peers on restart.
export KISEKI_COMPOSITION_FLUSH_INTERVAL_MS=100
Perf cluster (GCP transport profile, throughput baseline)
# Composition group commit ON + observability OFF for clean baseline.
export KISEKI_COMPOSITION_FLUSH_INTERVAL_MS=100
export KISEKI_OBSERVABILITY=off
Single-node FUSE benchmark
# Composition group commit ON, observability OFF. Accepts the 100 ms
# loss window (the benchmark itself doesn't care; operators DO need
# to know).
export KISEKI_COMPOSITION_FLUSH_INTERVAL_MS=100
export KISEKI_OBSERVABILITY=off
Verifying durability under group commit
After enabling either group-commit flag, sanity-check the fsync hook path:
# 1. Write a file via FUSE.
echo "hello world" > /mnt/kiseki/durability-test.txt
# 2. fsync(2) explicitly.
fsync_test /mnt/kiseki/durability-test.txt # any tool that calls fsync
# 3. Hard kill (NOT a graceful shutdown — kill -9 the kiseki-server).
sudo kill -9 $(pgrep kiseki-server)
# 4. Restart and read back.
kiseki-server &
sleep 2
cat /mnt/kiseki/durability-test.txt # MUST print "hello world"
If step 4 fails, the fsync hook isn’t wired. File a bug with the
/metrics snapshot showing kiseki_chunk_persistent_write_phase_duration
and the runtime log line fsync hook: composition redb registered.
Cross-reference
- Performance numbers and the May 2026 fix sweep:
docs/performance/README.md - ADR-029 §Group commit for the chunk-store rationale
- ADR-040 §D6 / §D8.1 for the composition redb / hydrator design
commit 681de37— chunk-store group commit landingcommit 56ec297(todo, after this work) — composition redb group commit landing