ADR-029: Raw Block Device Allocator
Status: Accepted Adversarial review: 2026-04-22 (8 findings: 2H 4M 2L, all resolved) Date: 2026-04-22 Context: ADR-022, ADR-024, ADR-005, I-C1 through I-C6
Problem
Chunk ciphertext needs to persist on JBOD data devices. ADR-024 specifies XFS on each device as the default, but filesystem overhead becomes the bottleneck at HPC scale:
- Double journaling: XFS journals its metadata, then redb journals ours — redundant durability cost
- Page cache pollution: OS caches data we already manage in our own cache layer, wasting DRAM
- Inode contention: Billions of chunks = billions of inodes; XFS metadata operations become the throughput ceiling
- Indirection: Every I/O traverses VFS → XFS → block layer → device; raw access removes two layers
Ceph’s migration from FileStore (XFS) to BlueStore (raw block) was driven by exactly these issues. DAOS uses SPDK for the same reason.
Decision
New crate: kiseki-block
A device I/O crate that manages raw block devices (and file-backed
fallback for VMs/CI). Separate from kiseki-chunk (domain logic).
kiseki-chunk depends on kiseki-block for storage.
Device Backend Trait
#![allow(unused)]
fn main() {
/// Abstraction over a storage device — raw block or file-backed.
/// Auto-detects device characteristics and adapts I/O strategy.
#[async_trait]
pub trait DeviceBackend: Send + Sync {
/// Allocate a contiguous extent of at least `size` bytes.
/// Alignment matches the device's physical block size.
fn alloc(&self, size: u64) -> Result<Extent, AllocError>;
/// Write data at the given extent.
fn write(&self, extent: &Extent, data: &[u8]) -> Result<(), BlockError>;
/// Read data from the given extent.
fn read(&self, extent: &Extent) -> Result<Vec<u8>, BlockError>;
/// Free an extent, returning blocks to the free pool.
fn free(&self, extent: &Extent) -> Result<(), AllocError>;
/// Sync all pending writes to stable storage.
fn sync(&self) -> Result<(), BlockError>;
/// Device capacity: (used_bytes, total_bytes).
fn capacity(&self) -> (u64, u64);
/// Probed device characteristics (read-only after open).
fn characteristics(&self) -> &DeviceCharacteristics;
}
}
Auto-detection (no manual configuration)
On DeviceManager::open(path), probe sysfs (Linux):
/sys/block/<dev>/queue/rotational → 0 (SSD/NVMe) or 1 (HDD)
/sys/block/<dev>/queue/physical_block_size → 512 or 4096
/sys/block/<dev>/queue/optimal_io_size → device-preferred I/O size
/sys/block/<dev>/queue/max_hw_sectors_kb → max single I/O size
/sys/block/<dev>/device/model → model string
/sys/block/<dev>/device/numa_node → NUMA node (-1 if none)
/sys/block/<dev>/queue/discard_max_bytes → TRIM support (>0 = yes)
Derived properties:
#![allow(unused)]
fn main() {
pub struct DeviceCharacteristics {
pub medium: DetectedMedium,
pub physical_block_size: u32,
pub optimal_io_size: u32,
pub rotational: bool,
pub numa_node: Option<u32>,
pub supports_trim: bool,
pub supports_smart: bool,
pub io_strategy: IoStrategy,
}
pub enum DetectedMedium {
NvmeSsd, // /sys/block/nvme*/ + rotational=0
SataSsd, // rotational=0, not NVMe
Hdd, // rotational=1
Virtual, // virtio in model, no SMART
Unknown,
}
pub enum IoStrategy {
DirectAligned, // O_DIRECT | O_DSYNC — NVMe, SATA SSD
BufferedSequential, // O_SYNC — HDD (readahead benefits)
FileBacked, // Default flags — VM, dev, CI
}
}
For non-Linux / VMs without sysfs: detect virtio in model string
or absence of block device properties → fall back to
IoStrategy::FileBacked with sparse file. All three strategies
implement the same DeviceBackend trait transparently.
On-disk format
Per data device:
Offset 0: [Superblock — 4K]
Offset 4K: [Primary Bitmap — variable size]
Offset M: [Mirror Bitmap — same size as primary]
Offset N: [Data Region — remainder of device]
Superblock (4K, first block):
#![allow(unused)]
fn main() {
pub struct Superblock {
pub magic: [u8; 8], // b"KISEKI\x01\x00"
pub version: u32, // Format version (1)
pub device_id: [u8; 16], // UUID
pub block_size: u32, // Physical block size (probed)
pub total_blocks: u64, // Device capacity in blocks
pub bitmap_offset: u64, // Byte offset of primary bitmap
pub bitmap_mirror_offset: u64, // Byte offset of mirror bitmap
pub bitmap_blocks: u64, // Size of each bitmap in blocks
pub data_offset: u64, // Byte offset of data region
pub generation: u64, // Monotonic, incremented on bitmap flush
pub checksum: [u8; 32], // SHA-256 of superblock fields
}
}
Allocation bitmap (primary + mirror): 1 bit per block in the data region. Stored twice at different offsets for redundancy.
- At 4K blocks: 4TB device = 1 billion blocks = 128MB × 2 = 256MB
- At 512B blocks: 4TB device = 8 billion blocks = 1GB × 2 = 2GB
- Bitmap overhead: 0.006% (4K) to 0.048% (512B)
- On read: verify primary against mirror. On mismatch, use the copy consistent with the redb journal.
Per-extent CRC32: Every data extent has a 4-byte CRC32 trailer written after the payload data (within the same aligned block).
- On read: verify CRC32 before returning data.
- CRC mismatch → hardware corruption → trigger EC repair from parity fragments (not a security incident).
- AES-GCM auth_tag failure after CRC pass → actual tampering (security incident, alert + audit).
- This distinguishes hardware failure from cryptographic attack, enabling correct operational response.
Allocation algorithm
Extent-based best-fit with free-list cache (Ceph BlueStore pattern, simpler than DAOS VEA):
- In-memory: B-tree of free extents
(offset, block_count), sorted by offset. On alloc, scan for smallest extent >= requested blocks. On free, insert and coalesce with neighbors. - Concurrency:
alloc()andfree()are serialized per device via Mutex on the allocator state. This is acceptable — allocation is a B-tree lookup (microseconds); I/O is the bottleneck, not allocation. Ceph BlueStore also serializes allocation per OSD. - On-disk: Bitmap is ground truth. Free-list rebuilt from bitmap on startup (~100ms for 4TB at 4K blocks).
- Crash safety: Bitmap updates journaled in redb
(
device_alloctable) before applied to device bitmap region. On crash recovery: reload bitmap from device, replay pending journal entries from redb, rebuild free-list.
Allocation flow (WAL-ordered for crash safety):
- Round up requested size to
block_sizeboundary - Search free-list for best-fit extent
- Split extent if larger than needed
- Journal intent in redb (
device_alloctable: alloc intent) - Mark bits in bitmap (pwrite to bitmap region)
- Return
Extent { offset, length } - Caller writes data to extent, then commits
chunk_metato redb - Clear intent from
device_allocjournal (write complete)
On crash recovery: scan device_alloc for pending intents. If
the corresponding chunk_meta exists → write completed, clear
intent. If no chunk_meta → write was interrupted, free the
extent (clear bitmap bits, remove intent). This is the standard
WAL pattern — Ceph BlueStore uses the same approach.
Free flow:
- Journal the deallocation intent in redb
- Clear bits in bitmap
- Insert freed extent into free-list, coalesce neighbors
- If
supports_trim: add to TRIM batch queue (see below) - Clear dealloc intent from journal
TRIM batching: Freed extents accumulate in a TRIM queue per
device. A batched BLKDISCARD ioctl is issued periodically
(every 60 seconds or when queue exceeds 1GB). This avoids
write amplification from many small TRIM commands.
Maximum extent size: 16MB. Allocations larger than 16MB are
split into multiple extents. FragmentLocation in chunk_meta
already supports multiple extents per chunk via Vec<FragmentLocation>.
I/O strategy per device type
| Strategy | Open flags | Alignment | Sync | Use case |
|---|---|---|---|---|
DirectAligned | O_DIRECT | O_DSYNC | physical_block_size | Implicit (O_DSYNC) | NVMe, SATA SSD |
BufferedSequential | O_SYNC | 512B | fdatasync() | HDD |
FileBacked | default | 4K (simulated) | fsync() | VM, dev, CI |
FileBacked alignment: FileBackedDevice enforces the same 4K
alignment as RawBlockDevice to ensure tests faithfully reproduce
raw block behavior. Code that passes CI will not fail on real
hardware due to alignment issues.
- Write buffers aligned via
std::alloc::Layout::from_size_alignfor O_DIRECT compatibility - NUMA-aware: pin allocator thread to
numa_nodeif detected - TRIM/UNMAP on free if
supports_trim(SSD wear management) optimal_io_sizeused for write batching (coalesce small writes up to this size before issuing I/O)
Metadata in redb (system partition)
ADR-022’s redb on the RAID-1 system partition stores chunk metadata:
Table: chunk_meta
Key: [u8; 32] (chunk_id)
Value: bincode-serialized ChunkMeta {
refcount: u64,
retention_holds: Vec<String>,
pool_name: String,
stored_bytes: u64,
fragments: Vec<FragmentLocation {
device_id: [u8; 16],
offset: u64,
length: u64,
}>,
envelope_meta: EnvelopeMeta {
nonce: [u8; 12],
auth_tag: [u8; 16],
system_epoch: u64,
tenant_epoch: Option<u64>,
tenant_wrapped_material: Option<Vec<u8>>,
},
}
Table: device_alloc (bitmap journal for crash safety)
Key: (device_id: [u8; 16], generation: u64)
Value: bincode-serialized Vec<AllocJournalEntry {
offset: u64,
length: u64,
is_alloc: bool, // true = allocate, false = free
}>
Separation of concerns
The allocator does NOT know about device subclasses (NvmeU2 vs
NvmeQlc, HddEnterprise vs HddBulk). Those are pool/placement
concerns in kiseki-chunk and kiseki-control (ADR-024).
| Layer | Cares about | Doesn’t care about |
|---|---|---|
kiseki-block | physical_block_size, rotational, O_DIRECT | TLC vs QLC, RPM, pool policy |
kiseki-chunk | pool thresholds, EC config, placement | block alignment, I/O flags |
kiseki-control | device class, pool assignment, tiering | how bytes reach the device |
The DeviceClass enum (ADR-024) stays in kiseki-chunk/kiseki-control.
DeviceCharacteristics (auto-probed) stays in kiseki-block.
Integration with existing code
ChunkOpstrait (ADR-005) unchanged — callers unaware of backend- New
PersistentChunkStoreinkiseki-chunkimplementsChunkOps:write_chunk(): EC encode → alloc extents per device viaDeviceBackend→ write fragments → update redbchunk_metaread_chunk(): lookup redbchunk_meta→DeviceBackend::readper fragment → EC decode if needed → return Envelopegc(): free extents viaDeviceBackend::free→ update bitmap → remove from redb
DeviceManagerinkiseki-blockopens devices at startup, probes characteristics, creates appropriateDeviceBackendper device- Server runtime (
kiseki-server) wiresDeviceManager→ pools →PersistentChunkStorewhenKISEKI_DATA_DIRis set
Crate structure
kiseki-block/
├── Cargo.toml
└── src/
├── lib.rs
├── backend.rs # DeviceBackend trait
├── raw.rs # RawBlockDevice (O_DIRECT)
├── file.rs # FileBackedDevice (sparse file)
├── probe.rs # Sysfs device probing
├── superblock.rs # On-disk superblock format
├── bitmap.rs # Allocation bitmap
├── allocator.rs # Extent allocator (free-list + bitmap)
├── extent.rs # Extent type
├── manager.rs # DeviceManager
└── error.rs # BlockError, AllocError
Rationale
- Raw block over XFS: Eliminates FS overhead (journaling, inode, page cache) that becomes the bottleneck at NVMe line rate. Ceph BlueStore validated this approach at scale.
- Auto-detection over manual config: Reduces deployment friction. Admin provides device paths; Kiseki probes characteristics. Works correctly on bare metal, VMs, and CI without config changes.
- Bitmap over B-tree free-list on disk: Simpler crash recovery (fixed-size, position-indexed). Free-list is derived in-memory. DAOS VEA uses B-tree on persistent memory, but we don’t require PMEM — bitmap on block device with redb journal is sufficient.
- File-backed fallback: Same trait, different backend. Tests and CI don’t need raw devices. VMs work without device passthrough.
- Separate crate:
kiseki-blockhas no domain knowledge (chunks, EC, pools). Clean dependency boundary. Testable in isolation.
Alternatives Considered
-
XFS on each JBOD device (ADR-024 original default): Rejected for production — FS overhead at NVMe line rate is unacceptable. Still available as
FileBackedstrategy for dev/VM. -
SPDK userspace I/O (DAOS model): Rejected — requires dedicated devices (no kernel access), complicates deployment, needs custom memory management (DMA buffers). Future optimization path if kernel I/O overhead is measured as bottleneck.
-
Pool files (one large file per device): Rejected — still has FS overhead (XFS metadata for the pool file itself). Raw block eliminates the FS entirely.
-
redb for chunk data: Rejected — B-tree not designed for multi-GB blob storage. Acceptable for metadata only.
Consequences
- Adds
kiseki-blockcrate to workspace (~2000 lines estimated) - Data devices must be provisioned as raw (no filesystem). Operator provides device paths in config; Kiseki writes superblock on init.
- VMs and CI use file-backed mode transparently (no raw devices needed)
- Crash recovery depends on redb journal + device bitmap consistency
- Device initialization is a destructive operation (writes superblock,
bitmap — existing data on device is lost). Safety checks before
init: (1) check for existing Kiseki superblock magic — require
--forceif found, (2) check for known FS signatures (XFS, ext4, NTFS magic) — refuse with clear error, (3) audit log the init - TRIM/UNMAP support improves SSD endurance but is optional
- Future: SPDK backend can implement
DeviceBackendtrait for userspace I/O without changing upper layers
Adversarial Review Findings (2026-04-22)
| # | Severity | Finding | Resolution |
|---|---|---|---|
| 1 | High | Write ordering — data before metadata creates phantom chunks on crash | WAL intent journal: alloc → journal intent → write data → commit chunk_meta → clear intent. Recovery replays intents. |
| 2 | High | No per-extent checksum — silent corruption indistinguishable from tampering | CRC32 trailer per extent. CRC fail = hardware corruption (EC repair). Auth tag fail after CRC pass = tampering (security alert). |
| 3 | Medium | Bitmap single point of failure per device | Primary + mirror bitmap at different offsets. On mismatch, use copy consistent with redb journal. |
| 4 | Medium | No device init safety — accidental overwrite of existing data | Safety checks: existing Kiseki magic → require –force. Known FS signatures → refuse. Audit log init. |
| 5 | Medium | File-backed mode doesn’t enforce alignment — CI misses bugs | FileBacked enforces same 4K alignment as RawBlockDevice. |
| 6 | Medium | Concurrent alloc race on shared free-list | Mutex per device on allocator state. Allocation is microseconds; I/O is the bottleneck. |
| 7 | Low | Immediate TRIM on free causes write amplification | Batch TRIM queue: accumulate, issue BLKDISCARD every 60s or at 1GB threshold. |
| 8 | Low | No max extent size — unbounded alloc fragments bitmap scan | Max extent 16MB. Larger chunks split into multiple extents. |
References
- Ceph BlueStore: Architecture
- DAOS VOS/VEA: Storage Model
- ADR-022: Storage backend (redb for metadata)
- ADR-024: Device management and capacity thresholds
- ADR-005: EC and chunk durability