Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

ADR-024: Device Management, Storage Tiers, and Capacity Thresholds

Status: Accepted (19/19 device-management BDD scenarios pass). Date: 2026-04-20. Deciders: Architect + domain expert.

Context

The current design (ADR-005) defines three NVMe device classes but does not address:

  • HDD / spinning disk tiers (common in cost-optimized HPC clusters)
  • System partition vs data partition separation
  • Capacity thresholds and degradation behavior
  • Device health monitoring and proactive replacement
  • Memory-attached storage (CXL, persistent memory)
  • Mixed-tier deployments (SSD+HDD, fast-SSD+cheap-SSD)

Real HPC deployments often have:

  • System partition: RAID-1 (or RAID-1+0) on 2 SSDs for OS + Kiseki binaries + redb
  • Data partitions: JBOD — each NVMe/SSD/HDD is an independent pool member
  • Tiering: Hot data on fast NVMe, warm on cheap SSD, cold on HDD

Decision

Device classification

Extend DeviceClass to cover the full storage hierarchy:

ClassMediumUse caseTypical capacity
NvmeU2NVMe U.2 TLC/MLCMetadata, hot data, Raft log1-8 TB
NvmeQlcNVMe QLCCheckpoints, warm data4-30 TB
NvmePersistentMemoryIntel Optane / CXLCache, ultra-hot metadata128 GB - 1 TB
SsdSataSATA SSDBudget fast storage1-8 TB
HddEnterpriseSAS/SATA HDD 10k/15kCold data, archive4-20 TB
HddBulkSATA HDD 7.2kDeep archive, bulk cold10-20 TB
Custom(String)User-definedVendor-specificVaries

Server disk layout

Server node:
├── System partition (RAID-1 on 2× SSD)
│   ├── /boot, /root, OS
│   ├── /var/lib/kiseki/redb/       ← Raft log, metadata index
│   └── /var/lib/kiseki/config/     ← Node config, certs
│
├── Data devices (JBOD, managed by Kiseki)
│   ├── /dev/nvme0n1 → pool "fast-nvme"  (device member)
│   ├── /dev/nvme1n1 → pool "fast-nvme"  (device member)
│   ├── /dev/sda     → pool "bulk-ssd"   (device member)
│   ├── /dev/sdb     → pool "cold-hdd"   (device member)
│   └── ...
│
└── Optional: CXL memory → pool "pmem" (hot cache tier)

JBOD for data, RAID-1 for system. Kiseki manages data durability via EC/replication across JBOD members. The system partition uses traditional RAID-1 because redb and Raft log must survive single-disk failure without Kiseki’s own repair mechanism.

Pool capacity management

Per-device-class capacity thresholds

Thresholds vary by device type because NVMe/SSD suffer GC-induced write amplification at high fill levels, while HDD does not. Enterprise arrays (VAST, Pure) can operate at 95%+ because they have global wear leveling — JBOD does not have that luxury.

StateNVMe/SSDHDDBehavior
Healthy0-75%0-85%Normal writes, background rebalance
Warning75-85%85-92%Log warning, emit telemetry
Critical85-92%92-97%Reject new placements, advisory backpressure
ReadOnly92-97%97-99%In-flight writes drain, no new writes
Full97-100%99-100%ENOSPC to clients

Rationale: NVMe/SSD GC pressure increases sharply above ~80% fill. QLC is worse than TLC. The SSD Warning threshold (75%) gives the placement engine time to redirect before the GC cliff. HDD has no such cliff — outer-track vs inner-track difference is ~20%, not a performance wall.

Implementation:

#![allow(unused)]
fn main() {
pub enum PoolHealth {
    Healthy,
    Warning { used_percent: u8 },
    Critical { used_percent: u8 },
    ReadOnly { used_percent: u8 },
    Full,
}

pub struct CapacityThresholds {
    pub warning_pct: u8,
    pub critical_pct: u8,
    pub readonly_pct: u8,
    pub full_pct: u8,
}

impl CapacityThresholds {
    pub fn for_device_class(class: &DeviceClass) -> Self {
        match class {
            DeviceClass::NvmeU2 | DeviceClass::NvmeQlc
            | DeviceClass::NvmePersistentMemory | DeviceClass::SsdSata => Self {
                warning_pct: 75,
                critical_pct: 85,
                readonly_pct: 92,
                full_pct: 97,
            },
            DeviceClass::HddEnterprise | DeviceClass::HddBulk => Self {
                warning_pct: 85,
                critical_pct: 92,
                readonly_pct: 97,
                full_pct: 99,
            },
            DeviceClass::Custom(_) => Self {
                warning_pct: 80,
                critical_pct: 90,
                readonly_pct: 95,
                full_pct: 99,
            },
        }
    }
}

impl AffinityPool {
    pub fn health(&self) -> PoolHealth {
        let pct = (self.used_bytes * 100) / self.capacity_bytes;
            81..=90 => PoolHealth::Warning { used_percent: pct as u8 },
            91..=95 => PoolHealth::Critical { used_percent: pct as u8 },
            96..=99 => PoolHealth::ReadOnly { used_percent: pct as u8 },
            _ => PoolHealth::Full,
        }
    }
}
}

Placement engine behavior:

  • Healthy: Place chunks according to affinity policy
  • Warning: Continue placing but emit telemetry; cluster admin should add capacity
  • Critical: Reject new placements; redirect to same device-class sibling only
  • ReadOnly: In-flight writes complete; new writes fail with retriable error
  • Full: ENOSPC — client gets permanent error

Pool redirection policy: When a pool is Critical, the placement engine redirects to another pool of the same device class only. Never cross device-class boundaries (e.g., never NVMe → HDD). If no same-class sibling has capacity, return ENOSPC to client. This preserves performance SLAs and compliance tag enforcement.

System partition

OS-managed RAID-1 on 2× SSD. Kiseki does not manage the RAID.

Kiseki monitors system partition health:

  1. On startup: check /proc/mdstat for RAID health
  2. If degraded → log WARNING, continue operating
  3. If both drives failed → log CRITICAL, refuse to start
  4. Periodic check every 60 seconds

Admin is responsible for replacing failed system drives and rebuilding the RAID. Kiseki trusts the OS for system partition durability.

Device health monitoring

Each device reports SMART/health metrics:

MetricThresholdAction
Temperature>70°CWarning; throttle if >80°C
Wear level (SSD)>90% life usedWarning; proactive replacement window
Bad sectors (HDD)>0 reallocatedWarning at 1; evacuate at >100
Latency>10× baselineMark degraded; reduce placement priority
ErrorsUncorrectable readMark suspect; verify EC/replicas for affected chunks

Device states:

Healthy → Degraded → Failed → Removed
     ↘       ↗
   Evacuating → Removed

Eviction and evacuation policy

Key principle: Unhealthy devices are evacuated proactively, not waited on until failure. Full devices are write-blocked, not evicted (data is still readable).

TriggerActionAutomatic?Priority
SMART wear >90% (SSD)Evacuate — migrate chunks to other pool membersYes (background)Normal
Bad sectors >100 (HDD)Evacuate — migrate before cascading failureYes (background)High
Uncorrectable read errorEvacuate + EC repair for affected chunksYes (immediate)Critical
Temperature >80°CThrottle I/O, alert adminYesHigh
Device unresponsiveMark Failed — trigger EC repair from survivorsYes (immediate)Critical
Pool at Critical thresholdBlock writes — redirect to sibling poolsYesNormal
Pool at ReadOnly thresholdDrain writes — no new data, existing completesYesNormal
Admin-initiatedEvacuate — controlled migration before physical removalManualNormal

Evacuation process:

  1. Mark device Evacuating
  2. For each chunk on device: read fragment, write to another healthy device in pool
  3. Update chunk metadata (redb) with new placement
  4. When all chunks migrated: mark device Removed
  5. Admin can physically pull the device

Evacuation speed: Bounded by network and destination device throughput. At 1 GB/s NVMe write speed, a 4TB device evacuates in ~67 minutes. EC repair (from parity) is faster since only the missing fragments need reconstruction.

Invariant: A device in Evacuating state accepts no new writes but serves reads for chunks not yet migrated.

Storage backend per JBOD device

ApproachProsConsRecommendation
Raw block (ADR-029)Zero FS overhead, direct I/O, aligned writes, bitmap allocator with redb journalCustom allocator in kiseki-blockDefault — recommended for production
File-backed (ADR-029)Same DeviceBackend trait, works in VMs/CI without raw devicesSlight overhead from host FSVMs and CI environments
xfsScales to 100M+ files, good NVMe supportExtra FS overhead, inode pressure at scaleLegacy / deprecated

Default: Raw block device I/O via kiseki-block (DeviceBackend trait with auto-detection of device characteristics). File-backed fallback for VMs and CI. XFS is deprecated as a chunk storage backend; existing XFS deployments can migrate via background evacuation.

Device discovery

Manual configuration (MVP):

  • Admin provides device list in node config (kiseki-server.toml)
  • Each device: path, class, pool assignment

Future: Auto-discovery:

  • Scan /sys/block/ for NVMe/SSD/HDD devices

  • Classify by transport (NVMe, SATA, SAS) and media (rotational flag)

  • Present to admin for pool assignment confirmation

  • Healthy: Normal I/O

  • Degraded: Elevated errors or latency; reduce write priority

  • Evacuating: Admin-initiated; migrate chunks to other devices, then remove

  • Failed: I/O errors; trigger EC repair for all chunks

  • Removed: Device physically absent; metadata cleaned up

Tiering and data movement

Static placement (MVP): Admin assigns pools to device classes. Chunk placement is determined at write time by the composition’s view descriptor affinity policy. No automatic migration.

Future: Reactive tiering (per assumption A8):

  • Compositions with high read frequency auto-promote from cold → hot
  • Compositions with no reads for >N days auto-demote from hot → cold
  • Promotion/demotion as background job (copy chunk, update metadata, delete old)
  • Bounded by pool capacity thresholds (don’t overfill hot tier)

Data model changes

#![allow(unused)]
fn main() {
pub enum DeviceClass {
    NvmeU2,
    NvmeQlc,
    NvmePersistentMemory,
    SsdSata,
    HddEnterprise,
    HddBulk,
    Custom(String),
}

pub struct DeviceInfo {
    pub id: DeviceId,
    pub class: DeviceClass,
    pub path: String,          // /dev/nvme0n1 or mount point
    pub capacity_bytes: u64,
    pub used_bytes: u64,
    pub state: DeviceState,
    pub pool_id: Option<String>,
}

pub enum DeviceState {
    Healthy,
    Degraded { reason: String },
    Evacuating { progress_percent: u8 },
    Failed { since: u64 },
    Removed,
}
}

Consequences

  • Device diversity now first-class (HDD, SSD, NVMe, PMem)
  • Capacity management is explicit with defined thresholds
  • System partition (RAID-1) separated from data (JBOD)
  • Device health monitoring enables proactive replacement
  • Tiering is future work; static placement for MVP
  • Cluster admin must provision devices and assign to pools at setup time

References

  • ADR-005: EC and chunk durability (per pool)
  • ADR-022: Storage backend (redb on system partition)
  • Assumption A4: ClusterStor hardware
  • Assumption A8: Reactive tiering
  • Failure mode F-I2: Storage node failure
  • Failure mode F-I4: Disk/device failure