ADR-025: Storage Administration API

Status: Proposed. Date: 2026-04-20. Deciders: Architect + domain expert.

Context

Storage administrators need to performance-tune the system similar to Ceph (ceph osd pool set), VAST (management UI), or Lustre (lctl). The current control plane API handles tenant lifecycle but has no storage admin surface — no pool management, device management, performance tuning, or cluster-wide observability.

API-first principle: All admin interactions go through gRPC APIs. CLI (kiseki-cli), Web UI, and job orchestrators (Ansible, Terraform) are wrappers around these APIs. No SSH-and-edit-config path.

Decision

Admin API surface (new gRPC service)

service StorageAdminService {
  // === Device management ===
  rpc ListDevices(ListDevicesRequest) returns (ListDevicesResponse);
  rpc GetDevice(GetDeviceRequest) returns (DeviceInfo);
  rpc AddDevice(AddDeviceRequest) returns (AddDeviceResponse);
  rpc RemoveDevice(RemoveDeviceRequest) returns (RemoveDeviceResponse);
  rpc EvacuateDevice(EvacuateDeviceRequest) returns (EvacuateDeviceResponse);
  rpc CancelEvacuation(CancelEvacuationRequest) returns (CancelEvacuationResponse);

  // === Pool management ===
  rpc ListPools(ListPoolsRequest) returns (ListPoolsResponse);
  rpc GetPool(GetPoolRequest) returns (PoolInfo);
  rpc CreatePool(CreatePoolRequest) returns (CreatePoolResponse);
  rpc SetPoolDurability(SetPoolDurabilityRequest) returns (SetPoolDurabilityResponse);
  rpc SetPoolThresholds(SetPoolThresholdsRequest) returns (SetPoolThresholdsResponse);
  rpc RebalancePool(RebalancePoolRequest) returns (RebalancePoolResponse);

  // === Performance tuning ===
  rpc GetTuningParams(GetTuningParamsRequest) returns (TuningParams);
  rpc SetTuningParams(SetTuningParamsRequest) returns (SetTuningParamsResponse);

  // === Cluster observability ===
  rpc ClusterStatus(ClusterStatusRequest) returns (ClusterStatus);
  rpc PoolStatus(PoolStatusRequest) returns (PoolStatus);
  rpc DeviceHealth(DeviceHealthRequest) returns (stream DeviceHealthEvent);
  rpc IOStats(IOStatsRequest) returns (stream IOStatsEvent);

  // === Shard management ===
  rpc ListShards(ListShardsRequest) returns (ListShardsResponse);
  rpc GetShard(GetShardRequest) returns (ShardInfo);
  rpc SplitShard(SplitShardRequest) returns (SplitShardResponse);
  rpc SetShardMaintenance(SetShardMaintenanceRequest) returns (SetShardMaintenanceResponse);

  // === Repair and scrub ===
  rpc TriggerScrub(TriggerScrubRequest) returns (TriggerScrubResponse);
  rpc RepairChunk(RepairChunkRequest) returns (RepairChunkResponse);
  rpc ListRepairs(ListRepairsRequest) returns (ListRepairsResponse);
}

Tuning parameters

Storage admins tune at four levels: cluster → pool → tenant → workload. Lower levels inherit from higher, can only narrow (not broaden).

Cluster-wide tuning

Parameter	Default	Range	What it controls
`compaction_rate_mb_s`	100	10-1000	Background compaction throughput cap
`gc_interval_s`	300	60-3600	How often GC scans for reclaimable chunks
`rebalance_rate_mb_s`	50	0-500	Background rebalance/evacuation throughput
`scrub_interval_h`	168 (7d)	24-720	How often integrity scrub runs
`max_concurrent_repairs`	4	1-32	Parallel EC repair jobs
`stream_proc_poll_ms`	100	10-1000	View materialization poll interval
`inline_threshold_bytes`	4096	512-65536	Below this, data inlined in delta
`raft_snapshot_interval`	10000	1000-100000	Entries between Raft snapshots

Per-pool tuning

Parameter	Default	Range	What it controls
`ec_data_chunks`	4 (NVMe) / 8 (HDD)	2-16	EC data fragment count
`ec_parity_chunks`	2 (NVMe) / 3 (HDD)	1-8	EC parity fragment count
`replication_count`	3	2-5	For replication pools (not EC)
`warning_threshold_pct`	per ADR-024	50-95	Capacity warning level
`critical_threshold_pct`	per ADR-024	60-98	Capacity critical level
`readonly_threshold_pct`	per ADR-024	70-99	Read-only level
`target_fill_pct`	70 (SSD) / 80 (HDD)	50-90	Rebalance target fill level
`chunk_alignment_bytes`	4096	512-65536	On-disk alignment (RDMA/NVMe)
`prefer_sequential_alloc`	true	bool	Allocate sequentially in pool file

Per-tenant tuning (via ControlService, existing)

Parameter	Existing API	What it controls
`quota.capacity_bytes`	SetQuota	Tenant capacity ceiling
`quota.iops`	SetQuota	IOPS limit
`quota.metadata_ops_per_sec`	SetQuota	Metadata op rate limit
`dedup_policy`	CreateOrganization	Cross-tenant vs isolated dedup
`compliance_tags`	SetComplianceTags	Regulatory constraints

Per-workload tuning (via ControlService + Advisory)

Parameter	API	What it controls
`workload.quota`	CreateWorkload	Workload-level capacity/IOPS
`advisory.hints_per_sec`	Advisory ceilings	Hint submission rate
`advisory.prefetch_bytes_max`	Advisory ceilings	Prefetch budget
`advisory.profile`	Advisory profiles	Allowed hint profiles

Observability API

ClusterStatus response

message ClusterStatus {
  uint32 node_count = 1;
  uint32 healthy_nodes = 2;
  uint64 total_capacity_bytes = 3;
  uint64 used_bytes = 4;
  uint32 pool_count = 5;
  uint32 shard_count = 6;
  uint32 active_repairs = 7;
  uint32 evacuating_devices = 8;
  repeated PoolSummary pools = 9;
}

PoolStatus response

message PoolStatus {
  string pool_id = 1;
  PoolHealth health = 2;
  uint64 capacity_bytes = 3;
  uint64 used_bytes = 4;
  uint32 device_count = 5;
  uint32 healthy_devices = 6;
  uint32 chunk_count = 7;
  // Performance metrics (rolling 60s window)
  double read_iops = 8;
  double write_iops = 9;
  double read_throughput_mb_s = 10;
  double write_throughput_mb_s = 11;
  double avg_read_latency_ms = 12;
  double avg_write_latency_ms = 13;
  double p99_read_latency_ms = 14;
  double p99_write_latency_ms = 15;
}

Streaming events

message DeviceHealthEvent {
  DeviceId device_id = 1;
  DeviceState old_state = 2;
  DeviceState new_state = 3;
  string reason = 4;
  uint64 timestamp_ms = 5;
}

message IOStatsEvent {
  string pool_id = 1;
  double read_iops = 2;
  double write_iops = 3;
  double read_throughput_mb_s = 4;
  double write_throughput_mb_s = 5;
  uint64 timestamp_ms = 6;
}

Admin personas and API mapping

Persona	Typical actions	APIs used
Cluster admin	Add/remove nodes, set cluster params, view health	StorageAdminService (all), ClusterStatus
Storage admin	Create pools, tune EC, set thresholds, rebalance	Pool*, SetTuningParams, PoolStatus
Tenant admin	Set quotas, compliance, retention, advisory	ControlService (existing)
Workload admin	Tune advisory, prefetch, dedup hints	Advisory (existing) + workload quota
On-call/SRE	View health, trigger repair, check alerts	ClusterStatus, DeviceHealth stream, TriggerScrub

CLI mapping (kiseki-cli)

kiseki cluster status              → ClusterStatus
kiseki pool list                   → ListPools
kiseki pool status fast-nvme       → PoolStatus
kiseki pool create --name bulk-hdd --class HddBulk --ec 8+3
kiseki pool tune fast-nvme --warning-pct 75 --target-fill 70
kiseki device list                 → ListDevices
kiseki device add /dev/nvme2n1 --pool fast-nvme
kiseki device evacuate dev-uuid    → EvacuateDevice
kiseki device health --watch       → DeviceHealth stream
kiseki tune set --compaction-rate 200 --gc-interval 120
kiseki shard list                  → ListShards
kiseki shard split shard-uuid      → SplitShard
kiseki repair scrub --pool fast-nvme
kiseki iostat --pool fast-nvme     → IOStats stream

Authorization model

API	Who can call	Auth
StorageAdminService (all)	Cluster admin only	mTLS cert with admin OU
ControlService (tenant ops)	Tenant admin	mTLS cert with tenant OU
Advisory (workload ops)	Workload identity	mTLS cert + workflow token
Read-only observability	Cluster admin, SRE	mTLS cert with admin/sre OU

Tenant admins cannot access StorageAdminService. They see their own quotas and compliance tags, not pool health or device state. This preserves the zero-trust boundary (I-T4).

Consequences

Full API-first admin surface — no SSH-and-edit needed
CLI, UI, automation all use the same gRPC APIs
Performance tuning at four levels with inheritance
Streaming observability for real-time monitoring
Clear authorization boundary between cluster admin and tenant admin
Significantly expands the gRPC surface (20+ new RPCs)

References

ADR-024: Device management and capacity thresholds
ADR-005: EC and chunk durability
ADR-020: Workflow advisory (workload-level tuning)
Ceph: ceph osd pool set command reference
Lustre: lctl set_param tunables
I-T4: Zero-trust infra/tenant boundary

Addendum: Adversarial Review Resolutions (2026-04-20)

C1: Per-tenant resource usage → ControlService, not StorageAdminService

Per-tenant resource usage (capacity, IOPS attribution) is exposed via ControlService with tenant-admin authorization, NOT via StorageAdminService. Cluster admin sees pool-level aggregates only. Tenant admin sees their own usage. This preserves I-T4.

// In ControlService (not StorageAdminService):
rpc GetTenantUsage(GetTenantUsageRequest) returns (TenantUsage);
// Requires tenant admin cert (mTLS OU = tenant ID)

C2: Per-device I/O stats added

rpc DeviceIOStats(DeviceIOStatsRequest) returns (stream DeviceIOStatsEvent);

message DeviceIOStatsEvent {
  string device_id = 1;
  double read_iops = 2;
  double write_iops = 3;
  double read_latency_p50_ms = 4;
  double read_latency_p99_ms = 5;
  double errors_per_sec = 6;
  uint64 timestamp_ms = 7;
}

C3: Shard health observability added

rpc GetShardHealth(GetShardHealthRequest) returns (ShardHealthInfo);

message ShardHealthInfo {
  string shard_id = 1;
  uint64 leader_node_id = 2;
  uint32 replica_count = 3;
  uint32 reachable_count = 4;
  uint32 recent_elections = 5;
  uint64 commit_lag_entries = 6;
}

C4: EC parameters immutable per pool

New invariant I-C6: EC parameters (data_chunks, parity_chunks) are immutable per pool. SetPoolDurability applies only to NEW chunks. Existing chunks retain their original EC configuration. Explicit re-encoding via ReencodePool RPC (long-running, cancellable).

Role	Access
`cluster-admin`	Full StorageAdminService (read + write)
`sre-on-call`	Read-only: List, Get, Status, Health streams
`sre-incident-response`	SRE + TriggerScrub, RepairChunk

Enforced via mTLS certificate OU field.

M4: DrainNode added

rpc DrainNode(DrainNodeRequest) returns (stream DrainNodeProgress);

Internally evacuates all devices on the node, then removes them. Idempotent, safe to retry.

Kiseki Documentation