Monitoring & Observability

Kiseki provides three observability pillars: metrics (Prometheus), structured logging (tracing), and distributed traces (OpenTelemetry). All three are tenant-aware, respecting the zero-trust boundary between cluster admin and tenant admin (ADR-015).

Prometheus metrics

Every kiseki-server node exposes Prometheus metrics in text exposition format on the metrics HTTP port.

Endpoint

GET http://<node>:9090/metrics

Registered metrics

Metric name	Type	Labels	Description
`kiseki_raft_commit_latency_seconds`	Histogram	`shard`	Raft commit latency per shard. Buckets: 100us to 1s.
`kiseki_raft_entries_total`	Counter	(none)	Total Raft entries applied on this node.
`kiseki_chunk_write_bytes_total`	Counter	(none)	Total chunk bytes written.
`kiseki_chunk_read_bytes_total`	Counter	(none)	Total chunk bytes read.
`kiseki_chunk_ec_encode_seconds`	Histogram	`strategy`	EC encode latency. Buckets: 100us to 50ms.
`kiseki_gateway_requests_total`	Counter	`method`, `status`	Gateway request count by method (GET, PUT, DELETE, etc.) and HTTP status.
`kiseki_gateway_request_duration_seconds`	Histogram	`method`	Gateway request duration. Buckets: 1ms to 5s.
`kiseki_pool_capacity_total_bytes`	Gauge	`pool`	Total capacity per pool in bytes.
`kiseki_pool_capacity_used_bytes`	Gauge	`pool`	Used capacity per pool in bytes.
`kiseki_transport_connections_active`	Gauge	(none)	Active transport connections.
`kiseki_transport_connections_idle`	Gauge	(none)	Idle transport connections.
`kiseki_shard_delta_count`	Gauge	`shard`	Current delta count per shard.
`kiseki_key_rotation_total`	Counter	(none)	Key rotations performed (system + tenant).
`kiseki_crypto_shred_total`	Counter	(none)	Crypto-shred operations performed.

Metric scoping (zero-trust)

Per ADR-015, metric scoping respects the zero-trust boundary:

Cluster admin sees: Aggregated metrics, per-node metrics, system health. Per-tenant metrics are anonymized (tenant_id replaced with opaque hash) unless the cluster admin has approved access for that tenant.
Tenant admin sees: Their own tenant’s metrics via the tenant audit export.
No metric exposes: File names, directory structure, data content, or access patterns attributable to a specific tenant (without approval).

Metric cardinality

Metric cardinality is bounded by design. Label values are drawn from fixed sets (shard IDs, pool names, HTTP methods, strategy names). There are no unbounded label values such as file paths, tenant IDs, or user identifiers in metrics labels.

Structured logging

Kiseki uses the tracing crate for structured logging. Every log event is a structured record with typed fields.

Configuration

Variable	Default	Description
`RUST_LOG`	`info`	Filter directive. Supports per-module granularity.
`KISEKI_LOG_FORMAT`	`text`	Output format: `text` (human-readable) or `json` (structured).

Filter examples

# Default: info-level for all Kiseki modules
RUST_LOG=kiseki=info

# Debug for the Raft subsystem, info for everything else
RUST_LOG=kiseki_raft=debug,kiseki=info

# Trace-level for the chunk subsystem (very verbose)
RUST_LOG=kiseki_chunk=trace,kiseki=info

# Warnings only (quiet)
RUST_LOG=warn

JSON output format

In production, set KISEKI_LOG_FORMAT=json for structured log aggregation (ELK, Loki, Datadog, etc.):

{
  "timestamp": "2026-04-23T14:30:00.123Z",
  "level": "INFO",
  "target": "kiseki_raft",
  "message": "Raft leader elected",
  "shard": "shard-0001",
  "node_id": 1,
  "term": 42
}

Log levels

Level	Usage
`ERROR`	Unrecoverable failures, invariant violations, data loss events.
`WARN`	Recoverable issues, degraded state, approaching capacity limits.
`INFO`	Significant state changes: leader election, key rotation, shard split, node join/leave.
`DEBUG`	Detailed operational events: individual RPCs, cache hits/misses, EC operations.
`TRACE`	Wire-level detail: Raft message contents, HKDF inputs, bitmap operations.

Security in logs

Tenant-identifying fields (tenant_id, namespace) are present for correlation.
Content fields (file names, chunk plaintext, key material) are never logged (I-K8).
Logs ship to the same audit/observability pipeline.

Distributed tracing (OpenTelemetry)

Kiseki uses OpenTelemetry for distributed tracing across the full write/read path.

Configuration

Variable	Default	Description
`OTEL_EXPORTER_OTLP_ENDPOINT`	(none)	OTLP gRPC endpoint. Example: `http://jaeger:4317`. When not set, tracing is disabled.
`OTEL_SERVICE_NAME`	`kiseki-server`	Service name in traces.
`OTEL_TRACES_SAMPLER_ARG`	`1.0`	Sampling rate (1.0 = 100%, 0.1 = 10%). Reduce in production for high-throughput workloads.

Trace propagation

Every write/read path carries a trace ID via OpenTelemetry context propagation. Traces span:

client -> gateway -> composition -> log -> chunk -> view

For the native client path:

client (FUSE) -> transport -> composition -> log -> chunk

Jaeger integration

The development Docker Compose stack includes Jaeger for trace visualization:

Jaeger UI: http://localhost:16686
OTLP gRPC receiver: localhost:4317

Trace scoping

Traces respect the zero-trust boundary:

Tenant-scoped traces are visible only to the tenant admin (via tenant audit export).
Cluster admin sees system-level spans. No tenant content appears in span attributes visible to the cluster admin.
Trace overhead is approximately 1-2% on the data path (acceptable for production).

Event store

The admin dashboard maintains an in-memory event store for diagnostic events. Events are categorized and severity-tagged.

Event categories

Category	Events
`node`	Node join, node leave, node unreachable, node recovered.
`shard`	Shard created, shard split, shard maintenance entered/exited.
`device`	Device added, device failed, SMART warning, evacuation started/completed.
`tenant`	Tenant created, tenant deleted, quota changed.
`security`	Auth failure, cert revocation, crypto-shred.
`admin`	Maintenance mode toggle, backup requested, scrub requested, tuning parameter change.
`gateway`	Protocol errors, connection surge, rate limiting.
`raft`	Leader election, membership change, snapshot transfer.

Event severities

Severity	Description
`info`	Normal operations.
`warning`	Attention needed, but system is operating.
`error`	Failure requiring investigation.
`critical`	Immediate action required (data at risk, quorum lost).

Event API

# All events from the last 3 hours
curl http://node1:9090/ui/api/events

# Errors from the last hour
curl 'http://node1:9090/ui/api/events?severity=error&hours=1'

# Device events, last 50
curl 'http://node1:9090/ui/api/events?category=device&limit=50'

# Security events from the last 24 hours
curl 'http://node1:9090/ui/api/events?category=security&hours=24'

Historical metrics API

# Metric snapshots from the last 3 hours
curl http://node1:9090/ui/api/history

# Last 6 hours
curl 'http://node1:9090/ui/api/history?hours=6'

The history endpoint returns time-series data points suitable for charting. The default retention is 3 hours in memory. For longer retention, use Prometheus.

Grafana integration

For production monitoring with alerting and long-term storage, configure Prometheus to scrape Kiseki metrics and visualize with Grafana.

Prometheus scrape configuration

scrape_configs:
  - job_name: 'kiseki'
    scrape_interval: 15s
    static_configs:
      - targets:
          - 'node1:9090'
          - 'node2:9090'
          - 'node3:9090'
    metrics_path: '/metrics'

Recommended Grafana dashboards

Cluster overview dashboard:

Cluster health (up/down per node)
Total Raft entries/sec (rate of kiseki_raft_entries_total)
Gateway request rate (rate of kiseki_gateway_requests_total)
Gateway latency p50/p99 (kiseki_gateway_request_duration_seconds)
Pool utilization (kiseki_pool_capacity_used_bytes / kiseki_pool_capacity_total_bytes)

Per-node dashboard:

Raft commit latency histogram (kiseki_raft_commit_latency_seconds)
Chunk read/write throughput
Transport connection count
Shard delta count per shard

Capacity dashboard:

Pool fill percentage over time
Pool capacity trend (linear projection for capacity planning)
Delta count growth rate (shard split prediction)

Key management dashboard:

Key rotation count over time (kiseki_key_rotation_total)
Crypto-shred count (kiseki_crypto_shred_total)

Alerting rules

Recommended Prometheus alerting rules:

groups:
  - name: kiseki
    rules:
      - alert: KisekiNodeDown
        expr: up{job="kiseki"} == 0
        for: 1m
        labels:
          severity: critical

      - alert: KisekiPoolCapacityWarning
        expr: >
          kiseki_pool_capacity_used_bytes / kiseki_pool_capacity_total_bytes > 0.85
        for: 5m
        labels:
          severity: warning

      - alert: KisekiPoolCapacityCritical
        expr: >
          kiseki_pool_capacity_used_bytes / kiseki_pool_capacity_total_bytes > 0.92
        for: 1m
        labels:
          severity: critical

      - alert: KisekiGatewayLatencyHigh
        expr: >
          histogram_quantile(0.99, rate(kiseki_gateway_request_duration_seconds_bucket[5m])) > 1
        for: 5m
        labels:
          severity: warning

      - alert: KisekiRaftCommitLatencyHigh
        expr: >
          histogram_quantile(0.99, rate(kiseki_raft_commit_latency_seconds_bucket[5m])) > 0.1
        for: 5m
        labels:
          severity: warning

Keyboard shortcuts

Kiseki Documentation