Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Monitoring & Observability

Kiseki provides three observability pillars: metrics (Prometheus), structured logging (tracing), and distributed traces (OpenTelemetry). All three are tenant-aware, respecting the zero-trust boundary between cluster admin and tenant admin (ADR-015).


Prometheus metrics

Every kiseki-server node exposes Prometheus metrics in text exposition format on the metrics HTTP port.

Endpoint

GET http://<node>:9090/metrics

Registered metrics

Metric nameTypeLabelsDescription
kiseki_raft_commit_latency_secondsHistogramshardRaft commit latency per shard. Buckets: 100us to 1s.
kiseki_raft_entries_totalCounter(none)Total Raft entries applied on this node.
kiseki_chunk_write_bytes_totalCounter(none)Total chunk bytes written.
kiseki_chunk_read_bytes_totalCounter(none)Total chunk bytes read.
kiseki_chunk_ec_encode_secondsHistogramstrategyEC encode latency. Buckets: 100us to 50ms.
kiseki_gateway_requests_totalCountermethod, statusGateway request count by method (GET, PUT, DELETE, etc.) and HTTP status.
kiseki_gateway_request_duration_secondsHistogrammethodGateway request duration. Buckets: 1ms to 5s.
kiseki_pool_capacity_total_bytesGaugepoolTotal capacity per pool in bytes.
kiseki_pool_capacity_used_bytesGaugepoolUsed capacity per pool in bytes.
kiseki_transport_connections_activeGauge(none)Active transport connections.
kiseki_transport_connections_idleGauge(none)Idle transport connections.
kiseki_shard_delta_countGaugeshardCurrent delta count per shard.
kiseki_key_rotation_totalCounter(none)Key rotations performed (system + tenant).
kiseki_crypto_shred_totalCounter(none)Crypto-shred operations performed.

Metric scoping (zero-trust)

Per ADR-015, metric scoping respects the zero-trust boundary:

  • Cluster admin sees: Aggregated metrics, per-node metrics, system health. Per-tenant metrics are anonymized (tenant_id replaced with opaque hash) unless the cluster admin has approved access for that tenant.
  • Tenant admin sees: Their own tenant’s metrics via the tenant audit export.
  • No metric exposes: File names, directory structure, data content, or access patterns attributable to a specific tenant (without approval).

Metric cardinality

Metric cardinality is bounded by design. Label values are drawn from fixed sets (shard IDs, pool names, HTTP methods, strategy names). There are no unbounded label values such as file paths, tenant IDs, or user identifiers in metrics labels.


Structured logging

Kiseki uses the tracing crate for structured logging. Every log event is a structured record with typed fields.

Configuration

VariableDefaultDescription
RUST_LOGinfoFilter directive. Supports per-module granularity.
KISEKI_LOG_FORMATtextOutput format: text (human-readable) or json (structured).

Filter examples

# Default: info-level for all Kiseki modules
RUST_LOG=kiseki=info

# Debug for the Raft subsystem, info for everything else
RUST_LOG=kiseki_raft=debug,kiseki=info

# Trace-level for the chunk subsystem (very verbose)
RUST_LOG=kiseki_chunk=trace,kiseki=info

# Warnings only (quiet)
RUST_LOG=warn

JSON output format

In production, set KISEKI_LOG_FORMAT=json for structured log aggregation (ELK, Loki, Datadog, etc.):

{
  "timestamp": "2026-04-23T14:30:00.123Z",
  "level": "INFO",
  "target": "kiseki_raft",
  "message": "Raft leader elected",
  "shard": "shard-0001",
  "node_id": 1,
  "term": 42
}

Log levels

LevelUsage
ERRORUnrecoverable failures, invariant violations, data loss events.
WARNRecoverable issues, degraded state, approaching capacity limits.
INFOSignificant state changes: leader election, key rotation, shard split, node join/leave.
DEBUGDetailed operational events: individual RPCs, cache hits/misses, EC operations.
TRACEWire-level detail: Raft message contents, HKDF inputs, bitmap operations.

Security in logs

  • Tenant-identifying fields (tenant_id, namespace) are present for correlation.
  • Content fields (file names, chunk plaintext, key material) are never logged (I-K8).
  • Logs ship to the same audit/observability pipeline.

Distributed tracing (OpenTelemetry)

Kiseki uses OpenTelemetry for distributed tracing across the full write/read path.

Configuration

VariableDefaultDescription
OTEL_EXPORTER_OTLP_ENDPOINT(none)OTLP gRPC endpoint. Example: http://jaeger:4317. When not set, tracing is disabled.
OTEL_SERVICE_NAMEkiseki-serverService name in traces.
OTEL_TRACES_SAMPLER_ARG1.0Sampling rate (1.0 = 100%, 0.1 = 10%). Reduce in production for high-throughput workloads.

Trace propagation

Every write/read path carries a trace ID via OpenTelemetry context propagation. Traces span:

client -> gateway -> composition -> log -> chunk -> view

For the native client path:

client (FUSE) -> transport -> composition -> log -> chunk

Jaeger integration

The development Docker Compose stack includes Jaeger for trace visualization:

  • Jaeger UI: http://localhost:16686
  • OTLP gRPC receiver: localhost:4317

Trace scoping

Traces respect the zero-trust boundary:

  • Tenant-scoped traces are visible only to the tenant admin (via tenant audit export).
  • Cluster admin sees system-level spans. No tenant content appears in span attributes visible to the cluster admin.
  • Trace overhead is approximately 1-2% on the data path (acceptable for production).

Event store

The admin dashboard maintains an in-memory event store for diagnostic events. Events are categorized and severity-tagged.

Event categories

CategoryEvents
nodeNode join, node leave, node unreachable, node recovered.
shardShard created, shard split, shard maintenance entered/exited.
deviceDevice added, device failed, SMART warning, evacuation started/completed.
tenantTenant created, tenant deleted, quota changed.
securityAuth failure, cert revocation, crypto-shred.
adminMaintenance mode toggle, backup requested, scrub requested, tuning parameter change.
gatewayProtocol errors, connection surge, rate limiting.
raftLeader election, membership change, snapshot transfer.

Event severities

SeverityDescription
infoNormal operations.
warningAttention needed, but system is operating.
errorFailure requiring investigation.
criticalImmediate action required (data at risk, quorum lost).

Event API

# All events from the last 3 hours
curl http://node1:9090/ui/api/events

# Errors from the last hour
curl 'http://node1:9090/ui/api/events?severity=error&hours=1'

# Device events, last 50
curl 'http://node1:9090/ui/api/events?category=device&limit=50'

# Security events from the last 24 hours
curl 'http://node1:9090/ui/api/events?category=security&hours=24'

Historical metrics API

# Metric snapshots from the last 3 hours
curl http://node1:9090/ui/api/history

# Last 6 hours
curl 'http://node1:9090/ui/api/history?hours=6'

The history endpoint returns time-series data points suitable for charting. The default retention is 3 hours in memory. For longer retention, use Prometheus.


Grafana integration

For production monitoring with alerting and long-term storage, configure Prometheus to scrape Kiseki metrics and visualize with Grafana.

Prometheus scrape configuration

scrape_configs:
  - job_name: 'kiseki'
    scrape_interval: 15s
    static_configs:
      - targets:
          - 'node1:9090'
          - 'node2:9090'
          - 'node3:9090'
    metrics_path: '/metrics'

Cluster overview dashboard:

  • Cluster health (up/down per node)
  • Total Raft entries/sec (rate of kiseki_raft_entries_total)
  • Gateway request rate (rate of kiseki_gateway_requests_total)
  • Gateway latency p50/p99 (kiseki_gateway_request_duration_seconds)
  • Pool utilization (kiseki_pool_capacity_used_bytes / kiseki_pool_capacity_total_bytes)

Per-node dashboard:

  • Raft commit latency histogram (kiseki_raft_commit_latency_seconds)
  • Chunk read/write throughput
  • Transport connection count
  • Shard delta count per shard

Capacity dashboard:

  • Pool fill percentage over time
  • Pool capacity trend (linear projection for capacity planning)
  • Delta count growth rate (shard split prediction)

Key management dashboard:

  • Key rotation count over time (kiseki_key_rotation_total)
  • Crypto-shred count (kiseki_crypto_shred_total)

Alerting rules

Recommended Prometheus alerting rules:

groups:
  - name: kiseki
    rules:
      - alert: KisekiNodeDown
        expr: up{job="kiseki"} == 0
        for: 1m
        labels:
          severity: critical

      - alert: KisekiPoolCapacityWarning
        expr: >
          kiseki_pool_capacity_used_bytes / kiseki_pool_capacity_total_bytes > 0.85
        for: 5m
        labels:
          severity: warning

      - alert: KisekiPoolCapacityCritical
        expr: >
          kiseki_pool_capacity_used_bytes / kiseki_pool_capacity_total_bytes > 0.92
        for: 1m
        labels:
          severity: critical

      - alert: KisekiGatewayLatencyHigh
        expr: >
          histogram_quantile(0.99, rate(kiseki_gateway_request_duration_seconds_bucket[5m])) > 1
        for: 5m
        labels:
          severity: warning

      - alert: KisekiRaftCommitLatencyHigh
        expr: >
          histogram_quantile(0.99, rate(kiseki_raft_commit_latency_seconds_bucket[5m])) > 0.1
        for: 5m
        labels:
          severity: warning