Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Node Lifecycle

Design Principle

Nodes follow a formal state machine with well-defined transitions, timeouts, and operator actions. The node agent drives transitions locally; the quorum records ownership changes with strong consistency. Running allocations are never disrupted by state transitions unless the node is genuinely unhealthy.

State Machine

                    ┌────────────────────────────────────────────┐
                    │                                            │
                    ▼                                            │
  ┌─────────┐   boot   ┌──────────┐   health ok   ┌─────────┐    │
  │ Unknown │────────→ │ Booting  │──────────────→│  Ready  │    │
  └─────────┘          └──────────┘               └────┬────┘    │
       ▲                     │                         │         │
       │               boot fail                       │         │
       │                     │        ┌────────────────┤         │
       │                     ▼        │                │         │
       │               ┌──────────┐   │  drain cmd     │         │
       │               │  Failed  │   │       │        │         │
       │               └──────────┘   │       ▼        │         │
       │                     │        │  ┌──────────┐  │  remediated
       │               wipe/reboot    │  │ Draining │  │         │
       │                     │        │  └─────┬────┘  │         │
       │                     │        │   allocs done  │         │
       │                     │        │        │       │         │
       │                     │        │        ▼       │         │
       │                     │        │  ┌──────────┐  │         │
       │                     │        │  │ Drained  │  │         │
       │                     │        │  └─────┬────┘  │         │
       │                     │        │ undrain│       │         │
       │                     │        │        │       │         │
       │                     │        │        ▼       │         │
       │                     │        └──→ (Ready) ◄───┘         │
       │                     │                                   │
       │                     │    heartbeat miss    ┌───────────┐│
       │                     │    ┌────────────────→│ Degraded  ││
       │                     │    │   (Ready)       └─────┬─────┘│
       │                     │    │                 grace timeout│
       │                     │    │                       │      │
       │                     │    │                       ▼      │
       │                     └────┼──────────────────┌─────────┐ │
       │                          │                  │  Down   │ │
       └──────────────────────────┼──────────────────└────┬────┘ │
                                  │                 reboot│      │
                                  │                       └──────┘
                                  │
                         heartbeat resume
                          (within grace)
                                  │
                                  └──→ (Ready)

States

StateDescriptionSchedulableAllocations Run
UnknownNode exists in inventory but has never reportedNoNo
BootingOpenCHAMI booting/reimaging the nodeNoNo
ReadyHealthy, agent reporting, available for schedulingYesYes
DegradedHeartbeat missed or minor issue detectedNo (new)Yes (existing)
DownConfirmed failure, grace period expiredNoNo (requeued)
DrainingOperator or scheduler requested drain, waiting for allocations to finishNo (new)Yes (existing, draining)
DrainedAll allocations completed/migrated after drainNoNo
FailedBoot failure or unrecoverable hardware errorNoNo

Transitions

Ready → Degraded

Trigger: First missed heartbeat.

Timeout: heartbeat_interval (default: 30s). If no heartbeat received within this window, the quorum marks the node Degraded.

Effect: Node is removed from scheduling candidates for new allocations. Running allocations continue undisturbed. No user notification.

Sensitive override: Sensitive nodes use a longer degradation window (default: 2 minutes) to avoid false positives from transient network issues.

Degraded → Ready

Trigger: Heartbeat resumes within the grace period.

Effect: Node re-enters the scheduling pool. No allocation disruption occurred. Event logged but no alert.

Degraded → Down

Trigger: Grace period expired without heartbeat recovery.

Timeouts:

Node TypeGrace PeriodRationale
Standard60sBalance between fast recovery and false positive avoidance
Sensitive5 minutesSensitive allocations are high-value; avoid premature requeue
Borrowed30sBorrowed nodes should be reclaimed quickly

Effect:

  1. All allocations on the node are evaluated per their requeue policy (cross-ref: failure-modes.md)
  2. Node ownership released (Raft commit)
  3. Alert raised to operators
  4. OpenCHAMI notified for out-of-band investigation (Redfish BMC check)

Ready → Draining

Trigger: Explicit operator command (lattice node drain <id>) or scheduler-initiated (upgrade, conformance drift on sensitive node).

Effect:

  1. Node removed from scheduling candidates
  2. Running allocations continue until completion
  3. For urgent drains: scheduler may trigger checkpoint on running allocations (cross-ref: checkpoint-broker.md)
  4. No new allocations assigned

Draining → Drained

Trigger: All running allocations on the node have completed, been checkpointed, or been migrated.

Effect: Node is idle and safe for maintenance. Operator can upgrade, reboot, or reimage.

Drained → Ready

Trigger: Operator undrain (lattice node undrain <id>). Typically after maintenance.

Precondition: Node agent health check passes (heartbeat, GPU detection, network test, conformance fingerprint computed).

Effect: Node re-enters scheduling pool.

Any → Down (hardware failure)

Trigger: OpenCHAMI Redfish BMC detects critical hardware failure (PSU, uncorrectable ECC, GPU fallen off bus).

Effect: Immediate transition to Down, bypassing grace period. Same allocation handling as Degraded → Down.

Down → Booting

Trigger: Operator or automated remediation initiates reboot/reimage via OpenCHAMI.

Effect: Node enters Booting state. OpenCHAMI BSS serves the appropriate image.

Booting → Ready

Trigger: Node agent starts, passes health check, reports to quorum.

Health check: Heartbeat received, GPU count matches capabilities, NIC firmware detected, conformance fingerprint computed and reported.

Booting → Failed

Trigger: Boot timeout (default: 10 minutes) or repeated boot failures (3 consecutive).

Effect: Node marked Failed. Alert raised. Operator must investigate.

Sensitive Node Lifecycle Extensions

Sensitive nodes have additional constraints:

EventStandard NodeSensitive Node
ClaimScheduler assignsUser claims explicitly, Raft-committed
Degraded grace60s5 minutes
Down → requeueAutomaticOperator intervention required
ReleaseNode returns to poolNode must be wiped (OpenCHAMI secure erase) before returning
Conformance driftDeprioritizedImmediate Draining, audit logged

Sensitive Release Sequence

1. User releases sensitive allocation
2. Quorum releases node ownership (Raft commit, audit entry)
3. Node enters Draining (if other sensitive allocations) or proceeds to wipe
4. OpenCHAMI initiates secure wipe:
   a. GPU memory clear
   b. NVMe secure erase (if present)
   c. RAM scrub
   d. Reboot into clean image
5. Wipe confirmation reported to quorum (Raft commit, audit entry)
6. Node transitions to Ready and returns to general pool

Wipe Failure Handling

If the OpenCHAMI secure wipe fails or times out during sensitive node release:

  1. Timeout: Default wipe timeout is 30 minutes (configurable: sensitive.wipe_timeout). If wipe does not complete within this window, the node enters a Quarantine state (treated as Down by the scheduler).
  2. Quarantine: Quarantined nodes are excluded from scheduling and flagged for operator intervention. They do not return to the general pool.
  3. Operator intervention: The operator investigates (BMC console, hardware diagnostics) and either:
    • Retries the wipe: lattice admin node wipe <id> --force
    • Replaces the node hardware
    • Marks the node as permanently failed: lattice node disable <id>
  4. Audit: Wipe failures are logged as critical audit events (Raft-committed for sensitive nodes). The audit entry records: node ID, wipe start time, failure reason, operator action.
  5. Alert: lattice_sensitive_wipe_failure_total counter incremented; critical alert fired.

Operator Commands

CommandEffectConfirmation Required
lattice node drain <id>Start drainingNo
lattice node drain <id> --urgentDrain with checkpoint triggerYes (allocations will be checkpointed)
lattice node undrain <id>Re-enable schedulingNo
lattice node disable <id>Transition to Down immediatelyYes (allocations will be requeued/failed)
lattice node enable <id>Re-enable a disabled node (Down → Ready)No
lattice node status <id>Show current state, allocations, healthNo
lattice node list --state=degradedList nodes in specific stateNo

Heartbeat Protocol

Node agents send heartbeats to the quorum at a configurable interval:

ParameterDefaultDescription
heartbeat_interval10sHow often the agent sends a heartbeat
heartbeat_timeout30sQuorum marks Degraded after this silence
grace_period60sDegradedDown after this additional silence
sensitive_grace_period5mExtended grace for sensitive nodes

Heartbeats include:

  • Monotonic sequence number (replay detection)
  • Node health summary (GPU count, temperature, ECC errors)
  • Conformance fingerprint (if recomputed since last heartbeat)
  • Running allocation count

Heartbeats are lightweight (~200 bytes) and sent over the management traffic class (cross-ref: security.md).

Agent Restart and State Recovery

The node agent persists active allocation state to /var/lib/lattice/agent-state.json (configurable via --state-file). This enables workload survival across agent restarts.

On graceful shutdown (SIGTERM):

  1. Agent writes current allocation state (PIDs, cgroup paths, runtime type, mount points) to the state file
  2. Agent exits without killing workloads (systemd KillMode=process)

On startup:

  1. Agent reads the persisted state file
  2. For each allocation, checks if the process is still alive (kill(pid, 0))
  3. Alive processes are reattached — agent resumes heartbeating their status
  4. Dead processes are treated as orphans — cgroup scopes are destroyed, mounts cleaned up
  5. Stray cgroup scopes under workload.slice/alloc-*.scope with no matching state entry are also cleaned up
  6. Agent re-registers with quorum and resumes normal operation

Crash recovery: If the agent crashes without writing the state file, the startup scan of cgroup scopes under workload.slice/ provides a fallback discovery mechanism for orphaned workloads.

Cross-References