Node Lifecycle
Design Principle
Nodes follow a formal state machine with well-defined transitions, timeouts, and operator actions. The node agent drives transitions locally; the quorum records ownership changes with strong consistency. Running allocations are never disrupted by state transitions unless the node is genuinely unhealthy.
State Machine
┌────────────────────────────────────────────┐
│ │
▼ │
┌─────────┐ boot ┌──────────┐ health ok ┌─────────┐ │
│ Unknown │────────→ │ Booting │──────────────→│ Ready │ │
└─────────┘ └──────────┘ └────┬────┘ │
▲ │ │ │
│ boot fail │ │
│ │ ┌────────────────┤ │
│ ▼ │ │ │
│ ┌──────────┐ │ drain cmd │ │
│ │ Failed │ │ │ │ │
│ └──────────┘ │ ▼ │ │
│ │ │ ┌──────────┐ │ remediated
│ wipe/reboot │ │ Draining │ │ │
│ │ │ └─────┬────┘ │ │
│ │ │ allocs done │ │
│ │ │ │ │ │
│ │ │ ▼ │ │
│ │ │ ┌──────────┐ │ │
│ │ │ │ Drained │ │ │
│ │ │ └─────┬────┘ │ │
│ │ │ undrain│ │ │
│ │ │ │ │ │
│ │ │ ▼ │ │
│ │ └──→ (Ready) ◄───┘ │
│ │ │
│ │ heartbeat miss ┌───────────┐│
│ │ ┌────────────────→│ Degraded ││
│ │ │ (Ready) └─────┬─────┘│
│ │ │ grace timeout│
│ │ │ │ │
│ │ │ ▼ │
│ └────┼──────────────────┌─────────┐ │
│ │ │ Down │ │
└──────────────────────────┼──────────────────└────┬────┘ │
│ reboot│ │
│ └──────┘
│
heartbeat resume
(within grace)
│
└──→ (Ready)
States
| State | Description | Schedulable | Allocations Run |
|---|---|---|---|
Unknown | Node exists in inventory but has never reported | No | No |
Booting | OpenCHAMI booting/reimaging the node | No | No |
Ready | Healthy, agent reporting, available for scheduling | Yes | Yes |
Degraded | Heartbeat missed or minor issue detected | No (new) | Yes (existing) |
Down | Confirmed failure, grace period expired | No | No (requeued) |
Draining | Operator or scheduler requested drain, waiting for allocations to finish | No (new) | Yes (existing, draining) |
Drained | All allocations completed/migrated after drain | No | No |
Failed | Boot failure or unrecoverable hardware error | No | No |
Transitions
Ready → Degraded
Trigger: First missed heartbeat.
Timeout: heartbeat_interval (default: 30s). If no heartbeat received within this window, the quorum marks the node Degraded.
Effect: Node is removed from scheduling candidates for new allocations. Running allocations continue undisturbed. No user notification.
Sensitive override: Sensitive nodes use a longer degradation window (default: 2 minutes) to avoid false positives from transient network issues.
Degraded → Ready
Trigger: Heartbeat resumes within the grace period.
Effect: Node re-enters the scheduling pool. No allocation disruption occurred. Event logged but no alert.
Degraded → Down
Trigger: Grace period expired without heartbeat recovery.
Timeouts:
| Node Type | Grace Period | Rationale |
|---|---|---|
| Standard | 60s | Balance between fast recovery and false positive avoidance |
| Sensitive | 5 minutes | Sensitive allocations are high-value; avoid premature requeue |
| Borrowed | 30s | Borrowed nodes should be reclaimed quickly |
Effect:
- All allocations on the node are evaluated per their requeue policy (cross-ref: failure-modes.md)
- Node ownership released (Raft commit)
- Alert raised to operators
- OpenCHAMI notified for out-of-band investigation (Redfish BMC check)
Ready → Draining
Trigger: Explicit operator command (lattice node drain <id>) or scheduler-initiated (upgrade, conformance drift on sensitive node).
Effect:
- Node removed from scheduling candidates
- Running allocations continue until completion
- For urgent drains: scheduler may trigger checkpoint on running allocations (cross-ref: checkpoint-broker.md)
- No new allocations assigned
Draining → Drained
Trigger: All running allocations on the node have completed, been checkpointed, or been migrated.
Effect: Node is idle and safe for maintenance. Operator can upgrade, reboot, or reimage.
Drained → Ready
Trigger: Operator undrain (lattice node undrain <id>). Typically after maintenance.
Precondition: Node agent health check passes (heartbeat, GPU detection, network test, conformance fingerprint computed).
Effect: Node re-enters scheduling pool.
Any → Down (hardware failure)
Trigger: OpenCHAMI Redfish BMC detects critical hardware failure (PSU, uncorrectable ECC, GPU fallen off bus).
Effect: Immediate transition to Down, bypassing grace period. Same allocation handling as Degraded → Down.
Down → Booting
Trigger: Operator or automated remediation initiates reboot/reimage via OpenCHAMI.
Effect: Node enters Booting state. OpenCHAMI BSS serves the appropriate image.
Booting → Ready
Trigger: Node agent starts, passes health check, reports to quorum.
Health check: Heartbeat received, GPU count matches capabilities, NIC firmware detected, conformance fingerprint computed and reported.
Booting → Failed
Trigger: Boot timeout (default: 10 minutes) or repeated boot failures (3 consecutive).
Effect: Node marked Failed. Alert raised. Operator must investigate.
Sensitive Node Lifecycle Extensions
Sensitive nodes have additional constraints:
| Event | Standard Node | Sensitive Node |
|---|---|---|
| Claim | Scheduler assigns | User claims explicitly, Raft-committed |
| Degraded grace | 60s | 5 minutes |
| Down → requeue | Automatic | Operator intervention required |
| Release | Node returns to pool | Node must be wiped (OpenCHAMI secure erase) before returning |
| Conformance drift | Deprioritized | Immediate Draining, audit logged |
Sensitive Release Sequence
1. User releases sensitive allocation
2. Quorum releases node ownership (Raft commit, audit entry)
3. Node enters Draining (if other sensitive allocations) or proceeds to wipe
4. OpenCHAMI initiates secure wipe:
a. GPU memory clear
b. NVMe secure erase (if present)
c. RAM scrub
d. Reboot into clean image
5. Wipe confirmation reported to quorum (Raft commit, audit entry)
6. Node transitions to Ready and returns to general pool
Wipe Failure Handling
If the OpenCHAMI secure wipe fails or times out during sensitive node release:
- Timeout: Default wipe timeout is 30 minutes (configurable:
sensitive.wipe_timeout). If wipe does not complete within this window, the node enters aQuarantinestate (treated asDownby the scheduler). - Quarantine: Quarantined nodes are excluded from scheduling and flagged for operator intervention. They do not return to the general pool.
- Operator intervention: The operator investigates (BMC console, hardware diagnostics) and either:
- Retries the wipe:
lattice admin node wipe <id> --force - Replaces the node hardware
- Marks the node as permanently failed:
lattice node disable <id>
- Retries the wipe:
- Audit: Wipe failures are logged as critical audit events (Raft-committed for sensitive nodes). The audit entry records: node ID, wipe start time, failure reason, operator action.
- Alert:
lattice_sensitive_wipe_failure_totalcounter incremented; critical alert fired.
Operator Commands
| Command | Effect | Confirmation Required |
|---|---|---|
lattice node drain <id> | Start draining | No |
lattice node drain <id> --urgent | Drain with checkpoint trigger | Yes (allocations will be checkpointed) |
lattice node undrain <id> | Re-enable scheduling | No |
lattice node disable <id> | Transition to Down immediately | Yes (allocations will be requeued/failed) |
lattice node enable <id> | Re-enable a disabled node (Down → Ready) | No |
lattice node status <id> | Show current state, allocations, health | No |
lattice node list --state=degraded | List nodes in specific state | No |
Heartbeat Protocol
Node agents send heartbeats to the quorum at a configurable interval:
| Parameter | Default | Description |
|---|---|---|
heartbeat_interval | 10s | How often the agent sends a heartbeat |
heartbeat_timeout | 30s | Quorum marks Degraded after this silence |
grace_period | 60s | Degraded → Down after this additional silence |
sensitive_grace_period | 5m | Extended grace for sensitive nodes |
Heartbeats include:
- Monotonic sequence number (replay detection)
- Node health summary (GPU count, temperature, ECC errors)
- Conformance fingerprint (if recomputed since last heartbeat)
- Running allocation count
Heartbeats are lightweight (~200 bytes) and sent over the management traffic class (cross-ref: security.md).
Agent Restart and State Recovery
The node agent persists active allocation state to /var/lib/lattice/agent-state.json (configurable via --state-file). This enables workload survival across agent restarts.
On graceful shutdown (SIGTERM):
- Agent writes current allocation state (PIDs, cgroup paths, runtime type, mount points) to the state file
- Agent exits without killing workloads (systemd
KillMode=process)
On startup:
- Agent reads the persisted state file
- For each allocation, checks if the process is still alive (
kill(pid, 0)) - Alive processes are reattached — agent resumes heartbeating their status
- Dead processes are treated as orphans — cgroup scopes are destroyed, mounts cleaned up
- Stray cgroup scopes under
workload.slice/alloc-*.scopewith no matching state entry are also cleaned up - Agent re-registers with quorum and resumes normal operation
Crash recovery: If the agent crashes without writing the state file, the startup scan of cgroup scopes under workload.slice/ provides a fallback discovery mechanism for orphaned workloads.
Cross-References
- failure-modes.md — Allocation requeue on node failure
- conformance.md — Conformance drift triggers drain on sensitive nodes
- upgrades.md — Drain/undrain during rolling upgrades
- checkpoint-broker.md — Checkpoint on urgent drain
- sensitive-workloads.md — Sensitive node claim/release/wipe
- security.md — Heartbeat authentication (mTLS, sequence numbers)