Upgrades and Rollouts
Design Principle
Zero-downtime upgrades. No running allocation is disrupted by an upgrade. Components are upgraded independently. Protocol backward compatibility ensures mixed-version operation during rolling upgrades.
Protocol Versioning
All gRPC services are versioned (lattice.v1.*):
- New fields are additive (backward compatible within a major version)
- Breaking changes require a new version (
lattice.v2.*) - During rolling upgrades, node agents and quorum members must support both version N and N-1
- Version negotiation on connection establishment: components advertise supported versions, use the highest common version
Upgrade Order
Components are upgraded in dependency order, from leaf to core:
1. Node agents (rolling, batched)
2. vCluster schedulers (rolling)
3. API servers (rolling)
4. Quorum members (Raft rolling membership change, one at a time)
This order ensures that core components (quorum) speak the old protocol until all clients (node agents, schedulers) are upgraded. The quorum is upgraded last because it’s the most critical and the hardest to roll back.
Node Agent Rolling Upgrade
Procedure
For each batch of nodes:
- Drain: Stop scheduling new allocations to the node. Node enters
Drainingstate. If no allocations are running, it transitions directly toDrained. - Wait: Running allocations complete naturally. The scheduler loop transitions the node from
DrainingtoDrainedonce all allocations finish. For urgent upgrades: checkpoint running allocations and migrate (cross-ref: checkpoint-broker.md). - Upgrade: Replace node agent binary while node is
Drained. Configuration is preserved. - Restart: Node agent starts, re-registers with quorum using new protocol version.
- Health check: Node passes health check (heartbeat, GPU detection, network test).
- Undrain: Operator runs
undrain. Node transitions fromDrainedtoReadyand is available for scheduling.
Canary Strategy
- Upgrade 1-2 nodes first (canary set)
- Monitor canary nodes for the observation window (default: 15 minutes):
- Scheduling cycle latency within SLO (cross-ref: telemetry.md scheduler self-monitoring)
- No increase in allocation failures on canary nodes
- Heartbeat latency stable
- Node health check pass rate = 100%
- If canary passes: proceed with rolling batches (batch size configurable, default: 5% of nodes)
- If canary fails: stop rollout, revert canary nodes (see Rollback below)
Batch Sizing
| Cluster Size | Canary Size | Batch Size | Total Batches |
|---|---|---|---|
| < 50 nodes | 1 node | 5 nodes | ~10 |
| 50-500 nodes | 2 nodes | 25 nodes | ~20 |
| 500+ nodes | 5 nodes | 50 nodes | varies |
vCluster Scheduler Rolling Upgrade
Schedulers are stateless — they read state from the quorum each cycle:
- Stop scheduler instance
- Upgrade binary
- Restart
- Verify: scheduling cycle completes successfully, proposals accepted by quorum
During scheduler downtime, the affected vCluster pauses scheduling (no new allocations). Running allocations are unaffected. Multiple scheduler replicas (if deployed) provide continuity.
API Server Rolling Upgrade
API servers are stateless, behind a load balancer:
- Remove instance from load balancer
- Drain active connections (grace period: 30s)
- Upgrade binary
- Restart
- Health check passes → re-add to load balancer
Client impact: brief connection reset for long-lived streams (StreamMetrics, StreamLogs). Clients reconnect automatically.
Quorum Rolling Upgrade
The most sensitive upgrade. One member at a time, maintaining quorum majority throughout:
3-Member Quorum
- Upgrade follower A: remove from Raft group → upgrade → re-add
- Wait for follower A to catch up (Raft log sync)
- Upgrade follower B: remove → upgrade → re-add
- Wait for follower B to catch up
- Trigger leader transfer to an upgraded follower
- Upgrade old leader: remove → upgrade → re-add
Constraint: Never more than 1 member down simultaneously (2/3 majority required).
5-Member Quorum
Same procedure but can upgrade 2 followers in parallel (3/5 majority maintained):
- Upgrade followers A and B in parallel
- Wait for catch-up
- Upgrade followers C and D in parallel
- Wait for catch-up
- Leader transfer → upgrade old leader
Constraint: Never more than 2 members down simultaneously (3/5 majority required).
Quorum Upgrade Verification
After each member upgrade:
- Raft log replication is current (no lag)
- Commit latency within SLO (< 5s)
- Leader election succeeds if triggered
- All node ownership state is consistent
Canary Criteria
Metrics from scheduler self-monitoring (cross-ref: telemetry.md) that gate rollout progression:
| Metric | Threshold | Severity |
|---|---|---|
lattice_scheduling_cycle_duration_seconds | p99 < 30s | Warning: pause rollout |
lattice_scheduling_proposals_total{result="rejected"} | No increase > 10% | Warning: pause rollout |
lattice_agent_heartbeat_latency_seconds | p99 < 5s | Warning: pause rollout |
lattice_raft_commit_latency_seconds | p99 < 5s | Critical: stop rollout |
lattice_api_requests_total{status="5xx"} | No increase > 5% | Warning: pause rollout |
| Allocation failure rate | No increase | Critical: stop rollout |
Rollback
Node Agent Rollback
- Drain canary/failed nodes
- Replace binary with previous version
- Restart
- Verify old-version operation
- Protocol backward compatibility ensures the rolled-back agent works with the rest of the cluster
Scheduler/API Rollback
Stateless — replace binary and restart.
Quorum Rollback
- Remove new-version member from Raft group
- Add old-version member back
- Protocol backward compatibility ensures mixed-version operation during the transition
Rollback is always safe because N-1 protocol support is maintained throughout the upgrade window.
Configuration Hot-Reload
Not all changes require a binary upgrade. Configuration changes that can be hot-reloaded via quorum without restart:
| Change | Hot-Reloadable | Mechanism |
|---|---|---|
| Cost function weights | Yes | Quorum config update, schedulers pick up next cycle |
| vCluster policies | Yes | Quorum config update |
| Telemetry mode (prod/debug/audit) | Yes | API call to node agent |
| Tenant quotas | Yes | Quorum config update |
| Node drain/undrain | Yes | API call |
| Protocol version | No | Binary upgrade required |
| Raft cluster size | No | Membership change (safe, but not hot-reload) |
Cross-References
- telemetry.md — Scheduler self-monitoring metrics used for canary criteria
- failure-modes.md — Failure detection during upgrades
- security.md — Certificate rotation during upgrades
- checkpoint-broker.md — Checkpoint before drain for urgent upgrades