Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Troubleshooting Guide

Allocation Stuck in Pending

Symptom: lattice status shows allocation in Pending for longer than expected.

Diagnosis:

# Check why the allocation isn't being scheduled
lattice status 12345 --verbose
Verbose OutputCauseFix
waiting for quota headroomTenant hard quota (max_nodes or max_concurrent_allocations) exceededCancel other allocations or request quota increase
no nodes matching constraintsNo nodes with requested GPU type, features, or topologyRelax constraints (--topology=any), check lattice nodes --state=ready
data staging in progressInput data being pre-staged from warm/cold tierWait (check progress with lattice status 12345 --verbose), or submit without tier_hint: hot
insufficient conformance groupNot enough nodes with matching conformance fingerprint for multi-node jobReduce node count, or wait for OpenCHAMI to remediate drifted nodes
all suitable nodes occupiedResources are busy; allocation is queued normallyWait; check queue depth with lattice status --state=pending
soft quota penalty (low score)GPU-hours budget nearly exhausted; allocation deprioritizedRequest budget increase from tenant admin or Waldur portal

Deeper investigation:

# Check scheduler cycle is running
lattice admin scheduler status --vcluster=hpc-batch

# Check if proposals are being rejected
lattice admin raft status

# View scheduling metrics
# (high proposal rejection rate may indicate race conditions or quota contention)

Scheduling Cycle Slow

Symptom: lattice_scheduling_cycle_duration_seconds p99 > 30s.

Diagnosis:

CheckCommandWhat to Look For
Queue depthlattice status --state=pending --count> 500 pending allocations
Cost function timeGrafana: lattice_scheduling_cost_function_duration_secondsDominant component of cycle
Conformance group fragmentationlattice nodes -o wide | sort -k7 | uniq -cMany small groups
Topology solverGrafana: cycle time breakdownMulti-group spanning expensive

Fixes:

CauseFix
Too many pending allocationsIncrease cycle interval to batch more proposals
Cost function slowCheck if custom metrics (f₅ data_readiness) are causing TSDB query delays
Conformance fragmentedStandardize firmware, or reduce w₉ for tolerant workloads
Topology solverReduce backfill depth, or allow topology: any for more jobs

Node Stuck in Degraded/Down

Symptom: Node shows Degraded or Down in lattice nodes.

Diagnosis:

# Check node details
lattice nodes x1000c0s0b0n0

# Check heartbeat
# If heartbeat missing: node agent may be down or network partitioned
StateDurationLikely CauseFix
Degraded, < 2 minTransient network blipWait; likely self-resolves
Degraded, > 5 minAgent crash or network partitionSSH to node, check agent: systemctl status lattice-agent
DownAgent not recoveringCheck BMC via OpenCHAMI: manta node status x1000c0s0b0n0
Down, BMC unreachableHardware failurePhysical inspection required

Recovery:

# If agent crashed, restart it
ssh x1000c0s0b0n0 systemctl restart lattice-agent

# If node needs reboot
lattice node disable x1000c0s0b0n0
# (coordinate with OpenCHAMI for reboot)
lattice node undrain x1000c0s0b0n0  # after reboot + health check

Raft Commit Latency High

Symptom: lattice_raft_commit_latency_seconds p99 > 1s.

Diagnosis:

CheckWhat to Look For
Disk I/O on quorum membersWAL write latency. Quorum members need fast SSD.
Network between quorum membersPacket loss or high latency between quorum nodes
Leader overloadedToo many proposals per second
Log compactionSnapshot in progress (one-time spike, normal)

Fixes:

CauseFix
Slow diskMove WAL to dedicated NVMe SSD
Network latencyEnsure quorum members are on low-latency network (same rack or switch)
Leader overloadIncrease scheduling cycle interval to reduce proposal rate
Log too largeReduce snapshot interval (more frequent snapshots = smaller log)

Allocation Fails During Prologue

Symptom: Allocation moves from Running to Failed within seconds of starting.

Diagnosis:

lattice logs 12345
# Look for prologue errors:
#   "uenv pull failed: hash mismatch"
#   "mount failed: ENOSPC"
#   "NFS mount timeout"
ErrorCauseFix
Hash mismatchCorrupted image in cache or registrylattice cache evict --image=... --node=... and retry
ENOSPCNode-local cache full, eviction couldn’t free spaceCheck cache status: lattice cache status --node=.... Evict unused images manually.
NFS mount timeoutVAST unavailable or network issueCheck VAST health. Check Slingshot storage traffic class.
Image not founduenv name/version doesn’t exist in registryVerify with lattice cache status --node=... or check the uenv registry directly

Preemption Not Working

Symptom: Higher-priority allocation waiting despite lower-priority allocations running on suitable nodes.

Diagnosis:

lattice status 12345 --verbose
# Check if preemption is enabled for this vCluster
lattice admin vcluster show hpc-batch
CauseFix
Pending job’s priority class ≤ running jobs’ classPreemption only works downward. Check priority classes.
Running jobs are non-preemptible (checkpoint: none + high class)Wait for them to complete
Running jobs are near completion (>90% walltime)Scheduler avoids preempting near-completion jobs. Wait.
vCluster doesn’t allow preemptionCheck vCluster config. Service vClusters only preempt borrowed nodes.

Autoscaling Not Triggering

Symptom: Reactive allocation stays at min_nodes despite high metric value.

Diagnosis:

# Check current metric value
lattice top 12345 --metric=gpu_utilization

# Check scaling events
lattice status 12345 --verbose
CauseFix
Metric below targetScaling only triggers when metric > target for scale_up_window (2 min)
Cooldown period activeRecent scale event; wait for cooldown (3 min default)
TSDB query failingCheck lattice_autoscaling_metric_query_failures_total metric
Tenant quota exhaustedmax_nodes reached; scale-up is a no-op
Metric name wrongVerify metric exists in TSDB: lattice top 12345 --metric=<name>

Sensitive Node Won’t Accept Claims

Symptom: Sensitive node claim rejected.

Diagnosis:

CheckWhat to Look For
lattice nodes <id>Is node in Ready state? (Not Degraded, Down, Draining)
ConformanceIs node’s conformance fingerprint matching the sensitive baseline?
Pool sizeIs sensitive_pool_size quota exhausted?
Previous wipeWas the node properly wiped after last sensitive use?

Fix:

# Check conformance
lattice nodes x1000c0s0b0n0 -o wide
# If drifted: coordinate with OpenCHAMI for remediation

# Check sensitive pool
lattice admin tenant show hospital-a --quotas
# If exhausted: release unused sensitive nodes or increase pool

Log Collection

When filing a bug report or escalating, collect:

# System overview
lattice admin raft status > diag/raft.txt
lattice nodes -o json > diag/nodes.json
lattice status --all -o json > diag/allocations.json

# Recent scheduler metrics (last hour)
lattice admin metrics dump --component=scheduler --duration=1h > diag/scheduler-metrics.json

# Specific node agent logs (if relevant)
ssh x1000c0s0b0n0 journalctl -u lattice-agent --since="1 hour ago" > diag/agent.log

Cross-References