Troubleshooting Guide

Allocation Stuck in Pending

Symptom: lattice status shows allocation in Pending for longer than expected.

Diagnosis:

# Check why the allocation isn't being scheduled
lattice status 12345 --verbose

Verbose Output	Cause	Fix
`waiting for quota headroom`	Tenant hard quota (`max_nodes` or `max_concurrent_allocations`) exceeded	Cancel other allocations or request quota increase
`no nodes matching constraints`	No nodes with requested GPU type, features, or topology	Relax constraints (`--topology=any`), check `lattice nodes --state=ready`
`data staging in progress`	Input data being pre-staged from warm/cold tier	Wait (check progress with `lattice status 12345 --verbose`), or submit without `tier_hint: hot`
`insufficient conformance group`	Not enough nodes with matching conformance fingerprint for multi-node job	Reduce node count, or wait for OpenCHAMI to remediate drifted nodes
`all suitable nodes occupied`	Resources are busy; allocation is queued normally	Wait; check queue depth with `lattice status --state=pending`
`soft quota penalty (low score)`	GPU-hours budget nearly exhausted; allocation deprioritized	Request budget increase from tenant admin or Waldur portal

Deeper investigation:

# Check scheduler cycle is running
lattice admin scheduler status --vcluster=hpc-batch

# Check if proposals are being rejected
lattice admin raft status

# View scheduling metrics
# (high proposal rejection rate may indicate race conditions or quota contention)

Scheduling Cycle Slow

Symptom: lattice_scheduling_cycle_duration_seconds p99 > 30s.

Diagnosis:

Check	Command	What to Look For
Queue depth	`lattice status --state=pending --count`	> 500 pending allocations
Cost function time	Grafana: `lattice_scheduling_cost_function_duration_seconds`	Dominant component of cycle
Conformance group fragmentation	`lattice nodes -o wide \| sort -k7 \| uniq -c`	Many small groups
Topology solver	Grafana: cycle time breakdown	Multi-group spanning expensive

Fixes:

Cause	Fix
Too many pending allocations	Increase cycle interval to batch more proposals
Cost function slow	Check if custom metrics (f₅ data_readiness) are causing TSDB query delays
Conformance fragmented	Standardize firmware, or reduce w₉ for tolerant workloads
Topology solver	Reduce backfill depth, or allow `topology: any` for more jobs

Node Stuck in Degraded/Down

Symptom: Node shows Degraded or Down in lattice nodes.

Diagnosis:

# Check node details
lattice nodes x1000c0s0b0n0

# Check heartbeat
# If heartbeat missing: node agent may be down or network partitioned

State	Duration	Likely Cause
Degraded, < 2 min	Transient network blip	Wait; likely self-resolves
Degraded, > 5 min	Agent crash or network partition	SSH to node, check agent: `systemctl status lattice-agent`
Down	Agent not recovering	Check BMC via OpenCHAMI: `manta node status x1000c0s0b0n0`
Down, BMC unreachable	Hardware failure	Physical inspection required

Recovery:

# If agent crashed, restart it
ssh x1000c0s0b0n0 systemctl restart lattice-agent

# If node needs reboot
lattice node disable x1000c0s0b0n0
# (coordinate with OpenCHAMI for reboot)
lattice node undrain x1000c0s0b0n0  # after reboot + health check

Raft Commit Latency High

Symptom: lattice_raft_commit_latency_seconds p99 > 1s.

Diagnosis:

Check	What to Look For
Disk I/O on quorum members	WAL write latency. Quorum members need fast SSD.
Network between quorum members	Packet loss or high latency between quorum nodes
Leader overloaded	Too many proposals per second
Log compaction	Snapshot in progress (one-time spike, normal)

Fixes:

Cause	Fix
Slow disk	Move WAL to dedicated NVMe SSD
Network latency	Ensure quorum members are on low-latency network (same rack or switch)
Leader overload	Increase scheduling cycle interval to reduce proposal rate
Log too large	Reduce snapshot interval (more frequent snapshots = smaller log)

Allocation Fails During Prologue

Symptom: Allocation moves from Running to Failed within seconds of starting.

Diagnosis:

lattice logs 12345
# Look for prologue errors:
#   "uenv pull failed: hash mismatch"
#   "mount failed: ENOSPC"
#   "NFS mount timeout"

Error	Cause	Fix
Hash mismatch	Corrupted image in cache or registry	`lattice cache evict --image=... --node=...` and retry
ENOSPC	Node-local cache full, eviction couldn’t free space	Check cache status: `lattice cache status --node=...`. Evict unused images manually.
NFS mount timeout	VAST unavailable or network issue	Check VAST health. Check Slingshot storage traffic class.
Image not found	uenv name/version doesn’t exist in registry	Verify with `lattice cache status --node=...` or check the uenv registry directly

Preemption Not Working

Symptom: Higher-priority allocation waiting despite lower-priority allocations running on suitable nodes.

Diagnosis:

lattice status 12345 --verbose
# Check if preemption is enabled for this vCluster
lattice admin vcluster show hpc-batch

Cause	Fix
Pending job’s priority class ≤ running jobs’ class	Preemption only works downward. Check priority classes.
Running jobs are non-preemptible (`checkpoint: none` + high class)	Wait for them to complete
Running jobs are near completion (>90% walltime)	Scheduler avoids preempting near-completion jobs. Wait.
vCluster doesn’t allow preemption	Check vCluster config. Service vClusters only preempt borrowed nodes.

Autoscaling Not Triggering

Symptom: Reactive allocation stays at min_nodes despite high metric value.

Diagnosis:

# Check current metric value
lattice top 12345 --metric=gpu_utilization

# Check scaling events
lattice status 12345 --verbose

Cause	Fix
Metric below target	Scaling only triggers when metric > target for `scale_up_window` (2 min)
Cooldown period active	Recent scale event; wait for cooldown (3 min default)
TSDB query failing	Check `lattice_autoscaling_metric_query_failures_total` metric
Tenant quota exhausted	`max_nodes` reached; scale-up is a no-op
Metric name wrong	Verify metric exists in TSDB: `lattice top 12345 --metric=<name>`

Sensitive Node Won’t Accept Claims

Symptom: Sensitive node claim rejected.

Diagnosis:

Check	What to Look For
`lattice nodes <id>`	Is node in `Ready` state? (Not `Degraded`, `Down`, `Draining`)
Conformance	Is node’s conformance fingerprint matching the sensitive baseline?
Pool size	Is `sensitive_pool_size` quota exhausted?
Previous wipe	Was the node properly wiped after last sensitive use?

Fix:

# Check conformance
lattice nodes x1000c0s0b0n0 -o wide
# If drifted: coordinate with OpenCHAMI for remediation

# Check sensitive pool
lattice admin tenant show hospital-a --quotas
# If exhausted: release unused sensitive nodes or increase pool

Log Collection

When filing a bug report or escalating, collect:

# System overview
lattice admin raft status > diag/raft.txt
lattice nodes -o json > diag/nodes.json
lattice status --all -o json > diag/allocations.json

# Recent scheduler metrics (last hour)
lattice admin metrics dump --component=scheduler --duration=1h > diag/scheduler-metrics.json

# Specific node agent logs (if relevant)
ssh x1000c0s0b0n0 journalctl -u lattice-agent --since="1 hour ago" > diag/agent.log

Cross-References

failure-modes.md — Expected failure patterns and recovery
node-lifecycle.md — Node state transitions and timeouts
preemption.md — Preemption policy and classes
autoscaling.md — Scaling loop and error handling
data-staging.md — Cache management and staging pipeline
tuning-guide.md — Cost function tuning for performance issues

Keyboard shortcuts

Lattice Documentation