Data Plane & Storage Architecture
Tiered Storage Model
┌─ Hot Tier (VAST-like) ─────────────────────────────────┐
│ Protocol: NFS + S3 (native multiprotocol) │
│ Use: active datasets, home dirs, checkpoints, scratch │
│ Performance: NVMe-speed, low-latency │
│ Scheduler integration: QoS per export, pre-staging │
│ Sensitive: encrypted pool, access-logged │
└────────────────────┬───────────────────────────────────┘
│ policy-driven data mover
┌────────────────────┴───────────────────────────────────┐
│ Warm Tier (capacity storage) │
│ Protocol: S3-compatible │
│ Use: completed outputs, older datasets, cold models │
│ Cost: significantly lower than hot │
└────────────────────┬───────────────────────────────────┘
│ archive policy
┌────────────────────┴───────────────────────────────────┐
│ Cold Tier (tape/object archive) │
│ Protocol: S3-compatible (Glacier-style retrieval) │
│ Use: regulatory retention, long-term archival │
│ Sensitive: 7+ year retention, immutable │
└────────────────────────────────────────────────────────┘
Protocol Standardization
Only two protocols for user-facing access:
- NFS: POSIX workloads, home directories, uenv images, legacy codes that expect a filesystem
- S3: Object access for checkpoints, datasets, model artifacts, any cloud-native tooling
No Lustre/GPFS client required. VAST delivers parallel-file-system performance via NFS.
Job Data Requirements
Explicit Declaration
Users who know their data needs can declare them:
data:
mounts:
- source: "s3://training-data/imagenet"
target: "/data/input"
tier_hint: "hot"
access: "read-only"
- source: "nfs://home/{user}"
target: "/home/{user}"
access: "read-write"
output: "s3://{tenant}/{project}/{allocation_id}/"
scratch_per_node: "500GB"
Sane Defaults (for users who don’t specify)
Every allocation automatically gets:
- Home directory: mounted via NFS from hot tier (
/home/{user}) - Node-local scratch: NVMe-backed ephemeral storage (
/scratch/local/) if NVMe is available; tmpfs or network scratch otherwise - Output directory:
s3://{tenant}/{project}/{allocation_id}/auto-created - Checkpoint directory:
s3://{tenant}/{project}/{allocation_id}/checkpoints/(if checkpoint != none)
Data Staging (Scheduler-Integrated)
The scheduler integrates with the storage API for intelligent data movement:
-
Pre-staging during queue wait: When a job is queued and its data is on warm/cold tier, the data mover begins warming it to hot tier. Queue wait time becomes useful instead of idle.
-
QoS allocation at job start: The scheduler calls the VAST API to set bandwidth guarantees for the job’s NFS export. Prevents I/O-intensive jobs from starving latency-sensitive services.
-
Checkpoint coordination: The checkpoint broker pre-allocates storage bandwidth windows to avoid I/O storms when many jobs checkpoint simultaneously.
VAST API Integration Points
| Operation | VAST API | When |
|---|---|---|
| Create export with QoS | POST /exports + QoS policy | Job starts |
| Query data locality | GET /catalog?path=… | Scheduling (data_readiness score) |
| Create snapshot | POST /snapshots | Job start (reproducibility) or checkpoint |
| Pre-stage from warm | POST /dataspace/prefetch | Job queued, data not on hot tier |
| Set bandwidth floor | PATCH /exports/{id}/qos | Job starts |
| Audit log query | GET /audit/logs?path=… | Compliance reporting |
Sensitive Storage Policy
vcluster: sensitive-secure
storage_policy:
encryption: aes-256-at-rest
pool: dedicated # separate VAST view/tenant
wipe_on_release: true # scrub after allocation ends
access_logging: full # every read/write logged
data_sovereignty: "ch" # data stays in Swiss jurisdiction
retention:
data: "as_specified_by_user"
audit_logs: "7_years"
tier_restriction: "hot_only" # no unencrypted copies on warm/cold
Log Storage
Allocation logs are persisted to S3 alongside output data. See observability.md for the log storage layout:
s3://{tenant}/{project}/{alloc_id}/logs/
├── stdout/{node_id}/{chunk_000..N}.log.zst
├── stderr/{node_id}/{chunk_000..N}.log.zst
└── metadata.json
Sensitive allocation logs are stored in the encrypted sensitive S3 pool with access logging enabled.
Node-Local Storage (Optional)
Nodes may have NVMe SSDs managed by the node agent. Local storage is not a hard requirement — nodes without NVMe operate with reduced performance but full functionality.
When NVMe is present:
- Scratch: ephemeral, wiped between allocations. For temp files, staging.
- Image cache: persistent across allocations. Caches uenv squashfs images and OCI layers.
- LRU eviction policy
- Cache hit avoids network pull from registry
- Popular images stay warm automatically
When NVMe is absent:
- Scratch: falls back to tmpfs (RAM-backed) or a network-mounted scratch directory. Capacity is limited by available RAM or network storage quota.
- Image cache: no persistent local cache. Images are pulled from the registry on every allocation start (or served from a shared NFS cache if configured). Higher startup latency.
- Allocations requesting the
nvme_scratchfeature constraint will not be scheduled on these nodes.
The node agent detects local storage at startup and reports its availability as part of node capabilities (features: ["nvme_scratch"]).