Memory Topology
Design Principle
Vendor-neutral abstraction over CPU-memory-GPU memory topology. The scheduler reasons about “memory domains” and “interconnect bandwidth,” not vendor-specific terms like NUMA node IDs or NVLink-C2C. Node agents discover and report memory topology; the scheduler uses it for placement decisions and memory policy configuration.
This complements gpu-topology.md, which models GPU interconnects. Memory topology models the CPU-memory-GPU memory hierarchy: NUMA domains, unified memory architectures, and CXL-attached memory tiers.
Memory Domain Types
| Type | Hardware Example | Characteristics | Discovery |
|---|---|---|---|
| Discrete NUMA | Multi-socket Intel Xeon, AMD EPYC | Separate DRAM per socket, asymmetric access latencies | /sys/devices/system/node/ |
| Unified CPU-GPU | NVIDIA Grace Hopper GH200 | NVLink-C2C coherent, single address space across CPU and GPU | NVML + /sys/devices/system/node/ |
| APU / Unified Die | AMD MI300A | CPU + GPU on same package, shared HBM3 pool | ROCm-SMI + hwloc |
| CXL-Attached | CXL Type 3 memory expanders | Pooled or device-attached memory, higher latency than local DRAM | /sys/bus/cxl/ |
| Single-Socket | Single-socket servers | Trivial: one NUMA node, uniform access | /sys/devices/system/node/ |
Abstraction Model
MemoryTopology {
domains: Vec<MemoryDomain>,
interconnects: Vec<MemoryInterconnect>,
total_capacity_bytes: u64,
}
MemoryDomain {
id: u32,
domain_type: MemoryDomainType, // Dram | Hbm | CxlAttached | Unified
capacity_bytes: u64,
numa_node: Option<u32>, // Linux NUMA node ID, if applicable
attached_cpus: Vec<u32>, // CPU IDs with local access
attached_gpus: Vec<u32>, // GPU indices with local/coherent access
}
MemoryInterconnect {
domain_a: u32,
domain_b: u32,
link_type: MemoryLinkType, // NumaLink | CxlSwitch | CoherentFabric
bandwidth_gbps: f64,
latency_ns: u64,
}
enum MemoryDomainType { Dram, Hbm, CxlAttached, Unified }
enum MemoryLinkType { NumaLink, CxlSwitch, CoherentFabric }
The node agent populates this structure at startup alongside GpuTopology and reports it with node capabilities and health data.
Interconnect Bandwidth and Latency
| Link Type | Typical Bandwidth | Typical Latency | Notes |
|---|---|---|---|
| Local DRAM access | 50-100 GB/s per channel | ~80 ns | Same-socket, same NUMA node |
| Remote NUMA (UPI/xGMI) | 20-40 GB/s | ~150-300 ns | Cross-socket, 1.5-3x local latency |
| NVLink-C2C (GH200) | 900 GB/s | ~100 ns | CPU-GPU coherent fabric |
| Infinity Fabric (MI300A) | 896 GB/s aggregate | ~100 ns | On-package CPU-GPU interconnect |
| CXL 2.0 (Type 3) | 32-64 GB/s | ~200-400 ns | Memory expander, higher latency |
| PCIe Gen5 (discrete GPU) | 64 GB/s | ~1-2 us | Non-coherent, requires explicit transfer |
Actual bandwidth and latency are discovered at runtime, not hardcoded.
Superchip Architectures
NVIDIA Grace Hopper (GH200)
Grace CPU + Hopper GPU connected via NVLink-C2C (900 GB/s bidirectional). The CPU and GPU share a single coherent address space — no explicit cudaMemcpy required for data movement.
┌────────────────────────────────────────────────────┐
│ GH200 Superchip │
│ │
│ ┌─────────────────┐ NVLink-C2C ┌─────────────┐ │
│ │ Grace CPU │◄──900 GB/s───►│ Hopper GPU │ │
│ │ 72 cores │ coherent │ 80 GB HBM3 │ │
│ │ 512 GB LPDDR5X │ │ │ │
│ └─────────────────┘ └─────────────┘ │
│ │
│ Single coherent address space (CPU + GPU) │
│ → Maps to one Unified MemoryDomain │
└────────────────────────────────────────────────────┘
Mapping to abstraction:
- One
MemoryDomain { type: Unified }spanning CPU LPDDR5X + GPU HBM3 attached_cpus: all Grace cores;attached_gpus: [Hopper GPU index]- One
MemoryInterconnect { type: CoherentFabric, bandwidth: 900 }between CPU and GPU sub-domains
AMD Instinct MI300A
APU with CDNA 3 GPU + Zen 4 CPU on the same package, sharing HBM3 memory pool. No discrete CPU DRAM — all memory is HBM3 accessible by both CPU and GPU.
┌──────────────────────────────────────────────────┐
│ MI300A Package │
│ │
│ ┌─────────────┐ Infinity ┌────────────────┐ │
│ │ Zen 4 CPU │ ◄──Fabric──► │ CDNA 3 GPU │ │
│ │ 24 cores │ 896 GB/s │ 6 XCDs │ │
│ └──────┬──────┘ └───────┬────────┘ │
│ │ │ │
│ └──────┐ ┌───────────┘ │
│ ▼ ▼ │
│ ┌─────────────────────┐ │
│ │ Shared HBM3 Pool │ │
│ │ 128 GB │ │
│ └─────────────────────┘ │
│ │
│ → Maps to one Unified MemoryDomain │
└──────────────────────────────────────────────────┘
Mapping to abstraction:
- One
MemoryDomain { type: Unified }for the shared HBM3 pool attached_cpus: all Zen 4 cores;attached_gpus: [MI300A GPU index]- Internal Infinity Fabric interconnect is not separately modeled (on-package, always present)
Discovery
The node agent discovers memory topology at startup using platform-specific sources:
| Source | What It Provides | Platform |
|---|---|---|
/sys/devices/system/node/ | NUMA node count, CPU-to-node mapping, memory per node | Linux (all) |
numactl --hardware | NUMA distances (latency matrix between nodes) | Linux (all) |
hwloc | Portable topology discovery, cache hierarchy, PCI locality | Linux (all) |
| NVML | GPU-to-NUMA affinity, NVLink-C2C detection (GH200) | NVIDIA GPUs |
| ROCm-SMI | GPU-to-NUMA affinity, MI300A detection | AMD GPUs |
/sys/bus/cxl/ | CXL device enumeration, memory regions, interleave config | CXL-capable systems |
Superchip Detection
GH200 and MI300A superchips are identified by GPU model string during GPU discovery (cross-ref: gpu-topology.md). When detected:
- The node agent queries the coherent memory size via vendor API (NVML for GH200, ROCm-SMI for MI300A)
- NUMA nodes associated with both CPU and GPU are merged into a single
Unifieddomain - The coherent interconnect bandwidth is reported as a
CoherentFabriclink
Discovery Fallback
If vendor APIs are unavailable (e.g., driver not loaded), the node agent falls back to hwloc for topology and reports Dram domains only. GPU memory domains are still reported via the GPU topology path but without coherent interconnect metadata.
Scheduling Impact
Extending f₄ (topology_fitness)
Memory topology extends the intra-node component of f₄ alongside GPU topology:
intra_node_fitness = β · gpu_link_fitness + (1-β) · memory_locality_fitness
memory_locality_fitness(j, selected_nodes) =
average over selected nodes of:
fraction of allocation's CPUs and GPUs in the same memory domain
β = 0.7 for GPU-heavy workloads (GPU interconnect dominates)
β = 0.3 for CPU-heavy workloads with GPU offload (memory locality dominates)
β = 0.5 default
Constraint Hints
Allocations can specify memory topology preferences:
| Constraint | Effect |
|---|---|
prefer_same_numa | Soft: prefer placing all CPUs in a single NUMA domain |
require_unified_memory | Hard: only schedule on nodes with Unified memory domains (GH200, MI300A) |
prefer_local_memory | Soft: prefer NUMA-local memory allocation policy |
allow_cxl_memory | Opt-in: allow scheduling on CXL-expanded memory capacity |
Hard constraints filter nodes before the knapsack solver runs. Soft constraints contribute to memory_locality_fitness.
Intra-Node CPU-GPU Co-location
On discrete NUMA systems (e.g., dual-socket with 4 GPUs per socket), the node agent co-locates an allocation’s CPU cores and GPUs within the same NUMA domain when possible:
For an allocation requesting k CPUs and g GPUs on a multi-NUMA node:
1. Identify NUMA domains that have both free CPUs and GPUs with local affinity
2. Prefer the domain where GPU-to-NIC affinity is best (for inter-node traffic)
3. Assign CPUs and GPUs from the same domain via cgroup/cpuset
4. If the allocation spans domains: prefer domains connected by highest-bandwidth link
Memory Mapping Policies
The node agent configures memory allocation policy at allocation start via numactl (or equivalent). This is transparent to the user unless they specify a preference.
| Policy | numactl Flag | When Used |
|---|---|---|
| Local | --localalloc | Default: allocate on the NUMA node where the thread runs |
| Interleave | --interleave=all | Large shared datasets that all threads access equally |
| Preferred | --preferred=<node> | Pin to a specific NUMA node (for known data locality) |
| Bind | --membind=<nodes> | Strict: only allocate from specified nodes (sensitive isolation) |
On unified memory architectures (GH200, MI300A), NUMA policy has reduced impact since CPU and GPU share the same memory pool. The node agent skips numactl configuration for allocations on unified nodes unless the user explicitly requests a policy.
Allocation-Level Override
Users can specify memory policy in the allocation request:
resources:
cpus: 24
gpus: 1
memory_gb: 128
constraints:
memory_policy: interleave # optional: local | interleave | preferred | bind
require_unified_memory: true # optional: only unified architectures
CXL Memory Tiers
CXL Type 3 memory expanders add a new capacity tier: higher latency than local DRAM but lower cost per GB. The scheduler treats CXL memory as a separate resource dimension.
Capacity Model
Node memory capacity:
local_dram_bytes: 512 GB (fast, NUMA-local)
cxl_memory_bytes: 2 TB (slower, CXL-attached)
total_bytes: 2.5 TB
Allocation can request:
memory_gb: 256 # scheduler satisfies from local DRAM
memory_gb: 1024 # scheduler must use CXL tier (exceeds local DRAM)
memory_gb: 1024
allow_cxl_memory: true # explicit opt-in for CXL tier
Scheduling Rules
- By default, allocations are placed using local DRAM capacity only
- If
allow_cxl_memory: true, CXL capacity is included in available memory - Allocations requesting more memory than local DRAM are only placed on CXL-capable nodes when the constraint is set
- CXL memory appears as a separate
CxlAttacheddomain inMemoryTopology
Cross-References
- gpu-topology.md — GPU interconnect topology, NIC affinity, intra-node GPU selection
- telemetry.md — NUMA locality metrics collection (eBPF), memory utilization
- scheduling-algorithm.md — f₄ topology_fitness, knapsack solver, constraint handling
- node-lifecycle.md — Node agent startup, health reporting, capability discovery
- conformance.md — Hardware configuration fingerprint (includes memory architecture)