Proposal: new numatopology collector for per-NUMA-node capacity and VM assignment

25 views

Skip to first unread message

Li Liu

unread,

May 6, 2026, 2:30:41 AMMay 6

to Prometheus Developers

This proposal is submitted on behalf of IBM and Red Hat. Both organizations operate large-scale KVM/libvirt environments where NUMA-aware VM placement is a routine operational concern, and we have a shared interest in making the necessary observability available through node_exporter rather than maintaining out-of-tree forks or parallel exporters.

Background

In KVM/libvirt environments where VMs are NUMA-pinned for latency-sensitive workloads, operators need to answer questions like:

How much CPU and memory capacity does each NUMA node have on this host?
Which VMs are pinned to which NUMA nodes, and how much of each node's capacity are they consuming?
Is a node oversubscribed, or do we have room to schedule another VM on it?

Today, node_exporter exposes per-NUMA-node memory statistics via meminfo_numa, but there is no built-in way to:

Read per-NUMA-node CPU capacity (/sys/devices/system/node/node*/cpulist).
Correlate VMs to the host NUMA nodes they are pinned to.

We've been running a collector internally that fills this gap and would like to propose contributing it upstream.

Proposed collector: numatopology

Default-disabled (opt-in), Linux-only, tagged !nonumatopology.

Metrics (all gauges):

node_numatopology_cpu_capacity — labels: node — source: /sys/devices/system/node/node*/cpulist
node_numatopology_memory_capacity_bytes — labels: node — source: /sys/devices/system/node/node*/meminfo
node_numatopology_cpu_used — labels: node — source: aggregated from libvirt domain XML
node_numatopology_memory_used_bytes — labels: node — source: aggregated from libvirt domain XML
node_numatopology_vm_cpu — labels: node, vm — source: per-VM, parsed from libvirt domain XML
node_numatopology_vm_memory_bytes — labels: node, vm — source: per-VM, parsed from libvirt domain XML

Flags:

--collector.numatopology.libvirt-xml-dir (default /run/libvirt/qemu) — directory containing libvirt domstatus/domain XML files.
--collector.numatopology.vm-metrics (default true) — emit per-VM metrics. Operators concerned about cardinality on hosts with many VMs can set --no-collector.numatopology.vm-metrics to keep only the per-node aggregates.

Why a new collector and not an extension to meminfo_numa?

meminfo_numa is purely sysfs-driven and is tightly scoped to memory statistics. The proposed collector adds a second data source (libvirt XML) and a different conceptual model (VM ↔ NUMA assignment), which doesn't fit cleanly under meminfo_numa. Keeping it separate also lets operators who only want the memory view continue to use meminfo_numa unchanged.

Cardinality

Per-node metrics: bounded by the number of NUMA nodes (typically 1–8 per host). Negligible.
Per-VM metrics: scale with (NUMA nodes × pinned VMs per host). In typical KVM hosts this is dozens, not thousands. The collector is defaultDisabled, and the per-VM metrics can be turned off independently via the flag above.

Implementation notes

Source layout: collector/numa_topology_linux.go for the collector, plus a collector/numatopology/ subpackage containing pure parsing functions (CountCPUList, ParseMeminfo, ParseVirshXML) that are unit-testable without sysfs or libvirt.
Sysfs access uses the same filepath.Glob(sysFilePath(...)) + os.ReadFile pattern as meminfo_numa_linux.go, cpu_linux.go, and edac_linux.go. We considered using prometheus/procfs/sysfs, but it currently has no helper for cpulist or per-node meminfo (only VMStatNUMA); adding such helpers upstream would be a separate effort.
Tests use sysfs fixtures (fixtures/sys) plus inline libvirt XML strings, covering both pinned and unpinned VMs, and the vm-metrics=false path.
Libvirt XML parsing is read-only and handles both <domstatus> (running domains under /run/libvirt/qemu) and bare <domain> formats. VMs without explicit <numatune><memnode> pinning are skipped.

Open questions for the community

Does the maintainer team see this as in-scope for node_exporter, or would it be more appropriate as a separate exporter (e.g., a libvirt-exporter)? Our reasoning for node_exporter: the per-node CPU/memory capacity portion is purely sysfs and clearly belongs here; the per-VM portion is the logical join with the host's NUMA topology, which is awkward to compute without being co-located on the host.
Naming: numatopology vs. e.g. numa_topology or splitting into numa_capacity + numa_vm_assignment? We picked the single-collector form because the data sources need to be read together to compute cpu_used / memory_used_bytes.
Is the libvirt-XML coupling acceptable, or would the team prefer a pluggable interface (e.g., reading from virsh command output or the libvirt API)? File-based parsing is the lightest dependency — no libvirt client libraries, no socket access — but it does tie the collector to the on-disk format.

Happy to open a draft PR for review if the direction is welcome. We have the implementation, fixtures, and tests ready to go.

Thank you

Regards

Li Liu

Reply all

Reply to author

Forward

0 new messages