Proposal: new numatopology collector for per-NUMA-node capacity and VM assignment

8 views
Skip to first unread message

Li Liu

unread,
May 6, 2026, 2:30:41 AM (yesterday) May 6
to Prometheus Developers

This proposal is submitted on behalf of IBM and Red Hat. Both organizations operate large-scale KVM/libvirt environments where NUMA-aware VM placement is a routine operational concern, and we have a shared interest in making the necessary observability available through node_exporter rather than maintaining out-of-tree forks or parallel exporters.

Background

In KVM/libvirt environments where VMs are NUMA-pinned for latency-sensitive workloads, operators need to answer questions like:

  • How much CPU and memory capacity does each NUMA node have on this host?
  • Which VMs are pinned to which NUMA nodes, and how much of each node's capacity are they consuming?
  • Is a node oversubscribed, or do we have room to schedule another VM on it?

Today, node_exporter exposes per-NUMA-node memory statistics via meminfo_numa, but there is no built-in way to:

  1. Read per-NUMA-node CPU capacity (/sys/devices/system/node/node*/cpulist).
  2. Correlate VMs to the host NUMA nodes they are pinned to.

We've been running a collector internally that fills this gap and would like to propose contributing it upstream.

Proposed collector: numatopology

Default-disabled (opt-in), Linux-only, tagged !nonumatopology.

Metrics (all gauges):

  • node_numatopology_cpu_capacity — labels: node — source: /sys/devices/system/node/node*/cpulist
  • node_numatopology_memory_capacity_bytes — labels: node — source: /sys/devices/system/node/node*/meminfo
  • node_numatopology_cpu_used — labels: node — source: aggregated from libvirt domain XML
  • node_numatopology_memory_used_bytes — labels: node — source: aggregated from libvirt domain XML
  • node_numatopology_vm_cpu — labels: nodevm — source: per-VM, parsed from libvirt domain XML
  • node_numatopology_vm_memory_bytes — labels: nodevm — source: per-VM, parsed from libvirt domain XML

Flags:

  • --collector.numatopology.libvirt-xml-dir (default /run/libvirt/qemu) — directory containing libvirt domstatus/domain XML files.
  • --collector.numatopology.vm-metrics (default true) — emit per-VM metrics. Operators concerned about cardinality on hosts with many VMs can set --no-collector.numatopology.vm-metrics to keep only the per-node aggregates.

Why a new collector and not an extension to meminfo_numa?

meminfo_numa is purely sysfs-driven and is tightly scoped to memory statistics. The proposed collector adds a second data source (libvirt XML) and a different conceptual model (VM ↔ NUMA assignment), which doesn't fit cleanly under meminfo_numa. Keeping it separate also lets operators who only want the memory view continue to use meminfo_numa unchanged.

Cardinality

  • Per-node metrics: bounded by the number of NUMA nodes (typically 1–8 per host). Negligible.
  • Per-VM metrics: scale with (NUMA nodes × pinned VMs per host). In typical KVM hosts this is dozens, not thousands. The collector is defaultDisabled, and the per-VM metrics can be turned off independently via the flag above.

Implementation notes

  • Source layout: collector/numa_topology_linux.go for the collector, plus a collector/numatopology/ subpackage containing pure parsing functions (CountCPUListParseMeminfoParseVirshXML) that are unit-testable without sysfs or libvirt.
  • Sysfs access uses the same filepath.Glob(sysFilePath(...)) + os.ReadFile pattern as meminfo_numa_linux.gocpu_linux.go, and edac_linux.go. We considered using prometheus/procfs/sysfs, but it currently has no helper for cpulist or per-node meminfo (only VMStatNUMA); adding such helpers upstream would be a separate effort.
  • Tests use sysfs fixtures (fixtures/sys) plus inline libvirt XML strings, covering both pinned and unpinned VMs, and the vm-metrics=false path.
  • Libvirt XML parsing is read-only and handles both <domstatus> (running domains under /run/libvirt/qemu) and bare <domain> formats. VMs without explicit <numatune><memnode> pinning are skipped.

Open questions for the community

  1. Does the maintainer team see this as in-scope for node_exporter, or would it be more appropriate as a separate exporter (e.g., a libvirt-exporter)? Our reasoning for node_exporter: the per-node CPU/memory capacity portion is purely sysfs and clearly belongs here; the per-VM portion is the logical join with the host's NUMA topology, which is awkward to compute without being co-located on the host.
  2. Naming: numatopology vs. e.g. numa_topology or splitting into numa_capacity + numa_vm_assignment? We picked the single-collector form because the data sources need to be read together to compute cpu_used / memory_used_bytes.
  3. Is the libvirt-XML coupling acceptable, or would the team prefer a pluggable interface (e.g., reading from virsh command output or the libvirt API)? File-based parsing is the lightest dependency — no libvirt client libraries, no socket access — but it does tie the collector to the on-disk format.

Happy to open a draft PR for review if the direction is welcome. We have the implementation, fixtures, and tests ready to go.


Thank you

Regards

Li Liu

Reply all
Reply to author
Forward
0 new messages