Just brainstorming here, fwiw:
Since C++ has std::thread::hardware_concurrency, well, what about
thinking of creating something like:
std::thread::hardware_concurrency_tree
That returns information in the form of a tree structure that describes,
or encodes the way the system actually arranges the number of cpu's
returned from std::thread::hardware_concurrency? You basically get a
root node that has all of the NUMA nodes in the "system".
So, crudely and quickly, for instance, if there are 2 NUMA nodes, it
gives you a linked list NL with two items, each of which are a NUMA
node. Each node in NL represents a CPU item; each CPU item has multiple
nodes representing its virtual cores VCPU; each VCPU has its
"hyper-threading" cores on each VCPU; and down to per-thread...
Afaict, this is zooming down from the high level at 2 NUMA nodes, into
the per-thread leaves. The "fractal" zoom ends in a tree where
terminating leaves are thread nodes. I guess we can go one level deeper
and add n fibers to each per-thread leaf, transform the thread leaf into
a fiber leaf, in a sense. Humm... The overall layout for NUMA down to
thread and/or fiber has always sounded sort of "fractal in nature" to me.
My crude rational for a possible proposal would be:
We have std::thread::hardware_concurrency. This is good information,
however its not very useful wrt mapping out the system. An additional
and optional function just might be in order:
std::thread::hardware_concurrency_tree, or something.
That gives deeper information on how the result of
std::thread::hardware_concurrency is arranged.
Also, what about:
std::thread::hardware_memory_tree
that gives various cache line sizes and how they are padded, and
arranged in the NUMA tree.
Combine this with optional thread affinity masking, and a program in C
and/or C++ can combine this with standard atomics to create highly
scaleable, low level, systems.
Imvvvho, something like:
std::thread::hardware_concurrency_tree
is probably "low-level enough" for C or C++ wrt providing aid to a
programmer that wants to use threading and atomics effectively, and
continue to use the standard wrt gaining the cache line size to properly
pad and align their structures in memory: We do not want any
false-sharing, damn it!
;^)