Finding CPU utilization of NUMA nodes

Paavo Helde

unread,

Feb 8, 2017, 3:44:22 AM2/8/17

to

Sorry for off-topic post, but comp.programming.threads seems to be long
dead.

In short, I need to find out current CPU utilization of different NUMA
nodes in the computer. Could you suggest a portable (at least
Windows+Linux) C or C++ library for doing this? My google-fu is somehow
failing me today...

TIA
Paavo

Scott Lurndal

unread,

Feb 8, 2017, 8:22:38 AM2/8/17

to

Not particularly portable, but:

On Linux, you can get the cpu utilization by reading /proc/stat.
You can determine the numa configuration using libnuma.

man 3 numa
man 5 proc

$ cat /proc/stat
cpu 468313627 3965537 187019877 9970190111 2918517 161185 4280251 0 4637258
cpu0 227884368 566411 17349912 1087913613 258729 1448 619035 0 1120146
cpu1 53250242 485882 21573270 1265011780 133365 99 341893 0 3024854
cpu2 41850867 643670 30634919 1254931613 188815 262 725175 0 224627
cpu3 28215097 729617 22785914 1259180937 2203630 601 598373 0 115891
cpu4 34656271 407981 28902824 1268742744 26178 395 590566 0 59265
cpu5 23679534 331028 20223783 1273092087 43387 431 494331 0 25066
cpu6 34225622 389284 25292245 1278171824 19862 157438 514237 0 43466
cpu7 24551623 411659 20257006 1283145510 44548 508 396638 0 23941
intr 19978667854 44748 0 0 2 2 0 0 0 1 0 0 0 0 0 0 0 174 0 0 0 0 0 0 1745712340 0 0 0 0 17853204 11517450 67817 488310827 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
ctxt 1633898650406
btime 1473175794
processes 8949363
procs_running 5
procs_blocked 0
softirq 34362581836 0 3495208294 7685981 493519202 24340829 0 4019633 2692771779 2676700522 3493499116

Paavo Helde

unread,

Feb 8, 2017, 4:00:12 PM2/8/17

to

On 8.02.2017 15:22, Scott Lurndal wrote:
> Paavo Helde <myfir...@osa.pri.ee> writes:
>>
>> Sorry for off-topic post, but comp.programming.threads seems to be long
>> dead.
>>
>> In short, I need to find out current CPU utilization of different NUMA
>> nodes in the computer. Could you suggest a portable (at least
>> Windows+Linux) C or C++ library for doing this? My google-fu is somehow
>> failing me today...
>
> Not particularly portable, but:
>
> On Linux, you can get the cpu utilization by reading /proc/stat.
> You can determine the numa configuration using libnuma.
>
> man 3 numa
> man 5 proc
>
> $ cat /proc/stat
> cpu 468313627 3965537 187019877 9970190111 2918517 161185 4280251 0 4637258
> cpu0 227884368 566411 17349912 1087913613 258729 1448 619035 0 1120146
> cpu1 53250242 485882 21573270 1265011780 133365 99 341893 0 3024854
> cpu2 41850867 643670 30634919 1254931613 188815 262 725175 0 224627
> cpu3 28215097 729617 22785914 1259180937 2203630 601 598373 0 115891

So for finding out the current load I need to read /proc/stat, wait a
bit, and read it again? There is nothing like /proc/loadavg per CPU?

Actually I am not sure any more if there is any point to try to measure
the CPU load. I was thinking of using it for binding the instances of my
program to different NUMA nodes, but measuring the current performance
does not help in any way if N instances of the program are launched at
the same time.

Thanks
Paavo

Ian Collins

unread,

Feb 8, 2017, 11:52:14 PM2/8/17

to

On 02/ 9/17 09:59 AM, Paavo Helde wrote:
> On 8.02.2017 15:22, Scott Lurndal wrote:
>> Paavo Helde <myfir...@osa.pri.ee> writes:
>>>
>>> Sorry for off-topic post, but comp.programming.threads seems to be long
>>> dead.
>>>
>>> In short, I need to find out current CPU utilization of different NUMA
>>> nodes in the computer. Could you suggest a portable (at least
>>> Windows+Linux) C or C++ library for doing this? My google-fu is somehow
>>> failing me today...
>>
>> Not particularly portable, but:
>>
>> On Linux, you can get the cpu utilization by reading /proc/stat.
>> You can determine the numa configuration using libnuma.
>>
>> man 3 numa
>> man 5 proc
>>
>> $ cat /proc/stat
>> cpu 468313627 3965537 187019877 9970190111 2918517 161185 4280251 0 4637258
>> cpu0 227884368 566411 17349912 1087913613 258729 1448 619035 0 1120146
>> cpu1 53250242 485882 21573270 1265011780 133365 99 341893 0 3024854
>> cpu2 41850867 643670 30634919 1254931613 188815 262 725175 0 224627
>> cpu3 28215097 729617 22785914 1259180937 2203630 601 598373 0 115891
>
> So for finding out the current load I need to read /proc/stat, wait a
> bit, and read it again? There is nothing like /proc/loadavg per CPU?

I have some code for gather and collating stats from /proc if you are
interested.

--
Ian

Paavo Helde

unread,

Feb 9, 2017, 3:53:55 AM2/9/17

to

I am not so sure any more if I actually need to use these stats. But if
this does not do any trouble to you, then maybe you could indeed post
the code or download instructions here or to my e-mail.

Thanks,
Paavo

Chris M. Thomasson

unread,

Feb 9, 2017, 4:08:53 PM2/9/17

to

I guess you can give this a try:

https://software.intel.com/en-us/intel-vtune-amplifier-xe

Also guess you can perhaps pin some threads to the NUMA nodes that can
give some crude timing data. How are you arranging your overall
synchronization scheme? Are you using affinity masks at all, or letting
the OS handling things from that perspective?

Chris M. Thomasson

unread,

Feb 9, 2017, 4:17:31 PM2/9/17

to

On 2/8/2017 12:44 AM, Paavo Helde wrote:
>

Fwiw, a somewhat related thread:

https://groups.google.com/d/topic/comp.programming.threads/XU6BtGNSkF0/discussion

(read all if interested, it involves using timing for hazard ptr impl)

This was back when comp.programming.threads was really great!

Paavo Helde

unread,

Feb 9, 2017, 4:43:56 PM2/9/17

to

On 9.02.2017 23:08, Chris M. Thomasson wrote:
> On 2/8/2017 12:44 AM, Paavo Helde wrote:
>>
>> Sorry for off-topic post, but comp.programming.threads seems to be long
>> dead.
>>
>> In short, I need to find out current CPU utilization of different NUMA
>> nodes in the computer. Could you suggest a portable (at least
>> Windows+Linux) C or C++ library for doing this? My google-fu is somehow
>> failing me today...
>
> I guess you can give this a try:
>
> https://software.intel.com/en-us/intel-vtune-amplifier-xe

AFAIK vtune is an application, not a library. When I said "I need to
find out" I actually meant "my library code needs to find out, when
running as a part of a some unknown application on some unknown hardware.

> Also guess you can perhaps pin some threads to the NUMA nodes that can
> give some crude timing data. How are you arranging your overall
> synchronization scheme? Are you using affinity masks at all, or letting
> the OS handling things from that perspective?

Yes I am attempting to bind a parallel thread pool in my program to a
single NUMA node (because profiling shows there is no point to let it
spread across nodes, it would be just wasting computer resources). What
I need is to figure out which NUMA node to bind to.

Cheers
Paavo

Chris M. Thomasson

unread,

Feb 11, 2017, 12:18:39 AM2/11/17

to

Indeed. Sharing across NUMA nodes should be avoided like the damn plague!

> What I need is to figure out which NUMA node to bind to.

That can be a rather nasty problem; what is your optimal criteria for
choosing? Think one step further and bind worker threads from the pool
to the local processors on the NUMA node. Well, should the OS handle
that set of details?

Okay, iirc this contrived kind of thing helped my code figure out a
mapping for a system, this is from a while ago (decade+) and the idea
went something like:

take the number of NUMA nodes, and create a thread pool for each one.
Then make all of the pools just lock their individual mutexs, increment
a local numa counter, and unlock, no mutex contention between the nodes.
They do this in a loop until a signal is hit, from a single controlling
thread, that tells them to stop. The end resulting NUMA counters with
the highest numbers are more active, and better "performing" in the
system. But, keep in mind this data can be totally arbitrary, or can be
out-of-date rather quickly, depending on load in the system. I have no
idea how this system is going to effect the counters. Quite honestly,
and for some reason, your question kind of reminds me of:

https://groups.google.com/d/topic/comp.lang.c++/7u_rLgQe86k/discussion

;^)

Paavo Helde

unread,

Feb 11, 2017, 12:55:18 PM2/11/17

to

The optimal criteria is that there should be no NUMA node sitting idle
while some other NUMA node is overloaded. But it looks like this is not
so easy to achieve in general, as the jobs I am binding can be in
different processes and their number and duration are not known in advance.

> Think one step further and bind worker threads from the pool
> to the local processors on the NUMA node. Well, should the OS handle
> that set of details?

I believe OS would handle thread scheduling in a single NUMA node better
than me. Actually I have tried to bind threads to single PU-s on
non-NUMA systems and the performance stayed the same or went worse,
depending on the OS/hardware.

Chris M. Thomasson

unread,

Feb 11, 2017, 11:18:58 PM2/11/17

to

Hummmm.... For now, think of binding a really low priority thread per
NUMA node. When it wakes, it increments a counter, and checks it for a
maxima. Okay, when this it hit, the node can say its in a LOW condition
scenario or something, reset the counter and repeat. Now, this is highly
contrived, but iirc, it worked fairly well as entropy for some
heuristics during runtime of system. The low priority thread can decide
to steal the shi% out of higher chunks of work when it hits the
threshold in a work-stealing/requesting hybrid sync setup. I will try to
find some more info, that is escaping me right now. Sorry Paavo. ;^o

>
>> Think one step further and bind worker threads from the pool
>> to the local processors on the NUMA node. Well, should the OS handle
>> that set of details?
>
> I believe OS would handle thread scheduling in a single NUMA node better
> than me. Actually I have tried to bind threads to single PU-s on
> non-NUMA systems and the performance stayed the same or went worse,
> depending on the OS/hardware.

Agreed. Sometimes, very rarely, it can be good to separate GC threads,
or proxy gc counting threads. Will try to remember. This was back a damn
decade ago. Shi%, I am getting old.

;^o Yikes!

Robert Wessel

unread,

Feb 12, 2017, 3:48:27 AM2/12/17

to

On Fri, 10 Feb 2017 21:18:30 -0800, "Chris M. Thomasson"
<inv...@invalid.invalid> wrote:

>On 2/9/2017 1:43 PM, Paavo Helde wrote:
>> On 9.02.2017 23:08, Chris M. Thomasson wrote:
>>> On 2/8/2017 12:44 AM, Paavo Helde wrote:
>>>>
>>>> Sorry for off-topic post, but comp.programming.threads seems to be long
>>>> dead.
>>>>
>>>> In short, I need to find out current CPU utilization of different NUMA
>>>> nodes in the computer. Could you suggest a portable (at least
>>>> Windows+Linux) C or C++ library for doing this? My google-fu is somehow
>>>> failing me today...
>>>
>>> I guess you can give this a try:
>>>
>>> https://software.intel.com/en-us/intel-vtune-amplifier-xe
>>
>> AFAIK vtune is an application, not a library. When I said "I need to
>> find out" I actually meant "my library code needs to find out, when
>> running as a part of a some unknown application on some unknown hardware.
>>
>>> Also guess you can perhaps pin some threads to the NUMA nodes that can
>>> give some crude timing data. How are you arranging your overall
>>> synchronization scheme? Are you using affinity masks at all, or letting
>>> the OS handling things from that perspective?
>>
>> Yes I am attempting to bind a parallel thread pool in my program to a
>> single NUMA node (because profiling shows there is no point to let it
>> spread across nodes, it would be just wasting computer resources).
>
>Indeed. Sharing across NUMA nodes should be avoided like the damn plague!

FSVO "plague".

The whole point of NUMA is to allow some level of intra-node sharing
with significantly lower latency than across a cluster. If you don't
have/need that, then why are you running on an expensive NUMA box
rather than on a cheap cluster?

Which is not to detract from your point, which is that data sharing
between nodes has a very significant penalty compared to items held
(exclusively) local to a node.

Chris M. Thomasson

unread,

Feb 12, 2017, 9:09:13 PM2/12/17

to

Well, the phraseology was to harsh, but the action should try to be
minimized.

> The whole point of NUMA is to allow some level of intra-node sharing
> with significantly lower latency than across a cluster.

Indeed. I worry about the details of the type of communications that
make NUMA nodes share. Think of a node going down! Or the process that
holds the thread, or fiber that is initiating the sync totally dies?
Iirc, a mixture of transactional memory, clever lock-free
fault-tolerant, and/or down right robust mutexs can get the job done.
Also, one can try to emulate this with a so-called "watchdog" process,
that can notify when an active entity has died while in the middle of
something critical.

> If you don't
> have/need that, then why are you running on an expensive NUMA box
> rather than on a cheap cluster?

What about multiple NUMA computers in a cluster?

> Which is not to detract from your point, which is that data sharing
> between nodes has a very significant penalty compared to items held
> (exclusively) local to a node.

I should say, try to avoid, however, sometimes this is easier said than
done.

;^o