hwloc support in cp2k-trunk

93 views
Skip to first unread message

Axel

unread,
Jan 20, 2012, 4:48:52 PM1/20/12
to cp...@googlegroups.com
hi everybody,

can somebody give me some pointers to debug
the hwloc/libnuma support in the current cp2k trunk?

it appears that libnuma support is not (yet) functioning,
even after i fixed a trivial bug in the C language wrapper.

[install@master makefiles]$ svn diff machine 
Index: machine/ma_linux.c
===================================================================
--- machine/ma_linux.c       (revision 12108)
+++ machine/ma_linux.c       (working copy)
@@ -51,7 +51,7 @@
 
   //libnuma has no support for I/O devices
   topo->nnetcards = 0;
-  local->nnetcards = 0;
+  local_topo->nnetcards = 0;
 
   topo->nsockets = linux_get_nsockets();
   local_topo->nsockets = topo->nsockets;


it also seems that hwloc requires a somewhat recent version
of the hwloc package. RHEL-6.x ships with 1.1 which
is missing some defines but using a self-compiled hwloc-1.3.1 
resulted in a working executable.

however, when playing with the options i see some
inconsistencies, especially when looking at the MPI task
and thread placement lists for a cp2k.psmp binary.

i currently use shell script based on numactl to schedule
the MPI task to processor/memory and thread assignments
and the performance data confirms it to be correct. however,
the corresponding output from the MACHINE_ARCH flags
is inconsistent with that. like all MPI tasks seem to be
located on the same physical node (which is not true)
and thread assignments are "crazy" as well.

i would very much appreciate it, somebody could
tell me how much this is still work in progress and 
on what platforms, this has been tested, and who
would be person to communicate patches, or ideas
for modifications or debug info to.

thanks in advance,
     axel.

Christiane Pousa

unread,
Jan 23, 2012, 2:59:49 AM1/23/12
to cp...@googlegroups.com
Hi Axel,

I'm Christiane and the one responsible for the hwloc/libnuma support on cp2k.

Concerning libnuma, the affinity support is much simpler than the one with hwloc. Only, thread/process affinity. I'll check this wrapper to see why it is not working and let you know.

About hwloc, that is true that it requires the latest version because of the pci support for network cards and gpus. By default this module, only attach processes and their memory to NUMA nodes. Their threads are not pinned to any cores, so they can move within a NUMA node. There are other strategies to place MPI/threads that can be used by setting the machine_arch keys.

Could you send me the input, machine_arch keys that you used for these tests? I've tested hwloc support on local intel/amd machines (with and without gpus) and on CRAY machines and I have no errors like that. All of them with NUMA characteristics.

When you use numactl, how do you can determine the cores for threads and MPI tasks? Do you attribute processes to NUMA nodes and consequently, threads are also attached to the same set of cores of their parent?

So, if you have any suggestions, comments, we can discuss this and also solve the problems that you have found.

--
[]'s
Christiane Pousa Ribeiro
 

Axel

unread,
Jan 24, 2012, 12:25:26 PM1/24/12
to cp...@googlegroups.com


On Monday, January 23, 2012 2:59:49 AM UTC-5, Christiane Pousa Ribeiro wrote:
Hi Axel,

hi christiane,
 
I'm Christiane and the one responsible for the hwloc/libnuma support on cp2k.

thanks for taking the time to look into this.
 
Concerning libnuma, the affinity support is much simpler than the one with hwloc. Only, thread/process affinity. I'll check this wrapper to see why it is not working and let you know.

ok. it may be version specific, too.  

[akohlmey@g002 input]$ rpm -qif /usr/lib64/libnuma.so.1 
Name        : numactl                      Relocations: (not relocatable)
Version     : 2.0.3                             Vendor: Red Hat, Inc.
Release     : 9.el6                         Build Date: Thu Jun 17 10:46:17 2010
 
About hwloc, that is true that it requires the latest version because of the pci support for network cards and gpus. By default this module, only attach processes and their memory to NUMA nodes. Their threads are not pinned to any cores, so they can move within a NUMA node. There are other strategies to place MPI/threads that can be used by setting the machine_arch keys.

yes, this kind of behavior is what i would have expected.
this should also help with the internal threading in OpenMPI.
 
Could you send me the input, machine_arch keys that you used for these tests? I've tested hwloc support on local intel/amd machines (with and without gpus) and on CRAY machines and I have no errors like that. All of them with NUMA characteristics.

please have a look at the attached file. you'll see that there
are some entries that don't look right. particularly the node
names are all that of MPI rank 0.

When you use numactl, how do you can determine the cores for threads and MPI tasks? Do you attribute processes to NUMA nodes and consequently, threads are also attached to the same set of cores of their parent?

yes. our MPI installation is configured by default to have a 1:1 core to MPI
rank mapping (since there is practically nobody yet using MPI+OpenMP)
with memory affinity for giving people the best MPI-only performance.

at the end of the attached file i include a copy of the wrapper script,
that is OpenMPI specific (since that is the only MPI library installed).

overall, it looks to me like that default settings are giving a desirable 
processor and memory affinity (which is great) that is consistent with
the best settings i could get using my wrapper script, but the diagnostics
seems to be off and may be confusing people, particularly technical
support in computing centers, that are often too literal and assume 
that any software is always giving 100% correct information. ;-)

cheers,
     axel.
cp2k-hwloc.txt

Christiane Pousa

unread,
Jan 26, 2012, 3:50:00 AM1/26/12
to cp...@googlegroups.com
ok. it may be version specific, too.  

[akohlmey@g002 input]$ rpm -qif /usr/lib64/libnuma.so.1 
Name        : numactl                      Relocations: (not relocatable)
Version     : 2.0.3                             Vendor: Red Hat, Inc.
Release     : 9.el6                         Build Date: Thu Jun 17 10:46:17 2010

The version that I use is the latest stable one. But, I don't believe that the error come from there. I still have to take a look on this libnuma support.
 
yes, this kind of behavior is what i would have expected.
this should also help with the internal threading in OpenMPI.

The main goal is to avoid memory allocations and access from different MPIs on remote NUMA nodes. But, If you want to pin also threads you can try the Linear strategy, which will pin process and threads.
 

please have a look at the attached file. you'll see that there
are some entries that don't look right. particularly the node
names are all that of MPI rank 0.

I did some changes to fix this. Could you try the latest version of CP2K?
 

yes. our MPI installation is configured by default to have a 1:1 core to MPI
rank mapping (since there is practically nobody yet using MPI+OpenMP)
with memory affinity for giving people the best MPI-only performance.


Ok. So, for threads, even with this installation you can not specify their cores?
 
at the end of the attached file i include a copy of the wrapper script,
that is OpenMPI specific (since that is the only MPI library installed).

thanks for the script.
 

overall, it looks to me like that default settings are giving a desirable 
processor and memory affinity (which is great) that is consistent with
the best settings i could get using my wrapper script, but the diagnostics
seems to be off and may be confusing people, particularly technical
support in computing centers, that are often too literal and assume 
that any software is always giving 100% correct information. ;-)

Now, it should work :) Let me know if you find new bugs.

Considering your machine, the cores number problem comes from the fact that I was using the number that the OS gives to the cores. Now, I'm using the logical ones. BTW, is your machine intel?
 

cheers,
     axel.


cheers,

Christiane Pousa Ribeiro
 

Axel

unread,
Jan 26, 2012, 7:38:04 PM1/26/12
to cp...@googlegroups.com


On Thursday, January 26, 2012 3:50:00 AM UTC-5, Christiane Pousa Ribeiro wrote:
 
 
yes, this kind of behavior is what i would have expected.
this should also help with the internal threading in OpenMPI.

The main goal is to avoid memory allocations and access from different MPIs on remote NUMA nodes.

i know. ;)
 
But, If you want to pin also threads you can try the Linear strategy, which will pin process and threads.

unlike with binding MPI tasks to "NUMA units",
i didn't see a significant difference in performance.


please have a look at the attached file. you'll see that there
are some entries that don't look right. particularly the node
names are all that of MPI rank 0.

I did some changes to fix this. Could you try the latest version of CP2K?

yes. updated, compiled and tested. it gives the output that i expect now.
 

yes. our MPI installation is configured by default to have a 1:1 core to MPI
rank mapping (since there is practically nobody yet using MPI+OpenMP)
with memory affinity for giving people the best MPI-only performance.


Ok. So, for threads, even with this installation you can not specify their cores?

it is not alone a matter of "want". the majority of users that i am working
with doesn't care (well, they do care if things run faster, but they don't
care so much, if it looks/sounds/is complicated). with hiding most of
the complexity in a script and having it not allow unreasonable choices,
i don't get the maximal flexibility, but all i need to do is to tell people:
just use this wrapper and it'll work. if every application would be hardware
topology aware and adjust itself as needed, that is even better and that
is why i am trying to compile cp2k this way.

at the end of the attached file i include a copy of the wrapper script,
that is OpenMPI specific (since that is the only MPI library installed).

thanks for the script.
 

overall, it looks to me like that default settings are giving a desirable 
processor and memory affinity (which is great) that is consistent with
the best settings i could get using my wrapper script, but the diagnostics
seems to be off and may be confusing people, particularly technical
support in computing centers, that are often too literal and assume 
that any software is always giving 100% correct information. ;-)

Now, it should work :) Let me know if you find new bugs.

thanks a lot. much appreciated. will let you know,
if i run across any additional problems.
 
Considering your machine, the cores number problem comes from the fact that I was using the number that the OS gives to the cores. Now, I'm using the logical ones. BTW, is your machine intel?

We have both. Intel and AMD (which is forcing me to use compiler settings,
that are compatible with a common subset of both). overall the AMD ones
benefit the most from using processor and memory affinity, but i was surprised
how much impact it has on the X5677 Intel CPUs (quad-core westmere ep
with 3.5GHz). just proves that there is always something new to learn...
 
thanks again,
    axel.

Christiane Pousa

unread,
Jan 27, 2012, 3:31:14 AM1/27/12
to cp...@googlegroups.com
unlike with binding MPI tasks to "NUMA units",
i didn't see a significant difference in performance.


Yes, most part of time the improvements will be around 5-15%. It really depends on how OS manage processes/threads and how the application itself was developed. 


yes. updated, compiled and tested. it gives the output that i expect now.

good. Libnuma support should now work too. 


We have both. Intel and AMD (which is forcing me to use compiler settings,
that are compatible with a common subset of both). overall the AMD ones
benefit the most from using processor and memory affinity, but i was surprised
how much impact it has on the X5677 Intel CPUs (quad-core westmere ep
with 3.5GHz). just proves that there is always something new to learn...

Yes, that is true. On the machines that I have worked, this was a rule too. Difficult to explain the reasons (cache size, cache.memory protocol??) :)
 
cheers,

--
[]'s
Christiane Pousa Ribeiro
 
"Judge a man by his questions, rather than by his answers"
Reply all
Reply to author
Forward
0 new messages