Hi Axel,
I'm Christiane and the one responsible for the hwloc/libnuma support on cp2k.
Concerning libnuma, the affinity support is much simpler than the one with hwloc. Only, thread/process affinity. I'll check this wrapper to see why it is not working and let you know.
About hwloc, that is true that it requires the latest version because of the pci support for network cards and gpus. By default this module, only attach processes and their memory to NUMA nodes. Their threads are not pinned to any cores, so they can move within a NUMA node. There are other strategies to place MPI/threads that can be used by setting the machine_arch keys.
Could you send me the input, machine_arch keys that you used for these tests? I've tested hwloc support on local intel/amd machines (with and without gpus) and on CRAY machines and I have no errors like that. All of them with NUMA characteristics.
When you use numactl, how do you can determine the cores for threads and MPI tasks? Do you attribute processes to NUMA nodes and consequently, threads are also attached to the same set of cores of their parent?
ok. it may be version specific, too.[akohlmey@g002 input]$ rpm -qif /usr/lib64/libnuma.so.1Name : numactl Relocations: (not relocatable)Version : 2.0.3 Vendor: Red Hat, Inc.Release : 9.el6 Build Date: Thu Jun 17 10:46:17 2010
yes, this kind of behavior is what i would have expected.this should also help with the internal threading in OpenMPI.
please have a look at the attached file. you'll see that thereare some entries that don't look right. particularly the nodenames are all that of MPI rank 0.
yes. our MPI installation is configured by default to have a 1:1 core to MPIrank mapping (since there is practically nobody yet using MPI+OpenMP)with memory affinity for giving people the best MPI-only performance.
at the end of the attached file i include a copy of the wrapper script,that is OpenMPI specific (since that is the only MPI library installed).
overall, it looks to me like that default settings are giving a desirableprocessor and memory affinity (which is great) that is consistent withthe best settings i could get using my wrapper script, but the diagnosticsseems to be off and may be confusing people, particularly technicalsupport in computing centers, that are often too literal and assumethat any software is always giving 100% correct information. ;-)
cheers,axel.
yes, this kind of behavior is what i would have expected.this should also help with the internal threading in OpenMPI.
The main goal is to avoid memory allocations and access from different MPIs on remote NUMA nodes.
But, If you want to pin also threads you can try the Linear strategy, which will pin process and threads.
please have a look at the attached file. you'll see that thereare some entries that don't look right. particularly the nodenames are all that of MPI rank 0.
I did some changes to fix this. Could you try the latest version of CP2K?
yes. our MPI installation is configured by default to have a 1:1 core to MPIrank mapping (since there is practically nobody yet using MPI+OpenMP)with memory affinity for giving people the best MPI-only performance.
Ok. So, for threads, even with this installation you can not specify their cores?
at the end of the attached file i include a copy of the wrapper script,that is OpenMPI specific (since that is the only MPI library installed).
thanks for the script.
overall, it looks to me like that default settings are giving a desirableprocessor and memory affinity (which is great) that is consistent withthe best settings i could get using my wrapper script, but the diagnosticsseems to be off and may be confusing people, particularly technicalsupport in computing centers, that are often too literal and assumethat any software is always giving 100% correct information. ;-)
Now, it should work :) Let me know if you find new bugs.
Considering your machine, the cores number problem comes from the fact that I was using the number that the OS gives to the cores. Now, I'm using the logical ones. BTW, is your machine intel?
unlike with binding MPI tasks to "NUMA units",i didn't see a significant difference in performance.
yes. updated, compiled and tested. it gives the output that i expect now.
We have both. Intel and AMD (which is forcing me to use compiler settings,that are compatible with a common subset of both). overall the AMD onesbenefit the most from using processor and memory affinity, but i was surprisedhow much impact it has on the X5677 Intel CPUs (quad-core westmere epwith 3.5GHz). just proves that there is always something new to learn...