Likwid meaures wrong cores

40 views
Skip to first unread message

Mike

unread,
Jun 14, 2022, 4:57:15 AM6/14/22
to likwid-users
Hello Likwid-Team,

I have a hybrid application that I want to measure with likwid-perfctr. Before I do want to make the hybrid measurements, I want to measure a pure multi-threading and then a pure MPI version of it. The multi-threading measuring is working fine, but I am running into problems concerning the MPI version. My problem is that likwid does not seem to measure the cores I want it to measure.

From the documentation I gathered that there are 3 main ways of measuring pure MPI applications:

       
  1. mpirun -np    32    likwid-perfctr -c S0:0-31    -g CACHE -o mpi_%h.txt ./a.out (from likwid-perfctr documentation)

       
  2. likwid-mpirun    -np 4    -pin S0:0_S0:1_S0:2_S0:3 -g CACHE ./a.out > mpi.txt (from    likwid-mpirun documentation and TutorialMPI)

       
  3. mpirun    -np 1 likwid-perfctr -c S0:0 -g CACHE ./a.out : -np 1 likwid-perfctr    -c S0:1 -g CACHE ./a.out : … (from TutorialMPI)


Some background information before I start explaining my problem. I work on a two-socket machine with 64 cores per socket. Hyper-threading is enabled, but I only want one process per core. In physical indexing core 0 has threads 0 and 128, core 1 has threads 1 and 129, and so on. While in the logical indexing core 0 has threads 0 and 1, core 1 has threads 2 and 3, and so on. I do the CPU binding via hwlock inside my code, so I only need likwid for measuring. I use Likwid 5.2.


In this example I want to run 32 processes and I want all of them to run on S0 and on the cores 0-31. Later I probably also want other configurations.


When using the first option I get the error that “WARN: Selected affinity domain S0 has only -1 hardware threads, but selection string evaluates to 32 threads. This results in multiple threads on the same hardware thread.” and crashes. Which I think is weird because domain S0 does contain 64 cores (likwid-topology also shows this). So, when omitting the S0 and just writing “-c 0-31” the code runs, but in the output file it sometimes shows the correct HWThreads are measured and sometimes that HWThreads 64-95 were measured instead and also some of those measurements are just a zero or a dash. It also gives me the info that “INFO: You are running LIKWID in a cpuset with 128 CPUs. Taking given IDs as logical ID in cpuset”, so I assume that I should use the logical CPU ID’s in the command line. When doing that it again will sometimes measure the correct cores and sometimes completely different ones. Is this measuring wrong cores just a weird side effect of something else I am doing and should I just redo the measurement until the correct cores show in the output file? If I use “-o mpi%r.txt” instead of “-o mpi%h.txt” I get a file per rank, am I right to assume that when using the latter, the output is just combined into one file?


When using the second option with 32 processes and thread group “S0” with IDs 0-31 it will always measure cores: 0, 65, 2, 67, 4, 69, 6, 71, 8, 73, 10, 75, 12, 77, 14, 79, 16, 81, 18, 83, 20, 85, 22, 87, 24, 89, 26, 91, 28, 93, 30, 95.

When using the thread group “N” with IDs 0,2,4,6… the cores: 0, 66, 4, 70, 8, 74, 12, 78, 16, 82, 20, 86, 24, 90, 28, 94, 32, 98, 36, 102, 40, 106, 44, 110, 48, 114, 52, 118, 56, 122, 60, 126 are measured.

When using no thread group with IDs 0,2,4,6… the same cores as with “N” are measured, which makes sense since both are logical orderings, but they are still not the ones I stated in the command line.

When using the third option pretty much the same as with the second option happens.

I realize this question is really huge and I am thankful for everyone that read it. So, am I missing or misunderstanding something when it comes to measuring specific cores? The first option does measure the correct cores sometimes, but since it does not do it always it has made my doubtful of the results (even though they were in the expected range and could very well be right). Thank you for your responses.

Thomas Gruber

unread,
Nov 10, 2022, 6:59:50 AM11/10/22
to likwid-users
Hello,

There are multiple applications in inter-play: The batch system, MPI and LIKWID. The batch system is providing the resources and limits the stuff you can use. The MPI runtime tries to be clever further limiting the cpuset. LIKWID tries to get what the user tells it but if the hardware threads are not part of the cpuset, it cannot do anything about it. Since limited cpusets are a pain to use, LIKWID switches to logical pinning, so your 0-31 are not hardware thread IDs but the indices in the list of hardware threads. If the list contains only the hardware threads of S1, those will be used. Moreover, only when using the <domain>:<list> notation, the physical cores are selected first. In all other cases, the list is left untouched containing the physical and SMT threads interleaved (commonly), so 0-31 is commonly not what you want.

If you submit a job (and run it through mpirun) and execute likwid-topology, does every hardware thread has its available asterisk? You can also run likwid-pin -p to see the pinning domains containing only available hardware threads.

> Is this measuring wrong cores just a weird side effect of something else I am doing and should I just redo the measurement until the correct cores show in the output file?
Depending on your batch system and MPI implementation, it might be that a different skip mask is required. The MPI runtimes partly start Shepherd/Progress/Background threads which should not be pinned. You can skip over them with the -s 0xX option. likwid-mpirun has some skip masks embedded because they depend on: MPI vendor, MPI version, OpenMP vendor, number of nodes and number of ranks per node.

> If I use “-o mpi%r.txt” instead of “-o mpi%h.txt” I get a file per rank, am I right to assume that when using the latter, the output is just combined into one file?
No, all ranks on the host will compete who writes the file. likwid-mpirun uses both, so mpi%h_%r.txt or similar.

So my guidelines for proper measurements:
- Add --exclusive or similar (-c <all_hw_threads>) to the batch system submission
- Deactivate pinning of the MPI implementation (I_MPI_PIN=off or similar)
- Check that you are not caged into a cpuset and have all resources

Best,
Thomas
Reply all
Reply to author
Forward
0 new messages