likwid-mpirun and MPI tasks distributed using slurm

Federico Tesser

unread,

May 4, 2020, 3:19:58 PM5/4/20

to likwid-users

Good evening.

I am Federico Tesser,

and I am writing to you because I do not understand why I have some problems

using likwid-mpirun (version 5.0.1) and slurm (version 18.08.3) distribution, with

8 MPI tasks on two different sockets.

If I run the code without likwid-mpirun, then the distribution of the MPI tasks is splitted

on the two sockets, while if I use likwid-mpirun, then I get 4 tasks on the first

4 cpus of one socket, and the remaining 4 tasks on the same set of cpus.

To be sure, I also checked the "Cpus_allowed" of the files /proc/self/status, in the two cases:

1) without likwid-mpirun

Process 000: VmRSS = 789276 KB, VmSize = 887076 KB Cpus_allowed= 0

Process 001: VmRSS = 789276 KB, VmSize = 887076 KB Cpus_allowed= 10

Process 002: VmRSS = 789344 KB, VmSize = 887076 KB Cpus_allowed= 1

Process 003: VmRSS = 789348 KB, VmSize = 887076 KB Cpus_allowed= 11

Process 004: VmRSS = 789276 KB, VmSize = 887076 KB Cpus_allowed= 2

Process 005: VmRSS = 789276 KB, VmSize = 887076 KB Cpus_allowed= 12

Process 006: VmRSS = 789284 KB, VmSize = 887076 KB Cpus_allowed= 3

Process 007: VmRSS = 789332 KB, VmSize = 887076 KB Cpus_allowed= 13

2) with likwid-mpirun

DEBUG: Resolved NperDomain string E:N:8:1 to CPUs: [0] [1] [2] [3] [16] [17] [18] [19]

DEBUG: Process 1 runs on CPUs 0

DEBUG: Process 2 runs on CPUs 1

DEBUG: Process 3 runs on CPUs 2

DEBUG: Process 4 runs on CPUs 3

DEBUG: Process 5 runs on CPUs 16

DEBUG: Process 6 runs on CPUs 17

DEBUG: Process 7 runs on CPUs 18

DEBUG: Process 8 runs on CPUs 19

DEBUG: Assign 8 processes with 8 per node and 1 threads per process to 1 hosts

Process 000: VmRSS = 803852 KB, VmSize = 908684 KB Cpus_allowed= 0

Process 001: VmRSS = 803828 KB, VmSize = 908596 KB Cpus_allowed= 1

Process 002: VmRSS = 803828 KB, VmSize = 908596 KB Cpus_allowed= 2

Process 003: VmRSS = 803832 KB, VmSize = 908596 KB Cpus_allowed= 3

Process 004: VmRSS = 803860 KB, VmSize = 908684 KB Cpus_allowed= 0

Process 005: VmRSS = 803828 KB, VmSize = 908596 KB Cpus_allowed= 1

Process 006: VmRSS = 803828 KB, VmSize = 908596 KB Cpus_allowed= 2

Process 007: VmRSS = 803824 KB, VmSize = 908596 KB Cpus_allowed= 3

For this last case I also report the metric I am obtaining for a kernel inside the

code (it is just a sort of triad; given 2 vectors a and b (each one composed by

10000 doubles) and a matrix A (10000X10000 doubles), the kernel is doing just

a(i) = a(i) + A(i,j)*b(j)), for the group MEM_DP:

+-----------------------------------+--------------+------------+------------+------------+------------+------------+------------+------------+

+-----------------------------------+--------------+------------+------------+------------+------------+------------+------------+------------+

| Runtime (RDTSC) [s] | 0.8160 | 0.8183 | 0.8129 | 0.8128 | 0.8127 | 0.8202 | 0.8099 | 0.8125 |

| Runtime unhalted [s] | 0.0007 | 0.0020 | 0.0060 | 0.0011 | 0.0042 | 0.0060 | 0.0026 | 0.0057 |

| Clock [MHz] | 2087.3790 | 1883.5659 | 2097.8637 | 3825.3865 | 2057.0348 | 2054.0730 | 2116.5902 | 2099.8554 |

| CPI | 0.5563 | 0.4859 | 0.4020 | 0.4662 | 0.4700 | 0.4673 | 0.4480 | 0.4340 |

| Energy [J] | 946.6923 | 0 | 0 | 0 | 944.6807 | 0 | 0 | 0 |

| Power [W] | 1160.2304 | 0 | 0 | 0 | 1162.3878 | 0 | 0 | 0 |

| Energy DRAM [J] | 118.0798 | 0 | 0 | 0 | 117.7544 | 0 | 0 | 0 |

| Power DRAM [W] | 144.7141 | 0 | 0 | 0 | 144.8916 | 0 | 0 | 0 |

| DP [MFLOP/s] | 0.1969 | 1.0108 | 1.7574 | 0.0495 | 0.8381 | 1.9277 | 0.8498 | 1.5763 |

| AVX DP [MFLOP/s] | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

| Packed [MUOPS/s] | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

| Scalar [MUOPS/s] | 0.1969 | 1.0108 | 1.7574 | 0.0495 | 0.8381 | 1.9277 | 0.8498 | 1.5763 |

| Memory read bandwidth [MBytes/s] | 5113.0600 | 0 | 0 | 0 | 5129.3341 | 0 | 0 | 0 |

| Memory read data volume [GBytes] | 4.1720 | 0 | 0 | 0 | 4.1686 | 0 | 0 | 0 |

| Memory write bandwidth [MBytes/s] | 176.7559 | 0 | 0 | 0 | 202.8835 | 0 | 0 | 0 |

| Memory write data volume [GBytes] | 0.1442 | 0 | 0 | 0 | 0.1649 | 0 | 0 | 0 |

| Memory bandwidth [MBytes/s] | 5289.8159 | 0 | 0 | 0 | 5332.2176 | 0 | 0 | 0 |

| Memory data volume [GBytes] | 4.3162 | 0 | 0 | 0 | 4.3335 | 0 | 0 | 0 |

| Operational intensity | 3.721738e-05 | 0 | 0 | 0 | 0.0002 | 0 | 0 | 0 |

+-----------------------------------+--------------+------------+------------+------------+------------+------------+------------+------------+

Here instead I report the most important sets for the slurm batch file:

#SBATCH -N 1

#SBATCH -n 8

#SBATCH --distribution=*:cyclic

#SBATCH --cpus-per-task=1

##SBATCH -B 2:4:1

#SBATCH --ntasks-per-node=8

#SBATCH --ntasks-per-socket=4

#SBATCH --ntasks-per-core=1

SLURM_MPI_TYPE=pmi2

SLURM_CPU_FREQ_REQ=2100000-2100000

export SLURM_MPI_TYPE

export SLURM_CPU_FREQ_REQ

likwid-mpirun -d -np 8 -g MEM_DP -m ./a.out

I also take advantage of this e-mail to ask you why in this case SLURM_CPU_FRE_REQ seems to be not considered, and why the

DP fields seems so strange (better, why the operational intensity is so low).

Tank you for your time and best regards,

Federico Tesser

Thomas Gruber

unread,

May 5, 2020, 5:14:50 AM5/5/20

to likwid-users

Hi,

It's difficult to answer without knowing the system. Do the compute nodes have 8 or 10 CPU cores per socket? SMT is on or off? (best would be the output of likwid-topology or likwid-pin -p inside a job)

likwid-mpirun uses a subset of the SLURM_* environment variables (basically only SLURM_TASKS_PER_NODE but in some situations also SLURM_CPUS_ON_NODE and SLURM_CPUS_PER_TASK). For likwid-mpirun to work properly, you should tell likwid-mpirun where you want the processes and threads and not SLURM. If you are explicit with your process distribution (-nperdomain S:4 = four processes per socket), does it still pin the threads to "wrong" cores?

Another point are the CPUset which are often forced by SLURM. If likwid-mpirun is allowed to use only 4 cores, it can only schedule the processes to these cores. Moreover, SLURM does not want you to change the scheduling inside a job. So if you set ntasks-per-node to 8, you cannot change this setting inside job without warnings. There is a patch for that in the current master branch developed in collaboration with CINECA.

There is definitely a problem with likwid-mpirun on your machine. Although it detects cores 0-3 and 16-19 as usable in E:N:8:1, it uses only cores 0-3 later. It kind of surprises me that likwid-mpirun uses cores twice because there is a check for that. Consequently, it reads core 0 twice (memory traffic). So you have a low FLOP rate (don't know why as your kernel seems reasonable) and doubled memory data volume which results in a very low intensity.

The enviroment is not touched by likwid-mpirun (with SLURM), so no clue about the SLURM_CPU_FRE_REQ. Can you please post the complete output of -d for your runs. You can also try what's happening when specifying the MPI (-mpi intelmpi|openmpi) or using the force flag (-f).

I hope this helps.

Best,
Thomas

Federico Tesser

unread,

May 5, 2020, 9:24:57 AM5/5/20

to likwid...@googlegroups.com

Hi,

and thank you for your reply.

On Tue, May 5, 2020 at 11:14 AM 'Thomas Gruber' via likwid-users <likwid...@googlegroups.com> wrote:

Hi,

It's difficult to answer without knowing the system. Do the compute nodes have 8 or 10 CPU cores per socket? SMT is on or off? (best would be the output of likwid-topology or likwid-pin -p inside a job)

Here the output of likwid-topology (two 16-cores sockets, without SMT):

Hardware Thread Topology
********************************************************************************
Sockets: 2
Cores per socket: 16
Threads per core: 1
--------------------------------------------------------------------------------
HWThread Thread Core Socket Available
0 0 0 0 *
1 0 1 0 *
2 0 2 0 *
3 0 3 0 *
4 0 4 0 *
5 0 5 0 *
6 0 6 0 *
7 0 7 0 *
8 0 8 0 *
9 0 9 0 *
10 0 10 0 *
11 0 11 0 *
12 0 12 0 *
13 0 13 0 *
14 0 14 0 *
15 0 15 0 *
16 0 16 1 *
17 0 17 1 *
18 0 18 1 *
19 0 19 1 *
20 0 20 1 *
21 0 21 1 *
22 0 22 1 *
23 0 23 1 *
24 0 24 1 *
25 0 25 1 *
26 0 26 1 *
27 0 27 1 *
28 0 28 1 *
29 0 29 1 *
30 0 30 1 *
31 0 31 1 *
--------------------------------------------------------------------------------
Socket 0: ( 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 )
Socket 1: ( 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 )
--------------------------------------------------------------------------------
********************************************************************************
Cache Topology
********************************************************************************
Level: 1
Size: 32 kB
Cache groups: ( 0 ) ( 1 ) ( 2 ) ( 3 ) ( 4 ) ( 5 ) ( 6 ) ( 7 ) ( 8 ) ( 9 ) ( 10 ) ( 11 ) ( 12 ) ( 13 ) ( 14 ) ( 15 ) ( 16 ) ( 17 ) ( 18 ) ( 19 ) ( 20 ) ( 21 ) ( 22 ) ( 23 ) ( 24 ) ( 25 ) ( 26 ) ( 27 ) ( 28 ) ( 29 ) ( 30 ) ( 31 )
--------------------------------------------------------------------------------
Level: 2
Size: 1 MB
Cache groups: ( 0 ) ( 1 ) ( 2 ) ( 3 ) ( 4 ) ( 5 ) ( 6 ) ( 7 ) ( 8 ) ( 9 ) ( 10 ) ( 11 ) ( 12 ) ( 13 ) ( 14 ) ( 15 ) ( 16 ) ( 17 ) ( 18 ) ( 19 ) ( 20 ) ( 21 ) ( 22 ) ( 23 ) ( 24 ) ( 25 ) ( 26 ) ( 27 ) ( 28 ) ( 29 ) ( 30 ) ( 31 )
--------------------------------------------------------------------------------
Level: 3
Size: 22 MB
Cache groups: ( 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ) ( 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 )
--------------------------------------------------------------------------------
********************************************************************************
NUMA Topology
********************************************************************************
NUMA domains: 2
--------------------------------------------------------------------------------
Domain: 0
Processors: ( 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 )
Distances: 10 21
Free memory: 92532.9 MB
Total memory: 97693.2 MB
--------------------------------------------------------------------------------
Domain: 1
Processors: ( 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 )
Distances: 21 10
Free memory: 95485 MB
Total memory: 98304 MB
--------------------------------------------------------------------------------

likwid-mpirun uses a subset of the SLURM_* environment variables (basically only SLURM_TASKS_PER_NODE but in some situations also SLURM_CPUS_ON_NODE and SLURM_CPUS_PER_TASK). For likwid-mpirun to work properly, you should tell likwid-mpirun where you want the processes and threads and not SLURM. If you are explicit with your process distribution (-nperdomain S:4 = four processes per socket), does it still pin the threads to "wrong" cores?

Yes, it continues to pin to the same cores.

Another point are the CPUset which are often forced by SLURM. If likwid-mpirun is allowed to use only 4 cores, it can only schedule the processes to these cores. Moreover, SLURM does not want you to change the scheduling inside a job. So if you set ntasks-per-node to 8, you cannot change this setting inside job without warnings. There is a patch for that in the current master branch developed in collaboration with CINECA.

Interesting, I will definitely check this patch.

There is definitely a problem with likwid-mpirun on your machine. Although it detects cores 0-3 and 16-19 as usable in E:N:8:1, it uses only cores 0-3 later. It kind of surprises me that likwid-mpirun uses cores twice because there is a check for that. Consequently, it reads core 0 twice (memory traffic). So you have a low FLOP rate (don't know why as your kernel seems reasonable) and doubled memory data volume which results in a very low intensity.

Yes, you are right. But however, the intensity value is still very low, also for a doubled memory data volume.

The enviroment is not touched by likwid-mpirun (with SLURM), so no clue about the SLURM_CPU_FRE_REQ. Can you please post the complete output of -d for your runs. You can also try what's happening when specifying the MPI (-mpi intelmpi|openmpi) or using the force flag (-f).

Here there is the complete output of "-d":

DEBUG: Executable given on commandline: ./a.out
DEBUG: Using MPI implementation slurm
DEBUG: Using OpenMP implementation gnu
DEBUG: Reading hostfile from batch system
Available hosts for scheduling:
Host Slots MaxSlots Interface
node07 8 8
Available hosts for scheduling:
Host Slots MaxSlots Interface
node07 8 8
DEBUG: Switch to perfctr mode, there are 1 eventsets given on the commandline
DEBUG: Working on host node07 with 8 slots and 8 slots maximally
DEBUG: NperDomain string E:N:8:1 covers the domains: N

DEBUG: Resolved NperDomain string E:N:8:1 to CPUs: [0] [1] [2] [3] [16] [17] [18] [19]
DEBUG: Process 1 runs on CPUs 0
DEBUG: Process 2 runs on CPUs 1
DEBUG: Process 3 runs on CPUs 2
DEBUG: Process 4 runs on CPUs 3
DEBUG: Process 5 runs on CPUs 16
DEBUG: Process 6 runs on CPUs 17
DEBUG: Process 7 runs on CPUs 18
DEBUG: Process 8 runs on CPUs 19
DEBUG: Assign 8 processes with 8 per node and 1 threads per process to 1 hosts

DEBUG: Add Host node07 with 8 slots to host list
DEBUG: Scheduling on hosts:
DEBUG: Host node07 with 8 slots (max. 8 slots)
DEBUG: Process 0 measures with event set: INSTR_RETIRED_ANY:FIXC0,CPU_CLK_UNHALTED_CORE:FIXC1,CPU_CLK_UNHALTED_REF:FIXC2,FP_ARITH_INST_RETIRED_128B_PACKED_DOUBLE:PMC0,FP_ARITH_INST_RETIRED_SCALAR_DOUBLE:PMC1,FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE:PMC2,FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE:PMC3,PWR_PKG_ENERGY:PWR0,PWR_DRAM_ENERGY:PWR3,CAS_COUNT_RD:MBOX0C0,CAS_COUNT_WR:MBOX0C1,CAS_COUNT_RD:MBOX1C0,CAS_COUNT_WR:MBOX1C1,CAS_COUNT_RD:MBOX2C0,CAS_COUNT_WR:MBOX2C1,CAS_COUNT_RD:MBOX3C0,CAS_COUNT_WR:MBOX3C1,CAS_COUNT_RD:MBOX4C0,CAS_COUNT_WR:MBOX4C1,CAS_COUNT_RD:MBOX5C0,CAS_COUNT_WR:MBOX5C1
DEBUG: Process 1 measures with event set: INSTR_RETIRED_ANY:FIXC0,CPU_CLK_UNHALTED_CORE:FIXC1,CPU_CLK_UNHALTED_REF:FIXC2,FP_ARITH_INST_RETIRED_128B_PACKED_DOUBLE:PMC0,FP_ARITH_INST_RETIRED_SCALAR_DOUBLE:PMC1,FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE:PMC2,FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE:PMC3
DEBUG: Process 2 measures with event set: INSTR_RETIRED_ANY:FIXC0,CPU_CLK_UNHALTED_CORE:FIXC1,CPU_CLK_UNHALTED_REF:FIXC2,FP_ARITH_INST_RETIRED_128B_PACKED_DOUBLE:PMC0,FP_ARITH_INST_RETIRED_SCALAR_DOUBLE:PMC1,FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE:PMC2,FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE:PMC3
DEBUG: Process 3 measures with event set: INSTR_RETIRED_ANY:FIXC0,CPU_CLK_UNHALTED_CORE:FIXC1,CPU_CLK_UNHALTED_REF:FIXC2,FP_ARITH_INST_RETIRED_128B_PACKED_DOUBLE:PMC0,FP_ARITH_INST_RETIRED_SCALAR_DOUBLE:PMC1,FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE:PMC2,FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE:PMC3
DEBUG: Process 4 measures with event set: INSTR_RETIRED_ANY:FIXC0,CPU_CLK_UNHALTED_CORE:FIXC1,CPU_CLK_UNHALTED_REF:FIXC2,FP_ARITH_INST_RETIRED_128B_PACKED_DOUBLE:PMC0,FP_ARITH_INST_RETIRED_SCALAR_DOUBLE:PMC1,FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE:PMC2,FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE:PMC3,PWR_PKG_ENERGY:PWR0,PWR_DRAM_ENERGY:PWR3,CAS_COUNT_RD:MBOX0C0,CAS_COUNT_WR:MBOX0C1,CAS_COUNT_RD:MBOX1C0,CAS_COUNT_WR:MBOX1C1,CAS_COUNT_RD:MBOX2C0,CAS_COUNT_WR:MBOX2C1,CAS_COUNT_RD:MBOX3C0,CAS_COUNT_WR:MBOX3C1,CAS_COUNT_RD:MBOX4C0,CAS_COUNT_WR:MBOX4C1,CAS_COUNT_RD:MBOX5C0,CAS_COUNT_WR:MBOX5C1
DEBUG: Process 5 measures with event set: INSTR_RETIRED_ANY:FIXC0,CPU_CLK_UNHALTED_CORE:FIXC1,CPU_CLK_UNHALTED_REF:FIXC2,FP_ARITH_INST_RETIRED_128B_PACKED_DOUBLE:PMC0,FP_ARITH_INST_RETIRED_SCALAR_DOUBLE:PMC1,FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE:PMC2,FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE:PMC3
DEBUG: Process 6 measures with event set: INSTR_RETIRED_ANY:FIXC0,CPU_CLK_UNHALTED_CORE:FIXC1,CPU_CLK_UNHALTED_REF:FIXC2,FP_ARITH_INST_RETIRED_128B_PACKED_DOUBLE:PMC0,FP_ARITH_INST_RETIRED_SCALAR_DOUBLE:PMC1,FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE:PMC2,FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE:PMC3
DEBUG: Process 7 measures with event set: INSTR_RETIRED_ANY:FIXC0,CPU_CLK_UNHALTED_CORE:FIXC1,CPU_CLK_UNHALTED_REF:FIXC2,FP_ARITH_INST_RETIRED_128B_PACKED_DOUBLE:PMC0,FP_ARITH_INST_RETIRED_SCALAR_DOUBLE:PMC1,FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE:PMC2,FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE:PMC3
EXEC (Rank 0): /usr/local/likwid-5.0.1/bin/likwid-perfctr -m -C 0 -g INSTR_RETIRED_ANY:FIXC0,CPU_CLK_UNHALTED_CORE:FIXC1,CPU_CLK_UNHALTED_REF:FIXC2,FP_ARITH_INST_RETIRED_128B_PACKED_DOUBLE:PMC0,FP_ARITH_INST_RETIRED_SCALAR_DOUBLE:PMC1,FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE:PMC2,FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE:PMC3,PWR_PKG_ENERGY:PWR0,PWR_DRAM_ENERGY:PWR3,CAS_COUNT_RD:MBOX0C0,CAS_COUNT_WR:MBOX0C1,CAS_COUNT_RD:MBOX1C0,CAS_COUNT_WR:MBOX1C1,CAS_COUNT_RD:MBOX2C0,CAS_COUNT_WR:MBOX2C1,CAS_COUNT_RD:MBOX3C0,CAS_COUNT_WR:MBOX3C1,CAS_COUNT_RD:MBOX4C0,CAS_COUNT_WR:MBOX4C1,CAS_COUNT_RD:MBOX5C0,CAS_COUNT_WR:MBOX5C1 -o /work/ftesser/likwid_prova/.output_191143_%r_%h.csv ./a.out
EXEC (Rank 1): /usr/local/likwid-5.0.1/bin/likwid-perfctr -m -C 1 -g INSTR_RETIRED_ANY:FIXC0,CPU_CLK_UNHALTED_CORE:FIXC1,CPU_CLK_UNHALTED_REF:FIXC2,FP_ARITH_INST_RETIRED_128B_PACKED_DOUBLE:PMC0,FP_ARITH_INST_RETIRED_SCALAR_DOUBLE:PMC1,FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE:PMC2,FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE:PMC3 -o /work/ftesser/likwid_prova/.output_191143_%r_%h.csv ./a.out
EXEC (Rank 2): /usr/local/likwid-5.0.1/bin/likwid-perfctr -m -C 2 -g INSTR_RETIRED_ANY:FIXC0,CPU_CLK_UNHALTED_CORE:FIXC1,CPU_CLK_UNHALTED_REF:FIXC2,FP_ARITH_INST_RETIRED_128B_PACKED_DOUBLE:PMC0,FP_ARITH_INST_RETIRED_SCALAR_DOUBLE:PMC1,FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE:PMC2,FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE:PMC3 -o /work/ftesser/likwid_prova/.output_191143_%r_%h.csv ./a.out
EXEC (Rank 3): /usr/local/likwid-5.0.1/bin/likwid-perfctr -m -C 3 -g INSTR_RETIRED_ANY:FIXC0,CPU_CLK_UNHALTED_CORE:FIXC1,CPU_CLK_UNHALTED_REF:FIXC2,FP_ARITH_INST_RETIRED_128B_PACKED_DOUBLE:PMC0,FP_ARITH_INST_RETIRED_SCALAR_DOUBLE:PMC1,FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE:PMC2,FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE:PMC3 -o /work/ftesser/likwid_prova/.output_191143_%r_%h.csv ./a.out
EXEC (Rank 4): /usr/local/likwid-5.0.1/bin/likwid-perfctr -m -C 16 -g INSTR_RETIRED_ANY:FIXC0,CPU_CLK_UNHALTED_CORE:FIXC1,CPU_CLK_UNHALTED_REF:FIXC2,FP_ARITH_INST_RETIRED_128B_PACKED_DOUBLE:PMC0,FP_ARITH_INST_RETIRED_SCALAR_DOUBLE:PMC1,FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE:PMC2,FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE:PMC3,PWR_PKG_ENERGY:PWR0,PWR_DRAM_ENERGY:PWR3,CAS_COUNT_RD:MBOX0C0,CAS_COUNT_WR:MBOX0C1,CAS_COUNT_RD:MBOX1C0,CAS_COUNT_WR:MBOX1C1,CAS_COUNT_RD:MBOX2C0,CAS_COUNT_WR:MBOX2C1,CAS_COUNT_RD:MBOX3C0,CAS_COUNT_WR:MBOX3C1,CAS_COUNT_RD:MBOX4C0,CAS_COUNT_WR:MBOX4C1,CAS_COUNT_RD:MBOX5C0,CAS_COUNT_WR:MBOX5C1 -o /work/ftesser/likwid_prova/.output_191143_%r_%h.csv ./a.out
EXEC (Rank 5): /usr/local/likwid-5.0.1/bin/likwid-perfctr -m -C 17 -g INSTR_RETIRED_ANY:FIXC0,CPU_CLK_UNHALTED_CORE:FIXC1,CPU_CLK_UNHALTED_REF:FIXC2,FP_ARITH_INST_RETIRED_128B_PACKED_DOUBLE:PMC0,FP_ARITH_INST_RETIRED_SCALAR_DOUBLE:PMC1,FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE:PMC2,FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE:PMC3 -o /work/ftesser/likwid_prova/.output_191143_%r_%h.csv ./a.out
EXEC (Rank 6): /usr/local/likwid-5.0.1/bin/likwid-perfctr -m -C 18 -g INSTR_RETIRED_ANY:FIXC0,CPU_CLK_UNHALTED_CORE:FIXC1,CPU_CLK_UNHALTED_REF:FIXC2,FP_ARITH_INST_RETIRED_128B_PACKED_DOUBLE:PMC0,FP_ARITH_INST_RETIRED_SCALAR_DOUBLE:PMC1,FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE:PMC2,FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE:PMC3 -o /work/ftesser/likwid_prova/.output_191143_%r_%h.csv ./a.out
EXEC (Rank 7): /usr/local/likwid-5.0.1/bin/likwid-perfctr -m -C 19 -g INSTR_RETIRED_ANY:FIXC0,CPU_CLK_UNHALTED_CORE:FIXC1,CPU_CLK_UNHALTED_REF:FIXC2,FP_ARITH_INST_RETIRED_128B_PACKED_DOUBLE:PMC0,FP_ARITH_INST_RETIRED_SCALAR_DOUBLE:PMC1,FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE:PMC2,FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE:PMC3 -o /work/ftesser/likwid_prova/.output_191143_%r_%h.csv ./a.out
EXEC: srun --nodes 1 --ntasks-per-node=8 --cpu_bind=none /work/ftesser/likwid_prova/.likwidscript_191143.txt
INFO: You are running LIKWID in a cpuset with 8 CPUs. Taking given IDs as logical ID in cpuset
INFO: You are running LIKWID in a cpuset with 8 CPUs. Taking given IDs as logical ID in cpuset
INFO: You are running LIKWID in a cpuset with 8 CPUs. Taking given IDs as logical ID in cpuset
INFO: You are running LIKWID in a cpuset with 8 CPUs. Taking given IDs as logical ID in cpuset
INFO: You are running LIKWID in a cpuset with 8 CPUs. Taking given IDs as logical ID in cpuset
INFO: You are running LIKWID in a cpuset with 8 CPUs. Taking given IDs as logical ID in cpuset
INFO: You are running LIKWID in a cpuset with 8 CPUs. Taking given IDs as logical ID in cpuset
INFO: You are running LIKWID in a cpuset with 8 CPUs. Taking given IDs as logical ID in cpuset

I am sorry, but the "-f" option does not just force writing of registers?

Sorry to continue bothering you guys, but I am really stuck into this problem.

Best regards,

Federico

--

---
You received this message because you are subscribed to the Google Groups "likwid-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to likwid-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/likwid-users/f53a6878-b199-4d82-ba66-5d537db4769a%40googlegroups.com.

Thomas Gruber

unread,

May 5, 2020, 11:18:22 AM5/5/20

to likwid-users

Hi,

It's difficult to answer without knowing the system. Do the compute nodes have 8 or 10 CPU cores per socket? SMT is on or off? (best would be the output of likwid-topology or likwid-pin -p inside a job)

Here the output of likwid-topology (two 16-cores sockets, without SMT):

Hardware Thread Topology
********************************************************************************
Sockets: 2
Cores per socket: 16
Threads per core: 1
--------------------------------------------------------------------------------

Socket 0: ( 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 )
Socket 1: ( 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 )
--------------------------------------------------------------------------------

likwid-mpirun uses a subset of the SLURM_* environment variables (basically only SLURM_TASKS_PER_NODE but in some situations also SLURM_CPUS_ON_NODE and SLURM_CPUS_PER_TASK). For likwid-mpirun to work properly, you should tell likwid-mpirun where you want the processes and threads and not SLURM. If you are explicit with your process distribution (-nperdomain S:4 = four processes per socket), does it still pin the threads to "wrong" cores?

Yes, it continues to pin to the same cores.

See explaination below.

Another point are the CPUset which are often forced by SLURM. If likwid-mpirun is allowed to use only 4 cores, it can only schedule the processes to these cores. Moreover, SLURM does not want you to change the scheduling inside a job. So if you set ntasks-per-node to 8, you cannot change this setting inside job without warnings. There is a patch for that in the current master branch developed in collaboration with CINECA.

Interesting, I will definitely check this patch.

That's the commit:
https://github.com/RRZE-HPC/likwid/commit/3e0360af3430d5d39db2238b7068178e6675720d

There is definitely a problem with likwid-mpirun on your machine. Although it detects cores 0-3 and 16-19 as usable in E:N:8:1, it uses only cores 0-3 later. It kind of surprises me that likwid-mpirun uses cores twice because there is a check for that. Consequently, it reads core 0 twice (memory traffic). So you have a low FLOP rate (don't know why as your kernel seems reasonable) and doubled memory data volume which results in a very low intensity.

Yes, you are right. But however, the intensity value is still very low, also for a doubled memory data volume.

Well, the FLOP rate for the individual processes is extremely low as well

DP [MFLOP/s] | 0.1969 | 1.0108 | 1.7574 | 0.0495 | 0.8381 | 1.9277 | 0.8498 | 1.5763 |

The last INFO lines tell the story. LIKWID schedules the processes properly for socket 0 (0-3) and socket 1 (16-19). From likwid-mpirun's perspective, the rank 5 runs on a different socket, so it schedules the events for the memory traffic as well so that hardware threads 0 and 16 measures the events. As expected. But when executing the srun command, srun puts the MPI processes in a cpuset 0-3,16-19 and calls the script.

EXEC: srun --nodes 1 --ntasks-per-node=8 --cpu_bind=none /work/ftesser/likwid_prova/.likwidscript_191143.txt

The script contains a big if-else block with if rank == 0 then likwid-pin -c 0 ... ; else if rank == 1 then likwid-pin -c ..., etc. But likwid-pin realizes that you are running inside a cpuset and switches to logical pinning, so it assumes 0-3.16-19 are the indicies in the cpuset of 8 hardware threads. If an index is outside of the list, it does a round-robin access, and 16 % 8 = 0, so it takes index 0 which results in CPU 0. Since the processes don't know anything of each other at that point, it each schedules two MPI processes on one hardware thread.

This is commonly avoided by the --cpu-bind=none but it seems your system forces the usage of cpusets. If you set -mpi [intelmpi|openmpi], it uses mpirun instead of srun, maybe that helps.

I am sorry, but the "-f" option does not just force writing of registers?

Yes and no. "-f" for likwid-mpirun causes skipping some problematic settings (overloading, ...) but it is also forwarded to likwid-perfctr to force the writing of the registers.

Sorry to continue bothering you guys, but I am really stuck into this problem.

No problem, that's what the mailing list is for.

Best,
Thomas

To unsubscribe from this group and stop receiving emails from it, send an email to likwid...@googlegroups.com.

Federico Tesser

unread,

May 6, 2020, 8:17:31 AM5/6/20

to likwid...@googlegroups.com

Hello again, and sorry for the late reply.

First of all, thank you for the link!

On Tue, May 5, 2020 at 5:18 PM 'Thomas Gruber' via likwid-users <likwid...@googlegroups.com> wrote:

Hi,

It's difficult to answer without knowing the system. Do the compute nodes have 8 or 10 CPU cores per socket? SMT is on or off? (best would be the output of likwid-topology or likwid-pin -p inside a job)

Here the output of likwid-topology (two 16-cores sockets, without SMT):

Hardware Thread Topology
********************************************************************************
Sockets: 2
Cores per socket: 16
Threads per core: 1
--------------------------------------------------------------------------------
Socket 0: ( 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 )
Socket 1: ( 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 )
--------------------------------------------------------------------------------

likwid-mpirun uses a subset of the SLURM_* environment variables (basically only SLURM_TASKS_PER_NODE but in some situations also SLURM_CPUS_ON_NODE and SLURM_CPUS_PER_TASK). For likwid-mpirun to work properly, you should tell likwid-mpirun where you want the processes and threads and not SLURM. If you are explicit with your process distribution (-nperdomain S:4 = four processes per socket), does it still pin the threads to "wrong" cores?

Yes, it continues to pin to the same cores.

See explaination below.

Another point are the CPUset which are often forced by SLURM. If likwid-mpirun is allowed to use only 4 cores, it can only schedule the processes to these cores. Moreover, SLURM does not want you to change the scheduling inside a job. So if you set ntasks-per-node to 8, you cannot change this setting inside job without warnings. There is a patch for that in the current master branch developed in collaboration with CINECA.

Interesting, I will definitely check this patch.

That's the commit:
https://github.com/RRZE-HPC/likwid/commit/3e0360af3430d5d39db2238b7068178e6675720d

There is definitely a problem with likwid-mpirun on your machine. Although it detects cores 0-3 and 16-19 as usable in E:N:8:1, it uses only cores 0-3 later. It kind of surprises me that likwid-mpirun uses cores twice because there is a check for that. Consequently, it reads core 0 twice (memory traffic). So you have a low FLOP rate (don't know why as your kernel seems reasonable) and doubled memory data volume which results in a very low intensity.

Yes, you are right. But however, the intensity value is still very low, also for a doubled memory data volume.

Well, the FLOP rate for the individual processes is extremely low as well
DP [MFLOP/s] | 0.1969 | 1.0108 | 1.7574 | 0.0495 | 0.8381 | 1.9277 | 0.8498 | 1.5763 |

Exactly, but what do I not understand is why. I mean, I am expecting for this kernel, more or less, an intensity equal to 0.06. FLOP rate is given by the memory bandwidth and the computational intensity, so why in this case the C.I. is very low, while in the

case where I split the process among the same socket (and there the pinning is correct) it is more or less in line with the expectations (and also, in this case, cpu frequency is perfectly set on all the cpus by SLURM_CPU_FREQ_REQ).

Why? Do you have any suggestions? Because in the system the task/cgroup plugin is not enabled. Moreover, changing "pmi2" with "openmpi" (although I am fond of slurm) gives some "bizzarre" (no offense of course) outputs (I report just the events):

+------------------------------------------+---------+------------+------------+------------+--------------+--------------+--------------+------------+------------+
| Event                                     | Counter | node07:0:0 | node07:1:1 | node07:2:2 | node07:3:3 | node07:4:0 | node07:5:1 | node07:6:2 | node07:7:3 |
+------------------------------------------+---------+------------+------------+------------+--------------+--------------+--------------+------------+------------+
| Region calls                                     | CTR | 10000 | 10000 | 10000 | 10000 | 10000 | 10000 | 10000 | 10000 |
| INSTR_RETIRED_ANY                        | FIXC0 | 108576800 | 341357900 | 302815900 | 5795 | 6461717 | 1.844674e+19 | 2492546 | 307227100 |
| CPU_CLK_UNHALTED_CORE                  | FIXC1 | 48901170 | 144400200 | 126492600 | 19790 | 2698055 | 1.844674e+19 | 1044222 | 123969500 |
| CPU_CLK_UNHALTED_REF                     | FIXC2 | 49034160 | 144408100 | 126622800 | 16548 | 2679768 | 13020 | 1050336 | 123968700 |
| PWR_PKG_ENERGY                        | PWR0 | 854.8477 | 0 | 0 | 0 | 850.5041 | 0 | 0 | 0 |
| PWR_DRAM_ENERGY                        | PWR3 | 115.8163 | 0 | 0 | 0 | 115.6404 | 0 | 0 | 0 |
| FP_ARITH_INST_RETIRED_128B_PACKED_DOUBLE | PMC0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| FP_ARITH_INST_RETIRED_SCALAR_DOUBLE        | PMC1 | 3785385 | 15208360 | 14167850 | 1.844674e+19 | 360674 | 15 | 140107 | 15388460 |
| FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE | PMC2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE | PMC3 | nan | nan | nan | nan | nan | nan | nan | nan |
| CAS_COUNT_RD                             | MBOX0C0 | 11387430 | 0 | 0 | 0 | 11264370 | 0 | 0 | 0 |
| CAS_COUNT_WR                            | MBOX0C1 | 1545163 | 0 | 0 | 0 | 724347 | 0 | 0 | 0 |
| CAS_COUNT_RD                             | MBOX1C0 | 11489690 | 0 | 0 | 0 | 11088720 | 0 | 0 | 0 |
| CAS_COUNT_WR                            | MBOX1C1 | 1537717 | 0 | 0 | 0 | 717601 | 0 | 0 | 0 |
| CAS_COUNT_RD                             | MBOX2C0 | 11580810 | 0 | 0 | 0 | 1.844674e+19 | 0 | 0 | 0 |
| CAS_COUNT_WR                            | MBOX2C1 | 1584951 | 0 | 0 | 0 | 1.844674e+19 | 0 | 0 | 0 |
| CAS_COUNT_RD                             | MBOX3C0 | 11440630 | 0 | 0 | 0 | 1.844674e+19 | 0 | 0 | 0 |
| CAS_COUNT_WR                            | MBOX3C1 | 1623920 | 0 | 0 | 0 | 1.844674e+19 | 0 | 0 | 0 |
| CAS_COUNT_RD                             | MBOX4C0 | 11429260 | 0 | 0 | 0 | 1.844674e+19 | 0 | 0 | 0 |
| CAS_COUNT_WR                            | MBOX4C1 | 1582884 | 0 | 0 | 0 | 1.844674e+19 | 0 | 0 | 0 |
| CAS_COUNT_RD                             | MBOX5C0 | 11363280 | 0 | 0 | 0 | 1.844674e+19 | 0 | 0 | 0 |
| CAS_COUNT_WR                            | MBOX5C1 | 1509423 | 0 | 0 | 0 | 1.844674e+19 | 0 | 0 | 0 |
+------------------------------------------+---------+------------+------------+------------+--------------+--------------+--------------+------------+------------+

I am sorry, but the "-f" option does not just force writing of registers?

Yes and no. "-f" for likwid-mpirun causes skipping some problematic settings (overloading, ...) but it is also forwarded to likwid-perfctr to force the writing of the registers.

The "-f" option gives me the following error:

ERROR: Processes requested exceeds maximally available slots of given hosts. Maximal processes: 1

Sorry to continue bothering you guys, but I am really stuck into this problem.

No problem, that's what the mailing list is for.

Best,
Thomas

Good afternoon and best regards,

Federico

To unsubscribe from this group and stop receiving emails from it, send an email to likwid-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/likwid-users/c565d343-f83a-46b3-9ca7-093a36a4cf54%40googlegroups.com.

Thomas Gruber

unread,

May 6, 2020, 12:40:55 PM5/6/20

to likwid-users

I assume this is caused by the threads running on the same cores. Both processes start, stop, read and (maybe) reset the counters independently but affect the counts of the other process.

Similar reason as above, MPI processes running on the same hardware threads. You get the e+19 counts when the stop value is smaller than start value (although LIKWID should catch it and return 0). The nans for the 512B events might be caused by a hardware bug on Skylake/Cascadelake when transactional memory is enabled in the BIOS. You should get an output telling you what to do in this case.
(for reference: https://groups.google.com/forum/#!topic/likwid-users/VlllTO0qv-c)

I have now clue how to avoid that SLURM puts the MPI processes in a cpuset besides --cpu-bind=none. Since the SLURM cgroup module is not active on your site, I assume that's the default behavior of SLURM.

I am sorry, but the "-f" option does not just force writing of registers?

Yes and no. "-f" for likwid-mpirun causes skipping some problematic settings (overloading, ...) but it is also forwarded to likwid-perfctr to force the writing of the registers.

The "-f" option gives me the following error:

ERROR: Processes requested exceeds maximally available slots of given hosts. Maximal processes: 1

This is caused by the cpuset/nodesets of SLURM. You start 8 tasks with one process per task. likwid-mpirun respects the cpuset and "sees" only a single slot. Maybe you should try less SLURM options and let likwid-mpirun do it's thing:

https://github.com/RRZE-HPC/likwid/wiki/Likwid-Mpirun#using-likwid-mpirun-with-slurm-job-scheduler

SLURM is really a pain if you want to break out of it.

I would recommend to apply the CINECA patch first because it might fix some of your problems. You can also clone the master directly to get the latest changes.

Best,
Thomas

To view this discussion on the web visit https://groups.google.com/d/msgid/likwid-users/c565d343-f83a-46b3-9ca7-093a36a4cf54%40googlegroups.com.

Thomas Gruber

unread,

May 7, 2020, 12:53:26 PM5/7/20

to likwid-users

I though about it again today but I didn't have an idea what can be tested remotely on the machine. So, if it's possible, can I get an account on your system for testing? If yes, please contact me by email (Thomas dot Gruber at fau dot de)

Best,
Thomas

Reply all

Reply to author

Forward