Anomalous MEM measurement with 2-rank MPI Sendrecv case

24 views
Skip to first unread message

Nichols A. Romero

unread,
Sep 27, 2023, 5:59:25 PM9/27/23
to likwid...@googlegroups.com
Hi,

I am using LIKWID on a dual socket Cascade Lake machine.

I have used likwid-mpirun without issue for the most part, but I have stumbled upon a test case whose results I do not understand.

If I run the test case as follows, I get reasonable results:
likwid-perfctr -C S1 -m -g MEM  mpirun -np 2 ./sendrecv.likwid.x

If I run the test case through likwid-mpirun, I just seem to always get something close to zero.
likwid-mpirun -np 2 -m -g MEM  ./sendrecv.likwid.x

I have read through this wiki:

and have tried different values of `-nperdomain` flag, but I always get something close to zero.

I am using MPICH which is not officially supported.

On other proxy apps, `likwid-mpirun -np N -m -g MEM <bin>` gives reasonable results. The main difference is that I am running at least at the scale of 8 MPI ranks.

I appreciate any insight from the LIKWID user community.


--- sendrecv.c --
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
#include <unistd.h>
#include "likwid.h"

/**
 * @brief Pair communications between 2 MPI processes sending a message to each other.
 **/
int main(int argc, char* argv[])
{
    int ierr;
 
    LIKWID_MARKER_INIT;
    LIKWID_MARKER_REGISTER("main");
    LIKWID_MARKER_REGISTER("initial");
    LIKWID_MARKER_REGISTER("sendrecv");
    LIKWID_MARKER_START("main");
   
    ierr = MPI_Init(&argc, &argv);
    ierr = MPI_Comm_set_errhandler(MPI_COMM_WORLD, MPI_ERRORS_ARE_FATAL);
   
 
    // Make sure exactly 2 MPI processes are used
    int size;
    MPI_Comm_size(MPI_COMM_WORLD, &size);
    if(size != 2)
    {
        printf("%d MPI processes used, please use 2.\n", size);
        MPI_Abort(MPI_COMM_WORLD, EXIT_FAILURE);
    }
 
    // Prepare parameters
    int my_rank;
    MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
    const int buffer_size = 100000000;
    double *buffer_send;
    double *buffer_recv;
    int tag_send = 0;
    int tag_recv = tag_send;
    int peer = (my_rank == 0) ? 1 : 0;
    int init_value;

    // Allocate memory
    buffer_send = (double *)malloc(buffer_size * sizeof(double));
    buffer_recv = (double *)malloc(buffer_size * sizeof(double));
   
    // Initialize
    init_value = (my_rank == 0) ? 12345 : 67890;
    LIKWID_MARKER_START("initial");
    for (int i = 0; i < buffer_size; i++) {
      buffer_send[i] = (double)(init_value + i);
    }
    LIKWID_MARKER_STOP("initial");
   
    if ((buffer_send == NULL) || (buffer_recv == NULL)) {
      printf("Out of memory. \n");
      exit(1);
    }
   
    // Issue the send + receive at the same time
    // printf("MPI process %d sends value %f to MPI process %d.\n", my_rank, buffer_send[0], peer);
    LIKWID_MARKER_START("sendrecv");
    MPI_Barrier(MPI_COMM_WORLD);
    MPI_Sendrecv(buffer_send, buffer_size, MPI_DOUBLE, peer, tag_send,
                 buffer_recv, buffer_size, MPI_DOUBLE, peer, tag_recv, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
    // MPI_Sendrecv(buffer_send, 8*buffer_size, MPI_BYTE, peer, tag_send,
    //     buffer_recv, 8*buffer_size, MPI_BYTE, peer, tag_recv, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
    // memcpy(buffer_recv, buffer_send, 8*buffer_size);
    MPI_Barrier(MPI_COMM_WORLD);
    LIKWID_MARKER_STOP("sendrecv");
    // printf("MPI process %d received value %f from MPI process %d.\n", my_rank, buffer_recv[0], peer);

    MPI_Barrier(MPI_COMM_WORLD);

    printf("made it here \n");

    fflush(NULL);

    MPI_Finalize();

    LIKWID_MARKER_STOP("main");
    LIKWID_MARKER_CLOSE;

    exit(0);
}
-- likwid.h --
#ifdef LIKWID_PERFMON
#include <likwid-marker.h>
#else
#define LIKWID_MARKER_INIT
#define LIKWID_MARKER_THREADINIT
#define LIKWID_MARKER_SWITCH
#define LIKWID_MARKER_REGISTER(regionTag)
#define LIKWID_MARKER_START(regionTag)
#define LIKWID_MARKER_STOP(regionTag)
#define LIKWID_MARKER_CLOSE
#define LIKWID_MARKER_GET(regionTag, nevents, events, time, count)
#endif
--
Nichols A. Romero, Ph.D.

Nichols A. Romero

unread,
Sep 27, 2023, 8:56:53 PM9/27/23
to likwid...@googlegroups.com
Here is the debug output section from doing:
likwid-mpirun -d -np 2 -g MEM -m ./sendrecv.x

DEBUG: Executable given on commandline: ./sendrecv.x
DEBUG: Using MPI implementation mvapich2
WARN: Cannot extract OpenMP vendor from executable or commandline, assuming no OpenMP
DEBUG: Switch to perfctr mode, there are 1 eventsets given on the commandline
DEBUG: Working on host localhost with 32 slots and 32 slots maximally
DEBUG: NperDomain string E:N:2:1 covers the domains: N
DEBUG: Resolved NperDomain string E:N:2:1 to CPUs: [0] [1]
DEBUG: Process 1 runs on CPUs 0
DEBUG: Process 2 runs on CPUs 1
DEBUG: Assign 2 processes with 2 per node and 1 threads per process to 1 hosts
DEBUG: Add Host localhost with 2 slots to host list
DEBUG: Scheduling on hosts:
DEBUG: Host localhost with 2 slots (max. 32 slots)
DEBUG: Process 0 measures with event set: INSTR_RETIRED_ANY:FIXC0,CPU_CLK_UNHALTED_CORE:FIXC1,CPU_CLK_UNHALTED_REF:FIXC2,CAS_COUNT_RD:MBOX0C0,CAS_COUNT_WR:MBOX0C1,CAS_COUNT_RD:MBOX1C0,CAS_COUNT_WR:MBOX1C1,CAS_COUNT_RD:MBOX2C0,CAS_COUNT_WR:MBOX2C1,CAS_COUNT_RD:MBOX3C0,CAS_COUNT_WR:MBOX3C1,CAS_COUNT_RD:MBOX4C0,CAS_COUNT_WR:MBOX4C1,CAS_COUNT_RD:MBOX5C0,CAS_COUNT_WR:MBOX5C1
DEBUG: Process 1 measures with event set: INSTR_RETIRED_ANY:FIXC0,CPU_CLK_UNHALTED_CORE:FIXC1,CPU_CLK_UNHALTED_REF:FIXC2
EXEC (Rank 0): /sal/home/n.a.romero/spack/opt/spack/linux-ubuntu20.04-cascadelake/gcc-9.4.0/likwid-5.2.2-vxj743a2ir5ceedldjnds5vctiyriz47/bin/likwid-perfctr -m  -C 0 -g INSTR_RETIRED_ANY:FIXC0,CPU_CLK_UNHALTED_CORE:FIXC1,CPU_CLK_UNHALTED_REF:FIXC2,CAS_COUNT_RD:MBOX0C0,CAS_COUNT_WR:MBOX0C1,CAS_COUNT_RD:MBOX1C0,CAS_COUNT_WR:MBOX1C1,CAS_COUNT_RD:MBOX2C0,CAS_COUNT_WR:MBOX2C1,CAS_COUNT_RD:MBOX3C0,CAS_COUNT_WR:MBOX3C1,CAS_COUNT_RD:MBOX4C0,CAS_COUNT_WR:MBOX4C1,CAS_COUNT_RD:MBOX5C0,CAS_COUNT_WR:MBOX5C1 -o /sal/home/n.a.romero/test/.output_1642703_%r_%h.csv ./sendrecv.x
EXEC (Rank 1): /sal/home/n.a.romero/spack/opt/spack/linux-ubuntu20.04-cascadelake/gcc-9.4.0/likwid-5.2.2-vxj743a2ir5ceedldjnds5vctiyriz47/bin/likwid-perfctr -m  -C 1 -g INSTR_RETIRED_ANY:FIXC0,CPU_CLK_UNHALTED_CORE:FIXC1,CPU_CLK_UNHALTED_REF:FIXC2 -o /sal/home/n.a.romero/test/.output_1642703_%r_%h.csv ./sendrecv.x
EXEC: /sal/home/n.a.romero/spack/opt/spack/linux-ubuntu20.04-cascadelake/gcc-9.4.0/mpich-4.1.1-uevczbo7dxfcliexy5yy2k77vdc7gw3r/bin/mpirun -f /sal/home/n.a.romero/test/.hostfile_1642703.txt -np 2 -ppn 2 -genv MV2_ENABLE_AFFINITY 0   /sal/home/n.a.romero/test/.likwidscript_1642703.txt

Thomas Gruber

unread,
Sep 30, 2023, 11:33:34 AM9/30/23
to likwid-users
Hi Nichols,

Why are you so sure that the measurements are anomalous? What values have you measured?
Some remarks:
- I would not measure the first init loop after malloc. This is known to behave strangly due to page faults, copy-on-write with the zero page, ...
- The recv_buffer is initialized inside of MPI_sendrecv, again strange things could occur
- Potential segfault as you malloc two arrays and write to it before checking whether the mallocs executed successfully

Your code on an Intel Cascadelake SP system for the sendrecv region:
+-----------------------------------+-----------------+-----------------+
|               Metric              | casclakesp2:0:0 | casclakesp2:1:1 |
+-----------------------------------+-----------------+-----------------+
|        Runtime (RDTSC) [s]        |          0.1705 |          0.1818 |
|        Runtime unhalted [s]       |          0.0001 |          0.0121 |
|            Clock [MHz]            |       3884.9374 |       3803.5545 |
|                CPI                |         14.3660 |          0.5525 |
|  Memory read bandwidth [MBytes/s] |      11103.9158 |               0 |
|  Memory read data volume [GBytes] |          1.8937 |               0 |
| Memory write bandwidth [MBytes/s] |       9364.2436 |               0 |
| Memory write data volume [GBytes] |          1.5970 |               0 |
|    Memory bandwidth [MBytes/s]    |      20468.1593 |               0 |
|    Memory data volume [GBytes]    |          3.4908 |               0 |
+-----------------------------------+-----------------+-----------------+

Each of the MPI processes reads 100000000 elements (800MB) from send_buffer and writes 100000000 elements (800MB) to the recv_buffer -> 3.2 GB minimal data volume.

Have a nice weekend,
Thomas

Nichols A. Romero

unread,
Sep 30, 2023, 11:54:46 PM9/30/23
to likwid...@googlegroups.com
Hi Thomas,

Thank you for your insightful comments on my test case. I will make it a point to clean it up.

Also, thank you for running the test case locally in your own Cascade Lake system. Your results look sensible and is aligned with my expectations.

So, I am perplexed by what I am seeing on dual-socket Cascade Lake.

Let us focus on the sendrecv region only.

Here is the output of:

likwid-mpirun -np 2 -m -g MEM ./sendrecv.likwid.x

(My apologies for the atrocious formatting.)

+-----------------------------------+------------------+------------------+
|               Metric              | sal-nplnpl04:0:0 | sal-nplnpl04:1:1 |
+-----------------------------------+------------------+------------------+
|        Runtime (RDTSC) [s]        |           0.7268 |           0.7263 |
|        Runtime unhalted [s]       |           0.3986 |           0.4000 |
|            Clock [MHz]            |        3396.2627 |        3398.4766 |
|                CPI                |          11.0939 |           8.7243 |
|  Memory read bandwidth [MBytes/s] |         638.2525 |                0 |
|  Memory read data volume [GBytes] |           0.4639 |                0 |
| Memory write bandwidth [MBytes/s] |         177.4336 |                0 |
| Memory write data volume [GBytes] |           0.1290 |                0 |
|    Memory bandwidth [MBytes/s]    |         815.6861 |                0 |
|    Memory data volume [GBytes]    |           0.5929 |                0 |
+-----------------------------------+------------------+------------------+

Also, if I try:
likwid-mpirun -np 2 -nperdomain S:2 -m -g MEM ./sendrecv.likwid.x
-----------------------------------+------------------+------------------+
|               Metric              | sal-nplnpl04:0:0 | sal-nplnpl04:1:1 |
+-----------------------------------+------------------+------------------+
|        Runtime (RDTSC) [s]        |           0.7202 |           0.7309 |
|        Runtime unhalted [s]       |           0.3961 |           0.4126 |
|            Clock [MHz]            |        3420.0206 |        3420.6251 |
|                CPI                |           8.9967 |           5.1744 |
|  Memory read bandwidth [MBytes/s] |         630.9186 |                0 |
|  Memory read data volume [GBytes] |           0.4544 |                0 |
| Memory write bandwidth [MBytes/s] |         165.3840 |                0 |
| Memory write data volume [GBytes] |           0.1191 |                0 |
|    Memory bandwidth [MBytes/s]    |         796.3026 |                0 |
|    Memory data volume [GBytes]    |           0.5735 |                0 |
+-----------------------------------+------------------+------------------+

And another attempt:

likwid-mpirun -np 2 -nperdomain S:1  -g MEM -m ./sendrecv.x


Result looks sensible, but only on sal-nplnpl04:0:0:


+-----------------------------------+------------------+------------------+
|               Metric              | sal-nplnpl04:0:0 | sal-nplnpl04:1:8 |
+-----------------------------------+------------------+------------------+
|        Runtime (RDTSC) [s]        |           0.8881 |           0.9138 |
|        Runtime unhalted [s]       |           0.6408 |           0.8429 |
|            Clock [MHz]            |        3488.1405 |        3481.5082 |
|                CPI                |          12.5748 |           2.0762 |
|  Memory read bandwidth [MBytes/s] |        1415.8138 |        5487.5057 |
|  Memory read data volume [GBytes] |           1.2574 |           5.0144 |
| Memory write bandwidth [MBytes/s] |        1983.3906 |        5438.4525 |
| Memory write data volume [GBytes] |           1.7615 |           4.9695 |
|    Memory bandwidth [MBytes/s]    |        3399.2044 |       10925.9582 |
|    Memory data volume [GBytes]    |           3.0188 |           9.9839 |
+-----------------------------------+------------------+------------------+


Thanks for your help.




--

---
You received this message because you are subscribed to the Google Groups "likwid-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to likwid-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/likwid-users/59828fa2-7eae-4c06-8f10-0568a1cdc19en%40googlegroups.com.

Thomas Gruber

unread,
Oct 3, 2023, 8:05:49 AM10/3/23
to likwid-users
Hi,

The problem is that you don't know how it is implemented in the MPI library. Your last example might be valid if it creates two separate buffers on S1 and copy all data to S1 first, then it distributes it back to the real arrays. There might be a seperate progress thread running on S1 taking care of the transfers. In your first two examples, I would check the L2 and L3 group whether I see the traffic there. For the inter-socket test, you could also measure UPI/QPI data transfers.

In fact, in your example code, it is safe for the compiler to remove the communication completely because the arrays are not accessed afterwards. Maybe you add an access there like an runtime-dependent but always-false condition (e.g. if (recv_buffer[buffer_size>>1] < 0)).

Best,
Thomas

Nichols A. Romero

unread,
Oct 3, 2023, 2:42:02 PM10/3/23
to likwid...@googlegroups.com
Hi Thomas,

Thank you for the analysis. BTW, is your Cascade Lake a single socket system?

Here is what I get for L2:
likwid-mpirun -np 2 -m -g L2 ./sendrecv.likwid.x

For the sendrecv region, it is higher than I would expect but still within reason:
+--------------------------------+------------------+------------------+

|             Metric             | sal-nplnpl04:0:0 | sal-nplnpl04:1:1 |
+--------------------------------+------------------+------------------+
|       Runtime (RDTSC) [s]      |           0.7565 |           0.7573 |
|      Runtime unhalted [s]      |           0.4407 |           0.4422 |
|           Clock [MHz]          |        3401.4026 |        3398.5831 |
|               CPI              |          12.7456 |           5.2816 |
|  L2D load bandwidth [MBytes/s] |        3930.3812 |        4007.1689 |
|  L2D load data volume [GBytes] |           2.9732 |           3.0346 |
| L2D evict bandwidth [MBytes/s] |        1714.6305 |        1789.3433 |
| L2D evict data volume [GBytes] |           1.2971 |           1.3551 |
|     L2 bandwidth [MBytes/s]    |        5750.8499 |        5907.1241 |
|     L2 data volume [GBytes]    |           4.3503 |           4.4734 |
+--------------------------------+------------------+------------------+

For the inter socket test, looks larger than expected but believable:
likwid-mpirun -np 2 -nperdomain S:1  -g UPI -m ./sendrecv.x

+-----------------------------------+------------------+------------------+
|               Metric              | sal-nplnpl04:0:0 | sal-nplnpl04:1:8 |
+-----------------------------------+------------------+------------------+
|        Runtime (RDTSC) [s]        |           0.8944 |           0.9837 |
|        Runtime unhalted [s]       |           0.6564 |           0.9430 |
|            Clock [MHz]            |        3480.2619 |        3477.8602 |
|                CPI                |          14.4162 |           1.5735 |
| Received data bandwidth [MByte/s] |        2201.4771 |        2302.3332 |
|    Received data volume [GByte]   |           1.9690 |           2.2647 |
|   Sent data bandwidth [MByte/s]   |        2415.8236 |        2010.8707 |
|      Sent data volume [GByte]     |           2.1607 |           1.9780 |
|   Total data bandwidth [MByte/s]  |        4617.3007 |        4313.2039 |
|     Total data volume [GByte]     |           4.1297 |           4.2427 |
+-----------------------------------+------------------+------------------+

Looks similar to what one gets from the L2.

For kicks, I created a different test where I don't have MPI at all and I just do a memcopy of the send recv buffer. Furthermore, I just
did:
mpirun -np 2 memcopy.x (again there is no MPI, I am just running two copies).

I have the same issue that I don't see the data volume in the `MEM` group, but I do see it in the L2 group. Furthermore, it is for exactly the size of the buffer:
+--------------------------------+------------------+------------------+

|             Metric             | sal-nplnpl04:0:0 | sal-nplnpl04:1:1 |
+--------------------------------+------------------+------------------+
|       Runtime (RDTSC) [s]      |           1.0830 |           1.0797 |
|      Runtime unhalted [s]      |           0.9252 |           0.9217 |
|           Clock [MHz]          |        3409.1523 |        3408.2470 |
|               CPI              |          21.7268 |          21.6472 |
|  L2D load bandwidth [MBytes/s] |         761.1532 |         759.6254 |
|  L2D load data volume [GBytes] |           0.8243 |           0.8202 |
| L2D evict bandwidth [MBytes/s] |          31.5678 |          32.7180 |
| L2D evict data volume [GBytes] |           0.0342 |           0.0353 |
|     L2 bandwidth [MBytes/s]    |         792.7889 |         792.4088 |
|     L2 data volume [GBytes]    |           0.8586 |           0.8556 |
+--------------------------------+------------------+------------------+

As you alluded to earlier, I guess that I don't really know what MPI is doing under the hood.

Thomas Gruber

unread,
Oct 9, 2023, 7:11:39 AM10/9/23
to likwid-users
Hi,

the test system is a dual-socket CLX system.

In your memcopy case, have you compiled with -ffreestanding? Memory copy loops are often replaced by compilers with calls to memcpy(). The memcpy implementations are often not clearly measureable, they use non-temporal stores and other things.

Best,
Thomas

Nichols A. Romero

unread,
Oct 13, 2023, 8:57:43 AM10/13/23
to likwid...@googlegroups.com
Hello Thomas,

Sorry for the delay, on vacation. Will reply when I get back next week.


Sent from Gmail Mobile


Reply all
Reply to author
Forward
0 new messages