GASNet on Leonardo

9 views
Skip to first unread message

Mario DI RENZO

unread,
Feb 2, 2024, 12:45:39 PM2/2/24
to gasnet...@lbl.gov
Dear all,
I am a Legion user and I am trying to install GASNet on Leonardo, a new HPC in Italy.
You can find the specs of the machine at this link https://wiki.u-gov.it/confluence/display/SCAIUS/UG3.2%3A+LEONARDO+UserGuide

I’ve installed GASNet 2023.3.0 with the following options: --enable-segment-fast --enable-par --disable-seq --disable-parsync --enable-pthreads --disable-auto-conduit-detect --enable-ibv --enable-pshm --enable-mpi-compat --with-ibv-max-hcas=4 --enable-kind-cuda-uva=probe

When I try to execute some benchmark tests like “testlarge” to check if my installation is working fine, I get the following output
```
run -p boost_usr_prod --qos=boost_qos_dbg -N 2 --gres=gpu:4 -W 10 --exclusive ./testlarge -m -in 10000 4194304 B
Timer granularity: <= 0.006 us, overhead: ~ 0.008 us
=====> testlarge nprocs=2 config=RELEASE=2023.3.0,SPEC=0.16,CONDUIT=IBV(IBV-2.12/IBV-2.12),THREADMODEL=PAR,SEGMENT=FAST,PTR=64bit,CACHE_LINE_BYTES=64,noalign,pshm,nodebug,notrace,nostats,nodebugmalloc,nosrclines,timers_native,membars_native,atomics_native,atomic32_native,atomic64_native,notiopt compiler=GNU/11.3.0 sys=x86_64-pc-linux-gnu
node 0/2 hostname is: lrdn3095.leonardo.local (supernode=0 pid=98564)
node 1/2 hostname is: lrdn3096.leonardo.local (supernode=1 pid=81609)
node 0/2 Running 10000 iterations of bulk put/get with local addresses inside the segment for sizes: 16...4194304

B:   0 -         16 byte :   10000 iters, throughput   43.759074 MB/sec (PutNBI+DEFER throughput)
B:   0 -         16 byte :   10000 iters, throughput   66.083972 MB/sec (GetNBI throughput)
B:   0 -         32 byte :   10000 iters, throughput   87.442917 MB/sec (PutNBI+DEFER throughput)
B:   0 -         32 byte :   10000 iters, throughput  133.206365 MB/sec (GetNBI throughput)
B:   0 -         64 byte :   10000 iters, throughput  175.287640 MB/sec (PutNBI+DEFER throughput)
B:   0 -         64 byte :   10000 iters, throughput  265.370245 MB/sec (GetNBI throughput)
B:   0 -        128 byte :   10000 iters, throughput  286.415562 MB/sec (PutNBI+DEFER throughput)
B:   0 -        128 byte :   10000 iters, throughput  529.130093 MB/sec (GetNBI throughput)
B:   0 -        256 byte :   10000 iters, throughput  565.926344 MB/sec (PutNBI+DEFER throughput)
B:   0 -        256 byte :   10000 iters, throughput 1064.721435 MB/sec (GetNBI throughput)
B:   0 -        512 byte :   10000 iters, throughput 1107.967438 MB/sec (PutNBI+DEFER throughput)
B:   0 -        512 byte :   10000 iters, throughput 2133.164045 MB/sec (GetNBI throughput)
B:   0 -       1024 byte :   10000 iters, throughput 2125.734654 MB/sec (PutNBI+DEFER throughput)
B:   0 -       1024 byte :   10000 iters, throughput 4242.235013 MB/sec (GetNBI throughput)
B:   0 -       2048 byte :   10000 iters, throughput 4050.445873 MB/sec (PutNBI+DEFER throughput)
B:   0 -       2048 byte :   10000 iters, throughput 8506.641986 MB/sec (GetNBI throughput)
B:   0 -       4096 byte :   10000 iters, throughput 7081.671501 MB/sec (PutNBI+DEFER throughput)
B:   0 -       4096 byte :   10000 iters, throughput   60.838666 MB/sec (GetNBI throughput)
B:   0 -       8192 byte :   10000 iters, throughput 13009.991674 MB/sec (PutNBI+DEFER throughput)
B:   0 -       8192 byte :   10000 iters, throughput 33879.011275 MB/sec (GetNBI throughput)
B:   0 -      16384 byte :   10000 iters, throughput 20462.283918 MB/sec (PutNBI+DEFER throughput)
B:   0 -      16384 byte :   10000 iters, throughput   87.669686 MB/sec (GetNBI throughput)
B:   0 -      32768 byte :   10000 iters, throughput 26130.947404 MB/sec (PutNBI+DEFER throughput)
B:   0 -      32768 byte :   10000 iters, throughput 3091.610605 MB/sec (GetNBI throughput)
B:   0 -      65536 byte :   10000 iters, throughput 34041.394336 MB/sec (PutNBI+DEFER throughput)
B:   0 -      65536 byte :   10000 iters, throughput 2771.028783 MB/sec (GetNBI throughput)
B:   0 -     131072 byte :   10000 iters, throughput 39429.688979 MB/sec (PutNBI+DEFER throughput)
B:   0 -     131072 byte :   10000 iters, throughput 6250.125003 MB/sec (GetNBI throughput)
B:   0 -     262144 byte :   10000 iters, throughput 42084.708100 MB/sec (PutNBI+DEFER throughput)
B:   0 -     262144 byte :   10000 iters, throughput 2720.546547 MB/sec (GetNBI throughput)
B:   0 -     524288 byte :   10000 iters, throughput 45004.095373 MB/sec (PutNBI+DEFER throughput)
B:   0 -     524288 byte :   10000 iters, throughput 2925.322941 MB/sec (GetNBI throughput)
B:   0 -    1048576 byte :   10000 iters, throughput 45655.220903 MB/sec (PutNBI+DEFER throughput)
B:   0 -    1048576 byte :   10000 iters, throughput 2663.579046 MB/sec (GetNBI throughput)
B:   0 -    2097152 byte :   10000 iters, throughput 44549.581457 MB/sec (PutNBI+DEFER throughput)
B:   0 -    2097152 byte :   10000 iters, throughput 2613.585443 MB/sec (GetNBI throughput)
B:   0 -    4194304 byte :   10000 iters, throughput 44293.305842 MB/sec (PutNBI+DEFER throughput)
B:   0 -    4194304 byte :   10000 iters, throughput 3805.424099 MB/sec (GetNBI throughput)
done.
```
There are some outliers in terms of bandwidth (in this execution the GetNBI 4096 byte and 16384 byte) and most notably repeating the test provides very non-deterministic results.
Updating to a newer version of GASNet does not seem to make any difference.
If I try to use this installation of GASNet with Legion and in my application, I see very slow communications and sometimes deadlocks.

I was wondering if you could suggest any test or configuration options to investigate what is going on and hopefully stabilize the performance of my installation of GASNet on Leonardo.

Thank you in advance for your help
Best regards
Mario

Paul H. Hargrove

unread,
Feb 2, 2024, 6:27:42 PM2/2/24
to Mario DI RENZO, gasnet...@lbl.gov
Mario,

I am the maintainer for GASNet-EX's ibv-conduit.  However, I don't immediately have any good theories which would explain the results you have provided and the text about "very non-deterministic results".   This is not a pattern of behavior I have seen before.

In particular, I cannot think of anything which would negatively impact the Get operations in this test while leaving the Put operations running well.  So, first I'd like to ask if "very non-deterministic results" includes runs in which the Put results contain outliers (as opposed to only ever seeing outliers in Get performance).  While the answer won't lead directly to a solution, it might help to narrow (or widen) what we need to consider.

My second thought is that despite the `--exclusive` flag to your run command it might be possible that something else is running on one or both compute nodes, leading to interference with the test.  If you can run `top` on each compute node to confirm nothing else is using significant CPU time, then we can hopefully eliminate that possibility.  You should watch the output for tens of seconds, to catch things that might be waking only periodically.  Alternatively, you could re-run on two different compute nodes to see if the behavior remains (though it would remain if they also have stray processes running).

If interference from other processes does not turn out to be the issue, then perhaps you can provide me with access to run some tests?  If that is a possibility, then please email me directly (not as a reply to this list).

-Paul


--
You received this message because you are subscribed to the Google Groups "gasnet-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gasnet-users...@lbl.gov.
To view this discussion on the web visit https://groups.google.com/a/lbl.gov/d/msgid/gasnet-users/4A760E9F-8F90-425F-B9D4-6C1C1DEE26A9%40unisalento.it.


--
Paul H. Hargrove <PHHar...@lbl.gov>
Pronouns: he, him, his
Computer Languages & Systems Software (CLaSS) Group
Computer Science Department
Lawrence Berkeley National Laboratory

Mario DI RENZO

unread,
Feb 3, 2024, 6:50:18 AM2/3/24
to Paul H. Hargrove, gasnet...@lbl.gov
Paul,

Thank you very much for the prompt reply.

I’ve seen outliers also on the Put operations, so I do not think that only the Get performance is affected.
I can’t open an interactive session on each node so it is difficult to run top and check for CPU utilization.
However, I use different allocations with different nodes to execute the tests therefore, if there is another process that is interfering with the tests, it is running on all the nodes of the machine that I have utilized.
Moreover, I have been in contact with the support team of the system and they provided a benchmark test for put and get (based on MPI) that does not show this problem.

Best,
Mario

Paul H. Hargrove

unread,
Feb 5, 2024, 9:21:59 AM2/5/24
to Mario DI RENZO, gasnet...@lbl.gov
Mario,

Thanks for the clarification that you also see Put performance impacted in some runs.
That makes the results slightly less "odd", but does not change the fact that I've never encountered this sort of behavior.

The fact that MPI-based testing does not see similar anomalies makes this even more odd in my mind.
The only thing that comes to mind is that if the MPI testing was for message passing, then a test of MPI RMA might be more representative of what GASNet's testlarge is measuring.  Examples of MPI RMA benchmarks include osu_put_bw and osu_get_bw from the OSU Micro-Benchmarks or IMB-RMA (subtests unidir_put and unidir_get) from the Intel MPI Benchmarks. 

-Paul
Reply all
Reply to author
Forward
0 new messages