Hi all
Currently I am able to build GASNet-1.32.0 using ofi + libfabric/EFA but not all tests are passing. You can see below the config command and the test failing.It seems that when switching from medium size to large message size something is not properly implemented in AWS libfabric and I would like to know if there is an 'easy' way to either fix what is missing inside libfabric or if we can work around current limitations in libfabric.
You can see from the test log that I already increased the medium size message from 8192 to 65k but somehow I don’t think this is the solution (to increase medium size to a very big number)
I can always run and provide you logs if somebody is willing to help me debug this a little.
Thank you,
--Gabriel Tanase
This is how I configure:
./configure --prefix=/home/ec2-user/GASNET \
--enable-ofi \
--enable-force-ofi \
--with-ofihome=/home/ec2-user/LIBFABRIC \
--with-ofi-provider=efa \
--enable-pthreads --enable-par --enable-segment-fast --with-segment-mmap-max=4GB --disable-seq --disable-parsync --disable-ibv-rcv-thread --disable-aligned-segments --disable-pshm --disable-fca --disable-mxm
And this is the test that is failing:
WARNING: Using OFI provider (efa), which has not been validated to provide
WARNING: acceptable GASNet performance. You should consider using a more
WARNING: hardware-appropriate GASNet conduit. See ofi-conduit/README.
WARNING: Using GASNet's ofi-conduit, which exists for portability convenience.
WARNING: Support was detected for native GASNet conduits: ibv
WARNING: You should *really* use the high-performance native GASNet conduit
WARNING: if communication performance is at all important in this program run.
=====> testcore2 nprocs=2 config=RELEASE=1.32.0,SPEC=1.8,CONDUIT=OFI(OFI-0.5/OFI-0.5),THREADMODEL=PAR,SEGMENT=FAST,PTR=64bit,noalign,nopshm,nodebug,notrace,nostats,nodebugmalloc,nosrclines,timers_native
,membars_native,atomics_native,atomic32_native,atomic64_native compiler=GNU/7.2.1 sys=x86_64-unknown-linux-gnu
node 0/2 hostname is: compute-st-r5n24xlarge-1 (supernode=0 pid=77890)
OFI conduit: v0.5 GASNET_ALIGNED_SEGMENTS=0
gasnet_AMMaxArgs(): 16
gasnet_AMMaxMedium(): 65536
gasnet_AMMaxLongRequest(): 2147483647
gasnet_AMMaxLongReply(): 2147483647
Running multi-threaded AM correctness test with 10 iterations, max_payload=1048576, depth=16...
payload = 1
node 1/2 hostname is: compute-st-r5n24xlarge-2 (supernode=1 pid=77197)
payload = 2
payload = 4
payload = 8
payload = 16
payload = 32
payload = 64
payload = 128
payload = 256
payload = 512
payload = 1024
payload = 2048
payload = 4096
payload = 8192
payload = 16384
payload = 32768
payload = 65536
ERROR: node 0/2 TH0 data mismatch at sz=65536 iter=0 chunk=1 elem=8880 : actual=60 expected=10 in Long Request (at /home/ec2-user/GASNet-1.32.0/tests/testcore2.c:64)
ERROR: node 0/2 TH0 data mismatch at sz=65536 iter=0 chunk=1 elem=8881 : actual=61 expected=11 in Long Request (at /home/ec2-user/GASNet-1.32.0/tests/testcore2.c:64)
ERROR: node 0/2 TH0 data mismatch at sz=65536 iter=0 chunk=1 elem=8882 : actual=62 expected=12 in Long Request (at /home/ec2-user/GASNet-1.32.0/tests/testcore2.c:64)
ERROR: node 0/2 TH0 data mismatch at sz=65536 iter=0 chunk=1 elem=8883 : actual=63 expected=13 in Long Request (at /home/ec2-user/GASNet-1.32.0/tests/testcore2.c:64)