intel-mpi and zen2

134 views
Skip to first unread message

mart...@gmail.com

unread,
Apr 19, 2021, 3:59:48 PM4/19/21
to Spack
I keep having "Illegal instruction" error with running Spack built applications that use target=zen2 with intel-mpi, on the zen2 arch. The same binary runs fine on zen1, and on Intel CPUs. An example of that is HPL, e.g.
spack install hpl%g...@10.2.0 target=zen2

This only happens with intel-mpi (which we have a default). MPICH works fine, e.g.:
spack install hpl%g...@10.2.0^mpich target=zen2

Hand built HPL with Spack installed intel-mpi runs fine as well. I have also changed the RPATH in the Spack's built HPL/intel-mpi binary to be the same as the hand built, and compared the binary source of the two via objdump, which is identical (apart from some addressing differences).

So, I am really perplexed in what can be causing this behavior, and am wondering if someone has seen this as well.

Thanks,
MC

Gamblin, Todd

unread,
Apr 19, 2021, 4:51:36 PM4/19/21
to mart...@gmail.com, Spack
What type of node is this?  Is it a cloud instance type that we could try to reproduce?  If it’s bare metal zen2 then it’s probably a bug in the arguments we’re passing — but I believe we’re just using -march=znver2 -mtune=znver2 for gcc… so I would be surprised if that was the issue…

Are you linking your spack HPL installation to some external libraries that are maybe incompatible?

In the past we’ve found that cloud providers will sometimes disable particular instructions, which can result in this type of problem.

-Todd



--
You received this message because you are subscribed to the Google Groups "Spack" group.
To unsubscribe from this group and stop receiving emails from it, send an email to spack+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/spack/c546c73f-2bad-4680-9a3a-f519adaa5c24n%40googlegroups.com.

mart...@gmail.com

unread,
Apr 19, 2021, 6:06:13 PM4/19/21
to Spack
Hi Todd,

it's our cluster node, so, bare metal. Pretty standard CentOS 7
$ uname -a
Linux notch190 3.10.0-957.27.2.el7.x86_64 #1 SMP Mon Jul 29 17:46:05 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
$ cat /etc/redhat-release
CentOS Linux release 7.6.1810 (Core)

with a hand built gcc/10.2.0

and Spack installed intel-mpi:
$ spack find -dl hpl
-- linux-centos7-zen2 / g...@10.2.0 ------------------------------
qq5iquf hpl@2.3
uzs2iv2     inte...@2020.3.279
myqt2sr     inte...@2019.8.254

The dynamic libraries loaded by the two binaries (Spack built, hand built), are the same:
[u0101881@notch190 zen2]$ ldd /uufs/chpc.utah.edu/sys/spack/linux-centos7-zen2/gcc-10.2.0/hpl-2.3-qq5iquf3hbvxd673iqox44mos3fejjpn/bin/xhpl
    linux-vdso.so.1 =>  (0x00007ffd63d7f000)
    libmkl_intel_lp64.so => /uufs/chpc.utah.edu/sys/spack/linux-centos7-zen2/gcc-10.2.0/intel-mkl-2020.3.279-uzs2iv2buhobuzfty5zvme5ppodm6tlu/compilers_and_libraries_2020.3.279/linux/mkl/lib/intel64/libmkl_intel_lp64.so (0x00007f34e6987000)
    libmkl_sequential.so => /uufs/chpc.utah.edu/sys/spack/linux-centos7-zen2/gcc-10.2.0/intel-mkl-2020.3.279-uzs2iv2buhobuzfty5zvme5ppodm6tlu/compilers_and_libraries_2020.3.279/linux/mkl/lib/intel64/libmkl_sequential.so (0x00007f34e4dc9000)
    libmkl_core.so => /uufs/chpc.utah.edu/sys/spack/linux-centos7-zen2/gcc-10.2.0/intel-mkl-2020.3.279-uzs2iv2buhobuzfty5zvme5ppodm6tlu/compilers_and_libraries_2020.3.279/linux/mkl/lib/intel64/libmkl_core.so (0x00007f34e0803000)
    libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f34e05e7000)
    libm.so.6 => /lib64/libm.so.6 (0x00007f34e02e5000)
    libdl.so.2 => /lib64/libdl.so.2 (0x00007f34e00e1000)
    libmpifort.so.12 => /uufs/chpc.utah.edu/sys/spack/linux-centos7-zen2/gcc-10.2.0/intel-mpi-2019.8.254-myqt2sr3rpud2yuagvnvacshjo45na2d/compilers_and_libraries_2020.2.254/linux/mpi/intel64/lib/libmpifort.so.12 (0x00007f34dfd22000)
    libmpi.so.12 => /uufs/chpc.utah.edu/sys/spack/linux-centos7-zen2/gcc-10.2.0/intel-mpi-2019.8.254-myqt2sr3rpud2yuagvnvacshjo45na2d/compilers_and_libraries_2020.2.254/linux/mpi/intel64/lib/release/libmpi.so.12 (0x00007f34deb06000)
    librt.so.1 => /lib64/librt.so.1 (0x00007f34de8fe000)
    libc.so.6 => /lib64/libc.so.6 (0x00007f34de531000)
    /lib64/ld-linux-x86-64.so.2 (0x00007f34e7698000)
    libgcc_s.so.1 => /uufs/chpc.utah.edu/sys/installdir/gcc/10.2.0/lib64/libgcc_s.so.1 (0x00007f34de319000)
    libfabric.so.1 => /uufs/chpc.utah.edu/sys/spack/linux-centos7-zen2/gcc-10.2.0/intel-mpi-2019.8.254-myqt2sr3rpud2yuagvnvacshjo45na2d/compilers_and_libraries_2020.2.254/linux/mpi/intel64/libfabric/lib/libfabric.so.1 (0x00007f34de0d7000)

[u0101881@notch190 zen2]$ ldd /uufs/chpc.utah.edu/common/home/u0101881/bench/hpl-2.3/bin/Zen_spack_mkl/xhpl
    linux-vdso.so.1 =>  (0x00007ffca0981000)
    libfabric.so.1 => /uufs/chpc.utah.edu/sys/spack/linux-centos7-zen2/gcc-10.2.0/intel-mpi-2019.8.254-myqt2sr3rpud2yuagvnvacshjo45na2d/compilers_and_libraries_2020.2.254/linux/mpi/intel64/libfabric/lib/libfabric.so.1 (0x00007f5ccdb82000)
    libmkl_intel_lp64.so => /uufs/chpc.utah.edu/sys/spack/linux-centos7-zen2/gcc-10.2.0/intel-mkl-2020.3.279-uzs2iv2buhobuzfty5zvme5ppodm6tlu/compilers_and_libraries_2020.3.279/linux/mkl/lib/intel64/libmkl_intel_lp64.so (0x00007f5ccce71000)
    libmkl_sequential.so => /uufs/chpc.utah.edu/sys/spack/linux-centos7-zen2/gcc-10.2.0/intel-mkl-2020.3.279-uzs2iv2buhobuzfty5zvme5ppodm6tlu/compilers_and_libraries_2020.3.279/linux/mkl/lib/intel64/libmkl_sequential.so (0x00007f5ccb2b3000)
    libmkl_core.so => /uufs/chpc.utah.edu/sys/spack/linux-centos7-zen2/gcc-10.2.0/intel-mkl-2020.3.279-uzs2iv2buhobuzfty5zvme5ppodm6tlu/compilers_and_libraries_2020.3.279/linux/mkl/lib/intel64/libmkl_core.so (0x00007f5cc6ced000)
    libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f5cc6ad1000)
    libm.so.6 => /lib64/libm.so.6 (0x00007f5cc67cf000)
    libdl.so.2 => /lib64/libdl.so.2 (0x00007f5cc65cb000)
    libmpifort.so.12 => /uufs/chpc.utah.edu/sys/spack/linux-centos7-zen2/gcc-10.2.0/intel-mpi-2019.8.254-myqt2sr3rpud2yuagvnvacshjo45na2d/compilers_and_libraries_2020.2.254/linux/mpi/intel64/lib/libmpifort.so.12 (0x00007f5cc620c000)
    libmpi.so.12 => /uufs/chpc.utah.edu/sys/spack/linux-centos7-zen2/gcc-10.2.0/intel-mpi-2019.8.254-myqt2sr3rpud2yuagvnvacshjo45na2d/compilers_and_libraries_2020.2.254/linux/mpi/intel64/lib/release/libmpi.so.12 (0x00007f5cc4ff0000)
    librt.so.1 => /lib64/librt.so.1 (0x00007f5cc4de8000)
    libc.so.6 => /lib64/libc.so.6 (0x00007f5cc4a1b000)
    /lib64/ld-linux-x86-64.so.2 (0x00007f5ccddc4000)
    libgcc_s.so.1 => /uufs/chpc.utah.edu/sys/installdir/gcc/10.2.0/lib64/libgcc_s.so.1 (0x00007f5cc4803000)

(note the confusing packaging of the intel-mpi, intel-mpi-2019.8.254-myqt2sr3rpud2yuagvnvacshjo45na2d/compilers_and_libraries_2020.2.254, but I verified that the symbolic link does go to the version 2019.8.254).

I also verified that Spack injects the -march=znver2 -mtune=znver2 and nothing else.

Thanks,
MC

mart...@gmail.com

unread,
Apr 19, 2021, 6:12:20 PM4/19/21
to Spack
Ah, this may be an issue with the intel-mpi/2019. I Spack built with intel-mpi/2018.4.274 and it's working OK, no illegal instruction. I'll fish around the other intel-mpi versions to see which work and which don't. I could have thought of trying other intel-mpi versions sooner.

MC

Gamblin, Todd

unread,
Apr 19, 2021, 6:16:35 PM4/19/21
to mart...@gmail.com, Spack
It would be interesting to know, if you can find out, what the illegal instruction is.  I wouldn’t be surprised if Intel didn’t get the instructions right for AMD’s processors, so it may well be that.  Finding out the instruction would at least give us an idea of what target zen2 is being mixed up with.

-Todd


mart...@gmail.com

unread,
Apr 20, 2021, 4:34:33 PM4/20/21
to Spack
Looks like I found the reason for the problem - MPI runtime/build incompatibility. I was lazy to create a local module for the Spack's intel-mpi/2019.8, so, while compiling with it, I ran with mpirun from locally installed intel-mpi/2019.5, relying on the notion that minor intel-mpi versions will be compatible with each other - which they have been so far.

Now I have created the Spack installed intel-mpi/2019.8 module, and using mpirun from the 2019.8 the xhpl binary that gives the illegal instruction with the intel-mpi/2019.5 mpirun works fine with the intel-mpi/2019.8 mpirun.

Looking in the core dump in gdb:

 >│0x7f6bec7e1cf5 <fi_param_get_+293>      (bad)   

There's no address 0x7f6bec7e1cf5 in the assembler code:
...
 0x7f6bec7e1cf0 <fi_param_get_+288>      xor    %eax,%eax                                                                                                         │
 0x7f6bec7e1cf2 <fi_param_get_+290>      jmpq   0x7f6bec7e1c10 <fi_param_get_+64>                                                                                 │
 0x7f6bec7e1cf7 <fi_param_get_+295>      movzbl (%rbx),%r12d                                                                                                      │
 0x7f6bec7e1cfb <fi_param_get_+299>      movzbl 0x24de7(%rip),%eax        # 0x7f6bec806ae9
...

so it's looking to me as a bad program reference due to the mixup of the two intel-mpi versions, with libfabric in particular. Looks like even the minor releases intel-mpi change the libfabric as with other intel-mpi versions I get different errors but all related to libfabric.

There is probably still some Spack's involvement through the RPATH because this does run fine with the hand built binary, but, not worth digging deeper as the root cause is the different intel-mpi versions.

MC
Reply all
Reply to author
Forward
0 new messages