Running MPI with Infiniband on Azure

677 views
Skip to first unread message

Paul Edwards

unread,
Apr 25, 2018, 8:01:54 AM4/25/18
to singularity
Hi,

I'm trying to use singularity on Azure with their Infiniband.  I am using the CentOS 7.1 HPC image provided (which has the drivers and Intel MPI installed) and I built singularity.  I would just like to run the MPI benchmarks that come with Intel MPI but I get the "no such device" error when I set it to use the DAPL driver.  Below is my def file:

$ cat centos.def
Bootstrap: yum
OSVersion: 7
Include: yum

%runscript
source /opt/intel/impi/5.1.3.223/bin64/mpivars.sh
exec "$@"

%files
intel.tgz /opt/intel.tgz

%post
yum install -y tar gzip libmlx4 librdmacm libibverbs dapl rdma net-tools numactl
cd /opt
tar zxf intel.tgz

The intel tar file is taken from the host.  I build (centos7.simg) and then run the following:

mpirun -np 2 \
-genv I_MPI_FALLBACK 0 \
-genv I_MPI_FABRICS dapl \
-genv I_MPI_DAPL_PROVIDER ofa-v2-ib0 \
-genv I_MPI_DYNAMIC_CONNECTION 0 \
./centos7.simg /opt/intel/impi/5.1.3.223/bin64/IMB-MPI1 Allreduce

The error I get is:

singularity:CMA:1b5b:67212b40: 71 us(71 us):  open_hca: getaddr_netdev ERROR:No such device. Is ib0 configured?
[0] MPI startup(): dapl fabric is not available and fallback fabric is not enabled
singularity:CMA:1b5c:b8359b40: 77 us(77 us):  open_hca: getaddr_netdev ERROR:No such device. Is ib0 configured?
[1] MPI startup(): dapl fabric is not available and fallback fabric is not enabled

Note: this will all work if I use tcp rather than dapl.

I'm new to singularity and any help/pointers would be greatly appreciated.

Thanks,
Paul

John Hearns

unread,
Apr 25, 2018, 8:17:18 AM4/25/18
to singu...@lbl.gov
This says that ntel MPI is installed https://azure.microsoft.com/en-gb/blog/introducing-mpi-support-for-linux-on-azure-batch/

I would say run ibv_devinfo  which lists the Verbs capable devices.
Last time I dealt with DAPL was on an Omnipath system and I forget most things there (sorry)
As I remember there is some subtlety with the /etc/dat.conf  file

I would start by looking at /etc/dat.conf   and apologies as I am probably leading you down a wrong path.









--
You received this message because you are subscribed to the Google Groups "singularity" group.
To unsubscribe from this group and stop receiving emails from it, send an email to singularity+unsubscribe@lbl.gov.

Paul Edwards

unread,
Apr 25, 2018, 8:43:00 AM4/25/18
to singu...@lbl.gov
Hi John,

Thanks for the pointers but it still not working :(  The ibv_devinfo shows exactly the same for the host and the container.  I have the correct driver set (using I_MPI_FABRICS) as this works on the host - I just get the error when running in the container.

Best regards,
Paul

To unsubscribe from this group and stop receiving emails from it, send an email to singularity...@lbl.gov.

John Hearns

unread,
Apr 25, 2018, 9:03:29 AM4/25/18
to singu...@lbl.gov
Look closely at the /etc/dat.conf file which defines the DAPL devices.
When I worked with Omnipath I had to change that file as I recall.

Paul Edwards

unread,
Apr 25, 2018, 9:34:05 AM4/25/18
to singu...@lbl.gov
Yes - I know about /etc/dat.conf (or /etc/rdma/dat.conf in my case) - but you set it with IntelMPI using the I_MPI_DAPL_PROVIDER (my mistake in the last email saying I_MPI_FABRICS) environment variable.  Anyway, both host and container have exactly the same dat.conf and I'm using the same environment variable.  And, adding debug shows that it is trying the right provider.

Rémy Dernat

unread,
Apr 25, 2018, 9:36:07 AM4/25/18
to singu...@lbl.gov

Paul Edwards

unread,
Apr 25, 2018, 10:12:14 AM4/25/18
to singu...@lbl.gov
It was /etc/rdma/dat.conf (many thanks John).  It was due to the file on the container being different to the host. The Azure image must modify it from the original dapl RPM.

Thanks again :)
Reply all
Reply to author
Forward
0 new messages