[slurm-dev] OpenMPI PMI2 with 14.03 not working

Showing 1-4 of 4 messages
[slurm-dev] OpenMPI PMI2 with 14.03 not working Anthony Alba 4/11/14 11:58 AM
Not sure if this is a SLURM or OMPI issue so starting here..

The OpenMPI FAQ mentions an issue with slurm 2.6.3/pmi2.
https://www.open-mpi.org/faq/?category=slurm#slurm-2.6.3-issue

I have built both 1.7.5/1.8 against slurm 14.03/pmi2.

When I launch openmpi/examples/hello_c on a single node allocation:

srun --mpi=pmi2 -N 1 hello_c:

srun -N 1 --mpi=pmi2 hello_c
srun: error: _server_read: fd 18 got error or unexpected eof reading header
srun: error: step_launch_notify_io_failure: aborting, io error with slurmstepd on node 0
srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
srun: error: Timed out waiting for job step to complete


with --slurmd-debug=9: (I'm not sure what is the meaning of "ip 111.110.61.48 sd 14"
below, is that ip as in ip address? It is not the ip address of any Nodes in my partition)

slurmstepd: mpi/pmi2: client_resp_send: 26    cmd=kvs-put-response;rc=0;
slurmstepd: mpi/pmi2: _tree_listen_readable
slurmstepd: mpi/pmi2: _task_readable
slurmstepd: mpi/pmi2: got client request: 14     cmd=kvs-fence;
slurmstepd: mpi/pmi2: _tree_listen_readable
slurmstepd: mpi/pmi2: _task_readable
slurmstepd: mpi/pmi2: _tree_listen_read
slurmstepd: _tree_listen_read: accepted tree connection: ip 111.110.61.48 sd 14
slurmstepd: _handle_accept_rank: going to read() client rank
slurmstepd: _handle_accept_rank: got client rank 1478164480 on fd 14
srun: error: _server_read: fd 18 got error or unexpected eof reading header
srun: error: step_launch_notify_io_failure: aborting, io error with slurmstepd on node 0
srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
srun: error: Timed out waiting for job step to complete

Launching with salloc/sbatch works.

- Anthony

[slurm-dev] Re: OpenMPI PMI2 with 14.03 not working David Bigagli 4/11/14 12:09 PM

Hi,
    this Slurm bug has been fixed and it will be available in 14.03.1
which will be released soon. Otherwise it is available in the HEAD.
You should find a core file of slurmstepd in the directory where you
have run the srun command.
--

Thanks,
       /David/Bigagli

www.schedmd.com
[slurm-dev] Re: OpenMPI PMI2 with 14.03 not working David Bigagli 4/11/14 12:25 PM

Errata corrige. The core file is in the log directory.
[slurm-dev] Re: OpenMPI PMI2 with 14.03 not working Anthony Alba 4/11/14 1:01 PM

Thanks David, that explains it. I'll watch out for the 14.03.1 tag.

I'll revert to the 2.6.9 tag in the mean time as pmi2 seems to be working there with OpenMPI 1.8.

- Anthony