Re: [slurm-dev] slurm controller died when executing regression test15.17

0 views
Skip to first unread message

Danny Auble

unread,
Jul 23, 2008, 11:40:58 AM7/23/08
to slur...@lists.llnl.gov
Do you happen to have a backtrace of the controller when it died? If not could you run the controller with the -D option and see if you can get a core file and send us the backtrace?

> When I ran test15.17 of regression testsuite, then slurm controller died (
> can be reproduced ). Could someone please look into to find the cause ?
> I have a cluster of 12 JS22 blade nodes (ppc64).
>
> Here is the output of test15.17:
>
> ============================================
> TEST: 15.17
> spawn /usr/bin/salloc -N1-4 -t1 /bin/bash^M
> salloc: Granted job allocation 22^M
> /usr/bin/sbatch --jobid=22 -o none -e none test15.17.input ^M
> cluster-ib-1:~/SLURM/testsuite/expect # /usr/bin/sbatch --jobid=22 -o none
> -e no ^Mne test15.17.input ^M
> sbatch: error: slurm_receive_msg: Zero Bytes were transmitted or
> received^M
> sbatch: error: Batch job submission failed: Zero Bytes were transmitted or
> received^M
> cluster-ib-1:~/SLURM/testsuite/expect #
> FAILURE: salloc not responding
> cancelling 22
> test15.17 FAILURE
> ============================================
>
> And the log of slurmctld.log at the time it died:
> ==============================
> cluster-ib-1:~/SLURM/testsuite/expect # tail /tmp/slurm/slurmctld.log
> [Jul 23 09:12:53] completing job 21
> [Jul 23 09:12:53] job_complete for JobId=21 successful
> [Jul 23 09:12:54] _slurm_rpc_allocate_resources JobId=22
> NodeList=cluster-ib-[1-4] usec=137
> [Jul 23 09:12:54] user 0 attempting to run batch script within an existing
> job
> ==============================
>
>
> Regards,
>
> Hien Nguyen
> Linux Technology Center (Austin)
> Phone: (512) 838-4140 Tie Line: 678-4140
> e-mail: hi...@us.ibm.com
>

Jeff Squyres

unread,
Jul 23, 2008, 9:48:50 AM7/23/08
to slur...@lists.llnl.gov
Greetings Dirk.

Unfortunately, Open MPI does not yet support using srun to launch MPI
processes without our mpirun. We OMPI developers have talked about
it, and wavered back and forth on whether we're going to do it or
not. So far, we haven't done it. I can't predict what will happen in
future versions of OMPI, I can tell you that support for "srun ...
ompi_mpi_executable" will *not* be included in the upcoming OMPI
v1.3. Sorry. :-(

However, there are two simple workarounds:

1. s/srun/salloc/ and use our mpirun, meaning:

$ salloc -N 2 -n 2 mpirun ./helloMPIworld

2. Or put the mpirun command in a script and launch it via sbatch:

$ cat > myscript <<EOF
mpirun ./helloMPIworld
EOF
$ chmod +x myscript
$ sbatch -N 2 -n 2 myscript

To be absolutely clear: Open MPI *does* use SLURM support under the
covers to effect mpirun's launching of remote processes, etc. We just
don't support "srun ... helloMPIworld".

Hope that helps.

On Jul 23, 2008, at 9:16 AM, Dirk Eddelbuettel wrote:

>
> I am preparing some example script for a presentation that covers
> among other
> things OpenMPI and then slurm. I hit a small conceptual snag with
> srun.
>
> Using a 'hello World' mpi example, I get the correct rank/size
> output using
> OpenMPI's orterun:
>
> $ orterun -n 2 -H ron,mccoy ./helloMPIworld
> Hello, rank 0, size 2 on processor ron
> Hello, rank 1, size 2 on processor mccoy
>
> Two hosts, total size 2, ranks 0 and 1 -- and I was thinking I could
> get that
> too using srun. But I always end up with rank 0 and size 1.
>
> $ srun -m arbitrary -w ron,mccoy -n2 helloMPIworld
> Hello, rank 0, size 1 on processor ron
> Hello, rank 0, size 1 on processor mccoy
> $ srun -N 2 -n 2 ./helloMPIworld
> Hello, rank 0, size 1 on processor ron
> Hello, rank 0, size 1 on processor mccoy
>
> Am I sinply misunderstanding how this is supposed to work?
>
> I was under the impression that srun does what orterun does (plus of
> course a
> slew of other things). Are the different MPI instances that are
> launched
> aware of each other or not?
>
> I am using Debian with version 1.2.7 of OpenMPI and 1.3.4 of slurm.
>
> Dirk
>
> --
> Three out of two people have difficulties with fractions.


--
Jeff Squyres
Cisco Systems

Hien Nguyen

unread,
Jul 23, 2008, 11:29:40 AM7/23/08
to slur...@lists.llnl.gov, Hien Nguyen

Dirk Eddelbuettel

unread,
Jul 23, 2008, 9:16:41 AM7/23/08
to slur...@lists.llnl.gov

Manuel Prinz

unread,
Jul 23, 2008, 10:27:06 AM7/23/08
to slur...@lists.llnl.gov
Hi Jeff and Dirk!

Am Mittwoch, den 23.07.2008, 09:48 -0400 schrieb Jeff Squyres:
> 2. Or put the mpirun command in a script and launch it via sbatch:
>
> $ cat > myscript <<EOF
> mpirun ./helloMPIworld
> EOF
> $ chmod +x myscript
> $ sbatch -N 2 -n 2 myscript

This is how I use OpenMPI via SLURM and can confirm that it works like
expected. Changing the exectutable bits is optional; sbatch will happily
execute the script without it being executable. TTBOMK, a shebang is
mandatory, though.

Another nice thing is that you can pass the sbatch parameters by using
lines starting with "#SBATCH", i.e.

$ cat >myscript <<EOF
#!/bin/sh
# SBATCH -n 2
# SBATCH -N 2
# SBATCH --mail-type ALL
mpirun ./helloMPIworld
EOF

Best regards
Manuel

signature.asc

jet...@llnl.gov

unread,
Jul 23, 2008, 11:48:53 AM7/23/08
to slur...@lists.llnl.gov
Try "slurmctld -Dvvvvvv" (the "v" for extra verbose logging)
Also send your slurm.conf file so we have some idea what your
configuration looks like.

Hien Nguyen

unread,
Jul 23, 2008, 12:27:47 PM7/23/08
to slur...@lists.llnl.gov, owner-s...@lists.llnl.gov, slur...@lists.llnl.gov

Files slurmctld.log and slurm.conf attached at the bottom of this e-mail.

Reran test15.17 with debug slurmctl mode:
=====================
cluster-ib-1:~/SLURM/testsuite/expect # sinfo
PARTITION AVAIL  TIMELIMIT NODES  STATE NODELIST
debug*       up   infinite    12   idle cluster-ib-[1-12]
cluster-ib-1:~/SLURM/testsuite/expect # ./test15.17
============================================
TEST: 15.17
spawn /usr/bin/salloc -N1-4 -t1 /bin/bash
salloc: Granted job allocation 2
/usr/bin/sbatch --jobid=2 -o none -e none test15.17.input
cluster-ib-1:~/SLURM/testsuite/expect # /usr/bin/sbatch --jobid=2 -o none -e none test15.17.input
sbatch: error: slurm_receive_msg: Zero Bytes were transmitted or received
sbatch: error: Batch job submission failed: Zero Bytes were transmitted or received
cluster-ib-1:~/SLURM/testsuite/expect #
FAILURE: salloc not responding
cancelling 2
scancel: error: Kill job error on job id 2: Unable to contact slurm controller (connect failure)
    while executing
"exec $scancel -q $job_id"
    (procedure "cancel_job" line 5)
    invoked from within
"cancel_job $job_id_1"
    invoked from within
"expect -nobrace -re {Granted job allocation ([0-9]+)} {
                set job_id_1 $expect_out(1,string)
                send "$sbatch --jobid=$job_id_1 -o none -e none $file_i..."
    invoked from within
"expect {
        -re "Granted job allocation ($number)" {
                set job_id_1 $expect_out(1,string)
                send "$sbatch --jobid=$job_id_1 -o none -e none $file_in \n"..."
    (file "./test15.17" line 65)
cluster-ib-1:~/SLURM/testsuite/expect #
=========================


Regards,

Hien Nguyen
Linux Technology Center (Austin)
Phone: (512) 838-4140            Tie Line: 678-4140
e-mail: hi...@us.ibm.com



jet...@llnl.gov
Sent by: owner-s...@lists.llnl.gov

07/23/2008 10:48 AM

Please respond to
slur...@lists.llnl.gov

To
slur...@lists.llnl.gov
cc
Subject
Re: [slurm-dev] slurm controller died when executing regression test15.17


slurmctld.log
slurm.conf

jet...@llnl.gov

unread,
Jul 23, 2008, 1:53:58 PM7/23/08
to Hien Nguyen, slur...@lists.llnl.gov
We've tried this on a couple of PPC systems here using the same
slurm.conf file as you (just changing some paths and the host names)
with no failures. We've used both AIX and SUSE Linux.

Here are some suggestions:
1. Upgrade to slurm v1.3.5, I don't know if this will make any difference,
   but it should be easy for you to try.
2. Produce a backtrace of the slurmctld failure, perhaps by running
   slurmctld under the control of a debugger
3. Add some more logging to the slurmctld code to locate the problem.
   You should start at step_create() in src/controller/step_mgr.c

Attachment converted: Macintosh HD:slurmctld.log (    /    ) (00DE6988)
Attachment converted: Macintosh HD:slurm.conf (    /    ) (00DE6989)


-- 
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Morris "Moe" Jette       jet...@llnl.gov                 925-423-4856
Integrated Computational Resource Management Group   fax 925-423-6961
Livermore Computing            Lawrence Livermore National Laboratory
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
"The problem with the world is that we draw the circle of our family
 too small."  - Mother Teresa
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Hien Nguyen

unread,
Jul 23, 2008, 4:01:28 PM7/23/08
to slur...@lists.llnl.gov, owner-s...@lists.llnl.gov, slur...@lists.llnl.gov

Here is what I got from gdb:

cluster-ib-1:~ # gdb slurmctld 9621
GNU gdb 6.6
Copyright (C) 2006 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "ppc-suse-linux"...
Using host libthread_db library "/lib/power6x/libthread_db.so.1".
Attaching to program: /usr/sbin/slurmctld, process 9621
Reading symbols from /usr/lib/libcrypto.so.0.9.8...done.
Loaded symbols for /usr/lib/libcrypto.so.0.9.8
Reading symbols from /lib/libdl.so.2...done.
Loaded symbols for /lib/libdl.so.2
Reading symbols from /lib/power6x/libpthread.so.0...done.
[Thread debugging using libthread_db enabled]
[New Thread -134467584 (LWP 9621)]
[New Thread -138267424 (LWP 9625)]
[New Thread -137218848 (LWP 9624)]
[New Thread -136170272 (LWP 9623)]
[New Thread -134904608 (LWP 9622)]
Loaded symbols for /lib/power6x/libpthread.so.0
Reading symbols from /lib/power6x/libc.so.6...done.
Loaded symbols for /lib/power6x/libc.so.6
Reading symbols from /lib/ld.so.1...done.
Loaded symbols for /lib/ld.so.1
Reading symbols from /usr/lib/slurm/crypto_openssl.so...done.
Loaded symbols for /usr/lib/slurm/crypto_openssl.so
Reading symbols from /usr/lib/slurm/select_linear.so...done.
Loaded symbols for /usr/lib/slurm/select_linear.so
Reading symbols from /usr/lib/slurm/checkpoint_none.so...done.
Loaded symbols for /usr/lib/slurm/checkpoint_none.so
Reading symbols from /usr/lib/slurm/accounting_storage_none.so...done.
Loaded symbols for /usr/lib/slurm/accounting_storage_none.so
Reading symbols from /usr/lib/slurm/jobacct_gather_none.so...done.
Loaded symbols for /usr/lib/slurm/jobacct_gather_none.so
Reading symbols from /usr/lib/slurm/switch_none.so...done.
Loaded symbols for /usr/lib/slurm/switch_none.so
Reading symbols from /usr/lib/slurm/jobcomp_none.so...done.
Loaded symbols for /usr/lib/slurm/jobcomp_none.so
Reading symbols from /usr/lib/slurm/sched_backfill.so...done.
Loaded symbols for /usr/lib/slurm/sched_backfill.so
Reading symbols from /usr/lib/slurm/auth_munge.so...done.
Loaded symbols for /usr/lib/slurm/auth_munge.so
Reading symbols from /usr/lib/libmunge.so.2...done.
Loaded symbols for /usr/lib/libmunge.so.2
0x0fd64830 in nanosleep () from /lib/power6x/libc.so.6
(gdb) br step_create
Breakpoint 1 at 0x1003c834: file step_mgr.c, line 802.
(gdb) s
Single stepping until exit from function nanosleep,
which has no line number information.
[New Thread -153094944 (LWP 9864)]
Cannot get thread event message: debugger service failed
(gdb) s
Single stepping until exit from function __libc_disable_asynccancel,
which has no line number information.
0x0fd64848 in nanosleep () from /lib/power6x/libc.so.6
(gdb) s
Single stepping until exit from function nanosleep,
which has no line number information.
0x0fd645b8 in sleep () from /lib/power6x/libc.so.6
(gdb) s
Single stepping until exit from function sleep,
which has no line number information.
main (argc=<value optimized out>, argv=<value optimized out>)
    at controller.c:985
985     controller.c: No such file or directory.
        in controller.c
(gdb) s
986     in controller.c
(gdb) s
985     in controller.c
(gdb) s
986     in controller.c
(gdb) s
988     in controller.c
(gdb) s
1007    in controller.c
(gdb) s
1008    in controller.c
(gdb) s
1009    in controller.c
(gdb) s
1008    in controller.c
(gdb) s
1009    in controller.c
(gdb) s
Cannot get thread event message: debugger service failed
(gdb) bt
#0  debug2 (fmt=0x1008b910 "Performing job time limit and checkpoint test")
    at log.c:741
#1  0x10015200 in main (argc=<value optimized out>, argv=<value optimized out>)
    at controller.c:1009
-------------
cluster-ib-1:~/SLURM/testsuite/expect # ./test15.17
============================================
TEST: 15.17
spawn /usr/bin/salloc -N1-4 -t1 /bin/bash
salloc: error: slurm_receive_msg: Socket timed out on send/recv operation
salloc: error: Failed to allocate resources: Socket timed out on send/recv operation
 
FAILURE: job allocation failure
-------------

Regards,

Hien Nguyen
Linux Technology Center (Austin)
Phone: (512) 838-4140            Tie Line: 678-4140
e-mail: hi...@us.ibm.com



jet...@llnl.gov
Sent by: owner-s...@lists.llnl.gov

07/23/2008 12:53 PM

Please respond to
slur...@lists.llnl.gov

To
Hien Nguyen/Austin/IBM@IBMUS, slur...@lists.llnl.gov
Reply all
Reply to author
Forward
0 new messages