large job launching

8 views
Skip to first unread message

Michael Di Domenico

unread,
Apr 10, 2012, 9:03:31 AM4/10/12
to sicortex-users
i have a 5832 in which we have a user that wants to launch jobs using
more then 512 nodes on the machine. i'm not sure if we've ever tried
this before (it's typically used for many smallish jobs), but i can't
even seem to run linpack 2.0 using more then 512 nodes. upto 512
nodes the job seems to launch just fine, moving up to 640, i can't
seem to get hpl to start. i suspect a rank is failing to check in.

how should i diag this on the sicortex? since the sicortex door
closing, our brain trust in sicortex is starting to seep out, i can't
recall all the diag methods. we did run the machine through a burn-in
using the system diags. we found some bad hardware and removed it,
now the machine runs through the system diags clean. i'm not sure
where the problem might lay

Lawrence Stewart

unread,
Apr 10, 2012, 10:33:58 AM4/10/12
to sicorte...@googlegroups.com, Michael Di Domenico, ste...@serissa.com
512 ranks or 512 nodes * 6 cores each for 3072 ranks?

In either case, this ought to work smoothly.

If I remember correctly, the thing to do is to break it down to find the cause:

1) Job launching generally

Try

srun -n 3072 hostname

or like that, to see if the job launching machinery is working. You can turn on successively higher levels of debugging in srun

2) Loading the executable

Make sure that the executable is actually available to all the assigned nodes, such as with

srun -n 3072 size <executable>

Make sure that all of its required shared libraries are there as well.

You can also generally speed up job launch for large jobs by making the executable local to all the nodes, with
something like

salloc -n 3072 /bin/bash
> sbcast <executable> /tmp/<executable>
>srun /tmp/<executable>
<^D>

3) Running out of memory

It can happen that configurations for HPL are just running out of memory, and some are not starting, instead of

srun -n 512

which will run 6 copies on each of 512/6 nodes, try

srun -N 512

which will run one copy on each of 512 nodes

You can also check on resource usage while the program is "hung", by running commands ilke this (needs to be as root)

srun -w <nodelist> --no-alloc cat /proc/meminfo

or whatever. Root can run jobs with -no-alloc (IIRC) which are already assigned to other jobs

4) Communications setup

It can happen that MPI cannot allocate fabric resources. Typically this will happen if you try to run more than 10 or so ranks per node, but it can also
happen if the "bigphysalloc" memory that the fabric uses for buffers gets fragmented. IIRC the 5832 needs a boot allocation of about 450 MB of bigphysalloc
in order to run full size jobs.

I think this case should be detectable by looking at /proc/bigphysarea, and it would break <any> MPI job of the given size, not just HPL.

5) More debugging.

It might be useful to relink with the debug version of libmpi, for which there are instructions in the programmer's guide.

-Larry/ random recollections of debugging techniques

> --
> You received this message because you are subscribed to the Google Groups "SiCortex Users" group.
> To post to this group, send email to sicorte...@googlegroups.com.
> To unsubscribe from this group, send email to sicortex-user...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/sicortex-users?hl=en.
>

Narayan Desai

unread,
Apr 10, 2012, 10:55:35 AM4/10/12
to sicorte...@googlegroups.com, Michael Di Domenico, ste...@serissa.com
Our 5832 will run full system jobs, with its dwindling number if
nodes. we do end up seeing memory fragmentation in the memory region
for the fabric, which requires a reboot. I don't recall if we saw this
issue before we upgraded to the 4.0 FT version.
-nld

Michael Di Domenico

unread,
Apr 10, 2012, 11:41:04 AM4/10/12
to sicortex-users
Looks like it might be related to module 21, if i exclude that module
HPL starts and completes with 768. However, i can find no clear
indication why removing that particular module makes it go
Reply all
Reply to author
Forward
0 new messages