how should i diag this on the sicortex? since the sicortex door
closing, our brain trust in sicortex is starting to seep out, i can't
recall all the diag methods. we did run the machine through a burn-in
using the system diags. we found some bad hardware and removed it,
now the machine runs through the system diags clean. i'm not sure
where the problem might lay
In either case, this ought to work smoothly.
If I remember correctly, the thing to do is to break it down to find the cause:
1) Job launching generally
Try
srun -n 3072 hostname
or like that, to see if the job launching machinery is working. You can turn on successively higher levels of debugging in srun
2) Loading the executable
Make sure that the executable is actually available to all the assigned nodes, such as with
srun -n 3072 size <executable>
Make sure that all of its required shared libraries are there as well.
You can also generally speed up job launch for large jobs by making the executable local to all the nodes, with
something like
salloc -n 3072 /bin/bash
> sbcast <executable> /tmp/<executable>
>srun /tmp/<executable>
<^D>
3) Running out of memory
It can happen that configurations for HPL are just running out of memory, and some are not starting, instead of
srun -n 512
which will run 6 copies on each of 512/6 nodes, try
srun -N 512
which will run one copy on each of 512 nodes
You can also check on resource usage while the program is "hung", by running commands ilke this (needs to be as root)
srun -w <nodelist> --no-alloc cat /proc/meminfo
or whatever. Root can run jobs with -no-alloc (IIRC) which are already assigned to other jobs
4) Communications setup
It can happen that MPI cannot allocate fabric resources. Typically this will happen if you try to run more than 10 or so ranks per node, but it can also
happen if the "bigphysalloc" memory that the fabric uses for buffers gets fragmented. IIRC the 5832 needs a boot allocation of about 450 MB of bigphysalloc
in order to run full size jobs.
I think this case should be detectable by looking at /proc/bigphysarea, and it would break <any> MPI job of the given size, not just HPL.
5) More debugging.
It might be useful to relink with the debug version of libmpi, for which there are instructions in the programmer's guide.
-Larry/ random recollections of debugging techniques
> --
> You received this message because you are subscribed to the Google Groups "SiCortex Users" group.
> To post to this group, send email to sicorte...@googlegroups.com.
> To unsubscribe from this group, send email to sicortex-user...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/sicortex-users?hl=en.
>