if you have IB then you will need to add ofed and openmpi, sge over ib
or add mvapich with ib and sge
if you have GPGPU you will need to add CUDA
by the way for mpich2 you may need to add tight integration with sge
and change some ENVIRONMENT variable
see wikis.rocksclusters.org or recent email archive thread
by the way tight integration just for control of mpi jobs
regards
-LT
Sent from my iPad
On Apr 1, 2012, at 22:02, Derrick LIN <kli...@gmail.com> wrote:
> hi guys,
>
> I have finally deployed Rocks 5.4.3 (with SGE) on our new cluster. I was
> surprised to find that SGE has MPI PE pre-configured with the proper start
> & stop handlers.
>
> But when I tried to submit a mpi job (HPL 2.0), the job was given errors
> and stopped. The output contains the following info:
>
> *[root@cluster Linux_ATHLON_FBLAS]# cat job_hpl_2.0.o15*
> *Warning: no access to tty (Bad file descriptor).*
> *Thus no job control in this shell.*
> *error: executing task of job 15 failed: execution daemon on host
> "omega-0-2.local" didn't accept task*
> *--------------------------------------------------------------------------*
> *A daemon (pid 24403) died unexpectedly with status 1 while attempting*
> *to launch so we are aborting.*
> *
> *
> *There may be more information reported by the environment (see above).*
> *
> *
> *This may be because the daemon was unable to find all the needed shared*
> *libraries on the remote node. You may set your LD_LIBRARY_PATH to have the*
> *location of the shared libraries on the remote nodes and this will*
> *automatically be forwarded to the remote nodes.*
> *--------------------------------------------------------------------------*
> *--------------------------------------------------------------------------*
> *mpirun noticed that the job aborted, but has no info as to the process*
> *that caused that situation.*
> *--------------------------------------------------------------------------*
> *mpirun: clean termination accomplished*
>
> The same job ran fine after I simply changed mpi PE's Start Proc Args and
> Stop Proc Args to /bin/true. I am wondering if the sge itself need to be
> configured further to make the mpi start and stop handlers work or I am
> missing something in my qsub job?
>
> Thanks in advance
>
> Derrick
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: https://lists.sdsc.edu/pipermail/npaci-rocks-discussion/attachments/20120402/0a2ca0f8/attachment.html
If you add the SGE parameter line of:
#$ -S /bin/bash
near the top of your qsub job submission script, does it make the
*Warning: no access to tty (Bad file descriptor).*
*Thus no job control in this shell.*
errors go away? Can you post your qsub script?
--
Wilson Cheung UCSD Mathematics Department Computing Support (MCS)
wch...@ucsd.edu 9500 Gilman Drive # 0112, La Jolla, CA 92093-0112
math...@math.ucsd.edu | (858) 534-2762 | AP&M 5018
After adding #$ -S /bin/bash, the shell error/warning went away. But the
mpi job still failed with the same mpi error.
My qsub script is very simple:
*#$ -S /bin/bash*
*#*
*#*
*# Export all environment variables*
*#$ -V*
*# specify the PE and core #*
*#$ -pe mpi 128*
*# Customize job name*
*#$ -N job_hpl_2.0*
*# Use current working directory*
*#$ -cwd*
*# Join stdout and stder into one file*
*#$ -j y*
*# The mpirun command; note the lack of host names as SGE will provide them
on-the-fly.*
*mpirun -np $NSLOTS ./xhpl >> xhpl.out*
Beside the error from the main log, I also found these in the po log:
*[root@cluster Linux_ATHLON_FBLAS]# cat job_hpl_2.0.po20*
*/opt/gridengine/default/spool/omega-0-18/active_jobs/20.1/pe_hostfile*
*omega-0-18*
*omega-0-18*
*omega-0-18*
*omega-0-18*
*omega-0-18*
*omega-0-18*
*omega-0-18*
*omega-0-18*
*omega-0-18*
*omega-0-18*
*omega-0-18*
*omega-0-18*
*omega-0-18*
*omega-0-18*
*omega-0-18*
*omega-0-18*
*omega-0-18*
*omega-0-18*
*omega-0-18*
*omega-0-18*
*omega-0-18*
*omega-0-18*
*omega-0-18*
*omega-0-18*
*omega-0-18*
*omega-0-18*
*omega-0-18*
*omega-0-18*
*omega-0-18*
*omega-0-18*
*omega-0-18*
*omega-0-18*
*omega-0-18*
*omega-0-18*
*omega-0-18*
*omega-0-18*
*omega-0-18*
*omega-0-18*
*omega-0-18*
*omega-0-18*
*omega-0-18*
*omega-0-18*
*omega-0-18*
*omega-0-18*
*omega-0-18*
*omega-0-18*
*omega-0-18*
*omega-0-18*
*omega-0-18*
*omega-0-18*
*omega-0-18*
*omega-0-18*
*omega-0-18*
*omega-0-18*
*omega-0-18*
*omega-0-18*
*omega-0-18*
*omega-0-18*
*omega-0-18*
*omega-0-18*
*omega-0-18*
*omega-0-18*
*omega-0-18*
*omega-0-18*
*t-0-17*
*t-0-17*
*t-0-17*
*t-0-17*
*t-0-17*
*t-0-17*
*t-0-17*
*t-0-17*
*t-0-17*
*t-0-17*
*t-0-17*
*t-0-17*
*t-0-17*
*t-0-17*
*t-0-17*
*t-0-17*
*t-0-17*
*t-0-17*
*t-0-17*
*t-0-17*
*t-0-17*
*t-0-17*
*t-0-17*
*t-0-17*
*t-0-17*
*t-0-17*
*t-0-17*
*t-0-17*
*t-0-17*
*t-0-17*
*t-0-17*
*t-0-17*
*t-0-17*
*t-0-17*
*t-0-17*
*t-0-17*
*t-0-17*
*t-0-17*
*t-0-17*
*t-0-17*
*t-0-17*
*t-0-17*
*t-0-17*
*t-0-17*
*t-0-17*
*t-0-17*
*t-0-17*
*t-0-17*
*t-0-17*
*t-0-17*
*t-0-17*
*t-0-17*
*t-0-17*
*t-0-17*
*t-0-17*
*t-0-17*
*t-0-17*
*t-0-17*
*t-0-17*
*t-0-17*
*t-0-17*
*t-0-17*
*t-0-17*
*t-0-17*
*rm: cannot remove `/tmp/20.1.tem.q/rsh': No such file or directory*
My cluster use 1GbE only, no IB etc. Every setting is Rocks pre-configured.
Regards,
Derrick
Dear Derrick,
can you try adding the following
-machinefile $TMPDIR/machines
to your command line of the mpirun:
mpirun -np $NSLOTS -machinefile $TMPDIR/machines ./xhpl
Sincerely,
Luca
It's much clearer for me now, thing works perfectly.
Cheers,
D
On Tue, Apr 3, 2012 at 11:03 PM, Hung-Sheng Tsao (LaoTsao) Ph.D <
lao...@gmail.com> wrote:
> correction
> with mpich2 and hydro one can also use pe orte
> regards
>
> Sent from my iPad
>
> On Apr 3, 2012, at 3:58, "Hung-Sheng Tsao (LaoTsao) Ph.D" <
> lao...@gmail.com> wrote:
>
> > in rocks sge PE
> > mpi is loosely integrated
> > mpich and orte are tightly integrated
> > qsub require args are different for mpi mpich with orte
> >
> > mpi and mpich need machinefile
> >
> > by default
> > mpi, mpich are for mpich2
> > orte is for openmpi
> > regards
> > -LT
> >
> >
> >
> >
> >
> > Sent from my iPad
> >