Hi,
I have read the questions and answers again and again, and I think this is the proper place to ask my dilemma.
I have set up PBS/TORQUE environment, with NFS share but not for the home/user directory, /home/user/workdir. I have put my binaries to that directory. I am using ubuntu 16 and I have installed abyss-pe using apt-get package installer.
Nodes: master, client1, client2 (clients are running the torque-mom, master runs the scheduler and server)
If I run the qsub comman, (posted later) on only one free node, it is working. For example client1 is free (client2 disabled) and i send the job to client1, it is working. If I enable client2 and disable client1, and send the job to client2 it is working.
If I enable (free) client1 and client2 torque-mom and post the job to them, only client2 runs the job. I can see it from the network traffic and mem usage.
I would like to use all of my nodes (client1 and 2) parallel.
I have also tried to enable client1 and 2 and force the mpirun (-H client1) to use only the client1 host, but it says "Host key verification fault" and I get also the following:
ORTE was unable to reliably start one or more daemons.
This usually is caused by:
* not finding the required libraries and/or binaries on
one or more nodes. Please check your PATH and LD_LIBRARY_PATH
settings, or configure OMPI with --enable-orterun-prefix-by-default
* lack of authority to execute on one or more specified nodes.
Please verify your allocation and authorities.
* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
Please check with your sys admin to determine the correct location to use.
* compilation of the orted with dynamic libraries when static are required
(e.g., on Cray). Please check your configure cmd line and consider using
one of the contrib/platform definitions for your system type.
* an inability to create a connection back to mpirun due to a
lack of common network interfaces and/or no route found between
them. Please check network connectivity (including firewalls
and network routing requirements).
--------------------------------------------------------------------------
/usr/bin/abyss-pe:470: recipe for target 'abysstest-1.fa' failed
My sh script looks like (b2.sh)
#PBS -N abysstest
#PBS -o /home/simpa/workdir/abysstest.log
#PBS -e /home/simpa/workdir/abysstest.err
#PBS -l nodes=2:ppn=1
#PBS -l walltime=700:00:00
#PBS -r y
cd /home/simpa/workdir/
module load openmpi
abyss-pe mpirun='mpirun -H client1' np=2 k=25 in='/home/simpa/workdir/SRR1955491_1.fastq /home/simpa/workdir/SRR1955491_2.fastq' > runinfo.txt
cat $PBS_NODEFILE > runnodes.txt
I submit the job like:
qsub -V -N abysstest b2.sh
$PBS_NODEFILE always getting the name of the nodes properly. I have checked it.
I have also tought that I will mess around mpirun.mpich, using mich instead of openmpi.
I have modified the sh according to this, changed the mpirun to mpiexec, I have pointed mpirun='/pathtonfsshare/mpirun.mpich -hosts client1', but at the end client2 has started the job...
Thank You for the patience reading theese lines. I am messing around with this for a long time, and really needed some help, or ideas.
Regards!
Tarziciusz