Peterson, Kirk
unread,Jun 7, 2013, 1:06:34 PM6/7/13Sign in to reply to author
Sign in to forward
You do not have permission to delete messages in this group
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to <dirac-users@googlegroups.com>
Dear Dirac experts,
Do to memory constraints on a HPC cluster I'm starting to run Dirac on, I need to allocate entire nodes to my Dirac jobs but only use 1 single processor on each node. Thus I'm having to use an alternative to the usual $PBS_NODEFILE machinefile in my openmpi runs. In my torque qsub script, I might set the following for a 2-node/processor run (each node has 12 cores):
#PBS -l nodes=2:ppn=12 ! allocate 2 entire nodes
I generate a machinefile via:
cat $PBS_NODEFILE | uniq > $base/.nodefile.$$
and call pam with something like:
pam --mw 7000 --aw 20000 --noarch --mpi=2 --machfile=$base/.nodefile.$$ --inp=test.inp --mol=test.mol
The resulting machinefile is generated correctly:
contents of $base/.nodefile.$$ :
node41
node39
The standard out from Dirac seems to confirm this:
Creating the scratch directory.
Copying file " dirac.x " to scratch dir.
pam: Copying u.mol to /scratch/kipeters/DIRAC_u1_u_31621/MOLECULE.MOL
pam: Copying u1.inp to /scratch/kipeters/DIRAC_u1_u_31621/DIRAC.INP
Machinefile read, list of unique nodes obtained: ['node39']
Copying selected content of master scratch directory to nodes : ['node39']
scp /scratch/kipeters/DIRAC_u1_u_31621/dirac.x node39:/scratch/kipeters/DIRAC_u1_u_31621/dirac.x
scp /scratch/kipeters/DIRAC_u1_u_31621/MOLECULE.MOL node39:/scratch/kipeters/DIRAC_u1_u_31621/MOLECULE.MOL
scp /scratch/kipeters/DIRAC_u1_u_31621/DIRAC.INP node39:/scratch/kipeters/DIRAC_u1_u_31621/DIRAC.INP
But I just now noticed that the final mpirun command doesn't contain a --machinefile option:
DIRAC command : /home/clarklab/kipeters/lib/openmpi-1.6.3/bin/mpirun -np 2 /scratch/kipeters/DIRAC_u1_u_31621/dirac.x (PID=31631)
In the output file from Dirac, I get the following:
** interface to 64-bit integer MPI enabled **
DIRAC master (node41) starts by allocating 7000000000 words ( 53405 MB) of memory
DIRAC node 1 (node41) starts by allocating 7000000000 words ( 53405 MB) of memory
DIRAC master (node41) to allocate at most 14000000000 words ( 106811 MB) of memory
Note: maximum allocatable memory for master+nodes can be set by -aw flag (MW) in pam
DIRAC node 1 (node41) to allocate at most 14000000000 words ( 106811 MB) of memory
Which shows that Dirac is using the 1st 2 entries in $PBS_NODEFILE and not my new machinefile. As the job is running, I also ssh'd to node39 and confirmed there was no dirac.x process there and there were 2 running on node41.
Is there an easy hack to pam to add my new machinefile or is it more complicated than this?
thanks in advance and apologies for what turned into a long post,
-Kirk