If any of you have successfully managed to run Gaussian 09 with Linda 8.2 and
integrate it with SGE on Rocks 5.4, I'd appreciate some tips. I have read
through the mailing list and tried a few things on my own, but in vain so far.
Thanks in advance for your time and help.
Best,
g
--
Gowtham
Advanced IT Research Support
Michigan Technological University
--
-Sudarshan Wadkar
"Success is getting what you want. Happiness is wanting what you get."
- Dale Carnegie
"It's always our decision who we are"
- Robert Solomon in Waking Life
"The Truth is The Truth, so all you can do is live with it."
- $udhi :)
Sent from my iPad
Hung-Sheng Tsao ( LaoTsao) Ph.D
I modified the global configuration file (qconf -mconf) as mentioned in this
post http://osdir.com/ml/clustering.gridengine.users/2007-03/msg00029.html
and added this option:
execd_params NOTIFY_KILL=INT
Then I changed the way our submit files used with qsub are generated to
include the -notify option.
Now 60 seconds before the SIGKILL signal is sent a SIGINT is sent which give
Gaussian time to close the slave connections and the submit script time to
clean-up.
If required it could also be set to NOTIFY_KILL=TERM to send SIGTERM instead
of SIGINT.
Hope this helps,
James
-------------- next part --------------
An HTML attachment was scrubbed...
URL: https://lists.sdsc.edu/pipermail/npaci-rocks-discussion/attachments/20110804/227778f5/attachment.html
-Sudarshan Wadkar
On Thu, Aug 4, 2011 at 3:56 AM, James Rudd <james...@gmail.com> wrote:
> We had problems with SGE not properly killing Linda jobs if they
> were canceled. Master would stop but slaves would keep on running.
--
I contacted Gaussian Support but they pretty much said they don't run SGE
so rely on info from users. Their first suggestion was to add -catch_rsh to
the PE config file.
Then they suggested looking at Sig signals sent:
>>>>>>>>>>>>>>>>
I have not heard from other users who I
have discussed this with but I did get an insight from some users on
PBS systems. The qdel command with PBS actually sends out two signals,
first is SIGTERM and then followed by a variable delay SIGKILL. If
we set the delay between the two signals to about 120 seconds this gave
sufficient time for the master to reliably advise the workers to
exit.
>>>>>>>>>>>>>>>>>>>
After looking around I found SGE just sends a SIGKILL, this kills the master
with no time to send out the shutdown signal. My previous post is what I
sent back to Gaussian to let them know how I had resolved the problem.
Regards,
James
-------------- next part --------------
An HTML attachment was scrubbed...
1. SGE submission script is attached with this email
2. I followed James Rudd's tip on 'qconf -mconf' and make
execd_params NOTIFY_KILL=INT
3. The attachment may also be found in
http://sgowtham.net/misc/gaussian_2009p_l82.txt
The Gaussian 09 calculation runs fine, the log file reports
that
...
%NProcShared=4
Will use up to 4 processors via shared memory.
%LindaWorkers=compute-1-21,compute-1-19,compute-1-18,compute-1-20
...
Is there some way (a script or a tool) with which I can
confidently make sure that all the said compute nodes are
actually being used by this Gaussian 09 calculation?
Thanks again for your time and help,
g
--
Gowtham
Advanced IT Research Support
Michigan Technological University
-------------- next part --------------
#! /bin/bash
#
#$ -cwd
#$ -j y
#$ -S /bin/bash
#$ -pe mpich 4
#
# Set required variables [PATH, LD_LIBRARY_PATH, G09 stuff, etc.]
. /share/apps/bin/batch_env.sh
# Folder where the files are located
# Folder where the calculation will be done
export INIT_DIR="/home/sgowtham/test_runs/G09"
# Name of the Gaussian 2009 input file
export INAME="Test_G09_Linda82"
# Prepare for running Gaussian 2009
export GAUSS_LFLAGS=' -vv -opt "Tsnet.Node.lindarsharg: ssh"'
LINDAWORKERS=$(cat $PE_HOSTFILE | grep -v "catch_rsh" | awk -F '.' '{ print $1}' | tr '\n' ',' | sed 's/,$//')
# Prepend input deck with necessary information to run
# Gaussian 2009 with Linda 8.2
# Run Gaussian 2009
( echo %NProcShared=${NSLOTS}; echo %LindaWorkers=${LINDAWORKERS}; cat ${INAME}.com ) | \
/share/apps/g09/g09 > ${INAME}_${NSLOTS}.log
# Delete the core dumps, if any
/bin/rm -f ${INIT_DIR}/core*
~$udhi :)
After adding nodes to the rocks DB via "rocks add" and running the
"rocks sync config" the tentakel report only lists part of the 48 nodes
I added ( as well as the tentakel.conf flie ).
"rocks run host" has the correct members for the groups (racks)
How can I fix tentakle?
Thanks
Eva
Am 03.08.2011 um 17:49 schrieb Gowtham:
> If any of you have successfully managed to run Gaussian 09 with Linda 8.2 and integrate it with SGE on Rocks 5.4, I'd appreciate some tips. I have read through the mailing list and tried a few things on my own, but in vain so far.
what approach did you use for now? Did you define PEs in SGE already and provide a list of lindaworkers? The setup for g03 >d.01 I posted should also work for g09.
-- Reuti
> Thanks in advance for your time and help.
>
> Best,
> g
>
> --
> Gowtham
> Advanced IT Research Support
> Michigan Technological University
>
> (906) 487/3593
>
> _______________________________________________
> users mailing list
> us...@gridengine.org
> https://gridengine.org/mailman/listinfo/users
If any of you have successfully managed to run Gaussian 09
with Linda 8.2 and integrate it with SGE on Rocks 5.4, I'd
appreciate some tips. I have read through the mailing list
and tried a few things on my own, but in vain so far.
Thanks in advance for your time and help.
Am 04.08.2011 um 17:10 schrieb Gowtham:
> Thanks all for your tips and suggestions. Here's what I did
> and it seems to be working.
>
> 1. SGE submission script is attached with this email
>
> 2. I followed James Rudd's tip on 'qconf -mconf' and make
>
> execd_params NOTIFY_KILL=INT
>
> 3. The attachment may also be found in
>
> http://sgowtham.net/misc/gaussian_2009p_l82.txt
>
>
> The Gaussian 09 calculation runs fine, the log file reports
> that
>
> ...
> %NProcShared=4
> Will use up to 4 processors via shared memory.
> %LindaWorkers=compute-1-21,compute-1-19,compute-1-18,compute-1-20
> ...
>
>
> Is there some way (a script or a tool) with which I can
> confidently make sure that all the said compute nodes are
> actually being used by this Gaussian 09 calculation?
you will have to go to each node and check whether the processes are there. As Gaussian will run some some of its links as serial and others as parallel, the processes may come and go on slave nodes.
But let me add some statements here:
1. to achieve a tight integration into SGE it's necessary to change the file linda8.2/opteron-linux/bin/linda_rsh near the end:
*) exec /usr/bin/rsh $host $user -n "$@"
to
*) exec rsh $host $user -n "$@"
This way the rsh-wrapper of SGE can catch the call and use a PE from MPICH with the setup -catch_rsh
2. The rsh wrapper I patched to a) don't echo the commands to start slave tasks in the users output file and b) also forward all variables to all ndoes
if [ x$just_wrap = x ]; then
if [ $minus_n -eq 1 ]; then
# echo $SGE_ROOT/bin/$ARC/qrsh -inherit -V -nostdin $rhost $cmd
exec $SGE_ROOT/bin/$ARC/qrsh -inherit -V -nostdin $rhost $cmd
else
# echo $SGE_ROOT/bin/$ARC/qrsh -inherit -V $rhost $cmd
exec $SGE_ROOT/bin/$ARC/qrsh -inherit -V $rhost $cmd
fi
else
I create a dedicated PE for each parallel library.
3. The necessary PE would be:
$ qconf -sp linda
pe_name linda
slots 999
user_lists NONE
xuser_lists NONE
start_proc_args /usr/sge/cluster/linda/startlinda.sh -catch_rsh $pe_hostfile
stop_proc_args /usr/sge/cluster/linda/stoplinda.sh
allocation_rule $round_robin
control_slaves TRUE
job_is_first_task TRUE
urgency_slots min
accounting_summary FALSE
4. While startlinda is just a renamed startmpi.sh, the stoplinda.sh includes commands to remove empty output and error files. The test for the counter you will have to adjust, incase the listed machinefile has more than one line (my startlinda.sh assembles already the %lindaworkers line and so I got only 2 lines therein: the echo of the start options and the %lindaworkers line).
#!/bin/sh
rm $TMPDIR/machines
rshcmd=rsh
case "$ARC" in
hp|hp10|hp11|hp11-64) rshcmd=remsh ;;
*) ;;
esac
rm $TMPDIR/$rshcmd
if [ -r "$SGE_STDOUT_PATH" -a -f "$SGE_STDOUT_PATH" ] ; then
counter=`wc -l $SGE_STDOUT_PATH`
[ "${counter%%$SGE_STDOUT_PATH}" -eq 2 ] && rm -f $SGE_STDOUT_PATH
fi
[ -r "$SGE_STDERR_PATH" -a -f "$SGE_STDERR_PATH" ] && [ ! -s "$SGE_STDERR_PATH" ] && rm -f $SGE_STDERR_PATH
exit 0
5. Changing the input file to get the list of lindaworkers
As said, I prefer to change a copy of the input file, but this might be a matter of taste whether you prefer on job per directory or just arbitrary ones.
-- Reuti
> | <gaussian_2009p_l82.txt>_______________________________________________
I have written a script to generate SGE submit files for Gaussian that is
used by our users.
https://github.com/ruddj/UNSW-Rocks-Config/blob/master/submit/g09gen.pl
You will probably need to modify it a fair bit for your environment. (e.g.
Emails, PE and Queue names, etc)
Regards,
James
I have added '#$ -notify' to the relevant portion that
generates g09 stuff.
Thank you,
g
--
Gowtham
Advanced IT Research Support
Michigan Technological University