[Rocks-Discuss] Rocks 5.4: Gaussian 09 B.01 + Linda 8.2 + SGE 6.5

Gowtham

unread,

Aug 3, 2011, 11:49:03 AM8/3/11

to NPACI Rocks Discussion List, SGE Discussion List

Dear fellow users,

If any of you have successfully managed to run Gaussian 09 with Linda 8.2 and
integrate it with SGE on Rocks 5.4, I'd appreciate some tips. I have read
through the mailing list and tried a few things on my own, but in vain so far.

Thanks in advance for your time and help.

Best,
g

--
Gowtham
Advanced IT Research Support
Michigan Technological University

(906) 487/3593

Sudarshan Wadkar

unread,

Aug 3, 2011, 1:01:00 PM8/3/11

to Discussion of Rocks Clusters

i managed it with torque/maui but i can certainly help you if you are
getting some problems
first and foremost go through the install script line by line
IIRC, one of the step where a directory/files are linked to g09 bin
directory is broken/doesn't work and so i had to modify the
installation script

--
-Sudarshan Wadkar

"Success is getting what you want. Happiness is wanting what you get."
- Dale Carnegie
"It's always our decision who we are"
- Robert Solomon in Waking Life
"The Truth is The Truth, so all you can do is live with it."
- $udhi :)

LaoTsao

unread,

Aug 3, 2011, 2:16:10 PM8/3/11

to Discussion of Rocks Clusters, SGEDiscussion List, NPACI Rocks Discussion List

google is your friend

e.g.
http://research.gc.cuny.edu/wiki/index.php/Submitting_parallel_Gaussian_03/Linda_jobs_to_the_Sun_Grid_Engine

Sent from my iPad
Hung-Sheng Tsao ( LaoTsao) Ph.D

James Rudd

unread,

Aug 3, 2011, 6:26:19 PM8/3/11

to Discussion of Rocks Clusters

We had problems with SGE not properly killing Linda jobs if they
were canceled. Master would stop but slaves would keep on running.

I modified the global configuration file (qconf -mconf) as mentioned in this
post http://osdir.com/ml/clustering.gridengine.users/2007-03/msg00029.html
and added this option:
execd_params NOTIFY_KILL=INT

Then I changed the way our submit files used with qsub are generated to
include the -notify option.

Now 60 seconds before the SIGKILL signal is sent a SIGINT is sent which give
Gaussian time to close the slave connections and the submit script time to
clean-up.
If required it could also be set to NOTIFY_KILL=TERM to send SIGTERM instead
of SIGINT.

Hope this helps,
James

-------------- next part --------------
An HTML attachment was scrubbed...
URL: https://lists.sdsc.edu/pipermail/npaci-rocks-discussion/attachments/20110804/227778f5/attachment.html

Sudarshan Wadkar

unread,

Aug 3, 2011, 11:10:33 PM8/3/11

to Discussion of Rocks Clusters

interesting approach James,
i was faced with same problem. I did not think of using torque's
similar feature (I am not sure if its there in torque)
I had to hack the way gaussian is run. I kept a trail of running
gaussian jobs and used post job scripts to clean the node of stale
gaussian jobs.
It (the problem) was really annoying and hogging up the system resources a lot.
I reported the problem to Gaussian Support, but they didn't respond
(except a mail or two asking for debug outputs)

-Sudarshan Wadkar

On Thu, Aug 4, 2011 at 3:56 AM, James Rudd <james...@gmail.com> wrote:
> We had problems with SGE not properly killing Linda jobs if they
> were canceled. Master would stop but slaves would keep on running.

--

James Rudd

unread,

Aug 3, 2011, 11:47:21 PM8/3/11

to Discussion of Rocks Clusters

I agree about it wasting resources. We found that with big jobs the
l502.exel process could keep running for days on slave nodes after it had
been terminated on master.

I contacted Gaussian Support but they pretty much said they don't run SGE
so rely on info from users. Their first suggestion was to add -catch_rsh to
the PE config file.
Then they suggested looking at Sig signals sent:
>>>>>>>>>>>>>>>>
I have not heard from other users who I
have discussed this with but I did get an insight from some users on
PBS systems. The qdel command with PBS actually sends out two signals,
first is SIGTERM and then followed by a variable delay SIGKILL. If
we set the delay between the two signals to about 120 seconds this gave
sufficient time for the master to reliably advise the workers to
exit.
>>>>>>>>>>>>>>>>>>>

After looking around I found SGE just sends a SIGKILL, this kills the master
with no time to send out the shutdown signal. My previous post is what I
sent back to Gaussian to let them know how I had resolved the problem.

Regards,
James

-------------- next part --------------
An HTML attachment was scrubbed...

URL: https://lists.sdsc.edu/pipermail/npaci-rocks-discussion/attachments/20110804/a4ca3a14/attachment.html

Gowtham

unread,

Aug 4, 2011, 11:10:54 AM8/4/11

to NPACI Rocks Discussion List, SGE Discussion List

Thanks all for your tips and suggestions. Here's what I did
and it seems to be working.

1. SGE submission script is attached with this email

2. I followed James Rudd's tip on 'qconf -mconf' and make

execd_params NOTIFY_KILL=INT

3. The attachment may also be found in

http://sgowtham.net/misc/gaussian_2009p_l82.txt

The Gaussian 09 calculation runs fine, the log file reports
that

...
%NProcShared=4
Will use up to 4 processors via shared memory.
%LindaWorkers=compute-1-21,compute-1-19,compute-1-18,compute-1-20
...

Is there some way (a script or a tool) with which I can
confidently make sure that all the said compute nodes are
actually being used by this Gaussian 09 calculation?

Thanks again for your time and help,
g

--
Gowtham
Advanced IT Research Support
Michigan Technological University

(906) 487/3593

-------------- next part --------------
#! /bin/bash
#
#$ -cwd
#$ -j y
#$ -S /bin/bash
#$ -pe mpich 4
#

# Set required variables [PATH, LD_LIBRARY_PATH, G09 stuff, etc.]
. /share/apps/bin/batch_env.sh

# Folder where the files are located
# Folder where the calculation will be done
export INIT_DIR="/home/sgowtham/test_runs/G09"

# Name of the Gaussian 2009 input file
export INAME="Test_G09_Linda82"

# Prepare for running Gaussian 2009
export GAUSS_LFLAGS=' -vv -opt "Tsnet.Node.lindarsharg: ssh"'
LINDAWORKERS=$(cat $PE_HOSTFILE | grep -v "catch_rsh" | awk -F '.' '{ print $1}' | tr '\n' ',' | sed 's/,$//')

# Prepend input deck with necessary information to run
# Gaussian 2009 with Linda 8.2
# Run Gaussian 2009
( echo %NProcShared=${NSLOTS}; echo %LindaWorkers=${LINDAWORKERS}; cat ${INAME}.com ) | \
/share/apps/g09/g09 > ${INAME}_${NSLOTS}.log

# Delete the core dumps, if any
/bin/rm -f ${INIT_DIR}/core*

Sudarshan Wadkar

unread,

Aug 4, 2011, 3:25:52 PM8/4/11

to Discussion of Rocks Clusters

what about scratch file directory?
what if NSLOTs are different for different hosts?
i had developed a wonderful script which would do all of this
unfortunately i dont have direct access to the system anymore.
will see if I can get it emailed to me.
will post it on the mailing list for future benefit.

~$udhi :)

Eva Hocks

unread,

Aug 4, 2011, 4:43:00 PM8/4/11

to Discussion of Rocks Clusters

After adding nodes to the rocks DB via "rocks add" and running the
"rocks sync config" the tentakel report only lists part of the 48 nodes
I added ( as well as the tentakel.conf flie ).

"rocks run host" has the correct members for the groups (racks)

How can I fix tentakle?

Thanks
Eva

Reuti

unread,

Aug 3, 2011, 12:07:24 PM8/3/11

to Gowtham, SGE Discussion List, NPACI Rocks Discussion List

Hi,

Am 03.08.2011 um 17:49 schrieb Gowtham:

> If any of you have successfully managed to run Gaussian 09 with Linda 8.2 and integrate it with SGE on Rocks 5.4, I'd appreciate some tips. I have read through the mailing list and tried a few things on my own, but in vain so far.

what approach did you use for now? Did you define PEs in SGE already and provide a list of lindaworkers? The setup for g03 >d.01 I posted should also work for g09.

-- Reuti

> Thanks in advance for your time and help.
>
> Best,
> g
>
> --
> Gowtham
> Advanced IT Research Support
> Michigan Technological University
>
> (906) 487/3593
>

> _______________________________________________
> users mailing list
> us...@gridengine.org
> https://gridengine.org/mailman/listinfo/users

Gowtham

unread,

Aug 3, 2011, 11:47:09 AM8/3/11

to

Dear fellow users,

If any of you have successfully managed to run Gaussian 09
with Linda 8.2 and integrate it with SGE on Rocks 5.4, I'd
appreciate some tips. I have read through the mailing list
and tried a few things on my own, but in vain so far.

Thanks in advance for your time and help.

Reuti

unread,

Aug 4, 2011, 2:01:31 PM8/4/11

to Gowtham, SGE Discussion List, NPACI Rocks Discussion List

Hi,

Am 04.08.2011 um 17:10 schrieb Gowtham:

> Thanks all for your tips and suggestions. Here's what I did
> and it seems to be working.
>
> 1. SGE submission script is attached with this email
>
> 2. I followed James Rudd's tip on 'qconf -mconf' and make
>
> execd_params NOTIFY_KILL=INT
>
> 3. The attachment may also be found in
>
> http://sgowtham.net/misc/gaussian_2009p_l82.txt
>
>
> The Gaussian 09 calculation runs fine, the log file reports
> that
>
> ...
> %NProcShared=4
> Will use up to 4 processors via shared memory.
> %LindaWorkers=compute-1-21,compute-1-19,compute-1-18,compute-1-20
> ...
>
>
> Is there some way (a script or a tool) with which I can
> confidently make sure that all the said compute nodes are
> actually being used by this Gaussian 09 calculation?

you will have to go to each node and check whether the processes are there. As Gaussian will run some some of its links as serial and others as parallel, the processes may come and go on slave nodes.

But let me add some statements here:

1. to achieve a tight integration into SGE it's necessary to change the file linda8.2/opteron-linux/bin/linda_rsh near the end:

*) exec /usr/bin/rsh $host $user -n "$@"

to

*) exec rsh $host $user -n "$@"

This way the rsh-wrapper of SGE can catch the call and use a PE from MPICH with the setup -catch_rsh

2. The rsh wrapper I patched to a) don't echo the commands to start slave tasks in the users output file and b) also forward all variables to all ndoes

if [ x$just_wrap = x ]; then
if [ $minus_n -eq 1 ]; then
# echo $SGE_ROOT/bin/$ARC/qrsh -inherit -V -nostdin $rhost $cmd
exec $SGE_ROOT/bin/$ARC/qrsh -inherit -V -nostdin $rhost $cmd
else
# echo $SGE_ROOT/bin/$ARC/qrsh -inherit -V $rhost $cmd
exec $SGE_ROOT/bin/$ARC/qrsh -inherit -V $rhost $cmd
fi
else

I create a dedicated PE for each parallel library.

3. The necessary PE would be:

$ qconf -sp linda
pe_name linda
slots 999
user_lists NONE
xuser_lists NONE
start_proc_args /usr/sge/cluster/linda/startlinda.sh -catch_rsh $pe_hostfile
stop_proc_args /usr/sge/cluster/linda/stoplinda.sh
allocation_rule $round_robin
control_slaves TRUE
job_is_first_task TRUE
urgency_slots min
accounting_summary FALSE

4. While startlinda is just a renamed startmpi.sh, the stoplinda.sh includes commands to remove empty output and error files. The test for the counter you will have to adjust, incase the listed machinefile has more than one line (my startlinda.sh assembles already the %lindaworkers line and so I got only 2 lines therein: the echo of the start options and the %lindaworkers line).

#!/bin/sh
rm $TMPDIR/machines

rshcmd=rsh
case "$ARC" in
hp|hp10|hp11|hp11-64) rshcmd=remsh ;;
*) ;;
esac
rm $TMPDIR/$rshcmd

if [ -r "$SGE_STDOUT_PATH" -a -f "$SGE_STDOUT_PATH" ] ; then
counter=`wc -l $SGE_STDOUT_PATH`
[ "${counter%%$SGE_STDOUT_PATH}" -eq 2 ] && rm -f $SGE_STDOUT_PATH
fi
[ -r "$SGE_STDERR_PATH" -a -f "$SGE_STDERR_PATH" ] && [ ! -s "$SGE_STDERR_PATH" ] && rm -f $SGE_STDERR_PATH

exit 0

5. Changing the input file to get the list of lindaworkers

As said, I prefer to change a copy of the input file, but this might be a matter of taste whether you prefer on job per directory or just arbitrary ones.

-- Reuti

> | <gaussian_2009p_l82.txt>_______________________________________________

James Rudd

unread,

Aug 4, 2011, 9:15:17 PM8/4/11

to Discussion of Rocks Clusters

Hi Gowtham,
You also need to add a "#$ -notify" to the submit file to tell SGE to send
the signal.

I have written a script to generate SGE submit files for Gaussian that is
used by our users.

https://github.com/ruddj/UNSW-Rocks-Config/blob/master/submit/g09gen.pl

You will probably need to modify it a fair bit for your environment. (e.g.
Emails, PE and Queue names, etc)

Regards,
James

URL: https://lists.sdsc.edu/pipermail/npaci-rocks-discussion/attachments/20110805/1c89b255/attachment.html

Gowtham

unread,

Aug 5, 2011, 9:21:59 AM8/5/11

to Discussion of Rocks Clusters

Yes Sir! I have a similar script, albeit in BASH - called
'getscript', that interactively generates SGE submission
scripts for every software suite installed on this cluster.

I have added '#$ -notify' to the relevant portion that
generates g09 stuff.

Thank you,
g

--
Gowtham
Advanced IT Research Support
Michigan Technological University

(906) 487/3593

Reply all

Reply to author

Forward