SAGA Jobs always end up in `Eqw` state on SGE cluster, but manual submission of starting script works

131 views
Skip to first unread message

Robert Meyer

unread,
Jul 16, 2015, 10:42:06 AM7/16/15
to saga-...@googlegroups.com
Hi,

I played a bit around with saga, and I like it a lot. I quickly managed to start remote jobs via `ssh`.
But I still cannot figure out how to submit jobs to an SGE cluster with `sge+ssh`.

Every time I submit a job (I am simply using a slightly modified version of the SGE touch example,
http://saga-python.readthedocs.org/en/latest/adaptors/saga.adaptor.sgejob.html)  with a given service

``js = saga.job.Service("sge+ssh://mycluster.myuni.de",
session=session)``

it ends up in the FAILED or `Eqw` state, respectively.
What is odd though is that I was able to copy the SAGA-shell script from the temporary folder of our cluster and
this runs perfectly with the `qsub` command.

The script that was produced by SAGA is:

#!/bin/bash
#$ -S /bin/bash
#$ -V
#$ -v FILENAME=testfile
#$ -wd /net/homes2/informatik/augustin/robm/working_dir
#$ -o examplejob.out
#$ -e examplejob.err
#$ -l h_rt=0:1:00
#$ -q short
#$ -A TG-MCB090174
#$ -pe mp 1
function aborted() {
  echo Aborted with signal $1.
  echo "signal: $1" >>$HOME/.saga/adaptors/sge_job/$JOB_ID
  echo "end_time: $(LC_ALL=en_US.utf8 date '+%a %b %d %H:%M:%S %Y')" >>$HOME/.saga/adaptors/sge_job/$JOB_ID
  exit -1
}
mkdir -p $HOME/.saga/adaptors/sge_job
for sig in SIGHUP SIGINT SIGQUIT SIGTERM SIGUSR1 SIGUSR2; do trap "aborted $sig" $sig; done
echo "hostname: $HOSTNAME" >$HOME/.saga/adaptors/sge_job/$JOB_ID
echo "qsub_time: Thu Jul 16 15:37:34 2015" >>$HOME/.saga/adaptors/sge_job/$JOB_ID
echo "start_time: $(LC_ALL=en_US.utf8 date '+%a %b %d %H:%M:%S %Y')" >>$HOME/.saga/adaptors/sge_job/$JOB_ID
/bin/touch $FILENAME
echo "exit_status: $?" >>$HOME/.saga/adaptors/sge_job/$JOB_ID
echo "end_time: $(LC_ALL=en_US.utf8 date '+%a %b %d %H:%M:%S %Y')" >>$HOME/.saga/adaptors/sge_job/$JOB_ID


And as I said, manually typing `qsub this_script.sh` works fine. How come that I can manually start it but submitting it
from a different computer via saga fails? The error reason provided by `qstat -j <jobnumber>` is the following:

error reason    1:          07/16/2015 16:23:19 [35257:14045]: execvlp(/var/spool/sge/node070/job_scripts/5228798, "/var/spool/s

Any ideas what's wrong?

Thanks and cheers,
Robert

Andre Merzky

unread,
Jul 16, 2015, 11:23:16 AM7/16/15
to saga-...@googlegroups.com
Hi Robert,

thanks for the feedback! :)

We are not exactly sure what snafu you are hitting there. Would you
mind running your test again while setting

export SAGA_VERBOSE=DEBUG

in your shell environment? That will produce quite a number of debug
messages on stderr -- please capture that and post it, of you don't
mind? If you use password based ssh connections, please make sure no
account credentials show up in the logs though.

Thanks, Andre.


PS.: I am not sure I understand what the `Eqw` state would refer to --
is that an SGE backend state, or a typo?
> --
> You received this message because you are subscribed to the Google Groups
> "saga-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to saga-users+...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.



--
99 little bugs in the code.
99 little bugs in the code.
Take one down, patch it around.

127 little bugs in the code...

Andre Merzky

unread,
Jul 16, 2015, 11:28:10 AM7/16/15
to saga-...@googlegroups.com
On Thu, Jul 16, 2015 at 11:22 AM, Andre Merzky <an...@merzky.net> wrote:
>
> PS.: I am not sure I understand what the `Eqw` state would refer to --
> is that an SGE backend state, or a typo?

Never mind that part, found it meanwhile *blush*

Robert Meyer

unread,
Jul 16, 2015, 1:46:03 PM7/16/15
to saga-...@googlegroups.com
Hi,

I attached the DEBUG output as a txt file, it's way to long for posting it here directly :-).

Cheers,
Robert
debug_output.txt

Robert Meyer

unread,
Jul 16, 2015, 1:53:07 PM7/16/15
to saga-...@googlegroups.com
Well. that was not the debug output for the touch example, sorry.
Anyway, here is the debug output for the touch example from the website (slightly modified though to work with our cluster).
debug_output.txt

Mark Santcroos

unread,
Jul 18, 2015, 10:41:06 PM7/18/15
to saga-...@googlegroups.com
Hi Robert,

Thanks for the debugging info.

In the last file you sent I see:
"failed 27 : searching requested shell\nexit_status 0 "

The script generated by the SGE adaptor starts with "/bin/bash".
Could it be that bash is not available on the cluster?

Thanks

Mark

Robert Meyer

unread,
Jul 19, 2015, 10:13:33 AM7/19/15
to saga-...@googlegroups.com, mark.sa...@rutgers.edu
Hi Mark,

it is available `/bin/bash` seems to work. I mean I can run the bash file manually that was created by saga python and it works.

Cheers,
Robert

Mark Santcroos

unread,
Jul 19, 2015, 1:41:27 PM7/19/15
to saga-...@googlegroups.com
Hi Robert,

I've created a SAGA ticket for this on github, I would like to ask you to continue the discussion there as you have clearly run into a bug.

https://github.com/radical-cybertools/saga-python/issues/434

Thanks!

Mark
Reply all
Reply to author
Forward
0 new messages