shepherd error

113 views
Skip to first unread message

Victor Ruotti

unread,
Feb 27, 2008, 12:01:40 PM2/27/08
to Grid Engine Life Science SIG
Hello,
Can anyone comment on these? Thanks in advance.
Victor

There are two problems that we keep seeing when running the Solexa
pipeline with SGE.
1- It sounds like there are a few problems with the recursive part of
qmake. The pipeline tents to halt after the completion of firecress
and often times I have to restart the pipeline from within the base
calling directory.
This is how we are currently use qsub and qmake.

qsub -cwd -v PATH -pe make 8-20 /common/bin/qsubmake_v1.1

AND /common/bin/qsubmake_v1.1 looks like this,

#!/usr/bin/sh
qmake -cwd -v PATH -inherit -- recursive

Have no clue why sometimes the pipeline will not execute the make
files recursively.

2- We keep seeing the "shepherd error" and the pipeline halts short
after the error appears, there is some doc about this error in the
Solexa pipeline doc files describing a way to avoid this. Have not
tried it yet.
Maybe not a fast enough NFS server/connection on my end?

"Another problem that we have observed in the past is grid engine
throwing up a "shepherd error". In our own experience, this error
could be prevented by keeping all log files that the grid engine
daemons produce on a fast local hard drive."


Sun Grid Engine

In order to submit non-interactive batch jobs to the grid engine, a
short wrapper script that can be submitted using qsub is needed. See
the grid engine documentation for details.

An example of such a script is:

#!/usr/bin/sh
qmake -cwd -v PATH -inherit -- recursive

One problem to be aware of with qmake is that the rsh implementation
used by Grid Engine tends to run out of available ports for large
degrees of parallelisation. Several parts of the pipeline farm out
short jobs, and ports may be used up before they expire. The work-
around we used is to switch to ssh as a remote shell. This is
described in http://gridengine.sunsource.net/howto/qrsh_qlogin_ssh.html.
Common problems with SGE are described in
http://gridengine.sunsource.net/howto/commonproblems.html.

Another problem that we have observed in the past is grid engine
throwing up a "shepherd error". In our own experience, this error
could be prevented by keeping all log files that the grid engine
daemons produce on a fast local hard drive.

Heywood, Todd

unread,
Mar 4, 2008, 4:04:53 PM3/4/08
to Victor Ruotti, Grid Engine Life Science SIG
Victor,

(1) We run Solexa with the same qsub/qmake setup you do without problems
(unless some input configuration error was made)

(2) By "shepherd error", do you mean "cannot get connection to shepherd"?
We've hit that problem, and I asked about it on the SGE list:

http://gridengine.sunsource.net/servlets/ReadMsg?listName=users&msgNo=20496

http://gridengine.sunsource.net/servlets/ReadMsg?listName=users&msgNo=20509

These errors occur every now and then, and on some random node. Re-running
the pipeline usually succeeds, but a failed run an be expensive in terms of
time and resources.

I managed to reduce, but not eliminate, the number of these errors by
mounting NFS via tcp instead of udp, and by adding options to ssh in the SGE
config:

rsh_command /usr/bin/ssh -o ConnectionAttempts=5 -o \
ConnectTimeout=60

Also, I was told that SGE spool directories should be on nodes' local disk
(they are, here).

We also get occassional "can't get qrsh_exit_code" errors which stop the
pipeline. These occur when Solexa tasks are running on the same node with
other tasks which cause a high load (like swapping).

Todd Heywood

Victor Ruotti

unread,
Mar 10, 2008, 4:00:40 PM3/10/08
to Grid Engine Life Science SIG
Thanks Todd,
Will have to get the stout to a local disk and test it this way.

I wonder if this is a network thing with SGE and NFS. Maybe tuning NFS
would help too.

victor


On Mar 4, 4:04 pm, "Heywood, Todd" <heyw...@cshl.edu> wrote:
> Victor,
>
> (1) We run Solexa with the same qsub/qmake setup you do without problems
> (unless some input configuration error was made)
>
> (2) By "shepherd error", do you mean "cannot get connection to shepherd"?
> We've hit that problem, and I asked about it on the SGE list:
>
> http://gridengine.sunsource.net/servlets/ReadMsg?listName=users&msgNo...
>
> http://gridengine.sunsource.net/servlets/ReadMsg?listName=users&msgNo...
>
> These errors occur every now and then, and on some random node. Re-running
> the pipeline usually succeeds, but a failed run an be expensive in terms of
> time and resources.
>
> I managed to reduce, but not eliminate, the number of these errors by
> mounting NFS via tcp instead of udp, and by adding options to ssh in the SGE
> config:
>
> rsh_command                  /usr/bin/ssh -o ConnectionAttempts=5 -o \
>                              ConnectTimeout=60
>
> Also, I was told that SGE spool directories should be on nodes' local disk
> (they are, here).
>
> We also get occassional "can't get qrsh_exit_code" errors which stop the
> pipeline. These occur when Solexa tasks are running on the same node with
> other tasks which cause a high load (like swapping).
>
> Todd Heywood
>
> > described inhttp://gridengine.sunsource.net/howto/qrsh_qlogin_ssh.html.

Hawk|

unread,
Mar 14, 2008, 9:21:53 AM3/14/08
to Grid Engine Life Science SIG
HI,

i used an NFS Server as well for the cluster, but it was not stable
enough to run so much IO Requests.
I crashed it several times a pipeline run.
You should test some cluster file systems which should be able to
handle such IO request better than nfs.
If you need some more help, don't hesitate to ask.

Greetings Hawk
> described inhttp://gridengine.sunsource.net/howto/qrsh_qlogin_ssh.html.
> Common problems with SGE are described inhttp://gridengine.sunsource.net/howto/commonproblems.html.

Hawk|

unread,
Mar 14, 2008, 9:27:01 AM3/14/08
to Grid Engine Life Science SIG
Hi,

me again,
I forgot to ask:
I was wondering which Versions of SGE do you use? Because I never
encountered an shepherd error.
Maybe its an issue of a specific SGE Version.

Greetings Hawk
Reply all
Reply to author
Forward
0 new messages