(1) We run Solexa with the same qsub/qmake setup you do without problems
(unless some input configuration error was made)
(2) By "shepherd error", do you mean "cannot get connection to shepherd"?
We've hit that problem, and I asked about it on the SGE list:
http://gridengine.sunsource.net/servlets/ReadMsg?listName=users&msgNo=20496
http://gridengine.sunsource.net/servlets/ReadMsg?listName=users&msgNo=20509
These errors occur every now and then, and on some random node. Re-running
the pipeline usually succeeds, but a failed run an be expensive in terms of
time and resources.
I managed to reduce, but not eliminate, the number of these errors by
mounting NFS via tcp instead of udp, and by adding options to ssh in the SGE
config:
rsh_command /usr/bin/ssh -o ConnectionAttempts=5 -o \
ConnectTimeout=60
Also, I was told that SGE spool directories should be on nodes' local disk
(they are, here).
We also get occassional "can't get qrsh_exit_code" errors which stop the
pipeline. These occur when Solexa tasks are running on the same node with
other tasks which cause a high load (like swapping).
Todd Heywood