[Rocks-Discuss] SGE execd not working with head node (Rocks 5.2.1)

668 views
Skip to first unread message

Ivan Adzhubey

unread,
Sep 18, 2009, 6:17:42 PM9/18/09
to npaci-rocks...@sdsc.edu
Hi,

I am having the same problem as Fernando with ROCKS 5.2.2. I installed SGE
execute daemon on the frontend using /opt/gridengine/install_execd script.
Everything looks fine except sgeexecd on frontend loses connection to qmaster
a few seconds after startup. I can see it listed without errors in qstat -f
output for about 10 seconds initially (even with the system load reported
properly) but then queue instance turns into 'au' state and can't be reached
anymore.

# tail /opt/gridengine/default/spool/qmaster/messages

09/18/2009 18:05:20|worker|anko|E|no execd known on host anko.local to send
conf notification
09/18/2009 18:06:00|worker|anko|E|no execd known on host anko.local to send
conf notification
09/18/2009 18:06:40|worker|anko|E|no execd known on host anko.local to send
conf notification
09/18/2009 18:07:20|worker|anko|E|no execd known on host anko.local to send
conf notification
09/18/2009 18:08:00|worker|anko|E|no execd known on host anko.local to send
conf notification
09/18/2009 18:08:40|worker|anko|E|no execd known on host anko.local to send
conf notification
09/18/2009 18:09:20|worker|anko|E|no execd known on host anko.local to send
conf notification

# ps -efl |grep sge
5 S sge 28844 1 0 85 0 - 54639 184466 17:54 ?
00:00:00 /opt/gridengine/bin/lx26-amd64/sge_qmaster
5 S sge 29633 1 0 75 0 - 32782 184466 17:55 ?
00:00:00 /opt/gridengine/bin/lx26-amd64/sge_execd

# qping anko.local 537 execd 1
09/18/2009 18:12:36 endpoint anko.local/execd/1 at port 537 is up since 999
seconds
09/18/2009 18:12:37 endpoint anko.local/execd/1 at port 537 is up since 1000
seconds
09/18/2009 18:12:38 endpoint anko.local/execd/1 at port 537 is up since 1001
seconds
09/18/2009 18:12:39 endpoint anko.local/execd/1 at port 537 is up since 1002
seconds
09/18/2009 18:12:40 endpoint anko.local/execd/1 at port 537 is up since 1003
seconds

# qstat -explain a
queuename qtype resv/used/tot. load_avg arch
states
---------------------------------------------------------------------------------
al...@anko.local BIP 0/0/8 -NA- lx26-amd64 au
error: no value for "np_load_avg" because execd is in unknown state
---------------------------------------------------------------------------------
al...@compute-0-0.local BIP 0/0/8 0.00 lx26-amd64
---------------------------------------------------------------------------------
al...@compute-0-1.local BIP 0/0/8 0.00 lx26-amd64
---------------------------------------------------------------------------------

--Ivan

> Hi, thanks, I didn't notice that!
> I reinstalled (and restarted) a few times, and ps -efl is now:
>
> # ps -efl | grep sge
> 5 S sge 1163 1 0 75 0 - 32788 184466 17:10 ?
> 00:00:00 /opt/gridengine/bin/lx26-amd64/sge_execd
> 0 S root 3068 25790 0 78 0 - 15294 pipe_w 17:20 pts/2
> 00:00:00 grep sge
> 5 S sge 31010 1 0 85 0 - 55162 184466 16:59 ?
> 00:00:00 /opt/gridengine/bin/lx26-amd64/sge_qmaster
>
> as it should've been. Unfortunately, it still doesn't work. :(
> I've restarted both sgemaster and sgeexecd, removed the headnode with
> qconf, reinstalled, etc. Everytime I restart, I get some weird error.
> For example:
>
> 08/10/2009 17:23:04| main|jambo|E|commlib error: got select error (Broken
> pipe)
>
> they're mostly related to commlib. Is it possible that the headnode
> might be trying to connect via the external IP to itself? If so, how
> could I fix it?
>
> As an additional information, qping to the execd is working:
>
> # qping jambo.local 537 execd 1
> 08/10/2009 17:23:00 endpoint jambo.local/execd/1 at port 537 is up
> since 64 seconds
> 08/10/2009 17:23:01 endpoint jambo.local/execd/1 at port 537 is up
> since 65 seconds
> 08/10/2009 17:23:02 endpoint jambo.local/execd/1 at port 537 is up
> since 66 seconds
> 08/10/2009 17:23:03 endpoint jambo.local/execd/1 at port 537 is up
> since 67 seconds
>
> and the state in qstat -f always becomes "au" when submitting a job,
> even though it can get all the data (ncpu, mem, etc...) before running
> the job.
>
> Running out of ideas :/
>
> Thanks again,
> Fernando


The information in this e-mail is intended only for the person to whom it is
addressed. If you believe this e-mail was sent to you in error and the e-mail
contains patient information, please contact the Partners Compliance HelpLine at
http://www.partners.org/complianceline . If the e-mail was sent to you in error
but does not contain patient information, please contact the sender and properly
dispose of the e-mail.

Ivan Adzhubey

unread,
Sep 18, 2009, 7:54:29 PM9/18/09
to Discussion of Rocks Clusters
Hi,

Further investigation discovered several problems. First, install_execd
script, when run on master node adds it to SGE queue with local hostname,
e.g. 'anko.local', while SGE master node is configured with public FQDN, e.g.
anko.partners.org. It supposedly should not make a difference with SGE 'use
domain name' configuration option set to 'false' (the default), but in
practice it looks like it does and anko.local host is rejected by qmaster
because the name is missing from @allhosts list (and possibly from several
other places in SGE config).

Another problem is that due to qmaster configuration using public FQDN,
sgeexecd daemons on compute nodes communicate with qmaster (frontend) via NAT
on public network. This creates unnecessary public network loop and waste of
CPU resources. I would consider this a bug, unless someone can convince me
there were some good intentions behind this configuration.

This is not the first time I was hit by SGE name resolution quirks of course.
But I am not sure how to fix it on ROCKS. Can I edit SGE configuration
manually? I have noticed RCS metadata directories present
under /opt/gridengine on compute nodes. Are this just leftovers or SGE files
on the nodes are indeed under some kind of version control?

Thanks,
Ivan

>------ al...@anko.local BIP 0/0/8 -NA-

> lx26-amd64 au error: no value for "np_load_avg" because execd is in
> unknown state
> ---------------------------------------------------------------------------

>------ al...@compute-0-0.local BIP 0/0/8 0.00
> lx26-amd64
> ---------------------------------------------------------------------------
>------ al...@compute-0-1.local BIP 0/0/8 0.00

Fernando Rozenblit

unread,
Sep 18, 2009, 8:30:02 PM9/18/09
to Discussion of Rocks Clusters
Hello Ivan,

We solved this by doing the following (it was solved on another thread
at this list): http://gridengine.sunsource.net/howto/multi_intrfcs.html
or, resumed:

* Create /opt/gridengine/default/common/host_aliases with the following:
[frontend].local [frontend].external.name

* Edit /opt/gridengine/default/common/act_qmaster to be [frontend].local

Then restart both qmaster and execd and everything will start working. :)

It seems that Rocks makes a non-canon install of SGE, so there are
these config problems when installing execd on headnode. I think this
is a bug with Rocks, because with 5.1 (as LaoTsao has tested) this
works with no further changes.

[]s,
Fernando
P.S.: Developers, could Rocks make these changes during install? Would
help a lot. Is there any feature / bug tracker where I should report
this?


On Fri, Sep 18, 2009 at 8:54 PM, Ivan Adzhubey
<iadz...@rics.bwh.harvard.edu> wrote:
> Hi,
>
> Further investigation discovered several problems. First, install_execd
> script, when run on master node adds it to SGE queue with local hostname,
> e.g. 'anko.local', while SGE master node is configured with public FQDN, e.g.
> anko.partners.org. It supposedly should not make a difference with SGE 'use
> domain name' configuration option set to 'false' (the default), but in
> practice it looks like it does and anko.local host is rejected by qmaster
> because the name is missing from @allhosts list (and possibly from several
> other places in SGE config).

[...]

Mason J. Katz

unread,
Sep 18, 2009, 8:43:16 PM9/18/09
to Discussion of Rocks Clusters
SGE installed w/ the SGE Roll should work out of the box is your
hostname is FQDN and resolvable in DNS. I'm not sure what happens if
you manually re-run the install script.

What issues did you see out of the box?

WRT to the naming used, while the public hostname of the frontend is
used by the execution hosts there is no NAT going on. There is a
static host route that forces all traffic directed to the frontend
public interface (from the compute nodes) onto the private network.
Which means all SGE traffic is private network only.

mason katz
+1.240.724.6825

Ivan Adzhubey

unread,
Sep 18, 2009, 11:03:30 PM9/18/09
to Discussion of Rocks Clusters
Hi Mason,

On Friday 18 September 2009 08:43:16 pm Mason J. Katz wrote:
> SGE installed w/ the SGE Roll should work out of the box is your
> hostname is FQDN and resolvable in DNS.

Yes it works perfectly as configured. But default out of the box config does not
include frontend in the list of SGE execute node. Which is a huge waist with
modern multi-CPU multi-core nodes. One can have as many as 32 cores per node
with current hardware!

The problems only begin when you try to add frontend as an SGE execute node
manually. I believe I have them described in detail in my post. There is also
a confirmation of this bug and a description of the fix in the post by Fernando
above. Apparently, SGE roll maintainers never tested such scenario. I think it
is worth looking into since it will be a quite common one as the CPU core
count keeps counting up.

> I'm not sure what happens if you manually re-run the install script.

Yep, just try it and you will see ;-)

> What issues did you see out of the box?

None.

> WRT to the naming used, while the public hostname of the frontend is
> used by the execution hosts there is no NAT going on. There is a
> static host route that forces all traffic directed to the frontend
> public interface (from the compute nodes) onto the private network.
> Which means all SGE traffic is private network only.

Figured it out myself already, sorry for false alarm. Still, I see it as
unnecessary complexity. Although in my particular case exactly such setup is
what I am looking for since I want to eventually add access to qmaster from
several submit nodes on public network, outside cluster.

Best,
Ivan

> >>--- ------ al...@anko.local BIP 0/0/8 -NA-


> >> lx26-amd64 au error: no value for "np_load_avg" because execd is in
> >> unknown state
> >> ------------------------------------------------------------------------

> >>--- ------ al...@compute-0-0.local BIP 0/0/8 0.00
> >> lx26-amd64
> >> ------------------------------------------------------------------------
> >>--- ------ al...@compute-0-1.local BIP 0/0/8 0.00
> >> lx26-amd64
> >> ------------------------------------------------------------------------
> >>--- ------

Ivan Adzhubey

unread,
Sep 18, 2009, 11:46:01 PM9/18/09
to Discussion of Rocks Clusters
On Friday 18 September 2009 08:30:02 pm Fernando Rozenblit wrote:
> Hello Ivan,
>
> We solved this by doing the following (it was solved on another thread
> at this list): http://gridengine.sunsource.net/howto/multi_intrfcs.html
> or, resumed:
>
> * Create /opt/gridengine/default/common/host_aliases with the following:
> [frontend].local [frontend].external.name
>
> * Edit /opt/gridengine/default/common/act_qmaster to be [frontend].local
>
> Then restart both qmaster and execd and everything will start working. :)

It seems that every time /etc/init.d/sgemaster.<cluster> is run it resets the
contents of act_qmaster file back to master host public FQDN. After which
things get broken again. I am trying to figure out what's doing this.

--Ivan

Ivan Adzhubey

unread,
Sep 19, 2009, 12:20:12 AM9/19/09
to Discussion of Rocks Clusters

OK, found the offender:

# tail -1 /etc/init.d/sgemaster.anko
/bin/hostname --fqdn > /opt/gridengine/default/common/act_qmaster

This is definitely ROCKS addition to a standard qmaster init file. What was the
purpose of it? After commenting it out your recipe finally worked.

Thank you!

--Ivan

Reply all
Reply to author
Forward
0 new messages