Exec host in permanent "unknown" state

972 views
Skip to first unread message

Sean Davis

unread,
Apr 14, 2008, 2:57:43 PM4/14/08
to Grid Engine Life Science SIG
I have a single execution host that seems to be parked in a permanent
"unknown" state:

# qstat -explain a
queuename qtype used/tot. load_avg arch states
----------------------------------------------------------------------------
al...@octopus.nci.nih.gov BIP 0/8 0.00 lx24-amd64
----------------------------------------------------------------------------
al...@pressa.nci.nih.gov BIP 0/4 -NA- lx24-amd64 au
error: no value for "np_load_avg" because execd is in unknown state
----------------------------------------------------------------------------
al...@shakespeare.nci.nih.gov BIP 0/8 4.29 lx24-amd64

I have tried stopping and restarting the execd several times to no
avail. Upon looking into the qmaster/messages file, I see it full of
messages like:

04/14/2008 14:47:51|qmaster|shakespeare|C|denied: request for user
"Administrator" does not match credentials for connection
<pressa.nci.nih.gov,execd,1>

I'm assuming that these are associated with requests like the qstat
request above. Any suggestions why this is occurring? The cluster is
not used much, so re-installing the execution host is not
out-of-the-question if it will solve the problem.

Thanks,
Sean

Jesse Becker

unread,
Apr 14, 2008, 3:06:13 PM4/14/08
to Sean Davis, Grid Engine Life Science SIG
Are you running qstat (or other commands) from a Windows box?
Usually, there isn't an account named "Administrator" on *nix systems.
That seems a bit odd, and I wonder if there isn't some sort of
account credential mis-match going on here.

Are there any logs for the execd itself? Check the
$SGE_ROOT/$SGE_CELL/spool/<exechost>/messages file.

Are you sure the execd process is actually running after the restarts?

--
Jesse Becker
GPG Fingerprint -- BD00 7AA4 4483 AFCC 82D0 2720 0083 0931 9A2B 06A2

Sean Davis

unread,
Apr 14, 2008, 3:20:06 PM4/14/08
to Jesse Becker, Grid Engine Life Science SIG
On Mon, Apr 14, 2008 at 3:06 PM, Jesse Becker <haw...@gmail.com> wrote:
>
> Are you running qstat (or other commands) from a Windows box?
> Usually, there isn't an account named "Administrator" on *nix systems.
> That seems a bit odd, and I wonder if there isn't some sort of
> account credential mis-match going on here.

Thanks, Jesse, for the quick answer.

No, we have not gone the Windows route for anything. However, your
hypothesis of credential mismatch may be true. We have a couple of
"desktop" linux machines in the cluster and the problematic machine is
one of them. For those machines, we do have an "Administrator"
account, which is a local, failsafe login. All other login info is
coming from LDAP. The installation was done several months ago, but
we did not use the cluster; now I am getting back to it, so it is
quite possible that something has changed on the execution machine.
In any case, any suggestions on what to do next?

> Are there any logs for the execd itself? Check the
> $SGE_ROOT/$SGE_CELL/spool/<exechost>/messages file.

There are no recent messages in here.

> Are you sure the execd process is actually running after the restarts?

Yes, it is running according to "ps".

Reply all
Reply to author
Forward
0 new messages