Re: [slurm-users] NHC and slurm

25 views
Skip to first unread message

Michael Jennings

unread,
Apr 20, 2021, 9:18:01 PM4/20/21
to Heitor, Slurm User Community List, LBNL Node Health Check
On Thursday, 15 April 2021, at 10:58:31 (-0300),
Heitor wrote:

> I'm trying to setup NHC[0] for our Slurm cluster, but I'm not
> getting it to work properly.

Just for future reference, NHC has its own mailing lists, and even
though your question does relate to Slurm tangentially, it's really an
NHC question, not a Slurm question. :-)

So I've set the Reply-To header to redirect to "n...@lbl.gov" instead.
It's an open list, but I would still encourage you to consider
subscribing at https://groups.google.com/a/lbl.gov/g/nhc

> $ sudo nhc
> ERROR: nhc: Health check failed: check_ps_service: Service sshd (process sshd) owned by root not running
>
> I know sshd is running because I logged in this machine with ssh.

By itself, this doesn't guarantee that there is still an sshd running
as root. When you connect, the main root-owned sshd process forks off
a separate sshd which is owned by you. It's entirely possible for the
root-owned sshd to exit or crash without impacting existing SSH
sessions. Just to be pedantic. ;-)

> And `systemctl status sshd` shows it is active.

Now *that* is a horse of a different color! ;-) So clearly you're
correct; sshd is definitely running.

> Here's a sample of my nhc.conf:
>
> * || check_ps_service munged
> * || check_ps_service -u root sshd
> * || check_ps_service -u root ssh
> * || check_ps_service ssh
> * || check_ps_service sshd

Only the first 2 lines are correct. The 3rd and 4th lines would look
for "ssh" processes instead of "sshd" processes, and the 5th one would
misinterpret user-owned sshd processes as the main listening sshd
that's owned by root. Not good. ;-) Your config should have:

* || check_ps_service munged
* || check_ps_service -u root sshd

You can also add "-S" to each of those checks if you'd like NHC to
attempt to start the service for you automatically if it's found to
not be running. First, though, we need to figure out why the 2nd
check isn't exhibiting the desired behavior!

> If I run `sudo nhc -a` to run all the tests, it gives 4 errors about
> ssh.
>
> NHC can find munge running, so what's the problem with ssh? What am I
> missing?

Well, if systemd reports the service as being active, it's definitely
running. So the check should pass unless there's something weird
going on....

The next step I'd recommend is to run either in Debug Mode (via -d) or
Trace Mode (via -x); either of those 2 options will show you
everything NHC is receiving back from the "ps" command it runs to
gather process data. In fact, when *I'm* troubleshooting a check,
I'll generally use *both*, and I also use "-e" so that I don't have to
wade through all the other stuff in the config. :-) So I'd do this:

nhc -x -d -l - -e 'check_ps_service -u root sshd'

Or if you prefer, you can send the output to a file by changing the
"-l -" to "-l <file>" instead. That output will show you all the
lines of "ps" output NHC is parsing through and should help to
determine what's going awry.

I should also note that I have never personally run NHC on Debian or
Ubuntu, so it's possible there's a bug lurking somewhere that I just
haven't run across yet....

Hope that helps; let me know what you find (over on n...@lbl.gov)! :-)

Michael

--
Michael E. Jennings <m...@lanl.gov> - [PGPH: he/him/his/Mr] -- hpc.lanl.gov
HPC Systems Engineer -- Platforms Team -- HPC Systems Group (HPC-SYS)
Strategic Computing Complex, Bldg. 03-2327, Rm. 2341 W: +1 (505) 606-0605
Los Alamos National Laboratory, P.O. Box 1663, Los Alamos, NM 87545-0001

Heitor

unread,
Apr 23, 2021, 2:26:21 PM4/23/21
to n...@lbl.gov
Hello Michael,

On Tue, 20 Apr 2021 19:17:57 -0600
Michael Jennings <m...@lanl.gov> wrote:


> By itself, this doesn't guarantee that there is still an sshd running
> as root. When you connect, the main root-owned sshd process forks off
> a separate sshd which is owned by you. It's entirely possible for the
> root-owned sshd to exit or crash without impacting existing SSH
> sessions. Just to be pedantic. ;-)

I did not know that! That's a very clear explanation, thanks! Now I get
why we should check for an sshd process owned by root instead of any
sshd process.

> The next step I'd recommend is to run either in Debug Mode (via -d) or
> Trace Mode (via -x); either of those 2 options will show you
> everything NHC is receiving back from the "ps" command it runs to
> gather process data. In fact, when *I'm* troubleshooting a check,
> I'll generally use *both*, and I also use "-e" so that I don't have to
> wade through all the other stuff in the config. :-) So I'd do this:
>
> nhc -x -d -l - -e 'check_ps_service -u root sshd'

This is very insightful! The sshd process in Ubuntu is reported as
`sshd:` instead of `sshd`. And that's why it doesn't find it:

$ nhc -l - -e 'check_ps_service -u root sshd'
ERROR: nhc: Health check failed: check_ps_service: Service sshd (process sshd) owned by root not running
$ nhc -l - -e 'check_ps_service -u root sshd:'
$ ps ax | grep sshd
root 256 0.0 0.0 12176 7272 ? Ss 15:51 0:00 sshd: /usr/sbin/sshd -D [listener] 0 of 10-100 startups
root 10962 0.0 0.0 13808 9024 ? Ss 16:58 0:00 sshd: ubuntu [priv]
ubuntu 11020 0.0 0.0 13944 6072 ? S 16:58 0:00 sshd: ubuntu@pts/0
$ ps ax | grep munge
10754 ? Sl 0:00 /usr/sbin/munged

it is interesting that sshd has this special name in Ubuntu. I wonder
why? What is the best way forward, should I check for `sshd:`?

Heitor
Reply all
Reply to author
Forward
0 new messages