How to best use NHC to check that time is in sync?

109 views
Skip to first unread message

Johan Guldmyr

unread,
Feb 28, 2017, 7:37:21 AM2/28/17
to LBNL Node Health Check
Hello!

I'd like to make a node go idle if the time is too much out of sync.

A oneliner I managed to make that parses the output of "chronyc tracking".
It however doesn't alter the return code and I don't know how NHC checks likes piping output:

echo $(chronyc tracking|grep "System time"|cut -d ":" -f2|cut -d " " -f2) '<' 100|bc -l

Does anybody have a better way to check this other than writing a new check?

// Johan

Johan Guldmyr

unread,
Feb 28, 2017, 8:14:09 AM2/28/17
to LBNL Node Health Check
** I'd like to drain a node if time is too much out of sync..


On Tuesday, February 28, 2017 at 2:37:21 PM UTC+2, Johan Guldmyr wrote:
Hello!

I'd like to drain a node if the time is too much out of sync.

Lachlan Musicman

unread,
Feb 28, 2017, 4:16:02 PM2/28/17
to n...@lbl.gov
I can an only presume that the jobs you run have a very high reliance on jobs running on other nodes or are otherwise constrained?

If not, can I ask why you don't run a time server like ntp internally? That should keep everything within seconds of each other.

Cheers
L.

--
You received this message because you are subscribed to the Google Groups "LBNL Node Health Check" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nhc+uns...@lbl.gov.
To post to this group, send email to n...@lbl.gov.
Visit this group at https://groups.google.com/a/lbl.gov/group/nhc/.

Johan Guldmyr

unread,
Feb 28, 2017, 10:39:02 PM2/28/17
to LBNL Node Health Check
Hi, sometimes there's hardware failures of say BIOS battery that changes clock on boot to be very wrong. Even chrony takes a long time to correct that and in the meantime slurm is unhappy.

We run a local time server.

Daniel Letai

unread,
Mar 1, 2017, 12:32:54 AM3/1/17
to n...@lbl.gov

Just run

ntpdate -u <ntpserver>

As a startup script, chrony will keep it correct once initial sync is done.

Johan Guldmyr

unread,
Mar 1, 2017, 12:48:58 AM3/1/17
to LBNL Node Health Check
That's already being planned. An idea is to also run "hwclock".
Now can we go back to my question about having NHC drain the node if time is out of sync?

// Johan

Michael Jennings

unread,
Mar 1, 2017, 5:17:41 PM3/1/17
to LBNL Node Health Check
The simplest way to do this, in my mind, is probably to use
check_cmd_output(), something like this:

* || check_cmd_output -t 10 -m '!/offset -?[0-9]{3,}\./' ntpdate -q
pool.ntp.org

This will capture the output of the command above (ntpdate -q
pool.ntp.org) and fail the check if either ntpdate returns non-zero or
if the output contains a number with 3 or more digits before the
decimal (i.e., 100 or higher). You could also use the same technique
with your "chronyc tracking" command as well:

* || check_cmd_output -t 2 -m '!/: [0-9]{3,}\.[0-9]+ seconds [a-z]+
of NTP time/' chronyc tracking

If you wanted to use your specific command, you should still be able
to use check_cmd_output(), but you may need to play around a bit with
quoting and such, and possibly invoke it via /bin/bash -c '...' or
split it up into 2 separate lines (one that sets a variable and one
that checks its value). I haven't tested it, but something like this
might work:

* || export NHC_NTP_TIME_DELTA=$(chronyc tracking|grep "System
time"|cut -d ":" -f2|cut -d " " -f2)
* || check_cmd_output -m 1 expr $NHC_NTP_TIME_DELTA '<' 100

So yeah, there are lots of ways to accomplish what you want, I think! :-)

Also, if you run ntpd locally on the nodes, you'll likely want to have
NHC check that too, via check_ps_service().

Hopefully something in the above will be helpful to you! :-)

Michael
> --
> You received this message because you are subscribed to the Google Groups
> "LBNL Node Health Check" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to nhc+uns...@lbl.gov.
> To post to this group, send email to n...@lbl.gov.
> Visit this group at https://groups.google.com/a/lbl.gov/group/nhc/.



--
Michael Jennings (KainX) https://medium.com/@mej0/ <m...@eterm.org>
Linux/HPC Systems Engineer, LANL.gov Author, Eterm (www.eterm.org)
-----------------------------------------------------------------------
"The trouble with doing something right the first time is that nobody
appreciates how difficult it was." -- Walt West

Johan Guldmyr

unread,
Mar 2, 2017, 1:20:01 AM3/2/17
to LBNL Node Health Check
Nice!

Thanks Michael :)

I think some of these should work quite nicely.

// Johan
Reply all
Reply to author
Forward
0 new messages