> Hi, I'm looking for a reliable NTP sync check on our cluster.
> So far we've been using:
> node_timex_tick_seconds > 0.5
> Check. But I can't tell either how reliable that is or where my colleagues
> got the idea from.
The node_timex_* metrics come from the adjtimex() system call/C library
function (depending on platform; on Linux it's a system call). I'm
not certain if tick_seconds is a useful metric; on most of our Linux
machines it's 0.01, and on ones that run NTP daemons it is sometimes
slightly different, I assume if the NTP daemon is adjusting the clock.
The node_timex_sync_status metric corresponds to whether or not the
adjtimex() return value is TIME_ERROR or not.
(node_timex_status corresponds to the .status bitmap/field of the
structure that adjtimex() writes information to.)
One hopes that an NTP daemon that loses proper synchronization to time
sources will call the kernel to tell it to turn off the adjtimex() 'we
are synchronized' status, but I'm not sure that it will. You may want
to test that. Also, depending on how your NTP daemon is configured,
it may be willing to consider itself synchronized even when what it's
synchronized to is not a reliable source of absolute time (for example,
you have a closed cycle of NTP servers all using each other for time
sources, all drifting down to stratum 10).
My experience with node_ntp_sanity has been mixed. In particular, some
of our hosts have asserted a 0 value for this for unclear reasons, when
as far as I could tell their NTP daemons had maintained sync status to
good time sources. At the time we saw this I didn't know about the
TIME.md documentation; if I had, I might have been able to decode why.
I would certainly look at how often your hosts have a 0 value for this
before setting an alert for it, and possibly tune the relevant command
line options for node_exporter.
If you know that there are certain invariants for your NTP daemon
configuration when it's in proper order, you may want to explicitly
check for them too. In particular, NTP stratum; in many cases, in normal
operation your NTP daemons will all be no lower than some stratum
number. If they're drifting lower it is usually a sign that they've
synchronized to a bad clock source or your upstream clock sources have
problems.
In our Prometheus installation, I wound up writing a program to make
SNTP queries to our NTP servers and directly report the resulting raw
information as metrics[*]. The node_ntp_* metrics don't fully capture all
of the information that you can get here, and as noted in TIME.md some
of the metrics are a bit processed. One important metric that I don't
think you can get directly from node_ntp_* metrics is whether or not
the NTP daemon considers itself to be a valid time source, although
probably if it isn't, node_ntp_sanity will be 0.
(As a side benefit of our SNTP query program, we can query our upstream
NTP clock sources too.)
Also, all of this is somewhat irrelevant if what you care about is
that hosts have closely synchronized time. My test for that case is:
abs(node_time_seconds - timestamp(node_time_seconds))
which finds the difference betwee a host's time and the time the
Prometheus server has. Of course, you might want to make any alert
here conditional on the Prometheus server thinking it has good time
synchronization itself.
- cks
[*: We currently run it against various hosts through the script exporter,
https://github.com/ricoberger/script_exporter
]