Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Sudden change in precision and jitter

81 views
Skip to first unread message

A C

unread,
Jun 1, 2013, 7:47:30 PM6/1/13
to
I restarted ntpd early this morning (no configuration changes, just
rolling a log and it's been running for over 12 hours now), but for some
reason, for the first time, it recomputed the precision to -12 (used to
be -13) and the jitter on PPS is now reaching a floor value of
0.000244141 which is exactly double (with a slight rounding error) of
its previous floor value 0.000122070. I've seen this doubling once
before prior to switching to 4.2.7 (the original value was 0.000061035
so another exact doubling) but I've never seen the precision change.

Nothing has changed in the hardware and loopstats indicates that it's
doing quite well for offset and dispersion, achieving the same
dispersions as it used to prior to the restart. I don't quite
understand why it recalculated these values and why they are exactly
double the old values when nothing has changed. I've restarted many
times in the past few months and it always returned to -13/122070 (and
usually did so in a few hours). The stability value in the loopstats
file is also reasonable and currently is about the same as before the
restart.

Here's an example from today's loopstats file before and after the restart:

[prior to restart about 0100 UTC]
56444 615.967 0.000005475 -73.823 0.000122070 0.000133 3
56444 623.967 0.000005310 -73.823 0.000122070 0.000137 3
56444 631.984 0.000005350 -73.823 0.000122070 0.000141 3
56444 639.971 0.000004189 -73.823 0.000122070 0.000139 3
56444 647.968 0.000004221 -73.823 0.000122070 0.000138 3
56444 655.985 0.000004861 -73.823 0.000122070 0.000139 3
56444 663.969 0.000004501 -73.823 0.000122070 0.000139 3

[after restart about 2300 UTC]
56444 84759.530 0.000002948 -74.118 0.000244141 0.000134 3
56444 84767.533 0.000005786 -74.118 0.000244141 0.000140 3
56444 84775.551 0.000001827 -74.118 0.000244141 0.000133 3
56444 84783.530 0.000004656 -74.118 0.000244141 0.000134 3
56444 84791.532 0.000002302 -74.118 0.000244141 0.000128 3
56444 84799.534 0.000005531 -74.118 0.000244141 0.000134 3
56444 84807.521 0.000001769 -74.118 0.000244141 0.000126 3
56444 84815.537 0.000002605 -74.118 0.000244141 0.000122 3


The PPM changes because there's a bit of a temperature swing but that
hasn't affected the system before. During the last three days the
temperature swing (I record the room temperature) was very similar with
no ill effects.

The machine's sole purpose in life is ntpd. No major cron jobs (two log
rotations each month), no other applications running save for ntpd and
gpsd (to feed an SHM, PPS is by ATOM).

It's very odd that the new floor is now double the previous floor. It
hasn't shown signs of changing below the new value for the past six
hours. I'll see what happens for the next 12 hours just in case it
regains sanity.

Harlan Stenn

unread,
Jun 1, 2013, 8:01:12 PM6/1/13
to
What OS?

I'm guessing that when ntpd started this time the calculation loop took
a bit longer than before and that caused your precision to get a notch
worse, which triggered the doubling of the base jitter.

H

A C

unread,
Jun 1, 2013, 8:05:18 PM6/1/13
to
NetBSD 5.1 (sparc)

That would be interesting since I have a cron job restarting it at an
odd hour away from any other cron jobs left. I'll check and see if
there are any others that I missed and then move the restart somewhere
if there's an overlap.

I wish I could figure out how 4.2.6 and 4.2.7 differ in the calculations
because under 4.2.6 my jitter is always 0.061 while under 4.2.7 it is
always 0.122 no matter what I do.

Harlan Stenn

unread,
Jun 1, 2013, 8:47:17 PM6/1/13
to
A C writes:
> NetBSD 5.1 (sparc)

Thanks.

> That would be interesting since I have a cron job restarting it at an
> odd hour away from any other cron jobs left. I'll check and see if
> there are any others that I missed and then move the restart somewhere
> if there's an overlap.

See ntp_proto.c:measure_precision() .

> I wish I could figure out how 4.2.6 and 4.2.7 differ in the calculations
> because under 4.2.6 my jitter is always 0.061 while under 4.2.7 it is
> always 0.122 no matter what I do.

There may be more changes in this area soon, as part of the fix for bug
2085.

And you can bring up the source file in question in a web browser
pointed at bk.ntp.org and see the changes that way - you should be able
to bring up the different changed versions.

H
--

David Woolley

unread,
Jun 2, 2013, 5:24:17 AM6/2/13
to
A C wrote:
>
> That would be interesting since I have a cron job restarting it at an
> odd hour away from any other cron jobs left. I'll check and see if

Why are you restarting it? ntpd works best if left to run continuously.

> there are any others that I missed and then move the restart somewhere
> if there's an overlap.
>
> I wish I could figure out how 4.2.6 and 4.2.7 differ in the calculations
> because under 4.2.6 my jitter is always 0.061 while under 4.2.7 it is
> always 0.122 no matter what I do.

There are two factors that affect precision: the clock tick rate and
the time needed to read the clock. The precision will tend to be based
on the latter, rounded up to a multiple of the former. This is then
rounded to a power of two fraction of a second.

If the time to read is very close to an exact multiple of the hardware
resolution, a small perturbation could change the number of ticks, which
might then change the result into the next power of two range.
>

A C

unread,
Jun 2, 2013, 1:44:23 PM6/2/13
to
On 6/2/2013 02:24, David Woolley wrote:
> A C wrote:
>>
>> That would be interesting since I have a cron job restarting it at an
>> odd hour away from any other cron jobs left. I'll check and see if
>
> Why are you restarting it? ntpd works best if left to run continuously.

I know it does...unless there is a bug (a compound bug between ntpd and
the kernel) that causes ntpd to spin out of control every few weeks and
forces me to restart it anyway. By spin out of control I do mean that
CPU usage goes to near 100% and ntpd stops disciplining the clock after
it managed to force the clock to run at some insane rate (e.g. nominal
PPM tick adjustment might be -78 and it ramps the tick to +350 PPM over
a few minutes). The end result is that the clock is very wrong, ntpd
has totally stopped doing anything, but somehow it's caught in an
infinite loop with maximum CPU usage meaning almost nothing else on the
system is working right.

I have a remote system that watches the billboard from this instance of
ntpd (by running ntpq -p <IP> from another machine) and when the problem
happens you can see all the offsets are in the tens of thousands and the
log file indicates a series of moderate (less than one second) clock
spikes and clock_syncs followed by either enough of a shift that ntpd
stops bothering to fix the clock (deselects all peers and sits) or an
absurd calculated clock step of approximately 2^32 - 1 seconds even
though the clock itself is actually only out by tens or hundreds of
seconds at most (the initial clock step correction applied when ntpd
restarts has never been more than 200 seconds).

And before anyone says anything, the machine/clock is not broken. It
keeps very good time (offset from PPS is typically less than 30
microseconds) right up until some event trips the bug. At that point
ntpd starts hunting and stepping the clock back and forth (four to five
clock spike_detects within a period of less than five minutes) and the
crash. After I restart it, everything settles back down and stays fine
for several weeks. A few weeks later everything repeats. The timing
between the repeats is not exact, sometimes it happens in three weeks,
sometimes in five. Once in a great while it has happened within days of
a restart but that is rare. Three to five weeks of run time before the
bug appears is the common failure mode.

>
>> there are any others that I missed and then move the restart somewhere
>> if there's an overlap.
>>
>> I wish I could figure out how 4.2.6 and 4.2.7 differ in the
>> calculations because under 4.2.6 my jitter is always 0.061 while under
>> 4.2.7 it is always 0.122 no matter what I do.
>
> There are two factors that affect precision: the clock tick rate and
> the time needed to read the clock. The precision will tend to be based
> on the latter, rounded up to a multiple of the former. This is then
> rounded to a power of two fraction of a second.
>
> If the time to read is very close to an exact multiple of the hardware
> resolution, a small perturbation could change the number of ticks, which
> might then change the result into the next power of two range.


The issue you quote from my message is a matter of calculation not
hardware. 4.2.6 consistently calculates a jitter of 0.061 while 4.2.7
consistently calculates a jitter of 0.122 regardless of what the system
is (or is not) doing at the time. That's entirely a computation change,
it would not be repeatable if it was the hardware. But Harlan pointed
out that the method of calculation did change so that would explain the
difference I see between 4.2.6 and 4.2.7.

However, a small perturbation is exactly what happened to the most
recent restart that sent the jitter from 0.122 to 0.244. I restarted
ntpd manually yesterday when nothing else was running and it recomputed
a jitter of 0.122 again. I did actually discover one overlapping cron
job that would have kept the system busy during the restart of ntpd so I
moved ntpd's restart to an odd time. I'll see what happens in a few
weeks when the cron job restarts ntpd again.

Harlan Stenn

unread,
Jun 2, 2013, 3:42:37 PM6/2/13
to
A C writes:
> On 6/2/2013 02:24, David Woolley wrote:
>>> That would be interesting since I have a cron job restarting it at an
>>> odd hour away from any other cron jobs left. I'll check and see if
>>
>> Why are you restarting it? ntpd works best if left to run continuously.
>
> I know it does...unless there is a bug (a compound bug between ntpd and
> the kernel) that causes ntpd to spin out of control every few weeks and
> forces me to restart it anyway. By spin out of control I do mean that
> CPU usage goes to near 100% and ntpd stops disciplining the clock after
> it managed to force the clock to run at some insane rate (e.g. nominal
> PPM tick adjustment might be -78 and it ramps the tick to +350 PPM over
> a few minutes). The end result is that the clock is very wrong, ntpd
> has totally stopped doing anything, but somehow it's caught in an
> infinite loop with maximum CPU usage meaning almost nothing else on the
> system is working right.

It would be really great to run truss or something in this case, and
I would also hope that the GSoC project to upgrade NTP's loggingd and
debugging interface will help with this too.

If you want to open a bug report on this that's fine too - we won't be
able to do much about it until we can repeat it or debug it, but it
might be useful information for the aforementioned GSoC project.

H

unruh

unread,
Jun 2, 2013, 4:20:48 PM6/2/13
to
Do you have all logging set up (peerstats, loopstats, refclocks, ....)
so you can post the contents of those files around the time that ntp
goes mad? It sure should not be doing that.

David Woolley

unread,
Jun 2, 2013, 4:33:11 PM6/2/13
to

A C

unread,
Jun 2, 2013, 4:43:44 PM6/2/13
to
Yes, all logging is turned on. Main, peer, loop, clock, sys, and raw.
I'll post to this thread next time it takes off. I've been trying to
track this bug down for a long time with no luck so far.

A C

unread,
Jun 2, 2013, 6:26:40 PM6/2/13
to
Possibly but there's some situation unique to ntpd that triggers the
issues. I kept going with that thread on the NetBSD list because the
problem still showed up although with a longer period before the crash.
Various test programs were written and compiled to stress all of the
floating point operators that ntpd might use but nothing ever happened,
the system ran the test applications without fail. That's why I said at
the beginning of this particular thread that it's a combination bug,
NetBSD on sparc plus something very peculiar to ntpd's operations trips
it up. Any other program never has an issue.

A C

unread,
Jun 11, 2013, 12:46:37 PM6/11/13
to
It didn't take too long before everything destabilized and ntpd stopped.
In this case it just spun out and dropped sync but still remained
responsive.

Belowis the main log snippet just before it went haywire. The rest of
the logs will get posted online for download. Everything was working
pretty well (just sys_fuzz messages showing up in the logs prior to
this) and then there's a sudden set of spike detects as the clock
wobbles and falls over (no_sys_peer). It's still performing
calculations and responding to ntpq queries but it's no longer
controlling the clock and, according to the offsets with the external
servers, the clock was now over 30 seconds off.

The times below are UTC. In the remaining logs, the incident occurs
somewhere near 56454d 32900s. I'm leaving the full log intact since
0000 UTC so you can see any trends. Everything is quite stable right up
until very near 56454 32906. According to the clockstats file, the
offset jumps from -0.000022767 to 0.004875262 in one second. Ten
seconds later the offset is 0.034419235. It ramps up until it hits
0.497311954 and suddenly changes sign to -0.497750784. Then it ramps
down to zero, ramps up to 0.496989164 and the sign flips again.

The clockstats doesn't show anything else out of the ordinary. It has
the line from the SHM driver showing 16 samples received and some of
them being bad but the majority good. But this happens for quite a
while and can be seen throughout the clockstats file.

In the loops file, the PPM value was sitting stable near -75.68 when it
suddenly starts to change (dropping to -74.3, then -71.6 then -67.85)
and then a rapid jump to -139.457 where it steadily dropped to -376
until I killed the process.

The loops file doesn't seem to show anything abnormal until the event
occurs but I'm not certain. Same for the rawstats and sysstats files.
I also don't really see anything in peers.

The diagnostic log files are at http://acarver.net/ntpd/crashdata/


Jun 11 06:16:54 sunipx2 ntpd[7545]: ts_min 1370931414.425737206 ts_prev
1370931414.425632206 ts 1370931414.462732423
Jun 11 06:16:54 sunipx2 ntpd[7545]: sys_fuzz 105000 nsec, this fuzz
0.000041586, prior 0.000088949
Jun 11 06:16:54 sunipx2 ntpd[7545]: get_systime prev result
0xd5613f56.7a3a2e65 is 0.014676266 later than 0xd5613f56.76785b81
Jun 11 09:09:02 sunipx2 ntpd[7545]: 0.0.0.0 0113 03 spike_detect +0.167383 s
Jun 11 09:11:26 sunipx2 ntpd[7545]: 0.0.0.0 0115 05 clock_sync
Jun 11 09:12:22 sunipx2 ntpd[7545]: 0.0.0.0 0113 03 spike_detect +0.157203 s
Jun 11 09:14:54 sunipx2 ntpd[7545]: 0.0.0.0 0115 05 clock_sync
Jun 11 09:15:42 sunipx2 ntpd[7545]: 0.0.0.0 0113 03 spike_detect +0.147021 s
Jun 11 09:18:14 sunipx2 ntpd[7545]: 0.0.0.0 0115 05 clock_synckk
Jun 11 09:19:02 sunipx2 ntpd[7545]: 0.0.0.0 0113 03 spike_detect +0.136838 s
Jun 11 09:21:34 sunipx2 ntpd[7545]: 0.0.0.0 0115 05 clock_sync
Jun 11 09:22:30 sunipx2 ntpd[7545]: 0.0.0.0 0113 03 spike_detect +0.166043 s
Jun 11 09:24:54 sunipx2 ntpd[7545]: 0.0.0.0 0115 05 clock_sync
Jun 11 09:25:50 sunipx2 ntpd[7545]: 0.0.0.0 0113 03 spike_detect +0.155848 s
Jun 11 09:26:59 sunipx2 ntpd[7545]: 0.0.0.0 0118 08 no_sys_peer
Jun 11 09:50:12 sunipx2 ntpd[7545]: ts_min 1370944211.340914811 ts_prev
1370944211.340809811 ts 1370944212.335866579
Jun 11 09:50:12 sunipx2 ntpd[7545]: sys_fuzz 105000 nsec, this fuzz
0.000002690, prior 0.000025571
Jun 11 09:50:12 sunipx2 ntpd[7545]: get_systime prev result
0xd5617154.564041fa is 0.001012620 later than 0xd5617154.55fde50a
Jun 11 12:44:12 sunipx2 ntpd[7545]: ts_min 1370954651.338258199 ts_prev
1370954651.338153199 ts 1370954652.344166908
Jun 11 12:44:12 sunipx2 ntpd[7545]: sys_fuzz 105000 nsec, this fuzz
0.000079607, prior 0.000012962
Jun 11 12:44:12 sunipx2 ntpd[7545]: get_systime prev result
0xd5619a1c.58628820 is 0.001006960 later than 0xd5619a1c.58208a22
Jun 11 14:26:14 sunipx2 ntpd[7545]: ntpd exiting on signal 15 (Terminated)

A C

unread,
Aug 10, 2013, 1:07:43 PM8/10/13
to
Old thread but new data coming up. After running for a nice while ntpd
finally spun out of control as I've described before. It swung the
clock around and then finally stopped doing anything. When I finally
restarted the clock was over 90 seconds off (the appropriate log entry
here):
Aug 10 16:23:02 sunipx2 ntpd[23542]: 0.0.0.0 c41c 0c clock_step -95.543901 s

I have all stats files turned on so below is a link to a combined file
from the configuration, main log, peers (both filtered for ATOM and SHM
and an unfiltered version), clockstats, loopstats, sysstats, and
rawstats for the time period when the system spun out.

Perhaps any of you can spot something that I'm overlooking in these
files. Everything works great and then it collapses very quickly
(within one or two polling cycles at most).

http://acarver.net/ntpd/combinedlogs20130810.txt

If you need/want more data just say so.

David Lord

unread,
Aug 10, 2013, 2:52:52 PM8/10/13
to
Hi

what hit me was your "tos minsane 1"

Both my GPS and MSF sources I'm told cannot be blacked out by
weather conditions but I also see flying saucers.

ntp2.lordynet.org.uk has been in the pool since late 2009:

# ntp.conf

tos minsane 3
tos orphan 10
tos mindist 0.01

# radioclkd2 -s timepps tty00:-dcd
server 127.127.28.0
fudge 127.127.28.0 stratum 4 time1 0.024000 refid MSFa # 13062901

peer -4 ntp1.lordynet.org minpoll 6 maxpoll 7 iburst
peer -4 ntp3.lordynet.org minpoll 6 maxpoll 7 iburst

server -4 ntp0.lordynet.org.uk minpoll 6 maxpoll 7 iburst
server -4 xxxxx minpoll 6 maxpoll 7 iburst prefer

server -4 xxxxx minpoll 8 maxpoll 12 iburst
server -4 xxxxx minpoll 8 maxpoll 12 iburst
server -4 xxxxx minpoll 8 maxpoll 12
server -4 xxxxx minpoll 8 maxpoll 12
server -4 xxxxx minpoll 8 maxpoll 12


# ntpq -c rv -p

associd=0 status=061d leap_none, sync_ntp, 1 event, kern,
version="ntpd 4.2.6p5-o Wed Feb 1 07:49:06 UTC 2012 (import)",
processor="i386", system="NetBSD/6.1_STABLE", leap=00, stratum=2,
precision=-18, rootdelay=0.695, rootdisp=411.242, refid=192.168.59.61,
reftime=d5b0fec3.f491b27b Sat, Aug 10 2013 18:02:43.955,
clock=d5b0ff83.4cef85c4 Sat, Aug 10 2013 18:05:55.300, peer=38671, tc=7,
mintc=3, offset=0.163, frequency=-49.792, sys_jitter=0.378,
clk_jitter=0.096, clk_wander=0.001, tai=35, leapsec=201207010000,
expire=201312010000

remote refid st t when poll reach delay offset
jitter
==============================================================================
-SHM(0) .MSFa. 4 l 92 64 132 0.000 -3.624
2.700
-ns1.lordynet.or 129.215.42.240 3 u 30 128 376 0.299 -0.597
0.056
-ns3.lordynet.or 129.215.160.240 3 u 85 128 376 0.371 -0.006
0.099
-ns0.lordynet.or 195.173.57.232 3 u 12 128 377 0.383 1.133
0.652
*xxxxxxxxxxxxxxx .PPSb. 1 u 64 128 377 0.695 0.163
0.378
xxxxxxxxxxxxxxxx xxxxxxxxxxxxxx 2 u 98 256 377 21.771 -0.086
0.307
xxxxxxxxxxxxxxxx xxxxxxxxxxxxxx 2 u 190 256 377 23.925 0.833
0.348
xxxxxxxxxxxxxxxx xxxxxxxxxxxxxx 2 u 36 256 377 29.408 0.156
0.170
xxxxxxxxxxxxxxxx xxxxxxxxxxxxxx 2 u 66 256 377 30.961 1.058
0.219
+xxxxxxxxxxxxxxx xxxxxxxxxxxxxx 2 u 111 256 377 48.076 0.179
0.355

David


A C

unread,
Aug 10, 2013, 5:02:46 PM8/10/13
to
On 8/10/2013 11:52, David Lord wrote:
> A C wrote:
>> Old thread but new data coming up. After running for a nice while
>> ntpd finally spun out of control as I've described before. It swung
>> the clock around and then finally stopped doing anything. When I
>> finally restarted the clock was over 90 seconds off (the appropriate
>> log entry here):
>> Aug 10 16:23:02 sunipx2 ntpd[23542]: 0.0.0.0 c41c 0c clock_step
>> -95.543901 s
>>
>> I have all stats files turned on so below is a link to a combined file
>> from the configuration, main log, peers (both filtered for ATOM and
>> SHM and an unfiltered version), clockstats, loopstats, sysstats, and
>> rawstats for the time period when the system spun out.
>>
>> Perhaps any of you can spot something that I'm overlooking in these
>> files. Everything works great and then it collapses very quickly
>> (within one or two polling cycles at most).
>>
>> http://acarver.net/ntpd/combinedlogs20130810.txt
>>
>> If you need/want more data just say so.

>
> Hi
>
> what hit me was your "tos minsane 1"
>
> Both my GPS and MSF sources I'm told cannot be blacked out by
> weather conditions but I also see flying saucers.

In theory my GPS shouldn't get knocked out either. I've very rarely
seen it misbehave and certainly not on the order of once every few weeks
(more like once a year). I tossed in minsane 1 in the event that my
connections to the remotes failed so that it would continue with the
local GPS data (PPS from ATOM and time from SHM). I think when I first
started tinkering with it I did have some network problems like that. I
know the mindist has helped during flakiness of the whole system when
things might just teeter on the edge of the default mindist, especially
the SHM data which moves around by many tens of milliseconds (+/- 70ms
is fairly common).

The billboard on the system right now after having restarted it a few
hours ago looks fairly normal:

associd=0 status=011d leap_none, sync_pps, 1 event, kern,
version="ntpd 4.2....@1.2483-o Sun Apr 1 07:58:59 UTC 2012 (1)",
processor="sparc", system="NetBSD/5.1", leap=00, stratum=1,
precision=-13, rootdelay=0.000, rootdisp=50.195, refid=PPS,
reftime=d5b12777.e1de3481 Sat, Aug 10 2013 20:56:23.882,
clock=d5b12785.46699541 Sat, Aug 10 2013 20:56:37.275, peer=39391, tc=4,
mintc=3, offset=0.013, frequency=-74.925, sys_jitter=0.122,
clk_jitter=0.010, clk_wander=0.015, tai=35, leapsec=201207010000,
expire=201312280000
remote refid st t when poll reach delay offset jitter
======================================================================
oPPS(0) .PPS. 0 l 14 16 377 0.000 0.013 0.122
-SHM(0) .GPSD. 4 l 16 16 377 0.000 52.092 5.353
+v v 2 u 163 512 377 40.618 1.243 0.422
+w w 2 u 26 512 377 34.010 1.060 0.528
-x x 2 u 98 512 377 93.175 0.418 0.352
-y y 2 u 536 512 377 77.223 2.690 2.236
+z z 2 u 143 512 377 85.161 0.621 0.231


I just can't figure out what causes the issue since it's very random.
Sometimes it can go weeks or months without issue and other times it can
crash in days.

A C

unread,
Aug 12, 2013, 12:35:46 AM8/12/13
to
I just took a look at two days worth of loop files (plotting with
ntpdloopwatch) which shows that the clock offset is reasonable (-0.048
ms to +0.027 ms) over that period. So initially it seems that the
system can keep fairly reasonable time which makes the sudden upsets
even more peculiar.

http://acarver.net/ntpd/loops20130812.PNG
0 new messages