NTP and DCF77 leap logs

Serge Bets

unread,

Jan 2, 2006, 8:17:48 AM1/2/06

to

Hello everybody, and best wishes to you all. Here are some commented
logs and my quick notes about *The Leap*:

Leap second insertion on Saturday 31 December 2005 at 23:59:60 UTC.
Local time of event is Sunday 1 January 2006 at 00:59:60 CET +0100.

DCF77 receiver: Jonathan A. Buzzard's Conrad 641138 simple DIY clock.
Recorder: Jon Atkins' radioclkd2 0.06 slightly modified.
Recorded DCF77 pulses: "." == 0, "1" == 1, "x" == no pulse.

TZL mmmmMMM hhhhHH ddddDDjjjmmmmMyyyyYYYY
|0 |5 |10 |15 |20 |25 |30 |35 |40 |45 |50 |55 |60
..................111...11.11.......1.....1111.....11.....1x 2006-01-01 (day 7) 00:58 CET leap second soon main ant
..................1111..11.1........1.....1111.....11.....1x 2006-01-01 (day 7) 00:59 CET leap second soon main ant
..................111........1.....11.....1111.....11.....1.x 2006-01-01 (day 7) 01:00 CET leap second soon main ant
..................1.11......11.....11.....1111.....11.....1x 2006-01-01 (day 7) 01:01 CET main ant
..................1.1.1.....11.....11.....1111.....11.....1x 2006-01-01 (day 7) 01:02 CET main ant
|0 |5 |10 |15 |20 |25 |30 |35 |40 |45 |50 |55 |60

data(60): 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,1,1,1,2,2,1,2,2,1,1,1,1,1,1,1,2,1,1,1,1,1,2,2,2,2,1,1,1,1,1,2,2,1,1,1,1,1,2,0,
data(60): 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,1,1,2,2,1,2,1,1,1,1,1,1,1,1,2,1,1,1,1,1,2,2,2,2,1,1,1,1,1,2,2,1,1,1,1,1,2,0,
data(61): 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,1,1,1,1,1,1,1,1,2,1,1,1,1,1,2,2,1,1,1,1,1,2,2,2,2,1,1,1,1,1,2,2,1,1,1,1,1,2,1,0,
data(60): 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,1,2,2,1,1,1,1,1,1,2,2,1,1,1,1,1,2,2,1,1,1,1,1,2,2,2,2,1,1,1,1,1,2,2,1,1,1,1,1,2,0,
data(60): 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,1,2,1,2,1,1,1,1,1,2,2,1,1,1,1,1,2,2,1,1,1,1,1,2,2,2,2,1,1,1,1,1,2,2,1,1,1,1,1,2,0,

Note that both radioclk-1.0 and radioclkd2 0.06 original radioclock
drivers failed to decode the 01:00 minute: They did not detect the
minute marker, fooled by a non-monotonic gettimeofday() repeating the
second 59. Example radioclkd2 0.06: Instead of decoding one 61 seconds
minute 01:00, followed by one 60 seconds minute 01:01, it decoded one
120 seconds minute 01:01. Raw data:

data(120): 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,1,1,1,1,1,1,1,1,2,1,1,1,1,1,2,2,1,1,1,1,1,2,2,2,2,1,1,1,1,1,2,2,1,1,1,1,1,2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,1,2,2,1,1,1,1,1,1,2,2,1,1,1,1,1,2,2,1,1,1,1,1,2,2,2,2,1,1,1,1,1,2,2,1,1,1,1,1,2,0,

In clear: The minute marker sequence "0,1" was wrongly timed as one "1".
Raw gettimeofday() timings (not corrected by time1 0.020 offset):

pulse start: at 1136073598.019534
pulse end: length 0.191548 - 58: 2
pulse start: at 1136073599.021453 (that's the 1st second 59)
pulse end: length 0.091303 - 59: 1
pulse start: at 1136073600.018443 (that's the only second 0)
pulse end: length 0.096868 - 60: 1
pulse start: at 1136073601.018891
pulse end: length 0.095290 - 61: 1

Main stratum 1 Linux Debian Woody recent ntp-dev 4.2.0b server. DCF
refclock SHM(0) interfaced with Jonathan A. Buzzard's radioclkd 1.0
patched to deal with leap seconds. It leaped gracefully. Just that
instead of inserting a second 60, it repeated twice the second 59.

Sequence of events began in early December, when the server informed by
the NIST leapseconds file started to announce leap=01 30 days in
advance.
Then it has set kernel leap bits (0x11 ==> PLL+INS) 1 day in advance.
Then DCF77 radio announced the leap 1 hour in advance.
And the kernel leaped.

Dec 31 01:00:45 kassad ntpd[13775]: kernel time sync enabled 0011
Jan 1 00:01:00 kassad radioclkd[259]: DCF77 on DCD line announces leap second insertion
Jan 1 00:59:59 kassad kernel: Clock: inserting leap second 23:59:60 UTC
Jan 1 01:01:00 kassad radioclkd[259]: DCF77 on DCD line stops announcing leap second

Later DCF77 shut off bit 19.
The NTP daemon announced leap=00 and tai=33.
All was well, frequency and phase.
Success: Congratulations, professor Mills! Much thanks to you, and to
all the team that made this possible. :-)

I also had a bunch of various stratum 2 clients.

- Woody ntp-dev 4.2.0b 20051225 leaped gracefully.

- Old Linux xntpd 3-5.93e leaped gracefully. Only difference is it took
the leap=01 only on the last day, as per NTPv3 specification.

Dec 31 01:00:23 biben xntpd[130]: system event 'event_sync_chg' (0x03) status 'leap_add_sec, sync_ntp, 7 events, event_peer/strat_chg' (0x4674)
Jan 1 00:59:59 biben kernel: Clock: inserting leap second 23:59:60 UTC
Jan 1 01:00:19 biben xntpd[130]: system event 'event_sync_chg' (0x03) status 'leap_none, sync_ntp, 15 events, event_peer/strat_chg' (0x6f4)

- Windows 98SE running Tardis 2000. Never took leap bits, and whether
it leaped or not is unsure. I failed the logging.

- Windows 2000 running w32time. Took the leap=01 one month in advance,
and was since then repeatedly polling the server each day some seconds
around midnight UTC. Previous midnights it didn't leap, this one it
leaped gracefully.

- 2 autonomous DCF clocks did not leap. But they resynced to the radio
during the following couple of minutes. Normal hourly resync.

- Windows 2000 running ntp-4...@1.1436mbg-xmas-o-win32-setup.exe was
apparently behaving well. Announcing leap=01 long in advance, the
service doing itself the leap...

Dec 22 22:12:59 oniko NTP: N/A: Detected positive leap second announcement for 2006-01-01 00:00:00 UTC
Jan 1 01:00:03 oniko NTP: N/A: Inserting positive leap second.

But all went wrong after the leap. The service entered a cycle of
wildly bad frequency, like +50 or -120 instead of -20.8 PPM, with offset
quickly wandering, then time reset. Wander, reset -150 ms. And so on for
hours, without any sign of stabilisation. 20 hours later it is still
totally broken. I give up: Deleted drift file and restarted.

Serge.
--
Serge point Bets arobase laposte point net

Martin Burnicki

unread,

Jan 2, 2006, 3:12:16 PM1/2/06

to

Serge,

Serge Bets wrote:
[...]

> - Windows 2000 running ntp-4...@1.1436mbg-xmas-o-win32-setup.exe was
> apparently behaving well. Announcing leap=01 long in advance, the
> service doing itself the leap...
>
> Dec 22 22:12:59 oniko NTP: N/A: Detected positive leap second announcement
> for 2006-01-01 00:00:00 UTC
> Jan 1 01:00:03 oniko NTP: N/A: Inserting positive leap second.
>
> But all went wrong after the leap. The service entered a cycle of
> wildly bad frequency, like +50 or -120 instead of -20.8 PPM, with offset
> quickly wandering, then time reset. Wander, reset -150 ms. And so on for
> hours, without any sign of stabilisation. 20 hours later it is still
> totally broken. I give up: Deleted drift file and restarted.

Please see my reply to David J. Taylor. Could you please also have a look at
the event logs of those machines and report if there have been any
remarkable messages?

Martin
--
Martin Burnicki

Meinberg Funkuhren
Bad Pyrmont
Germany

Serge Bets

unread,

Jan 3, 2006, 7:03:26 AM1/3/06

to

Hello Martin,

On Monday, January 2, 2006 at 21:12:16 +0100, Martin Burnicki wrote:

> Serge Bets wrote:
>> But all went wrong after the leap. The service entered a cycle of
>> wildly bad frequency, like +50 or -120 instead of -20.8 PPM, with
>> offset quickly wandering, then time reset.

> Could you please also have a look at the event logs of those machines
> and report if there have been any remarkable messages?

I made a mistake: The bad frequency began nearly 2 hours before the
leap, at 22:11 UTC. I was confused by the first "time reset -0.149742 s"
that happened only later at 0:18 UTC, followed by many others.

There are many clockhops, between other S2 peers and S1 server of the
LAN, and also lost reachability (probably good servers seen as insane),
and some minutes later resync to the same server. Those clockhops follow
each reset. The LAN was isolated, and all other peers/servers of the LAN
leaped OK and had stable frequency and sub-ms offset, *after the leap*.
There was not a single bad source.

But what happened 2 hours before is much less clear. I was finishing the
scripts to monitor the event, and especially had done a pair of dirty
experiments on one other S2 machine. Suspect.

Another notable fact: The bad frequency was around +50 (+-10) all the
(very long) time between 22:11 and 6:00, despite the wander, despite the
resets, and when the normal drift of this machine is extremely stable
-20.8 PPM (+-0.3 of thermal effect). Only after 6:00 freq jumped to
-120, then later to +450, and such. Since killed drift file and service
restarted, the freq is back normal:

| $ ntpq -c "rv 0 frequency" Oniko
| status=06f4 leap_none, sync_ntp, 15 events, event_peer/strat_chg
| frequency=-20.756

Hint for David: Look at 4th column of loopstats file for the frequency
correction recalculated at each poll.

David J Taylor

unread,

Jan 3, 2006, 9:02:36 AM1/3/06

to

Serge Bets wrote:
[]

> Hint for David: Look at 4th column of loopstats file for the frequency
> correction recalculated at each poll.
>
>
> Serge.

Thanks, Serge. Both the two client PCs below were running the ntpd
software which inserted the leap second into Windows, and were taking the
time both from the Internet and two local servers. Unfortunately, I don't
have the loopstats for the Internet-only facing PCs, which were not
running the newer ntpd.

On client Odin, during 2005 Dec 31 the drift changed from:

-7.449 at midnight
-7.628 at 34741 seconds
-7.485 at 85768 seconds

On 2006 January 01

500.000 at 906 seconds
drifting gradually down to
480.562 at 27190 seconds

when I deleted the drift file and restarted ntpd
-120.441 at 28797 seconds
-41.424 at 46727 seconds
-9.302 at 86295 seconds

On client Stamsund, the Dec 31 drift value was around -12

On 2006 Jan 01

-11.775 at 53736 971 seconds
-12.715 at 7135 seconds
-10.968 at 86384 seconds

so that system behaved quite reasonably. I did not have to restart ntpd
on that system.

Cheers,
David

Serge Bets

unread,

Jan 5, 2006, 7:15:15 AM1/5/06

to

On Tuesday, January 3, 2006 at 13:03:26 +0100, Serge Bets wrote:

> But what happened 2 hours before is much less clear.

Update after some more log crossing, and a pair of replay experiments:
The bad frequency I saw on Windows was entirely due to a perturbating
source 2 hours before the leap. Now, the fact that this initial
perturbation was not at all stabilised and even aggravating 22 hours
later is really annoying.

The Windows ntpd specific new super-fast slewing works very well, as 2
replays in normal conditions have shown me a very clean post-leap
behaviour. Stable frequency and offset. Bravo Martin.

In contrast, I replayed a leap with older code: False time during
10 minutes, time reset -1.000938 s at 0:10:46, and frequency explosion
-629 followed by a long period of instability. Many other time resets.
I can imagine what would happen if Murphy makes sources unreachable at
this moment... Wander of 3 centuries per minute.

Conclusion: The inclusion in the daemon of code doing the leap on
platforms where the kernel does not itself leap should be reevaluated.
Bug 508 discussion shows it's not easy to do universaly right, but the
Win prototype from Meinberg nicely proves it's possible.

While at it, I also simulated a negative leap deletion: Step +1 and
frequency explosion exceeding +500 PPM unfortunately.

David L. Mills

unread,

Jan 5, 2006, 10:53:46 AM1/5/06

to

Serge,

I've said this several times; I say it again. The frequency surge must
be a bug; the only acceptable behavior is either to obey the leap as
designed or to ignore the leap and coast for 17 minutes one second off
the sources and then step to the correct offset. The frequency should
not be affected. This is not a case of frequency discipline
mismanagement; this is a bug.

Dave

Serge Bets

unread,

Jan 5, 2006, 1:59:51 PM1/5/06

to

On Thursday, January 5, 2006 at 15:53:46 +0000, David L. Mills wrote:

> the only acceptable behavior is either to obey the leap as designed or
> to ignore the leap and coast for 17 minutes one second off the sources
> and then step to the correct offset. The frequency should not be
> affected.

With all due respect, I'll formulate the following opinion: The coast
and step behaviour can't be placed in the acceptable category.

- Gives wrong time on the machine.
- Serves wrong time to downstream clients.
- Does a step backward at some unpredictable moment.
- Depends on sources leaping well or not.

Serge, frightened.

David L. Mills

unread,

Jan 6, 2006, 2:12:25 PM1/6/06

to

Serge,

Please lapse back into engineering mode and reformulate your opinion.
Seriously consider what happens when (a) upstream sources indicate a
leap and the kernel does not implement a leap function, (b) the radio
does not recognize the leap warning and steps only some minutes later,
(c) one or more samples of extreme error arrive outside the step
threshold and then return to nominal, (d) there is no advance warning
and the upstream sources step on or after the leap. Once you consider
these cases in detail and provide a compromise solution, I would be very
happy to learn your opinion. Meanwhile, the envelope of acceptable
behavior is the current implementation.

Dave

Serge Bets

unread,

Jan 6, 2006, 6:54:27 PM1/6/06

to

On Friday, January 6, 2006 at 19:12:25 +0000, David L. Mills wrote:

> Serge Bets wrote:
>> The coast and step behaviour can't be placed in the acceptable
>> category.
>> - Gives wrong time on the machine.
>> - Serves wrong time to downstream clients.
>> - Does a step backward at some unpredictable moment.
>> - Depends on sources leaping well or not.

And that step is a time reset, wiping the carefully accumulated history.

> (a) upstream sources indicate a leap and the kernel does not implement
> a leap function

This case is perhaps not the perfect one, but is the best available on
some platforms. It is there considered to be normal conditions. To coast
and step is ugly. A nicer behaviour would be: Ntpd performs the leap,
compensating kernel deficiency.

The other cases (b), (c), and (d) are broken: No way to be sure if it's
really a leap. Coast and step is acceptable. The response could be
changed if we knew in advance a leap is occuring, either thru other
sources or thru leapseconds file. Then kernel or ntpd would leap in
time, and perhaps we receive some wrong samples until the bad source
itself leaps late.

> Once you consider these cases in detail and provide a compromise
> solution, I would be very happy to learn your opinion.

So I'll reformulate: Coast and step is ugly. Doing such things in fubar
conditions is indeed acceptable, when no better behaviour can be
imagined. But doing it in normal conditions is not.

The scope of my coast and step criticism was case (a) only.

How many different platforms have non-leaping kernels? How many
stratum 1 or 2 servers run ntpd on such platforms? How many good leaping
kernel clients can be disturbed by such servers? What percentage of the
December 2005 leap problems can be attributed directly or indirectly to
coast and step?

Serge.