System date/time not synced on freeBSD

727 views
Skip to first unread message

Luca Baldini

unread,
Nov 23, 2016, 8:40:38 AM11/23/16
to gce-discussion
Hi, I've several instances that were all working perfectly for months. In the latest days I start to see problems related to date/time synchronization and NTPD.

I haven't changed anything in my configurations but here is what I see:

My /etc/ntp.conf is this:

server metadata.google.internal iburst

restrict
default kod nomodify notrap nopeer noquery
restrict
-6 default kod nomodify notrap nopeer noquery

restrict
127.0.0.1
restrict
-6 ::1
restrict
127.127.1.0

logfile
/var/log/ntpd.log


The ntpd service is up and running:
*******:/ # ps daux | grep ntp
root        
1352   0.0  0.4  22016 13928  -  Ss    8:54AM  0:00.04 |-- /usr/sbin/ntpd -g -c /etc/ntp.conf -p /var/run/ntpd.pid -f /var/db/ntpd.drift


The log shows no errors whatsoever:
*******:/ # cat /var/log/ntpd.log
23 Nov 08:54:05 ntpd[1352]: Listen and drop on 0 v6wildcard [::]:123
23 Nov 08:54:05 ntpd[1352]: Listen and drop on 1 v4wildcard 0.0.0.0:123
23 Nov 08:54:05 ntpd[1352]: Listen normally on 2 vtnet0 [**MYLOCALIP**]:123
23 Nov 08:54:05 ntpd[1352]: Listen normally on 3 lo0 [::1]:123
23 Nov 08:54:05 ntpd[1352]: Listen normally on 4 lo0 [fe80::1%2]:123
23 Nov 08:54:05 ntpd[1352]: Listen normally on 5 lo0 127.0.0.1:123
23 Nov 08:54:05 ntpd[1352]: Listening on routing socket on fd #26 for interface updates
where **MYLOCALIP** is my local ip address.


If I stop and start the ntpd service everything seems working:
********:/ # ntpq -p
     remote           refid      st t
when poll reach   delay   offset  jitter
==============================================================================
*metadata.google 52.117.126.26    2 u    1   64    1    0.328  -272.00 172.488

********:/ # ntpq -c sysinfo
associd
=0 status=0618 leap_none, sync_ntp, 1 event, no_sys_peer,
system peer
:        metadata.google.internal:123
system peer mode
:   client
leap indicator
:     00
stratum
:            3
log2 precision
:     -24
root delay
:         1.332
root dispersion
:    7991.809
reference ID
:       169.254.169.254
reference time
:     dbdfd9b9.45b11621  Wed, Nov 23 2016  9:02:49.272
system jitter
:      0.000000
clock jitter
:       18.991
clock wander
:       0.000
broadcast delay
:    -50.000
symm
. auth. delay:  0.000
root@test
-orologio:/var/log #


but, after some time (2-3 minutes), the synching procedure stops and I get this:
associd=0 status=0613 leap_none, sync_ntp, 1 event, spike_detect,
(spike_detect)


then this:
********:/ # ntpq -p
     remote           refid      st t
when poll reach   delay   offset  jitter
==============================================================================
 metadata
.google 52.117.126.26    2 u   13   64    3    0.524  -3669.7 3348.20


********:/ # ntpq -c sysinfo
associd
=0 status=0018 leap_none, sync_unspec, 1 event, no_sys_peer,
system peer
:        0.0.0.0:0
system peer mode
:   unspec
leap indicator
:     00
stratum
:            3
log2 precision
:     -24
root delay
:         1.184
root dispersion
:    789.028
reference ID
:       169.254.169.254
reference time
:     dbdfd9c3.4d614c5a  Wed, Nov 23 2016  9:02:59.302
system jitter
:      0.000000
clock jitter
:       18.991
clock wander
:       0.000
broadcast delay
:    -50.000
symm
. auth. delay:  0.000
No more "*" before the metadata.google server name and no_sys_peer...


The following command
ntpdate -d metadata.google.internal


gives the following result:
transmit(169.254.169.254)
receive
(169.254.169.254)
transmit
(169.254.169.254)
receive
(169.254.169.254)
transmit
(169.254.169.254)
receive
(169.254.169.254)
transmit
(169.254.169.254)
receive
(169.254.169.254)
server
169.254.169.254, port 123
stratum
2, precision -20, leap 00, trust 000
refid
[169.254.169.254], delay 0.02588, dispersion 0.11003
transmitted
4, in filter 4
reference time
:    dbdfda89.e993bffe  Wed, Nov 23 2016  9:06:17.912
originate timestamp
: dbdfda89.e993c001  Wed, Nov 23 2016  9:06:17.912
transmit timestamp
:  dbdfda96.131fd2bf  Wed, Nov 23 2016  9:06:30.074
filter delay
:  0.02626  0.02589  0.02588  0.02592
         
0.00000  0.00000  0.00000  0.00000
filter offset
: -11.8329 -11.9424 -12.0527 -12.1624
         
0.000000 0.000000 0.000000 0.000000
delay
0.02588, dispersion 0.11003
offset
-12.052761


23 Nov 09:06:30 ntpdate[1560]: step time server 169.254.169.254 offset -12.052761 sec
so it seems that the ntp is able to call the server, get synch info but it does not apply them... 
And my server is already off by 12 seconds in 5-6 minutes... what could be wrong?

I've tryied changing NTP server and using the NIST one with no luck. What should I look at? Someone else is seeing date/time sync problems on google cloud instances?

Carlos (Cloud Platform Support)

unread,
Nov 23, 2016, 4:23:18 PM11/23/16
to gce-discussion
Hi Luca, 

I deployed a new VM using the image provided on this forum, the NTP synchronization worked.  I compared the configuration file and looks like yours.  I would suggest deploying a new VM using the same image since the test might help to isolate the issue.  I would also try comparing the software and kernel versions. 

freebsd-version -k 
10.3-RELEASE-p11 

$ ntpd --version 
ntpd 4.2.8p8-a (1) 

Andy Carrel

unread,
Nov 23, 2016, 4:29:31 PM11/23/16
to gce-discussion
Your NTP config is probably fine. What you're observing is actually drift in the kernel timecounter. I've also seen substantial drift on my own FreeBSD VMs that use the ACPI-fast timer. You can see which timer you're using with `sysctl kern.timecounter.hardware'.

You can switch to the TSC timer on the CPU by adding the following lines to your system config files:
...in /boot/loader.conf:
> machdep.disable_tsc_calibration=1
> kern.timecounter.invariant_tsc=1
...in /etc/sysctl.conf:
> kern.timecounter.hardware=TSC-low

The loader.conf changes instruct the kernel to not try to do calibration of the TSC, since this calibration could give bad results due to the virtualization, and to ignore the fact that the "invariant TSC" feature is not advertised by the CPU and just assume the CPU has this feature.

The sysctl.conf change instructs the kernel to use the TSC-low timecounter which was made available by the loader.conf changes.

Because there are changes to /boot/loader.conf, you must reboot your kernel to pick up this configuration change.

Hope this helps.

Luca Baldini

unread,
Nov 24, 2016, 12:16:24 PM11/24/16
to gce-discussion
Thanks @AndyCarrel, I've investigated a little bit more and I've seen that changing from TSC-low to ACPI-fast solves(?) the problem... 
A few findings:
- not all my machines shows the problem...
- if a working machine is rebooted it could show the problem (not always)... probably due to hardware change (?)
- if a non-working machine is detected I do the following:
     service ntpd stop
     kern.timecounter.hardware=ACPI-fast
     service ntpd start
and after that everything runs perfectly (after a day the maximum offset is 0.105 ms!!!! and NTP is still working and adapting the time...)
- not all my server has ACPI-fast... some have ACPI-safe (that as far as I understand it's an automatic bug recovery made by the system that sees a buggy ACPI).

So I'll probably write some code that checks the time... and if it drifts too much it changes the kern.timecounter.hardware and keeps checking until it finds a good one.. I need this because a reboot could change hardware and I'm worried about "live migrations" (that I don't know if it could change hardware without SO knowing it).
BUT I'll try your "loader.conf" suggestion... it seems interesting (and maybe solves everything without code) :-)))

@Carlos Thanks! I've an image that was created some time ago and it has been updated recently (it's now 10.2-RELEASE)... I've tryied update my image installing security patches but it did not solve the problem. I have even tryied update the machine to the latest 11.0-RELEASE but no luck)... As I was saying before not all my machines shows the problem... I'm pretty sure I can fix it by running periodic checks and change timer source to the best one... 
I will try what @Andy was saying (change to loader.conf)... it seems a reasonable test to be made... :-)))

Thanks a lot!!!! I'll write here future findings... Cheers!

Luca Baldini

unread,
Nov 24, 2016, 12:35:47 PM11/24/16
to gce-discussion
I've checked my current settings (i.e. settings that are inside the image I use to create new instances):

My boot/loader.conf:

autoboot_delay="-1"
beastie_disable
="YES"
loader_logo
="none"
hw
.memtest.tests="0"
console
="comconsole"
hw
.vtnet.mq_disable=1
kern
.timecounter.hardware=ACPI-safe
aesni_load
="YES"
nvme_load
="YES"

and I see an "unexpected" kern.timecounter.hardware=ACPI-safe there...

That machine does not even have an ACPI-safe...:
> sysctl kern.timecounter.choice

kern
.timecounter.choice: TSC-low(1000) i8254(0) ACPI-fast(900) dummy(-1000000)

And, infact, that machine uses
kern.timecounter.hardware: TSC-low

I've also seen that 
> sysctl  machdep.disable_tsc_calibration
machdep.disable_tsc_calibration: 0
> sysctl  kern.timecounter.invariant_tsc
kern.timecounter.invariant_tsc: 1

so I think that I only need to set disable_tsc_calibration to 1 in loader.conf... I'll give it a try and I'll let you know what I see...

Thanks!!!!!!!!!!!!

Luca Baldini

unread,
Nov 28, 2016, 9:13:36 AM11/28/16
to gce-discussion
I've tryied what Andy suggested and I think that his suggestion IS the solution... :-))

I've tryied looking at var/log/messages and I've seen this:
Nov 20 12:01:00 [SERVER] kernel: Timecounter "TSC-low" frequency 1181407212 Hz quality 1000       delta = -68592788 (-5,49%)
Nov 21 14:13:41 [SERVER] kernel: Timecounter "TSC-low" frequency 1181396059 Hz quality 1000       delta = -68603941 (-5,49%)
Nov 22 08:13:23 [SERVER] kernel: Timecounter "TSC-low" frequency 1250051155 Hz quality 1000       delta = 51155 (0,00%)
Nov 22 13:54:14 [SERVER] kernel: Timecounter "TSC-low" frequency 1181412587 Hz quality 1000       delta = -68587413 (-5,49%)
Nov 22 14:12:29 [SERVER] kernel: Timecounter "TSC-low" frequency 1112737584 Hz quality 1000       delta = -137262416 (-10,98%)
Nov 23 08:13:21 [SERVER] kernel: Timecounter "TSC-low" frequency 1250064322 Hz quality 1000       delta = 64322 (0,01%)
Nov 24 09:12:13 [SERVER] kernel: Timecounter "TSC-low" frequency 1250000000 Hz quality 1000


As shown in the last days the TSC-low counter was calibrated but, somehow, it gave strange results... 
The second 22nd's reboot gave a value that is almost 11% off the "true" (1250000000 Hz) value... that's, probably, the reason why my clock was gone and NTP was unable to cope with that. 

Anyway, putting the machdep.disable_tsc_calibration=1 into the loader.conf file "fixed" the problem... Thanks everyone!

Reply all
Reply to author
Forward
0 new messages