time issues and ZFS

Daniel Braniss

unread,

Jan 21, 2013, 6:33:17 AM1/21/13

to

After many trials (and errors), here are some facts:

host: DELL PowerEdge R710, 16GB,
mfi0: <Dell PERC H700 Integrated>
mfid0: 14305280MB (29297213440 sectors) RAID volume 'r5' is optimal
mfi1: <Dell PERC 6>
mfid1: 12393472MB (25381830656 sectors) RAID volume 'Virtual Disk 0' is
optimal

we have NO problems with FreeBSD-8.3-STABLE, but with 9.1-STABLE, the real-time
clock slows down when doing some zfs stuff like send|receive, typing 'date'
when less that 1000s went by seems to crorrect the problem,
ntpd kicks in and on track again.

I have a cron job just logging date every 5 minutes, and the loghost sees:

|-- local time on loghost | time on problematic host
Jan 20 19:56:19 store-02.cs.huji.ac.il Jan 20 19:56:19 danny: Sun Jan 20
19:56:19 IST 2013 -- ok
Jan 20 20:15:00 store-02.cs.huji.ac.il Jan 20 20:15:00 danny: Sun Jan 20
20:15:00 IST 2013 -- ok
Jan 20 21:30:00 store-02.cs.huji.ac.il Jan 20 20:21:06 danny: Sun Jan 20
20:21:06 IST 2013 -- off by 1:09
Jan 20 21:33:53 store-02.cs.huji.ac.il Jan 20 20:25:00 danny: Sun Jan 20
20:25:00 IST 2013 -- off by 1:08
Jan 20 21:38:54 store-02.cs.huji.ac.il Jan 20 20:30:00 danny: Sun Jan 20
20:30:00 IST 2013 -- off by 1:09
...
Jan 20 22:03:54 store-02.cs.huji.ac.il Jan 20 20:55:00 danny: Sun Jan 20
20:55:00 IST 2013 -- diff is now constant
..
Jan 20 22:04:13 store-02.cs.huji.ac.il Jan 20 20:55:19 ntpd[1848]: time
correction of 4134 seconds exceeds sanity limit (1000); set clock manually to
the correct UTC time.
...
Jan 20 22:58:53 store-02.cs.huji.ac.il Jan 20 21:50:00 danny: Sun Jan 20
21:50:00 IST 2013

strangely, when running 8.3, ACPI-fast is chosen:
kern.timecounter.choice: TSC(-100) HPET(900) ACPI-fast(1000) i8254(0)
dummy(-1000000)
but with 9.1 TSC-low gets chosen:
kern.timecounter.choice: TSC-low(1000) ACPI-fast(900) HPET(950) i8254(0)
dummy(-1000000)

so I did sysctl kern.timecounter.hardware=ACPI-fast, but the same happens -
unless it can't be changed after boot.

I realy need help here!

thanks,
danny

_______________________________________________
freebsd...@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stabl...@freebsd.org"

Adrian Chadd

unread,

Jan 21, 2013, 9:13:02 AM1/21/13

to

Hi,

Try experimenting with kern.eventtimer.periodic and kern.eventtimer.idletick.

If this fixes it for you, please file a PR with all the relevant details.

Thanks!

Adrian

Daniel Braniss

unread,

Jan 21, 2013, 9:37:47 AM1/21/13

to

> Hi,
>
> Try experimenting with kern.eventtimer.periodic and kern.eventtimer.idletick.
>

can you give/point to some info about this?

btw, I just noticed that on this hardware I get:
9.1-STABLE:

> vmstat -i
interrupt total rate
irq3: uart1 931 0
irq4: uart0 5 0
irq19: ehci0 1331 0
irq20: hpet0 uhci3 1687937 1163
irq21: uhci2 ehci1 29 0
irq23: atapci0 48 0
irq256: bce0 52270 36
irq260: mfi0 14690 10
irq261: mfi1 3088 2
Total 1760329 1213

no cpu timer, instead irq20: hpet0 uhci3,
and when 8.3-STABLE:
> vmstat -i
interrupt total rate
irq3: uart1 1048 0
irq4: uart0 5 0
irq19: ehci0 280451 1
irq21: uhci2 ehci1 29 0
irq23: atapci0 52 0
cpu0:timer 313544623 1125
irq256: bce0 30791673 110
irq260: mfi0 1372186 4
cpu1:timer 1294093 4
...

total 384382790 1380

is this OK?

> If this fixes it for you, please file a PR with all the relevant details.
>

I will!

Ian Lepore

unread,

Jan 21, 2013, 10:03:08 AM1/21/13

to

What's the output of sysctl kern.eventtimer? Does the bad behavior
change if you set kern.eventimer.periodic=1?

-- Ian

Daniel Braniss

unread,

Jan 21, 2013, 10:35:16 AM1/21/13

to

...

>
> What's the output of sysctl kern.eventtimer?

kern.eventtimer.periodic is 0

> Does the bad behavior
> change if you set kern.eventimer.periodic=1?
>

setting kern.eventtimer.timer=LAPIC
instead of the default HPET made the missing cpu timers to appear:
# vmstat -i
interrupt total rate
irq3: uart1 1695 0
irq4: uart0 5 0
irq19: ehci0 3875 0
irq20: hpet0 uhci3 5495755 1135

irq21: uhci2 ehci1 29 0
irq23: atapci0 48 0

cpu0:timer 7063 1
irq256: bce0 117073 24
irq260: mfi0 51083 10
irq261: mfi1 3088 0
cpu1:timer 484 0
cpu14:timer 36 0
cpu6:timer 486 0
cpu8:timer 38 0
cpu5:timer 38 0
cpu15:timer 38 0
cpu7:timer 32 0
cpu12:timer 38 0
cpu3:timer 40 0
cpu9:timer 36 0
cpu10:timer 34 0
cpu11:timer 37 0
cpu2:timer 33 0
cpu13:timer 40 0
cpu4:timer 36 0
Total 5681160 1173

is this relevant?

danny

Ian Lepore

unread,

Jan 21, 2013, 10:54:27 AM1/21/13

to

I'll have to let someone who knows modern x86 hardware better comment on
the relative merits of hpet vs. lapic timers. If it was using hpet in
one-shot mode, and changing it to hpet in periodic mode makes the
problem go away, that might be a clue that there's something wrong in
the hpet eventtimer start or interrupt routines.

I wonder if a single missed interrupt in one-shot mode would bring an
eventtimer to a halt like that? And if so, then what is it about
manually asking for the date that kicks it into running again?

Adrian Chadd

unread,

Jan 21, 2013, 3:09:21 PM1/21/13

to

I still firmly believe the ACPI event timer code is racy, and what we
may be seeing here is the fallout from that.

It's very possible that we're missing interrupts here - the new
eventtimer code that made it into 9.x puts the halt behind a critical
section, with interrupts disabled. The only platforms that correctly
implement enable-interrupts-and-halt atomically is the HLT (well, and
the don't-sleep-at-all) idle loops on i386/amd64. The default method
is to use the ACPI sleep method, which doesn't do atomic interrupt
enable / halt.

I'm still seeing odd stuff on some of my ACPI-using netbooks when
doing net80211/ath development and it all goes away whenever I fondle
with the above settings.

So, play with kern.eventtimer.periodic, kern.eventtimer.idletick and
machdep.idle (try setting machdep.idle to hlt, or something else
listed in machdep.idle_available) - please report back what the
results are.

Adrian

Daniel Braniss

unread,

Jan 22, 2013, 2:28:34 AM1/22/13

to

> I still firmly believe the ACPI event timer code is racy, and what we
> may be seeing here is the fallout from that.
>
> It's very possible that we're missing interrupts here - the new
> eventtimer code that made it into 9.x puts the halt behind a critical
> section, with interrupts disabled. The only platforms that correctly
> implement enable-interrupts-and-halt atomically is the HLT (well, and
> the don't-sleep-at-all) idle loops on i386/amd64. The default method
> is to use the ACPI sleep method, which doesn't do atomic interrupt
> enable / halt.
>
> I'm still seeing odd stuff on some of my ACPI-using netbooks when
> doing net80211/ath development and it all goes away whenever I fondle
> with the above settings.
>
> So, play with kern.eventtimer.periodic, kern.eventtimer.idletick and
> machdep.idle (try setting machdep.idle to hlt, or something else
> listed in machdep.idle_available) - please report back what the
> results are.
>
>
> Adrian
>

Adrian,
you mention that ACPI is racy, which event timer are you talking about?

how is the quality chosen?

at the moment switching kern.eventtimer.timer to LAPIC seems to have done the
trick. I'll have to wait another 24hs to make sure.

In the meantime here is some info:
Intel(R) Xeon(R) CPU E5645: running with no problems
LAPIC(600) HPET(450) HPET1(440) HPET2(440) HPET3(440) i8254(100) RTC(0)

Intel(R) Xeon(R) CPU X5550: this is the problematic, at least for the moment
HPET(450) HPET1(440) HPET2(440) HPET3(440) LAPIC(400) i8254(100) RTC(0)

Dual-Core AMD Opteron(tm) Processor 2218: running with no problems
LAPIC(400) RTC(0)

so if someone is running 9.1 on any of the following and can provide
the output of sysctl kern.eventtimer.choice would be nice:

Intel(R) Xeon(R) CPU E5410
Intel(R) Xeon(R) CPU E5507

btw, all the above are on server MBs.

thanks,
danny

Adrian Chadd

unread,

Jan 22, 2013, 4:40:02 AM1/22/13

to

Daniel,

Have you run tests with the machdep.idle value changed, and fiddling
kern.eventtimer.periodic / kern.eventtimer.idletick ?

adrian

Daniel Braniss

unread,

Jan 22, 2013, 4:55:39 AM1/22/13

to

> Daniel,
>
> Have you run tests with the machdep.idle value changed, and fiddling
> kern.eventtimer.periodic / kern.eventtimer.idletick ?

Adrian,

not yet, for several reasons:
1- as I explained, I can't realy force the problem, it happens when we run some
zfs scripts, like mirror, but have to wait till enough changes happened on
the source, usualy after 24hs.
2- changing to LAPIC seems to have solved the problem.
3- I'm now learning all I can about event timers and you have not answered some
of my questions :-)

danny

Julian Stecklina

unread,

Jan 22, 2013, 7:27:24 AM1/22/13

to

Thus spake Daniel Braniss <da...@cs.huji.ac.il>:

> In the meantime here is some info:
> Intel(R) Xeon(R) CPU E5645: running with no problems
> LAPIC(600) HPET(450) HPET1(440) HPET2(440) HPET3(440) i8254(100) RTC(0)
>
> Intel(R) Xeon(R) CPU X5550: this is the problematic, at least for the moment
> HPET(450) HPET1(440) HPET2(440) HPET3(440) LAPIC(400) i8254(100) RTC(0)

Does anyone know why the LAPIC is given a lower priority than HPET in
this case? If you have an LAPIC, it should always be prefered to HPET,
unless something is seriously wrong with it...

Julian

Ryan Stone

unread,

Jan 22, 2013, 8:53:28 AM1/22/13

to

On Tue, Jan 22, 2013 at 7:27 AM, Julian Stecklina <
jste...@os.inf.tu-dresden.de> wrote:

> Does anyone know why the LAPIC is given a lower priority than HPET in
> this case? If you have an LAPIC, it should always be prefered to HPET,
> unless something is seriously wrong with it...
>

On many processors the lapic timer does not work correctly in states lower
than C1. There are many processors that will automatically enter a "C1E"
mode when the processor is idle, and in that state I have seen the lapic
timer run slower than the programmed frequency, causing time to move to
slowly on idle FreeBSD systems.

Adam McDougall

unread,

Jan 22, 2013, 9:35:39 AM1/22/13

to

This may help:

"Problem with LAPIC timer is that it stops working when CPU goes to C3
or deeper idle state. These states are not enabled by default, so unless
you enabled them explicitly, it is safe to use LAPIC. In any case
present 9-STABLE system should prevent you from using unsafe C-state if
LAPIC timer is used. From all other perspectives LAPIC is preferable, as
it is faster and easier to operate then HPET. Latest CPUs fixed the
LAPIC timer problem, so I don't think that switching to it will be
pessimistic in foreseeable future.

--
Alexander Motin"

Adrian Chadd

unread,

Jan 22, 2013, 1:42:21 PM1/22/13

to

Hi!

As I said before, the problem with non-HLT loops with event-timer in
-9 and -head is that it calls the idle function inside a critical
section (critical_enter and critical_exit) which blocks interrupts
from occuring.

The EI;HLT instruction pair on i386/amd64 atomically and correctly
handles things from what I've been told.

However, there's no atomic way to do this using ACPI sleeping, so
there's a small window where an interrupt may come in but it isn't
handled; waiting for the next interrupt to occur before it'll wake up
and respond to that interrupt.

I kept hitting my head against this when doing network testing. :(

Now - specifically for timekeeping it shouldn't matter; that's to do
with whether the counters are reliable or not (and heck, are even in
lock-step on CPUs.) But extra latency could show up weirdly, hence why
I was asking for you to try different timer configurations and idle
loops.

Thanks,

Adrian

Andriy Gapon

unread,

Jan 23, 2013, 9:58:19 AM1/23/13

to

on 22/01/2013 20:42 Adrian Chadd said the following:

> Hi!
>
> As I said before, the problem with non-HLT loops with event-timer in
> -9 and -head is that it calls the idle function inside a critical
> section (critical_enter and critical_exit) which blocks interrupts
> from occuring.
>
> The EI;HLT instruction pair on i386/amd64 atomically and correctly
> handles things from what I've been told.
>
> However, there's no atomic way to do this using ACPI sleeping, so
> there's a small window where an interrupt may come in but it isn't
> handled; waiting for the next interrupt to occur before it'll wake up
> and respond to that interrupt.

I don't think that this is true of x86 hardware in general.
You might have hit some limitation or a quirk or a bug or an erratum for some
particular hardware.

E.g. a chipset on this machine has a bit described as such:
"Set to 1 to skip the C state transition if there is break event
when entering C state."
The bit is set indeed and as far as I can tell the behavior matches the description.

Most modern (non-embedded) machines seem to behave this way. Attempt to enter a
deeper C state while a break event is pending still incurs some overhead, but it's
not as bad as waiting for the next break event.

> I kept hitting my head against this when doing network testing. :(
>
> Now - specifically for timekeeping it shouldn't matter; that's to do
> with whether the counters are reliable or not (and heck, are even in
> lock-step on CPUs.) But extra latency could show up weirdly, hence why
> I was asking for you to try different timer configurations and idle
> loops.

--
Andriy Gapon

--
Andriy Gapon

Adrian Chadd

unread,

Jan 23, 2013, 11:20:52 AM1/23/13

to

On 23 January 2013 06:58, Andriy Gapon <a...@freebsd.org> wrote:

> I don't think that this is true of x86 hardware in general.
> You might have hit some limitation or a quirk or a bug or an erratum for some
> particular hardware.
>
> E.g. a chipset on this machine has a bit described as such:
> "Set to 1 to skip the C state transition if there is break event
> when entering C state."
> The bit is set indeed and as far as I can tell the behavior matches the description.
>
> Most modern (non-embedded) machines seem to behave this way. Attempt to enter a
> deeper C state while a break event is pending still incurs some overhead, but it's
> not as bad as waiting for the next break event.

I'll reverify the behaviour on my netbooks when I'm back home.

It may be a quirk of an older 9.x, which is fixed in -HEAD. It may be
a quirk of the older generation celeron hardware - in which case, we
need to tell the user somehow..

Adrian

John Nielsen

unread,

Jan 23, 2013, 6:11:57 PM1/23/13

to

On Jan 22, 2013, at 2:40 AM, Adrian Chadd <adr...@freebsd.org> wrote:

> On Jan 21, 2013, at 4:33 AM, Daniel Braniss <da...@cs.huji.ac.il> wrote:
>
>> host: DELL PowerEdge R710, 16GB,

I administer a Dell PowerEdge R710 and I've been seeing the exact same thing. It's currently running FreeBSD 9.0-STABLE #0 r236355. It has a ZFS pool which sees moderate load most of the time but can be very high at times (when certain scripts run, etc.). I hadn't previously correlated the issue with ZFS load but that is very possible.

I set a cron job to restart ntpd when it dies (because the time difference exceeds the sanity check). The cron job runs "every 20 minutes", but that varies greatly when the system stops counting. The time offset from ntpdate (which the script runs before restarting ntpd) varies a lot, but always in increments of 300 seconds. I've seen everything from 1200 to 23100. (Yes, that's 23 thousand seconds aka 6 hours 25 minutes that the system wasn't keeping time for.)

Sysctl kern.timecounter.hardware defaults to HPET. I experimented with setting it to ACPI-fast but the issue persisted so I put it back.
kern.timecounter.choice: TSC-low(-100) ACPI-fast(900) HPET(950) i8254(0) dummy(-1000000)

I first installed the box with an older 9.0-STABLE and this issue was not present. I have been tracking -STABLE on it (albeit irregularly) so I'm not sure when the issue came up.

> Have you run tests with the machdep.idle value changed, and fiddling
> kern.eventtimer.periodic / kern.eventtimer.idletick ?

I would love to resolve this and am able to do some experimenting. I've _usually_ been seeing the issue 2-3 times every 1-2 days, but I did just make some changes:
disabling ZFS compression and deduplication on all pools
updated to 9.1-STABLE from yesterday (r245821)

If the issue persists I will try changing some of the sysctls above and follow up with the result. If it goes away, I'll try to remember to report that too.

JN

Daniel Braniss

unread,

Jan 24, 2013, 1:50:02 AM1/24/13

to

> On Jan 22, 2013, at 2:40 AM, Adrian Chadd <adr...@freebsd.org> wrote:>
> > On Jan 21, 2013, at 4:33 AM, Daniel Braniss <da...@cs.huji.ac.il> wrote:
> >
> >> host: DELL PowerEdge R710, 16GB,
>

> I administer a Dell PowerEdge R710 and I've been seeing the exact same =thing. It's currently running FreeBSD 9.0-STABLE #0 r236355. It has a =ZFS pool which sees moderate load most of the time but can be very high =at times (when certain scripts run, etc.). I hadn't previously =correlated the issue with ZFS load but that is very possible.> > I set a cron job to restart ntpd when it dies (because the time =difference exceeds the sanity check). The cron job runs "every 20 =minutes", but that varies greatly when the system stops counting. The =time offset from ntpdate (which the script runs before restarting ntpd) =varies a lot, but always in increments of 300 seconds. I've seen =everything from 1200 to 23100. (Yes, that's 23 thousand seconds aka 6 =hours 25 minutes that the system wasn't keeping time for.)
>
> Sysctl kern.timecounter.hardware defaults to HPET. I experimented with =setting it to ACPI-fast but the issue persisted so I put it back.
> kern.timecounter.choice: TSC-low(-100) ACPI-fast(900) HPET(950) i8254(0) =dummy(-1000000)> > I first installed the box with an older 9.0-STABLE and this issue was =not present. I have been tracking -STABLE on it (albeit irregularly) so =I'm not sure when the issue came up.

>
>
> Have you run tests with the machdep.idle value changed, and fiddling
>
> kern.eventtimer.periodic / kern.eventtimer.idletick ?
>

> I would love to resolve this and am able to do some experimenting. I've =_usually_ been seeing the issue 2-3 times every 1-2 days, but I did just =make some changes:

> disabling ZFS compression and deduplication on all pools
> updated to 9.1-STABLE from yesterday (r245821)
>

> If the issue persists I will try changing some of the sysctls above and =follow up with the result. If it goes away, I'll try to remember to =report that too.
>
> JN
>

set kern.eventtimer.timer=LAPIC
this solved it for me.

danny

Andriy Gapon

unread,

Jan 26, 2013, 5:15:10 AM1/26/13

to

on 23/01/2013 18:20 Adrian Chadd said the following:

> It may be a quirk of an older 9.x, which is fixed in -HEAD. It may be
> a quirk of the older generation celeron hardware - in which case, we
> need to tell the user somehow..

This is not software related at all. It's the hardware feature (or its absence).
I wonder if your celerons report PBE feature.

--
Andriy Gapon

Adrian Chadd

unread,

Jan 27, 2013, 12:27:49 PM1/27/13

to

On 26 January 2013 02:15, Andriy Gapon <a...@freebsd.org> wrote:
> on 23/01/2013 18:20 Adrian Chadd said the following:
>> It may be a quirk of an older 9.x, which is fixed in -HEAD. It may be
>> a quirk of the older generation celeron hardware - in which case, we
>> need to tell the user somehow..
>
> This is not software related at all. It's the hardware feature (or its absence).
> I wonder if your celerons report PBE feature.

What am I looking for?

And personally, requiring (much) more recent hardware to get
sane/correct (btu inefficient) behaviour out of the timekeeping
framework is a little .. suboptimal. :)

Adrian

Andriy Gapon

unread,

Jan 27, 2013, 12:54:27 PM1/27/13

to

on 27/01/2013 19:27 Adrian Chadd said the following:

> On 26 January 2013 02:15, Andriy Gapon <a...@freebsd.org> wrote:
>> on 23/01/2013 18:20 Adrian Chadd said the following:
>>> It may be a quirk of an older 9.x, which is fixed in -HEAD. It may be
>>> a quirk of the older generation celeron hardware - in which case, we
>>> need to tell the user somehow..
>>
>> This is not software related at all. It's the hardware feature (or its absence).
>> I wonder if your celerons report PBE feature.
>
> What am I looking for?

PBE in dmesg

> And personally, requiring (much) more recent hardware to get
> sane/correct (btu inefficient) behaviour out of the timekeeping
> framework is a little .. suboptimal. :)

Well, I never knew about this issue before but I always assumed that the
reasonable behavior was the behavior. And I never encountered any evidence to
the contrary.

--
Andriy Gapon