Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

cvs commit: src/sys/amd64/amd64 cpu_switch.S machdep.c

6 views
Skip to first unread message

David Xu

unread,
Oct 17, 2005, 7:10:31 PM10/17/05
to src-com...@freebsd.org, cvs...@freebsd.org, cvs...@freebsd.org
davidxu 2005-10-17 23:10:31 UTC

FreeBSD src repository

Modified files:
sys/amd64/amd64 cpu_switch.S machdep.c
Log:
Micro optimization for context switch. Eliminate code for saving gs.base
and fs.base. We always update pcb.pcb_gsbase and pcb.pcb_fsbase
when user wants to set them, in context switch routine, we only need to
write them into registers, we never have to read them out from registers
when thread is switched away. Since rdmsr is a serialization instruction,
micro benchmark shows it is worthy to do.

Reviewed by: peter, jhb

Revision Changes Path
1.154 +0 -15 src/sys/amd64/amd64/cpu_switch.S
1.642 +2 -0 src/sys/amd64/amd64/machdep.c

Poul-Henning Kamp

unread,
Oct 18, 2005, 9:47:36 AM10/18/05
to Andrew Gallatin, cvs...@freebsd.org, src-com...@freebsd.org, David Xu, cvs...@freebsd.org
In message <2005101809...@grasshopper.cs.duke.edu>, Andrew Gallatin wri
tes:

>It is a shame we can't find a way to use the TSC as a timecounter on
>SMP systems. It seems that about 40% of the context switch time is
>spent just waiting for the PIO read of the ACPI-fast or i8254 to
>return.

No, the shame is that the scheduler tries to partition time rather
than cpu cycles because that approximation got goldplated in some
random standard years back.

--
Poul-Henning Kamp | UNIX since Zilog Zeus 3.20
p...@FreeBSD.ORG | TCP/IP since RFC 956
FreeBSD committer | BSD since 4.3-tahoe
Never attribute to malice what can adequately be explained by incompetence.

Andrew Gallatin

unread,
Oct 18, 2005, 9:44:02 AM10/18/05
to David Xu, cvs...@freebsd.org, src-com...@freebsd.org, cvs...@freebsd.org
David Xu [dav...@FreeBSD.org] wrote:
> davidxu 2005-10-17 23:10:31 UTC
>
> FreeBSD src repository
>
> Modified files:
> sys/amd64/amd64 cpu_switch.S machdep.c
> Log:
> Micro optimization for context switch. Eliminate code for saving gs.base
> and fs.base. We always update pcb.pcb_gsbase and pcb.pcb_fsbase
> when user wants to set them, in context switch routine, we only need to
> write them into registers, we never have to read them out from registers
> when thread is switched away. Since rdmsr is a serialization instruction,
> micro benchmark shows it is worthy to do.

Nice. This reduces lmbench context switch latency by about 0.4us (7.2
-> 6.8us), and reduces TCP loopback latency by about 0.9us (36.1 ->
35.2) on my dual core 3800+

It is a shame we can't find a way to use the TSC as a timecounter on
SMP systems. It seems that about 40% of the context switch time is
spent just waiting for the PIO read of the ACPI-fast or i8254 to
return.


Drew

Andrew Gallatin

unread,
Oct 18, 2005, 10:05:18 AM10/18/05
to Poul-Henning Kamp, cvs...@freebsd.org, src-com...@freebsd.org, cvs...@freebsd.org

Poul-Henning Kamp writes:
> In message <2005101809...@grasshopper.cs.duke.edu>, Andrew Gallatin wri
> tes:
>
> >It is a shame we can't find a way to use the TSC as a timecounter on
> >SMP systems. It seems that about 40% of the context switch time is
> >spent just waiting for the PIO read of the ACPI-fast or i8254 to
> >return.
>
> No, the shame is that the scheduler tries to partition time rather
> than cpu cycles because that approximation got goldplated in some
> random standard years back.

Sorry if I mi-spoke. I guess the shame twofold.

First we insist on not trying keep the TSC in sync and so we don't use
it for SMP timekeeping like other OSes do, which means that getting a
micro-second granularity timestamp is orders of magnitude more
expensive for us. To compound the problem, we insist on using the
expensive non-TSC binuptime() to get a runtime measurement on each
context switch, rather than being able to use something cheap like
ticks, or a per-cpu cycle counter.

If anybody is looking for low-hanging fruit in the SMP context switch
path, figuring some acceptable way to avoid reading the ACPI or i8254
timecounter is it.


Drew

Scott Long

unread,
Oct 18, 2005, 10:07:53 AM10/18/05
to Andrew Gallatin, cvs...@freebsd.org, src-com...@freebsd.org, David Xu, cvs...@freebsd.org

The TSC represents the clock rate of the CPU, and thus can vary wildly
when thermal and power management controls kick in, and there is no way
to know when it changes. Because of this, I think that it's
practically useless on Pentium-Mobile and Pentium-M chips, among many
others. There is also the issue of multiple CPUs having to keep their
TSC's somewhat in sync in order to get consistent counting in the
system. The best that you can do is to periodically read a stable
counter and try to recalibrate, but then you'll likely start getting
wild operational variances. It's a shame that a PIO read is still so
expensive. I'd hate to see just how bad your benchmark becomes when
ACPI-slow is used instead of ACPI-fast.

I wonder if moving to HZ=1000 on amd64 and i386 was really all that good
of an idea. Having preemption in the kernel means that ithreads can run
right away instead of having to wait for a tick, and various fixes to
4BSD in the past year have eliminated bugs that would make the CPU wait
for up to a tick to schedule a thread. So all we're getting now is a
10x increase in scheduler overhead, including reading the timecounters.

Scott

Andrew Gallatin

unread,
Oct 18, 2005, 10:25:14 AM10/18/05
to Scott Long, cvs...@freebsd.org, src-com...@freebsd.org, David Xu, cvs...@freebsd.org

As I pointed out in another thread, both linux and solaris do it.
Solaris seems to have a nice algorithm for keeping things in sync, and
accounting for the TSC getting cleared after suspend/resume etc. At
my level of understanding, this argument is nothing more than "but
Mom, all the other kids are doing it". I was just hoping that
somebody with real understanding could pick up on it.

> It's a shame that a PIO read is still so
> expensive. I'd hate to see just how bad your benchmark becomes when
> ACPI-slow is used instead of ACPI-fast.

It seems like reading ACPI-fast is "only" 3us or so, but when the ctx
switch is otherwise 4us, it adds up. i8254 is much worse on this
system (6.5us).

> I wonder if moving to HZ=1000 on amd64 and i386 was really all that good
> of an idea. Having preemption in the kernel means that ithreads can run
> right away instead of having to wait for a tick, and various fixes to
> 4BSD in the past year have eliminated bugs that would make the CPU wait
> for up to a tick to schedule a thread. So all we're getting now is a
> 10x increase in scheduler overhead, including reading the timecounters.

Yeah. I moved my back to hz=1000 when I noticed 4000 interrupts/sec
on an idle system.

Drew

Scott Long

unread,
Oct 18, 2005, 10:34:52 AM10/18/05
to Andrew Gallatin, cvs...@freebsd.org, src-com...@freebsd.org, David Xu, cvs...@freebsd.org

Steering mutliple TSC's together isn't that hard and there are plenty of
examples, as you point out. Accounting for the changes due to thermal
and power management (note that this isn't the same problem as suspend
and resume) is what worries me.

>
> > It's a shame that a PIO read is still so
> > expensive. I'd hate to see just how bad your benchmark becomes when
> > ACPI-slow is used instead of ACPI-fast.
>
> It seems like reading ACPI-fast is "only" 3us or so, but when the ctx
> switch is otherwise 4us, it adds up. i8254 is much worse on this
> system (6.5us).
>
> > I wonder if moving to HZ=1000 on amd64 and i386 was really all that good
> > of an idea. Having preemption in the kernel means that ithreads can run
> > right away instead of having to wait for a tick, and various fixes to
> > 4BSD in the past year have eliminated bugs that would make the CPU wait
> > for up to a tick to schedule a thread. So all we're getting now is a
> > 10x increase in scheduler overhead, including reading the timecounters.
>
> Yeah. I moved my back to hz=1000 when I noticed 4000 interrupts/sec
> on an idle system.
>
> Drew

Do you mean 1000 or 100 here? Anyways, the high clock interrupt rate is
so that we can use the local apic clock to get the various system ticks
that we have instead of continuing to fight motherboards that no longer
hook up the 8259 in a sane way. This is why 5.x doesn't work well on a
number of new motherboards (nvidia ones especially) but 6.x works just
fine.

Scott

David O'Brien

unread,
Oct 18, 2005, 10:40:58 AM10/18/05
to Andrew Gallatin, cvs...@freebsd.org, src-com...@freebsd.org, David Xu, cvs...@freebsd.org
On Tue, Oct 18, 2005 at 09:44:02AM -0400, Andrew Gallatin wrote:
> It is a shame we can't find a way to use the TSC as a timecounter on
> SMP systems. It seems that about 40% of the context switch time is
> spent just waiting for the PIO read of the ACPI-fast or i8254 to
> return.

Revision F Opterion's will have the RDTSCP (read serialized TSC pair)
instruction that helps some. Slide 13 of
http://www.amd.com/us-en/assets/content_type/DownloadableAssets/dwamd_kernel_summit_08_RB.pdf

Future Opteron's (or what ever AMD will call it then) will have a P-state
invarient TSC in 2007.
http://lwn.net/Articles/144098/

--
-- David (obr...@FreeBSD.org)

Andrew Gallatin

unread,
Oct 18, 2005, 10:48:45 AM10/18/05
to Scott Long, cvs...@freebsd.org, src-com...@freebsd.org, David Xu, cvs...@freebsd.org

Scott Long writes:
> Andrew Gallatin wrote:
> > As I pointed out in another thread, both linux and solaris do it.
> > Solaris seems to have a nice algorithm for keeping things in sync, and
> > accounting for the TSC getting cleared after suspend/resume etc. At
> > my level of understanding, this argument is nothing more than "but
> > Mom, all the other kids are doing it". I was just hoping that
> > somebody with real understanding could pick up on it.
>
> Steering mutliple TSC's together isn't that hard and there are plenty of
> examples, as you point out. Accounting for the changes due to thermal
> and power management (note that this isn't the same problem as suspend
> and resume) is what worries me.

Yes, I have no answer for this :(

> > Yeah. I moved my back to hz=1000 when I noticed 4000 interrupts/sec
> > on an idle system.
> >
> > Drew
>
> Do you mean 1000 or 100 here? Anyways, the high clock interrupt rate is

Sorry.. That was a typo. I meant hz=100.

Drew

Poul-Henning Kamp

unread,
Oct 18, 2005, 11:31:31 AM10/18/05
to Scott Long, cvs...@freebsd.org, src-com...@freebsd.org, Andrew Gallatin, cvs...@freebsd.org, David Xu
In message <435508...@samsco.org>, Scott Long writes:

[At the risk of repeating myself once more...]

>Steering mutliple TSC's together isn't that hard and there are plenty of
>examples, as you point out. Accounting for the changes due to thermal
>and power management (note that this isn't the same problem as suspend
>and resume) is what worries me.

It all depends what you mean by "hard" and what benefit you expect
to arrive at.

One of the things you have to realize is that once you go down this
road you need a lot of code for all the conditionals.

For instance you need to make sure that every new timestamp you
hand out not prior to another one, no matter what is happening to
the clocks.

Imagine one CPU throttling because of heat, that CPU will be handing
out timestamps in the past until the TSC slowdown has been corrected,
meanwhile the other CPU in the system churns on at full speed.

To solve this, you need to pessimize every timestamp with an intercpu
lock to compare against the previous timestamp and if less you have
to do the Lamport-trick and return the "previous timestamp + epsilon".

Then there is the question of how you adapt, a stepwise adaptation
is hard to get right without overshoot, and stability is far from
a given.

Dave Mills implemented a scheme on Alpha to have a per-cpu PLL which
where clocked by a common interrupt from the RTC. The results were
interesting, but hardly revolutionary, and performance wise it sucked.

So, yes, it may not be "hard" in the "write an OS from scratch" sense
of "hard", but it is certainly far from trivial, comes with a heavy
penalty in complexity and a notable shortage of successful prior art.


One of the things we pride ourselves off in FreeBSD is stability,
and the current code (finally!) provides that: It has been a long
time since we last hard timecounter issues with broken hardware.

But if people are certain their TSC's are good and sound, they can
override the default safe selection of ACPI with a sysctl, and in
doing so, they can take a calculated risk.

That, IMO, is the correct "FreeBSD way" to handle this:

"Safe out of the box. Informed tweaking may be profitable."

I would hate to have to go to the other side where some fraction
of users which happen to use hardware with problems in this space
will have to disable something to get stable operation or to
avoid unexplained undesirable transient phenomena.

>> It seems like reading ACPI-fast is "only" 3us or so, but when the ctx
>> switch is otherwise 4us, it adds up. i8254 is much worse on this
>> system (6.5us).

i8254 is always bad, and about as bad as it can. Mostly because
of the need to disable interrupts (Actually, that's a critical
section today, isn't it ?) and also hobbled by the three 8 bit
ISA-bus(-like) accesses needed.

>> > I wonder if moving to HZ=1000 on amd64 and i386 was really all that good
>> > of an idea.

The main benefit was getting more precise timeouts, something we have
at various times thought about implementing with deadline counters
on platforms that have it. Nobody has done it though.


So, instead of looking for "quick fixes", lets look at this with a
designers or architects view:

On a busy system the scheduler works hundred thousand times per
second, but on most systems nobody ever looks at the times(2) data.

The smart solution is therefore to postpone the heavy stuff into
times(2) and make the scheduler work as fast as it can.

So the scheduler should read the TSC and schedule in TSC-ticks.

times(2) will then have to convert this to clock_t compatible
numbers.

According the The Open Group, clock_t is in microseconds by means
of historical standards mistakes.

However, I can see nowhere that would collide with an interpretation
that said "clock_t is microseconds PROVIDED the cpu had run at full
speed", so a simple one second routine to latch the highest number
of TSC-tics we've seen in a second would be sufficient to generate
the conversion factor.

And in many ways this would be a much more useful metric to offer
(in top(1)) than the current rubber-band-cpu-seconds.

Poul-Henning

[1] A problem with this plan of course is that some CPU's don't
have TSCs, but a fallback mechanism to use whatever timecounter is
active as TSC.

John Baldwin

unread,
Oct 18, 2005, 11:01:02 AM10/18/05
to Andrew Gallatin, cvs...@freebsd.org, src-com...@freebsd.org, David Xu, cvs...@freebsd.org

You can try it by just setting the kern.timecounter.smp_tsc=1 tunable on boot.

--
John Baldwin <j...@FreeBSD.org> <>< http://www.FreeBSD.org/~jhb/
"Power Users Use the Power to Serve" = http://www.FreeBSD.org

Andrew Gallatin

unread,
Oct 18, 2005, 11:54:37 AM10/18/05
to John Baldwin, cvs...@freebsd.org, src-com...@freebsd.org, David Xu, cvs...@freebsd.org

John Baldwin writes:
> On Tuesday 18 October 2005 09:44 am, Andrew Gallatin wrote:
> > David Xu [dav...@FreeBSD.org] wrote:
> > > davidxu 2005-10-17 23:10:31 UTC
> > >
> > > FreeBSD src repository
> > >
> > > Modified files:
> > > sys/amd64/amd64 cpu_switch.S machdep.c
> > > Log:
> > > Micro optimization for context switch. Eliminate code for saving
> > > gs.base and fs.base. We always update pcb.pcb_gsbase and pcb.pcb_fsbase
> > > when user wants to set them, in context switch routine, we only need to
> > > write them into registers, we never have to read them out from registers
> > > when thread is switched away. Since rdmsr is a serialization instruction,
> > > micro benchmark shows it is worthy to do.
> >
> > Nice. This reduces lmbench context switch latency by about 0.4us (7.2
> > -> 6.8us), and reduces TCP loopback latency by about 0.9us (36.1 ->
> > 35.2) on my dual core 3800+
> >
> > It is a shame we can't find a way to use the TSC as a timecounter on
> > SMP systems. It seems that about 40% of the context switch time is
> > spent just waiting for the PIO read of the ACPI-fast or i8254 to
> > return.
>
> You can try it by just setting the kern.timecounter.smp_tsc=1 tunable on boot.

Yes, that's how I get my figure of 3us for PIO read, and 3.8us for the
rest of the context switch. But its not currently practical on most
machines, since we don't sync the TSC between cpus, or do anything to
account for drift.

Drew

Nate Lawson

unread,
Oct 18, 2005, 12:50:37 PM10/18/05
to Andrew Gallatin, cvs...@freebsd.org, Poul-Henning Kamp, src-com...@freebsd.org, cvs...@freebsd.org
Andrew Gallatin wrote:
> Poul-Henning Kamp writes:
> > In message <2005101809...@grasshopper.cs.duke.edu>, Andrew Gallatin wri
> > tes:
> >
> > >It is a shame we can't find a way to use the TSC as a timecounter on
> > >SMP systems. It seems that about 40% of the context switch time is
> > >spent just waiting for the PIO read of the ACPI-fast or i8254 to
> > >return.
> >
> > No, the shame is that the scheduler tries to partition time rather
> > than cpu cycles because that approximation got goldplated in some
> > random standard years back.
>
> Sorry if I mi-spoke. I guess the shame twofold.
>
> First we insist on not trying keep the TSC in sync and so we don't use
> it for SMP timekeeping like other OSes do, which means that getting a
> micro-second granularity timestamp is orders of magnitude more
> expensive for us. To compound the problem, we insist on using the
> expensive non-TSC binuptime() to get a runtime measurement on each
> context switch, rather than being able to use something cheap like
> ticks, or a per-cpu cycle counter.

I have good information that in the near future, most designs will have
guaranteed synchronized TSC across all CPUs.

> If anybody is looking for low-hanging fruit in the SMP context switch
> path, figuring some acceptable way to avoid reading the ACPI or i8254
> timecounter is it.

The ACPI timecounter involves a 32 bit read from IO space. The actual
timecounter is 24 or 32 bits. Since it's maintained in the chipset and
has strict requirements for being reliable in many modes of system
operation (i.e. C3), this read takes a while.

Using it at task switch time is overkill. As you suggest, it's better
to use TSC and calibrate via the ACPI timer. More info on this in my
next email.

--
Nate

Nate Lawson

unread,
Oct 18, 2005, 1:09:33 PM10/18/05
to Nate Lawson, cvs...@freebsd.org, Poul-Henning Kamp, src-com...@freebsd.org, Andrew Gallatin, cvs...@freebsd.org
Nate Lawson wrote:
> Andrew Gallatin wrote:
>
>> Poul-Henning Kamp writes:
>> > In message <2005101809...@grasshopper.cs.duke.edu>, Andrew
>> Gallatin wri
>> > tes:
>> > > >It is a shame we can't find a way to use the TSC as a
>> timecounter on
>> > >SMP systems. It seems that about 40% of the context switch time is
>> > >spent just waiting for the PIO read of the ACPI-fast or i8254 to
>> > >return.
>> > > No, the shame is that the scheduler tries to partition time rather
>> > than cpu cycles because that approximation got goldplated in some
>> > random standard years back.
>>
>> Sorry if I mi-spoke. I guess the shame twofold.
>> First we insist on not trying keep the TSC in sync and so we don't use
>> it for SMP timekeeping like other OSes do, which means that getting a
>> micro-second granularity timestamp is orders of magnitude more
>> expensive for us. To compound the problem, we insist on using the
>> expensive non-TSC binuptime() to get a runtime measurement on each
>> context switch, rather than being able to use something cheap like
>> ticks, or a per-cpu cycle counter.
>
>
> I have good information that in the near future, most designs will have
> guaranteed synchronized TSC across all CPUs.

Oops, I not only meant "synchronized" but also "the same value".

--
Nate

Nate Lawson

unread,
Oct 18, 2005, 1:31:14 PM10/18/05
to Scott Long, cvs...@freebsd.org, src-com...@freebsd.org, Andrew Gallatin, cvs...@freebsd.org, David Xu
Scott Long wrote:

> Andrew Gallatin wrote:
>> Nice. This reduces lmbench context switch latency by about 0.4us (7.2
>> -> 6.8us), and reduces TCP loopback latency by about 0.9us (36.1 ->
>> 35.2) on my dual core 3800+
>>
>> It is a shame we can't find a way to use the TSC as a timecounter on
>> SMP systems. It seems that about 40% of the context switch time is
>> spent just waiting for the PIO read of the ACPI-fast or i8254 to
>> return.
>
> The TSC represents the clock rate of the CPU, and thus can vary wildly
> when thermal and power management controls kick in, and there is no way
> to know when it changes. Because of this, I think that it's
> practically useless on Pentium-Mobile and Pentium-M chips, among many
> others.

This is a myth. It is not so dismal as you portray and cpufreq(4) gives
both the kernel and userland a way of getting the necessary info in an
MI way (including notification of clock rate changes) and control it
when possible. There are a number of mechanisms actually in the world
today:

* SMM-based clock switching: most laptops have SMM code (i.e. BIOS) that
checks the power line status on boot and sets the base clock rate. They
use the standard platform mechanism (i.e. enh speedstep, speedstep-ich)
to set the frequency and cpufreq(4) allows the user or kernel to freely
override it at runtime. All that is left to do is for timecounters to
export a "re-calibrate" option that works at runtime and for cpufreq(4)
to call it when the frequency is changed by the kernel/usermode. bde@
supplied some code I hope to import soon once I have it well tested that
implements such a runtime calibration, although it is just used
internally by cpufreq(4), not hooked into timecounters at the moment.
Note that no BIOS I know of actually changes the value after boot, so
TSC is reliable unless we change it ourselves.

* p4tcc: thermal control circuit. Version 1 does x/8 throttling of the
CPU by an internal stop clock cycle, where "x" is an integer. Version 2
also can step the clock rate via enh speedstep. There are two parts to
this, the platform (BIOS) setting and "on demand" (kernel) setting. The
OS can use the on demand setting via cpufreq(4) to save power or for
passive cooling. We initiate this ourselves, so once the timecounter
interface can accept an updated calibration, there is no issue here.
The platform setting is worse in that we don't know when it kicks in.
However, it is intended as an emergency measure like if a fan dies. All
known BIOSen set this value just below the thermal shutdown circuit
(i.e. the processor stops operation completely). As such, this is an
edge case that we do not have to handle particularly efficiently. It
suffices to periodically check the calibration of TSC (perhaps every 10
seconds?) via the ACPI timer and update our settings if it has changed.
Since cpufreq(4) knows all the possible settings, it suffices to just
measure the clock rate and compare it to a table of valid settings.
There is no ambiguity (yet) since every CPU control mechanism has
discrete settings.

> There is also the issue of multiple CPUs having to keep their
> TSC's somewhat in sync in order to get consistent counting in the
> system. The best that you can do is to periodically read a stable
> counter and try to recalibrate, but then you'll likely start getting
> wild operational variances.

> It's a shame that a PIO read is still so
> expensive. I'd hate to see just how bad your benchmark becomes when
> ACPI-slow is used instead of ACPI-fast.

ACPI-slow should not be used at all. If the acpi timer is unreliable,
use a different one. Also, I think most systems that had unreliable
acpi timers were older and not likely to have variable CPU clocks. So
I'd prefer TSC on such systems anyway.

> I wonder if moving to HZ=1000 on amd64 and i386 was really all that good
> of an idea. Having preemption in the kernel means that ithreads can run
> right away instead of having to wait for a tick, and various fixes to
> 4BSD in the past year have eliminated bugs that would make the CPU wait
> for up to a tick to schedule a thread. So all we're getting now is a
> 10x increase in scheduler overhead, including reading the timecounters.

I use hz=100 on my systems due to the 1 khz noise from C3 sleep.
Windows has the same problem.

--
Nate

Poul-Henning Kamp

unread,
Oct 18, 2005, 1:57:28 PM10/18/05
to Nate Lawson, cvs...@freebsd.org, src-com...@freebsd.org, Andrew Gallatin, cvs...@freebsd.org
In message <435527DD...@root.org>, Nate Lawson writes:


>I have good information that in the near future, most designs will have
>guaranteed synchronized TSC across all CPUs.

...and when those chips arrive, we can hopefully identify them by some
bit in some MSR and then we can use the TSC on them.

This is a good move and it is only too bad that it's taken the chip
manufacturers 10 years to figure this out.

Poul-Henning Kamp

unread,
Oct 18, 2005, 2:05:27 PM10/18/05
to Nate Lawson, Scott Long, src-com...@freebsd.org, Andrew Gallatin, cvs...@freebsd.org, cvs...@freebsd.org, David Xu
In message <43553162...@root.org>, Nate Lawson writes:

>> The TSC represents the clock rate of the CPU, and thus can vary wildly
>> when thermal and power management controls kick in, and there is no way
>> to know when it changes. Because of this, I think that it's
>> practically useless on Pentium-Mobile and Pentium-M chips, among many
>> others.
>
>This is a myth.

It isn't a myth.

As recent as this year chips have been sent on the market which
will throttle their cpu-clock and TSC on certain chip stress
conditions without giving any timely indication to any part of
the BIOS or OS.

One major BIOS supplier still mucks up SMP TSC synchronization on
certain SMM bios actions.

And remember: not everybody runs intel or AMD chips.

We need to work on sparc64 and alpha chips as well.

Alpha is particularly nasty as some of the older chips have a SAW
generated CPU clock which is not synchronized to the bus clock.

>> There is also the issue of multiple CPUs having to keep their
>> TSC's somewhat in sync in order to get consistent counting in the
>> system.

For "somewhat" read: "exact"

Unless we want to do the Lamport-trick and pay the overhead of
intra-cpu locks when we calculate timestamps they have to be in
_exact_ synchronization _and_ syntonization.


The solution to the context switch problem is _not_ to botch
the timekeeping, the solution is to not _need_ the timekeeping.

Andre Oppermann

unread,
Oct 18, 2005, 2:49:17 PM10/18/05
to Poul-Henning Kamp, cvs...@freebsd.org, src-com...@freebsd.org, Andrew Gallatin, cvs...@freebsd.org, Nate Lawson
Poul-Henning Kamp wrote:
>
> In message <435527DD...@root.org>, Nate Lawson writes:
>
> >I have good information that in the near future, most designs will have
> >guaranteed synchronized TSC across all CPUs.
>
> ...and when those chips arrive, we can hopefully identify them by some
> bit in some MSR and then we can use the TSC on them.
>
> This is a good move and it is only too bad that it's taken the chip
> manufacturers 10 years to figure this out.

Considering that Nate knows about it and that it took cpu manufacturers
so I suspect they did it to make some DRM schemes work.

--
Andre

Nate Lawson

unread,
Oct 18, 2005, 3:47:22 PM10/18/05
to Andre Oppermann, cvs...@freebsd.org, Poul-Henning Kamp, src-com...@freebsd.org, Andrew Gallatin, cvs...@freebsd.org

Nah, ACPI tables on new machines tend to give info about what major OS
vendors will soon support.

--
Nate

Nate Lawson

unread,
Oct 18, 2005, 5:38:29 PM10/18/05
to Poul-Henning Kamp, Scott Long, src-com...@freebsd.org, Andrew Gallatin, cvs...@freebsd.org, cvs...@freebsd.org, David Xu
Poul-Henning Kamp wrote:
> In message <43553162...@root.org>, Nate Lawson writes:
>
>
>>>The TSC represents the clock rate of the CPU, and thus can vary wildly
>>>when thermal and power management controls kick in, and there is no way
>>>to know when it changes. Because of this, I think that it's
>>>practically useless on Pentium-Mobile and Pentium-M chips, among many
>>>others.
>>
>>This is a myth.
>
> It isn't a myth.
>
> As recent as this year chips have been sent on the market which
> will throttle their cpu-clock and TSC on certain chip stress
> conditions without giving any timely indication to any part of
> the BIOS or OS.

Does this refer to the p4tcc platform limit that I described or is it
soemthing different? In my analysis, the limit is set very high and
should not be hit unless a fan fails. This info seems to match my
observations:

"Thermal Monitor controls the processor temperature by modulating
(starting and stopping) the processor core clocks. Automatic and
On-Demand modes are used to activate the thermal control circuit (TCC).
When automatic mode is enabled, the TCC will activate only when the
interanl die temperature is very near the temperature limits of the
processor."

http://www.intel.com/cd/channel/reseller/asmo-na/eng/products/box_processors/mobile/celeron_m/technical_reference/97374.htm

For Prescott, the temp for automatic cut-in is around 72C. We may be
able to detect this on SMP systems via IPIs (I don't know which ones).

http://softwareforums.intel.com/ids/board/message?board.id=49&message.id=456

That doesn't mean we can ignore it, just that we don't have to optimize
for that case. When your CPU is about to melt down, having slower
scheduling for a few seconds doesn't seem unreasonable.

> The solution to the context switch problem is _not_ to botch
> the timekeeping, the solution is to not _need_ the timekeeping.

Yes, I agree. We need to fix context switching to not be
binuptime()-based and separately improve TSC support so it can be used
more often as a timecounter.

--
Nate

Bruce Evans

unread,
Oct 20, 2005, 1:45:21 AM10/20/05
to Scott Long, cvs...@freebsd.org, src-com...@freebsd.org, Andrew Gallatin, cvs...@freebsd.org, David Xu
On Tue, 18 Oct 2005, Scott Long wrote:

[Excessive quoting retained since I want to comment on separate points.]

> Andrew Gallatin wrote:
>> Scott Long writes:
>> > Andrew Gallatin wrote:
>> > > David Xu [dav...@FreeBSD.org] wrote:
>> > > > >>davidxu 2005-10-17 23:10:31 UTC
>> > >>
>> > >> FreeBSD src repository
>> > >>
>> > >> Modified files:
>> > >> sys/amd64/amd64 cpu_switch.S machdep.c > >> Log:
>> > >> Micro optimization for context switch. Eliminate code for saving
>> gs.base
>> > >> and fs.base. We always update pcb.pcb_gsbase and pcb.pcb_fsbase
>> > >> when user wants to set them, in context switch routine, we only need
>> to
>> > >> write them into registers, we never have to read them out from
>> registers
>> > >> when thread is switched away. Since rdmsr is a serialization
>> instruction,
>> > >> micro benchmark shows it is worthy to do.

>> > > > > > > Nice. This reduces lmbench context switch latency by about
>> 0.4us (7.2
>> > > -> 6.8us), and reduces TCP loopback latency by about 0.9us (36.1 ->
>> > > 35.2) on my dual core 3800+

I wonder if this reduces the context switch latency from about 1.320
usec to 0.900 usec on my A64-3000. The latency is only .520 usec in
i386 mode. I use a TSC timecounter of course.

The fastest loopback latency that I've seen is 5.638 usec under
Linux-2.2.9 on the same machine. In Linux-2.6.10, it has regressed
to 17.1 usec. In FreeBSD last year, it was 10.8 usec on the same
machine in i386 mode and 19.0 in amd64 mode. So the A64 can almost
keep up with an AXP-1400 running a pre-SMPng version of FreeBSD where
it was 9.94 usec.

[... Nonsense by phk already snipped]

The timecounter is not used by schedulers, so the inefficiency of non-TSC
timecounters and its effect on context switching has nothing to do with
schedulers. Schedulers use mainly tick counts, and intentionally don't
try hard to keep track of interrupt times because the fine-grained
timekeeping needed to keep track of interrupts would be too expansive.
It is still too expensive, but is now done (except for fast interrupts),
but is not used by schedulers. The timestamps taken by mi_switch() are
used mainly by userland statistics utilities. They are very useful for
debugging and for otherwise understanding system behaviour, but are
sometimes too inefficient.

>> > > > > It is a shame we can't find a way to use the TSC as a timecounter
>> on
>> > > SMP systems. It seems that about 40% of the context switch time is
>> > > spent just waiting for the PIO read of the ACPI-fast or i8254 to
>> > > return.

It seems to be more like 95% in year case.

>> > > > > > > Drew
>> > > > > > > > > The TSC represents the clock rate of the CPU, and thus
>> can vary wildly
>> > when thermal and power management controls kick in, and there is no way
>> > to know when it changes. Because of this, I think that it's
>> > practically useless on Pentium-Mobile and Pentium-M chips, among many
>> > others. There is also the issue of multiple CPUs having to keep their
>> > TSC's somewhat in sync in order to get consistent counting in the
>> > system. The best that you can do is to periodically read a stable
>> > counter and try to recalibrate, but then you'll likely start getting
>> > wild operational variances.

I agree that it's too hard to sync the TSC on systems with power
management. It would be easy enough to sync with the i8254 every HZ,
but even that would give extreme nonlinearities when the TSC frequency
jumps up or down. Jumping up is the worst case. E.g, if the TSC
frequency starts at 1GHz and HZ is 1000 expect the TSC count to increment
by 10^6 in the next msec. If the TSC frequency jumps up to 2GHz, then
the TSC count will actually increment by 2*10^6. I see nothing better
than recalibrating half way into the next msec (when the TSC count
reaches 10^6) and then wildly slewing the TSC clock so that the 10^6
increment in the count expected in the next half a msec from causing
another half-msec error.

>> As I pointed out in another thread, both linux and solaris do it.
>> Solaris seems to have a nice algorithm for keeping things in sync, and
>> accounting for the TSC getting cleared after suspend/resume etc. At
>> my level of understanding, this argument is nothing more than "but
>> Mom, all the other kids are doing it". I was just hoping that
>> somebody with real understanding could pick up on it.
>
> Steering mutliple TSC's together isn't that hard and there are plenty of
> examples, as you point out. Accounting for the changes due to thermal
> and power management (note that this isn't the same problem as suspend
> and resume) is what worries me.

Possibly the systems with power management don't matter here. Power
management is currently only essential for portable machines, and the
portable machines won't have multi-Gb/s networks to keep up with and
might not have such strict real time requirements.

>> > It's a shame that a PIO read is still so
>> > expensive. I'd hate to see just how bad your benchmark becomes when
>> > ACPI-slow is used instead of ACPI-fast.
>>
>> It seems like reading ACPI-fast is "only" 3us or so, but when the ctx
>> switch is otherwise 4us, it adds up. i8254 is much worse on this
>> system (6.5us).

I don't know why your system is so slow. I get ~50nsec for TSC, ~1000 nsec
for ACPI-fast, ~3000 nsec for ACPI-slow and ~4000 nsec for i8254. But PIO
keeps getting slower even in absolute terms. My (nearly) newest system
(nForce2) has ISA PIO times of 1133 nsec for the i8254 registers where
my first PCI system (with an early Intel chipset) has a read time of 703
usec and a write time of 1180 nsec. The nForce2 system also has a PCI
PIO read time of 290 nsec for the same PCI card that can be read in 125
nsec (overclocked) or 150 nsec (not overclocked) on a KT266A system.

>> > I wonder if moving to HZ=1000 on amd64 and i386 was really all that good
>> > of an idea. Having preemption in the kernel means that ithreads can run
>> > right away instead of having to wait for a tick, and various fixes to
>> > 4BSD in the past year have eliminated bugs that would make the CPU wait
>> > for up to a tick to schedule a thread. So all we're getting now is a
>> > 10x increase in scheduler overhead, including reading the timecounters.
>>
>> Yeah. I moved my back to hz=1000 when I noticed 4000 interrupts/sec
>> on an idle system.
>

> Do you mean 1000 or 100 here? Anyways, the high clock interrupt rate is
> so that we can use the local apic clock to get the various system ticks
> that we have instead of continuing to fight motherboards that no longer
> hook up the 8259 in a sane way. This is why 5.x doesn't work well on a
> number of new motherboards (nvidia ones especially) but 6.x works just
> fine.

[Dan actually meant 100.]

I use 100 and never downgraded to use 1000 except for testing how bad
it is. The default number is now up to <number of CPUs> * 2 * HZ.
E.g., it is 4000 on sledge.freebsd.org. While 4000 interrupts/sec can
be handled easily by any new machine, 4000 is a disgustingly large
number to use for clock interrupts. Have a look at vmstat -i output
on almost any machine. On most machines in the freebsd cluster, the
total number of interrupts is dominated by clock interrupts even with
HZ = 100.

The main use for a large HZ is to low quality hardware and applications
that need or want to poll very often.

Bruce

Bruce Evans

unread,
Oct 20, 2005, 1:53:15 AM10/20/05
to John Baldwin, cvs...@freebsd.org, src-com...@freebsd.org, Andrew Gallatin, cvs...@freebsd.org, David Xu
On Tue, 18 Oct 2005, John Baldwin wrote:

> On Tuesday 18 October 2005 09:44 am, Andrew Gallatin wrote:
>> It is a shame we can't find a way to use the TSC as a timecounter on
>> SMP systems. It seems that about 40% of the context switch time is
>> spent just waiting for the PIO read of the ACPI-fast or i8254 to
>> return.
>
> You can try it by just setting the kern.timecounter.smp_tsc=1 tunable on boot.

There is no need for this. Just set the timecounter using sysctl after
booting (and quickly switch it back if it doesn't work).

This tuneable, like most, shouldn't exist. It may be a relic from
when the TSC wasn't put in the list of available timecounters in the
SMP case. It is now put in the list with a negative "quaility", but
the sysctl to set the timecounter correctly not restricted by the
quality.

Bruce

David Xu

unread,
Oct 20, 2005, 3:39:20 AM10/20/05
to Bruce Evans, cvs...@freebsd.org, Scott Long, src-com...@freebsd.org, Andrew Gallatin, cvs...@freebsd.org

we can avoid reloading userland GS.base MSR and FS.base MSR for system
threads, I am not sure if it can reduce interrupt thread latency.

David Xu


Bruce Evans

unread,
Oct 20, 2005, 4:01:38 AM10/20/05
to Poul-Henning Kamp, Scott Long, src-com...@freebsd.org, Andrew Gallatin, cvs...@freebsd.org, cvs...@freebsd.org, David Xu
On Tue, 18 Oct 2005, Poul-Henning Kamp wrote:

> [At the risk of repeating myself once more...]

> ...

> One of the things you have to realize is that once you go down this
> road you need a lot of code for all the conditionals.
>
> For instance you need to make sure that every new timestamp you
> hand out not prior to another one, no matter what is happening to
> the clocks.

Clocks are already incoherent in many ways:
- the times returned by the get*() functions incoherent with the ones
returned by the functions that read the hardware, because the latter
are always in advance of the former and the difference is sometimes
visible at the active resolution. POSIX tests of file times have
been reporting this incoherency since timecounters were implemented.
The tests use time() to determine the current time and stat() to
determine file times. In the sequence:

t1 = time(...):
sleep(1)
touch(file);
stat(file);
t2 = mtime(file);

t2 should be < t1, but the bug lets t2 == t1 happen.

- times are incoherent between threads unless the threads use their
own expensive locking to prevent this. This is not very different
from timestamps being incoherent between CPUs unless the system uses
expensive locking to prevent it.

> ...

>>> It seems like reading ACPI-fast is "only" 3us or so, but when the ctx
>>> switch is otherwise 4us, it adds up. i8254 is much worse on this
>>> system (6.5us).
>
> i8254 is always bad, and about as bad as it can.

The i8254 is not that bad, and far from as bad as can be.

> Mostly because
> of the need to disable interrupts (Actually, that's a critical
> section today, isn't it ?) and also hobbled by the three 8 bit
> ISA-bus(-like) accesses needed.

Mostly not:
- disabling interrupts is not necessary is was done mainly because it
is most efficient except (apparently) on P4's. It is only necessary
to repeat the read if the conditions were changed underneath us by an
interrupt. Whether there was an interrupt can easily be determined
by looking at the interrupt count.

Disabling of interrupts is still always used, at least on i386's. This
is essential in the non-lapic case and good in the lapic case:
- In the non-lapic case, the code hasn't changed significantly lately
and still has an explicit hard-disablement. There is a magic number
of 20 i8254 cycles (spelled TIMER0_LATCH_COUNT in axed code) that
gives a real-time requirement on the maximum time between the i8254
timer read and the check for rollover. Disabling interrupts is not
sufficient to meet this requirement since bus activity may lengthen
the time for the combined i/o to many more than 20 cycles (I've
measured about 200 for similar code in getit()), but it mostly works.
If interrupts were not hard-disabled, then almost any interrupt would
break this requirement.
- In the lapic case, there is now only a spin mutex on the clock lock.
The lock is essential, and it gives a critical section which is almost
as essential (since without the critical section a low priority
thread reading the i8254 might be preempted while holding the
lock). Spin mutexes still hard-disable interrupts, so interrupts
are still hard-disabled as a side effect. Hard-disabling interrupts
for spinlocks is a bug, but here it is good though not essential.
It prevents fast interrupt handlers and low-level non-context-switching
interrupt code from running. There is no longer a requirement for
completing the function in 20 i8254 cycles, but doing so is safest.

The simplification in the lapic case has very little to do with
interrupts, clock or otherwise. The real-time requirement is now that
i8254_get_timecount() be called significantly more often than the
i8254 rolls over. This is now easily satisfied by increasing the
rollover period to ~55 msec and depending on users not configuring
HZ to permitted values of <= 18 Hz. Even HZ = 100 provides a safety
margin. This method could also be used for the non-lapic case,
using either another source of periodic interrupts to keep calling
i82854_get_timecount() significantly more often than every 1/HZ seconds,
or by using another source for hardclock interrupts. On i386's, the
RTC would work perfectly for clock interrupts too except for minor
problems in schedulers and maybe applications wanting timeouts of
exactly 10 msec.

- only 1 or 2 accesses are needed:
- 2 with only the LSB of the count used. This HZ to be larger than about
5000. Large HZ are undesirable in general but are sometimes good for
dumb hardware like the i8254.
- 1 with unlatched reads. I could never get this to work.

>>> > I wonder if moving to HZ=1000 on amd64 and i386 was really all that good
>>> > of an idea.
>
> The main benefit was getting more precise timeouts, something we have
> at various times thought about implementing with deadline counters
> on platforms that have it. Nobody has done it though.

Dragonfly did it.

> So, instead of looking for "quick fixes", lets look at this with a
> designers or architects view:
>
> On a busy system the scheduler works hundred thousand times per
> second, but on most systems nobody ever looks at the times(2) data.

More like 1000 times a second. Even stathz = 128 gives too many decisions
per second for the 4BSD scheduler, so it is divided down to 16 per second.
Processes blocking on i/o may cause many more than 128/sec calls to the
scheduler, but there should be nothing much to decide then.

> The smart solution is therefore to postpone the heavy stuff into
> times(2) and make the scheduler work as fast as it can.

Once more: schedulers haven't used anything related to times(2) since
the ancient version of 3BSD or 4BSD where times() was superseded by
gettimeofday(), and have never used timecounters. (Even times(2) doesn't
use anything related to scheduling except to fake 4BSD scheduler clock
ticks in its API.)

> So the scheduler should read the TSC and schedule in TSC-ticks.

Schedulers never read the TSC. The schedule in statclock ticks.

> times(2) will then have to convert this to clock_t compatible
> numbers.

It has converted from real times to clock_t's since before FreeBSD-1.
The real times happen to be implemented using timecounters and the
timecounter may be the TSC. times() doesn't really care. OTOH,
getrusage() reports process times in real times (with only some
resolution lost by converting MD times to bintimes and then bintimes
to timevals).

> According the The Open Group, clock_t is in microseconds by means
> of historical standards mistakes.

clock_t in microseconds is required for historical mistakes in OS's
supported by The Open Group. FreeBSD never had these particular
mistakes. It has different ones, and has sysconf(_SC_CLK_TCK) fixed
at 128 to support them. (Note that the units for clock_t are not the
same for all uses of clock_t, but for the historical times() mistake
they are 1/sysconf(_SC_CLK_TCK) seconds. As an implementation detail,
FreeBSD uses 1/128 for all clock_t's even in cases where the historical
mistakes have less inertia.)

> However, I can see nowhere that would collide with an interpretation
> that said "clock_t is microseconds PROVIDED the cpu had run at full
> speed", so a simple one second routine to latch the highest number
> of TSC-tics we've seen in a second would be sufficient to generate
> the conversion factor.
>
> And in many ways this would be a much more useful metric to offer
> (in top(1)) than the current rubber-band-cpu-seconds.

You seem to have left out a "not" here. Users mostly only care about
the real time taken by their processes. If the conversion factor is
constant then it is possible for even users to apply it to convert from
the units displayed by top and friends to their favourite units, but
with variable conversion factors it would be difficult for even
applications to do the conversion. Syscalls would have to return a
table giving their best idea of the conversion factors at different
times in the processes lifetime, and applications would have to
integrate over time to convert to a single number to display to the
user, according to user-specified weights. Better yet, put the
integration in the kernel and use syscalls to tell the kernel the
weights ;-).

Anyway, getrusage() has fewer historical mistakes than times(), and
maintaining non-broken support for it requires using timecounters in
mi_switch() almost like we already do. Hmm. Checking the history
shows some anachronisms in what I said in the above. It is only
necessary to go back as far as FreeBSD-1 to find a BSD where ticks are
used for getrusage() too. In FreeBSD-1, there wasn't even an mi_swtch().
Context switches went directly to MD code in swtch() and swtch() was
missing calls to microtime()/bintime() and many other expenses. The
bogusness in times() and getrusage() was sort of reversed -- getrusage()
(actually hardclock()) converted from low-resolution tick counts to
high resolution timevals and times() just returned the tick counts;
now getrusage() only uses the tick counts for dividing up the total
time and times() converts from the high-res units back to low-res ones
and ends up with less accuracy that it started with due to double
rounding.

So the current pessimizations from timecounter calls in mi_switch()
are an end result of general pessimizations of swtch() starting in
4.4BSD. I rather like this part of the pessimizations...

Bruce

Poul-Henning Kamp

unread,
Oct 20, 2005, 4:27:09 AM10/20/05
to Bruce Evans, Scott Long, src-com...@freebsd.org, Andrew Gallatin, cvs...@freebsd.org, cvs...@freebsd.org, David Xu
In message <2005102015...@delplex.bde.org>, Bruce Evans writes:

>> One of the things you have to realize is that once you go down this
>> road you need a lot of code for all the conditionals.
>>
>> For instance you need to make sure that every new timestamp you
>> hand out not prior to another one, no matter what is happening to
>> the clocks.
>
>Clocks are already incoherent in many ways:
>- the times returned by the get*() functions incoherent with the ones
> returned by the functions that read the hardware, because the latter
> are always in advance of the former and the difference is sometimes
> visible at the active resolution.

Sorry Bruce, but this is just FUD: The entire point of the get*
familiy of functions is to provide "good enough" timestamps, very
fast, for code that knows it doesn't need better than roughly 1/hz
precision.

> visible at the active resolution. POSIX tests of file times have
> been reporting this incoherency since timecounters were implemented.
> The tests use time() to determine the current time and stat() to
> determine file times. In the sequence:
>
> t1 = time(...):
> sleep(1)
> touch(file);
> stat(file);
> t2 = mtime(file);
>
> t2 should be < t1, but the bug lets t2 == t1 happen.

t2 == t1 is not illegal.

The morons who defined a non-extensible timestamp format obviously
didn't belive in Andy Moore, but given a sufficiently fast computer
the resolution of the standardized timestamps prevents t2 > t1 in
the above test code.

>- times are incoherent between threads unless the threads use their
> own expensive locking to prevent this. This is not very different
> from timestamps being incoherent between CPUs unless the system uses
> expensive locking to prevent it.

Only if the get* family of functions is used in places where they
shouldn't be. I belive there is a sysctl which determines if it
is used for vfs timestamp. The default can be changed if necessary.

>> So, instead of looking for "quick fixes", lets look at this with a
>> designers or architects view:
>>
>> On a busy system the scheduler works hundred thousand times per
>> second, but on most systems nobody ever looks at the times(2) data.
>
>More like 1000 times a second. Even stathz = 128 gives too many decisions
>per second for the 4BSD scheduler, so it is divided down to 16 per second.
>Processes blocking on i/o may cause many more than 128/sec calls to the
>scheduler, but there should be nothing much to decide then.

I'm regularly running into 5 digits in the Csw field in systat -vm.
I don't know what events you talk about, but they are clearly not
the same as the ones I'm talking about.

The problem here is context-switch time, and while we can argue if
this is really scheduler related or not, the fact that the scheduler
decides which thread to context-switch to should be enough to
avoid a silly discussion of semantics.

>So the current pessimizations from timecounter calls in mi_switch()
>are an end result of general pessimizations of swtch() starting in
>4.4BSD. I rather like this part of the pessimizations...

It's so nice to have you back in action Bruce :-)

Bruce Evans

unread,
Oct 20, 2005, 7:11:11 AM10/20/05
to David Xu, cvs...@freebsd.org, Scott Long, src-com...@freebsd.org, Andrew Gallatin, cvs...@freebsd.org
On Thu, 20 Oct 2005, David Xu wrote:

> Bruce Evans wrote:
>> I wonder if this reduces the context switch latency from about 1.320
>> usec to 0.900 usec on my A64-3000. The latency is only .520 usec in
>> i386 mode. I use a TSC timecounter of course.
>

> we can avoid reloading userland GS.base MSR and FS.base MSR for system
> threads, I am not sure if it can reduce interrupt thread latency.

I think it would recover some of the the other 0.400 usec of the extra
overhead for the amd64 case.

We already avoid null reloads of %cr3 and avoiding null reloads of
FS/GS.base would be similar. Both are null only for intra-kernel
switches, so the savings are smaller than for the stores of FS/GS.base
since the reloads can't always be avoided.

Bruce

Bruce Evans

unread,
Oct 20, 2005, 8:55:23 AM10/20/05
to Poul-Henning Kamp, Scott Long, src-com...@freebsd.org, Andrew Gallatin, cvs...@freebsd.org, cvs...@freebsd.org, David Xu
On Thu, 20 Oct 2005, Poul-Henning Kamp wrote:

> In message <2005102015...@delplex.bde.org>, Bruce Evans writes:
>
>>> One of the things you have to realize is that once you go down this
>>> road you need a lot of code for all the conditionals.
>>>
>>> For instance you need to make sure that every new timestamp you
>>> hand out not prior to another one, no matter what is happening to
>>> the clocks.
>>
>> Clocks are already incoherent in many ways:
>> - the times returned by the get*() functions incoherent with the ones
>> returned by the functions that read the hardware, because the latter
>> are always in advance of the former and the difference is sometimes
>> visible at the active resolution.
>
> Sorry Bruce, but this is just FUD: The entire point of the get*
> familiy of functions is to provide "good enough" timestamps, very
> fast, for code that knows it doesn't need better than roughly 1/hz
> precision.

This bug shows that the get* functions don't actually provide "good
enough" timestamps, even for what is probably their primary use --
ffs file times are probably their primary use, and these only need
a resolution of 1 second; however, they need to be accurate relative
to other clocks, and a precision of ~1/hz doesn't provide enough
accuracy due to implementation details.

>> visible at the active resolution. POSIX tests of file times have
>> been reporting this incoherency since timecounters were implemented.
>> The tests use time() to determine the current time and stat() to
>> determine file times. In the sequence:
>>
>> t1 = time(...):
>> sleep(1)
>> touch(file);
>> stat(file);
>> t2 = mtime(file);
>>
>> t2 should be < t1, but the bug lets t2 == t1 happen.
>
> t2 == t1 is not illegal.

It is just invalid and of low quality for file systems that provide a
resolution of 1 second in their timestamps. The sleep of 1 second in
there is specific to such file systems; it is to ensure that at least
1 second has elapsed between the time() and the touch().

> The morons who defined a non-extensible timestamp format obviously
> didn't belive in Andy Moore, but given a sufficiently fast computer
> the resolution of the standardized timestamps prevents t2 > t1 in
> the above test code.

POSIX specifies the resolution for file times but doesn't specify their
accuracy AFAIK (not far). Quality of implementation specifies their
accuracy. The above is a simple test for strict monotonicity of file
times that happens to test for accuracy and coherency too. This
monotonicity is very easy to get right. sleep(3) is required to sleep
for at least 1 second. nanosleep(2) is sloppy about this -- it uses
a get* function so it risks similar bugs, but I think none here since
the extra tick in the timeout provides a sufficient margin for error.
After sleeping for at least 1 second, the time has surely advanced by
1 second and timestamps taken by a coherent clock will see this. With
a time(2) in it, the test would just not see incoherencies of 1 second.

>> - times are incoherent between threads unless the threads use their
>> own expensive locking to prevent this. This is not very different
>> from timestamps being incoherent between CPUs unless the system uses
>> expensive locking to prevent it.
>
> Only if the get* family of functions is used in places where they
> shouldn't be. I belive there is a sysctl which determines if it
> is used for vfs timestamp. The default can be changed if necessary.

This point is for all the functions. A timestamp taken by 1 thread
might not be used until after many timestamps are taken and used by
other threads. Naive comparison of these timestamps would then give
apparent incoherencies. It is up to the threads to provide synchronization
points if they want to compare times. More interestingly, there is
no need to keep the timecounters seen by different threads perfectly
in sync except at synchronization points, since any differences would
be indistinguishable frome ones caused be unsynchronized preemption.
(Strict real time to ~nanoseconds accuracy wouldn't work for either.)

I use the sysctl in POSIX tests to as not to keep seeing the the file
times bugs, but I sometimes forget to use it so I get remined of the
bugs anyway. IIRC, I got jdp to change the sysctl a bit to handle
more cases. He wanted an option for more resolution and I wanted one
to unbreak seconds resolution. The implementation actually uses
the get* functions for seconds and 1/hz resolution and the non-get*
functions for microseconds and nanoseconds resolution. So I use an
unnecessarily high resolution to avoid the bug.

>>> On a busy system the scheduler works hundred thousand times per
>>> second, but on most systems nobody ever looks at the times(2) data.
>>
>> More like 1000 times a second. Even stathz = 128 gives too many decisions
>> per second for the 4BSD scheduler, so it is divided down to 16 per second.
>> Processes blocking on i/o may cause many more than 128/sec calls to the
>> scheduler, but there should be nothing much to decide then.
>
> I'm regularly running into 5 digits in the Csw field in systat -vm.
> I don't know what events you talk about, but they are clearly not
> the same as the ones I'm talking about.

I just looked at csw values on machines in the freebsd cluster. They
may be underpowered and not heavily used, but they are more active
than any machine that I run and may be representative of general server
machines. On hub a few hours ago, csw was a transient 100-500 and the
average since boot time was 1010. The count since boot time may have
overflowed but the average is reasonable. hub has been up for 236
days and an average of 1010/sec gives a count of just below INT_MAX.

The 128/16 events is for timekeeping for scheduling. 4BSD does little
more than incrememnt a tick count here. ULE does a bit more. Then
there are the rescheduling every second for 4BSD, and more distributed
rescheduling for ULE. On context switches, the scheduler has (or
should have) little to do. It is context switching itself that makes
the timestamps that become too expensive when csw is high.

> The problem here is context-switch time, and while we can argue if
> this is really scheduler related or not, the fact that the scheduler
> decides which thread to context-switch to should be enough to
> avoid a silly discussion of semantics.

The problem is still unrelated to (non-broken) schedulers. Most context
switches happens because something blocks on i/o or is preempted by
an interrupt handler (it's a very low level of scheduling -- just
interrupt priority -- that allows the preemption, so I don't count
it as part of scheduling). So unavoidable context switches can happen
a lot on busy machines and the scheduler can't/shouldn't affect their
count except possibly to reduce it a bit. Given that they happen a lot
on some systems, they should be as efficient as possible. I think the
timecounter part of their inefficiency is not very important except in
the usual case of a slow timecounter. Losses from busted caches may
dominate.

>> So the current pessimizations from timecounter calls in mi_switch()
>> are an end result of general pessimizations of swtch() starting in
>> 4.4BSD. I rather like this part of the pessimizations...
>
> It's so nice to have you back in action Bruce :-)

I don't plan to stay very active.

Bruce

Poul-Henning Kamp

unread,
Oct 20, 2005, 9:05:19 AM10/20/05
to Bruce Evans, Scott Long, src-com...@freebsd.org, Andrew Gallatin, cvs...@freebsd.org, cvs...@freebsd.org, David Xu
In message <200510202...@delplex.bde.org>, Bruce Evans writes:
>On Thu, 20 Oct 2005, Poul-Henning Kamp wrote:

>This point is for all the functions. A timestamp taken by 1 thread
>might not be used until after many timestamps are taken and used by
>other threads. Naive comparison of these timestamps would then give
>apparent incoherencies.

Ahh, but now we're into the "programmer doesn't understand concurrency"
territory, that has little to do with our timekeeping functions.

>On hub a few hours ago, csw was a transient 100-500 and the
>average since boot time was 1010.

The average since boot should not be optimized for, since we don't
really care what the machine does (or doesn't) when we are not
offering any workload to it.

>So unavoidable context switches can happen
>a lot on busy machines and the scheduler can't/shouldn't affect their
>count except possibly to reduce it a bit. Given that they happen a lot
>on some systems, they should be as efficient as possible. I think the
>timecounter part of their inefficiency is not very important except in
>the usual case of a slow timecounter. Losses from busted caches may
>dominate.

I would tend to agree with you there, but any sensible optimization
should be done.

>> It's so nice to have you back in action Bruce :-)
>
>I don't plan to stay very active.

Too bad, your considered opinion, even though we often disagree,
is one of the things I really enjoy around here: it forces me to
think harder.

John Baldwin

unread,
Oct 20, 2005, 9:59:19 AM10/20/05
to Bruce Evans, cvs...@freebsd.org, src-com...@freebsd.org, Andrew Gallatin, cvs...@freebsd.org, David Xu

Ah, I didn't realize the sysctl let you use negative quality timecounters.
The tunable does serve to automate it for remote machines I guess since it
doesn't pessimize the quality on SMP.

John Baldwin

unread,
Oct 20, 2005, 9:58:07 AM10/20/05
to Bruce Evans, Scott Long, src-com...@freebsd.org, Andrew Gallatin, cvs...@freebsd.org, cvs...@freebsd.org, David Xu

Note that on 4.x you don't get to see the interrupt counts for the hz + stathz
* (cpus - 1) IPIs for all the clock interrupts, so in real numbers, each CPU
has gone from hz + stathz to hz * 2 interrupts. However, the higher number
is offset by the fact that the interrupt handler for the lapic case doesn't
have to touch any hardware, and it also works much more reliably (getting
irq0 to work in APIC mode on some amd64 nvidia chipsets required several
quirks, and future motherboards will probably continue to require quirks
since Windows uses the APIC timer in APIC mode and doesn't require irq0 to
work in APIC mode).

Scott Long

unread,
Oct 20, 2005, 10:34:38 AM10/20/05
to John Baldwin, src-com...@freebsd.org, Andrew Gallatin, Bruce Evans, cvs...@freebsd.org, cvs...@freebsd.org, David Xu

I'm in complete argreement that using the APIC timer is the right thing
to do, and I believe that we did some tests to show that the high
interrupt rate didn't have an appreciable effect on performance.
However, I'd like to revisit the HZ=1000 decision for 7-CURRENT.

Scott

John Baldwin

unread,
Oct 20, 2005, 10:53:01 AM10/20/05
to Scott Long, src-com...@freebsd.org, Andrew Gallatin, Bruce Evans, cvs...@freebsd.org, cvs...@freebsd.org, David Xu

Agreed.

M. Warner Losh

unread,
Oct 20, 2005, 11:19:25 AM10/20/05
to sco...@samsco.org, src-com...@freebsd.org, j...@freebsd.org, b...@zeta.org.au, cvs...@freebsd.org, cvs...@freebsd.org, dav...@freebsd.org, gall...@cs.duke.edu
In message: <4357AAFE...@samsco.org>
Scott Long <sco...@samsco.org> writes:
: However, I'd like to revisit the HZ=1000 decision for 7-CURRENT.

At Timing Solutions, we run with HZ=1000 to reduce the latency for
interacting with serial devices (since we have highly synchronous
protocols that are spoken over them). Other than that, we've seen no
performance differences between HZ=100 and HZ=1000 in other areas of
our systems. We have noted a small increase in overhead with 1000,
but since we have plenty of CPU to burn, we burn a little to get
better latencies... We'll likely tune the number based on our
experience, so changing the default HZ won't impact us.

Warner

Robert Watson

unread,
Oct 20, 2005, 11:46:28 AM10/20/05
to M. Warner Losh, sco...@samsco.org, src-com...@freebsd.org, j...@freebsd.org, b...@zeta.org.au, cvs...@freebsd.org, cvs...@freebsd.org, dav...@freebsd.org, gall...@cs.duke.edu

I've seen reports of TCP improvements as a result of more precise timing,
but I've also seen reports of minor performance reduction as a result of
the increased overhead. Some of the problems here were reduced by
removing naive uses of callouts that ran every tick in order to run their
own job scheduler which then selected to run jobs only every now and then.
This still exists in some of the RPC-related code in NFS, and needs to be
addressed. It's also important for Xen, because in Xen it's desirable to
only run per-domain clock ticks if there's work to do, so there are
optimizations in Xen to use programmable timers for callouts rather than
running them frequently in order to avoid having to run all the domains
every time a timer tick fires.

Revisiting the 1000hz decision does make sense, but there are real
trade-offs here: higher accuracy in timing potentially improves the
behavior of retransmission and drop detection for network services in high
performance environments. With time scales on packet processing events
being on the order of a millionth of a second, things are a lot different
than previously.

Robert N M Watson

0 new messages