Kernel information about calibrated value of tsc_khz and
tsc_stability (result of tsc warp test) are useful bits of information
for any app that wants to use TSC directly. Export this read_only
information in sysfs.
Signed-off-by: Venkatesh Pallipadi <ve...@google.com>
Signed-off-by: Dan Magenheimer <dan.mag...@oracle.com>
---
arch/x86/kernel/tsc.c | 76 +++++++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 76 insertions(+), 0 deletions(-)
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 9faf91a..24dd484 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -10,6 +10,7 @@
#include <linux/clocksource.h>
#include <linux/percpu.h>
#include <linux/timex.h>
+#include <linux/sysdev.h>
#include <asm/hpet.h>
#include <asm/timer.h>
@@ -857,6 +858,81 @@ static void __init init_tsc_clocksource(void)
clocksource_register(&clocksource_tsc);
}
+#ifdef CONFIG_SYSFS
+/*
+ * Export TSC related info to user land. This reflects kernel usage of TSC
+ * as hints to userspace users of TSC. The read_only info provided here:
+ * - tsc_stable: 1 implies system has TSC that always counts at a constant
+ * rate, sync across CPUs and has passed the kernel warp test.
+ * - tsc_khz: TSC frequency in khz.
+ * - tsc_mult and tsc_shift: multiplier and shift to optimally convert
+ * TSC delta to ns; ns = ((u64) delta * mult) >> shift
+ */
+
+#define define_show_var_function(_name, _var) \
+static ssize_t show_##_name( \
+ struct sys_device *dev, struct sysdev_attribute *attr, char *buf) \
+{ \
+ return sprintf(buf, "%u\n", (unsigned int) _var);\
+}
+
+define_show_var_function(tsc_stable, !tsc_unstable);
+define_show_var_function(tsc_khz, tsc_khz);
+define_show_var_function(tsc_mult, clocksource_tsc.mult);
+define_show_var_function(tsc_shift, clocksource_tsc.shift);
+
+static SYSDEV_ATTR(tsc_stable, 0444, show_tsc_stable, NULL);
+static SYSDEV_ATTR(tsc_khz, 0444, show_tsc_khz, NULL);
+static SYSDEV_ATTR(tsc_mult, 0444, show_tsc_mult, NULL);
+static SYSDEV_ATTR(tsc_shift, 0444, show_tsc_shift, NULL);
+
+static struct sysdev_attribute *tsc_attrs[] = {
+ &attr_tsc_stable,
+ &attr_tsc_khz,
+ &attr_tsc_mult,
+ &attr_tsc_shift,
+};
+
+static struct sysdev_class tsc_sysclass = {
+ .name = "tsc",
+};
+
+static struct sys_device device_tsc = {
+ .id = 0,
+ .cls = &tsc_sysclass,
+};
+
+static int __init init_tsc_sysfs(void)
+{
+ int err, i = 0;
+
+ err = sysdev_class_register(&tsc_sysclass);
+ if (err)
+ return err;
+
+ err = sysdev_register(&device_tsc);
+ if (err)
+ goto fail;
+
+ for (i = 0; i < ARRAY_SIZE(tsc_attrs); i++) {
+ err = sysdev_create_file(&device_tsc, tsc_attrs[i]);
+ if (err)
+ goto fail;
+ }
+
+ return 0;
+
+fail:
+ while (--i >= 0)
+ sysdev_remove_file(&device_tsc, tsc_attrs[i]);
+
+ sysdev_unregister(&device_tsc);
+ sysdev_class_unregister(&tsc_sysclass);
+ return err;
+}
+device_initcall(init_tsc_sysfs);
+#endif
+
#ifdef CONFIG_X86_64
/*
* calibrate_cpu is used on systems with fixed rate TSCs to determine
--
1.7.0.1
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
> From: Dan Magenheimer <dan.mag...@oracle.com>
>
> Kernel information about calibrated value of tsc_khz and
> tsc_stability (result of tsc warp test) are useful bits of information
> for any app that wants to use TSC directly. Export this read_only
> information in sysfs.
Is this really a good idea? It will encourage the applications
to use RDTSC directly, but there are all kinds of constraints on
that. Even the kernel has a hard time with them, how likely
is it that applications will get all that right?
It would be better to fix them to use the vsyscalls instead.
Or if they can't use the vsyscalls for some reason today fix them.
This way if anything changes again in TSC the kernel could
shield the applications.
-Andi
--
a...@linux.intel.com -- Speaking for myself only.
fail will call sysdev_unregister(&device_tsc) which is not
appropriate, please fix this goto.
Thanks,
--
Jaswinder Singh.
Indeed, that is what it is intended to do.
> that. Even the kernel has a hard time with them, how likely
> is it that applications will get all that right?
That's the point of exposing the tsc_reliable kernel data.
If the processor has Invariant TSC and the system has
successfully passed Ingo's warp test and, as a result
the kernel is using TSC as a clocksource, why not enable
userland apps that need to obtain timestamp data
tens or hundreds of thousands of times per second to
also use the TSC directly?
> It would be better to fix them to use the vsyscalls instead.
> Or if they can't use the vsyscalls for some reason today fix them.
The problem is from an app point-of-view there is no vsyscall.
There are two syscalls: gettimeofday and clock_gettime. Sometimes,
if it gets lucky, they turn out to be very fast and sometimes
it doesn't get lucky and they are VERY slow (resulting in a performance
hit of 10% or more), depending on a number of factors completely
out of the control of the app and even undetectable to the app.
Note also that even vsyscall with TSC as the clocksource will
still be significantly slower than rdtsc, especially in the
common case where a timestamp is directly stored and the
delta between two timestamps is later evaluated; in the
vsyscall case, each timestamp is a function call and a convert
to nsec but in the TSC case, each timestamp is a single
instruction.
> This way if anything changes again in TSC the kernel could
> shield the applications.
If tsc_reliable is 1, the system and the kernel are guaranteeing
to the app that nothing will change in the TSC. In an Invariant
TSC system that has passed Ingo's warp test (to eliminate the
possibility of a fixed interprocessor TSC gap due to a broken BIOS
in a multi-node NUMA system), if anything changes in the clock
signal that drives the TSC, the system is badly broken and far
worse things -- like inter-processor cache incoherency -- may happen.
Is it finally possible to get past the horrible SMP TSC problems
of the past and allow apps, under the right conditions, to be able
to use rdtsc again? This patch argues "yes".
Thanks for catching this. Will fix this in the patch refresh.
Thanks,
Venki
I am a little concerned about applications getting this right, with
rdtsc and related barriers etc. May be that calls for a userspace
library.
Despite of that, it is useful information to human user to know
whether tsc is stable and what TSC freq is. It will be very hard for
userspace to do the TSC calibration. And to know whether TSC warp test
passed or whether TSC is marked unstable by the kernel, there is no
one place to get these answers today. Users have to look at
clocksource, dmesg, etc.
Thanks,
Venki
> > It would be better to fix them to use the vsyscalls instead.
> > Or if they can't use the vsyscalls for some reason today fix them.
>
> The problem is from an app point-of-view there is no vsyscall.
> There are two syscalls: gettimeofday and clock_gettime. Sometimes,
> if it gets lucky, they turn out to be very fast and sometimes
> it doesn't get lucky and they are VERY slow (resulting in a
> performance hit of 10% or more), depending on a number of factors
> completely out of the control of the app and even undetectable to the
> app.
But the point is.. in the case you get that 10% hit.... that is exactly
the case where tsc would not work either!!!
>
> If tsc_reliable is 1, the system and the kernel are guaranteeing
> to the app that nothing will change in the TSC. In an Invariant
> TSC system that has passed Ingo's warp test (to eliminate the
> possibility of a fixed interprocessor TSC gap due to a broken BIOS
> in a multi-node NUMA system), if anything changes in the clock
just when we're trying to get rid of this constraint by allowing a per
cpu offset... (this is needed to cope with cpus not powering on at the
exact same time... including hotplug cpu etc etc)
oh and.. what notification mechanism do you have to notify the
application that the tsc now is no longer reliable? Such conditions
can exist... for example due to a CPU being hotplugged, or some SMM
screwing around and the kernel detecting that or .. or ...
really. Use the vsyscall. If the vsyscall does not do exactly what you
want, make a better vsyscall.
But friends don't let friends use rdtsc in application code.
--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org
> But friends don't let friends use rdtsc in application code.
Um, I realize that many people have been burned by this
many times over the years so it is a "hot stove". I also
realize that there are many environments where using
rdtsc is risking stepping on landmines. But I (we?) also
know there are many environments now where using rdtsc is
NOT risky at all... and with the vast majority of new
systems soon shipping with Invariant TSC and a single socket
(and even most multiple-socket systems with non-broken
BIOSes passing a warp test), why should past burns outlaw
userland use of a very fast, very useful CPU feature? After
all, CPU designers at both Intel and AMD have spent
a great deal of design effort and transistors to FINALLY
provide an Invariant TSC.
> > The problem is from an app point-of-view there is no vsyscall.
> > There are two syscalls: gettimeofday and clock_gettime. Sometimes,
> > if it gets lucky, they turn out to be very fast and sometimes
> > it doesn't get lucky and they are VERY slow (resulting in a
> > performance hit of 10% or more), depending on a number of factors
> > completely out of the control of the app and even undetectable to the
> > app.
>
> But the point is.. in the case you get that 10% hit.... that is exactly
> the case where tsc would not work either!!!
Yes, understood. But the kernel doesn't expose a "gettimeofday
performance sucks" flag either. If it did (or in the case of
the patch, if tsc_reliable is zero) the application could at least
choose to turn off the 10000-100000 timestamps/second and log
a message saying "you are running on old hardware so you get
fewer features".
> just when we're trying to get rid of this constraint by allowing a per
> cpu offset... (this is needed to cope with cpus not powering on at the
> exact same time... including hotplug cpu etc etc)
>
> oh and.. what notification mechanism do you have to notify the
> application that the tsc now is no longer reliable? Such conditions
> can exist... for example due to a CPU being hotplugged, or some SMM
> screwing around and the kernel detecting that or .. or ...
The proposal doesn't provide a notification mechanism (though I'm
not against it)... if the tsc can EVER become unreliable,
tsc_reliable should be 0.
A CPU-hotplugable system is a good example of a case where
the kernel should expose that tsc_reliable is 0. (I've heard
anecdotally that CPU hotplug into a QPI or Hypertransport system
will have some other interesting challenges, so may require some
special kernel parameters anyway.) Even if tsc_reliable were
only enabled if a "no-cpu_hotplug" kernel parameter is set,
that is still useful. And with cores-per-socket (and even
nodes-per-socket) going up seemingly every day, multi-socket
systems will likely be an ever smaller percentage of new
systems.
A virtual machine where live migration to another physical machine
may occur is another good example where tsc_reliable should be 0.
Xen now has a VM config feature that says "migration is disallowed"
for this reason; the Invariant TSC flag is always off for a VM
unless this "no_migrate" flag is set (or rdtsc is emulated).
> really. Use the vsyscall. If the vsyscall does not do exactly what you
> want, make a better vsyscall.
If this discussion results in a better vsyscall and/or a way
for applications to easily determine (and report loudly) that
the system does NOT provide a good way to do a fast timestamp,
that may be sufficient. But please propose how that will be done
as the current software choices are inadequate and the CPU
designers have finally fixed the problem for the vast majority
of systems. I am already aware of some enterprise software
that is doing its best to guess whether TSC is reliable by
looking at CPU families and socket counts, but this is doomed
to failure in userland and is something that the kernel knows
and should now expose.
Thanks,
Dan
On Sat, 15 May 2010, Dan Magenheimer wrote:
> > From: Andi Kleen [mailto:an...@firstfloor.org]
> >
> > > Kernel information about calibrated value of tsc_khz and
> > > tsc_stability (result of tsc warp test) are useful bits of
> > information
> > > for any app that wants to use TSC directly. Export this read_only
> > > information in sysfs.
> >
> > Is this really a good idea? It will encourage the applications
> > to use RDTSC directly, but there are all kinds of constraints on
>
> Indeed, that is what it is intended to do.
And you better do not.
Short story: TSC sucks in all aspects. Never ever let an application
rely on it how tempting it may be.
> > that. Even the kernel has a hard time with them, how likely
> > is it that applications will get all that right?
>
> That's the point of exposing the tsc_reliable kernel data.
The tsc_reliable bit is useless outside of the kernel.
> If the processor has Invariant TSC and the system has
> successfully passed Ingo's warp test and, as a result
> the kernel is using TSC as a clocksource, why not enable
> userland apps that need to obtain timestamp data
> tens or hundreds of thousands of times per second to
> also use the TSC directly?
Simply because at the time of this writing there is no single reliable
TSC instance available.
Yeah, the CPU has that "P and C state invariant feature bit", but it's
_not_ worth a penny.
Lemme explain some of the reasons in random order:
1) SMI:
We have proof that SMIs fiddle with the TSC to hide the fact that
they happened. Yes, that's stupid, but a matter of fact. We have no
reliable way to detect that shit in the kernel yet, but we are
working on it. Some of those "intelligent" BIOS fkcups can be
detected already and all we can do is disable TSC.
That's going to be easier once the TSC is not longer writeable and
instead we get an writeable per cpu offset register. That way we
can observe the SMI tricks way easier, but even then we cannot
reliably undo them before some TSC user which is out of the kernels
control can access it.
2) Boot offset / hotplug
Even if the TSC is completely in sync frequency wise there is no
way to prevent per core/HT offsets. I'm writing this from a box
where a perfectly in sync TSC (with the nice "I'm stable and
reliable" bit set) is hosed by some BIOS magic which manages to
offset the non boot cpu TSCs by > 300k cycles.
3) Multi socket
The "reliable" TSCs of a package are driven by the same clock, but
on multi socket systems this is not the case. Each socket derives
its TSC clock via a PLL from a global distributed clock at least in
theory. But there is no guarantee that a board manufacturer really
distributes that global base clock and instead uses a separate
"global" clock on each socket.
Aside of that even if all the PLLs are driven by the same global
clock there is no guarantee that the resulting PLL'ed clocks are in
sync. They are not, and they never ever will be. The PLL accuracy
differs in the ppm range and is also prone to temperature
variations. The result over time is that the TSCs of different
sockets diverge via drift in an observable way. We have bug reports
about resulting user space observable time going backwards problems
already.
> > It would be better to fix them to use the vsyscalls instead.
> > Or if they can't use the vsyscalls for some reason today fix them.
>
> The problem is from an app point-of-view there is no vsyscall.
> There are two syscalls: gettimeofday and clock_gettime. Sometimes,
> if it gets lucky, they turn out to be very fast and sometimes
> it doesn't get lucky and they are VERY slow (resulting in a performance
> hit of 10% or more), depending on a number of factors completely
> out of the control of the app and even undetectable to the app.
And they get slow for a reason: simply because the stupid hardware is
not reliable whether it has some "I claim to be reliable tag" on it or
not.
> Note also that even vsyscall with TSC as the clocksource will
> still be significantly slower than rdtsc, especially in the
> common case where a timestamp is directly stored and the
> delta between two timestamps is later evaluated; in the
> vsyscall case, each timestamp is a function call and a convert
> to nsec but in the TSC case, each timestamp is a single
> instruction.
That is all understandable, but as long as we do not have some really
reliable hardware I'm going to NACK any exposure of the gory details
to user space simply because I have to deal with the fallout of this.
What we can talk about is a vget_tsc_raw() interface along with a
vconvert_tsc_delta() interface, where vget_tsc_raw() returns you an
nasty error code for everything which is not usable.
> > This way if anything changes again in TSC the kernel could
> > shield the applications.
>
> If tsc_reliable is 1, the system and the kernel are guaranteeing
Wrong. The kernel is not guaranteeing anything. See above.
> to the app that nothing will change in the TSC. In an Invariant
> TSC system that has passed Ingo's warp test (to eliminate the
> possibility of a fixed interprocessor TSC gap due to a broken BIOS
> in a multi-node NUMA system), if anything changes in the clock
> signal that drives the TSC, the system is badly broken and far
> worse things -- like inter-processor cache incoherency -- may happen.
>
> Is it finally possible to get past the horrible SMP TSC problems
> of the past and allow apps, under the right conditions, to be able
> to use rdtsc again? This patch argues "yes".
Dream on while working with the 2 machines at your desk which
represent about 90% of the sane subset in the x86 universe!
We are working on solutions to get the TSC reliably usable in the case
of "P/C state invariant" feature bit set, but that will be restricted
to a vsyscall and you won't be able to use it realiably in the way you
envision until either
- chip manufacturers finally grasp that reliable and fast access to
timestamps is something important
- BIOS tinkeres finally grasp that fiddling with time is a NONO - or
chip manufactures prevent them from doing so
or until we get something which myself an others proposed > 10years
ago:
A simple master clock driven 1MHZ == resolution 1us counter which
can be synced / preset by simple mechanisms and which was btw.
developed in 1990es cluster computing environments.
Thanks,
tglx
> > From: Arjan van de Ven [mailto:ar...@infradead.org]
> (Arjan comments reordered somewhat)
>
> > But friends don't let friends use rdtsc in application code.
>
> Um, I realize that many people have been burned by this
> many times over the years so it is a "hot stove". I also
> realize that there are many environments where using
> rdtsc is risking stepping on landmines.
> But I (we?) also
> know there are many environments now where using rdtsc is
> NOT risky at all...
I see a lot of Intel hardware.. (stuff that you likely don't see yet ;-)
and I have not yet seen a system where the kernel would be able to give
the guarantee as you describe it in your email.
If you want a sysfs variable that is always 0... go wild.
> and with the vast majority of new
> systems soon shipping with Invariant TSC and a single socket
> (and even most multiple-socket systems with non-broken
> BIOSes passing a warp test),
(the warp test is going away)
on multisocket that passes a wrap test you can still get skew over
time.. due to things like SMM, thermal throttling etc etc.
> why should past burns outlaw
> userland use of a very fast, very useful CPU feature? After
> all, CPU designers at both Intel and AMD have spent
> a great deal of design effort and transistors to FINALLY
> provide an Invariant TSC.
sadly even with all these transistors no system that I know of today
can guarantee the guarantee by the rules you state.
> > oh and.. what notification mechanism do you have to notify the
> > application that the tsc now is no longer reliable? Such conditions
> > can exist... for example due to a CPU being hotplugged, or some SMM
> > screwing around and the kernel detecting that or .. or ...
>
> The proposal doesn't provide a notification mechanism (though I'm
> not against it)... if the tsc can EVER become unreliable,
> tsc_reliable should be 0.
then it should be 0 always on all of todays hardware.
SMM, thermal overload, etc etc ... you name it.
Things the kernel will get notified about...
> A CPU-hotplugable system is a good example of a case where
> the kernel should expose that tsc_reliable is 0. (I've heard
> anecdotally that CPU hotplug into a QPI or Hypertransport system
> will have some other interesting challenges, so may require some
> special kernel parameters anyway.)
eh no.
hot add works just fine.
(hot remove is a very different ballgame)
> > really. Use the vsyscall. If the vsyscall does not do exactly what
> > you want, make a better vsyscall.
>
> If this discussion results in a better vsyscall and/or a way
> for applications to easily determine (and report loudly) that
> the system does NOT provide a good way to do a fast timestamp,
> that may be sufficient. But please propose how that will be done
> as the current software choices are inadequate and the CPU
> designers have finally fixed the problem for the vast majority
> of systems.
*cough*
> I am already aware of some enterprise software
> that is doing its best to guess whether TSC is reliable by
> looking at CPU families and socket counts, but this is doomed
> to failure in userland and is something that the kernel knows
> and should now expose.
can you name said "enterprise" software by name please? We need a huge
advertisement to let people know not to trust their important data to
it..
--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org
Nah, there are systems which will have it set to 1:
Dig out your good old Pentium-I box and enjoy.
> > > oh and.. what notification mechanism do you have to notify the
> > > application that the tsc now is no longer reliable? Such conditions
> > > can exist... for example due to a CPU being hotplugged, or some SMM
> > > screwing around and the kernel detecting that or .. or ...
> >
> > The proposal doesn't provide a notification mechanism (though I'm
> > not against it)... if the tsc can EVER become unreliable,
> > tsc_reliable should be 0.
>
> then it should be 0 always on all of todays hardware.
> SMM, thermal overload, etc etc ... you name it.
> Things the kernel will get notified about...
What we could expose is an estimate about the performance of
gettimeofday/clock_gettime. The kernel has all the information to do
that, but this still does not solve the notification problem when we
need to switch to a different clock source.
Thanks,
tglx
I'm open to something like that provided:
1) It works (whenever possible) without changing privilege levels
or causing vmexits or other "hidden slowness" problems when
used both in bare-metal Linux and in a virtual machine.
2) The "transformation" performed by the kernel on the TSC
does not require some hidden pcpu number that won't work
in a virtual machine.
If TSC is indeed reliable (see below), it is both faster AND
meets the above constraints.
> > From: Arjan van de Ven [mailto:ar...@infradead.org]
> > If you want a sysfs variable that is always 0... go wild.
>
> From: Thomas Gleixner [mailto:tg...@linutronix.de]
> Nah, there are systems which will have it set to 1:
> Dig out your good old Pentium-I box and enjoy.
Hot stove syndrome again? Are you truly saying that there
are NO single-socket multi-core systems that don't have
stupid firmware (SMI and/or BIOS)? Or are you saying that
significant TSC clock skew occurs even between the cores
on a single-socket Nehalem system?
If things are this bad, why on earth would the kernel itself
EVER use TSC even as its own internal clocksource? Or
even to provide additional precision to a slow platform timer?
Or are you saying that many systems (and especially large
multi-socket systems) DO exist where the kernel isn't able
to proactively determine that the firmware is broken and/or
significant thermal variation may occur across sockets?
This I believe.
I understand that you both are involved in pushing the
limits of large systems and that time synchronization is
a very hard problem, perhaps effectively unsolvable,
in these systems.
But that doesn't mean the vast majority of latest generation
single-socket systems can't set "tsc_reliable" to 1. Or that
the kernel is responsible for detecting and/or correcting
every system with buggy firmware.
Maybe the best way to solve the "buggy firmware problem"
is exactly by encouraging enterprise apps to use TSC
and to expose and *blacklist* systems and/or system vendors
who ship boxes with crappy firmware!
> From: Thomas Gleixner [mailto:tg...@linutronix.de]
> What we could expose is an estimate about the performance of
> gettimeofday/clock_gettime. The kernel has all the information to do
> that, but this still does not solve the notification problem when we
> need to switch to a different clock source.
This would at least be a big step in the right direction.
But if we go with a vget_raw_tsc() or direct TSC solution,
you have convinced me of the need for notification.
Maybe this is a perfect use for (at least one bit in)
the TSC_AUX register and the rdtscp instruction?
And I do agree with Venki that some user library (or at
least published sample code) should be made available
to demonstrate proper usage and to dampen out the worst
of the "broken user problem".
> > From: Arjan van de Ven [mailto:ar...@infradead.org]
> > can you name said "enterprise" software by name please? We need a huge
> > advertisement to let people know not to trust their important data to
> > it..
For obvious reasons I can't do that, but I can point to
enterprise *operating systems* that have long since solved
this same problem one way or another: Solaris on x86 and
HP-UX (the latter admittedly on ia64). Enterprise app
vendors are quite happy with requiring conformance to a
very completely specified software/hardware/firmware stack
before providing support to an app customer. I'm just trying
to ensure that Linux can be part of that spec.
On Sun, 16 May 2010, Dan Magenheimer wrote:
> > From: Thomas Gleixner [mailto:tg...@linutronix.de]
> > What we can talk about is a vget_tsc_raw() interface along with a
> > vconvert_tsc_delta() interface, where vget_tsc_raw() returns you an
> > nasty error code for everything which is not usable.
>
> I'm open to something like that provided:
>
> 1) It works (whenever possible) without changing privilege levels
> or causing vmexits or other "hidden slowness" problems when
> used both in bare-metal Linux and in a virtual machine.
> 2) The "transformation" performed by the kernel on the TSC
> does not require some hidden pcpu number that won't work
> in a virtual machine.
What I have in mind and what I'm working on for quite a while is going
to work on both bare metal and VMs w/o hidden slowness.
> If TSC is indeed reliable (see below), it is both faster AND
> meets the above constraints.
>
> > > From: Arjan van de Ven [mailto:ar...@infradead.org]
> > > If you want a sysfs variable that is always 0... go wild.
> >
> > From: Thomas Gleixner [mailto:tg...@linutronix.de]
> > Nah, there are systems which will have it set to 1:
> > Dig out your good old Pentium-I box and enjoy.
>
> Hot stove syndrome again? Are you truly saying that there
Kinda hot stove, yes. I'm unfortunately forced to deal with the 500+
different variants of borked timers and that makes me very reluctant
to believe anything what chip/board/bios vendors promise. It's not the
one time hot stove experience, it's the constant exposure to the never
ending supply of hot stoves, which makes me nervous.
I wish I could say something different.
> are NO single-socket multi-core systems that don't have
> stupid firmware (SMI and/or BIOS)? Or are you saying that
There are single socket multi-core x86 systems with a sane BIOS, but
there is no reliable way to tell which ones belong into that category.
> significant TSC clock skew occurs even between the cores
> on a single-socket Nehalem system?
There is no clock skew between the cores of a package - at least we
are not aware of such a problem. Though I wouldn't rely on that
forever: they also said that the Titanic was unsinkable :)
> If things are this bad, why on earth would the kernel itself
> EVER use TSC even as its own internal clocksource? Or
We try to use it for performance sake, but the kernel does at least
it's very best to find out when it goes bad. We then switch back to a
hpet or pm-timer which is horrible performance wise but does not screw
up timekeeping and everything which relies on it completely.
> even to provide additional precision to a slow platform timer?
We don't do that anymore.
> Or are you saying that many systems (and especially large
> multi-socket systems) DO exist where the kernel isn't able
> to proactively determine that the firmware is broken and/or
> significant thermal variation may occur across sockets?
> This I believe.
As I said, we try our very best to determine when things go awry, but
there are small errors which occur either sporadic or after longer
uptime which we cannot yet detect reliably. Multi-socket falls into
that category, but we are working on that.
> I understand that you both are involved in pushing the
> limits of large systems and that time synchronization is
> a very hard problem, perhaps effectively unsolvable,
> in these systems.
Well, it would be solvable in hardware and it has been done in
hardware more than 20 years ago. Just not there where it would have
been important: inside of x86 cpus. Hint: there are other
architectures which got that right from the very beginning even on
multi-socket systems.
Admitted, x86 made progress, but we are still some steps away from
something which I would consider reliable under all circumstances.
But you are right, some of the problems with the existing hardware are
just unsolvable and I spent a serious amount of time on trying to
convince myself otherwise.
The nasty thing about the subtle wreckage is that it is really hard to
investigate and debug and I wasted a whole week recently to figure out
what caused the time going backwards problem on a dual socket
westmere. Not fun !
> But that doesn't mean the vast majority of latest generation
> single-socket systems can't set "tsc_reliable" to 1. Or that
> the kernel is responsible for detecting and/or correcting
> every system with buggy firmware.
It _IS_ responsible to detect buggy firmware otherwise we would just
drain in bug reports about broken timekeeping. We've been there, no
way to go back to this.
> Maybe the best way to solve the "buggy firmware problem"
> is exactly by encouraging enterprise apps to use TSC
> and to expose and *blacklist* systems and/or system vendors
> who ship boxes with crappy firmware!
Blacklists are the last resort if a problem is not detectable by the
kernel. We usually detect the non usability of TSC and emit a
prominent warning into dmesg. Those warnings are there for years, but
the number of systems with BIOS caused TSC wreckage has grown.
> > From: Thomas Gleixner [mailto:tg...@linutronix.de]
> > What we could expose is an estimate about the performance of
> > gettimeofday/clock_gettime. The kernel has all the information to do
> > that, but this still does not solve the notification problem when we
> > need to switch to a different clock source.
>
> This would at least be a big step in the right direction.
Ok.
> But if we go with a vget_raw_tsc() or direct TSC solution,
> you have convinced me of the need for notification.
> Maybe this is a perfect use for (at least one bit in)
> the TSC_AUX register and the rdtscp instruction?
Uurgh, no. The vsyscall will return a proper error code when shit
happens. And really, we don't want to encourage the direct use of
rdtsc at all. Also rdtscp is a full serializing instruction, which is
probably not what you want to get fast timestamps.
> And I do agree with Venki that some user library (or at
> least published sample code) should be made available
> to demonstrate proper usage and to dampen out the worst
> of the "broken user problem".
Using a vsyscall is the best way to achieve that. Simple function call
interface with a well defined ABI and a proper return code. If the
user ignores the return code - none of my problems.
Further it allows us
- to keep the various CPU generation specific quirks well confined in
the kernel and we can even do fixups for correctable wreckage.
- to expose coarser grained fast timestamps when the TSC is not
usable. [So the best name for it would be vget_timestamp(), which btw.
allows us to provide the same interface to non x86 as well ]
Thoughts ?
> > > From: Arjan van de Ven [mailto:ar...@infradead.org]
> > > can you name said "enterprise" software by name please? We need a huge
> > > advertisement to let people know not to trust their important data to
> > > it..
>
> For obvious reasons I can't do that, but I can point to
> enterprise *operating systems* that have long since solved
> this same problem one way or another: Solaris on x86 and
On a well selected subset of the machines which they control themself.
> HP-UX (the latter admittedly on ia64). Enterprise app
But that's probably just a property of ia64 and not the merit of HP,
as their x86 machines have a proven track record of BIOS/SMI problems.
> vendors are quite happy with requiring conformance to a
> very completely specified software/hardware/firmware stack
> before providing support to an app customer. I'm just trying
> to ensure that Linux can be part of that spec.
I understand that and I'm willing to help, but in a sane and
controlled way which does me not expose to a new category of unfixable
bugreports and complaints.
Thanks,
tglx
> > From: Thomas Gleixner [mailto:tg...@linutronix.de]
> > Nah, there are systems which will have it set to 1:
> > Dig out your good old Pentium-I box and enjoy.
>
> Hot stove syndrome again? Are you truly saying that there
> are NO single-socket multi-core systems that don't have
> stupid firmware (SMI and/or BIOS)?
there are no systems *where we can know* this.
Some of the stupid SMI only triggers on higher temperature situations
etc. Impossible to know upfront.
> If things are this bad, why on earth would the kernel itself
> EVER use TSC even as its own internal clocksource?
Why do you think we do extensive and continuous validation of the tsc
(and soon, continuous recalibration)
> But that doesn't mean the vast majority of latest generation
> single-socket systems can't set "tsc_reliable" to 1. Or that
> the kernel is responsible for detecting and/or correcting
> every system with buggy firmware.
sadly this also shows up on single socket systems... much more than we
like.
This is why I really really hate having apps run tsc directly.
A VDSO call at least gives the kernel the option to ensure
correctness... even if it starts out fast and goes slow suddenly after
3 weeks when the AC in the datacenter got maintenance for an hour.
--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org
Well, if this can be done today/soon and is fast enough
(say <2x the cycles of a rdtsc), I am very interested and
"won't let my friends use rdtsc" :-) Anything I can do
to help?
> From: Thomas Gleixner [mailto:tg...@linutronix.de]
> We try to use it for performance sake, but the kernel does at least
> it's very best to find out when it goes bad. We then switch back to a
> hpet or pm-timer which is horrible performance wise but does not screw
> up timekeeping and everything which relies on it completely.
> :
> As I said, we try our very best to determine when things go awry, but
> there are small errors which occur either sporadic or after longer
> uptime which we cannot yet detect reliably. Multi-socket falls into
> that category, but we are working on that.
> From: Arjan van de Ven [mailto:ar...@infradead.org]
> Why do you think we do extensive and continuous validation of the tsc
> (and soon, continuous recalibration)
So the kernel has the ability to detect that the TSC
is "OK for now", but must use some kind of polling
(periodic warp test) to recognize that TSC has
gone "bad". As long as TSC is good AND a sophisticated
enterprise app understands that TSC might go bad at
some point in the future AND if the kernel exposes
"goodness" information AND the app (like the kernel) is
resilient** to the possibility that there might be some
period of time that obtained timestamps might be
"bad" before the app polls the kernel to find out that
the kernel says they are indeed "bad"... why should it
be forbidden for an app to use TSC?
(** e.g. increments its own tsc_last to ensure time never goes
backwards)
It seems like the only advantages the kernel has here over
a reasonably intelligent app is that: 1) the kernel can run
a warp test and the app can't, and 2) the kernel can
estimate the frequency of the TSC and the app can't.
AND, in the case of a virtual machine, the kernel has
neither of these advantages anyway.
So though I now understand and agree that neither the kernel
nor an app can guarantee that TSC won't unexpectedly go from
"good" to "bad", I still don't understand why "TSC goodness"
information shouldn't be exposed to userland, where an
intelligent enterprise app can choose to use TSC when it is good
(for the same reason the kernel does: "for performance sake")
and choose to stop using it when it goes bad (for the same
reason the kernel does: to "not screw up timekeeping").
It sounds as if you are saying that "the kernel is allowed
to use a rope because if it accidentally gets the rope
around its neck, it has a knife to ensure it doesn't hang
itself" BUT "the app isn't allowed to use a rope because
it might hang itself and we'll be damned if we loan
our knife to an app because, well... because it doesn't
need a knife because we said it shouldn't use the rope".
I think you can understand why this isn't a very satisfying
explanation.
P.S. Thanks for taking the time to discuss this!
>
> It seems like the only advantages the kernel has here over
> a reasonably intelligent app is that: 1) the kernel can run
> a warp test and the app can't, and 2) the kernel can
> estimate the frequency of the TSC and the app can't.
and 3) the kernel gets thermal interrupts and the app does not
and 4) the kernel decides which power management to use when
and 5) the kernel can find out if SMI's happened, and the app cannot.
and 6) the kernel can access tsc and a per cpu offset/frequency
data atomically, without being scheduled to another CPU. The app cannot
[well it can ask the kernel to be pinned, and that's a 99.99% thing,
but still]
[snipped a bunch of twists of my argument that are not correct]
look we're not disabling ring 3 tsc. We could, but we don't.
we're just telling you that WE as kernel cannot tell you, in
an architectural and long term (multiple kernel versions and
hardware generations) stable way, when the tsc is "usable".
Because WE know it is barely if any so. We continuously add
workarounds, calibrations and tweaks for this, and stop using it
at runtime when something smells funny and defeats our logic.
If you want to find out yourself if the tsc is good enough for you
that is one thing.... but if you want the kernel to have an official
interface for it.... the kernel has to live by that commitment.
We cannot put in that interface "oh and you need to implement the same
workarounds, scaling and offsets as the kernel does", because that's
in a huge flux, and will change from kernel version to kernel version.
The only shot you could get is some vsyscall/vdso function that gives
you a unit (but that is not easy given per cpu offset/frequency/etc..
but at least the kernel can try)
--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org
> > > > From: Thomas Gleixner [mailto:tg...@linutronix.de]
> > > > What we can talk about is a vget_tsc_raw() interface along with a
> >
> > What I have in mind and what I'm working on for quite a while is going
> > to work on both bare metal and VMs w/o hidden slowness.
>
> Well, if this can be done today/soon and is fast enough
> (say <2x the cycles of a rdtsc), I am very interested and
Are you going to measure that with rdtsc() ? :)
> "won't let my friends use rdtsc" :-) Anything I can do
> to help?
Yes, stop trying to convince me that rdtsc in apps is a good idea. :)
> It sounds as if you are saying that "the kernel is allowed
> to use a rope because if it accidentally gets the rope
> around its neck, it has a knife to ensure it doesn't hang
> itself" BUT "the app isn't allowed to use a rope because
> it might hang itself and we'll be damned if we loan
> our knife to an app because, well... because it doesn't
> need a knife because we said it shouldn't use the rope".
>
> I think you can understand why this isn't a very satisfying
> explanation.
What I understand is that you want us to give out the rope only and
when things go wrong let us kernel developers deal with the bugreports
about the missing knife.
Please understand that once we expose that tsc_reliable information we
are responsible for its correctness. People will use it whether the
enterprise entity who wants this feature has qualified that particular
piece of hardware or not. And while the support of that enity refuses
to help on non qualified hardware (your own words), we'll end up with
the mess which was created to help that very entity.
I think you understand that I have no intention to put a ticking time
bomb into the code I'm responsible for. I really have better things to
do than shooting myself in the foot.
Thanks,
tglx
On Sat, May 15, 2010 at 06:29:25AM -0700, Dan Magenheimer wrote:
> The problem is from an app point-of-view there is no vsyscall.
> There are two syscalls: gettimeofday and clock_gettime. Sometimes,
> if it gets lucky, they turn out to be very fast and sometimes
> it doesn't get lucky and they are VERY slow (resulting in a performance
> hit of 10% or more), depending on a number of factors completely
> out of the control of the app and even undetectable to the app.
What would the application do in the 10% case?
(Assuming modern kernels, I know older kernels had trouble sometimes):
That's the case when the TSC doesn't work reliably, so if it
uses it anyways it won't get good time.
It seems to me you're bordering on violating Steinberg's rule
of system programming here :-)
>
> Note also that even vsyscall with TSC as the clocksource will
> still be significantly slower than rdtsc, especially in the
> common case where a timestamp is directly stored and the
> delta between two timestamps is later evaluated; in the
> vsyscall case, each timestamp is a function call and a convert
> to nsec but in the TSC case, each timestamp is a single
> instruction.
First the single instruction is typically quite slow. Then
to really get monotonous time you need a barrier anyways.
When I originally wrote vsyscalls that overhead wasn't that big
with all that compared to open coding. The only thing that could
be stripped might be the unit conversion. In principle
a new vsyscall could be added for that (what units do you need?)
I remember when they were converted to clocksources they got
somewhat slower, but I suspect with some tuning work that
could be also fixed again.
I think glibc also still does a unnecessary indirect jump
(might hurt you if your CPU cannot predict that), but that could
be fixed too. I think I have an old patch for that in fact,
if you're still willing to use the old style vsyscalls.
>
> > This way if anything changes again in TSC the kernel could
> > shield the applications.
>
> If tsc_reliable is 1, the system and the kernel are guaranteeing
> to the app that nothing will change in the TSC. In an Invariant
> TSC system that has passed Ingo's warp test (to eliminate the
> possibility of a fixed interprocessor TSC gap due to a broken BIOS
> in a multi-node NUMA system), if anything changes in the clock
That only handles cases visible at boot. If the TSC breaks
longer term the kernel catches it with its watchdog, but your
user application won't.
> signal that drives the TSC, the system is badly broken and far
> worse things -- like inter-processor cache incoherency -- may happen.
I don't think that's true. There are various large systems with
non synchronized TSC and I haven't heard of any unique cache coherency
problems on that.
Also often the TSC is actually synchronized, but unfortunately
runs with a offset.
>
> Is it finally possible to get past the horrible SMP TSC problems
> of the past and allow apps, under the right conditions, to be able
> to use rdtsc again? This patch argues "yes".
Yes but why not let them use vsyscalls?
I know vsyscalls still have some issues today, but these
would be better fixed than worked around like this.
e.g.
If the idea is to use the TSC on not fully synchronized systems?
I haven't fully kept track, but at some point there was an attempt
to have more POSIX clocks with loser semantics (like per thread
monotonous). If you use that you'll get fast time (well not day time,
but perhaps useful time) which might be good enough without
hacks like this?
If the semantics are not exactly right I think more POSIX clocks
could be added too.
Or if the time conversion is a problem we could add a posix_gettime_otherunit()
or so (e.g. with a second vsyscall that converts units so you don't
need to do it in the fast path)
A long time ago there was also the idea to export the information
if gettimeofday()/clock_gettime() was fast or not. If this helps this could
be probably revisited. But I'm not sure what the application
should really do in this case.
32bit doesn't have a fast ring 3 gtod() today but that could be also fixed.
-Andi
--
a...@linux.intel.com -- Speaking for myself only.
I don't think anyone would object to exporting such a flag if
it's cleanly designed.
Getting the semantics right for that might be somewhat tricky
though. How is "slow" defined?
> A CPU-hotplugable system is a good example of a case where
> the kernel should expose that tsc_reliable is 0. (I've heard
That would mean that a large class of systems which
are always hotplug capable (even if it's not used)
would never get fast TSC time.
Wasn't the goal here to be faster?
> anecdotally that CPU hotplug into a QPI or Hypertransport system
> will have some other interesting challenges, so may require some
> special kernel parameters anyway.) Even if tsc_reliable were
> only enabled if a "no-cpu_hotplug" kernel parameter is set,
> that is still useful. And with cores-per-socket (and even
> nodes-per-socket) going up seemingly every day, multi-socket
> systems will likely be an ever smaller percentage of new
> systems.
Still the people running them will expect as good performance
as possible.
-Andi
--
a...@linux.intel.com -- Speaking for myself only.
> From: Thomas Gleixner [mailto:tg...@linutronix.de]
> > "won't let my friends use rdtsc" :-) Anything I can do
> > to help?
>
> Yes, stop trying to convince me that rdtsc in apps is a good idea. :)
You've convinced ME that it is NOT a good idea in many many cases
and that some of those cases are very very hard to detect. Unfortunately,
convincing ME is not sufficient. The problem is that Pandora's
box is already open and it is getting ever more wide open. (Each
of you has implied that TSC on future systems is getting even
more likely to be reliable.) Not only is rdtsc unprivileged
but so is the CPUID Invariant TSC bit. And the kernel already
publishes current_clocksource and puts a tempting MHz rate in
its logs. While we all might hope that every system programmer
will read this long thread and also be convinced never to use
rdtsc, the problem is there is a high and increasing chance that
any given systems programmer will write code that uses rdtsc and,
in all of his/her test machines will NEVER see a problem. But,
sadly, some of the customers using that app WILL see the problem
BUT neither the customers nor the systems programmer may ever know
what really went wrong (unless they hire Thomas to debug it :-)
So let me suggest inverting the logic of the patch and maybe
it will serve all of us better. (see proposal below)
> From: Arjan van de Ven [mailto:ar...@infradead.org]
> If you want to find out yourself if the tsc is good enough for you
> that is one thing.... but if you want the kernel to have an official
> interface for it.... the kernel has to live by that commitment.
> We cannot put in that interface "oh and you need to implement the same
> workarounds, scaling and offsets as the kernel does", because that's
> in a huge flux, and will change from kernel version to kernel version.
> The only shot you could get is some vsyscall/vdso function that gives
> you a unit (but that is not easy given per cpu offset/frequency/etc..
> but at least the kernel can try)
> From: Thomas Gleixner [mailto:tg...@linutronix.de]
> Please understand that once we expose that tsc_reliable information we
> are responsible for its correctness. People will use it whether the
> enterprise entity who wants this feature has qualified that particular
> piece of hardware or not. And while the support of that enity refuses
> to help on non qualified hardware (your own words), we'll end up with
> the mess which was created to help that very entity.
OK, so let's invert the sense of the sysfs file and call it (for now)
"tsc_detected_as_UNreliable". Then anytime the kernel detects
a failed warp test (or any other suspicious condition), it changes
the bit from 0 to 1 effectively saying "if you are using rdtsc
against our recommendation, we told you that it might go bad and
it has, so consider yourself warned that some of the timestamps
you've taken since the last time you've checked this flag
may be b*rked"
IMHO, addressing the issue directly and clearly documenting it
(instead of trying to hide the dirty laundry in the kernel)
will result in far better education of systems programmers
and far fewer end user problems. Which raises another good analogy:
You are telling teenagers to abstain and I am proposing that we
instead encourage them to use a condom.
You are simply not going to stop systems programmers from using
rdtsc... let's at least allow them to practice "safe TSC usage"
which, to extend the analogy, we all know is still not 100.0%
effective but, if they are going to do "it" anyway, is still
far better than completely unprotected TSC usage.
(The remainder of this response is discussion of individual
points raised, so can be skipped by most readers.)
====
> From: Arjan van de Ven [mailto:ar...@infradead.org]
> and 3) the kernel gets thermal interrupts and the app does not
> and 4) the kernel decides which power management to use when
> and 5) the kernel can find out if SMI's happened, and the app cannot.
> and 6) the kernel can access tsc and a per cpu offset/frequency
> data atomically, without being scheduled to another CPU. The app cannot
> [well it can ask the kernel to be pinned, and that's a 99.99% thing,
> but still]
These are all very good reasons where the kernel might turn
on a "tsc_detected_as_unreliable" bit. And CPU-hotplug_add
too.
And, by the way, are any of these valid in a virtual machine?
> From: Andi Kleen [mailto:an...@firstfloor.org]
> What would the application do in the 10% case?
(and in the "tsc_detected_as_unreliable" == 1 case)
Well, some possibilities are:
- Disable the heavy TSC usage and log/notify that it has been
disabled because the TSC is not reliable.
- Switch to gettimeofday and log/notify that, sorry, overall
performance has become slower due to the fact that the
TSC has become reliable
- Log a message telling the user to confirm that the hardware
they are using is on the app's supported/compatibility list
and that they are using the latest firmware and that their
thermal logs are OK, because something might not be quite
right with their system
> From: Andi Kleen [mailto:an...@firstfloor.org]
> 32bit doesn't have a fast ring 3 gtod() today but that could be also
> fixed.
That might have helped, because the enterprise app I mentioned
earlier was 32-bit... but I'll bet the genie is out of the
bottle now and the app has already shipped using rdtsc.
> From: Andi Kleen [mailto:an...@firstfloor.org]
> It seems to me you're bordering on violating Steinberg's rule
> of system programming here :-)
<Embarrassed to not know this rule, Dan goes off and googles
but fails to find any good matches before his TSC goes bad>
> From: Andi Kleen [mailto:an...@firstfloor.org]
> First the single instruction is typically quite slow. Then
> to really get monotonous time you need a barrier anyways.
Agreed, I've measured 30-ish and 60-ish cycles on a
couple of machines.
> From: Andi Kleen [mailto:an...@firstfloor.org]
> When I originally wrote vsyscalls that overhead wasn't that big
> with all that compared to open coding. The only thing that could
> be stripped might be the unit conversion. In principle
> a new vsyscall could be added for that (what units do you need?)
>
> I remember when they were converted to clocksources they got
> somewhat slower, but I suspect with some tuning work that
> could be also fixed again.
>
> I think glibc also still does a unnecessary indirect jump
> (might hurt you if your CPU cannot predict that), but that could
> be fixed too. I think I have an old patch for that in fact,
> if you're still willing to use the old style vsyscalls.
I think vsyscall is an excellent idea and I'm very much in
favor of continuing to improve it and encouraging people
to use it. But until it either "always" "just works" in "all"
of the software/hardware environments used by a systems
programmer in their development and testing and in the systems
the apps get deployed on (including virtual machines) OR until
it clearly provides an indicator that it is hiding dirty performance
laundry, IMHO it won't convince the (admittedly undereducated)
pro-TSC crowd.
> > signal that drives the TSC, the system is badly broken and far
> > worse things -- like inter-processor cache incoherency -- may happen.
>
> From: Andi Kleen [mailto:an...@firstfloor.org]
> I don't think that's true. There are various large systems with
> non synchronized TSC and I haven't heard of any unique cache coherency
> problems on that.
Here I was referring to clock skew/drift, not the "fixed offset"
problem. I'm far from an expert in PLL's etc, but I think if
the clock signal is delayed far enough to cause TSC to skew
significantly, eventually some critical cache coherency protocol
is eventually going to miss a beat and screw up and corrupt data.
> If the idea is to use the TSC on not fully synchronized systems?
That has never been my intent, though others might be interested
in that.
> From: Andi Kleen [mailto:an...@firstfloor.org]
> I haven't fully kept track, but at some point there was an attempt
> to have more POSIX clocks with loser semantics (like per thread
> monotonous). If you use that you'll get fast time (well not day time,
> but perhaps useful time) which might be good enough without
> hacks like this?
>
> If the semantics are not exactly right I think more POSIX clocks
> could be added too.
>
> Or if the time conversion is a problem we could add a
> posix_gettime_otherunit()
> or so (e.g. with a second vsyscall that converts units so you don't
> need to do it in the fast path)
>
> A long time ago there was also the idea to export the information
> if gettimeofday()/clock_gettime() was fast or not. If this helps this
> could
> be probably revisited. But I'm not sure what the application
> should really do in this case.
>
> 32bit doesn't have a fast ring 3 gtod() today but that could be also
> fixed.
I believe (and this is strictly a personal opinion based on my
view of human psychology) that adding more obscure interfaces and
more obscure options is a losing battle.
> From: Thomas Gleixner [mailto:tg...@linutronix.de]
> I think you understand that I have no intention to put a ticking time
> bomb into the code I'm responsible for. I really have better things to
> do than shooting myself in the foot.
As we both know, the ticking time bomb is already there. IMHO this
revised proposal provides you with "plausible deniability"**
"The kernel now advertises when it detects the TSC is bad or has gone
bad... did your app vendor check the kernel-provided info?"
** http://en.wikipedia.org/wiki/Plausible_deniability
It would be probably a good idea to document all this
somewhere central, agreed.
In fact I remember wanting to write such a paper once. Possibly
even with some title borrowing from Swift. It seems it never happened :)
> > From: Andi Kleen [mailto:an...@firstfloor.org]
> > 32bit doesn't have a fast ring 3 gtod() today but that could be also
> > fixed.
>
> That might have helped, because the enterprise app I mentioned
> earlier was 32-bit... but I'll bet the genie is out of the
> bottle now and the app has already shipped using rdtsc.
Well the enterprise will have to live with wrong timing now and then
(or perhaps totally broken timing on some system) then.
> > From: Andi Kleen [mailto:an...@firstfloor.org]
> > It seems to me you're bordering on violating Steinberg's rule
> > of system programming here :-)
>
> <Embarrassed to not know this rule, Dan goes off and googles
> but fails to find any good matches before his TSC goes bad>
Sorry it's Steinbach's rule. Never test for an error condition
you don't know how to handle.
>
> > From: Andi Kleen [mailto:an...@firstfloor.org]
> > First the single instruction is typically quite slow. Then
> > to really get monotonous time you need a barrier anyways.
>
> Agreed, I've measured 30-ish and 60-ish cycles on a
> couple of machines.
It can be worse.
Also for correct behaviour you need the barrier on many systems.
And some other workarounds.
>
> I think vsyscall is an excellent idea and I'm very much in
> favor of continuing to improve it and encouraging people
> to use it. But until it either "always" "just works" in "all"
On 64bit it should just work, although it could be made somewhat
(but not dramatically) faster.
On older kernels vsyscall got a bad name because it was not always
as clever as it could be to chose when to use TSC and when not,
but that is long fixed. It worked in any case, just was slower[1]
[1] That is some specific TSC misbehaviours were only fixed in later
kernels, but it's very unlikely that any other RDTSC user got
that right either.
> > From: Andi Kleen [mailto:an...@firstfloor.org]
> > I don't think that's true. There are various large systems with
> > non synchronized TSC and I haven't heard of any unique cache coherency
> > problems on that.
>
> Here I was referring to clock skew/drift, not the "fixed offset"
> problem. I'm far from an expert in PLL's etc, but I think if
> the clock signal is delayed far enough to cause TSC to skew
> significantly, eventually some critical cache coherency protocol
> is eventually going to miss a beat and screw up and corrupt data.
Generally small systems run on the same clock (and any drift
you see has other reasons), but large systems built out of multiple
boards run on different clocks and the interconnect
does appropiate timing between the different domains for itself.
The interconnects also have checksums and other error checking
and recovery mechanisms and tend to do something appropiate when there
is a transmission problem. It does not usually lead to data corruption.
> I believe (and this is strictly a personal opinion based on my
> view of human psychology) that adding more obscure interfaces and
> more obscure options is a losing battle.
It sounds like you're arguing against your own patch here.
-Andi
--
a...@linux.intel.com -- Speaking for myself only.
Wrong. A vsyscall _is_ the protection which you want them to pull over
the rdtsc.
You are basically telling them: Go ahead, but keep in mind to look for
that well hidden tag behind the left earlobe which might change
suddenly from "no disease" to "infectous".
Thanks,
tglx
OK, well ignoring the metaphor, it's clear we disagree on a point
that neither one of us can prove: You think your decision to avoid
sharing kernel information will stop system programmers from using
rdtsc, and I think some are going to use rdtsc anyway and blame
Linux when something eventually and silently breaks.
Given that I'm not going to win this argument even by pointing
to examples, let's move forward with your solution. Can you say
more about your vget_tsc_raw() directions? Or at least describe
the API if there is anything special? I'm getting beaten on for
an answer *today*, they say (and I quote from an email today)
they "have been using vsyscalls for a while and still have a
performance headache", and rdtsc looks awfully tempting, even
if it is not 100% perfect. So what can I give them as an
alternative?
(And, yes, I understand this is not your problem, but I do worry
that it is perceived as *Linux's* problem. And as we all have
painfully learned in our careers, it's almost impossible to
beat back a perception by presenting a set of byzantine facts
that only a handful of people in the world truly understand.)
Thanks!
Dan
> OK, well ignoring the metaphor, it's clear we
> disagree on a point that neither one of us can
> prove: You think your decision to avoid sharing
> kernel information will stop system programmers
> from using rdtsc, and I think some are going to use
> rdtsc anyway and blame Linux when something
> eventually and silently breaks.
Applications can do various unreliable things, the
kernel cannot do anything about that.
The point is for the kernel to not be complicit in
practices that are technically not reliable.
So the kernel wont 'signal' that something is safe to
use if it is not safe to use.
One suggestion in this thread makes sense i think: to
signal via sysfs that gettimeofday is slow.
Plus lets hope that we really can figure out a fast,
TSC based gettimeofday implementation. If that is
possible then user-space will get a fast gettimeofday
right out of box.
Thanks,
Ingo
> Given that I'm not going to win this argument even
> by pointing to examples, [...]
You could win the argument by coming up with a patch
that changes gettimeofday to make use of the TSC in a
reliable manner.
I really mean it - and it might be possible - but we
have not found it yet.
Thanks,
Ingo
Maybe we should.
We could then trap and emulate it with something sensible like clock
monotonic.
/me runs :-)
That would kill the vsyscall too. Remember it's running in ring 3.
That is in theory you could disable it on systems where the vsyscall
doesn't use it, but then you would likely break huge amounts of software,
unless you emulate it.
Emulation would be a possibility, but I'm not sure it would
make anyone really happy. It would be certainly slow.
-Andi
--
a...@linux.intel.com -- Speaking for myself only.
Well, software shouldn't use it, so breaking it sounds like a fine
idea ;-)
Also, a slow emulation is an incentive to actually do the right thing.
Last fall, I discovered that EVERY program on RHEL5 (U2?) uses rdtsc
because RHEL5 ld.so uses rdtsc. The uses are few and harmless but
are there nonetheless. Clearly that can be fixed, but you might
be surprised how big "huge amounts of software" is.
> Also, a slow emulation is an incentive to actually do the right thing.
Emulation is not particularly slow, especially compared to accessing
the HPET. If the kernel deems TSC is unsafe, the ring 3 vsyscall
shouldn't be using rdtsc either so the additional trap overhead
might be in the noise.
As long as there is a sysfs file that can override the setting
and there is a counter (accessible via sysfs) that can count
the number of emulated rdtsc/rdtscp instructions (possibly
optionally by pid so the "offending" userland threads can
be tracked down), setting CR4.TSD whenever the kernel deems
TSC is unsafe and emulating rdtsc might be a reasonable solution.
Infrequent rdtsc users won't know or care, and frequent users
will at least be able to learn the frequency of their "sin".
And to help Thomas/Arjan/Ingo/Andi educate users, every read
or write to any of these sysfs files could also result in a
printk of "Use of rdtsc is deprecated... use vsyscalls instead.
See Documentation/friends_dont_let_friends_use_rdtsc."
(Half ;-)
And the sysfs file could have a "strict" setting which kills any
thread that uses rdtsc, so Thomas can tell future problem
reporters: "Set the rdtsc setting to strict and if you still
have problems, call me back." (Other half of ;-)
The problem is that you throw the baby (vsyscall) out with the bathwater
(user rdtsc).
-hpa
--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.
Well, we could only flip the CR4 bit when we mark the TSC unsuitable for
gtod. That should be plenty good to tag all userspace trying to use it,
since more than half my machines don't use TSC for clocksource.
This might be an option, although it would have to be an *option*.
There are restricted uses of the TSC in userspace which are still useful
(mainly involving performance analysis and/or CPU-locked processes).
-hpa
--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.
--
(Though I expect tglx/arjan/andi/mingo to disagree with this proposal
for similar reasons as the original one that started this thread...)
Proposal:
/sys/devices/system/tsc/native (writable by root):
0 = (default) Kernel dynamically controls TSC emulation.
When the kernel deems TSC usable as a clocksource, rdtsc
will be executed directly by the CPU. When the kernel deems
TSC unsafe to use, rdtsc will be trapped and emulated.
1 = TSC emulation is never enabled. Programs using rdtsc
directly are subject to the many known and sometimes rare
and subtle vagaries of TSC.
2 = TSC emulation is always enabled (for debug only)
3 = Processes using TSC will be treated as if they executed
an illegal instruction. [? Can the kernel recognize
use of rdtsc in a vsyscall and emulate so that,
even though vsyscall is slower, all other rdtsc
in userspace are illegal?] [? Can/should this be
enforced only on non-root processes?]
/sys/devices/system/tsc/system_count (writable by root):
Contains a count of all TSC emulations, system-wide.
Writable to allow reset to zero.
/sys/devices/system/tsc/pid_counters (writable by root):
0 = (default) TSC counts are system-wide only
1 = TSC counted per pid (at performance penalty)
counters in /proc/PID/tsc_count
/proc/PID/tsc_count (readonly):
If /sys/devices/system/tsc_pid_counters is 1,
contains the count of rdtsc instructions emulated
for this PID.
(Note: except for the actual instruction emulation
which will be faithful, rdtscp will be treated and
counted as a rdtsc.)
I'll add another reason to disagree: exporting these as sysfs variables
is non-atomic, but these are really only useful when atomically read as
a unit.
-hpa
Oops, hit send too soon.
And the reason I expect tglx/arjan/andi/mingo to disagree is because
their position is that there is NO safe use for rdtsc in userspace EVER!
Whereas your position stated earlier:
> There are restricted uses of the TSC in userspace which are still
> useful
> (mainly involving performance analysis and/or CPU-locked processes).
says there are.
While the engineer in me agrees with tglx/arjan/andi/mingo, the
realist in me agrees with you.
Which "these"? The counters? I would think the primary use
of the counters is to diagnose extreme problem cases, not
to differentiate whether a system or process did exactly
27 vs 28 rdtsc's, so I don't see why atomic read is at all
necessary.
Or did I miss your point entirely?
I should have added "that are not related to wall time" to the statement
above.
Furthermore, vsyscalls are user space from a CPU perspective.
-hpa
That is correct.
> I'm still not sure if you are in favor of optionally emulating
> PL3 rdtsc instructions or not? I thought my proposal was
> just filling out some details of your proposal and suggesting
> a default.
I'm not in favor of emulating rdtsc instructions. I would consider
letting them SIGILL (actually SIGSEGV since RDTSC #GP in userspace) when
the TSC is unavailable, though.
It's not clear to me that it's possible, though, since that also affects
RDTSCP.
-hpa
Yes... that still puts your opinion at odds with tglx/etc.
All of the cases I am concerned with ARE performance analysis
uses, not wall time uses.
> Furthermore, vsyscalls are user space from a CPU perspective.
Yes, understood, a minor semantic issue. From a kernel perspective
vsyscalls are kernelspace, so IIUC this is OK with tglx/etc.
Since vsyscall shouldn't be using rdtsc when the kernel
doesn't trust TSC, it doesn't matter if CR4.TSD is enabled when
the kernel doesn't trust TSC.
I'm still not sure if you are in favor of optionally emulating
PL3 rdtsc instructions or not? I thought my proposal was
just filling out some details of your proposal and suggesting
a default.
--
(All the variations are boggling so hard to discuss in
a linear email thread.)
IIUC, tglx/arjan consider RDTSC and RDTSCP to be in the same
category. RDTSCP simply eliminates one large class of TSC
problems, but not all the possible system TSC problems that
the kernel can (or can't) detect. So userspace (non-vsyscall)
shouldn't use either one
Further, this one redeeming feature of RDTSCP can be useless
and/or misleading in a virtual machine the way the kernel
sets up TSC_AUX.
> when the TSC is unavailable, though.
Do you mean "when the processor doesn't support a TSC instruction"
(very rare nowadays AFAIK) or "when the kernel determines that
TSC is not safe to use as a clocksource"?
I doubt we could ever do that, it would likely break just too much
code. Yes the code is already broken likely on some system, but there's
a big difference between wrong time and crash.
-Andi
--
a...@linux.intel.com -- Speaking for myself only.
s/some/most/
> a big difference between wrong time and crash.
Maybe start with a patch that logs all users and start sending patches
to the respective projects to clean them up.
Once we get most of userspace running fine, we can switch it to
generating faults.
Of course closed source stuff will have to deal with it themselves, but
who cares about that anyway ;-)
One usecase that hasn't been discussed is when userspace needs this info to
calibrate the TSC.
Take NTP as an example. It does a pretty good job of observing the drift in
gettimeofday() against a reference clock and correcting for it. This seems
to work well even when GTOD uses the TSC. But, it assumes that the drift
changes slowly.
That goes out the window on reboot, because the kernel only spends 25ms on
TSC<->PIT calibration and the value of tsc_khz can vary a lot from boot to
boot. Then NTP starts up and reads a drift value from /var/lib/ntp/ntp.drift
that it *thinks* is accurate. In our experience, it'll then spend up to 48
hours doing god knows what to the clock until it converges on the real
drift at the new tsc_khz. initscripts could correct for the kernel's
recalibration, but tsc_khz isn't exported.
So it's too bad that it can't be exported somehow. The TSC on our
machines has proven to be stable for all intents and purposes; I just
checked 25 of my machines, most have uptime of >200 days, all of them
still have current_clocksource=tsc. After NTP or PTPd has been running
for a while, things converge, but being unable to reboot is a headache.
Using the HPET for gettimeofday() would be impractical for performance
reasons.
Yea, the relative instability of the tsc calibration at boot is an
issue for folks who want very very precise timekeeping immediately
after a reboot.
I proposed a solution to this awhile back via a boot option users
could use to specify the tsc_khz freq, so it would be consistent from
boot to boot. See: https://patchwork.kernel.org/patch/22492/
It didn't really go anywhere due to a lack of public interest.
However, if you're interested in playing with it, I can try to revive
the patch.
thanks
-john
The point about NTP is a very good one; with NTP you *really* want what
you're adjusting to be directly related to the crystal oscillator -- as
what it is; precision over accuracy as NTP takes care of accuracy but
not precision.
-hpa
--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.
--
Another possibility: Optionally trust the stamped rate for the part?
I understand that on Nehalem this value is available in
MSR_PLATFORM_INFO[15:8] (google for MSR_PLATFORM_INFO 15 8),
but I don't know if this MSR is available on older (or AMD)
processors.
Just wondering: If one were to put an ultra-precise scope on
a processor, how far off would the calibrated value be? I'd
imagine the process of calibrating one unknown crystal against
a second crystal which has a known-but-not-highly-precise
frequency, though good enough for most purposes, is not particularly
accurate. In other words, maybe the stamped rate is more accurate
than the calibrated rate anyway?
No. Not even close.
A spread-spectrum clock is inaccurate by entire percentage points.
A non-spread clock is typically �50 ppm with typical consumer PC
oscillators, �1 ppm with non-crappy but still cheap oscillators (e.g.
used in cell phones.)
-hpa
Hmmm. That could be an option for newer cpus that I wouldn't oppose.
While Peter is correct that the stamped value is probably not very
accurate, atleast it would be constant from boot to boot, and NTP's
calculated drift value would be correct.
We'd need a check to make sure its not way off, since NTP will give up
if its outside 500ppm. So as long as its close to the calibrated value,
we probably could use it.
thanks
-john
Is that still the case? I thought newer versions of NTP could deal with
large values. Inaccuracies of way more than 500 ppm are everyday.
-hpa
That's scary.
Yea, in the kernel the ntp freq correction tops out at 500ppm. Almost
all the systems I see tend to fall in the +/- 200ppm range (if there's
not something terribly wrong with the hardware).
So maybe things aren't so bad out there? Or is that wishful thinking?
thanks
-john
In the kernel, yes; I thought the ntp daemon itself now handled the
exceptions (basically it detects if the PLL consistently veers off the
rails and adjusts the timing constants.)
However, you're comparing apples to oranges: you're talking about
current kernels, which means a calibrated TSC, which means you're
comparing to the non-spread 14.31818 MHz clock (which feeds into the
HPET, PMTMR and 8254 on a standard PC platform.) In most PCs this is a
separate oscillator from the bus clock which is spread spectrum. As a
result, it should be in the ±50 ppm range in theory; in practice as you
observe the range is wider.
-hpa
Since Brian's concern is at boot-time at which point there is no
network or ntp, and assuming that it would be unwise to vary tsc_khz
dynamically on a clocksource==tsc machine (is it?), would optionally
lengthening the TSC<->PIT calibration beyond 25ms result in a more
consistent tsc_khz between boots? Or is the relative instability
an unavoidable result of skew between the PIT and the fixed constant
PIT_TICK_RATE combined with algorithmic/arithmetic error? Or is
the jitter of the (spread-spectrum) TSC too extreme? Or ???
If better more consistent calibration is possible, offering
that as an optional kernel parameter seems better than specifying
a fixed tsc_khz (stamped or user-specified) which may or may
not be ignored due to "too different from measured tsc_khz".
Even an (*optional*) extra second or two of boot time might
be perfectly OK if it resulted in an additional five or six
bits of tsc_khz precision.
Thoughts, Brian?
Making the calibration time longer should give a more precise result,
but of course at the expense of longer boot time.
A longer sample would make sense if the goal is to freeze it into a
kernel command line variable, but the real question is how many people
would actually do that (and how many people would then suffer problems
because they upgraded their CPU/mobo and got massive failures on post-boot.)
-hpa
I'll admit its a feature for a minority of users. Probably why its not
included.
And the upgraded system issue was something I tried to address by using
the calibrated value if it was off by some unreasonable amount, however
folks protested that, figuring since if its explicitly stated kernel
should not override it (ie: for the use case of where the calibration is
broken and folks want to force the value).
Also, you don't really need extra accuracy, you just need it to be the
same from boot to boot. NTP keeps the correction factor persistent from
boot to boot via the drift file. The boot argument is just trying to
save the time (possibly hours depending on ntp config) after a reboot
for NTP to correct for the new error introduced by calibration.
thanks
-john
I was assuming that extra accuracy would decrease the ntp
convergence time by about the same factor (5-6 bits of extra
accuracy would decrease ntp convergence time by 32-64x).
Is that an incorrect assumption?
> From: H. Peter Anvin [mailto:h...@zytor.com]
> A longer sample would make sense if the goal is to freeze it into a
> kernel command line variable, but the real question is how many people
> would actually do that (and how many people would then suffer problems
> because they upgraded their CPU/mobo and got massive failures on post-
> boot.)
Not sure why upgraded mobo's would fail due to a longer sample?
As more and more systems become dependent on clocksource==tsc
and more and more people assume nanosecond-class measurements
are relatively accurate, I'd expect the accuracy of tsc_khz
to become more important. While desktop users might bristle
at an extra second of boot delay, I'll bet many server
farm administrators would gladly pay that upfront cost
if they know an option exists.
Yes.
>> From: H. Peter Anvin [mailto:h...@zytor.com]
>> A longer sample would make sense if the goal is to freeze it into a
>> kernel command line variable, but the real question is how many people
>> would actually do that (and how many people would then suffer problems
>> because they upgraded their CPU/mobo and got massive failures on post-
>> boot.)
>
> Not sure why upgraded mobo's would fail due to a longer sample?
Not due to a longer sample, but a frozen sample.
> As more and more systems become dependent on clocksource==tsc
> and more and more people assume nanosecond-class measurements
> are relatively accurate, I'd expect the accuracy of tsc_khz
> to become more important. While desktop users might bristle
> at an extra second of boot delay, I'll bet many server
> farm administrators would gladly pay that upfront cost
> if they know an option exists.
Not really. The delta measurements aren't the issue here, but rather
walltime convergence.
-hpa
Sorry, this is sort of mixing points. I was saying you don't need more
accuracy (as opposed to what H. Peter mentioned below) when setting the
tsc_khz= option I proposed. Since it will be constant from boot to boot,
and thus will reduce the ntp convergence time.
However, without such a boot option, more accuracy from an increased
calibration time would help. However, the tradeoff of a longer boot time
is one not many will probably want.
> > From: H. Peter Anvin [mailto:h...@zytor.com]
> > A longer sample would make sense if the goal is to freeze it into a
> > kernel command line variable, but the real question is how many people
> > would actually do that (and how many people would then suffer problems
> > because they upgraded their CPU/mobo and got massive failures on post-
> > boot.)
>
> Not sure why upgraded mobo's would fail due to a longer sample?
Again, this is mixing the discussion. The concern was users of a
tsc_khz= boot option might have problems when they upgrade, as the
actual TSC freq might not match what was specified at boot.
> As more and more systems become dependent on clocksource==tsc
> and more and more people assume nanosecond-class measurements
> are relatively accurate, I'd expect the accuracy of tsc_khz
> to become more important. While desktop users might bristle
> at an extra second of boot delay, I'll bet many server
> farm administrators would gladly pay that upfront cost
> if they know an option exists.
Maybe something like a tsc_long_calibration=1 option would allow for
this?
However, I really do like the idea of pulling the stamped value from the
MSR and if its close to what we quickly calibrated, use that.
thanks
-john
On a system with synchronized TSC and multiple cores you could also simply do
a longer calibration on another core in the background after a quick "fast
calibration"
-Andi
Oops, sorry, missed that. :-}
> Maybe something like a tsc_long_calibration=1 option would allow for
> this?
Sounds good to me. If it's non-obvious what value to choose for
the new calibration, maybe specifying it in MS (per MAX_QUICK_PIT_MS
in arch/x86/kernel/tsc.c) would be nice.
> However, I really do like the idea of pulling the stamped value from
> the MSR and if its close to what we quickly calibrated, use that.
On a quick sample of two machines looking at the TSC calibration
done by Xen (which exposes the equivalent of tsc_khz), it appears
that the stamped value is different from the calibration by about
1000ppm. YMMV.
So pretty seriously different. Less than I would have expected, but not
out of the ballpark.
-hpa
If we're being strict, something like NTP needs to know exactly
what's driving gettimeofday(). If the clocksource changes, it
needs to know that so it could correct or trash its drift
estimate. If there's one-time calibration, it needs to know
what the result was. If there's continuous calibration, it
either needs to be notified, or have the ability to disable
it. Right? So I think exporting tsc_khz in some form is a
step in the right direction.
So what's wrong with just adding a
/sys/devices/system/clocksource/clocksource0/tsc_khz?
Maybe Thomas Gleixner's suggestion of a vget_tsc_raw()
would also suffice, I'm not sure I understand the details
enough.
Any of the other fixes people have discussed (tsc_khz=
bootopt, tsc_long_calibration=1) would be enough to make
me happy though :)
(*) Though they still need to learn enough to coax the
kernel into giving them a fast gettimeofday(). That's a
price you gotta pay if you care enough :)
As an RFC:
Add clocksource.sys_register & sys_unregister so the
current clocksource can add supplemental information to
/sys/devices/system/clocksource/clocksource0/
Export tsc_khz when current_clocksource==tsc so that
daemons like NTP can account for the variability of
calibration results.
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 9faf91a..9c99965 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -76,6 +76,11 @@ unsigned long long
sched_clock(void) __attribute__((alias("native_sched_clock")));
#endif
+int sysfs_tsc_register(struct sys_device *clocksource_dev,
+ struct clocksource *cs);
+void sysfs_tsc_unregister(struct sys_device *clocksource_dev,
+ struct clocksource *cs);
+
int check_tsc_unstable(void)
{
return tsc_unstable;
@@ -757,6 +762,8 @@ static struct clocksource clocksource_tsc = {
#ifdef CONFIG_X86_64
.vread = vread_tsc,
#endif
+ .sys_register = sysfs_tsc_register,
+ .sys_unregister = sysfs_tsc_unregister,
};
void mark_tsc_unstable(char *reason)
@@ -967,3 +974,22 @@ void __init tsc_init(void)
init_tsc_clocksource();
}
+static ssize_t show_tsc_khz(
+ struct sys_device *dev, struct sysdev_attribute *attr, char *buf)
+{
+ return sprintf(buf, "%u\n", tsc_khz);
+}
+
+static SYSDEV_ATTR(tsc_khz, 0444, show_tsc_khz, NULL);
+
+int sysfs_tsc_register(struct sys_device *clocksource_dev,
+ struct clocksource *cs)
+{
+ return sysdev_create_file(clocksource_dev, &attr_tsc_khz);
+}
+
+void sysfs_tsc_unregister(struct sys_device *clocksource_dev,
+ struct clocksource *cs)
+{
+ sysdev_remove_file(clocksource_dev, &attr_tsc_khz);
+}
diff --git a/include/linux/clocksource.h b/include/linux/clocksource.h
index 5ea3c60..d9f6f13 100644
--- a/include/linux/clocksource.h
+++ b/include/linux/clocksource.h
@@ -15,6 +15,7 @@
#include <linux/cache.h>
#include <linux/timer.h>
#include <linux/init.h>
+#include <linux/sysdev.h>
#include <asm/div64.h>
#include <asm/io.h>
@@ -156,6 +157,8 @@ extern u64 timecounter_cyc2time(struct timecounter *tc,
* @vread: vsyscall based read
* @suspend: suspend function for the clocksource, if necessary
* @resume: resume function for the clocksource, if necessary
+ * @sys_register: optional, register additional sysfs attributes
+ * @sys_unregister: optional, unregister sysfs attributes
*/
struct clocksource {
/*
@@ -194,6 +197,10 @@ struct clocksource {
struct list_head wd_list;
cycle_t wd_last;
#endif
+ int (*sys_register)(struct sys_device *clocksource_dev,
+ struct clocksource *cs);
+ void (*sys_unregister)(struct sys_device *clocksource_dev,
+ struct clocksource *cs);
};
/*
diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index f08e99c..d8b69a5 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -41,6 +41,8 @@ void timecounter_init(struct timecounter *tc,
}
EXPORT_SYMBOL_GPL(timecounter_init);
+void sysfs_alter_clocksource(struct clocksource *old, struct clocksource *new);
+
/**
* timecounter_read_delta - get nanoseconds since last call of this function
* @tc: Pointer to time counter
@@ -572,6 +574,7 @@ static void clocksource_select(void)
}
if (curr_clocksource != best) {
printk(KERN_INFO "Switching to clocksource %s\n", best->name);
+ sysfs_alter_clocksource(curr_clocksource, best);
curr_clocksource = best;
timekeeping_notify(curr_clocksource);
}
@@ -834,6 +837,8 @@ static struct sys_device device_clocksource = {
.cls = &clocksource_sysclass,
};
+static int sysfs_active = 0;
+
static int __init init_clocksource_sysfs(void)
{
int error = sysdev_class_register(&clocksource_sysclass);
@@ -848,10 +853,34 @@ static int __init init_clocksource_sysfs(void)
error = sysdev_create_file(
&device_clocksource,
&attr_available_clocksource);
+
+ if (!error)
+ {
+ mutex_lock(&clocksource_mutex);
+ if(curr_clocksource->sys_register)
+ error = curr_clocksource->sys_register(
+ &device_clocksource, curr_clocksource);
+ mutex_unlock(&clocksource_mutex);
+ }
+
+ if (!error)
+ sysfs_active = 1;
return error;
}
device_initcall(init_clocksource_sysfs);
+
+void sysfs_alter_clocksource(struct clocksource *old,
+ struct clocksource *new)
+{
+ if(!sysfs_active)
+ return;
+ if(old->sys_unregister)
+ old->sys_unregister(&device_clocksource, old);
+ if(new->sys_register)
+ new->sys_register(&device_clocksource, new);
+}
+
#endif /* CONFIG_SYSFS */
/**
I think this is a bad idea, as it creates an ABI that is arch AND
machine specific, which will cause portability problems in applications
that expect the interface to be there.
thanks
-john
It's an arch-independent ABI that returns ENOENT on
unsupported platforms ;)
Could you please explain what you envision as an
arch-independent solution to this problem?
I guess the tsc_long_calibration=1 alternative is
one.
> john stultz wrote:
> > On Tue, 2010-05-25 at 20:16 -0400, Brian Bloniarz wrote:
> >> On 05/24/2010 09:33 PM, Brian Bloniarz wrote:
> >>> So what's wrong with just adding a
> >>> /sys/devices/system/clocksource/clocksource0/tsc_khz?
> >> As an RFC:
> >>
> >> Add clocksource.sys_register & sys_unregister so the
> >> current clocksource can add supplemental information to
> >> /sys/devices/system/clocksource/clocksource0/
> >>
> >> Export tsc_khz when current_clocksource==tsc so that
> >> daemons like NTP can account for the variability of
> >> calibration results.
> >
> > I think this is a bad idea, as it creates an ABI that is arch AND
> > machine specific, which will cause portability problems in applications
> > that expect the interface to be there.
>
> It's an arch-independent ABI that returns ENOENT on
> unsupported platforms ;)
>
> Could you please explain what you envision as an
> arch-independent solution to this problem?
> I guess the tsc_long_calibration=1 alternative is
> one.
Arch independent solution is to provide information about the current
clock source in general. This is _NOT_ a TSC specific problem, you
have the same trouble with any other clocksource which gets calibrated
and does not take it's frequency as a constant value from boot loader,
configuration or some CPU/chipset register. The only missing piece is
a frequency member in struct clocksource which needs to be filled in
by the arch/machine specific code.
Thanks,
tglx
> On 05/24/2010 09:33 PM, Brian Bloniarz wrote:
> > So what's wrong with just adding a
> > /sys/devices/system/clocksource/clocksource0/tsc_khz?
It's wrong because TSC is an x86'ism.
> As an RFC:
>
> Add clocksource.sys_register & sys_unregister so the
> current clocksource can add supplemental information to
> /sys/devices/system/clocksource/clocksource0/
>
> Export tsc_khz when current_clocksource==tsc so that
> daemons like NTP can account for the variability of
> calibration results.
I'd rather see a generic solution which provides the information of
the current (and possibly those of the available) clock source(s).
This x86 centric TSC world view is horrible.
Thanks,
tglx
Actually there is already a frequency in struct clocksource except
it's represented by the two components: mult and shift. Maybe
it would be best to expose these instead of khz (for all clocksources)
so as to limit abuse by naive users.
So, Thomas and John, if Brian's patch is modified to provide:
/sys/devices/system/clocksource/clocksource0/current_mult
/sys/devices/system/clocksource/clocksource0/current_shift
and/or
/sys/devices/system/clocksource/clocksource0/current_khz
is that an acceptable arch-independent patch? (And which do
you prefer?)
You mean the TSC user space ones ?
> So, Thomas and John, if Brian's patch is modified to provide:
>
> /sys/devices/system/clocksource/clocksource0/current_mult
> /sys/devices/system/clocksource/clocksource0/current_shift
> and/or
> /sys/devices/system/clocksource/clocksource0/current_khz
>
> is that an acceptable arch-independent patch? (And which do
> you prefer?)
I'd rather prefer the frequency interface for a simple reason. It
allows to add a commandline option which provides the NTP folks with a
sensible solution to their calibration problem as I don't see that a
longer calibration time will reliably fix it.
So we'd get a "clocksource_freq=XXX" option which would be applied to
the clocksource which is selected on the command line with
"clocksource=NNN".
John ?
Thanks,
tglx
Right but having applications add "Linux on x86 where the TSC is being
used" logic is pretty poor solution. Its an issue that should be
addressed from the kernel side.
And really, if apps really wanted this info, they can fish it out
of /proc/cpuinfo.
> Could you please explain what you envision as an
> arch-independent solution to this problem?
> I guess the tsc_long_calibration=1 alternative is
> one.
...and the tsc_khz= patch I posted earlier.
thanks
-john
Yeah, sure.
> And really, if apps really wanted this info, they can fish it out
> of /proc/cpuinfo.
Really? I was under the impression that tsc_khz can differ
from cpu_mhz (invariant tsc?), and cpu_mhz can differ from what
shows up in /proc/cpuinfo cpuMHz due to cpufreq scaling. I was
also under the impression that knowing or controlling tsc_khz
is what NTP needs to ensure stability (assuming the TSC is
otherwise stable, i.e. no halts-in-idle, NMI etc etc weirdness).
Dan Magenheimer wrote:
> /sys/devices/system/clocksource/clocksource0/current_khz
>
> is that an acceptable arch-independent patch? (And which do
> you prefer?)
Thomas Gleixner:
> I'd rather see a generic solution which provides the information of
> the current (and possibly those of the available) clock source(s).
Another possibility:
$ cd /sys/devices/system/clocksource/clocksource0/
$ ls -lR
available_clocksource
current_clocksource
current_clocksource_ln -> tsc
tsc/
tsc/calibration
tsc/calibrated_master -> ../hpet
tsc/khz
hpet/
hpet/calibration
hpet/khz
$ cat tsc/calibration
slave
# there has been a one-time calibration against a reference at boot time,
# the source clock is in calibrated_master and and the khz is calculated
# from that
$ cat hpet/calibration
constant
# takes its value from constant value from boot loader, configuration
# or some CPU/chipset register
Would this be workable? I need to look deeper at how the other clocksources
work, for example the virtualized ones. I'm also wondering if NICs with their
own clocks & IEEE-1588 support are going to become part of the clocksource
infrastructure (see e.g. http://patchwork.ozlabs.org/patch/52626/)
Thanks everyone for the guidance.
Bah. You're right. I shouldn't be emailing this early :)
Even so, I'm still not a fan of the "expose raw details so userland apps
can hack around the kernel's inadequacies" approach.
> Dan Magenheimer wrote:
> > /sys/devices/system/clocksource/clocksource0/current_khz
> >
> > is that an acceptable arch-independent patch? (And which do
> > you prefer?)
>
> Thomas Gleixner:
> > I'd rather see a generic solution which provides the information of
> > the current (and possibly those of the available) clock source(s).
While I'm not a huge fan of it, Thomas' way would be a bit more
palatable.
NTP can check the initial freq the clocksource was registered and if its
different from the last boot decide if it can recalculate that into a
new correction factor, or just throw out the drift file value.
Brian: is this something the NTPd folks actually want? Has anyone
checked with them before we hand down the solution from high upon on
lkml mountain?
Personally I think NTPd should be a little more savvy about how far it
trusts the drift file when it starts up. Since I believe its
fast-startup mode can quickly estimate the drift well within 100ppm,
which is about the maximum variance I've seen from the calibration code.
thanks
-john
Engaging with them is probably a good idea. In the past, the NTP core
folks have been extremely anti-Linux and pro-BSD and therefore unwilling
to talk, but that has at least in part been due to what they perceive as
unilateral actions on our part.
-hpa
I haven't checked, it's been a while since I dealt with
this problem. The NTP maintainers definitely complain about the
quick TSC calibration code like it's a bug:
(e.g. http://www.mail-archive.com/ques...@lists.ntp.org/msg02079.html).
Anyway I'll reach out before I spend any time investing in
a solution that they don't want (and you don't like :).
> Personally I think NTPd should be a little more savvy about how far it
> trusts the drift file when it starts up. Since I believe its
> fast-startup mode can quickly estimate the drift well within 100ppm,
> which is about the maximum variance I've seen from the calibration code.
The workaround we went with was to remove the drift file on
every reboot. But in our experience, even with iburst, converging takes
a long time. I don't have hard numbers since it's been a long time since
I investigated the problem, but we defined failure as >1ms offset syncing
to a server in our LAN, and a cold NTP boot takes 10-20 hours to get
there.
I was hoping that being able to reuse the drift information
across boots would shorten convergence time. I think that in principle
it's a nice thing to be able to do. Though as far as I'm aware, neither
chrony nor PTPd (IEEE 1588) attempts to do this.
Yes, Prof. Mills in particular (for those who don't know, he's "Mr.
NTP") really gets upset about the way Linux does timekeeping.
Unfortunately it's not clear to me that he's willing to work with us as
opposed to wanting things to work exactly like BSD, and swive the
non-NTP users.
-hpa
Ok. If its been awhile, you may find recent kernels (2.6.31+) are much
faster to converge due to adjustments made to the SHIFT_PLL constant.
This was done explicitly to address issues similar to what you describe
above.
thanks
-john
My tests were pre-2.6.31, this is really good to know. I'll take
another look on recent kernels.
Although I suspect his dislike for Linux is historical, as Roman
reworked the ntp code to follow the NTPv4 reference implementation back
in the 2.6.19 timeframe.
However I'd be more then happy to try to address any specific
deficiencies with Linux's NTP implementation if someone can better
express Prof Mills' critiques are.
thanks
-john
That's the $10M question... I think it starts with asking the NTP
community for advice.
-hpa
> > Yes, understood. But the kernel doesn't expose a "gettimeofday
> > performance sucks" flag either. If it did (or in the case of
> > the patch, if tsc_reliable is zero) the application could at least
> > choose to turn off the 10000-100000 timestamps/second and log
> > a message saying "you are running on old hardware so you get
> > fewer features".
>
> I don't think anyone would object to exporting such a flag if
> it's cleanly designed.
>
> Getting the semantics right for that might be somewhat tricky
> though. How is "slow" defined?
Well... if you want to know how fast gettimeofday is, perhaps doing
gettimeofday(); gettimeofday();
is good enough?
If not, perhaps you can export 'how many clocks is gettimeofday
expected to take' variable somewhere, but...
Pavel
> > A CPU-hotplugable system is a good example of a case where
> > the kernel should expose that tsc_reliable is 0. (I've heard
>
> That would mean that a large class of systems which
> are always hotplug capable (even if it's not used)
> would never get fast TSC time.
>
> Wasn't the goal here to be faster?
>
> > anecdotally that CPU hotplug into a QPI or Hypertransport system
> > will have some other interesting challenges, so may require some
> > special kernel parameters anyway.) Even if tsc_reliable were
> > only enabled if a "no-cpu_hotplug" kernel parameter is set,
> > that is still useful. And with cores-per-socket (and even
> > nodes-per-socket) going up seemingly every day, multi-socket
> > systems will likely be an ever smaller percentage of new
> > systems.
>
> Still the people running them will expect as good performance
> as possible.
>
> -Andi
>
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html