[PATCH 4/5] export new cpuid KVM

Glauber Costa

unread,

Apr 15, 2010, 2:40:02 PM4/15/10

to

Since we're changing the msrs kvmclock uses, we have to communicate
that to the guest, through cpuid. We can add a new KVM_CAP to the
hypervisor, and then patch userspace to recognize it.

And if we ever add a new cpuid bit in the future, we have to do that again,
which create some complexity and delay in feature adoption.

Instead, what I'm proposing in this patch is a new capability, called
KVM_CAP_X86_CPUID_FEATURE_LIST, that returns the current feature list
currently supported by the hypervisor. If we ever want to add or remove
some feature, we only need to tweak into the HV, leaving userspace untouched.

Signed-off-by: Glauber Costa <glo...@redhat.com>
---
arch/x86/kvm/x86.c | 6 ++++++
include/linux/kvm.h | 1 +
2 files changed, 7 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 714aae2..74f0dc3 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1545,6 +1545,12 @@ int kvm_dev_ioctl_check_extension(long ext)
case KVM_CAP_MCE:
r = KVM_MAX_MCE_BANKS;
break;
+ case KVM_CAP_X86_CPUID_FEATURE_LIST:
+ r = (1 << KVM_FEATURE_CLOCKSOURCE) |
+ (1 << KVM_FEATURE_NOP_IO_DELAY) |
+ (1 << KVM_FEATURE_MMU_OP) |
+ (1 << KVM_FEATURE_CLOCKSOURCE2);
+ break;
default:
r = 0;
break;
diff --git a/include/linux/kvm.h b/include/linux/kvm.h
index ce28767..1ce124f 100644
--- a/include/linux/kvm.h
+++ b/include/linux/kvm.h
@@ -507,6 +507,7 @@ struct kvm_ioeventfd {
#define KVM_CAP_DEBUGREGS 50
#endif
#define KVM_CAP_X86_ROBUST_SINGLESTEP 51
+#define KVM_CAP_X86_CPUID_FEATURE_LIST 52

#ifdef KVM_CAP_IRQ_ROUTING

--
1.6.2.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Glauber Costa

unread,

Apr 15, 2010, 2:40:01 PM4/15/10

to

In recent stress tests, it was found that pvclock-based systems
could seriously warp in smp systems. Using ingo's time-warp-test.c,
I could trigger a scenario as bad as 1.5mi warps a minute in some systems.
(to be fair, it wasn't that bad in most of them). Investigating further, I
found out that such warps were caused by the very offset-based calculation
pvclock is based on.

This happens even on some machines that report constant_tsc in its tsc flags,
specially on multi-socket ones.

Two reads of the same kernel timestamp at approx the same time, will likely
have tsc timestamped in different occasions too. This means the delta we
calculate is unpredictable at best, and can probably be smaller in a cpu
that is legitimately reading clock in a forward ocasion.

Some adjustments on the host could make this window less likely to happen,
but still, it pretty much poses as an intrinsic problem of the mechanism.

A while ago, I though about using a shared variable anyway, to hold clock
last state, but gave up due to the high contention locking was likely
to introduce, possibly rendering the thing useless on big machines. I argue,
however, that locking is not necessary.

We do a read-and-return sequence in pvclock, and between read and return,
the global value can have changed. However, it can only have changed
by means of an addition of a positive value. So if we detected that our
clock timestamp is less than the current global, we know that we need to
return a higher one, even though it is not exactly the one we compared to.

OTOH, if we detect we're greater than the current time source, we atomically
replace the value with our new readings. This do causes contention on big
boxes (but big here means *BIG*), but it seems like a good trade off, since
it provide us with a time source guaranteed to be stable wrt time warps.

After this patch is applied, I don't see a single warp in time during 5 days
of execution, in any of the machines I saw them before.

Signed-off-by: Glauber Costa <glo...@redhat.com>
CC: Jeremy Fitzhardinge <jer...@goop.org>
CC: Avi Kivity <a...@redhat.com>
CC: Marcelo Tosatti <mtos...@redhat.com>
CC: Zachary Amsden <zam...@redhat.com>
---
arch/x86/kernel/pvclock.c | 23 +++++++++++++++++++++++
1 files changed, 23 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kernel/pvclock.c b/arch/x86/kernel/pvclock.c
index 03801f2..b7de0e6 100644
--- a/arch/x86/kernel/pvclock.c
+++ b/arch/x86/kernel/pvclock.c
@@ -109,11 +109,14 @@ unsigned long pvclock_tsc_khz(struct pvclock_vcpu_time_info *src)
return pv_tsc_khz;
}

+static u64 last_value = 0;
+
cycle_t pvclock_clocksource_read(struct pvclock_vcpu_time_info *src)
{
struct pvclock_shadow_time shadow;
unsigned version;
cycle_t ret, offset;
+ u64 last;

do {
version = pvclock_get_time_values(&shadow, src);
@@ -123,6 +126,26 @@ cycle_t pvclock_clocksource_read(struct pvclock_vcpu_time_info *src)
barrier();
} while (version != src->version);

+ /*
+ * Assumption here is that last_value, a global accumulator, always goes
+ * forward. If we are less than that, we should not be much smaller.
+ * We assume there is an error marging we're inside, and then the correction
+ * does not sacrifice accuracy.
+ *
+ * For reads: global may have changed between test and return,
+ * but this means someone else updated poked the clock at a later time.
+ * We just need to make sure we are not seeing a backwards event.
+ *
+ * For updates: last_value = ret is not enough, since two vcpus could be
+ * updating at the same time, and one of them could be slightly behind,
+ * making the assumption that last_value always go forward fail to hold.
+ */
+ do {
+ last = last_value;
+ if (ret < last)
+ return last;
+ } while (unlikely(cmpxchg64(&last_value, last, ret) != ret));
+
return ret;

Glauber Costa

unread,

Apr 15, 2010, 2:50:01 PM4/15/10

to

We now added a new set of clock-related msrs in replacement of the old
ones. In theory, we could just try to use them and get a return value
indicating they do not exist, due to our use of kvm_write_msr_save.

However, kvm clock registration happens very early, and if we ever
try to write to a non-existant MSR, we raise a lethal #GP, since our
idt handlers are not in place yet.

So this patch tests for a cpuid feature exported by the host to
decide which set of msrs are supported.

Signed-off-by: Glauber Costa <glo...@redhat.com>
---

arch/x86/include/asm/kvm_para.h | 4 ++
arch/x86/kernel/kvmclock.c | 68 +++++++++++++++++++++++++++------------
2 files changed, 51 insertions(+), 21 deletions(-)

diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index 0cffb96..a32710a 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -16,6 +16,10 @@
#define KVM_FEATURE_CLOCKSOURCE 0
#define KVM_FEATURE_NOP_IO_DELAY 1
#define KVM_FEATURE_MMU_OP 2
+/* We could just try to use new msr values, but they are queried very early,
+ * kernel does not have idt handlers yet, and failures are fatal */
+#define KVM_FEATURE_CLOCKSOURCE2 3
+

#define MSR_KVM_WALL_CLOCK_OLD 0x11
#define MSR_KVM_SYSTEM_TIME_OLD 0x12
diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index feaeb0d..6d814ce 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -29,6 +29,7 @@
#define KVM_SCALE 22

static int kvmclock = 1;
+static int kvm_use_new_msrs = 0;

static int parse_no_kvmclock(char *arg)
{
@@ -41,6 +42,18 @@ early_param("no-kvmclock", parse_no_kvmclock);
static DEFINE_PER_CPU_SHARED_ALIGNED(struct pvclock_vcpu_time_info, hv_clock);
static struct pvclock_wall_clock wall_clock;

+static int kvm_system_time_write_value(int low, int high)
+{
+ if (kvm_use_new_msrs)
+ return native_write_msr_safe(MSR_KVM_SYSTEM_TIME_OLD, low, high);
+ else
+ return native_write_msr_safe(MSR_KVM_SYSTEM_TIME, low, high);
+}
+
+static void kvm_turnoff_clock(void)
+{
+ kvm_system_time_write_value(0, 0);
+}
/*
* The wallclock is the time of day when we booted. Since then, some time may
* have elapsed since the hypervisor wrote the data. So we try to account for
@@ -54,7 +67,11 @@ static unsigned long kvm_get_wallclock(void)

low = (int)__pa_symbol(&wall_clock);
high = ((u64)__pa_symbol(&wall_clock) >> 32);
- native_write_msr(MSR_KVM_WALL_CLOCK, low, high);
+
+ if (kvm_use_new_msrs)
+ native_write_msr_safe(MSR_KVM_WALL_CLOCK, low, high);
+ else
+ native_write_msr(MSR_KVM_WALL_CLOCK_OLD, low, high);

vcpu_time = &get_cpu_var(hv_clock);
pvclock_read_wallclock(&wall_clock, vcpu_time, &ts);
@@ -130,7 +147,8 @@ static int kvm_register_clock(char *txt)
high = ((u64)__pa(&per_cpu(hv_clock, cpu)) >> 32);
printk(KERN_INFO "kvm-clock: cpu %d, msr %x:%x, %s\n",
cpu, high, low, txt);
- return native_write_msr_safe(MSR_KVM_SYSTEM_TIME, low, high);
+
+ return kvm_system_time_write_value(low, high);
}

#ifdef CONFIG_X86_LOCAL_APIC
@@ -165,14 +183,14 @@ static void __init kvm_smp_prepare_boot_cpu(void)
#ifdef CONFIG_KEXEC
static void kvm_crash_shutdown(struct pt_regs *regs)
{
- native_write_msr_safe(MSR_KVM_SYSTEM_TIME, 0, 0);
+ kvm_turnoff_clock();
native_machine_crash_shutdown(regs);
}
#endif

static void kvm_shutdown(void)
{
- native_write_msr_safe(MSR_KVM_SYSTEM_TIME, 0, 0);
+ kvm_turnoff_clock();
native_machine_shutdown();
}

@@ -181,27 +199,35 @@ void __init kvmclock_init(void)
if (!kvm_para_available())
return;

- if (kvmclock && kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE)) {
- if (kvm_register_clock("boot clock"))
- return;
- pv_time_ops.sched_clock = kvm_clock_read;
- x86_platform.calibrate_tsc = kvm_get_tsc_khz;
- x86_platform.get_wallclock = kvm_get_wallclock;
- x86_platform.set_wallclock = kvm_set_wallclock;
+ if (kvmclock && kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE2))
+ kvm_use_new_msrs = 1;
+ else if (kvmclock && kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE))
+ kvm_use_new_msrs = 0;
+ else
+ return;
+
+ printk(KERN_INFO "kvm-clock: %ssing clocksource new msrs",
+ kvm_use_new_msrs ? "U": "Not u");
+
+ if (kvm_register_clock("boot clock"))
+ return;
+ pv_time_ops.sched_clock = kvm_clock_read;
+ x86_platform.calibrate_tsc = kvm_get_tsc_khz;
+ x86_platform.get_wallclock = kvm_get_wallclock;
+ x86_platform.set_wallclock = kvm_set_wallclock;
#ifdef CONFIG_X86_LOCAL_APIC
- x86_cpuinit.setup_percpu_clockev =
- kvm_setup_secondary_clock;
+ x86_cpuinit.setup_percpu_clockev =
+ kvm_setup_secondary_clock;
#endif
#ifdef CONFIG_SMP
- smp_ops.smp_prepare_boot_cpu = kvm_smp_prepare_boot_cpu;
+ smp_ops.smp_prepare_boot_cpu = kvm_smp_prepare_boot_cpu;
#endif
- machine_ops.shutdown = kvm_shutdown;
+ machine_ops.shutdown = kvm_shutdown;
#ifdef CONFIG_KEXEC
- machine_ops.crash_shutdown = kvm_crash_shutdown;
+ machine_ops.crash_shutdown = kvm_crash_shutdown;
#endif
- kvm_get_preset_lpj();
- clocksource_register(&kvm_clock);
- pv_info.paravirt_enabled = 1;
- pv_info.name = "KVM";
- }
+ kvm_get_preset_lpj();
+ clocksource_register(&kvm_clock);
+ pv_info.paravirt_enabled = 1;
+ pv_info.name = "KVM";

Glauber Costa

unread,

Apr 15, 2010, 2:50:02 PM4/15/10

to

Avi pointed out a while ago that those MSRs falls into the pentium
PMU range. So the idea here is to add new ones, and after a while,
deprecate the old ones.

Signed-off-by: Glauber Costa <glo...@redhat.com>
---

arch/x86/include/asm/kvm_para.h | 8 ++++++--
arch/x86/kvm/x86.c | 7 ++++++-
2 files changed, 12 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index ffae142..0cffb96 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -17,8 +17,12 @@

#define KVM_FEATURE_NOP_IO_DELAY 1
#define KVM_FEATURE_MMU_OP 2

-#define MSR_KVM_WALL_CLOCK 0x11
-#define MSR_KVM_SYSTEM_TIME 0x12
+#define MSR_KVM_WALL_CLOCK_OLD 0x11
+#define MSR_KVM_SYSTEM_TIME_OLD 0x12
+
+/* Custom MSRs falls in the range 0x4b564d00-0x4b564dff */
+#define MSR_KVM_WALL_CLOCK 0x4b564d00
+#define MSR_KVM_SYSTEM_TIME 0x4b564d01

#define KVM_MAX_MMU_OP_BATCH 32

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 8824b73..714aae2 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -575,9 +575,10 @@ static inline u32 bit(int bitno)
* kvm-specific. Those are put in the beginning of the list.
*/

-#define KVM_SAVE_MSRS_BEGIN 5
+#define KVM_SAVE_MSRS_BEGIN 7
static u32 msrs_to_save[] = {
MSR_KVM_SYSTEM_TIME, MSR_KVM_WALL_CLOCK,
+ MSR_KVM_SYSTEM_TIME_OLD, MSR_KVM_WALL_CLOCK_OLD,
HV_X64_MSR_GUEST_OS_ID, HV_X64_MSR_HYPERCALL,
HV_X64_MSR_APIC_ASSIST_PAGE,
MSR_IA32_SYSENTER_CS, MSR_IA32_SYSENTER_ESP, MSR_IA32_SYSENTER_EIP,
@@ -1099,10 +1100,12 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data)
case MSR_IA32_MISC_ENABLE:
vcpu->arch.ia32_misc_enable_msr = data;
break;
+ case MSR_KVM_WALL_CLOCK_OLD:
case MSR_KVM_WALL_CLOCK:
vcpu->kvm->arch.wall_clock = data;
kvm_write_wall_clock(vcpu->kvm, data);
break;
+ case MSR_KVM_SYSTEM_TIME_OLD:
case MSR_KVM_SYSTEM_TIME: {
if (vcpu->arch.time_page) {
kvm_release_page_dirty(vcpu->arch.time_page);
@@ -1374,9 +1377,11 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata)
data = vcpu->arch.efer;
break;
case MSR_KVM_WALL_CLOCK:
+ case MSR_KVM_WALL_CLOCK_OLD:
data = vcpu->kvm->arch.wall_clock;
break;
case MSR_KVM_SYSTEM_TIME:
+ case MSR_KVM_SYSTEM_TIME_OLD:
data = vcpu->arch.time;
break;
case MSR_IA32_P5_MC_ADDR:

Glauber Costa

unread,

Apr 15, 2010, 2:50:02 PM4/15/10

to

This patch adds a new file, kvm/kvmclock.txt, describing
the mechanism we use in kvmclock.

Signed-off-by: Glauber Costa <glo...@redhat.com>
---

Documentation/kvm/kvmclock.txt | 138 ++++++++++++++++++++++++++++++++++++++++
1 files changed, 138 insertions(+), 0 deletions(-)
create mode 100644 Documentation/kvm/kvmclock.txt

diff --git a/Documentation/kvm/kvmclock.txt b/Documentation/kvm/kvmclock.txt
new file mode 100644
index 0000000..21008bb
--- /dev/null
+++ b/Documentation/kvm/kvmclock.txt
@@ -0,0 +1,138 @@
+KVM Paravirtual Clocksource driver
+Glauber Costa, Red Hat Inc.
+==================================
+
+1. General Description
+=======================
+
+Keeping time in virtual machine is acknowledged as a hard problem. The most
+basic mode of operation, usually done by older guests, assumes a fixed length
+between timer interrupts. It then counts the number of interrupts and
+calculates elapsed time. This method fails easily in virtual machines, since
+we can't guarantee that the virtual interrupt will be delivered in time.
+
+Another possibility is to emulate modern devices like HPET, or any other we
+see fit. A modern guest which implements something like the clocksource
+infrastructure, can then ask this virtual device about current time when it
+needs to. The problem with this approach, is that it bumps the guest out
+of guest mode operation, and in some cases, even to userspace very frequently.
+
+In this context, the best approach is to provide the guest with a
+virtualization-aware (paravirtual) clock device. It the asks the hypervisor
+about current time, guaranteeing both stable and accurate timekeeping.
+
+2. kvmclock basics
+===========================
+
+When supported by the hypervisor, guests can register a memory page
+to contain kvmclock data. This page has to be present in guest's address space
+throughout its whole life. The hypervisor continues to write to it until it is
+explicitly disabled or the guest is turned off.
+
+2.1 kvmclock availability
+-------------------------
+
+Guests that want to take advantage of kvmclock should first check its
+availability through cpuid.
+
+kvm features are presented to the guest in leaf 0x40000001. Bit 3 indicates
+the present of kvmclock. Bit 0 indicates that kvmclock is present, but the
+old MSR set must be used. See section 2.3 for details.
+
+2.2 kvmclock functionality
+--------------------------
+
+Two MSRs are provided by the hypervisor, controlling kvmclock operation:
+
+ * MSR_KVM_WALL_CLOCK, value 0x4b564d00 and
+ * MSR_KVM_SYSTEM_TIME, value 0x4b564d01.
+
+The first one is only used in rare situations, like boot-time and a
+suspend-resume cycle. Data is disposable, and after used, the guest
+may use it for something else. This is hardly a hot path for anything.
+The Hypervisor fills in the address provided through this MSR with the
+following structure:
+
+struct pvclock_wall_clock {
+ u32 version;
+ u32 sec;
+ u32 nsec;
+} __attribute__((__packed__));
+
+Guest should only trust data to be valid when version haven't changed before
+and after reads of sec and nsec. Besides not changing, it has to be an even
+number. Hypervisor may write an odd number to version field to indicate that
+an update is in progress.
+
+MSR_KVM_SYSTEM_TIME, on the other hand, has persistent data, and is
+constantly updated by the hypervisor with time information. The data
+written in this MSR contains two pieces of information: the address in which
+the guests expects time data to be present 4-byte aligned or'ed with an
+enabled bit. If one wants to shutdown kvmclock, it just needs to write
+anything that has 0 as its last bit.
+
+Time information presented by the hypervisor follows the structure:
+
+struct pvclock_vcpu_time_info {
+ u32 version;
+ u32 pad0;
+ u64 tsc_timestamp;
+ u64 system_time;
+ u32 tsc_to_system_mul;
+ s8 tsc_shift;
+ u8 pad[3];
+} __attribute__((__packed__));
+
+The version field plays the same role as with the one in struct
+pvclock_wall_clock. The other fields, are:
+
+ a. tsc_timestamp: the guest-visible tsc (result of rdtsc + tsc_offset) of
+ this cpu at the moment we recorded system_time. Note that some time is
+ inevitably spent between system_time and tsc_timestamp measurements.
+ Guests can subtract this quantity from the current value of tsc to obtain
+ a delta to be added to system_time
+
+ b. system_time: this is the most recent host-time we could be provided with.
+ host gets it through ktime_get_ts, using whichever clocksource is
+ registered at the moment
+
+ c. tsc_to_system_mul: this is the number that tsc delta has to be multiplied
+ by in order to obtain time in nanoseconds. Hypervisor is free to change
+ this value in face of events like cpu frequency change, pcpu migration,
+ etc.
+
+ d. tsc_shift: guests must shift
+
+With this information available, guest calculates current time as:
+
+ T = kt + to_nsec(tsc - tsc_0)
+
+2.3 Compatibility MSRs
+----------------------
+
+Guests running on top of older hypervisors may have to use a different set of
+MSRs. This is because originally, kvmclock MSRs were exported within a
+reserved range by accident. Guests should check cpuid leaf 0x40000001 for the
+presence of kvmclock. If bit 3 is disabled, but bit 0 is enabled, guests can
+have access to kvmclock functionality through
+
+ * MSR_KVM_WALL_CLOCK_OLD, value 0x11 and
+ * MSR_KVM_SYSTEM_TIME_OLD, value 0x12.
+
+Note, however, that this is deprecated.
+
+3. Migration
+============
+
+Two ioctls are provided to aid the task of migration:
+
+ * KVM_GET_CLOCK and
+ * KVM_SET_CLOCK
+
+Their aim is to control an offset that can be summed to system_time, in order
+to guarantee monotonicity on the time over guest migration. Source host
+executes KVM_GET_CLOCK, obtaining the last valid timestamp in this host, while
+destination sets it with KVM_SET_CLOCK. It's the destination responsibility to
+never return time that is less than that.
+
+

Randy Dunlap

unread,

Apr 15, 2010, 3:30:02 PM4/15/10

to

On Thu, 15 Apr 2010 14:37:28 -0400 Glauber Costa wrote:

> This patch adds a new file, kvm/kvmclock.txt, describing
> the mechanism we use in kvmclock.
>
> Signed-off-by: Glauber Costa <glo...@redhat.com>
> ---
> Documentation/kvm/kvmclock.txt | 138 ++++++++++++++++++++++++++++++++++++++++
> 1 files changed, 138 insertions(+), 0 deletions(-)
> create mode 100644 Documentation/kvm/kvmclock.txt
>
> diff --git a/Documentation/kvm/kvmclock.txt b/Documentation/kvm/kvmclock.txt
> new file mode 100644
> index 0000000..21008bb
> --- /dev/null
> +++ b/Documentation/kvm/kvmclock.txt
> @@ -0,0 +1,138 @@
> +KVM Paravirtual Clocksource driver
> +Glauber Costa, Red Hat Inc.
> +==================================
> +
> +1. General Description
> +=======================
> +

...

> +
> +2. kvmclock basics
> +===========================
> +
> +When supported by the hypervisor, guests can register a memory page
> +to contain kvmclock data. This page has to be present in guest's address space
> +throughout its whole life. The hypervisor continues to write to it until it is
> +explicitly disabled or the guest is turned off.
> +
> +2.1 kvmclock availability
> +-------------------------
> +
> +Guests that want to take advantage of kvmclock should first check its
> +availability through cpuid.
> +
> +kvm features are presented to the guest in leaf 0x40000001. Bit 3 indicates
> +the present of kvmclock. Bit 0 indicates that kvmclock is present, but the

presence
but it's confusing. Is it bit 3 or bit 0? They seem to indicate the same thing.

> +old MSR set must be used. See section 2.3 for details.

"old MSR set": what does this mean?

> +
> +2.2 kvmclock functionality
> +--------------------------
> +
> +Two MSRs are provided by the hypervisor, controlling kvmclock operation:
> +
> + * MSR_KVM_WALL_CLOCK, value 0x4b564d00 and
> + * MSR_KVM_SYSTEM_TIME, value 0x4b564d01.
> +
> +The first one is only used in rare situations, like boot-time and a
> +suspend-resume cycle. Data is disposable, and after used, the guest
> +may use it for something else. This is hardly a hot path for anything.
> +The Hypervisor fills in the address provided through this MSR with the
> +following structure:
> +
> +struct pvclock_wall_clock {
> + u32 version;
> + u32 sec;
> + u32 nsec;
> +} __attribute__((__packed__));
> +
> +Guest should only trust data to be valid when version haven't changed before

has not

CPU (please)

> + inevitably spent between system_time and tsc_timestamp measurements.
> + Guests can subtract this quantity from the current value of tsc to obtain
> + a delta to be added to system_time

to system_time.

> +
> + b. system_time: this is the most recent host-time we could be provided with.
> + host gets it through ktime_get_ts, using whichever clocksource is
> + registered at the moment

moment.

> +
> + c. tsc_to_system_mul: this is the number that tsc delta has to be multiplied
> + by in order to obtain time in nanoseconds. Hypervisor is free to change
> + this value in face of events like cpu frequency change, pcpu migration,

CPU

> + etc.
> +
> + d. tsc_shift: guests must shift

missing text??

---
~Randy

Glauber Costa

unread,

Apr 15, 2010, 4:20:02 PM4/15/10

to

On Thu, Apr 15, 2010 at 12:28:36PM -0700, Randy Dunlap wrote:
> On Thu, 15 Apr 2010 14:37:28 -0400 Glauber Costa wrote:
>

Thanks Randy,

All coments relevant. I'll update the document to cover them.

As for your question: Both bit 3 and 0 are used, yes. But they
tell the guest to use a different MSR pair. However, this document
is intented for people not familiar with kvmclock. If you read it,
and could not extract that from the text, it definitely needs to
be augmented.

Thanks!

Marcelo Tosatti

unread,

Apr 16, 2010, 4:30:02 PM4/16/10

to

__cacheline_aligned_in_smp to avoid other data from sharing the
cacheline.

Don't you need to handle wrap-around?

Jeremy Fitzhardinge

unread,

Apr 16, 2010, 4:40:01 PM4/16/10

to

On 04/15/2010 11:37 AM, Glauber Costa wrote:
> In recent stress tests, it was found that pvclock-based systems
> could seriously warp in smp systems. Using ingo's time-warp-test.c,
> I could trigger a scenario as bad as 1.5mi warps a minute in some systems.
>

Is that "1.5 million"?

> (to be fair, it wasn't that bad in most of them). Investigating further, I
> found out that such warps were caused by the very offset-based calculation
> pvclock is based on.
>

Is the problem that the tscs are starting out of sync, or that they're
drifting relative to each other over time? Do the problems become worse
the longer the uptime? How large are the offsets we're talking about here?

Does this need a barrier() to prevent the compiler from re-reading
last_value for the subsequent lines? Otherwise "(ret < last)" and
"return last" could execute with different values for "last".

> + if (ret < last)
> + return last;
> + } while (unlikely(cmpxchg64(&last_value, last, ret) != ret));
>

So if CPU A's tsc is, say, 50us behind CPU B's, then it will return a
static time (from last_time) for 50us until its tsc catched up - or if
CPU A happens to update last_time to give it a kick?

Is it worth trying to update CPU A's tsc offset at the same time?

J

Zachary Amsden

unread,

Apr 16, 2010, 5:10:01 PM4/16/10

to

On 04/16/2010 10:36 AM, Jeremy Fitzhardinge wrote:
> On 04/15/2010 11:37 AM, Glauber Costa wrote:
>
>> In recent stress tests, it was found that pvclock-based systems
>> could seriously warp in smp systems. Using ingo's time-warp-test.c,
>> I could trigger a scenario as bad as 1.5mi warps a minute in some systems.
>>
>>
> Is that "1.5 million"?
>
>
>> (to be fair, it wasn't that bad in most of them). Investigating further, I
>> found out that such warps were caused by the very offset-based calculation
>> pvclock is based on.
>>
>>
> Is the problem that the tscs are starting out of sync, or that they're
> drifting relative to each other over time? Do the problems become worse
> the longer the uptime? How large are the offsets we're talking about here?
>

This is one source of the problem, but the same thing happens at many
levels... tsc may start out of sync, drift between sockets, be badly
re-calibrated by the BIOS, etc... the issue persists even if the TSCs
are perfectly in sync - the measurement of them is not.

So reading TSC == 100,000 units at time A and then waiting 10 units, one
may read TSC == 100,010 +/- 5 units because the code stream is not
perfectly serialized - nor can it be. There will always be some amount
of error unless running in perfect lock-step, which only happens in a
simulator.

This inherent measurement error can cause apparent time to go backwards
when measured simultaneously across multiple CPUs, or when
re-calibrating against an external clocksource. Combined with other
factors as above, it can be of sufficient magnitude to be noticed. KVM
clock is particularly exposed to the problem because the TSC is measured
and recalibrated for each virtual CPU whenever there is a physical CPU
switch, so micro-adjustments forwards and backwards may occur during the
recalibration - and appear as a real backwards time warp to the guest.
I have some patches to fix that issue, but the SMP problem remains to be
fixed - and is addressed quite thoroughly by this patch.

Zach

Avi Kivity

unread,

Apr 17, 2010, 2:50:01 PM4/17/10

to

Please define a cpuid bit that makes this optional. When we eventually
enable it in the future, it will allow a wider range of guests to enjoy it.

>
> +static u64 last_value = 0;
>

Needs to be atomic64_t.

> +
> cycle_t pvclock_clocksource_read(struct pvclock_vcpu_time_info *src)
> {
> struct pvclock_shadow_time shadow;
> unsigned version;
> cycle_t ret, offset;
> + u64 last;
>
>

> + do {
> + last = last_value;
>

Otherwise, this assignment can see a partial update.

> + if (ret< last)
> + return last;
> + } while (unlikely(cmpxchg64(&last_value, last, ret) != ret));
> +
> return ret;
> }
>
>

--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

Avi Kivity

unread,

Apr 17, 2010, 3:00:02 PM4/17/10

to

On 04/17/2010 09:48 PM, Avi Kivity wrote:
>
>>
>> +static u64 last_value = 0;
>
> Needs to be atomic64_t.
>
>> +
>> cycle_t pvclock_clocksource_read(struct pvclock_vcpu_time_info *src)
>> {
>> struct pvclock_shadow_time shadow;
>> unsigned version;
>> cycle_t ret, offset;
>> + u64 last;
>>
>>
>> + do {
>> + last = last_value;
>
> Otherwise, this assignment can see a partial update.

On a 32-bit guest, that is.

Avi Kivity

unread,

Apr 17, 2010, 3:00:02 PM4/17/10

to

On 04/15/2010 09:37 PM, Glauber Costa wrote:

> Since we're changing the msrs kvmclock uses, we have to communicate
> that to the guest, through cpuid. We can add a new KVM_CAP to the
> hypervisor, and then patch userspace to recognize it.
>
> And if we ever add a new cpuid bit in the future, we have to do that again,
> which create some complexity and delay in feature adoption.
>
> Instead, what I'm proposing in this patch is a new capability, called
> KVM_CAP_X86_CPUID_FEATURE_LIST, that returns the current feature list
> currently supported by the hypervisor. If we ever want to add or remove
> some feature, we only need to tweak into the HV, leaving userspace untouched.
>
>

Hm. We need to update userspace anyway, since we don't like turning
features on unconditionally (it breaks live migration into an older kernel).

--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

--

Avi Kivity

unread,

Apr 17, 2010, 3:00:02 PM4/17/10

to

On 04/15/2010 09:37 PM, Glauber Costa wrote:

> We now added a new set of clock-related msrs in replacement of the old
> ones. In theory, we could just try to use them and get a return value
> indicating they do not exist, due to our use of kvm_write_msr_save.
>
> However, kvm clock registration happens very early, and if we ever
> try to write to a non-existant MSR, we raise a lethal #GP, since our
> idt handlers are not in place yet.
>
> So this patch tests for a cpuid feature exported by the host to
> decide which set of msrs are supported.
>
>
>

> diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
> index 0cffb96..a32710a 100644
> --- a/arch/x86/include/asm/kvm_para.h
> +++ b/arch/x86/include/asm/kvm_para.h
> @@ -16,6 +16,10 @@
> #define KVM_FEATURE_CLOCKSOURCE 0
> #define KVM_FEATURE_NOP_IO_DELAY 1
> #define KVM_FEATURE_MMU_OP 2
> +/* We could just try to use new msr values, but they are queried very early,
> + * kernel does not have idt handlers yet, and failures are fatal */
> +#define KVM_FEATURE_CLOCKSOURCE2 3
>

This comment has no place in a header that's oriented to userspace.
Instead, it's better to document what this bit exposes.

>
> static int kvmclock = 1;
> +static int kvm_use_new_msrs = 0;
>

Instead of this and all those if ()s all over the place, you can define
variables for the MSR indices. Set them up on initialization, the rest
of the code just looks them up.

--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

--

Avi Kivity

unread,

Apr 17, 2010, 3:00:01 PM4/17/10

to

On 04/15/2010 09:37 PM, Glauber Costa wrote:

> Avi pointed out a while ago that those MSRs falls into the pentium
> PMU range. So the idea here is to add new ones, and after a while,
> deprecate the old ones.
>
> Signed-off-by: Glauber Costa<glo...@redhat.com>
> ---
> arch/x86/include/asm/kvm_para.h | 8 ++++++--
> arch/x86/kvm/x86.c | 7 ++++++-
> 2 files changed, 12 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
> index ffae142..0cffb96 100644
> --- a/arch/x86/include/asm/kvm_para.h
> +++ b/arch/x86/include/asm/kvm_para.h
> @@ -17,8 +17,12 @@
> #define KVM_FEATURE_NOP_IO_DELAY 1
> #define KVM_FEATURE_MMU_OP 2
>
> -#define MSR_KVM_WALL_CLOCK 0x11
> -#define MSR_KVM_SYSTEM_TIME 0x12
> +#define MSR_KVM_WALL_CLOCK_OLD 0x11
> +#define MSR_KVM_SYSTEM_TIME_OLD 0x12
> +
> +/* Custom MSRs falls in the range 0x4b564d00-0x4b564dff */
> +#define MSR_KVM_WALL_CLOCK 0x4b564d00
> +#define MSR_KVM_SYSTEM_TIME 0x4b564d01
>

This is exposed to userspace. Userspace that is compiled with the new
headers, but runs on an old kernel, will break.

So you need to keep the old names, and define a new KVM_FEATURE for the
new names.

--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

--

Peter Zijlstra

unread,

Apr 19, 2010, 6:40:01 AM4/19/10

to

On Fri, 2010-04-16 at 13:36 -0700, Jeremy Fitzhardinge wrote:
> > + do {
> > + last = last_value;
> >
> Does this need a barrier() to prevent the compiler from re-reading
> last_value for the subsequent lines? Otherwise "(ret < last)" and
> "return last" could execute with different values for "last".
>

ACCESS_ONCE() is your friend.

Avi Kivity

unread,

Apr 19, 2010, 6:50:01 AM4/19/10

to

On 04/19/2010 01:43 PM, Peter Zijlstra wrote:
>
>>>
>>>> +
>>>> cycle_t pvclock_clocksource_read(struct pvclock_vcpu_time_info *src)
>>>> {
>>>> struct pvclock_shadow_time shadow;
>>>> unsigned version;
>>>> cycle_t ret, offset;
>>>> + u64 last;
>>>>
>>>>
>>>> + do {
>>>> + last = last_value;
>>>>
>>> Otherwise, this assignment can see a partial update.
>>>
>> On a 32-bit guest, that is.
>>

> Right, do bear in mind that the x86 implementation of atomic64_read() is
> terrifyingly expensive, it is better to not do that read and simply use
> the result of the cmpxchg.
>
>

atomic64_read() _is_ cmpxchg64b. Are you thinking of some clever
implementation for smp i386?

--
error compiling committee.c: too many arguments to function

Peter Zijlstra

unread,

Apr 19, 2010, 6:50:02 AM4/19/10

to

On Mon, 2010-04-19 at 12:46 +0200, Peter Zijlstra wrote:

> On Sat, 2010-04-17 at 21:48 +0300, Avi Kivity wrote:
> > > After this patch is applied, I don't see a single warp in time during 5 days
> > > of execution, in any of the machines I saw them before.
> > >
> > >
> >
> > Please define a cpuid bit that makes this optional. When we eventually
> > enable it in the future, it will allow a wider range of guests to enjoy it.
>

> Right, so on x86 we have:
>
> X86_FEATURE_CONSTANT_TSC, which only states that TSC is frequency
> independent, not that it doesn't stop in C states and similar fun stuff.
>
> X86_FEATURE_TSC_RELIABLE, which IIRC should indicate the TSC is constant
> and synced between cores.

Fun, we also have:

X86_FEATURE_NONSTOP_TSC, which states the thing doesn't stop in C
states.

Peter Zijlstra

unread,

Apr 19, 2010, 6:50:01 AM4/19/10

to

On Sat, 2010-04-17 at 21:49 +0300, Avi Kivity wrote:
> On 04/17/2010 09:48 PM, Avi Kivity wrote:
> >
> >>
> >> +static u64 last_value = 0;
> >
> > Needs to be atomic64_t.
> >
> >> +
> >> cycle_t pvclock_clocksource_read(struct pvclock_vcpu_time_info *src)
> >> {
> >> struct pvclock_shadow_time shadow;
> >> unsigned version;
> >> cycle_t ret, offset;
> >> + u64 last;
> >>
> >>
> >> + do {
> >> + last = last_value;
> >
> > Otherwise, this assignment can see a partial update.
>
> On a 32-bit guest, that is.

Right, do bear in mind that the x86 implementation of atomic64_read() is

terrifyingly expensive, it is better to not do that read and simply use
the result of the cmpxchg.

Peter Zijlstra

unread,

Apr 19, 2010, 6:50:01 AM4/19/10

to

On Sat, 2010-04-17 at 21:48 +0300, Avi Kivity wrote:

> > After this patch is applied, I don't see a single warp in time during 5 days
> > of execution, in any of the machines I saw them before.
> >
> >
>
> Please define a cpuid bit that makes this optional. When we eventually
> enable it in the future, it will allow a wider range of guests to enjoy it.

Right, so on x86 we have:

X86_FEATURE_CONSTANT_TSC, which only states that TSC is frequency
independent, not that it doesn't stop in C states and similar fun stuff.

X86_FEATURE_TSC_RELIABLE, which IIRC should indicate the TSC is constant
and synced between cores.

--

Avi Kivity

unread,

Apr 19, 2010, 6:50:02 AM4/19/10

to

On 04/19/2010 01:46 PM, Peter Zijlstra wrote:
> On Sat, 2010-04-17 at 21:48 +0300, Avi Kivity wrote:
>
>>> After this patch is applied, I don't see a single warp in time during 5 days
>>> of execution, in any of the machines I saw them before.
>>>
>>>
>>>
>> Please define a cpuid bit that makes this optional. When we eventually
>> enable it in the future, it will allow a wider range of guests to enjoy it.
>>
> Right, so on x86 we have:
>
> X86_FEATURE_CONSTANT_TSC, which only states that TSC is frequency
> independent, not that it doesn't stop in C states and similar fun stuff.
>
> X86_FEATURE_TSC_RELIABLE, which IIRC should indicate the TSC is constant
> and synced between cores.
>
>

Sockets and boards too? (IOW, how reliable is TSC_RELIABLE)?

--
error compiling committee.c: too many arguments to function

--

Avi Kivity

unread,

Apr 19, 2010, 7:00:01 AM4/19/10

to

On 04/19/2010 01:49 PM, Peter Zijlstra wrote:
>
>> Right, so on x86 we have:
>>
>> X86_FEATURE_CONSTANT_TSC, which only states that TSC is frequency
>> independent, not that it doesn't stop in C states and similar fun stuff.
>>
>> X86_FEATURE_TSC_RELIABLE, which IIRC should indicate the TSC is constant
>> and synced between cores.
>>
> Fun, we also have:
>
> X86_FEATURE_NONSTOP_TSC, which states the thing doesn't stop in C
> states.
>

All of them? I though tsc stops in some mwait deep REM sleep thing.

So what do we need? test for both TSC_RELIABLE and NONSTOP_TSC? IMO
TSC_RELIABLE should imply NONSTOP_TSC.

--
error compiling committee.c: too many arguments to function

--

Peter Zijlstra

unread,

Apr 19, 2010, 7:00:02 AM4/19/10

to

On Mon, 2010-04-19 at 13:53 +0300, Avi Kivity wrote:
> On 04/19/2010 01:49 PM, Peter Zijlstra wrote:
> >
> >> Right, so on x86 we have:
> >>
> >> X86_FEATURE_CONSTANT_TSC, which only states that TSC is frequency
> >> independent, not that it doesn't stop in C states and similar fun stuff.
> >>
> >> X86_FEATURE_TSC_RELIABLE, which IIRC should indicate the TSC is constant
> >> and synced between cores.
> >>
> > Fun, we also have:
> >
> > X86_FEATURE_NONSTOP_TSC, which states the thing doesn't stop in C
> > states.
> >
>
> All of them? I though tsc stops in some mwait deep REM sleep thing.

The idea is that TSC will not stop ever (except s2r like stuff), not
sure about the actual implementation of the NONSTOP_TSC bit though.

> So what do we need? test for both TSC_RELIABLE and NONSTOP_TSC? IMO
> TSC_RELIABLE should imply NONSTOP_TSC.

Yeah, I think RELIABLE does imply NONSTOP and CONSTANT, but NONSTOP &&
CONSTANT does not make RELIABLE.

Avi Kivity

unread,

Apr 19, 2010, 7:00:02 AM4/19/10

to

On 04/19/2010 01:51 PM, Peter Zijlstra wrote:
>
>
>>> Right, so on x86 we have:
>>>
>>> X86_FEATURE_CONSTANT_TSC, which only states that TSC is frequency
>>> independent, not that it doesn't stop in C states and similar fun stuff.
>>>
>>> X86_FEATURE_TSC_RELIABLE, which IIRC should indicate the TSC is constant
>>> and synced between cores.
>>>
>>>
>>>
>> Sockets and boards too? (IOW, how reliable is TSC_RELIABLE)?
>>

> Not sure, IIRC we clear that when the TSC sync test fails, eg when we
> mark the tsc clocksource unusable.
>

Worrying. By the time we detect this the guest may already have gotten
confused by clocks going backwards.

Peter Zijlstra

unread,

Apr 19, 2010, 7:00:01 AM4/19/10

to

On Mon, 2010-04-19 at 13:49 +0300, Avi Kivity wrote:
> On 04/19/2010 01:46 PM, Peter Zijlstra wrote:
> > On Sat, 2010-04-17 at 21:48 +0300, Avi Kivity wrote:
> >
> >>> After this patch is applied, I don't see a single warp in time during 5 days
> >>> of execution, in any of the machines I saw them before.
> >>>
> >>>
> >>>
> >> Please define a cpuid bit that makes this optional. When we eventually
> >> enable it in the future, it will allow a wider range of guests to enjoy it.
> >>
> > Right, so on x86 we have:
> >
> > X86_FEATURE_CONSTANT_TSC, which only states that TSC is frequency
> > independent, not that it doesn't stop in C states and similar fun stuff.
> >
> > X86_FEATURE_TSC_RELIABLE, which IIRC should indicate the TSC is constant
> > and synced between cores.
> >
> >
>
> Sockets and boards too? (IOW, how reliable is TSC_RELIABLE)?

Not sure, IIRC we clear that when the TSC sync test fails, eg when we

mark the tsc clocksource unusable.

--

Peter Zijlstra

unread,

Apr 19, 2010, 7:00:02 AM4/19/10

to

On Mon, 2010-04-19 at 13:47 +0300, Avi Kivity wrote:
> On 04/19/2010 01:43 PM, Peter Zijlstra wrote:
> >
> >>>
> >>>> +
> >>>> cycle_t pvclock_clocksource_read(struct pvclock_vcpu_time_info *src)
> >>>> {
> >>>> struct pvclock_shadow_time shadow;
> >>>> unsigned version;
> >>>> cycle_t ret, offset;
> >>>> + u64 last;
> >>>>
> >>>>
> >>>> + do {
> >>>> + last = last_value;
> >>>>
> >>> Otherwise, this assignment can see a partial update.
> >>>
> >> On a 32-bit guest, that is.
> >>
> > Right, do bear in mind that the x86 implementation of atomic64_read() is
> > terrifyingly expensive, it is better to not do that read and simply use
> > the result of the cmpxchg.
> >
> >
>
> atomic64_read() _is_ cmpxchg64b. Are you thinking of some clever
> implementation for smp i386?

No, what I was suggesting was to rewrite that loop no to need the
initial read but use the cmpxchg result of the previous iteration.

Something like:

u64 last = 0;

/* more stuff */

do {
if (ret < last)
return last;
last = cmpxchg64(&last_value, last, ret);
} while (last != ret);

That only has a single cmpxchg8 in there per loop instead of two
(avoiding the atomic64_read() one).

Avi Kivity

unread,

Apr 19, 2010, 7:00:02 AM4/19/10

to

On 04/19/2010 01:39 PM, Peter Zijlstra wrote:
> On Fri, 2010-04-16 at 13:36 -0700, Jeremy Fitzhardinge wrote:
>
>>> + do {
>>> + last = last_value;
>>>
>>>
>> Does this need a barrier() to prevent the compiler from re-reading
>> last_value for the subsequent lines? Otherwise "(ret< last)" and
>> "return last" could execute with different values for "last".
>>
>>
> ACCESS_ONCE() is your friend.
>

I think it's implied with atomic64_read().

--
error compiling committee.c: too many arguments to function

--

Peter Zijlstra

unread,

Apr 19, 2010, 7:10:01 AM4/19/10

to

On Mon, 2010-04-19 at 13:50 +0300, Avi Kivity wrote:
> On 04/19/2010 01:39 PM, Peter Zijlstra wrote:
> > On Fri, 2010-04-16 at 13:36 -0700, Jeremy Fitzhardinge wrote:
> >
> >>> + do {
> >>> + last = last_value;
> >>>
> >>>
> >> Does this need a barrier() to prevent the compiler from re-reading
> >> last_value for the subsequent lines? Otherwise "(ret< last)" and
> >> "return last" could execute with different values for "last".
>
> > ACCESS_ONCE() is your friend.
> >
>
> I think it's implied with atomic64_read().

Yes it would be. I was merely trying to point out that

last = ACCESS_ONCE(last_value);

Is a narrower way of writing:

last = last_value;
barrier();

In that it need not clobber all memory locations and makes it instantly
clear what we want the barrier for.

Avi Kivity

unread,

Apr 19, 2010, 7:20:02 AM4/19/10

to

On 04/19/2010 01:56 PM, Peter Zijlstra wrote:
>
>
>>> Right, do bear in mind that the x86 implementation of atomic64_read() is
>>> terrifyingly expensive, it is better to not do that read and simply use
>>> the result of the cmpxchg.
>>>
>>>
>>>
>> atomic64_read() _is_ cmpxchg64b. Are you thinking of some clever
>> implementation for smp i386?
>>
>
> No, what I was suggesting was to rewrite that loop no to need the
> initial read but use the cmpxchg result of the previous iteration.
>
> Something like:
>
> u64 last = 0;
>
> /* more stuff */
>
> do {
> if (ret< last)
> return last;
> last = cmpxchg64(&last_value, last, ret);
> } while (last != ret);
>
> That only has a single cmpxchg8 in there per loop instead of two
> (avoiding the atomic64_read() one).
>

Still have two cmpxchgs in the common case. The first iteration will
fail, fetching last_value, the second will work.

It will be better when we have contention, though, so it's worthwhile.

--
error compiling committee.c: too many arguments to function

--

Peter Zijlstra

unread,

Apr 19, 2010, 7:20:02 AM4/19/10

to

On Mon, 2010-04-19 at 14:13 +0300, Avi Kivity wrote:
> On 04/19/2010 01:56 PM, Peter Zijlstra wrote:
> >
> >
> >>> Right, do bear in mind that the x86 implementation of atomic64_read() is
> >>> terrifyingly expensive, it is better to not do that read and simply use
> >>> the result of the cmpxchg.
> >>>
> >>>
> >>>
> >> atomic64_read() _is_ cmpxchg64b. Are you thinking of some clever
> >> implementation for smp i386?
> >>
> >
> > No, what I was suggesting was to rewrite that loop no to need the
> > initial read but use the cmpxchg result of the previous iteration.
> >
> > Something like:
> >
> > u64 last = 0;
> >
> > /* more stuff */
> >
> > do {
> > if (ret< last)
> > return last;
> > last = cmpxchg64(&last_value, last, ret);
> > } while (last != ret);
> >
> > That only has a single cmpxchg8 in there per loop instead of two
> > (avoiding the atomic64_read() one).
> >
>
> Still have two cmpxchgs in the common case. The first iteration will
> fail, fetching last_value, the second will work.
>
> It will be better when we have contention, though, so it's worthwhile.

Right, another option is to put the initial read outside of the loop,
that way you'll have the best of all cases, a single LOCK'ed op in the
loop, and only a single LOCK'ed op for the fast path on sensible
architectures ;-)

last = atomic64_read(&last_value);

do {
if (ret < last)
return last;

last = atomic64_cmpxchg(&last_value, last, ret);
} while (unlikely(last != ret));

Or so.

Avi Kivity

unread,

Apr 19, 2010, 7:20:01 AM4/19/10

to

On 04/19/2010 02:05 PM, Peter Zijlstra wrote:
>>
>>> ACCESS_ONCE() is your friend.
>>>
>>>
>> I think it's implied with atomic64_read().
>>
> Yes it would be. I was merely trying to point out that
>
> last = ACCESS_ONCE(last_value);
>
> Is a narrower way of writing:
>
> last = last_value;
> barrier();
>
> In that it need not clobber all memory locations and makes it instantly
> clear what we want the barrier for.
>

Oh yes, just trying to avoid a patch with both atomic64_read() and
ACCESS_ONCE().

--
error compiling committee.c: too many arguments to function

--

Avi Kivity

unread,

Apr 19, 2010, 7:40:02 AM4/19/10

to

On 04/19/2010 01:59 PM, Peter Zijlstra wrote:
>
>> So what do we need? test for both TSC_RELIABLE and NONSTOP_TSC? IMO
>> TSC_RELIABLE should imply NONSTOP_TSC.
>>
> Yeah, I think RELIABLE does imply NONSTOP and CONSTANT, but NONSTOP&&
> CONSTANT does not make RELIABLE.
>

The manual says:

> 16.11.1 Invariant TSC
>
> The time stamp counter in newer processors may support an enhancement,
> referred
> to as invariant TSC. Processor’s support for invariant TSC is indicated by
> CPUID.80000007H:EDX[8].
> The invariant TSC will run at a constant rate in all ACPI P-, C-. and
> T-states. This is
> the architectural behavior moving forward. On processors with
> invariant TSC
> support, the OS may use the TSC for wall clock timer services (instead
> of ACPI or
> HPET timers). TSC reads are much more efficient and do not incur the
> overhead
> associated with a ring transition or access to a platform resource.

and this maps to NONSTOP, so I think we're fine.

--
error compiling committee.c: too many arguments to function

--

Avi Kivity

unread,

Apr 19, 2010, 7:50:03 AM4/19/10

to

On 04/19/2010 02:19 PM, Peter Zijlstra wrote:
>
>> Still have two cmpxchgs in the common case. The first iteration will
>> fail, fetching last_value, the second will work.
>>
>> It will be better when we have contention, though, so it's worthwhile.
>>
> Right, another option is to put the initial read outside of the loop,
> that way you'll have the best of all cases, a single LOCK'ed op in the
> loop, and only a single LOCK'ed op for the fast path on sensible
> architectures ;-)
>
> last = atomic64_read(&last_value);
> do {
> if (ret< last)
> return last;
> last = atomic64_cmpxchg(&last_value, last, ret);
> } while (unlikely(last != ret));
>
> Or so.
>
>

Yes, looks optimal when !NONSTOP_TSC.

--
error compiling committee.c: too many arguments to function

--

Glauber Costa

unread,

Apr 19, 2010, 10:30:03 AM4/19/10

to

On Fri, Apr 16, 2010 at 01:36:34PM -0700, Jeremy Fitzhardinge wrote:
> On 04/15/2010 11:37 AM, Glauber Costa wrote:
> > In recent stress tests, it was found that pvclock-based systems
> > could seriously warp in smp systems. Using ingo's time-warp-test.c,
> > I could trigger a scenario as bad as 1.5mi warps a minute in some systems.
> >
>
> Is that "1.5 million"?

Yes it is.

But as I said, this seem to be a very deep worst case scenario. Most of boxes
are not even close to being that bad.

>
> > (to be fair, it wasn't that bad in most of them). Investigating further, I
> > found out that such warps were caused by the very offset-based calculation
> > pvclock is based on.
> >
>
> Is the problem that the tscs are starting out of sync, or that they're
> drifting relative to each other over time? Do the problems become worse
> the longer the uptime? How large are the offsets we're talking about here?

The offsets usually seem pretty small, under a microsecond. So I don't think
it has anything to do with tscs starting out of sync. Specially because the
delta-based calculation has the exact purpose of handling that case.

Glauber Costa

unread,

Apr 19, 2010, 10:30:04 AM4/19/10

to

On Mon, Apr 19, 2010 at 02:10:54PM +0300, Avi Kivity wrote:
> On 04/19/2010 02:05 PM, Peter Zijlstra wrote:
> >>
> >>>ACCESS_ONCE() is your friend.
> >>>
> >>I think it's implied with atomic64_read().
> >Yes it would be. I was merely trying to point out that
> >
> > last = ACCESS_ONCE(last_value);
> >
> >Is a narrower way of writing:
> >
> > last = last_value;
> > barrier();
> >
> >In that it need not clobber all memory locations and makes it instantly
> >clear what we want the barrier for.
>
> Oh yes, just trying to avoid a patch with both atomic64_read() and
> ACCESS_ONCE().

you're mixing the private version of the patch you saw with this one.
there isn't any atomic reads in here. I'll use a barrier then

Avi Kivity

unread,

Apr 19, 2010, 10:40:01 AM4/19/10

to

On 04/19/2010 05:21 PM, Glauber Costa wrote:
>
>> Oh yes, just trying to avoid a patch with both atomic64_read() and
>> ACCESS_ONCE().
>>
> you're mixing the private version of the patch you saw with this one.
> there isn't any atomic reads in here. I'll use a barrier then
>

This patch writes last_value atomically, but reads it non-atomically. A
barrier is insufficient.

--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

--

Glauber Costa

unread,

Apr 19, 2010, 10:40:01 AM4/19/10

to

isn't a barrier enough here?

Avi Kivity

unread,

Apr 19, 2010, 10:40:02 AM4/19/10

to

On 04/19/2010 05:32 PM, Glauber Costa wrote:
>
>> Right, another option is to put the initial read outside of the loop,
>> that way you'll have the best of all cases, a single LOCK'ed op in the
>> loop, and only a single LOCK'ed op for the fast path on sensible
>> architectures ;-)
>>
>> last = atomic64_read(&last_value);
>>
> isn't a barrier enough here?
>
>

No. On i386, the statement

last = last_value;

will be split by the compiler into two 32-bit loads. If a write
(atomic, using cmpxchg) on another cpu happens between those two loads,
then the variable last will have a corrupted value.

--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

--

Peter Zijlstra

unread,

Apr 19, 2010, 10:50:03 AM4/19/10

to

On Mon, 2010-04-19 at 17:33 +0300, Avi Kivity wrote:
> On 04/19/2010 05:21 PM, Glauber Costa wrote:
> >
> >> Oh yes, just trying to avoid a patch with both atomic64_read() and
> >> ACCESS_ONCE().
> >>
> > you're mixing the private version of the patch you saw with this one.
> > there isn't any atomic reads in here. I'll use a barrier then
> >
>
> This patch writes last_value atomically, but reads it non-atomically. A
> barrier is insufficient.

What avi says! :-)

On a 32bit machine a 64bit read are two 32bit reads, so

last = last_value;

becomes:

last.high = last_value.high;
last.low = last_vlue.low;

(or the reverse of course)

Now imagine a write getting interleaved with that ;-)

Glauber Costa

unread,

Apr 19, 2010, 11:00:03 AM4/19/10

to

On Sat, Apr 17, 2010 at 09:58:26PM +0300, Avi Kivity wrote:
> On 04/15/2010 09:37 PM, Glauber Costa wrote:

> >Since we're changing the msrs kvmclock uses, we have to communicate
> >that to the guest, through cpuid. We can add a new KVM_CAP to the
> >hypervisor, and then patch userspace to recognize it.
> >
> >And if we ever add a new cpuid bit in the future, we have to do that again,
> >which create some complexity and delay in feature adoption.
> >
> >Instead, what I'm proposing in this patch is a new capability, called
> >KVM_CAP_X86_CPUID_FEATURE_LIST, that returns the current feature list
> >currently supported by the hypervisor. If we ever want to add or remove
> >some feature, we only need to tweak into the HV, leaving userspace untouched.
> >
>
> Hm. We need to update userspace anyway, since we don't like turning
> features on unconditionally (it breaks live migration into an older
> kernel).
Right now, we don't have any mechanism to disable, say, kvmclock cpuid bit
at userspace. But let's suppose we have: What's the difference between disabling
it in the way it is now, and disabling it with the method I am proposing?

All this ioctl say is: "Those are the current supported stuff in this HV".
It does not mandate userspace to expose all of this to the guest. It just saves
us from the job of creating yet another CAP for every bit we plan on including.

If we want to be conservative, we can keep everything but the things we know
already disabled, in userspace.

Jeremy Fitzhardinge

unread,

Apr 19, 2010, 12:20:02 PM4/19/10

to

On 04/19/2010 07:46 AM, Peter Zijlstra wrote:
> What avi says! :-)
>
> On a 32bit machine a 64bit read are two 32bit reads, so
>
> last = last_value;
>
> becomes:
>
> last.high = last_value.high;
> last.low = last_vlue.low;
>
> (or the reverse of course)
>
> Now imagine a write getting interleaved with that ;-)
>

You could explicitly do:

do {
h = last.high;
barrier();
l = last.low;
barrier();
} while (last.high != h);

This works because we expect last to be always increasing, so the only
worry is low wrapping and incrementing high, and is more efficient than
making the read fully atomic (the write is still cmpxchg64). But it's
pretty ugly to open code just for 32b architectures; its something that
might be useful to turn into a general abstraction (monotonic_read_64
FTW!). I already have code like this in the Xen time code, so I could
make immediate use of it.

J

Jeremy Fitzhardinge

unread,

Apr 19, 2010, 12:20:02 PM4/19/10

to

On 04/19/2010 07:26 AM, Glauber Costa wrote:
>> Is the problem that the tscs are starting out of sync, or that they're
>> drifting relative to each other over time? Do the problems become worse
>> the longer the uptime? How large are the offsets we're talking about here?
>>
> The offsets usually seem pretty small, under a microsecond. So I don't think
> it has anything to do with tscs starting out of sync. Specially because the
> delta-based calculation has the exact purpose of handling that case.
>

So you think they're drifting out of sync from an initially synced
state? If so, what would bound the drift?

J

Jeremy Fitzhardinge

unread,

Apr 19, 2010, 12:20:03 PM4/19/10

to

On 04/19/2010 07:33 AM, Avi Kivity wrote:
> On 04/19/2010 05:21 PM, Glauber Costa wrote:
>>
>>> Oh yes, just trying to avoid a patch with both atomic64_read() and
>>> ACCESS_ONCE().
>>>
>> you're mixing the private version of the patch you saw with this one.
>> there isn't any atomic reads in here. I'll use a barrier then
>>
>
> This patch writes last_value atomically, but reads it non-atomically.
> A barrier is insufficient.

Well, on a 32b system, you can explicitly order the updates of low and
high, then do a high-low-checkhigh read. That would be much more
efficient than atomic64. If we really care about 32b.

J

Glauber Costa

unread,

Apr 19, 2010, 2:30:02 PM4/19/10

to

On Mon, Apr 19, 2010 at 09:19:38AM -0700, Jeremy Fitzhardinge wrote:
> On 04/19/2010 07:26 AM, Glauber Costa wrote:
> >> Is the problem that the tscs are starting out of sync, or that they're
> >> drifting relative to each other over time? Do the problems become worse
> >> the longer the uptime? How large are the offsets we're talking about here?
> >>
> > The offsets usually seem pretty small, under a microsecond. So I don't think
> > it has anything to do with tscs starting out of sync. Specially because the
> > delta-based calculation has the exact purpose of handling that case.
> >
>
> So you think they're drifting out of sync from an initially synced
> state? If so, what would bound the drift?

I think delta calculation introduces errors.

Marcelo can probably confirm it, but he has a nehalem with an appearently
very good tsc source. Even this machine warps.

It stops warping if we only write pvclock data structure once and forget it,
(which only updated tsc_timestamp once), according to him.

Obviously, we can't do that everywhere.

Zachary Amsden

unread,

Apr 19, 2010, 2:40:02 PM4/19/10

to

On 04/19/2010 12:54 AM, Avi Kivity wrote:
> On 04/19/2010 01:51 PM, Peter Zijlstra wrote:
>>
>>>> Right, so on x86 we have:
>>>>
>>>> X86_FEATURE_CONSTANT_TSC, which only states that TSC is frequency
>>>> independent, not that it doesn't stop in C states and similar fun
>>>> stuff.
>>>>
>>>> X86_FEATURE_TSC_RELIABLE, which IIRC should indicate the TSC is
>>>> constant
>>>> and synced between cores.
>>>>
>>>>
>>> Sockets and boards too? (IOW, how reliable is TSC_RELIABLE)?
>> Not sure, IIRC we clear that when the TSC sync test fails, eg when we
>> mark the tsc clocksource unusable.
>
> Worrying. By the time we detect this the guest may already have
> gotten confused by clocks going backwards.

Upstream, we are marking the TSC unstable preemptively when hardware
which will eventually sync test is detected, so this should be fine.

Zach

Marcelo Tosatti

unread,

Apr 19, 2010, 10:00:01 PM4/19/10

to

On Mon, Apr 19, 2010 at 03:25:43PM -0300, Glauber Costa wrote:
> On Mon, Apr 19, 2010 at 09:19:38AM -0700, Jeremy Fitzhardinge wrote:
> > On 04/19/2010 07:26 AM, Glauber Costa wrote:
> > >> Is the problem that the tscs are starting out of sync, or that they're
> > >> drifting relative to each other over time? Do the problems become worse
> > >> the longer the uptime? How large are the offsets we're talking about here?
> > >>
> > > The offsets usually seem pretty small, under a microsecond. So I don't think
> > > it has anything to do with tscs starting out of sync. Specially because the
> > > delta-based calculation has the exact purpose of handling that case.
> > >
> >
> > So you think they're drifting out of sync from an initially synced
> > state? If so, what would bound the drift?
> I think delta calculation introduces errors.

Yes.

> Marcelo can probably confirm it, but he has a nehalem with an appearently
> very good tsc source. Even this machine warps.
>
> It stops warping if we only write pvclock data structure once and forget it,
> (which only updated tsc_timestamp once), according to him.

Yes. So its not as if the guest visible TSCs go out of sync (they don't
on this machine Glauber mentioned, or even on a multi-core Core 2 Duo),
but the delta calculation is very hard (if not impossible) to get right.

The timewarps i've seen were in the 0-200ns range, and very rare (once
every 10 minutes or so).

Avi Kivity

unread,

Apr 20, 2010, 5:30:02 AM4/20/10

to

On 04/19/2010 05:50 PM, Glauber Costa wrote:
> On Sat, Apr 17, 2010 at 09:58:26PM +0300, Avi Kivity wrote:
>
>> On 04/15/2010 09:37 PM, Glauber Costa wrote:
>>
>>> Since we're changing the msrs kvmclock uses, we have to communicate
>>> that to the guest, through cpuid. We can add a new KVM_CAP to the
>>> hypervisor, and then patch userspace to recognize it.
>>>
>>> And if we ever add a new cpuid bit in the future, we have to do that again,
>>> which create some complexity and delay in feature adoption.
>>>
>>> Instead, what I'm proposing in this patch is a new capability, called
>>> KVM_CAP_X86_CPUID_FEATURE_LIST, that returns the current feature list
>>> currently supported by the hypervisor. If we ever want to add or remove
>>> some feature, we only need to tweak into the HV, leaving userspace untouched.
>>>
>>>
>> Hm. We need to update userspace anyway, since we don't like turning
>> features on unconditionally (it breaks live migration into an older
>> kernel).
>>
> Right now, we don't have any mechanism to disable, say, kvmclock cpuid bit
> at userspace.

(that's a serious bug wrt migration, btw)

> But let's suppose we have: What's the difference between disabling
> it in the way it is now, and disabling it with the method I am proposing?
>

No difference.

> All this ioctl say is: "Those are the current supported stuff in this HV".
> It does not mandate userspace to expose all of this to the guest. It just saves
> us from the job of creating yet another CAP for every bit we plan on including.
>

Right. Well, creating a new CAP and creating a new FEATURE flag aren't
very different, and I'd like to avoid API churn. We have enough new
APIs due to missing or badly implemented features; I'd like to avoid new
ones whenever possible.

It's not like it saves userspace anything, it has to accommodate older
kernels anyhow. We are able to ignore pre 2.6.27 kernels, but now with
kvm shipped in long term support distributions, deprecating APIs will be
much harder.

> If we want to be conservative, we can keep everything but the things we know
> already disabled, in userspace.
>

We definitely need to do that.

--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

--

Avi Kivity

unread,

Apr 20, 2010, 5:40:01 AM4/20/10

to

On 04/19/2010 09:35 PM, Zachary Amsden wrote:
>>>> Sockets and boards too? (IOW, how reliable is TSC_RELIABLE)?
>>> Not sure, IIRC we clear that when the TSC sync test fails, eg when we
>>> mark the tsc clocksource unusable.
>>
>> Worrying. By the time we detect this the guest may already have
>> gotten confused by clocks going backwards.
>
>
> Upstream, we are marking the TSC unstable preemptively when hardware
> which will eventually sync test is detected, so this should be fine.

ENOPARSE?

--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

--

Avi Kivity

unread,

Apr 20, 2010, 5:40:02 AM4/20/10

to

On 04/19/2010 07:18 PM, Jeremy Fitzhardinge wrote:
> On 04/19/2010 07:46 AM, Peter Zijlstra wrote:
>
>> What avi says! :-)
>>
>> On a 32bit machine a 64bit read are two 32bit reads, so
>>
>> last = last_value;
>>
>> becomes:
>>
>> last.high = last_value.high;
>> last.low = last_vlue.low;
>>
>> (or the reverse of course)
>>
>> Now imagine a write getting interleaved with that ;-)
>>
>>
> You could explicitly do:
>
> do {
> h = last.high;
> barrier();
> l = last.low;
> barrier();
> } while (last.high != h);
>
>
> This works because we expect last to be always increasing, so the only
> worry is low wrapping and incrementing high, and is more efficient than
> making the read fully atomic (the write is still cmpxchg64). But it's
> pretty ugly to open code just for 32b architectures; its something that
> might be useful to turn into a general abstraction (monotonic_read_64
> FTW!). I already have code like this in the Xen time code, so I could
> make immediate use of it.
>

I don't think this is worthwhile - the cmpxchg is not that expensive on
most kvm capable hosts (the exception is the Pentium D).

btw, do you want this code in pvclock.c, or shall we keep it kvmclock
specific?

--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

--

Avi Kivity

unread,

Apr 20, 2010, 5:40:01 AM4/20/10

to

On 04/20/2010 04:57 AM, Marcelo Tosatti wrote:
>
>> Marcelo can probably confirm it, but he has a nehalem with an appearently
>> very good tsc source. Even this machine warps.
>>
>> It stops warping if we only write pvclock data structure once and forget it,
>> (which only updated tsc_timestamp once), according to him.
>>
> Yes. So its not as if the guest visible TSCs go out of sync (they don't
> on this machine Glauber mentioned, or even on a multi-core Core 2 Duo),
> but the delta calculation is very hard (if not impossible) to get right.
>
> The timewarps i've seen were in the 0-200ns range, and very rare (once
> every 10 minutes or so).
>

Might be due to NMIs or SMIs interrupting the rdtsc(); ktime_get()
operation which establishes the timeline. We could limit it by having a
loop doing rdtsc(); ktime_get(); rdtsc(); and checking for some bound,
but it isn't worthwhile (and will break nested virtualization for
sure). Better to have the option to calibrate kvmclock just once on
machines with X86_FEATURE_NONSTOP_TRULY_RELIABLE _TSC_HONESTLY.

--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

--

Glauber Costa

unread,

Apr 20, 2010, 9:00:02 AM4/20/10

to

On Tue, Apr 20, 2010 at 12:35:19PM +0300, Avi Kivity wrote:
> On 04/20/2010 04:57 AM, Marcelo Tosatti wrote:
> >
> >>Marcelo can probably confirm it, but he has a nehalem with an appearently
> >>very good tsc source. Even this machine warps.
> >>
> >>It stops warping if we only write pvclock data structure once and forget it,
> >>(which only updated tsc_timestamp once), according to him.
> >Yes. So its not as if the guest visible TSCs go out of sync (they don't
> >on this machine Glauber mentioned, or even on a multi-core Core 2 Duo),
> >but the delta calculation is very hard (if not impossible) to get right.
> >
> >The timewarps i've seen were in the 0-200ns range, and very rare (once
> >every 10 minutes or so).
>
> Might be due to NMIs or SMIs interrupting the rdtsc(); ktime_get()
> operation which establishes the timeline. We could limit it by
> having a loop doing rdtsc(); ktime_get(); rdtsc(); and checking for
> some bound, but it isn't worthwhile (and will break nested
> virtualization for sure). Better to have the option to calibrate
> kvmclock just once on machines with
> X86_FEATURE_NONSTOP_TRULY_RELIABLE _TSC_HONESTLY.

For the record, we can only even do that in those machines. If we try to update
time structures only once in machines with the
X86_FEATURE_TSC_SAYS_IT_IS_OKAY_BUT_IN_REALITY_IS_NOT_OKAY feature flag, guests
won't even boot.

We can detect that, and besides doing calculation only once, also export some
bit indicating that to the guest. Humm... I'm thinking now, that because of
migration, we should check this bit every time, because we might have changed host.
So instead of using an expensive cpuid check, we should probably use some bit in
the vcpu_time_info structure, and use a cpuid bit just to say it is enabled.

Jeremy,

are you okay in turning one of the pad fields in the structure into a flags field?

Avi Kivity

unread,

Apr 20, 2010, 11:20:02 AM4/20/10

to

On 04/20/2010 03:59 PM, Glauber Costa wrote:
>
>> Might be due to NMIs or SMIs interrupting the rdtsc(); ktime_get()
>> operation which establishes the timeline. We could limit it by
>> having a loop doing rdtsc(); ktime_get(); rdtsc(); and checking for
>> some bound, but it isn't worthwhile (and will break nested
>> virtualization for sure). Better to have the option to calibrate
>> kvmclock just once on machines with
>> X86_FEATURE_NONSTOP_TRULY_RELIABLE _TSC_HONESTLY.
>>
> For the record, we can only even do that in those machines. If we try to update
> time structures only once in machines with the
> X86_FEATURE_TSC_SAYS_IT_IS_OKAY_BUT_IN_REALITY_IS_NOT_OKAY feature flag, guests
> won't even boot.
>
> We can detect that, and besides doing calculation only once, also export some
> bit indicating that to the guest. Humm... I'm thinking now, that because of
> migration, we should check this bit every time, because we might have changed host.
> So instead of using an expensive cpuid check, we should probably use some bit in
> the vcpu_time_info structure, and use a cpuid bit just to say it is enabled.
>

Right, we need a bit in the structure itself that says you can trust the
time not to go backwards, and a bit in cpuid that says you can trust the
bit in the structure (unless we can be sure no implementations leak
garbage into the padding field).

--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

--

Jeremy Fitzhardinge

unread,

Apr 20, 2010, 3:50:02 PM4/20/10

to

On 04/20/2010 11:54 AM, Avi Kivity wrote:
> On 04/20/2010 09:23 PM, Jeremy Fitzhardinge wrote:

>> On 04/20/2010 02:31 AM, Avi Kivity wrote:
>>
>>> btw, do you want this code in pvclock.c, or shall we keep it kvmclock
>>> specific?
>>>

>> I think its a pvclock-level fix. I'd been hoping to avoid having
>> something like this, but I think its ultimately necessary.
>>
>
> Did you observe drift on Xen, or is this "ultimately" pointing at the
> future?

People are reporting weirdnesses that "clocksource=jiffies" apparently
resolves. Xen and KVM are faced with the same hardware constraints, and
it wouldn't surprise me if there were small measurable
non-monotonicities in the PV clock under Xen. May as well be safe.

Of course, it kills any possibility of being able to usefully export
this interface down to usermode.

My main concern about this kind of simple fix is that if there's a long
term systematic drift between different CPU's tscs, then this will
somewhat mask the problem while giving really awful time measurement on
the "slow" CPU(s). In that case it really needs to adjust the scaling
factor to correct for the drift (*not* update the offset). But if we're
definitely only talking about fixed, relatively small time offsets then
it is fine.

J

Avi Kivity

unread,

Apr 20, 2010, 3:00:02 PM4/20/10

to

On 04/20/2010 09:23 PM, Jeremy Fitzhardinge wrote:
> On 04/20/2010 02:31 AM, Avi Kivity wrote:
>

>> btw, do you want this code in pvclock.c, or shall we keep it kvmclock
>> specific?
>>

> I think its a pvclock-level fix. I'd been hoping to avoid having
> something like this, but I think its ultimately necessary.
>

Did you observe drift on Xen, or is this "ultimately" pointing at the
future?

--

Jeremy Fitzhardinge

unread,

Apr 20, 2010, 2:30:02 PM4/20/10

to

On 04/20/2010 02:31 AM, Avi Kivity wrote:

> btw, do you want this code in pvclock.c, or shall we keep it kvmclock
> specific?

I think its a pvclock-level fix. I'd been hoping to avoid having

something like this, but I think its ultimately necessary.

J

Zachary Amsden

unread,

Apr 20, 2010, 8:10:02 PM4/20/10

to

On 04/20/2010 09:42 AM, Jeremy Fitzhardinge wrote:
> On 04/20/2010 11:54 AM, Avi Kivity wrote:
>
>> On 04/20/2010 09:23 PM, Jeremy Fitzhardinge wrote:
>>
>>> On 04/20/2010 02:31 AM, Avi Kivity wrote:
>>>
>>>
>>>> btw, do you want this code in pvclock.c, or shall we keep it kvmclock
>>>> specific?
>>>>
>>>>
>>> I think its a pvclock-level fix. I'd been hoping to avoid having
>>> something like this, but I think its ultimately necessary.
>>>
>>>
>> Did you observe drift on Xen, or is this "ultimately" pointing at the
>> future?
>>
> People are reporting weirdnesses that "clocksource=jiffies" apparently
> resolves. Xen and KVM are faced with the same hardware constraints, and
> it wouldn't surprise me if there were small measurable
> non-monotonicities in the PV clock under Xen. May as well be safe.
>

Does the drift only occur on SMP VMs?

Zachary Amsden

unread,

Apr 20, 2010, 8:10:02 PM4/20/10

to

On 04/19/2010 11:35 PM, Avi Kivity wrote:
> On 04/20/2010 04:57 AM, Marcelo Tosatti wrote:
>>
>>> Marcelo can probably confirm it, but he has a nehalem with an
>>> appearently
>>> very good tsc source. Even this machine warps.
>>>
>>> It stops warping if we only write pvclock data structure once and
>>> forget it,
>>> (which only updated tsc_timestamp once), according to him.
>> Yes. So its not as if the guest visible TSCs go out of sync (they don't
>> on this machine Glauber mentioned, or even on a multi-core Core 2 Duo),
>> but the delta calculation is very hard (if not impossible) to get right.
>>
>> The timewarps i've seen were in the 0-200ns range, and very rare (once
>> every 10 minutes or so).
>
> Might be due to NMIs or SMIs interrupting the rdtsc(); ktime_get()
> operation which establishes the timeline. We could limit it by having
> a loop doing rdtsc(); ktime_get(); rdtsc(); and checking for some
> bound, but it isn't worthwhile (and will break nested virtualization
> for sure). Better to have the option to calibrate kvmclock just once
> on machines with X86_FEATURE_NONSTOP_TRULY_RELIABLE _TSC_HONESTLY.

There's a perfect way to do this and it still fails to stop timewarps.
You can set the performance counters to overflow if more instructions
are issued than your code path, run an assembly instruction stream and
if the performance interrupt hits, restart the calibration.

The calibration happens not just once, but on every migration, and
currently, I believe, on every VCPU switch. Even if we reduce the
number of calibrations to the bare minimum and rule out SMIs and NMIs,
there will still be variation due to factors beyond our control because
of the unpredictable nature of cache and instruction issue.

However, X86_FEATURE_NONSTOP_TRULY_RELIABLE_TSC_HONESTLY does imply one
key feature which the code is missing today: on SMP VMs, the
calibration of kvmclock needs to be done only once, and the clock can
then be used for all VCPUs. That, I think, stops Glauber's bug from
appearing on the server side.

I will spin that into my web of patches and send the cocoon out sometime
this evening.

Zach

Zachary Amsden

unread,

Apr 20, 2010, 8:10:02 PM4/20/10

to

On 04/19/2010 11:39 PM, Avi Kivity wrote:
> On 04/19/2010 09:35 PM, Zachary Amsden wrote:
>>>>> Sockets and boards too? (IOW, how reliable is TSC_RELIABLE)?
>>>> Not sure, IIRC we clear that when the TSC sync test fails, eg when we
>>>> mark the tsc clocksource unusable.
>>>
>>> Worrying. By the time we detect this the guest may already have
>>> gotten confused by clocks going backwards.
>>
>>
>> Upstream, we are marking the TSC unstable preemptively when hardware
>> which will eventually sync test is detected, so this should be fine.
>
> ENOPARSE?
>

Instead of detecting TSC warp, c1e_idle, power_saving_mwait_init,
tsc_check_state, dmi_mark_tsc_unstable all do something similar to this
to disable TSC before warp even occurs:

static void c1e_idle(void)
{
if (need_resched())
return;

if (!c1e_detected) {
u32 lo, hi;

rdmsr(MSR_K8_INT_PENDING_MSG, lo, hi);
if (lo & K8_INTP_C1E_ACTIVE_MASK) {
c1e_detected = 1;
if (!boot_cpu_has(X86_FEATURE_NONSTOP_TSC))
mark_tsc_unstable("TSC halt in AMD C1E");
printk(KERN_INFO "System has AMD C1E enabled\n");
set_cpu_cap(&boot_cpu_data, X86_FEATURE_AMDC1E);

Avi Kivity

unread,

Apr 21, 2010, 4:10:02 AM4/21/10

to

On 04/21/2010 03:01 AM, Zachary Amsden wrote:
>>> on this machine Glauber mentioned, or even on a multi-core Core 2 Duo),
>>> but the delta calculation is very hard (if not impossible) to get
>>> right.
>>>
>>> The timewarps i've seen were in the 0-200ns range, and very rare (once
>>> every 10 minutes or so).
>>
>> Might be due to NMIs or SMIs interrupting the rdtsc(); ktime_get()
>> operation which establishes the timeline. We could limit it by
>> having a loop doing rdtsc(); ktime_get(); rdtsc(); and checking for
>> some bound, but it isn't worthwhile (and will break nested
>> virtualization for sure). Better to have the option to calibrate
>> kvmclock just once on machines with
>> X86_FEATURE_NONSTOP_TRULY_RELIABLE _TSC_HONESTLY.
> Yes. So its not as if the guest visible TSCs go out of sync (they don't
>
> There's a perfect way to do this and it still fails to stop
> timewarps. You can set the performance counters to overflow if more
> instructions are issued than your code path, run an assembly
> instruction stream and if the performance interrupt hits, restart the
> calibration.

It's completely impractical. The PMU is a global resource that is
already shared among users and the host; programming and restoring it is
expensive; and in a virtualized environment it the whole scheme may fail.

>
> The calibration happens not just once, but on every migration, and
> currently, I believe, on every VCPU switch. Even if we reduce the
> number of calibrations to the bare minimum and rule out SMIs and NMIs,
> there will still be variation due to factors beyond our control
> because of the unpredictable nature of cache and instruction issue.

Right.

>
> However, X86_FEATURE_NONSTOP_TRULY_RELIABLE_TSC_HONESTLY does imply
> one key feature which the code is missing today: on SMP VMs, the
> calibration of kvmclock needs to be done only once, and the clock can
> then be used for all VCPUs. That, I think, stops Glauber's bug from
> appearing on the server side.

That's the plan.

--
error compiling committee.c: too many arguments to function

Avi Kivity

unread,

Apr 21, 2010, 4:10:02 AM4/21/10

to

That works within a socket; but multiple socket machines need not feed
all sockets from the same crystal and reset line (esp. likely for
multiple board machines, but might even happen with single board
FSB-less processors).

--
error compiling committee.c: too many arguments to function

--

Glauber Costa

unread,

Apr 22, 2010, 9:20:02 AM4/22/10

to

On Tue, Apr 20, 2010 at 12:42:17PM -0700, Jeremy Fitzhardinge wrote:
> On 04/20/2010 11:54 AM, Avi Kivity wrote:
> > On 04/20/2010 09:23 PM, Jeremy Fitzhardinge wrote:
> >> On 04/20/2010 02:31 AM, Avi Kivity wrote:
> >>
> >>> btw, do you want this code in pvclock.c, or shall we keep it kvmclock
> >>> specific?
> >>>
> >> I think its a pvclock-level fix. I'd been hoping to avoid having
> >> something like this, but I think its ultimately necessary.
> >>
> >
> > Did you observe drift on Xen, or is this "ultimately" pointing at the
> > future?
>
> People are reporting weirdnesses that "clocksource=jiffies" apparently
> resolves. Xen and KVM are faced with the same hardware constraints, and
> it wouldn't surprise me if there were small measurable
> non-monotonicities in the PV clock under Xen. May as well be safe.
>
> Of course, it kills any possibility of being able to usefully export
> this interface down to usermode.
>
> My main concern about this kind of simple fix is that if there's a long
> term systematic drift between different CPU's tscs, then this will
> somewhat mask the problem while giving really awful time measurement on
> the "slow" CPU(s). In that case it really needs to adjust the scaling
> factor to correct for the drift (*not* update the offset). But if we're
> definitely only talking about fixed, relatively small time offsets then
> it is fine.

Can you by any chance run ingo's time warp test on those machines?

You need to define TOD to 1, and leave out the TSC test.

For me, warps exists on every machine out there, but the nehalems, so far

Zachary Amsden

unread,

Apr 22, 2010, 9:50:01 PM4/22/10

to

Or apply this patch.

time-warp.patch

Avi Kivity

unread,

Apr 23, 2010, 5:40:02 AM4/23/10

to

On 04/23/2010 04:44 AM, Zachary Amsden wrote:
> Or apply this patch.
> time-warp.patch
>
>
> diff -rup a/time-warp-test.c b/time-warp-test.c
> --- a/time-warp-test.c 2010-04-15 16:30:13.955981607 -1000
> +++ b/time-warp-test.c 2010-04-15 16:35:37.777982377 -1000
> @@ -91,7 +91,7 @@ static inline unsigned long long __rdtsc
> {
> DECLARE_ARGS(val, low, high);
>
> - asm volatile("cpuid; rdtsc" : EAX_EDX_RET(val, low, high));
> + asm volatile("cpuid; rdtsc" : EAX_EDX_RET(val, low, high) :: "ebx", "ecx");
>
>

Plus, replace cpuid by lfence/mfence. cpuid will trap.

--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

--

Avi Kivity

unread,

Apr 23, 2010, 3:30:02 PM4/23/10

to

On 04/23/2010 10:22 PM, Jeremy Fitzhardinge wrote:

> On 04/23/2010 02:34 AM, Avi Kivity wrote:
>
>>> diff -rup a/time-warp-test.c b/time-warp-test.c
>>> --- a/time-warp-test.c 2010-04-15 16:30:13.955981607 -1000
>>> +++ b/time-warp-test.c 2010-04-15 16:35:37.777982377 -1000
>>> @@ -91,7 +91,7 @@ static inline unsigned long long __rdtsc
>>> {
>>> DECLARE_ARGS(val, low, high);
>>>
>>> - asm volatile("cpuid; rdtsc" : EAX_EDX_RET(val, low, high));
>>> + asm volatile("cpuid; rdtsc" : EAX_EDX_RET(val, low, high) ::
>>> "ebx", "ecx");
>>>
>>>
>>>
>>
>> Plus, replace cpuid by lfence/mfence. cpuid will trap.
>>

> I presume the clobbers are needed for cpuid; if you use [lm]fence then
> they shouldn't be needed, right?
>
>

Right. But I'm not sure lfence/mfence are sufficient - from what I
understand rdtsc can pass right through them.

Jeremy Fitzhardinge

unread,

Apr 23, 2010, 3:30:02 PM4/23/10

to

On 04/23/2010 02:34 AM, Avi Kivity wrote:
>>

>> diff -rup a/time-warp-test.c b/time-warp-test.c
>> --- a/time-warp-test.c 2010-04-15 16:30:13.955981607 -1000
>> +++ b/time-warp-test.c 2010-04-15 16:35:37.777982377 -1000
>> @@ -91,7 +91,7 @@ static inline unsigned long long __rdtsc
>> {
>> DECLARE_ARGS(val, low, high);
>>
>> - asm volatile("cpuid; rdtsc" : EAX_EDX_RET(val, low, high));
>> + asm volatile("cpuid; rdtsc" : EAX_EDX_RET(val, low, high) ::
>> "ebx", "ecx");
>>
>>
>
>
> Plus, replace cpuid by lfence/mfence. cpuid will trap.

I presume the clobbers are needed for cpuid; if you use [lm]fence then

they shouldn't be needed, right?

J

Zachary Amsden

unread,

Apr 23, 2010, 5:40:02 PM4/23/10

to

On 04/22/2010 11:34 PM, Avi Kivity wrote:
> On 04/23/2010 04:44 AM, Zachary Amsden wrote:
>> Or apply this patch.
>> time-warp.patch
>>
>>
>> diff -rup a/time-warp-test.c b/time-warp-test.c
>> --- a/time-warp-test.c 2010-04-15 16:30:13.955981607 -1000
>> +++ b/time-warp-test.c 2010-04-15 16:35:37.777982377 -1000
>> @@ -91,7 +91,7 @@ static inline unsigned long long __rdtsc
>> {
>> DECLARE_ARGS(val, low, high);
>>
>> - asm volatile("cpuid; rdtsc" : EAX_EDX_RET(val, low, high));
>> + asm volatile("cpuid; rdtsc" : EAX_EDX_RET(val, low, high) ::
>> "ebx", "ecx");
>>
>
> Plus, replace cpuid by lfence/mfence. cpuid will trap.
>

Does lfence / mfence actually serialize? I thought there was some great
confusion about that not being the case on all AMD processors, and
possibly not at all on Intel.

A trap, however is a great way to serialize.

I think, there is no serializing instruction which can be used from
userspace which does not trap, at least, I don't know one off the top of
my head.

Zach

Jeremy Fitzhardinge

unread,

Apr 23, 2010, 5:40:02 PM4/23/10

to

On 04/23/2010 02:31 PM, Zachary Amsden wrote:
> Does lfence / mfence actually serialize? I thought there was some
> great confusion about that not being the case on all AMD processors,
> and possibly not at all on Intel.
>
> A trap, however is a great way to serialize.
>
> I think, there is no serializing instruction which can be used from
> userspace which does not trap, at least, I don't know one off the top
> of my head.

rsm is not technically privileged... but not quite usable from usermode ;)

J

Zachary Amsden

unread,

Apr 23, 2010, 5:50:01 PM4/23/10

to

On 04/23/2010 11:35 AM, Jeremy Fitzhardinge wrote:
> On 04/23/2010 02:31 PM, Zachary Amsden wrote:
>
>> Does lfence / mfence actually serialize? I thought there was some
>> great confusion about that not being the case on all AMD processors,
>> and possibly not at all on Intel.
>>
>> A trap, however is a great way to serialize.
>>
>> I think, there is no serializing instruction which can be used from
>> userspace which does not trap, at least, I don't know one off the top
>> of my head.
>>
> rsm is not technically privileged... but not quite usable from usermode ;)
>

rsm under hardware virtualization makes my head hurt

Avi Kivity

unread,

Apr 24, 2010, 5:30:02 AM4/24/10

to

On 04/24/2010 12:31 AM, Zachary Amsden wrote:
> On 04/22/2010 11:34 PM, Avi Kivity wrote:
>> On 04/23/2010 04:44 AM, Zachary Amsden wrote:
>>> Or apply this patch.
>>> time-warp.patch
>>>
>>>
>>> diff -rup a/time-warp-test.c b/time-warp-test.c
>>> --- a/time-warp-test.c 2010-04-15 16:30:13.955981607 -1000
>>> +++ b/time-warp-test.c 2010-04-15 16:35:37.777982377 -1000
>>> @@ -91,7 +91,7 @@ static inline unsigned long long __rdtsc
>>> {
>>> DECLARE_ARGS(val, low, high);
>>>
>>> - asm volatile("cpuid; rdtsc" : EAX_EDX_RET(val, low, high));
>>> + asm volatile("cpuid; rdtsc" : EAX_EDX_RET(val, low, high) ::
>>> "ebx", "ecx");
>>>
>>
>> Plus, replace cpuid by lfence/mfence. cpuid will trap.
>>
>
> Does lfence / mfence actually serialize? I thought there was some
> great confusion about that not being the case on all AMD processors,
> and possibly not at all on Intel.

They don't.

>
> A trap, however is a great way to serialize.
>
> I think, there is no serializing instruction which can be used from
> userspace which does not trap, at least, I don't know one off the top
> of my head.

iret.

--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

--

Avi Kivity

unread,

Apr 24, 2010, 5:40:03 AM4/24/10

to

On 04/24/2010 12:41 AM, Zachary Amsden wrote:
>> rsm is not technically privileged... but not quite usable from
>> usermode ;)
>
>
> rsm under hardware virtualization makes my head hurt

Either one independently is sufficient for me.

--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

--

Glauber Costa

unread,

Apr 26, 2010, 2:00:03 PM4/26/10