And if we ever add a new cpuid bit in the future, we have to do that again,
which create some complexity and delay in feature adoption.
Instead, what I'm proposing in this patch is a new capability, called
KVM_CAP_X86_CPUID_FEATURE_LIST, that returns the current feature list
currently supported by the hypervisor. If we ever want to add or remove
some feature, we only need to tweak into the HV, leaving userspace untouched.
Signed-off-by: Glauber Costa <glo...@redhat.com>
---
arch/x86/kvm/x86.c | 6 ++++++
include/linux/kvm.h | 1 +
2 files changed, 7 insertions(+), 0 deletions(-)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 714aae2..74f0dc3 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1545,6 +1545,12 @@ int kvm_dev_ioctl_check_extension(long ext)
case KVM_CAP_MCE:
r = KVM_MAX_MCE_BANKS;
break;
+ case KVM_CAP_X86_CPUID_FEATURE_LIST:
+ r = (1 << KVM_FEATURE_CLOCKSOURCE) |
+ (1 << KVM_FEATURE_NOP_IO_DELAY) |
+ (1 << KVM_FEATURE_MMU_OP) |
+ (1 << KVM_FEATURE_CLOCKSOURCE2);
+ break;
default:
r = 0;
break;
diff --git a/include/linux/kvm.h b/include/linux/kvm.h
index ce28767..1ce124f 100644
--- a/include/linux/kvm.h
+++ b/include/linux/kvm.h
@@ -507,6 +507,7 @@ struct kvm_ioeventfd {
#define KVM_CAP_DEBUGREGS 50
#endif
#define KVM_CAP_X86_ROBUST_SINGLESTEP 51
+#define KVM_CAP_X86_CPUID_FEATURE_LIST 52
#ifdef KVM_CAP_IRQ_ROUTING
--
1.6.2.2
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
This happens even on some machines that report constant_tsc in its tsc flags,
specially on multi-socket ones.
Two reads of the same kernel timestamp at approx the same time, will likely
have tsc timestamped in different occasions too. This means the delta we
calculate is unpredictable at best, and can probably be smaller in a cpu
that is legitimately reading clock in a forward ocasion.
Some adjustments on the host could make this window less likely to happen,
but still, it pretty much poses as an intrinsic problem of the mechanism.
A while ago, I though about using a shared variable anyway, to hold clock
last state, but gave up due to the high contention locking was likely
to introduce, possibly rendering the thing useless on big machines. I argue,
however, that locking is not necessary.
We do a read-and-return sequence in pvclock, and between read and return,
the global value can have changed. However, it can only have changed
by means of an addition of a positive value. So if we detected that our
clock timestamp is less than the current global, we know that we need to
return a higher one, even though it is not exactly the one we compared to.
OTOH, if we detect we're greater than the current time source, we atomically
replace the value with our new readings. This do causes contention on big
boxes (but big here means *BIG*), but it seems like a good trade off, since
it provide us with a time source guaranteed to be stable wrt time warps.
After this patch is applied, I don't see a single warp in time during 5 days
of execution, in any of the machines I saw them before.
Signed-off-by: Glauber Costa <glo...@redhat.com>
CC: Jeremy Fitzhardinge <jer...@goop.org>
CC: Avi Kivity <a...@redhat.com>
CC: Marcelo Tosatti <mtos...@redhat.com>
CC: Zachary Amsden <zam...@redhat.com>
---
arch/x86/kernel/pvclock.c | 23 +++++++++++++++++++++++
1 files changed, 23 insertions(+), 0 deletions(-)
diff --git a/arch/x86/kernel/pvclock.c b/arch/x86/kernel/pvclock.c
index 03801f2..b7de0e6 100644
--- a/arch/x86/kernel/pvclock.c
+++ b/arch/x86/kernel/pvclock.c
@@ -109,11 +109,14 @@ unsigned long pvclock_tsc_khz(struct pvclock_vcpu_time_info *src)
return pv_tsc_khz;
}
+static u64 last_value = 0;
+
cycle_t pvclock_clocksource_read(struct pvclock_vcpu_time_info *src)
{
struct pvclock_shadow_time shadow;
unsigned version;
cycle_t ret, offset;
+ u64 last;
do {
version = pvclock_get_time_values(&shadow, src);
@@ -123,6 +126,26 @@ cycle_t pvclock_clocksource_read(struct pvclock_vcpu_time_info *src)
barrier();
} while (version != src->version);
+ /*
+ * Assumption here is that last_value, a global accumulator, always goes
+ * forward. If we are less than that, we should not be much smaller.
+ * We assume there is an error marging we're inside, and then the correction
+ * does not sacrifice accuracy.
+ *
+ * For reads: global may have changed between test and return,
+ * but this means someone else updated poked the clock at a later time.
+ * We just need to make sure we are not seeing a backwards event.
+ *
+ * For updates: last_value = ret is not enough, since two vcpus could be
+ * updating at the same time, and one of them could be slightly behind,
+ * making the assumption that last_value always go forward fail to hold.
+ */
+ do {
+ last = last_value;
+ if (ret < last)
+ return last;
+ } while (unlikely(cmpxchg64(&last_value, last, ret) != ret));
+
return ret;
However, kvm clock registration happens very early, and if we ever
try to write to a non-existant MSR, we raise a lethal #GP, since our
idt handlers are not in place yet.
So this patch tests for a cpuid feature exported by the host to
decide which set of msrs are supported.
Signed-off-by: Glauber Costa <glo...@redhat.com>
---
arch/x86/include/asm/kvm_para.h | 4 ++
arch/x86/kernel/kvmclock.c | 68 +++++++++++++++++++++++++++------------
2 files changed, 51 insertions(+), 21 deletions(-)
diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index 0cffb96..a32710a 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -16,6 +16,10 @@
#define KVM_FEATURE_CLOCKSOURCE 0
#define KVM_FEATURE_NOP_IO_DELAY 1
#define KVM_FEATURE_MMU_OP 2
+/* We could just try to use new msr values, but they are queried very early,
+ * kernel does not have idt handlers yet, and failures are fatal */
+#define KVM_FEATURE_CLOCKSOURCE2 3
+
#define MSR_KVM_WALL_CLOCK_OLD 0x11
#define MSR_KVM_SYSTEM_TIME_OLD 0x12
diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index feaeb0d..6d814ce 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -29,6 +29,7 @@
#define KVM_SCALE 22
static int kvmclock = 1;
+static int kvm_use_new_msrs = 0;
static int parse_no_kvmclock(char *arg)
{
@@ -41,6 +42,18 @@ early_param("no-kvmclock", parse_no_kvmclock);
static DEFINE_PER_CPU_SHARED_ALIGNED(struct pvclock_vcpu_time_info, hv_clock);
static struct pvclock_wall_clock wall_clock;
+static int kvm_system_time_write_value(int low, int high)
+{
+ if (kvm_use_new_msrs)
+ return native_write_msr_safe(MSR_KVM_SYSTEM_TIME_OLD, low, high);
+ else
+ return native_write_msr_safe(MSR_KVM_SYSTEM_TIME, low, high);
+}
+
+static void kvm_turnoff_clock(void)
+{
+ kvm_system_time_write_value(0, 0);
+}
/*
* The wallclock is the time of day when we booted. Since then, some time may
* have elapsed since the hypervisor wrote the data. So we try to account for
@@ -54,7 +67,11 @@ static unsigned long kvm_get_wallclock(void)
low = (int)__pa_symbol(&wall_clock);
high = ((u64)__pa_symbol(&wall_clock) >> 32);
- native_write_msr(MSR_KVM_WALL_CLOCK, low, high);
+
+ if (kvm_use_new_msrs)
+ native_write_msr_safe(MSR_KVM_WALL_CLOCK, low, high);
+ else
+ native_write_msr(MSR_KVM_WALL_CLOCK_OLD, low, high);
vcpu_time = &get_cpu_var(hv_clock);
pvclock_read_wallclock(&wall_clock, vcpu_time, &ts);
@@ -130,7 +147,8 @@ static int kvm_register_clock(char *txt)
high = ((u64)__pa(&per_cpu(hv_clock, cpu)) >> 32);
printk(KERN_INFO "kvm-clock: cpu %d, msr %x:%x, %s\n",
cpu, high, low, txt);
- return native_write_msr_safe(MSR_KVM_SYSTEM_TIME, low, high);
+
+ return kvm_system_time_write_value(low, high);
}
#ifdef CONFIG_X86_LOCAL_APIC
@@ -165,14 +183,14 @@ static void __init kvm_smp_prepare_boot_cpu(void)
#ifdef CONFIG_KEXEC
static void kvm_crash_shutdown(struct pt_regs *regs)
{
- native_write_msr_safe(MSR_KVM_SYSTEM_TIME, 0, 0);
+ kvm_turnoff_clock();
native_machine_crash_shutdown(regs);
}
#endif
static void kvm_shutdown(void)
{
- native_write_msr_safe(MSR_KVM_SYSTEM_TIME, 0, 0);
+ kvm_turnoff_clock();
native_machine_shutdown();
}
@@ -181,27 +199,35 @@ void __init kvmclock_init(void)
if (!kvm_para_available())
return;
- if (kvmclock && kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE)) {
- if (kvm_register_clock("boot clock"))
- return;
- pv_time_ops.sched_clock = kvm_clock_read;
- x86_platform.calibrate_tsc = kvm_get_tsc_khz;
- x86_platform.get_wallclock = kvm_get_wallclock;
- x86_platform.set_wallclock = kvm_set_wallclock;
+ if (kvmclock && kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE2))
+ kvm_use_new_msrs = 1;
+ else if (kvmclock && kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE))
+ kvm_use_new_msrs = 0;
+ else
+ return;
+
+ printk(KERN_INFO "kvm-clock: %ssing clocksource new msrs",
+ kvm_use_new_msrs ? "U": "Not u");
+
+ if (kvm_register_clock("boot clock"))
+ return;
+ pv_time_ops.sched_clock = kvm_clock_read;
+ x86_platform.calibrate_tsc = kvm_get_tsc_khz;
+ x86_platform.get_wallclock = kvm_get_wallclock;
+ x86_platform.set_wallclock = kvm_set_wallclock;
#ifdef CONFIG_X86_LOCAL_APIC
- x86_cpuinit.setup_percpu_clockev =
- kvm_setup_secondary_clock;
+ x86_cpuinit.setup_percpu_clockev =
+ kvm_setup_secondary_clock;
#endif
#ifdef CONFIG_SMP
- smp_ops.smp_prepare_boot_cpu = kvm_smp_prepare_boot_cpu;
+ smp_ops.smp_prepare_boot_cpu = kvm_smp_prepare_boot_cpu;
#endif
- machine_ops.shutdown = kvm_shutdown;
+ machine_ops.shutdown = kvm_shutdown;
#ifdef CONFIG_KEXEC
- machine_ops.crash_shutdown = kvm_crash_shutdown;
+ machine_ops.crash_shutdown = kvm_crash_shutdown;
#endif
- kvm_get_preset_lpj();
- clocksource_register(&kvm_clock);
- pv_info.paravirt_enabled = 1;
- pv_info.name = "KVM";
- }
+ kvm_get_preset_lpj();
+ clocksource_register(&kvm_clock);
+ pv_info.paravirt_enabled = 1;
+ pv_info.name = "KVM";
Signed-off-by: Glauber Costa <glo...@redhat.com>
---
arch/x86/include/asm/kvm_para.h | 8 ++++++--
arch/x86/kvm/x86.c | 7 ++++++-
2 files changed, 12 insertions(+), 3 deletions(-)
diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index ffae142..0cffb96 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -17,8 +17,12 @@
#define KVM_FEATURE_NOP_IO_DELAY 1
#define KVM_FEATURE_MMU_OP 2
-#define MSR_KVM_WALL_CLOCK 0x11
-#define MSR_KVM_SYSTEM_TIME 0x12
+#define MSR_KVM_WALL_CLOCK_OLD 0x11
+#define MSR_KVM_SYSTEM_TIME_OLD 0x12
+
+/* Custom MSRs falls in the range 0x4b564d00-0x4b564dff */
+#define MSR_KVM_WALL_CLOCK 0x4b564d00
+#define MSR_KVM_SYSTEM_TIME 0x4b564d01
#define KVM_MAX_MMU_OP_BATCH 32
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 8824b73..714aae2 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -575,9 +575,10 @@ static inline u32 bit(int bitno)
* kvm-specific. Those are put in the beginning of the list.
*/
-#define KVM_SAVE_MSRS_BEGIN 5
+#define KVM_SAVE_MSRS_BEGIN 7
static u32 msrs_to_save[] = {
MSR_KVM_SYSTEM_TIME, MSR_KVM_WALL_CLOCK,
+ MSR_KVM_SYSTEM_TIME_OLD, MSR_KVM_WALL_CLOCK_OLD,
HV_X64_MSR_GUEST_OS_ID, HV_X64_MSR_HYPERCALL,
HV_X64_MSR_APIC_ASSIST_PAGE,
MSR_IA32_SYSENTER_CS, MSR_IA32_SYSENTER_ESP, MSR_IA32_SYSENTER_EIP,
@@ -1099,10 +1100,12 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data)
case MSR_IA32_MISC_ENABLE:
vcpu->arch.ia32_misc_enable_msr = data;
break;
+ case MSR_KVM_WALL_CLOCK_OLD:
case MSR_KVM_WALL_CLOCK:
vcpu->kvm->arch.wall_clock = data;
kvm_write_wall_clock(vcpu->kvm, data);
break;
+ case MSR_KVM_SYSTEM_TIME_OLD:
case MSR_KVM_SYSTEM_TIME: {
if (vcpu->arch.time_page) {
kvm_release_page_dirty(vcpu->arch.time_page);
@@ -1374,9 +1377,11 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata)
data = vcpu->arch.efer;
break;
case MSR_KVM_WALL_CLOCK:
+ case MSR_KVM_WALL_CLOCK_OLD:
data = vcpu->kvm->arch.wall_clock;
break;
case MSR_KVM_SYSTEM_TIME:
+ case MSR_KVM_SYSTEM_TIME_OLD:
data = vcpu->arch.time;
break;
case MSR_IA32_P5_MC_ADDR:
Signed-off-by: Glauber Costa <glo...@redhat.com>
---
Documentation/kvm/kvmclock.txt | 138 ++++++++++++++++++++++++++++++++++++++++
1 files changed, 138 insertions(+), 0 deletions(-)
create mode 100644 Documentation/kvm/kvmclock.txt
diff --git a/Documentation/kvm/kvmclock.txt b/Documentation/kvm/kvmclock.txt
new file mode 100644
index 0000000..21008bb
--- /dev/null
+++ b/Documentation/kvm/kvmclock.txt
@@ -0,0 +1,138 @@
+KVM Paravirtual Clocksource driver
+Glauber Costa, Red Hat Inc.
+==================================
+
+1. General Description
+=======================
+
+Keeping time in virtual machine is acknowledged as a hard problem. The most
+basic mode of operation, usually done by older guests, assumes a fixed length
+between timer interrupts. It then counts the number of interrupts and
+calculates elapsed time. This method fails easily in virtual machines, since
+we can't guarantee that the virtual interrupt will be delivered in time.
+
+Another possibility is to emulate modern devices like HPET, or any other we
+see fit. A modern guest which implements something like the clocksource
+infrastructure, can then ask this virtual device about current time when it
+needs to. The problem with this approach, is that it bumps the guest out
+of guest mode operation, and in some cases, even to userspace very frequently.
+
+In this context, the best approach is to provide the guest with a
+virtualization-aware (paravirtual) clock device. It the asks the hypervisor
+about current time, guaranteeing both stable and accurate timekeeping.
+
+2. kvmclock basics
+===========================
+
+When supported by the hypervisor, guests can register a memory page
+to contain kvmclock data. This page has to be present in guest's address space
+throughout its whole life. The hypervisor continues to write to it until it is
+explicitly disabled or the guest is turned off.
+
+2.1 kvmclock availability
+-------------------------
+
+Guests that want to take advantage of kvmclock should first check its
+availability through cpuid.
+
+kvm features are presented to the guest in leaf 0x40000001. Bit 3 indicates
+the present of kvmclock. Bit 0 indicates that kvmclock is present, but the
+old MSR set must be used. See section 2.3 for details.
+
+2.2 kvmclock functionality
+--------------------------
+
+Two MSRs are provided by the hypervisor, controlling kvmclock operation:
+
+ * MSR_KVM_WALL_CLOCK, value 0x4b564d00 and
+ * MSR_KVM_SYSTEM_TIME, value 0x4b564d01.
+
+The first one is only used in rare situations, like boot-time and a
+suspend-resume cycle. Data is disposable, and after used, the guest
+may use it for something else. This is hardly a hot path for anything.
+The Hypervisor fills in the address provided through this MSR with the
+following structure:
+
+struct pvclock_wall_clock {
+ u32 version;
+ u32 sec;
+ u32 nsec;
+} __attribute__((__packed__));
+
+Guest should only trust data to be valid when version haven't changed before
+and after reads of sec and nsec. Besides not changing, it has to be an even
+number. Hypervisor may write an odd number to version field to indicate that
+an update is in progress.
+
+MSR_KVM_SYSTEM_TIME, on the other hand, has persistent data, and is
+constantly updated by the hypervisor with time information. The data
+written in this MSR contains two pieces of information: the address in which
+the guests expects time data to be present 4-byte aligned or'ed with an
+enabled bit. If one wants to shutdown kvmclock, it just needs to write
+anything that has 0 as its last bit.
+
+Time information presented by the hypervisor follows the structure:
+
+struct pvclock_vcpu_time_info {
+ u32 version;
+ u32 pad0;
+ u64 tsc_timestamp;
+ u64 system_time;
+ u32 tsc_to_system_mul;
+ s8 tsc_shift;
+ u8 pad[3];
+} __attribute__((__packed__));
+
+The version field plays the same role as with the one in struct
+pvclock_wall_clock. The other fields, are:
+
+ a. tsc_timestamp: the guest-visible tsc (result of rdtsc + tsc_offset) of
+ this cpu at the moment we recorded system_time. Note that some time is
+ inevitably spent between system_time and tsc_timestamp measurements.
+ Guests can subtract this quantity from the current value of tsc to obtain
+ a delta to be added to system_time
+
+ b. system_time: this is the most recent host-time we could be provided with.
+ host gets it through ktime_get_ts, using whichever clocksource is
+ registered at the moment
+
+ c. tsc_to_system_mul: this is the number that tsc delta has to be multiplied
+ by in order to obtain time in nanoseconds. Hypervisor is free to change
+ this value in face of events like cpu frequency change, pcpu migration,
+ etc.
+
+ d. tsc_shift: guests must shift
+
+With this information available, guest calculates current time as:
+
+ T = kt + to_nsec(tsc - tsc_0)
+
+2.3 Compatibility MSRs
+----------------------
+
+Guests running on top of older hypervisors may have to use a different set of
+MSRs. This is because originally, kvmclock MSRs were exported within a
+reserved range by accident. Guests should check cpuid leaf 0x40000001 for the
+presence of kvmclock. If bit 3 is disabled, but bit 0 is enabled, guests can
+have access to kvmclock functionality through
+
+ * MSR_KVM_WALL_CLOCK_OLD, value 0x11 and
+ * MSR_KVM_SYSTEM_TIME_OLD, value 0x12.
+
+Note, however, that this is deprecated.
+
+3. Migration
+============
+
+Two ioctls are provided to aid the task of migration:
+
+ * KVM_GET_CLOCK and
+ * KVM_SET_CLOCK
+
+Their aim is to control an offset that can be summed to system_time, in order
+to guarantee monotonicity on the time over guest migration. Source host
+executes KVM_GET_CLOCK, obtaining the last valid timestamp in this host, while
+destination sets it with KVM_SET_CLOCK. It's the destination responsibility to
+never return time that is less than that.
+
+
> This patch adds a new file, kvm/kvmclock.txt, describing
> the mechanism we use in kvmclock.
>
> Signed-off-by: Glauber Costa <glo...@redhat.com>
> ---
> Documentation/kvm/kvmclock.txt | 138 ++++++++++++++++++++++++++++++++++++++++
> 1 files changed, 138 insertions(+), 0 deletions(-)
> create mode 100644 Documentation/kvm/kvmclock.txt
>
> diff --git a/Documentation/kvm/kvmclock.txt b/Documentation/kvm/kvmclock.txt
> new file mode 100644
> index 0000000..21008bb
> --- /dev/null
> +++ b/Documentation/kvm/kvmclock.txt
> @@ -0,0 +1,138 @@
> +KVM Paravirtual Clocksource driver
> +Glauber Costa, Red Hat Inc.
> +==================================
> +
> +1. General Description
> +=======================
> +
...
> +
> +2. kvmclock basics
> +===========================
> +
> +When supported by the hypervisor, guests can register a memory page
> +to contain kvmclock data. This page has to be present in guest's address space
> +throughout its whole life. The hypervisor continues to write to it until it is
> +explicitly disabled or the guest is turned off.
> +
> +2.1 kvmclock availability
> +-------------------------
> +
> +Guests that want to take advantage of kvmclock should first check its
> +availability through cpuid.
> +
> +kvm features are presented to the guest in leaf 0x40000001. Bit 3 indicates
> +the present of kvmclock. Bit 0 indicates that kvmclock is present, but the
presence
but it's confusing. Is it bit 3 or bit 0? They seem to indicate the same thing.
> +old MSR set must be used. See section 2.3 for details.
"old MSR set": what does this mean?
> +
> +2.2 kvmclock functionality
> +--------------------------
> +
> +Two MSRs are provided by the hypervisor, controlling kvmclock operation:
> +
> + * MSR_KVM_WALL_CLOCK, value 0x4b564d00 and
> + * MSR_KVM_SYSTEM_TIME, value 0x4b564d01.
> +
> +The first one is only used in rare situations, like boot-time and a
> +suspend-resume cycle. Data is disposable, and after used, the guest
> +may use it for something else. This is hardly a hot path for anything.
> +The Hypervisor fills in the address provided through this MSR with the
> +following structure:
> +
> +struct pvclock_wall_clock {
> + u32 version;
> + u32 sec;
> + u32 nsec;
> +} __attribute__((__packed__));
> +
> +Guest should only trust data to be valid when version haven't changed before
has not
CPU (please)
> + inevitably spent between system_time and tsc_timestamp measurements.
> + Guests can subtract this quantity from the current value of tsc to obtain
> + a delta to be added to system_time
to system_time.
> +
> + b. system_time: this is the most recent host-time we could be provided with.
> + host gets it through ktime_get_ts, using whichever clocksource is
> + registered at the moment
moment.
> +
> + c. tsc_to_system_mul: this is the number that tsc delta has to be multiplied
> + by in order to obtain time in nanoseconds. Hypervisor is free to change
> + this value in face of events like cpu frequency change, pcpu migration,
CPU
> + etc.
> +
> + d. tsc_shift: guests must shift
missing text??
---
~Randy
All coments relevant. I'll update the document to cover them.
As for your question: Both bit 3 and 0 are used, yes. But they
tell the guest to use a different MSR pair. However, this document
is intented for people not familiar with kvmclock. If you read it,
and could not extract that from the text, it definitely needs to
be augmented.
Thanks!
__cacheline_aligned_in_smp to avoid other data from sharing the
cacheline.
Don't you need to handle wrap-around?
Is that "1.5 million"?
> (to be fair, it wasn't that bad in most of them). Investigating further, I
> found out that such warps were caused by the very offset-based calculation
> pvclock is based on.
>
Is the problem that the tscs are starting out of sync, or that they're
drifting relative to each other over time? Do the problems become worse
the longer the uptime? How large are the offsets we're talking about here?
Does this need a barrier() to prevent the compiler from re-reading
last_value for the subsequent lines? Otherwise "(ret < last)" and
"return last" could execute with different values for "last".
> + if (ret < last)
> + return last;
> + } while (unlikely(cmpxchg64(&last_value, last, ret) != ret));
>
So if CPU A's tsc is, say, 50us behind CPU B's, then it will return a
static time (from last_time) for 50us until its tsc catched up - or if
CPU A happens to update last_time to give it a kick?
Is it worth trying to update CPU A's tsc offset at the same time?
J
This is one source of the problem, but the same thing happens at many
levels... tsc may start out of sync, drift between sockets, be badly
re-calibrated by the BIOS, etc... the issue persists even if the TSCs
are perfectly in sync - the measurement of them is not.
So reading TSC == 100,000 units at time A and then waiting 10 units, one
may read TSC == 100,010 +/- 5 units because the code stream is not
perfectly serialized - nor can it be. There will always be some amount
of error unless running in perfect lock-step, which only happens in a
simulator.
This inherent measurement error can cause apparent time to go backwards
when measured simultaneously across multiple CPUs, or when
re-calibrating against an external clocksource. Combined with other
factors as above, it can be of sufficient magnitude to be noticed. KVM
clock is particularly exposed to the problem because the TSC is measured
and recalibrated for each virtual CPU whenever there is a physical CPU
switch, so micro-adjustments forwards and backwards may occur during the
recalibration - and appear as a real backwards time warp to the guest.
I have some patches to fix that issue, but the SMP problem remains to be
fixed - and is addressed quite thoroughly by this patch.
Zach
Please define a cpuid bit that makes this optional. When we eventually
enable it in the future, it will allow a wider range of guests to enjoy it.
>
> +static u64 last_value = 0;
>
Needs to be atomic64_t.
> +
> cycle_t pvclock_clocksource_read(struct pvclock_vcpu_time_info *src)
> {
> struct pvclock_shadow_time shadow;
> unsigned version;
> cycle_t ret, offset;
> + u64 last;
>
>
> + do {
> + last = last_value;
>
Otherwise, this assignment can see a partial update.
> + if (ret< last)
> + return last;
> + } while (unlikely(cmpxchg64(&last_value, last, ret) != ret));
> +
> return ret;
> }
>
>
--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.
On a 32-bit guest, that is.
Hm. We need to update userspace anyway, since we don't like turning
features on unconditionally (it breaks live migration into an older kernel).
--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.
--
This comment has no place in a header that's oriented to userspace.
Instead, it's better to document what this bit exposes.
>
> static int kvmclock = 1;
> +static int kvm_use_new_msrs = 0;
>
Instead of this and all those if ()s all over the place, you can define
variables for the MSR indices. Set them up on initialization, the rest
of the code just looks them up.
--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.
--
This is exposed to userspace. Userspace that is compiled with the new
headers, but runs on an old kernel, will break.
So you need to keep the old names, and define a new KVM_FEATURE for the
new names.
--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.
--
atomic64_read() _is_ cmpxchg64b. Are you thinking of some clever
implementation for smp i386?
--
error compiling committee.c: too many arguments to function
Fun, we also have:
X86_FEATURE_NONSTOP_TSC, which states the thing doesn't stop in C
states.
Right, do bear in mind that the x86 implementation of atomic64_read() is
terrifyingly expensive, it is better to not do that read and simply use
the result of the cmpxchg.
Right, so on x86 we have:
X86_FEATURE_CONSTANT_TSC, which only states that TSC is frequency
independent, not that it doesn't stop in C states and similar fun stuff.
X86_FEATURE_TSC_RELIABLE, which IIRC should indicate the TSC is constant
and synced between cores.
--
Sockets and boards too? (IOW, how reliable is TSC_RELIABLE)?
--
error compiling committee.c: too many arguments to function
--
All of them? I though tsc stops in some mwait deep REM sleep thing.
So what do we need? test for both TSC_RELIABLE and NONSTOP_TSC? IMO
TSC_RELIABLE should imply NONSTOP_TSC.
--
error compiling committee.c: too many arguments to function
--
The idea is that TSC will not stop ever (except s2r like stuff), not
sure about the actual implementation of the NONSTOP_TSC bit though.
> So what do we need? test for both TSC_RELIABLE and NONSTOP_TSC? IMO
> TSC_RELIABLE should imply NONSTOP_TSC.
Yeah, I think RELIABLE does imply NONSTOP and CONSTANT, but NONSTOP &&
CONSTANT does not make RELIABLE.
Worrying. By the time we detect this the guest may already have gotten
confused by clocks going backwards.
Not sure, IIRC we clear that when the TSC sync test fails, eg when we
mark the tsc clocksource unusable.
--
No, what I was suggesting was to rewrite that loop no to need the
initial read but use the cmpxchg result of the previous iteration.
Something like:
u64 last = 0;
/* more stuff */
do {
if (ret < last)
return last;
last = cmpxchg64(&last_value, last, ret);
} while (last != ret);
That only has a single cmpxchg8 in there per loop instead of two
(avoiding the atomic64_read() one).
I think it's implied with atomic64_read().
--
error compiling committee.c: too many arguments to function
--
Yes it would be. I was merely trying to point out that
last = ACCESS_ONCE(last_value);
Is a narrower way of writing:
last = last_value;
barrier();
In that it need not clobber all memory locations and makes it instantly
clear what we want the barrier for.
Still have two cmpxchgs in the common case. The first iteration will
fail, fetching last_value, the second will work.
It will be better when we have contention, though, so it's worthwhile.
--
error compiling committee.c: too many arguments to function
--
Right, another option is to put the initial read outside of the loop,
that way you'll have the best of all cases, a single LOCK'ed op in the
loop, and only a single LOCK'ed op for the fast path on sensible
architectures ;-)
last = atomic64_read(&last_value);
do {
if (ret < last)
return last;
last = atomic64_cmpxchg(&last_value, last, ret);
} while (unlikely(last != ret));
Or so.
Oh yes, just trying to avoid a patch with both atomic64_read() and
ACCESS_ONCE().
--
error compiling committee.c: too many arguments to function
--
The manual says:
> 16.11.1 Invariant TSC
>
> The time stamp counter in newer processors may support an enhancement,
> referred
> to as invariant TSC. Processor’s support for invariant TSC is indicated by
> CPUID.80000007H:EDX[8].
> The invariant TSC will run at a constant rate in all ACPI P-, C-. and
> T-states. This is
> the architectural behavior moving forward. On processors with
> invariant TSC
> support, the OS may use the TSC for wall clock timer services (instead
> of ACPI or
> HPET timers). TSC reads are much more efficient and do not incur the
> overhead
> associated with a ring transition or access to a platform resource.
and this maps to NONSTOP, so I think we're fine.
--
error compiling committee.c: too many arguments to function
--
Yes, looks optimal when !NONSTOP_TSC.
--
error compiling committee.c: too many arguments to function
--
But as I said, this seem to be a very deep worst case scenario. Most of boxes
are not even close to being that bad.
>
> > (to be fair, it wasn't that bad in most of them). Investigating further, I
> > found out that such warps were caused by the very offset-based calculation
> > pvclock is based on.
> >
>
> Is the problem that the tscs are starting out of sync, or that they're
> drifting relative to each other over time? Do the problems become worse
> the longer the uptime? How large are the offsets we're talking about here?
The offsets usually seem pretty small, under a microsecond. So I don't think
it has anything to do with tscs starting out of sync. Specially because the
delta-based calculation has the exact purpose of handling that case.
This patch writes last_value atomically, but reads it non-atomically. A
barrier is insufficient.
--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.
--
No. On i386, the statement
last = last_value;
will be split by the compiler into two 32-bit loads. If a write
(atomic, using cmpxchg) on another cpu happens between those two loads,
then the variable last will have a corrupted value.
--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.
--
What avi says! :-)
On a 32bit machine a 64bit read are two 32bit reads, so
last = last_value;
becomes:
last.high = last_value.high;
last.low = last_vlue.low;
(or the reverse of course)
Now imagine a write getting interleaved with that ;-)
All this ioctl say is: "Those are the current supported stuff in this HV".
It does not mandate userspace to expose all of this to the guest. It just saves
us from the job of creating yet another CAP for every bit we plan on including.
If we want to be conservative, we can keep everything but the things we know
already disabled, in userspace.
You could explicitly do:
do {
h = last.high;
barrier();
l = last.low;
barrier();
} while (last.high != h);
This works because we expect last to be always increasing, so the only
worry is low wrapping and incrementing high, and is more efficient than
making the read fully atomic (the write is still cmpxchg64). But it's
pretty ugly to open code just for 32b architectures; its something that
might be useful to turn into a general abstraction (monotonic_read_64
FTW!). I already have code like this in the Xen time code, so I could
make immediate use of it.
J
So you think they're drifting out of sync from an initially synced
state? If so, what would bound the drift?
J
Well, on a 32b system, you can explicitly order the updates of low and
high, then do a high-low-checkhigh read. That would be much more
efficient than atomic64. If we really care about 32b.
J
Marcelo can probably confirm it, but he has a nehalem with an appearently
very good tsc source. Even this machine warps.
It stops warping if we only write pvclock data structure once and forget it,
(which only updated tsc_timestamp once), according to him.
Obviously, we can't do that everywhere.
Upstream, we are marking the TSC unstable preemptively when hardware
which will eventually sync test is detected, so this should be fine.
Zach
Yes.
> Marcelo can probably confirm it, but he has a nehalem with an appearently
> very good tsc source. Even this machine warps.
>
> It stops warping if we only write pvclock data structure once and forget it,
> (which only updated tsc_timestamp once), according to him.
Yes. So its not as if the guest visible TSCs go out of sync (they don't
on this machine Glauber mentioned, or even on a multi-core Core 2 Duo),
but the delta calculation is very hard (if not impossible) to get right.
The timewarps i've seen were in the 0-200ns range, and very rare (once
every 10 minutes or so).
(that's a serious bug wrt migration, btw)
> But let's suppose we have: What's the difference between disabling
> it in the way it is now, and disabling it with the method I am proposing?
>
No difference.
> All this ioctl say is: "Those are the current supported stuff in this HV".
> It does not mandate userspace to expose all of this to the guest. It just saves
> us from the job of creating yet another CAP for every bit we plan on including.
>
Right. Well, creating a new CAP and creating a new FEATURE flag aren't
very different, and I'd like to avoid API churn. We have enough new
APIs due to missing or badly implemented features; I'd like to avoid new
ones whenever possible.
It's not like it saves userspace anything, it has to accommodate older
kernels anyhow. We are able to ignore pre 2.6.27 kernels, but now with
kvm shipped in long term support distributions, deprecating APIs will be
much harder.
> If we want to be conservative, we can keep everything but the things we know
> already disabled, in userspace.
>
We definitely need to do that.
--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.
--
ENOPARSE?
--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.
--
I don't think this is worthwhile - the cmpxchg is not that expensive on
most kvm capable hosts (the exception is the Pentium D).
btw, do you want this code in pvclock.c, or shall we keep it kvmclock
specific?
--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.
--
Might be due to NMIs or SMIs interrupting the rdtsc(); ktime_get()
operation which establishes the timeline. We could limit it by having a
loop doing rdtsc(); ktime_get(); rdtsc(); and checking for some bound,
but it isn't worthwhile (and will break nested virtualization for
sure). Better to have the option to calibrate kvmclock just once on
machines with X86_FEATURE_NONSTOP_TRULY_RELIABLE _TSC_HONESTLY.
--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.
--
We can detect that, and besides doing calculation only once, also export some
bit indicating that to the guest. Humm... I'm thinking now, that because of
migration, we should check this bit every time, because we might have changed host.
So instead of using an expensive cpuid check, we should probably use some bit in
the vcpu_time_info structure, and use a cpuid bit just to say it is enabled.
Jeremy,
are you okay in turning one of the pad fields in the structure into a flags field?
Right, we need a bit in the structure itself that says you can trust the
time not to go backwards, and a bit in cpuid that says you can trust the
bit in the structure (unless we can be sure no implementations leak
garbage into the padding field).
--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.
--
People are reporting weirdnesses that "clocksource=jiffies" apparently
resolves. Xen and KVM are faced with the same hardware constraints, and
it wouldn't surprise me if there were small measurable
non-monotonicities in the PV clock under Xen. May as well be safe.
Of course, it kills any possibility of being able to usefully export
this interface down to usermode.
My main concern about this kind of simple fix is that if there's a long
term systematic drift between different CPU's tscs, then this will
somewhat mask the problem while giving really awful time measurement on
the "slow" CPU(s). In that case it really needs to adjust the scaling
factor to correct for the drift (*not* update the offset). But if we're
definitely only talking about fixed, relatively small time offsets then
it is fine.
J
Did you observe drift on Xen, or is this "ultimately" pointing at the
future?
--
I think its a pvclock-level fix. I'd been hoping to avoid having
something like this, but I think its ultimately necessary.
J
Does the drift only occur on SMP VMs?
There's a perfect way to do this and it still fails to stop timewarps.
You can set the performance counters to overflow if more instructions
are issued than your code path, run an assembly instruction stream and
if the performance interrupt hits, restart the calibration.
The calibration happens not just once, but on every migration, and
currently, I believe, on every VCPU switch. Even if we reduce the
number of calibrations to the bare minimum and rule out SMIs and NMIs,
there will still be variation due to factors beyond our control because
of the unpredictable nature of cache and instruction issue.
However, X86_FEATURE_NONSTOP_TRULY_RELIABLE_TSC_HONESTLY does imply one
key feature which the code is missing today: on SMP VMs, the
calibration of kvmclock needs to be done only once, and the clock can
then be used for all VCPUs. That, I think, stops Glauber's bug from
appearing on the server side.
I will spin that into my web of patches and send the cocoon out sometime
this evening.
Zach
Instead of detecting TSC warp, c1e_idle, power_saving_mwait_init,
tsc_check_state, dmi_mark_tsc_unstable all do something similar to this
to disable TSC before warp even occurs:
static void c1e_idle(void)
{
if (need_resched())
return;
if (!c1e_detected) {
u32 lo, hi;
rdmsr(MSR_K8_INT_PENDING_MSG, lo, hi);
if (lo & K8_INTP_C1E_ACTIVE_MASK) {
c1e_detected = 1;
if (!boot_cpu_has(X86_FEATURE_NONSTOP_TSC))
mark_tsc_unstable("TSC halt in AMD C1E");
printk(KERN_INFO "System has AMD C1E enabled\n");
set_cpu_cap(&boot_cpu_data, X86_FEATURE_AMDC1E);
It's completely impractical. The PMU is a global resource that is
already shared among users and the host; programming and restoring it is
expensive; and in a virtualized environment it the whole scheme may fail.
>
> The calibration happens not just once, but on every migration, and
> currently, I believe, on every VCPU switch. Even if we reduce the
> number of calibrations to the bare minimum and rule out SMIs and NMIs,
> there will still be variation due to factors beyond our control
> because of the unpredictable nature of cache and instruction issue.
Right.
>
> However, X86_FEATURE_NONSTOP_TRULY_RELIABLE_TSC_HONESTLY does imply
> one key feature which the code is missing today: on SMP VMs, the
> calibration of kvmclock needs to be done only once, and the clock can
> then be used for all VCPUs. That, I think, stops Glauber's bug from
> appearing on the server side.
That's the plan.
--
error compiling committee.c: too many arguments to function
That works within a socket; but multiple socket machines need not feed
all sockets from the same crystal and reset line (esp. likely for
multiple board machines, but might even happen with single board
FSB-less processors).
--
error compiling committee.c: too many arguments to function
--
You need to define TOD to 1, and leave out the TSC test.
For me, warps exists on every machine out there, but the nehalems, so far
Plus, replace cpuid by lfence/mfence. cpuid will trap.
--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.
--
Right. But I'm not sure lfence/mfence are sufficient - from what I
understand rdtsc can pass right through them.
I presume the clobbers are needed for cpuid; if you use [lm]fence then
they shouldn't be needed, right?
J
Does lfence / mfence actually serialize? I thought there was some great
confusion about that not being the case on all AMD processors, and
possibly not at all on Intel.
A trap, however is a great way to serialize.
I think, there is no serializing instruction which can be used from
userspace which does not trap, at least, I don't know one off the top of
my head.
Zach
rsm is not technically privileged... but not quite usable from usermode ;)
J
rsm under hardware virtualization makes my head hurt
They don't.
>
> A trap, however is a great way to serialize.
>
> I think, there is no serializing instruction which can be used from
> userspace which does not trap, at least, I don't know one off the top
> of my head.
iret.
--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.
--
Either one independently is sufficient for me.
--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.
--
This happens even on some machines that report constant_tsc in its tsc flags,
specially on multi-socket ones.
Two reads of the same kernel timestamp at approx the same time, will likely
have tsc timestamped in different occasions too. This means the delta we
calculate is unpredictable at best, and can probably be smaller in a cpu
that is legitimately reading clock in a forward ocasion.
Some adjustments on the host could make this window less likely to happen,
but still, it pretty much poses as an intrinsic problem of the mechanism.
A while ago, I though about using a shared variable anyway, to hold clock
last state, but gave up due to the high contention locking was likely
to introduce, possibly rendering the thing useless on big machines. I argue,
however, that locking is not necessary.
We do a read-and-return sequence in pvclock, and between read and return,
the global value can have changed. However, it can only have changed
by means of an addition of a positive value. So if we detected that our
clock timestamp is less than the current global, we know that we need to
return a higher one, even though it is not exactly the one we compared to.
OTOH, if we detect we're greater than the current time source, we atomically
replace the value with our new readings. This do causes contention on big
boxes (but big here means *BIG*), but it seems like a good trade off, since
it provide us with a time source guaranteed to be stable wrt time warps.
After this patch is applied, I don't see a single warp in time during 5 days
of execution, in any of the machines I saw them before.
Signed-off-by: Glauber Costa <glo...@redhat.com>
CC: Jeremy Fitzhardinge <jer...@goop.org>
CC: Avi Kivity <a...@redhat.com>
CC: Marcelo Tosatti <mtos...@redhat.com>
CC: Zachary Amsden <zam...@redhat.com>
---
arch/x86/kernel/pvclock.c | 24 ++++++++++++++++++++++++
1 files changed, 24 insertions(+), 0 deletions(-)
diff --git a/arch/x86/kernel/pvclock.c b/arch/x86/kernel/pvclock.c
index 8f4af7b..6cf6dec 100644
--- a/arch/x86/kernel/pvclock.c
+++ b/arch/x86/kernel/pvclock.c
@@ -118,11 +118,14 @@ unsigned long pvclock_tsc_khz(struct pvclock_vcpu_time_info *src)
return pv_tsc_khz;
}
+static atomic64_t last_value = ATOMIC64_INIT(0);
+
cycle_t pvclock_clocksource_read(struct pvclock_vcpu_time_info *src)
{
struct pvclock_shadow_time shadow;
unsigned version;
cycle_t ret, offset;
+ u64 last;
do {
version = pvclock_get_time_values(&shadow, src);
@@ -132,6 +135,27 @@ cycle_t pvclock_clocksource_read(struct pvclock_vcpu_time_info *src)
barrier();
} while (version != src->version);
+ /*
+ * Assumption here is that last_value, a global accumulator, always goes
+ * forward. If we are less than that, we should not be much smaller.
+ * We assume there is an error marging we're inside, and then the correction
+ * does not sacrifice accuracy.
+ *
+ * For reads: global may have changed between test and return,
+ * but this means someone else updated poked the clock at a later time.
+ * We just need to make sure we are not seeing a backwards event.
+ *
+ * For updates: last_value = ret is not enough, since two vcpus could be
+ * updating at the same time, and one of them could be slightly behind,
+ * making the assumption that last_value always go forward fail to hold.
+ */
+ last = atomic64_read(&last_value);
+ do {
+ if (ret < last)
+ return last;
+ last = atomic64_cmpxchg(&last_value, last, ret);
+ } while (unlikely(last != ret));
+
return ret;
}
--
1.6.2.2
Flags, however, will only be interpreted when the guest decides to.
It uses the pvclock_valid_flags function to signal that a specific
set of flags should be taken into consideration. Which flags are valid
are usually devised via HV negotiation.
Signed-off-by: Glauber Costa <glo...@redhat.com>
CC: Jeremy Fitzhardinge <jer...@goop.org>
---
arch/x86/include/asm/pvclock-abi.h | 3 ++-
arch/x86/include/asm/pvclock.h | 1 +
arch/x86/kernel/pvclock.c | 9 +++++++++
3 files changed, 12 insertions(+), 1 deletions(-)
diff --git a/arch/x86/include/asm/pvclock-abi.h b/arch/x86/include/asm/pvclock-abi.h
index 6d93508..ec5c41a 100644
--- a/arch/x86/include/asm/pvclock-abi.h
+++ b/arch/x86/include/asm/pvclock-abi.h
@@ -29,7 +29,8 @@ struct pvclock_vcpu_time_info {
u64 system_time;
u32 tsc_to_system_mul;
s8 tsc_shift;
- u8 pad[3];
+ u8 flags;
+ u8 pad[2];
} __attribute__((__packed__)); /* 32 bytes */
struct pvclock_wall_clock {
diff --git a/arch/x86/include/asm/pvclock.h b/arch/x86/include/asm/pvclock.h
index 53235fd..c50823f 100644
--- a/arch/x86/include/asm/pvclock.h
+++ b/arch/x86/include/asm/pvclock.h
@@ -6,6 +6,7 @@
/* some helper functions for xen and kvm pv clock sources */
cycle_t pvclock_clocksource_read(struct pvclock_vcpu_time_info *src);
+void pvclock_valid_flags(u8 flags);
unsigned long pvclock_tsc_khz(struct pvclock_vcpu_time_info *src);
void pvclock_read_wallclock(struct pvclock_wall_clock *wall,
struct pvclock_vcpu_time_info *vcpu,
diff --git a/arch/x86/kernel/pvclock.c b/arch/x86/kernel/pvclock.c
index 03801f2..8f4af7b 100644
--- a/arch/x86/kernel/pvclock.c
+++ b/arch/x86/kernel/pvclock.c
@@ -31,8 +31,16 @@ struct pvclock_shadow_time {
u32 tsc_to_nsec_mul;
int tsc_shift;
u32 version;
+ u8 flags;
};
+static u8 valid_flags = 0;
+
+void pvclock_valid_flags(u8 flags)
+{
+ valid_flags = flags;
+}
+
/*
* Scale a 64-bit delta by scaling and multiplying by a 32-bit fraction,
* yielding a 64-bit result.
@@ -91,6 +99,7 @@ static unsigned pvclock_get_time_values(struct pvclock_shadow_time *dst,
dst->system_timestamp = src->system_time;
dst->tsc_to_nsec_mul = src->tsc_to_system_mul;
dst->tsc_shift = src->tsc_shift;
+ dst->flags = src->flags;
rmb(); /* test version after fetching data */
} while ((src->version & 1) || (dst->version != src->version));
However, kvm clock registration happens very early, and if we ever
try to write to a non-existant MSR, we raise a lethal #GP, since our
idt handlers are not in place yet.
So this patch tests for a cpuid feature exported by the host to
decide which set of msrs are supported.
Signed-off-by: Glauber Costa <glo...@redhat.com>
---
arch/x86/kernel/kvmclock.c | 56 +++++++++++++++++++++++++++----------------
1 files changed, 35 insertions(+), 21 deletions(-)
diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index feaeb0d..f2f6aee 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -29,6 +29,8 @@
#define KVM_SCALE 22
static int kvmclock = 1;
+static int msr_kvm_system_time = MSR_KVM_SYSTEM_TIME;
+static int msr_kvm_wall_clock = MSR_KVM_WALL_CLOCK;
static int parse_no_kvmclock(char *arg)
{
@@ -41,6 +43,7 @@ early_param("no-kvmclock", parse_no_kvmclock);
static DEFINE_PER_CPU_SHARED_ALIGNED(struct pvclock_vcpu_time_info, hv_clock);
static struct pvclock_wall_clock wall_clock;
+
/*
* The wallclock is the time of day when we booted. Since then, some time may
* have elapsed since the hypervisor wrote the data. So we try to account for
@@ -54,7 +57,8 @@ static unsigned long kvm_get_wallclock(void)
low = (int)__pa_symbol(&wall_clock);
high = ((u64)__pa_symbol(&wall_clock) >> 32);
- native_write_msr(MSR_KVM_WALL_CLOCK, low, high);
+
+ native_write_msr_safe(msr_kvm_wall_clock, low, high);
vcpu_time = &get_cpu_var(hv_clock);
pvclock_read_wallclock(&wall_clock, vcpu_time, &ts);
@@ -130,7 +134,8 @@ static int kvm_register_clock(char *txt)
high = ((u64)__pa(&per_cpu(hv_clock, cpu)) >> 32);
printk(KERN_INFO "kvm-clock: cpu %d, msr %x:%x, %s\n",
cpu, high, low, txt);
- return native_write_msr_safe(MSR_KVM_SYSTEM_TIME, low, high);
+
+ return native_write_msr_safe(msr_kvm_system_time, low, high);
}
#ifdef CONFIG_X86_LOCAL_APIC
@@ -165,14 +170,15 @@ static void __init kvm_smp_prepare_boot_cpu(void)
#ifdef CONFIG_KEXEC
static void kvm_crash_shutdown(struct pt_regs *regs)
{
- native_write_msr_safe(MSR_KVM_SYSTEM_TIME, 0, 0);
+
+ native_write_msr(msr_kvm_system_time, 0, 0);
native_machine_crash_shutdown(regs);
}
#endif
static void kvm_shutdown(void)
{
- native_write_msr_safe(MSR_KVM_SYSTEM_TIME, 0, 0);
+ native_write_msr(msr_kvm_system_time, 0, 0);
native_machine_shutdown();
}
@@ -181,27 +187,35 @@ void __init kvmclock_init(void)
if (!kvm_para_available())
return;
- if (kvmclock && kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE)) {
- if (kvm_register_clock("boot clock"))
- return;
- pv_time_ops.sched_clock = kvm_clock_read;
- x86_platform.calibrate_tsc = kvm_get_tsc_khz;
- x86_platform.get_wallclock = kvm_get_wallclock;
- x86_platform.set_wallclock = kvm_set_wallclock;
+ if (kvmclock && kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE2)) {
+ msr_kvm_system_time = MSR_KVM_SYSTEM_TIME_NEW;
+ msr_kvm_wall_clock = MSR_KVM_WALL_CLOCK_NEW;
+ }
+ else if (!(kvmclock && kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE)))
+ return;
+
+ printk(KERN_INFO "kvm-clock: Using msrs %x and %x",
+ msr_kvm_system_time, msr_kvm_wall_clock);
+
+ if (kvm_register_clock("boot clock"))
+ return;
+ pv_time_ops.sched_clock = kvm_clock_read;
+ x86_platform.calibrate_tsc = kvm_get_tsc_khz;
+ x86_platform.get_wallclock = kvm_get_wallclock;
+ x86_platform.set_wallclock = kvm_set_wallclock;
#ifdef CONFIG_X86_LOCAL_APIC
- x86_cpuinit.setup_percpu_clockev =
- kvm_setup_secondary_clock;
+ x86_cpuinit.setup_percpu_clockev =
+ kvm_setup_secondary_clock;
#endif
#ifdef CONFIG_SMP
- smp_ops.smp_prepare_boot_cpu = kvm_smp_prepare_boot_cpu;
+ smp_ops.smp_prepare_boot_cpu = kvm_smp_prepare_boot_cpu;
#endif
- machine_ops.shutdown = kvm_shutdown;
+ machine_ops.shutdown = kvm_shutdown;
#ifdef CONFIG_KEXEC
- machine_ops.crash_shutdown = kvm_crash_shutdown;
+ machine_ops.crash_shutdown = kvm_crash_shutdown;
#endif
- kvm_get_preset_lpj();
- clocksource_register(&kvm_clock);
- pv_info.paravirt_enabled = 1;
- pv_info.name = "KVM";
- }
+ kvm_get_preset_lpj();
+ clocksource_register(&kvm_clock);
+ pv_info.paravirt_enabled = 1;
+ pv_info.name = "KVM";
Signed-off-by: Glauber Costa <glo...@redhat.com>
---
arch/x86/include/asm/kvm_para.h | 4 ++++
arch/x86/kvm/x86.c | 7 ++++++-
2 files changed, 10 insertions(+), 1 deletions(-)
diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index ffae142..9734808 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -20,6 +20,10 @@
#define MSR_KVM_WALL_CLOCK 0x11
#define MSR_KVM_SYSTEM_TIME 0x12
+/* Custom MSRs falls in the range 0x4b564d00-0x4b564dff */
+#define MSR_KVM_WALL_CLOCK_NEW 0x4b564d00
+#define MSR_KVM_SYSTEM_TIME_NEW 0x4b564d01
+
#define KVM_MAX_MMU_OP_BATCH 32
/* Operations for KVM_HC_MMU_OP */
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 8824b73..a2ead7f 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -575,9 +575,10 @@ static inline u32 bit(int bitno)
* kvm-specific. Those are put in the beginning of the list.
*/
-#define KVM_SAVE_MSRS_BEGIN 5
+#define KVM_SAVE_MSRS_BEGIN 7
static u32 msrs_to_save[] = {
MSR_KVM_SYSTEM_TIME, MSR_KVM_WALL_CLOCK,
+ MSR_KVM_SYSTEM_TIME_NEW, MSR_KVM_WALL_CLOCK_NEW,
HV_X64_MSR_GUEST_OS_ID, HV_X64_MSR_HYPERCALL,
HV_X64_MSR_APIC_ASSIST_PAGE,
MSR_IA32_SYSENTER_CS, MSR_IA32_SYSENTER_ESP, MSR_IA32_SYSENTER_EIP,
@@ -1099,10 +1100,12 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data)
case MSR_IA32_MISC_ENABLE:
vcpu->arch.ia32_misc_enable_msr = data;
break;
+ case MSR_KVM_WALL_CLOCK_NEW:
case MSR_KVM_WALL_CLOCK:
vcpu->kvm->arch.wall_clock = data;
kvm_write_wall_clock(vcpu->kvm, data);
break;
+ case MSR_KVM_SYSTEM_TIME_NEW:
case MSR_KVM_SYSTEM_TIME: {
if (vcpu->arch.time_page) {
kvm_release_page_dirty(vcpu->arch.time_page);
@@ -1374,9 +1377,11 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata)
data = vcpu->arch.efer;
break;
case MSR_KVM_WALL_CLOCK:
+ case MSR_KVM_WALL_CLOCK_NEW:
data = vcpu->kvm->arch.wall_clock;
break;
case MSR_KVM_SYSTEM_TIME:
+ case MSR_KVM_SYSTEM_TIME_NEW:
data = vcpu->arch.time;
break;
case MSR_IA32_P5_MC_ADDR:
And if we ever add a new cpuid bit in the future, we have to do that again,
which create some complexity and delay in feature adoption.
Instead, what I'm proposing in this patch is a new capability, called
KVM_CAP_X86_CPUID_FEATURE_LIST, that returns the current feature list
currently supported by the hypervisor. If we ever want to add or remove
some feature, we only need to tweak into the HV, leaving userspace untouched.
Signed-off-by: Glauber Costa <glo...@redhat.com>
---
arch/x86/include/asm/kvm_para.h | 4 ++++
arch/x86/kvm/x86.c | 6 ++++++
include/linux/kvm.h | 1 +
3 files changed, 11 insertions(+), 0 deletions(-)
diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index 9734808..f019f8c 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -16,6 +16,10 @@
#define KVM_FEATURE_CLOCKSOURCE 0
#define KVM_FEATURE_NOP_IO_DELAY 1
#define KVM_FEATURE_MMU_OP 2
+/* This indicates that the new set of kvmclock msrs
+ * are available. The use of 0x11 and 0x12 is deprecated
+ */
+#define KVM_FEATURE_CLOCKSOURCE2 3
#define MSR_KVM_WALL_CLOCK 0x11
#define MSR_KVM_SYSTEM_TIME 0x12
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index a2ead7f..04f04aa 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1545,6 +1545,12 @@ int kvm_dev_ioctl_check_extension(long ext)
case KVM_CAP_MCE:
r = KVM_MAX_MCE_BANKS;
break;
+ case KVM_CAP_X86_CPUID_FEATURE_LIST:
+ r = (1 << KVM_FEATURE_CLOCKSOURCE) |
+ (1 << KVM_FEATURE_NOP_IO_DELAY) |
+ (1 << KVM_FEATURE_MMU_OP) |
+ (1 << KVM_FEATURE_CLOCKSOURCE2);
+ break;
default:
r = 0;
break;
diff --git a/include/linux/kvm.h b/include/linux/kvm.h
index ce28767..1ce124f 100644
--- a/include/linux/kvm.h
+++ b/include/linux/kvm.h
@@ -507,6 +507,7 @@ struct kvm_ioeventfd {
#define KVM_CAP_DEBUGREGS 50
#endif
#define KVM_CAP_X86_ROBUST_SINGLESTEP 51
+#define KVM_CAP_X86_CPUID_FEATURE_LIST 52
#ifdef KVM_CAP_IRQ_ROUTING
Signed-off-by: Glauber Costa <glo...@redhat.com>
---
arch/x86/include/asm/kvm_para.h | 5 +++++
arch/x86/include/asm/pvclock-abi.h | 1 +
arch/x86/kernel/kvmclock.c | 3 +++
arch/x86/kernel/pvclock.c | 4 ++++
4 files changed, 13 insertions(+), 0 deletions(-)
diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index f019f8c..615ebb1 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -21,6 +21,11 @@
*/
#define KVM_FEATURE_CLOCKSOURCE2 3
+/* The last 8 bits are used to indicate how to interpret the flags field
+ * in pvclock structure. If no bits are set, all flags are ignored.
+ */
+#define KVM_FEATURE_CLOCKSOURCE_STABLE_TSC 0xf8
+
#define MSR_KVM_WALL_CLOCK 0x11
#define MSR_KVM_SYSTEM_TIME 0x12
diff --git a/arch/x86/include/asm/pvclock-abi.h b/arch/x86/include/asm/pvclock-abi.h
index ec5c41a..b123bd7 100644
--- a/arch/x86/include/asm/pvclock-abi.h
+++ b/arch/x86/include/asm/pvclock-abi.h
@@ -39,5 +39,6 @@ struct pvclock_wall_clock {
u32 nsec;
} __attribute__((__packed__));
+#define PVCLOCK_STABLE_TSC (1 << 0)
#endif /* __ASSEMBLY__ */
#endif /* _ASM_X86_PVCLOCK_ABI_H */
diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index f2f6aee..aca2d3c 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -218,4 +218,7 @@ void __init kvmclock_init(void)
clocksource_register(&kvm_clock);
pv_info.paravirt_enabled = 1;
pv_info.name = "KVM";
+
+ if (kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE_STABLE_TSC))
+ pvclock_valid_flags(PVCLOCK_STABLE_TSC);
}
diff --git a/arch/x86/kernel/pvclock.c b/arch/x86/kernel/pvclock.c
index 6cf6dec..43ae8d5 100644
--- a/arch/x86/kernel/pvclock.c
+++ b/arch/x86/kernel/pvclock.c
@@ -135,6 +135,10 @@ cycle_t pvclock_clocksource_read(struct pvclock_vcpu_time_info *src)
barrier();
} while (version != src->version);
+ if ((valid_flags & PVCLOCK_STABLE_TSC) &&
+ (shadow->flags & PVCLOCK_STABLE_TSC))
+ return ret;
+
/*
* Assumption here is that last_value, a global accumulator, always goes
* forward. If we are less than that, we should not be much smaller.
Is this necessary? Why not just make the pvclock driver maintain a
local flag set, and have the HV backend call into it to update it. Why
does it need to be part of the pvclock structure?
J
--
There are obvious other ways, but I don't see any of them being simpler.
If we go by the method you suggested, we'd have, for instance, to register
the memory area where this flags lives. Which is a duplication of the
infrastructure already present in kvmclock.
Please drop this. Its not certain what is the best method to reduce
contention on the global variable. Whatever it is, can be done later.
Wraparound?
Extra newline.
Non conformant indentation.
Remove this flag, it has been deprecated.
> + (1 << KVM_FEATURE_CLOCKSOURCE2);
> + break;
> default:
Also missing qemu-kvm/qemu patches.
The deprecation of this flag has nothing to do with this series.
Anyway, we can't pick its number, even if it is really deprecated,
since we can still have old hvs around, so, don't really see the point here
>
> > + (1 << KVM_FEATURE_CLOCKSOURCE2);
> > + break;
> > default:
>
> Also missing qemu-kvm/qemu patches.
>
Of course it is missing. It is a totally separate story.
Ok, sorry. My bad here.
at a quick glance, I did not realize you were commenting on this
patch specifically. KVM_CAP_X86_CPUID_FEATURE_LIST is a new
entity, so there is no need to return a deprecated feature.
I will remove that.
If the units nanoseconds, it's good for ~500,000 years of uptime.
J
set_pvclock_valid_flags() (or just set_pvclock_flags?
pvclock_set_flags()?). The original name looks like a getter.
--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.
--
The problem is lead time. If we (or the cpu vendors) fix this in the
hypervisor/processors in a year, then we want existing guests to take
advantage of it immediately.
--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.
--
Please document this in Documentation/kvm/cpuid.txt (also /msr.txt).
>
> #define MSR_KVM_WALL_CLOCK 0x11
> #define MSR_KVM_SYSTEM_TIME 0x12
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index a2ead7f..04f04aa 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -1545,6 +1545,12 @@ int kvm_dev_ioctl_check_extension(long ext)
> case KVM_CAP_MCE:
> r = KVM_MAX_MCE_BANKS;
> break;
> + case KVM_CAP_X86_CPUID_FEATURE_LIST:
> + r = (1<< KVM_FEATURE_CLOCKSOURCE) |
> + (1<< KVM_FEATURE_NOP_IO_DELAY) |
> + (1<< KVM_FEATURE_MMU_OP) |
> + (1<< KVM_FEATURE_CLOCKSOURCE2);
> + break;
> default:
> r = 0;
> break;
>
Hmm. We already have an API to get cpuid bits:
KVM_GET_SUPPORTED_CPUID2. The nice thing about it is that it will mean
-cpu host will work out of the box.
--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.
--
Ok, from what I understand, KVM_GET_CPUID2 gets a set of features, and tells
userspace which of them are available. Right?
If this is the case, userspace could ask for 0xffffffff, and then we tell them
which of them are present.
Does that make sense?
No. KVM_GET_CPUID2 reads what was set by KVM_SET_CPUID, as modified by
the guest executing the cpuid instruction. KVM_GET_SUPPORTED_CPUID
tells userspace which bits are supported by the host cpu and kvm.
> If this is the case, userspace could ask for 0xffffffff, and then we tell them
> which of them are present.
>
> Does that make sense?
>
The API for KVM_GET_SUPPORTED_CPUID returns all cpuid leaves supported
in one go, IIRC.
--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.
--
Unfortunately this is breaking Xen save/restore: if you restore on a
host which was booted more recently than the save host, causing the
system time to be smaller. The effect is that the domain's time leaps
forward to a fixed point, and stays there until the host catches up to
the source host...
I guess last_time needs to be reset on this type of event. I guess the
cleanest way would be for pvclock.c to register a sysdev suspend/resume
handler.
J
> Signed-off-by: Glauber Costa <glo...@redhat.com>
> CC: Jeremy Fitzhardinge <jer...@goop.org>
> CC: Avi Kivity <a...@redhat.com>
> CC: Marcelo Tosatti <mtos...@redhat.com>
> CC: Zachary Amsden <zam...@redhat.com>
> ---
> arch/x86/kernel/pvclock.c | 23 +++++++++++++++++++++++
> 1 files changed, 23 insertions(+), 0 deletions(-)
>
> diff --git a/arch/x86/kernel/pvclock.c b/arch/x86/kernel/pvclock.c
> index 03801f2..b7de0e6 100644
> --- a/arch/x86/kernel/pvclock.c
> +++ b/arch/x86/kernel/pvclock.c
> @@ -109,11 +109,14 @@ unsigned long pvclock_tsc_khz(struct pvclock_vcpu_time_info *src)
> return pv_tsc_khz;
> }
>
> +static u64 last_value = 0;
> +
> cycle_t pvclock_clocksource_read(struct pvclock_vcpu_time_info *src)
> {
> struct pvclock_shadow_time shadow;
> unsigned version;
> cycle_t ret, offset;
> + u64 last;
>
> do {
> version = pvclock_get_time_values(&shadow, src);
> @@ -123,6 +126,26 @@ cycle_t pvclock_clocksource_read(struct pvclock_vcpu_time_info *src)
> barrier();
> } while (version != src->version);
>
> + /*
> + * Assumption here is that last_value, a global accumulator, always goes
> + * forward. If we are less than that, we should not be much smaller.
> + * We assume there is an error marging we're inside, and then the correction
> + * does not sacrifice accuracy.
> + *
> + * For reads: global may have changed between test and return,
> + * but this means someone else updated poked the clock at a later time.
> + * We just need to make sure we are not seeing a backwards event.
> + *
> + * For updates: last_value = ret is not enough, since two vcpus could be
> + * updating at the same time, and one of them could be slightly behind,
> + * making the assumption that last_value always go forward fail to hold.
> + */
> + do {
> + last = last_value;
> + if (ret < last)
> + return last;
> + } while (unlikely(cmpxchg64(&last_value, last, ret) != ret));
> +
> return ret;
> }
>
--
Shouldn't save/restore also save the timebase?
> I guess last_time needs to be reset on this type of event. I guess the
> cleanest way would be for pvclock.c to register a sysdev suspend/resume
> handler.
Should be for Xen only; kvm save/restore doesn't involve the guest.
--
error compiling committee.c: too many arguments to function
the suspend/resume path do adjust the time base. Maybe something similar
should be done ?
Xen doesn't guarantee the system time is monotonic across those kinds of
events. The domain could maintain its own offset to maintain an
illusion of monotonicity, but I think its simpler to just zero
last_value on resume.
J