Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

[PATCH] x86: Calculate MHz using APERF/MPERF for cpuinfo and scaling_cur_freq

152 views
Skip to first unread message

Len Brown

unread,
Apr 1, 2016, 12:38:20 AM4/1/16
to x...@kernel.org, linu...@vger.kernel.org, linux-...@vger.kernel.org, Len Brown
From: Len Brown <len....@intel.com>

For x86 processors with APERF/MPERF and TSC,
return meaningful and consistent MHz in
/proc/cpuinfo and
/sys/devices/system/cpu/cpu*/cpufreq/scaling_cur_freq

MHz is computed like so:

MHz = base_MHz * delta_APERF / delta_MPERF

MHz is the average frequency of the busy processor
over a measurement interval. The interval is
defined to be the time between successive reads
of the frequency on that processor, whether from
/proc/cpuinfo or from sysfs cpufreq/scaling_cur_freq.
As with previous methods of calculating MHz,
idle time is excluded.

base_MHz above is from TSC calibration global "cpu_khz".

This x86 native method to calculate MHz returns a meaningful result
no matter if P-states are controlled by hardware or firmware
and/or the Linux cpufreq sub-system is/is-not installed.

Note that frequent or concurrent reads of /proc/cpuinfo
or sysfs cpufreq/scaling_cur_freq will shorten the
measurement interval seen by each reader. The code
mitigates that issue by caching results for 100ms.

Discerning users are encouraged to take advantage of
the turbostat(8) utility, which can gracefully handle
concurrent measurement intervals of arbitrary length.

Signed-off-by: Len Brown <len....@intel.com>
---
arch/x86/kernel/cpu/Makefile | 1 +
arch/x86/kernel/cpu/aperfmperf.c | 76 ++++++++++++++++++++++++++++++++++++++++
arch/x86/kernel/cpu/proc.c | 4 ++-
drivers/cpufreq/cpufreq.c | 7 +++-
include/linux/cpufreq.h | 13 +++++++
5 files changed, 99 insertions(+), 2 deletions(-)
create mode 100644 arch/x86/kernel/cpu/aperfmperf.c

diff --git a/arch/x86/kernel/cpu/Makefile b/arch/x86/kernel/cpu/Makefile
index 4a8697f..821e31a 100644
--- a/arch/x86/kernel/cpu/Makefile
+++ b/arch/x86/kernel/cpu/Makefile
@@ -20,6 +20,7 @@ obj-y := intel_cacheinfo.o scattered.o topology.o
obj-y += common.o
obj-y += rdrand.o
obj-y += match.o
+obj-y += aperfmperf.o

obj-$(CONFIG_PROC_FS) += proc.o
obj-$(CONFIG_X86_FEATURE_NAMES) += capflags.o powerflags.o
diff --git a/arch/x86/kernel/cpu/aperfmperf.c b/arch/x86/kernel/cpu/aperfmperf.c
new file mode 100644
index 0000000..9380102
--- /dev/null
+++ b/arch/x86/kernel/cpu/aperfmperf.c
@@ -0,0 +1,76 @@
+/*
+ * x86 APERF/MPERF KHz calculation
+ * Used by /proc/cpuinfo and /sys/.../cpufreq/scaling_cur_freq
+ *
+ * Copyright (C) 2015 Intel Corp.
+ * Author: Len Brown <len....@intel.com>
+ *
+ * This file is licensed under GPLv2.
+ */
+
+#include <linux/jiffies.h>
+#include <linux/math64.h>
+#include <linux/percpu.h>
+#include <linux/smp.h>
+
+struct aperfmperf_sample {
+ unsigned int khz;
+ unsigned long jiffies;
+ unsigned long long aperf;
+ unsigned long long mperf;
+};
+
+static DEFINE_PER_CPU(struct aperfmperf_sample, samples);
+
+/*
+ * aperfmperf_snapshot_khz()
+ * On the current CPU, snapshot APERF, MPERF, and jiffies
+ * unless we already did it within 100ms
+ * calculate kHz, save snapshot
+ */
+static void aperfmperf_snapshot_khz(void *dummy)
+{
+ unsigned long long aperf, aperf_delta;
+ unsigned long long mperf, mperf_delta;
+ unsigned long long numerator;
+ struct aperfmperf_sample *s = &get_cpu_var(samples);
+
+ /* Cache KHz for 100 ms */
+ if (time_before(jiffies, s->jiffies + HZ/10))
+ goto out;
+
+ rdmsrl(MSR_IA32_APERF, aperf);
+ rdmsrl(MSR_IA32_MPERF, mperf);
+
+ aperf_delta = aperf - s->aperf;
+ mperf_delta = mperf - s->mperf;
+
+ /*
+ * There is no architectural guarantee that MPERF
+ * increments faster than we can read it.
+ */
+ if (mperf_delta == 0)
+ goto out;
+
+ numerator = cpu_khz * aperf_delta;
+ s->khz = div64_u64(numerator, mperf_delta);
+ s->jiffies = jiffies;
+ s->aperf = aperf;
+ s->mperf = mperf;
+
+out:
+ put_cpu_var(samples);
+}
+
+unsigned int aperfmperf_khz_on_cpu(int cpu)
+{
+ if (!cpu_khz)
+ return 0;
+
+ if (!boot_cpu_has(X86_FEATURE_APERFMPERF))
+ return 0;
+
+ smp_call_function_single(cpu, aperfmperf_snapshot_khz, NULL, 1);
+
+ return per_cpu(samples.khz, cpu);
+}
diff --git a/arch/x86/kernel/cpu/proc.c b/arch/x86/kernel/cpu/proc.c
index 18ca99f..44507c0 100644
--- a/arch/x86/kernel/cpu/proc.c
+++ b/arch/x86/kernel/cpu/proc.c
@@ -78,9 +78,11 @@ static int show_cpuinfo(struct seq_file *m, void *v)
seq_printf(m, "microcode\t: 0x%x\n", c->microcode);

if (cpu_has(c, X86_FEATURE_TSC)) {
- unsigned int freq = cpufreq_quick_get(cpu);
+ unsigned int freq = aperfmperf_khz_on_cpu(cpu);

if (!freq)
+ freq = cpufreq_quick_get(cpu);
+ if (!freq)
freq = cpu_khz;
seq_printf(m, "cpu MHz\t\t: %u.%03u\n",
freq / 1000, (freq % 1000));
diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
index b87596b..7fcd090 100644
--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -541,8 +541,13 @@ show_one(scaling_max_freq, max);
static ssize_t show_scaling_cur_freq(struct cpufreq_policy *policy, char *buf)
{
ssize_t ret;
+ unsigned int freq;

- if (cpufreq_driver && cpufreq_driver->setpolicy && cpufreq_driver->get)
+ freq = arch_freq_get_on_cpu(policy->cpu);
+ if (freq)
+ ret = sprintf(buf, "%u\n", freq);
+ else if (cpufreq_driver && cpufreq_driver->setpolicy &&
+ cpufreq_driver->get)
ret = sprintf(buf, "%u\n", cpufreq_driver->get(policy->cpu));
else
ret = sprintf(buf, "%u\n", policy->cur);
diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h
index 718e872..a9b8ec6 100644
--- a/include/linux/cpufreq.h
+++ b/include/linux/cpufreq.h
@@ -566,6 +566,19 @@ static inline bool policy_has_boost_freq(struct cpufreq_policy *policy)
/* the following funtion is for cpufreq core use only */
struct cpufreq_frequency_table *cpufreq_frequency_get_table(unsigned int cpu);

+#ifdef CONFIG_X86
+extern unsigned int aperfmperf_khz_on_cpu(int cpu);
+static inline unsigned int arch_freq_get_on_cpu(int cpu)
+{
+ return aperfmperf_khz_on_cpu(cpu);
+}
+#else
+static inline unsigned int arch_freq_get_on_cpu(int cpu)
+{
+ return 0;
+}
+#endif
+
/* the following are really really optional */
extern struct freq_attr cpufreq_freq_attr_scaling_available_freqs;
extern struct freq_attr cpufreq_freq_attr_scaling_boost_freqs;
--
2.8.0.rc4.16.g56331f8

Thomas Gleixner

unread,
Apr 1, 2016, 3:58:19 AM4/1/16
to Len Brown, x...@kernel.org, linu...@vger.kernel.org, linux-...@vger.kernel.org, Len Brown
On Fri, 1 Apr 2016, Len Brown wrote:
> +/*
> + * aperfmperf_snapshot_khz()
> + * On the current CPU, snapshot APERF, MPERF, and jiffies
> + * unless we already did it within 100ms
> + * calculate kHz, save snapshot
> + */
> +static void aperfmperf_snapshot_khz(void *dummy)
> +{
> + unsigned long long aperf, aperf_delta;
> + unsigned long long mperf, mperf_delta;
> + unsigned long long numerator;
> + struct aperfmperf_sample *s = &get_cpu_var(samples);

this_cpu_ptr is sufficient. That's a smp function call ...
You can avoid the function call if you check s->jiffies here.

Thanks,

tglx

Peter Zijlstra

unread,
Apr 1, 2016, 4:03:44 AM4/1/16
to Len Brown, x...@kernel.org, linu...@vger.kernel.org, linux-...@vger.kernel.org, Len Brown
u64 is less typing ;-)

> + struct aperfmperf_sample *s = &get_cpu_var(samples);
> +
> + /* Cache KHz for 100 ms */
> + if (time_before(jiffies, s->jiffies + HZ/10))
> + goto out;

This puts in a lower bound, but afaict there is no upper bound. Both
users appear to be userspace controlled.

That is; if userspace doesn't request a freq reading we can go without
reading this for a very long time.

> +
> + rdmsrl(MSR_IA32_APERF, aperf);
> + rdmsrl(MSR_IA32_MPERF, mperf);
> +
> + aperf_delta = aperf - s->aperf;
> + mperf_delta = mperf - s->mperf;

That means these delta's can be arbitrarily large, in fact the MSRs can
have wrapped however many times.

> +
> + /*
> + * There is no architectural guarantee that MPERF
> + * increments faster than we can read it.
> + */
> + if (mperf_delta == 0)
> + goto out;
> +
> + numerator = cpu_khz * aperf_delta;

And since delta can be any 64bit value as per the msr range, this
multiplication can overflow.

> + s->khz = div64_u64(numerator, mperf_delta);
> + s->jiffies = jiffies;
> + s->aperf = aperf;
> + s->mperf = mperf;
> +
> +out:
> + put_cpu_var(samples);
> +}
> +
> +unsigned int aperfmperf_khz_on_cpu(int cpu)
> +{
> + if (!cpu_khz)
> + return 0;
> +
> + if (!boot_cpu_has(X86_FEATURE_APERFMPERF))
> + return 0;

You could do the jiffy compare here; avoiding the IPI.

Peter Zijlstra

unread,
Apr 1, 2016, 4:16:51 AM4/1/16
to Len Brown, x...@kernel.org, linu...@vger.kernel.org, linux-...@vger.kernel.org, Len Brown
On Fri, Apr 01, 2016 at 12:37:00AM -0400, Len Brown wrote:
> From: Len Brown <len....@intel.com>
>
> For x86 processors with APERF/MPERF and TSC,
> return meaningful and consistent MHz in
> /proc/cpuinfo and
> /sys/devices/system/cpu/cpu*/cpufreq/scaling_cur_freq
>
> MHz is computed like so:
>
> MHz = base_MHz * delta_APERF / delta_MPERF
>
> MHz is the average frequency of the busy processor
> over a measurement interval. The interval is
> defined to be the time between successive reads
> of the frequency on that processor, whether from
> /proc/cpuinfo or from sysfs cpufreq/scaling_cur_freq.
> As with previous methods of calculating MHz,
> idle time is excluded.

Is this really a semantic you want to pin down?

Since we're looking at doing something like:

lkml.kernel.org/r/2016030316...@twins.programming.kicks-ass.net

We could also just return cpu_khz * whatever fraction we store there,
knowing it is something recent.

Stephane Gasparini

unread,
Apr 1, 2016, 4:17:01 AM4/1/16
to Peter Zijlstra, Len Brown, x...@kernel.org, linu...@vger.kernel.org, linux-...@vger.kernel.org, Len Brown

—
Steph
64 bits is 18 446 744 073 709 551 615

so even assuming a 10 GHz frequency if my math are good this is more than
58 years before the MSR wrap around, assuming the device ran always at max
freq.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pm" in
> the body of a message to majo...@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html

Peter Zijlstra

unread,
Apr 1, 2016, 4:23:35 AM4/1/16
to Stephane Gasparini, Len Brown, x...@kernel.org, linu...@vger.kernel.org, linux-...@vger.kernel.org, Len Brown

Trim your emails

On Fri, Apr 01, 2016 at 10:16:42AM +0200, Stephane Gasparini wrote:

> > That means these delta's can be arbitrarily large, in fact the MSRs can
> > have wrapped however many times.
>
> 64 bits is 18 446 744 073 709 551 615
>
> so even assuming a 10 GHz frequency if my math are good this is more than
> 58 years before the MSR wrap around, assuming the device ran always at max
> freq.

fair enough.. but going with 10Ghz, cpu_khz would be 10e6 ~ 33 bits,
which effectively reduces the wrap/overflow time to just 31 bits, which
per that frequency is just ~1/4th of a second.


Peter Zijlstra

unread,
Apr 1, 2016, 4:29:43 AM4/1/16
to Stephane Gasparini, Len Brown, x...@kernel.org, linu...@vger.kernel.org, linux-...@vger.kernel.org, Len Brown
On Fri, Apr 01, 2016 at 10:23:23AM +0200, Peter Zijlstra wrote:
>
> Trim your emails
>
> On Fri, Apr 01, 2016 at 10:16:42AM +0200, Stephane Gasparini wrote:
>
> > > That means these delta's can be arbitrarily large, in fact the MSRs can
> > > have wrapped however many times.
> >
> > 64 bits is 18 446 744 073 709 551 615
> >
> > so even assuming a 10 GHz frequency if my math are good this is more than
> > 58 years before the MSR wrap around, assuming the device ran always at max
> > freq.
>
> fair enough.. but going with 10Ghz, cpu_khz would be 10e6 ~ 33 bits,

I can't do maths this morning; 23 bits

> which effectively reduces the wrap/overflow time to just 31 bits, which
> per that frequency is just ~1/4th of a second.

41 giving lots more, but a reasonable time to wrap/overflow.

Stephane Gasparini

unread,
Apr 1, 2016, 5:31:00 AM4/1/16
to Peter Zijlstra, Len Brown, x...@kernel.org, linu...@vger.kernel.org, linux-...@vger.kernel.org, Len Brown
my comment was about your comment that MSR have wrapped however many times



> On Apr 1, 2016, at 10:03 AM, Peter Zijlstra <pet...@infradead.org> wrote:
>
> That is; if userspace doesn't request a freq reading we can go without
> reading this for a very long time.
>
>> +
>> + rdmsrl(MSR_IA32_APERF, aperf);
>> + rdmsrl(MSR_IA32_MPERF, mperf);
>> +
>> + aperf_delta = aperf - s->aperf;
>> + mperf_delta = mperf - s->mperf;
>
> That means these delta's can be arbitrarily large, in fact the MSRs can
> have wrapped however many times.

The MSRs will not wrap that often.

—
Steph

Peter Zijlstra

unread,
Apr 1, 2016, 5:38:51 AM4/1/16
to Stephane Gasparini, Len Brown, x...@kernel.org, linu...@vger.kernel.org, linux-...@vger.kernel.org, Len Brown
On Fri, Apr 01, 2016 at 11:30:48AM +0200, Stephane Gasparini wrote:
> my comment was about your comment that MSR have wrapped however many times
>

Yes, and don't top post.

Borislav Petkov

unread,
Apr 1, 2016, 5:50:21 AM4/1/16
to Stephane Gasparini, Peter Zijlstra, Len Brown, x...@kernel.org, linu...@vger.kernel.org, linux-...@vger.kernel.org, Len Brown
On Fri, Apr 01, 2016 at 11:30:48AM +0200, Stephane Gasparini wrote:
> The MSRs will not wrap that often.

Unless some yahoo goes and does WRMSR APERF <big_value_close_to_wrap_around>.

I think we should handle that gracefully too, regardless of how "smart"
that move might be.

--
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.

Len Brown

unread,
Apr 2, 2016, 1:23:13 AM4/2/16
to Borislav Petkov, Stephane Gasparini, Peter Zijlstra, X86 ML, Linux PM list, linux-...@vger.kernel.org, Len Brown
Thanks for the comments.

Re: is this a useful semantic?

Yes, average MHz over an interval is significantly more useful than
a snapshot of the recent instantaneous frequency.
It is possible to convert the former into the later,
but it is not possible to reliably and efficiently convert the later
into the former.

Indeed, we stopped using MSR_PERF_STATUS for this very reason --
a snapshot of instantaneous frequency can be very misleading.

Further, the mechanism in this patch will still work even when Linux
has no concept of frequency control,
including firmware control and CONFIG_CPU_FREQ=n

Of course, when there is 1 reader, this mechanism works the best --
as they get to select whatever interval they like.
For multi-user, the interval would shorten -- possibly
degrading to the 100ms limit set here. My reasoning on the
100ms limit is that anything more frequent is abuse,
and the users should be using user-space tools like turbostat in that case.

Re: 64-bit math.

Stephane is correct, APERF and MPERF will not overflow in the uptime
of the machine.
They are both 64-bit registers, and they tick at TSC rate or slower.
(Indeed, they tick at 0 when idle)

Boris is right, this works as long as somebody doesn't scribble on these MSRs.
Linux used to do that in 2.6.23, but we learned our lesson and we leave them
free running since then. I'm not going to worry about a yahoo
scribbling on MSRs
behind the kernel's back. More than this will break if that happens.

Peter is right, in the expression "numerator = cpu_khz * aperf_delta",
the capacity of the 64-bit numerator is reduced as cpu_khz
and aperf_delta grow.

For example, if this patch runs on a busy system having a 4GHz CPU,
then APERF ticks at 2^32 Hz.
cpu_khz = 2^22
so max aperf_delta without overflow is 2^64/2^22 = 2^42 cycles

2^42 cycles / 2^32 cycles/sec = 2^10 sec = 1024 seconds = 17 minutes.

Though we could improve this range by 1024x by simply operating on
cpu_mhz instead of cpu_khz, yielding 12 days.

Or we could simply detect potential overflow:

2^64 < cpu_khz * delta_aperf
so
if (2^64/cpu_khz < delta_aperf) then overflow

and since delta_aperf and delta_mperf are much larger than cpu_khz
in this case, we can calculate this way:

khz = cpu_khz (delta_aperf)/(delta_mperf)
khz = cpu_khz (delta_aperf/cpu_khz)/(delta_mperf/cpu_khz)
khz = delta_aperf / (delta_mperf/cpu_khz)

no calculation here can overflow 64-bits in the uptime of the machine.

I'll send an updated patch.

thanks,
-Len

Len Brown

unread,
Apr 6, 2016, 4:48:14 PM4/6/16
to x...@kernel.org, linu...@vger.kernel.org, linux-...@vger.kernel.org, Len Brown
From: Len Brown <len....@intel.com>

For x86 processors with APERF/MPERF and TSC,
return meaningful and consistent MHz in
/proc/cpuinfo and
/sys/devices/system/cpu/cpu*/cpufreq/scaling_cur_freq

MHz is computed like so:

MHz = base_MHz * delta_APERF / delta_MPERF

or when delta_APERF is large, to prevent
64-bit overflow:

MHz = delta_APERF / (delta_MPERF / base_MHz)

MHz is the average frequency of the busy processor
over a measurement interval. The interval is
defined to be the time between successive reads
of the frequency on that processor, whether from
/proc/cpuinfo or from sysfs cpufreq/scaling_cur_freq.
As with previous methods of calculating MHz,
idle time is excluded.

base_MHz above is from TSC calibration global "cpu_khz".

This x86 native method to calculate MHz returns a meaningful result
no matter if P-states are controlled by hardware or firmware
and/or the Linux cpufreq sub-system is/is-not installed.

Note that frequent or concurrent reads of /proc/cpuinfo
or sysfs cpufreq/scaling_cur_freq will shorten the
measurement interval seen by each reader. The code
mitigates that issue by caching results for 100ms.

Discerning users are encouraged to take advantage of
the turbostat(8) utility, which can gracefully handle
concurrent measurement intervals of arbitrary length.

Signed-off-by: Len Brown <len....@intel.com>
---
arch/x86/kernel/cpu/Makefile | 1 +
arch/x86/kernel/cpu/aperfmperf.c | 81 ++++++++++++++++++++++++++++++++++++++++
arch/x86/kernel/cpu/proc.c | 4 +-
drivers/cpufreq/cpufreq.c | 7 +++-
include/linux/cpufreq.h | 13 +++++++
5 files changed, 104 insertions(+), 2 deletions(-)
create mode 100644 arch/x86/kernel/cpu/aperfmperf.c

diff --git a/arch/x86/kernel/cpu/Makefile b/arch/x86/kernel/cpu/Makefile
index 4a8697f..821e31a 100644
--- a/arch/x86/kernel/cpu/Makefile
+++ b/arch/x86/kernel/cpu/Makefile
@@ -20,6 +20,7 @@ obj-y := intel_cacheinfo.o scattered.o topology.o
obj-y += common.o
obj-y += rdrand.o
obj-y += match.o
+obj-y += aperfmperf.o

obj-$(CONFIG_PROC_FS) += proc.o
obj-$(CONFIG_X86_FEATURE_NAMES) += capflags.o powerflags.o
diff --git a/arch/x86/kernel/cpu/aperfmperf.c b/arch/x86/kernel/cpu/aperfmperf.c
new file mode 100644
index 0000000..3189f68
--- /dev/null
+++ b/arch/x86/kernel/cpu/aperfmperf.c
@@ -0,0 +1,81 @@
+/*
+ * x86 APERF/MPERF KHz calculation
+ * Used by /proc/cpuinfo and /sys/.../cpufreq/scaling_cur_freq
+ *
+ * Copyright (C) 2015 Intel Corp.
+ * Author: Len Brown <len....@intel.com>
+ *
+ * This file is licensed under GPLv2.
+ */
+
+#include <linux/jiffies.h>
+#include <linux/math64.h>
+#include <linux/percpu.h>
+#include <linux/smp.h>
+
+struct aperfmperf_sample {
+ unsigned int khz;
+ unsigned long jiffies;
+ u64 aperf;
+ u64 mperf;
+};
+
+static DEFINE_PER_CPU(struct aperfmperf_sample, samples);
+
+/*
+ * aperfmperf_snapshot_khz()
+ * On the current CPU, snapshot APERF, MPERF, and jiffies
+ * unless we already did it within 100ms
+ * calculate kHz, save snapshot
+ */
+static void aperfmperf_snapshot_khz(void *dummy)
+{
+ u64 aperf, aperf_delta;
+ u64 mperf, mperf_delta;
+ struct aperfmperf_sample *s = &get_cpu_var(samples);
+
+ /* Cache KHz for 100 ms */
+ if (time_before(jiffies, s->jiffies + HZ/10))
+ goto out;
+
+ rdmsrl(MSR_IA32_APERF, aperf);
+ rdmsrl(MSR_IA32_MPERF, mperf);
+
+ aperf_delta = aperf - s->aperf;
+ mperf_delta = mperf - s->mperf;
+
+ /*
+ * There is no architectural guarantee that MPERF
+ * increments faster than we can read it.
+ */
+ if (mperf_delta == 0)
+ goto out;
+
+ /*
+ * if (cpu_khz * aperf_delta) fits into ULLONG_MAX, then
+ * khz = (cpu_khz * aperf_delta) / mperf_delta
+ */
+ if (div64_u64(ULLONG_MAX, cpu_khz) > aperf_delta)
+ s->khz = div64_u64((cpu_khz * aperf_delta), mperf_delta);
+ else /* khz = aperf_delta / (mperf_delta / cpu_khz) */
+ s->khz = div64_u64(aperf_delta, div64_u64(mperf_delta, cpu_khz));
+ s->jiffies = jiffies;
+ s->aperf = aperf;
+ s->mperf = mperf;
+
+out:
+ put_cpu_var(samples);
+}
+
+unsigned int aperfmperf_khz_on_cpu(int cpu)
+{
+ if (!cpu_khz)
+ return 0;
+
+ if (!boot_cpu_has(X86_FEATURE_APERFMPERF))
+ return 0;
+
+ smp_call_function_single(cpu, aperfmperf_snapshot_khz, NULL, 1);
+
+ return per_cpu(samples.khz, cpu);
+}

Prarit Bhargava

unread,
Apr 8, 2016, 8:26:39 AM4/8/16
to linux-...@vger.kernel.org, le...@kernel.org, Prarit Bhargava
>For x86 processors with APERF/MPERF and TSC, return
> meaningful and consistent MHz in /proc/cpuinfo and
> /sys/devices/system/cpu/cpu*/cpufreq/scaling_cur_freq
>
>MHz is computed like so:
>
>MHz = base_MHz * delta_APERF / delta_MPERF
>
>or when delta_APERF is large, to prevent
>64-bit overflow:
>
>MHz = delta_APERF / (delta_MPERF / base_MHz)
>
>MHz is the average frequency of the busy processor
>over a measurement interval. The interval is
>defined to be the time between successive reads
>of the frequency on that processor, whether from
>/proc/cpuinfo or from sysfs cpufreq/scaling_cur_freq.
>As with previous methods of calculating MHz,
>idle time is excluded.
>
>base_MHz above is from TSC calibration global "cpu_khz".
>
>This x86 native method to calculate MHz returns a meaningful result
>no matter if P-states are controlled by hardware or firmware
>and/or the Linux cpufreq sub-system is/is-not installed.
>
>Note that frequent or concurrent reads of /proc/cpuinfo
>or sysfs cpufreq/scaling_cur_freq will shorten the
>measurement interval seen by each reader. The code
>mitigates that issue by caching results for 100ms.

I have a minor ABI concern with this patch. It seems that there is much more
variance in the output of "cpu MHz" with this patch, and I think that
needs to be noted in the changelog.

ISTR having a conversation a while ago (with you Len? with Srinivas?)
where I mentioned that "cpu MHz" used to just reflect the "marketing"
frequency of the processors on the system. Is it worth going back to
that static state, and leaving the calculation for the current frequency to
userspace programs like turbostat, cpupower, etc.?

FWIW: I *regularly* get bugzillas filed from people who do not understand
that "cpu MHz" shows the current frequency of the core. I've often
thought it would be easier to make that value static ...

P.

Len Brown

unread,
Apr 8, 2016, 7:56:41 PM4/8/16
to Prarit Bhargava, linux-...@vger.kernel.org
> I have a minor ABI concern with this patch. It seems that there is much more
> variance in the output of "cpu MHz" with this patch, and I think that
> needs to be noted in the changelog.
>
> ISTR having a conversation a while ago (with you Len? with Srinivas?)
> where I mentioned that "cpu MHz" used to just reflect the "marketing"
> frequency of the processors on the system. Is it worth going back to
> that static state, and leaving the calculation for the current frequency to
> userspace programs like turbostat, cpupower, etc.?
>
> FWIW: I *regularly* get bugzillas filed from people who do not understand
> that "cpu MHz" shows the current frequency of the core. I've often
> thought it would be easier to make that value static ...

I am fine with always printing static cpu_khz in /proc/cpuinfo on all machines.

If it were up to me, I would not have allowed the cpufreq sub-system
to start messing with this.

But it did, and I figured the genie was out of the bottle.
Assuming I'd never be able to get the community to agree to stuff the
genie back in the bottle,
I figured that this file should show a value that actually means something,
and isn't completely different depending on the choice of cpufreq
driver being used on that system. Indeed, your comment on variability
is right on the money, this solution is less "variable" than some drivers,
such as intel_pstate, and more variable than others, such as acpi-cpufreq.
Neither of those drivers return a value that is particularly meaningful.
This solution at least, has a semantic definition.

Len Brown, Intel Open Source Technology Center

Pavel Machek

unread,
Apr 24, 2016, 12:39:00 PM4/24/16
to Len Brown, x...@kernel.org, linu...@vger.kernel.org, linux-...@vger.kernel.org, Len Brown
Hi!
> +++ b/arch/x86/kernel/cpu/aperfmperf.c
> @@ -0,0 +1,76 @@
> +/*
> + * x86 APERF/MPERF KHz calculation
> + * Used by /proc/cpuinfo and /sys/.../cpufreq/scaling_cur_freq

Could we use some shorter filename here? cpu_mhz.c? mhz.c?

> +/*
> + * aperfmperf_snapshot_khz()


--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
0 new messages