[PATCH v1 0/2] perf,x86: add Intel RAPL PMU support

Stephane Eranian

unread,

Oct 7, 2013, 12:09:47 PM10/7/13

to linux-...@vger.kernel.org, pet...@infradead.org, mi...@elte.hu, a...@linux.intel.com, ac...@redhat.com, jo...@redhat.com, zheng...@intel.com

This patch adds a new uncore PMU to expose the Intel
RAPL energy consumption counters. Up to 3 counters,
each counting a particular RAPL event are exposed.

The RAPL counters are available on Intel SandyBridge,
IvyBridge, Haswell. The server skus add a 3rd counter.

The following events are available nd exposed in sysfs:
- rapl-energy-cores: power consumption of all cores on socket
- rapl-energy-pkg: power consumption of all cores + LLc cache
- rapl-energy-dram: power consumption of DRAM

The RAPL PMU is uncore by nature and is implemented such
that it only works in system-wide mode. Measuring only
one CPU per socket is sufficient. The /sys/devices/rapl/cpumask
is exported and can be used by tools to figure out which CPU
to monitor by default. For instance, on a 2-socket system, 2 CPUs
(one on each socket) will be shown.

The counters all count in the same unit. The perf_events API
exposes all RAPL counters as 64-bit integers counting in unit
of 1/2^32 Joules (or 0.23 nJ). User level tools must convert
the counts by multiplying them by 0.23 and divide 10^9 to
obtain Joules. The reason for this is that the kernel avoids
doing floating point math whenever possible because it is
expensive (user floating-point state must be saved). The method
used avoids kernel floating-point and minimizes the loss of
precision (bits). Thanks to PeterZ for suggesting this approach.

To convert the raw count in Watt: W = C * 0.23 / (1e9 * time)

RAPL PMU is a new standalone PMU which registers with the
perf_event core subsystem. The PMU type (attr->type) is
dynamically allocated and is available from /sys/device/rapl/type.

Sampling is not supported by the RAPL PMU. There is no
privilege level filtering either.

The PMU exports a cpumask in /sys/devices/uncore/cpumask. It
is used by perf to ensure only one instance of each RAPL event
is measured per processor socket. Hotplug CPU is also supported.

We artificially limit the number of simultaneous RAPL events
to a max of 1 instance of each (so up to 3). That helps track
events and is sufficient given that RAPL events do not support
any filters, i.e., no gain in measuring the same event twice
in an event group.

The second patch adds a hrtimer to poll the counters given that
they do no interrupt on overflow. Hardware counters are 32-bit
wide.

Supported CPUs: SandyBridge, IvyBridge, Haswell.

$ perf stat -a -e rapl/rapl-energy-cores/,rapl/rapl-energy-pkg/ -I 1000 sleep 10
time counts events
1.000345931 772 278 493 rapl/rapl-energy-cores/
1.000345931 55 539 138 560 rapl/rapl-energy-pkg/
2.000836387 771 751 936 rapl/rapl-energy-cores/
2.000836387 55 326 015 488 rapl/rapl-energy-pkg/

Stephane Eranian (2):
perf,x86: add Intel RAPL PMU support
perf,x86: add RAPL hrtimer support

arch/x86/kernel/cpu/Makefile | 2 +-
arch/x86/kernel/cpu/perf_event_intel_rapl.c | 649 +++++++++++++++++++++++++++
tools/perf/util/evsel.c | 1 -
3 files changed, 650 insertions(+), 2 deletions(-)
create mode 100644 arch/x86/kernel/cpu/perf_event_intel_rapl.c

--
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Stephane Eranian

unread,

Oct 7, 2013, 12:10:01 PM10/7/13

to linux-...@vger.kernel.org, pet...@infradead.org, mi...@elte.hu, a...@linux.intel.com, ac...@redhat.com, jo...@redhat.com, zheng...@intel.com

We artificially limit the number of simultaneous RAPL events
to a max of 1 instance of each (so up to 3). That helps track
events and is sufficient given that RAPL events do not support
any filters, i.e., no gain in measuring the same event twice
in an event group.

Signed-off-by: Stephane Eranian <era...@google.com>
---
arch/x86/kernel/cpu/Makefile | 2 +-
arch/x86/kernel/cpu/perf_event_intel_rapl.c | 580 +++++++++++++++++++++++++++
tools/perf/util/evsel.c | 1 -
3 files changed, 581 insertions(+), 2 deletions(-)
create mode 100644 arch/x86/kernel/cpu/perf_event_intel_rapl.c

diff --git a/arch/x86/kernel/cpu/Makefile b/arch/x86/kernel/cpu/Makefile
index 47b56a7..6359506 100644
--- a/arch/x86/kernel/cpu/Makefile
+++ b/arch/x86/kernel/cpu/Makefile
@@ -36,7 +36,7 @@ obj-$(CONFIG_CPU_SUP_AMD) += perf_event_amd_iommu.o
endif
obj-$(CONFIG_CPU_SUP_INTEL) += perf_event_p6.o perf_event_knc.o perf_event_p4.o
obj-$(CONFIG_CPU_SUP_INTEL) += perf_event_intel_lbr.o perf_event_intel_ds.o perf_event_intel.o
-obj-$(CONFIG_CPU_SUP_INTEL) += perf_event_intel_uncore.o
+obj-$(CONFIG_CPU_SUP_INTEL) += perf_event_intel_uncore.o perf_event_intel_rapl.o
endif

diff --git a/arch/x86/kernel/cpu/perf_event_intel_rapl.c b/arch/x86/kernel/cpu/perf_event_intel_rapl.c
new file mode 100644
index 0000000..f59dbd4
--- /dev/null
+++ b/arch/x86/kernel/cpu/perf_event_intel_rapl.c
@@ -0,0 +1,580 @@
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/pci.h>
+#include <linux/perf_event.h>
+#include "perf_event.h"
+/*
+ * RAPL energy status counters
+ */
+#define RAPL_IDX_PP0_NRG_STAT 0 /* all cores */
+#define INTEL_RAPL_PP0 0x1 /* pseudo-encoding */
+#define RAPL_IDX_PKG_NRG_STAT 1 /* entire package */
+#define INTEL_RAPL_PKG 0x2 /* pseudo-encoding */
+#define RAPL_IDX_RAM_NRG_STAT 2 /* DRAM */
+#define INTEL_RAPL_RAM 0x3 /* pseudo-encoding */
+
+#define RAPL_IDX_MAX 4 /* Max number of RAPL counters */
+
+/* Desktops have PP0, PKG */
+#define RAPL_IDX_CLN (1<<RAPL_IDX_PP0_NRG_STAT|\
+ 1<<RAPL_IDX_PKG_NRG_STAT)
+
+/* Servers have PP0, PKG, RAM */
+#define RAPL_IDX_SRV (1<<RAPL_IDX_PP0_NRG_STAT|\
+ 1<<RAPL_IDX_PKG_NRG_STAT|\
+ 1<<RAPL_IDX_RAM_NRG_STAT)
+
+/*
+ * event code: LSB 8 bits, to pass in attr->config
+ * any other bit is reserved
+ */
+#define RAPL_EVENT_MASK 0xFFULL
+
+#define DEFINE_RAPL_FORMAT_ATTR(_var, _name, _format) \
+static ssize_t __rapl_##_var##_show(struct kobject *kobj, \
+ struct kobj_attribute *attr, \
+ char *page) \
+{ \
+ BUILD_BUG_ON(sizeof(_format) >= PAGE_SIZE); \
+ return sprintf(page, _format "\n"); \
+} \
+static struct kobj_attribute format_attr_##_var = \
+ __ATTR(_name, 0444, __rapl_##_var##_show, NULL)
+
+#define RAPL_EVENT_DESC(_name, _config) \
+{ \
+ .attr = __ATTR(_name, 0444, rapl_event_show, NULL), \
+ .config = _config, \
+}
+
+#define RAPL_CNTR_WIDTH 32 /* 32-bit rapl counters */
+
+struct rapl_pmu {
+ atomic_t refcnt;
+ int hw_unit; /* 1/2^hw_unit Joule */
+ int phys_id;
+ int n_active; /* number of active events */
+ unsigned long active_mask[BITS_TO_LONGS(RAPL_IDX_MAX)];
+ struct perf_event *events[RAPL_IDX_MAX];
+};
+
+static struct pmu rapl_pmu_class;
+static cpumask_t rapl_cpu_mask;
+static int rapl_cntr_mask;
+
+static DEFINE_PER_CPU(struct rapl_pmu *, rapl_pmu);
+static DEFINE_PER_CPU(struct rapl_pmu *, rapl_pmu_kfree);
+
+static inline u64 rapl_read_counter(struct perf_event *event)
+{
+ u64 raw;
+ rdmsrl(event->hw.event_base, raw);
+ return raw;
+}
+
+static inline u64 rapl_scale(u64 v)
+{
+ /*
+ * scale delta to smallest unit (1/2^32)
+ * users must then scale back: count * 1/(1e9*2^32) to get Joules
+ * Watts = Joules/Time delta
+ */
+ return v << (32 - __get_cpu_var(rapl_pmu)->hw_unit);
+}
+
+static u64 rapl_event_update(struct perf_event *event)
+{
+ struct hw_perf_event *hwc = &event->hw;
+ u64 prev_raw_count, new_raw_count;
+ s64 delta, sdelta;
+ int shift = RAPL_CNTR_WIDTH;
+
+again:
+ prev_raw_count = local64_read(&hwc->prev_count);
+ rdmsrl(event->hw.event_base, new_raw_count);
+
+ if (local64_cmpxchg(&hwc->prev_count, prev_raw_count,
+ new_raw_count) != prev_raw_count)
+ goto again;
+
+ /*
+ * Now we have the new raw value and have updated the prev
+ * timestamp already. We can now calculate the elapsed delta
+ * (event-)time and add that to the generic event.
+ *
+ * Careful, not all hw sign-extends above the physical width
+ * of the count.
+ */
+ delta = (new_raw_count << shift) - (prev_raw_count << shift);
+ delta >>= shift;
+
+ sdelta = rapl_scale(delta);
+
+ local64_add(sdelta, &event->count);
+
+ return new_raw_count;
+}
+
+static void rapl_pmu_event_start(struct perf_event *event, int flags)
+{
+ struct rapl_pmu *pmu = __get_cpu_var(rapl_pmu);
+
+ if (WARN_ON_ONCE(!(event->hw.state & PERF_HES_STOPPED)))
+ return;
+
+ event->hw.state = 0;
+
+ local64_set(&event->hw.prev_count, rapl_read_counter(event));
+
+ pmu->n_active++;
+}
+
+static int rapl_pmu_event_add(struct perf_event *event, int flags)
+{
+ struct rapl_pmu *pmu = __get_cpu_var(rapl_pmu);
+ struct hw_perf_event *hwc = &event->hw;
+ int idx = event->hw.idx;
+
+ /* counter already in use */
+ if (__test_and_set_bit(idx, pmu->active_mask))
+ return -EAGAIN;
+
+ pmu->events[idx] = event;
+
+ hwc->state = PERF_HES_UPTODATE | PERF_HES_STOPPED;
+
+ if (flags & PERF_EF_START)
+ rapl_pmu_event_start(event, 0);
+
+ return 0;
+}
+
+static void rapl_pmu_event_stop(struct perf_event *event, int flags)
+{
+ struct rapl_pmu *pmu = __get_cpu_var(rapl_pmu);
+ struct hw_perf_event *hwc = &event->hw;
+
+ /* mark event as deactivated and stopped */
+ if (__test_and_clear_bit(hwc->idx, pmu->active_mask)) {
+ WARN_ON_ONCE(pmu->n_active <= 0);
+ pmu->n_active--;
+ pmu->events[hwc->idx] = NULL;
+ WARN_ON_ONCE(hwc->state & PERF_HES_STOPPED);
+ hwc->state |= PERF_HES_STOPPED;
+ }
+
+ /* check if update of sw counter is necessary */
+ if ((flags & PERF_EF_UPDATE) && !(hwc->state & PERF_HES_UPTODATE)) {
+ /*
+ * Drain the remaining delta count out of a event
+ * that we are disabling:
+ */
+ rapl_event_update(event);
+ hwc->state |= PERF_HES_UPTODATE;
+ }
+}
+
+static void rapl_pmu_event_del(struct perf_event *event, int flags)
+{
+ rapl_pmu_event_stop(event, PERF_EF_UPDATE);
+}
+
+static int rapl_validate_group(struct perf_event *event)
+{
+ struct perf_event *leader = event->group_leader;
+ unsigned long active_mask[BITS_TO_LONGS(RAPL_IDX_MAX)];
+ struct perf_event *e;
+
+ /*
+ * group can only have RAPL PMU events
+ * just need to verify they don't use the same
+ * counter
+ * Although the RAPL counters are read-only and
+ * we could have as many RAPL events as we would
+ * like, we artificially limit to a maximum of
+ * one event of each kind. That helps us track
+ * events better especially when we want to avoid
+ * missing counter overflows. Furthermore, the
+ * counters have no filters, thus adding 2 instances
+ * of an event does not buy anything.
+ */
+ bitmap_zero(active_mask, RAPL_IDX_MAX);
+
+ /*
+ * event is not yet connected with siblings
+ */
+ __set_bit(event->hw.idx, active_mask);
+
+ /*
+ * add leader too
+ */
+ if (__test_and_set_bit(leader->hw.idx, active_mask))
+ return -EINVAL;
+
+ /*
+ * now check existing siblings
+ */
+ list_for_each_entry(e, &leader->sibling_list, group_entry) {
+ if (__test_and_set_bit(e->hw.idx, active_mask))
+ return -EINVAL;
+ }
+ return 0;
+}
+
+static int rapl_pmu_event_init(struct perf_event *event)
+{
+ u64 cfg = event->attr.config & RAPL_EVENT_MASK;
+ int bit, msr, ret = 0;
+
+ /* only look at RAPL events */
+ if (event->attr.type != rapl_pmu_class.type)
+ return -ENOENT;
+
+ /* check only supported bits are set */
+ if (event->attr.config & ~RAPL_EVENT_MASK)
+ return -EINVAL;
+
+ /*
+ * check event is known (determines counter)
+ */
+ switch (cfg) {
+ case INTEL_RAPL_PP0:
+ bit = RAPL_IDX_PP0_NRG_STAT;
+ msr = MSR_PP0_ENERGY_STATUS;
+ break;
+ case INTEL_RAPL_PKG:
+ bit = RAPL_IDX_PKG_NRG_STAT;
+ msr = MSR_PKG_ENERGY_STATUS;
+ break;
+ case INTEL_RAPL_RAM:
+ bit = RAPL_IDX_RAM_NRG_STAT;
+ msr = MSR_DRAM_ENERGY_STATUS;
+ break;
+ default:
+ return -EINVAL;
+ }
+ /* check event supported */
+ if (!(rapl_cntr_mask & (1 << bit)))
+ return -EINVAL;
+
+ /* unsupported modes and filters */
+ if (event->attr.exclude_user ||
+ event->attr.exclude_kernel ||
+ event->attr.exclude_hv ||
+ event->attr.exclude_idle ||
+ event->attr.exclude_host ||
+ event->attr.exclude_guest ||
+ event->attr.sample_period) /* no sampling */
+ return -EINVAL;
+
+ /* must be done before validate_group */
+ event->hw.event_base = msr;
+ event->hw.idx = bit;
+
+ if (event->group_leader != event)
+ ret = rapl_validate_group(event);
+
+ return ret;
+}
+
+static void rapl_pmu_event_read(struct perf_event *event)
+{
+ rapl_event_update(event);
+}
+
+static ssize_t rapl_get_attr_cpumask(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ int n = cpulist_scnprintf(buf, PAGE_SIZE - 2, &rapl_cpu_mask);
+
+ buf[n++] = '\n';
+ buf[n] = '\0';
+ return n;
+}
+
+static DEVICE_ATTR(cpumask, S_IRUGO, rapl_get_attr_cpumask, NULL);
+
+static struct attribute *rapl_pmu_attrs[] = {
+ &dev_attr_cpumask.attr,
+ NULL,
+};
+
+static struct attribute_group rapl_pmu_attr_group = {
+ .attrs = rapl_pmu_attrs,
+};
+
+EVENT_ATTR_STR(rapl-energy-cores, rapl_pp0, "event=0x01");
+EVENT_ATTR_STR(rapl-energy-pkg , rapl_pkg, "event=0x02");
+EVENT_ATTR_STR(rapl-energy-ram , rapl_ram, "event=0x03");
+
+static struct attribute *rapl_events_srv_attr[] = {
+ EVENT_PTR(rapl_pp0),
+ EVENT_PTR(rapl_pkg),
+ EVENT_PTR(rapl_ram),
+ NULL,
+};
+
+static struct attribute *rapl_events_cln_attr[] = {
+ EVENT_PTR(rapl_pp0),
+ EVENT_PTR(rapl_pkg),
+ NULL,
+};
+
+static struct attribute_group rapl_pmu_events_group = {
+ .name = "events",
+ .attrs = NULL, /* patched at runtime */
+};
+
+DEFINE_RAPL_FORMAT_ATTR(event, event, "config:0-7");
+static struct attribute *rapl_formats_attr[] = {
+ &format_attr_event.attr,
+ NULL,
+};
+
+static struct attribute_group rapl_pmu_format_group = {
+ .name = "format",
+ .attrs = rapl_formats_attr,
+};
+
+const struct attribute_group *rapl_attr_groups[] = {
+ &rapl_pmu_attr_group,
+ &rapl_pmu_format_group,
+ &rapl_pmu_events_group,
+ NULL,
+};
+
+static struct pmu rapl_pmu_class = {
+ .attr_groups = rapl_attr_groups,
+ .task_ctx_nr = perf_invalid_context, /* system-wide only */
+ .event_init = rapl_pmu_event_init,
+ .add = rapl_pmu_event_add, /* must have */
+ .del = rapl_pmu_event_del, /* must have */
+ .start = rapl_pmu_event_start,
+ .stop = rapl_pmu_event_stop,
+ .read = rapl_pmu_event_read,
+};
+
+static void rapl_exit_cpu(int cpu)
+{
+ int i, phys_id = topology_physical_package_id(cpu);
+
+ /* if CPU not in RAPL mask, nothing to do */
+ if (!cpumask_test_and_clear_cpu(cpu, &rapl_cpu_mask))
+ return;
+
+ /* find a new cpu on same package */
+ for_each_online_cpu(i) {
+ if (i == cpu || i == 0)
+ continue;
+ if (phys_id == topology_physical_package_id(i)) {
+ cpumask_set_cpu(i, &rapl_cpu_mask);
+ break;
+ }
+ }
+
+ WARN_ON(cpumask_empty(&rapl_cpu_mask));
+}
+
+static void rapl_init_cpu(int cpu)
+{
+ int i, phys_id = topology_physical_package_id(cpu);
+
+ /* check if phys_is is already covered */
+ for_each_cpu(i, &rapl_cpu_mask) {
+ if (i == 0)
+ continue;
+ if (phys_id == topology_physical_package_id(i))
+ return;
+ }
+ /* was not found, so add it */
+ cpumask_set_cpu(cpu, &rapl_cpu_mask);
+}
+
+static int rapl_cpu_prepare(int cpu)
+{
+ struct rapl_pmu *pmu = per_cpu(rapl_pmu, cpu);
+ int phys_id = topology_physical_package_id(cpu);
+
+ if (pmu)
+ return 0;
+
+ if (phys_id < 0)
+ return -1;
+
+ pmu = kzalloc_node(sizeof(*pmu), GFP_KERNEL, cpu_to_node(cpu));
+ if (!pmu)
+ return -1;
+
+ atomic_set(&pmu->refcnt, 1);
+ pmu->phys_id = phys_id;
+ /*
+ * grab power unit as: 1/2^unit Joules
+ *
+ * we cache in local PMU instance
+ */
+ rdmsrl(MSR_RAPL_POWER_UNIT, pmu->hw_unit);
+ pmu->hw_unit = (pmu->hw_unit >> 8) & 0x1FULL;
+
+ /* set RAPL pmu for this cpu for now */
+ per_cpu(rapl_pmu_kfree, cpu) = NULL;
+ per_cpu(rapl_pmu, cpu) = pmu;
+
+ return 0;
+}
+
+static int rapl_cpu_starting(int cpu)
+{
+ struct rapl_pmu *pmu2;
+ struct rapl_pmu *pmu1 = per_cpu(rapl_pmu, cpu);
+ int i, phys_id = topology_physical_package_id(cpu);
+
+ if (pmu1)
+ return 0;
+
+ for_each_online_cpu(i) {
+ pmu2 = per_cpu(rapl_pmu, i);
+
+ if (!pmu2 || i == cpu)
+ continue;
+
+ if (pmu2->phys_id == phys_id) {
+ per_cpu(rapl_pmu, cpu) = pmu2;
+ per_cpu(rapl_pmu_kfree, cpu) = pmu1;
+ atomic_inc(&pmu2->refcnt);
+ break;
+ }
+ }
+ return 0;
+}
+
+static int rapl_cpu_dying(int cpu)
+{
+ struct rapl_pmu *pmu = per_cpu(rapl_pmu, cpu);
+ int i;
+
+ if (!pmu)
+ return 0;
+ /*
+ * stop all syswide RAPL events on that CPU
+ * as a consequence also stops the hrtimer
+ */
+ for (i = 0; i < RAPL_IDX_MAX; i++) {
+ if (pmu->events[i])
+ rapl_pmu_event_stop(pmu->events[i], PERF_EF_UPDATE);
+ }
+ per_cpu(rapl_pmu, cpu) = NULL;
+
+ if (atomic_dec_and_test(&pmu->refcnt))
+ kfree(pmu);
+
+ return 0;
+}
+
+static int rapl_cpu_notifier(struct notifier_block *self,
+ unsigned long action, void *hcpu)
+{
+ unsigned int cpu = (long)hcpu;
+
+ /* allocate/free data structure for uncore box */
+ switch (action & ~CPU_TASKS_FROZEN) {
+ case CPU_UP_PREPARE:
+ rapl_cpu_prepare(cpu);
+ break;
+ case CPU_STARTING:
+ rapl_cpu_starting(cpu);
+ break;
+ case CPU_UP_CANCELED:
+ case CPU_DYING:
+ rapl_cpu_dying(cpu);
+ break;
+ case CPU_ONLINE:
+ kfree(per_cpu(rapl_pmu_kfree, cpu));
+ per_cpu(rapl_pmu_kfree, cpu) = NULL;
+ break;
+ case CPU_DEAD:
+ per_cpu(rapl_pmu, cpu) = NULL;
+ break;
+ default:
+ break;
+ }
+
+ /* select the cpu that collects uncore events */
+ switch (action & ~CPU_TASKS_FROZEN) {
+ case CPU_DOWN_FAILED:
+ case CPU_STARTING:
+ rapl_init_cpu(cpu);
+ break;
+ case CPU_DOWN_PREPARE:
+ rapl_exit_cpu(cpu);
+ break;
+ default:
+ break;
+ }
+
+ return NOTIFY_OK;
+}
+
+static int __init rapl_pmu_init(void)
+{
+ struct rapl_pmu *pmu;
+ int i, cpu, ret;
+
+ /* check supported CPU */
+ switch (boot_cpu_data.x86_model) {
+ case 42: /* Sandy Bridge */
+ case 58: /* Ivy Bridge */
+ case 60: /* Haswell */
+ rapl_cntr_mask = RAPL_IDX_CLN;
+ rapl_pmu_events_group.attrs = rapl_events_cln_attr;
+ break;
+ case 45: /* Sandy Bridge-EP */
+ case 62: /* IvyTown */
+ rapl_cntr_mask = RAPL_IDX_SRV;
+ rapl_pmu_events_group.attrs = rapl_events_srv_attr;
+ break;
+
+ default:
+ /* unsupported */
+ return 0;
+ }
+ get_online_cpus();
+
+ for_each_online_cpu(cpu) {
+ int phys_id = topology_physical_package_id(cpu);
+
+ /* save on prepare by only calling prepare for new phys_id */
+ for_each_cpu(i, &rapl_cpu_mask) {
+ if (phys_id == topology_physical_package_id(i)) {
+ phys_id = -1;
+ break;
+ }
+ }
+ if (phys_id < 0) {
+ pmu = per_cpu(rapl_pmu, i);
+ if (pmu) {
+ per_cpu(rapl_pmu, cpu) = pmu;
+ atomic_inc(&pmu->refcnt);
+ }
+ continue;
+ }
+ rapl_cpu_prepare(cpu);
+ cpumask_set_cpu(cpu, &rapl_cpu_mask);
+ }
+
+ perf_cpu_notifier(rapl_cpu_notifier);
+
+ ret = perf_pmu_register(&rapl_pmu_class, "rapl", -1);
+ WARN_ON(ret);
+
+ pmu = __get_cpu_var(rapl_pmu);
+ pr_info("RAPL PMU detected, hw unit 2^-%d Joules, "
+ " API unit is 2^-32 Joules,"
+ " %d fixed counters\n",
+ pmu->hw_unit,
+ hweight32(rapl_cntr_mask));
+
+ put_online_cpus();
+
+ return 0;
+}
+device_initcall(rapl_pmu_init);
diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index abe69af..12bfd7d 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -895,7 +895,6 @@ int __perf_evsel__read(struct perf_evsel *evsel,
if (readn(FD(evsel, cpu, thread),
&count, nv * sizeof(u64)) < 0)
return -errno;
-
aggr->val += count.val;
if (scale) {
aggr->ena += count.ena;

Stephane Eranian

unread,

Oct 7, 2013, 12:10:05 PM10/7/13

to linux-...@vger.kernel.org, pet...@infradead.org, mi...@elte.hu, a...@linux.intel.com, ac...@redhat.com, jo...@redhat.com, zheng...@intel.com

The RAPL PMU counters do not interrupt on overflow.
Therefore, the kernel needs to poll the counters
to avoid missing an overflow. This patch adds
the hrtimer code to do this.

The timer internval is calculated at boot time
based on the power unit used by the HW.

Signed-off-by: Stephane Eranian <era...@google.com>
---

arch/x86/kernel/cpu/perf_event_intel_rapl.c | 75 +++++++++++++++++++++++++--
1 file changed, 72 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_rapl.c b/arch/x86/kernel/cpu/perf_event_intel_rapl.c
index f59dbd4..6294d62 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_rapl.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_rapl.c
@@ -54,6 +54,8 @@ struct rapl_pmu {

int hw_unit; /* 1/2^hw_unit Joule */

int phys_id;

int n_active; /* number of active events */

+ ktime_t timer_interval; /* in ktime_t unit */
+ struct hrtimer hrtimer;
unsigned long active_mask[BITS_TO_LONGS(RAPL_IDX_MAX)];
struct perf_event *events[RAPL_IDX_MAX];
};
@@ -82,6 +84,18 @@ static inline u64 rapl_scale(u64 v)

return v << (32 - __get_cpu_var(rapl_pmu)->hw_unit);
}

+static void rapl_start_hrtimer(struct rapl_pmu *pmu)
+{
+ __hrtimer_start_range_ns(&pmu->hrtimer,
+ pmu->timer_interval, 0,
+ HRTIMER_MODE_REL_PINNED, 0);
+}
+
+static void rapl_stop_hrtimer(struct rapl_pmu *pmu)
+{
+ hrtimer_cancel(&pmu->hrtimer);
+}
+
static u64 rapl_event_update(struct perf_event *event)
{

struct hw_perf_event *hwc = &event->hw;

@@ -115,6 +129,38 @@ static u64 rapl_event_update(struct perf_event *event)
return new_raw_count;
}

+static enum hrtimer_restart rapl_hrtimer_handle(struct hrtimer *hrtimer)
+{
+ struct rapl_pmu *pmu = container_of(hrtimer, struct rapl_pmu, hrtimer);
+ unsigned long flags;
+ int i;
+
+ if (!pmu->n_active)
+ return HRTIMER_NORESTART;
+
+ local_irq_save(flags);
+
+ for_each_set_bit(i, pmu->active_mask, RAPL_IDX_MAX) {
+ rapl_event_update(pmu->events[i]);
+ }
+
+ local_irq_restore(flags);
+
+ hrtimer_forward_now(&pmu->hrtimer, pmu->timer_interval);
+
+
+ return HRTIMER_RESTART;
+}
+
+static void rapl_hrtimer_init(struct rapl_pmu *pmu)
+{
+ hrtimer_init(&pmu->hrtimer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+ pmu->hrtimer.function = rapl_hrtimer_handle;
+}
+
+
+
+

static void rapl_pmu_event_start(struct perf_event *event, int flags)

{

struct rapl_pmu *pmu = __get_cpu_var(rapl_pmu);

@@ -127,6 +173,8 @@ static void rapl_pmu_event_start(struct perf_event *event, int flags)
local64_set(&event->hw.prev_count, rapl_read_counter(event));

pmu->n_active++;
+ if (pmu->n_active == 1)
+ rapl_start_hrtimer(pmu);

}

static int rapl_pmu_event_add(struct perf_event *event, int flags)

@@ -158,6 +206,9 @@ static void rapl_pmu_event_stop(struct perf_event *event, int flags)

if (__test_and_clear_bit(hwc->idx, pmu->active_mask)) {

WARN_ON_ONCE(pmu->n_active <= 0);
pmu->n_active--;
+ if (pmu->n_active == 0)
+ rapl_stop_hrtimer(pmu);

+
pmu->events[hwc->idx] = NULL;

WARN_ON_ONCE(hwc->state & PERF_HES_STOPPED);
hwc->state |= PERF_HES_STOPPED;
@@ -394,6 +445,7 @@ static int rapl_cpu_prepare(int cpu)
{

struct rapl_pmu *pmu = per_cpu(rapl_pmu, cpu);

int phys_id = topology_physical_package_id(cpu);
+ u64 ms;

if (pmu)
return 0;
@@ -415,6 +467,20 @@ static int rapl_cpu_prepare(int cpu)
rdmsrl(MSR_RAPL_POWER_UNIT, pmu->hw_unit);

pmu->hw_unit = (pmu->hw_unit >> 8) & 0x1FULL;

+ /*
+ * use reference of 200W for scaling the timeout
+ * to avoid missing counter overflows.
+ * 200W = 200 Joules/sec
+ * divide interval by 2 to avoid lockstep (2 * 100)
+ * if hw unit is 32, then we use 2 ms 1/200/2
+ */
+ if (pmu->hw_unit < 32)
+ ms = 1000 * (1ULL << (32 - pmu->hw_unit - 1)) / (2 * 100);
+ else
+ ms = 2;
+
+ pmu->timer_interval = ms_to_ktime(ms);

+
/* set RAPL pmu for this cpu for now */

per_cpu(rapl_pmu_kfree, cpu) = NULL;
per_cpu(rapl_pmu, cpu) = pmu;
@@ -559,6 +625,7 @@ static int __init rapl_pmu_init(void)
}
rapl_cpu_prepare(cpu);
cpumask_set_cpu(cpu, &rapl_cpu_mask);
+ rapl_hrtimer_init(per_cpu(rapl_pmu, cpu));
}

perf_cpu_notifier(rapl_cpu_notifier);
@@ -567,11 +634,13 @@ static int __init rapl_pmu_init(void)
WARN_ON(ret);

pmu = __get_cpu_var(rapl_pmu);
- pr_info("RAPL PMU detected, hw unit 2^-%d Joules, "

+ pr_info("RAPL PMU detected, hw unit 2^-%d Joules,"

" API unit is 2^-32 Joules,"

- " %d fixed counters\n",
+ " %d fixed counters,",
+ " %Lu ms ovfl timer\n",
pmu->hw_unit,
- hweight32(rapl_cntr_mask));
+ hweight32(rapl_cntr_mask),
+ ktime_to_ms(pmu->timer_interval));

put_online_cpus();

Borislav Petkov

unread,

Oct 7, 2013, 12:19:37 PM10/7/13

to Stephane Eranian, linux-...@vger.kernel.org, pet...@infradead.org, mi...@elte.hu, a...@linux.intel.com, ac...@redhat.com, jo...@redhat.com, zheng...@intel.com

On Mon, Oct 07, 2013 at 06:09:15PM +0200, Stephane Eranian wrote:
> The counters all count in the same unit. The perf_events API
> exposes all RAPL counters as 64-bit integers counting in unit
> of 1/2^32 Joules (or 0.23 nJ). User level tools must convert
> the counts by multiplying them by 0.23 and divide 10^9 to
> obtain Joules. The reason for this is that the kernel avoids
> doing floating point math whenever possible because it is
> expensive (user floating-point state must be saved). The method
> used avoids kernel floating-point and minimizes the loss of
> precision (bits). Thanks to PeterZ for suggesting this approach.
>
> To convert the raw count in Watt: W = C * 0.23 / (1e9 * time)

..

> $ perf stat -a -e rapl/rapl-energy-cores/,rapl/rapl-energy-pkg/ -I 1000 sleep 10
> time counts events
> 1.000345931 772 278 493 rapl/rapl-energy-cores/
> 1.000345931 55 539 138 560 rapl/rapl-energy-pkg/
> 2.000836387 771 751 936 rapl/rapl-energy-cores/
> 2.000836387 55 326 015 488 rapl/rapl-energy-pkg/

So can we do the Watt conversion in perf tool and make that "counts"
output more human-friendly like what those numbers are, to which
core/LLC they belong, etc, etc?

Thanks.

--
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

Stephane Eranian

unread,

Oct 7, 2013, 12:24:26 PM10/7/13

to Borislav Petkov, LKML, Peter Zijlstra, mi...@elte.hu, a...@linux.intel.com, Arnaldo Carvalho de Melo, Jiri Olsa, Yan, Zheng

On Mon, Oct 7, 2013 at 6:19 PM, Borislav Petkov <b...@alien8.de> wrote:
> On Mon, Oct 07, 2013 at 06:09:15PM +0200, Stephane Eranian wrote:
>> The counters all count in the same unit. The perf_events API
>> exposes all RAPL counters as 64-bit integers counting in unit
>> of 1/2^32 Joules (or 0.23 nJ). User level tools must convert
>> the counts by multiplying them by 0.23 and divide 10^9 to
>> obtain Joules. The reason for this is that the kernel avoids
>> doing floating point math whenever possible because it is
>> expensive (user floating-point state must be saved). The method
>> used avoids kernel floating-point and minimizes the loss of
>> precision (bits). Thanks to PeterZ for suggesting this approach.
>>
>> To convert the raw count in Watt: W = C * 0.23 / (1e9 * time)
>

> ...

>
>> $ perf stat -a -e rapl/rapl-energy-cores/,rapl/rapl-energy-pkg/ -I 1000 sleep 10
>> time counts events
>> 1.000345931 772 278 493 rapl/rapl-energy-cores/
>> 1.000345931 55 539 138 560 rapl/rapl-energy-pkg/
>> 2.000836387 771 751 936 rapl/rapl-energy-cores/
>> 2.000836387 55 326 015 488 rapl/rapl-energy-pkg/
>
> So can we do the Watt conversion in perf tool and make that "counts"
> output more human-friendly like what those numbers are, to which
> core/LLC they belong, etc, etc?
>

We could but that means we would need to special case the events in perf.
I was trying to avoid that.

Peter Zijlstra

unread,

Oct 7, 2013, 12:29:49 PM10/7/13

to Stephane Eranian, linux-...@vger.kernel.org, mi...@elte.hu, a...@linux.intel.com, ac...@redhat.com, jo...@redhat.com, zheng...@intel.com

On Mon, Oct 07, 2013 at 06:09:15PM +0200, Stephane Eranian wrote:

> The counters all count in the same unit. The perf_events API
> exposes all RAPL counters as 64-bit integers counting in unit
> of 1/2^32 Joules (or 0.23 nJ). User level tools must convert
> the counts by multiplying them by 0.23 and divide 10^9 to
> obtain Joules. The reason for this is that the kernel avoids
> doing floating point math whenever possible because it is
> expensive (user floating-point state must be saved). The method
> used avoids kernel floating-point and minimizes the loss of
> precision (bits). Thanks to PeterZ for suggesting this approach.
>
> To convert the raw count in Watt: W = C * 0.23 / (1e9 * time)

Right, so the output is in 32.32 fixed point. So if you want to convert
to double you'd do something like:

double watt = ldexp(counter, -32);

Borislav Petkov

unread,

Oct 7, 2013, 12:42:26 PM10/7/13

to Stephane Eranian, LKML, Peter Zijlstra, mi...@elte.hu, a...@linux.intel.com, Arnaldo Carvalho de Melo, Jiri Olsa, Yan, Zheng

On Mon, Oct 07, 2013 at 06:24:18PM +0200, Stephane Eranian wrote:
> We could but that means we would need to special case the events in perf.
> I was trying to avoid that.

Maybe we could use some sort of a post-processing hook on those events'
output which is defined only for that type of events...

Andi Kleen

unread,

Oct 7, 2013, 1:55:49 PM10/7/13

to Stephane Eranian, linux-...@vger.kernel.org, pet...@infradead.org, mi...@elte.hu, ac...@redhat.com, jo...@redhat.com, zheng...@intel.com

Quick review. Thanks for working on this. It should work
nicely with ucevent -- people already asked for reporting
power numbers there.

> diff --git a/arch/x86/kernel/cpu/perf_event_intel_rapl.c b/arch/x86/kernel/cpu/perf_event_intel_rapl.c
> new file mode 100644
> index 0000000..f59dbd4
> --- /dev/null
> +++ b/arch/x86/kernel/cpu/perf_event_intel_rapl.c
> @@ -0,0 +1,580 @@

Having a comment at the beginning of each file with two sentences
what the file roughly does and what "RAPL" actually is would be useful.

Also a pointer to the SDM chapters is also useful.

> +static u64 rapl_event_update(struct perf_event *event)
> +{
> + struct hw_perf_event *hwc = &event->hw;
> + u64 prev_raw_count, new_raw_count;
> + s64 delta, sdelta;
> + int shift = RAPL_CNTR_WIDTH;
> +
> +again:
> + prev_raw_count = local64_read(&hwc->prev_count);
> + rdmsrl(event->hw.event_base, new_raw_count);
> +
> + if (local64_cmpxchg(&hwc->prev_count, prev_raw_count,
> + new_raw_count) != prev_raw_count)

Add a cpu_relax()

> + goto again;
> +

> + struct rapl_pmu *pmu = __get_cpu_var(rapl_pmu);
> +
> + if (WARN_ON_ONCE(!(event->hw.state & PERF_HES_STOPPED)))
> + return;
> +
> + event->hw.state = 0;
> +
> + local64_set(&event->hw.prev_count, rapl_read_counter(event));
> +
> + pmu->n_active++;

What lock protects this add?

> +}
> +
> +static ssize_t rapl_get_attr_cpumask(struct device *dev,
> + struct device_attribute *attr, char *buf)
> +{
> + int n = cpulist_scnprintf(buf, PAGE_SIZE - 2, &rapl_cpu_mask);

Check n here in case it overflowed

> +
> + buf[n++] = '\n';
> + buf[n] = '\0';
> + return n;

> + for_each_online_cpu(i) {
> + pmu2 = per_cpu(rapl_pmu, i);
> +
> + if (!pmu2 || i == cpu)
> + continue;
> +
> + if (pmu2->phys_id == phys_id) {
> + per_cpu(rapl_pmu, cpu) = pmu2;
> + per_cpu(rapl_pmu_kfree, cpu) = pmu1;
> + atomic_inc(&pmu2->refcnt);
> + break;
> + }
> + }

Doesn't this need a lock of some form? AFAIK we can do parallel
CPU startup now.

Similar to the other code walking the CPUs.

> +static int __init rapl_pmu_init(void)
> +{
> + struct rapl_pmu *pmu;
> + int i, cpu, ret;

You need to check for Intel CPU here, as this is called unconditionally.

A more modern way to do this is to use x86_cpu_id.
This would in principle allow making it a module later (if perf ever
supports that)

> +
> + /* check supported CPU */
> + switch (boot_cpu_data.x86_model) {
> + case 42: /* Sandy Bridge */
> + case 58: /* Ivy Bridge */
> + case 60: /* Haswell */

Need more model numbers for Haswell (see the main perf driver)

> diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
> index abe69af..12bfd7d 100644
> --- a/tools/perf/util/evsel.c
> +++ b/tools/perf/util/evsel.c
> @@ -895,7 +895,6 @@ int __perf_evsel__read(struct perf_evsel *evsel,
> if (readn(FD(evsel, cpu, thread),
> &count, nv * sizeof(u64)) < 0)
> return -errno;
> -
> aggr->val += count.val;
> if (scale) {
> aggr->ena += count.ena;

Bogus hunk

-Andi

--
a...@linux.intel.com -- Speaking for myself only

Peter Zijlstra

unread,

Oct 7, 2013, 2:08:35 PM10/7/13

to Andi Kleen, Stephane Eranian, linux-...@vger.kernel.org, mi...@elte.hu, ac...@redhat.com, jo...@redhat.com, zheng...@intel.com

On Mon, Oct 07, 2013 at 10:55:42AM -0700, Andi Kleen wrote:
> This would in principle allow making it a module later (if perf ever
> supports that)

IIRC its a few EXPORTs away from being able to do that.

Andi Kleen

unread,

Oct 7, 2013, 3:23:05 PM10/7/13

to Peter Zijlstra, Stephane Eranian, linux-...@vger.kernel.org, mi...@elte.hu, ac...@redhat.com, jo...@redhat.com, zheng...@intel.com

On Mon, Oct 07, 2013 at 08:08:10PM +0200, Peter Zijlstra wrote:
> On Mon, Oct 07, 2013 at 10:55:42AM -0700, Andi Kleen wrote:
> > This would in principle allow making it a module later (if perf ever
> > supports that)
>
> IIRC its a few EXPORTs away from being able to do that.

Great. With ~700k text that would be a good thing.
After all most users don't develop.

Hopefully we can get there soon.

Is anyone actively working on it?

-Andi

--
a...@linux.intel.com -- Speaking for myself only

Peter Zijlstra

unread,

Oct 7, 2013, 4:34:14 PM10/7/13

to Andi Kleen, Stephane Eranian, linux-...@vger.kernel.org, mi...@elte.hu, ac...@redhat.com, jo...@redhat.com, zheng...@intel.com

On Mon, Oct 07, 2013 at 12:22:58PM -0700, Andi Kleen wrote:
> On Mon, Oct 07, 2013 at 08:08:10PM +0200, Peter Zijlstra wrote:
> > On Mon, Oct 07, 2013 at 10:55:42AM -0700, Andi Kleen wrote:
> > > This would in principle allow making it a module later (if perf ever
> > > supports that)
> >
> > IIRC its a few EXPORTs away from being able to do that.
>
> Great. With ~700k text that would be a good thing.
> After all most users don't develop.
>
> Hopefully we can get there soon.
>
> Is anyone actively working on it?

All of perf being a module; no and that's not actually going to happen.
PMU driver modules should be fairly simple though.

Dunno if anybody is working on that, I typically consider my .config
broken if its got =m in it.

Stephane Eranian

unread,

Oct 7, 2013, 4:58:51 PM10/7/13

to Andi Kleen, LKML, Peter Zijlstra, mi...@elte.hu, Arnaldo Carvalho de Melo, Jiri Olsa, Yan, Zheng

On Mon, Oct 7, 2013 at 7:55 PM, Andi Kleen <a...@linux.intel.com> wrote:
> Quick review. Thanks for working on this. It should work
> nicely with ucevent -- people already asked for reporting
> power numbers there.
>

Yes, got some requests myself too. So I implemented this.

>> diff --git a/arch/x86/kernel/cpu/perf_event_intel_rapl.c b/arch/x86/kernel/cpu/perf_event_intel_rapl.c
>> new file mode 100644
>> index 0000000..f59dbd4
>> --- /dev/null
>> +++ b/arch/x86/kernel/cpu/perf_event_intel_rapl.c
>> @@ -0,0 +1,580 @@
>
> Having a comment at the beginning of each file with two sentences
> what the file roughly does and what "RAPL" actually is would be useful.
>
> Also a pointer to the SDM chapters is also useful.
>

Forgot to add that. Will do in V2.

>> +static u64 rapl_event_update(struct perf_event *event)
>> +{
>> + struct hw_perf_event *hwc = &event->hw;
>> + u64 prev_raw_count, new_raw_count;
>> + s64 delta, sdelta;
>> + int shift = RAPL_CNTR_WIDTH;
>> +
>> +again:
>> + prev_raw_count = local64_read(&hwc->prev_count);
>> + rdmsrl(event->hw.event_base, new_raw_count);
>> +
>> + if (local64_cmpxchg(&hwc->prev_count, prev_raw_count,
>> + new_raw_count) != prev_raw_count)
>
> Add a cpu_relax()
>

But then it should be in perf_event_*.c as well.
It's a verbatim copy of the existing code. Now given that RAPL does not
interrupt, the only risk here would be preemption. I did not verify whether
this function was always called with interrupts disabled. So I left the retry
loop.

>> + goto again;
>> +
>> + struct rapl_pmu *pmu = __get_cpu_var(rapl_pmu);
>> +
>> + if (WARN_ON_ONCE(!(event->hw.state & PERF_HES_STOPPED)))
>> + return;
>> +
>> + event->hw.state = 0;
>> +
>> + local64_set(&event->hw.prev_count, rapl_read_counter(event));
>> +
>> + pmu->n_active++;
>
> What lock protects this add?
>

None. I will add one. Bu then I am wondering about if it is really
necessary given
that RAPL event are system-wide and this pinned to a CPU. If the call came
from another CPU, then it IPI there, and that means that CPU is executing that
code. Any other CPU will need IPI too, and that interrupt will be kept pending.
Am I missing a test case here? Are IPI reentrant?

>> +}
>> +
>> +static ssize_t rapl_get_attr_cpumask(struct device *dev,
>> + struct device_attribute *attr, char *buf)
>> +{
>> + int n = cpulist_scnprintf(buf, PAGE_SIZE - 2, &rapl_cpu_mask);
>
> Check n here in case it overflowed
>

But isn't that what the -2 and the below \n\0 are for?

>> +
>> + buf[n++] = '\n';
>> + buf[n] = '\0';
>> + return n;
>
>
>
>> + for_each_online_cpu(i) {
>> + pmu2 = per_cpu(rapl_pmu, i);
>> +
>> + if (!pmu2 || i == cpu)
>> + continue;
>> +
>> + if (pmu2->phys_id == phys_id) {
>> + per_cpu(rapl_pmu, cpu) = pmu2;
>> + per_cpu(rapl_pmu_kfree, cpu) = pmu1;
>> + atomic_inc(&pmu2->refcnt);
>> + break;
>> + }
>> + }
>
> Doesn't this need a lock of some form? AFAIK we can do parallel
> CPU startup now.
>

Did not know about this change? But then that means all the other
perf_event *_starting() and maybe even _*prepare() routines must also
use locks. I can add that to RAPL.

> Similar to the other code walking the CPUs.
>
>> +static int __init rapl_pmu_init(void)
>> +{
>> + struct rapl_pmu *pmu;
>> + int i, cpu, ret;
>
> You need to check for Intel CPU here, as this is called unconditionally.
>
> A more modern way to do this is to use x86_cpu_id.
> This would in principle allow making it a module later (if perf ever
> supports that)
>

Forgot that, will fix it.

>> +
>> + /* check supported CPU */
>> + switch (boot_cpu_data.x86_model) {
>> + case 42: /* Sandy Bridge */
>> + case 58: /* Ivy Bridge */
>> + case 60: /* Haswell */
>
> Need more model numbers for Haswell (see the main perf driver)
>

Don't have all the models to test...

>> diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
>> index abe69af..12bfd7d 100644
>> --- a/tools/perf/util/evsel.c
>> +++ b/tools/perf/util/evsel.c
>> @@ -895,7 +895,6 @@ int __perf_evsel__read(struct perf_evsel *evsel,
>> if (readn(FD(evsel, cpu, thread),
>> &count, nv * sizeof(u64)) < 0)
>> return -errno;
>> -
>> aggr->val += count.val;
>> if (scale) {
>> aggr->ena += count.ena;
>
> Bogus hunk
>

Arg, yes. It should not be here.

Thanks for the review.

Andi Kleen

unread,

Oct 7, 2013, 5:45:50 PM10/7/13

to Stephane Eranian, LKML, Peter Zijlstra, mi...@elte.hu, Arnaldo Carvalho de Melo, Jiri Olsa, Yan, Zheng

Stephane Eranian <era...@google.com> writes:
>
>>> + goto again;
>>> +
>>> + struct rapl_pmu *pmu = __get_cpu_var(rapl_pmu);
>>> +
>>> + if (WARN_ON_ONCE(!(event->hw.state & PERF_HES_STOPPED)))
>>> + return;
>>> +
>>> + event->hw.state = 0;
>>> +
>>> + local64_set(&event->hw.prev_count, rapl_read_counter(event));
>>> +
>>> + pmu->n_active++;
>>
>> What lock protects this add?
>>
> None. I will add one. Bu then I am wondering about if it is really
> necessary given
> that RAPL event are system-wide and this pinned to a CPU. If the call came
> from another CPU, then it IPI there, and that means that CPU is executing that
> code. Any other CPU will need IPI too, and that interrupt will be kept pending.
> Am I missing a test case here? Are IPI reentrant?

they can be if interrupts are enabled (likely here)

>
>>> +}
>>> +
>>> +static ssize_t rapl_get_attr_cpumask(struct device *dev,
>>> + struct device_attribute *attr, char *buf)
>>> +{
>>> + int n = cpulist_scnprintf(buf, PAGE_SIZE - 2, &rapl_cpu_mask);
>>
>> Check n here in case it overflowed
>>
> But isn't that what the -2 and the below \n\0 are for?

I know it's very unlikely and other stuff would break, but

Assuming you have a system with some many CPUs that they don't fit
into a page. Then the scnprintf would fail, but you would corrupt
random data because you write before the buffer.

>> Doesn't this need a lock of some form? AFAIK we can do parallel
>> CPU startup now.
>>
> Did not know about this change? But then that means all the other
> perf_event *_starting() and maybe even _*prepare() routines must also
> use locks. I can add that to RAPL.

Yes may be broken everywhere.

>>> + /* check supported CPU */
>>> + switch (boot_cpu_data.x86_model) {
>>> + case 42: /* Sandy Bridge */
>>> + case 58: /* Ivy Bridge */
>>> + case 60: /* Haswell */
>>
>> Need more model numbers for Haswell (see the main perf driver)
>>
> Don't have all the models to test...

It should be all the same.

-Andi
--
a...@linux.intel.com -- Speaking for myself only

Stephane Eranian

unread,

Oct 7, 2013, 6:38:41 PM10/7/13

to Andi Kleen, LKML, Peter Zijlstra, mi...@elte.hu, Arnaldo Carvalho de Melo, Jiri Olsa, Yan, Zheng

On Mon, Oct 7, 2013 at 11:45 PM, Andi Kleen <an...@firstfloor.org> wrote:
> Stephane Eranian <era...@google.com> writes:
>>
>>>> + goto again;
>>>> +
>>>> + struct rapl_pmu *pmu = __get_cpu_var(rapl_pmu);
>>>> +
>>>> + if (WARN_ON_ONCE(!(event->hw.state & PERF_HES_STOPPED)))
>>>> + return;
>>>> +
>>>> + event->hw.state = 0;
>>>> +
>>>> + local64_set(&event->hw.prev_count, rapl_read_counter(event));
>>>> +
>>>> + pmu->n_active++;
>>>
>>> What lock protects this add?
>>>
>> None. I will add one. Bu then I am wondering about if it is really
>> necessary given
>> that RAPL event are system-wide and this pinned to a CPU. If the call came
>> from another CPU, then it IPI there, and that means that CPU is executing that
>> code. Any other CPU will need IPI too, and that interrupt will be kept pending.
>> Am I missing a test case here? Are IPI reentrant?
>
> they can be if interrupts are enabled (likely here)
>

I will check on that.

>>
>>>> +}
>>>> +
>>>> +static ssize_t rapl_get_attr_cpumask(struct device *dev,
>>>> + struct device_attribute *attr, char *buf)
>>>> +{
>>>> + int n = cpulist_scnprintf(buf, PAGE_SIZE - 2, &rapl_cpu_mask);
>>>
>>> Check n here in case it overflowed
>>>
>> But isn't that what the -2 and the below \n\0 are for?
>
> I know it's very unlikely and other stuff would break, but
>
> Assuming you have a system with some many CPUs that they don't fit
> into a page. Then the scnprintf would fail, but you would corrupt
> random data because you write before the buffer.
>

My understanding is that cpulist_scnprintf() behaves like snprintf(). It
generates up to PAGE_SIZE-2 characters in the buffer. So if you
have a very large number of CPUs, the generation of the output string in buf
will stop, i.e., truncated string. The return value is the length of the string.
That n cannot be negative. So how you could write buffer the buffer (buf)?

The part I don't like about the API of rapl_get_attr_cpumask() here is that
it assumes that the buf is PAGE_SIZE. Its size is not passed as an argument.
But maybe this is what you are pointing out to me.

>>> Doesn't this need a lock of some form? AFAIK we can do parallel
>>> CPU startup now.
>>>
>> Did not know about this change? But then that means all the other
>> perf_event *_starting() and maybe even _*prepare() routines must also
>> use locks. I can add that to RAPL.
>
> Yes may be broken everywhere.
>
>>>> + /* check supported CPU */
>>>> + switch (boot_cpu_data.x86_model) {
>>>> + case 42: /* Sandy Bridge */
>>>> + case 58: /* Ivy Bridge */
>>>> + case 60: /* Haswell */
>>>
>>> Need more model numbers for Haswell (see the main perf driver)
>>>
>> Don't have all the models to test...
>
> It should be all the same.
>

Need to know which ones are client vs. servers. Not have the same
number of RAPL events.

Thanks.

Ingo Molnar

unread,

Oct 8, 2013, 3:37:18 AM10/8/13

to Andi Kleen, Peter Zijlstra, Stephane Eranian, linux-...@vger.kernel.org, mi...@elte.hu, ac...@redhat.com, jo...@redhat.com, zheng...@intel.com

* Andi Kleen <a...@linux.intel.com> wrote:

> On Mon, Oct 07, 2013 at 08:08:10PM +0200, Peter Zijlstra wrote:
> > On Mon, Oct 07, 2013 at 10:55:42AM -0700, Andi Kleen wrote:
> > > This would in principle allow making it a module later (if perf ever
> > > supports that)
> >
> > IIRC its a few EXPORTs away from being able to do that.
>
> Great. With ~700k text that would be a good thing.

Nonsense, the real overhead of core perf + PMU drivers on x86-64 is around
150k.

The 700k overhead you claimed is not reproducible, at all, and I tested it
with your own config:

https://lkml.org/lkml/2013/10/8/62

700k in perf is nonsensical - it's an obvious lie really, the code sizes
of the relevant perf .o objects are nowhere even _close_ to that amount:

hubble:~/tip> size $(find . -name '*perf*.o')
text data bss dec hex filename
887 0 32 919 397 ./arch/x86/kernel/cpu/perfctr-watchdog.o
2932 680 96 3708 e7c ./arch/x86/kernel/cpu/perf_event_amd_uncore.o
14117 6361 1 20479 4fff ./arch/x86/kernel/cpu/perf_event_intel.o
23787 19541 264 43592 aa48 ./arch/x86/kernel/cpu/perf_event_intel_uncore.o
4531 1572 0 6103 17d7 ./arch/x86/kernel/cpu/perf_event_p4.o
1728 1121 0 2849 b21 ./arch/x86/kernel/cpu/perf_event_knc.o
4591 811 32 5434 153a ./arch/x86/kernel/cpu/perf_event_amd_ibs.o
13394 6797 116 20307 4f53 ./arch/x86/kernel/cpu/perf_event.o
3156 1483 32 4671 123f ./arch/x86/kernel/cpu/perf_event_amd_iommu.o
5322 4396 0 9718 25f6 ./arch/x86/kernel/cpu/perf_event_intel_ds.o
11383 1 0 11384 2c78 ./arch/x86/kernel/cpu/perf_event_intel_lbr.o
3636 1125 0 4761 1299 ./arch/x86/kernel/cpu/perf_event_amd.o
1282 544 0 1826 722 ./arch/x86/kernel/cpu/perf_event_p6.o
278 1 0 279 117 ./arch/x86/kernel/perf_regs.o
4844 88 12 4944 1350 ./drivers/acpi/processor_perflib.o
77 96 0 173 ad ./drivers/cpufreq/cpufreq_performance.o
232 64 0 296 128 ./drivers/devfreq/governor_performance.o
1972 4 64 2040 7f8 ./kernel/trace/trace_event_perf.o

So stop making that ridiculous claim without posting exact .config's
publicly.

Thanks,

Ingo

Stephane Eranian

unread,

Oct 8, 2013, 11:10:40 AM10/8/13

to Andi Kleen, LKML, Peter Zijlstra, mi...@elte.hu, Arnaldo Carvalho de Melo, Jiri Olsa, Yan, Zheng

Andi,

On Mon, Oct 7, 2013 at 11:45 PM, Andi Kleen <an...@firstfloor.org> wrote:

> Stephane Eranian <era...@google.com> writes:
>>
>>>> + goto again;
>>>> +
>>>> + struct rapl_pmu *pmu = __get_cpu_var(rapl_pmu);
>>>> +
>>>> + if (WARN_ON_ONCE(!(event->hw.state & PERF_HES_STOPPED)))
>>>> + return;
>>>> +
>>>> + event->hw.state = 0;
>>>> +
>>>> + local64_set(&event->hw.prev_count, rapl_read_counter(event));
>>>> +
>>>> + pmu->n_active++;
>>>
>>> What lock protects this add?
>>>
>> None. I will add one. Bu then I am wondering about if it is really
>> necessary given
>> that RAPL event are system-wide and this pinned to a CPU. If the call came
>> from another CPU, then it IPI there, and that means that CPU is executing that
>> code. Any other CPU will need IPI too, and that interrupt will be kept pending.
>> Am I missing a test case here? Are IPI reentrant?
>
> they can be if interrupts are enabled (likely here)
>

So, I spent some time trying to figure this out via instrumentation and it seems
it is never the case that this function or in fact __perf_event_enable() for a
syswide event is called with interrupts enabled. Why?

Well, it has to do with cpu_function_call() which is ALWAYS called for a syswide
event on the perf_event_enable() code path.

If you are calling for an event on the same CPU, you end up executing:
smp_call_function_single()
if (cpu == this_cpu) {
local_irq_save(flags);
func(info);
local_irq_restore(flags);

If you are calling a remote CPU, then you end up in the APIC code to send
an IPI. On the receiving side, I could not find the local_irq_save() call, but
I verified that upon entry, __perf_event_enable() has interrupts disabled.
And that's either because I missed the interrupt masking call OR because
the HW does it automatically for us. I could not yet figure this out.

In any case, looks like both the start() and stop() routine are protected
from interrupts and thus preemption, so we may not need a lock to
protect n_active.

Stephane Eranian

unread,

Oct 10, 2013, 10:50:25 AM10/10/13

to linux-...@vger.kernel.org, pet...@infradead.org, mi...@elte.hu, a...@linux.intel.com, ac...@redhat.com, jo...@redhat.com, zheng...@intel.com, b...@alien8.de

This patch adds a new uncore PMU to expose the Intel
RAPL energy consumption counters. Up to 3 counters,
each counting a particular RAPL event are exposed.

The RAPL counters are available on Intel SandyBridge,
IvyBridge, Haswell. The server skus add a 3rd counter.

The following events are available nd exposed in sysfs:
- rapl-energy-cores: power consumption of all cores on socket
- rapl-energy-pkg: power consumption of all cores + LLc cache
- rapl-energy-dram: power consumption of DRAM

The RAPL PMU is uncore by nature and is implemented such
that it only works in system-wide mode. Measuring only
one CPU per socket is sufficient. The /sys/devices/rapl/cpumask
is exported and can be used by tools to figure out which CPU
to monitor by default. For instance, on a 2-socket system, 2 CPUs
(one on each socket) will be shown.

The counters all count in the same unit. The perf_events API
exposes all RAPL counters as 64-bit integers counting in unit
of 1/2^32 Joules (or 0.23 nJ). User level tools must convert
the counts by multiplying them by 0.23 and divide 10^9 to
obtain Joules. The reason for this is that the kernel avoids
doing floating point math whenever possible because it is
expensive (user floating-point state must be saved). The method
used avoids kernel floating-point and minimizes the loss of
precision (bits). Thanks to PeterZ for suggesting this approach.

To convert the raw count in Watt: W = C * 0.23 / (1e9 * time)

RAPL PMU is a new standalone PMU which registers with the
perf_event core subsystem. The PMU type (attr->type) is
dynamically allocated and is available from /sys/device/rapl/type.

Sampling is not supported by the RAPL PMU. There is no
privilege level filtering either.

The PMU exports a cpumask in /sys/devices/uncore/cpumask. It
is used by perf to ensure only one instance of each RAPL event
is measured per processor socket. Hotplug CPU is also supported.

The second patch adds a hrtimer to poll the counters given that
they do no interrupt on overflow. Hardware counters are 32-bit
wide.

In v2, we add the locking necesarry to protect the rapl_pmu
struct. We also add a description at the top of the file.
We check for Intel only processor. We improved the data
layout of the rapl_pmu struct. We also lifted the restriction
of the number of instances of RAPL counters that can be active
at the same time. RAPL is free running counters, so ought to be
able to measure events as many times as necessary in parallel
via multiple tools. There is never multiplexing among RAPL events.

Supported CPUs: SandyBridge, IvyBridge, Haswell.

$ perf stat -a -e rapl/rapl-energy-cores/,rapl/rapl-energy-pkg/ -I 1000 sleep 10
time counts events
1.000345931 772 278 493 rapl/rapl-energy-cores/
1.000345931 55 539 138 560 rapl/rapl-energy-pkg/
2.000836387 771 751 936 rapl/rapl-energy-cores/
2.000836387 55 326 015 488 rapl/rapl-energy-pkg/

Stephane Eranian (3):
perf: add active_entry list head to struct perf_event

perf,x86: add Intel RAPL PMU support
perf,x86: add RAPL hrtimer support

arch/x86/kernel/cpu/Makefile | 2 +-

arch/x86/kernel/cpu/perf_event_intel_rapl.c | 688 +++++++++++++++++++++++++++
include/linux/perf_event.h | 1 +
kernel/events/core.c | 1 +
4 files changed, 691 insertions(+), 1 deletion(-)

create mode 100644 arch/x86/kernel/cpu/perf_event_intel_rapl.c

--
1.7.9.5

Stephane Eranian

unread,

Oct 10, 2013, 10:50:32 AM10/10/13

to linux-...@vger.kernel.org, pet...@infradead.org, mi...@elte.hu, a...@linux.intel.com, ac...@redhat.com, jo...@redhat.com, zheng...@intel.com, b...@alien8.de

This patch adds a new fields to the struct perf_event.
It is intended to be used to chain events which are
active (enabled). It helps in the hardware layer
for PMU which do not have actual counter restrictions, i.e.,
free running read-only counters. Active events are chained
as opposed to being tracked via the counter they use.

Signed-off-by: Stephane Eranian <era...@google.com>
---

include/linux/perf_event.h | 1 +
kernel/events/core.c | 1 +

2 files changed, 2 insertions(+)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 2e069d1..a376384 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -435,6 +435,7 @@ struct perf_event {
struct perf_cgroup *cgrp; /* cgroup event is attach to */
int cgrp_defer_enabled;
#endif
+ struct list_head active_entry;

#endif /* CONFIG_PERF_EVENTS */
};
diff --git a/kernel/events/core.c b/kernel/events/core.c
index c716385..b1dbf79 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -6629,6 +6629,7 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
INIT_LIST_HEAD(&event->event_entry);
INIT_LIST_HEAD(&event->sibling_list);
INIT_LIST_HEAD(&event->rb_entry);
+ INIT_LIST_HEAD(&event->active_entry);

init_waitqueue_head(&event->waitq);
init_irq_work(&event->pending, perf_pending_event);

Stephane Eranian

unread,

Oct 10, 2013, 10:50:36 AM10/10/13

to linux-...@vger.kernel.org, pet...@infradead.org, mi...@elte.hu, a...@linux.intel.com, ac...@redhat.com, jo...@redhat.com, zheng...@intel.com, b...@alien8.de

or ldexp(C, -32).

RAPL PMU is a new standalone PMU which registers with the
perf_event core subsystem. The PMU type (attr->type) is
dynamically allocated and is available from /sys/device/rapl/type.

Sampling is not supported by the RAPL PMU. There is no
privilege level filtering either.

Signed-off-by: Stephane Eranian <era...@google.com>
---

arch/x86/kernel/cpu/Makefile | 2 +-
arch/x86/kernel/cpu/perf_event_intel_rapl.c | 623 +++++++++++++++++++++++++++
2 files changed, 624 insertions(+), 1 deletion(-)
create mode 100644 arch/x86/kernel/cpu/perf_event_intel_rapl.c

diff --git a/arch/x86/kernel/cpu/Makefile b/arch/x86/kernel/cpu/Makefile
index 47b56a7..6359506 100644
--- a/arch/x86/kernel/cpu/Makefile
+++ b/arch/x86/kernel/cpu/Makefile
@@ -36,7 +36,7 @@ obj-$(CONFIG_CPU_SUP_AMD) += perf_event_amd_iommu.o
endif
obj-$(CONFIG_CPU_SUP_INTEL) += perf_event_p6.o perf_event_knc.o perf_event_p4.o
obj-$(CONFIG_CPU_SUP_INTEL) += perf_event_intel_lbr.o perf_event_intel_ds.o perf_event_intel.o
-obj-$(CONFIG_CPU_SUP_INTEL) += perf_event_intel_uncore.o
+obj-$(CONFIG_CPU_SUP_INTEL) += perf_event_intel_uncore.o perf_event_intel_rapl.o
endif

diff --git a/arch/x86/kernel/cpu/perf_event_intel_rapl.c b/arch/x86/kernel/cpu/perf_event_intel_rapl.c
new file mode 100644

index 0000000..abaaf4f
--- /dev/null
+++ b/arch/x86/kernel/cpu/perf_event_intel_rapl.c
@@ -0,0 +1,623 @@
+/*
+ * perf_event_intel_rapl.c: support Intel RAPL energy consumption counters
+ * Copyright (C) 2013 Google, Inc., Stephane Eranian
+ *
+ * Intel RAPL interface is specified in the IA-32 Manual Vol3b
+ * section 14.7.1 (September 2013)
+ *
+ * RAPL provides more controls than just reporting energy consumption
+ * however here we only expose the 3 energy consumption free running
+ * counters (pp0, pkg, dram).
+ *
+ * Each of those counters increments in a power unit defined by the
+ * RAPL_POWER_UNIT MSR. On SandyBridge, this unit is 1/(2^16) Joules
+ * but it can vary.
+ *
+ * Counter to rapl events mappings:
+ *
+ * pp0 counter: consumption of all physical cores (power plane 0)
+ * event: rapl_energy_cores
+ * perf code: 0x1
+ *
+ * pkg counter: consumption of the whole processor package
+ * event: rapl_energy_pkg
+ * perf code: 0x2
+ *
+ * dram counter: consumption of the dram domain (servers only)
+ * event: rapl_energy_dram
+ * perf code: 0x3
+ *
+ * We manage those counters as free running (read-only). They may be
+ * use simultaneously by other tools, such as turbostat.
+ *
+ * The events only support system-wide mode counting. There is no
+ * sampling support because it does not make sense and is not
+ * supported by the RAPL hardware.
+ *
+ * Because we want to avoid floating-point operations in the kernel,
+ * the events are all reported in fixed point arithmetic (32.32).
+ * Tools must adjust the counts to convert them to Watts using
+ * the duration of the measurement. Tools may use a function such as
+ * ldexp(raw_count, -32);
+ */
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/perf_event.h>
+#include <asm/cpu_device_id.h>
+#include "perf_event.h"
+

+/*
+ * RAPL energy status counters
+ */
+#define RAPL_IDX_PP0_NRG_STAT 0 /* all cores */
+#define INTEL_RAPL_PP0 0x1 /* pseudo-encoding */
+#define RAPL_IDX_PKG_NRG_STAT 1 /* entire package */
+#define INTEL_RAPL_PKG 0x2 /* pseudo-encoding */
+#define RAPL_IDX_RAM_NRG_STAT 2 /* DRAM */
+#define INTEL_RAPL_RAM 0x3 /* pseudo-encoding */
+

+/* Clients have PP0, PKG */

+#define RAPL_IDX_CLN (1<<RAPL_IDX_PP0_NRG_STAT|\
+ 1<<RAPL_IDX_PKG_NRG_STAT)
+
+/* Servers have PP0, PKG, RAM */
+#define RAPL_IDX_SRV (1<<RAPL_IDX_PP0_NRG_STAT|\
+ 1<<RAPL_IDX_PKG_NRG_STAT|\
+ 1<<RAPL_IDX_RAM_NRG_STAT)
+
+/*

+ * event code: LSB 8 bits, passed in attr->config

+ * any other bit is reserved
+ */
+#define RAPL_EVENT_MASK 0xFFULL
+
+#define DEFINE_RAPL_FORMAT_ATTR(_var, _name, _format) \
+static ssize_t __rapl_##_var##_show(struct kobject *kobj, \
+ struct kobj_attribute *attr, \
+ char *page) \
+{ \
+ BUILD_BUG_ON(sizeof(_format) >= PAGE_SIZE); \
+ return sprintf(page, _format "\n"); \
+} \
+static struct kobj_attribute format_attr_##_var = \
+ __ATTR(_name, 0444, __rapl_##_var##_show, NULL)
+
+#define RAPL_EVENT_DESC(_name, _config) \
+{ \
+ .attr = __ATTR(_name, 0444, rapl_event_show, NULL), \
+ .config = _config, \
+}
+
+#define RAPL_CNTR_WIDTH 32 /* 32-bit rapl counters */
+
+struct rapl_pmu {

+ spinlock_t lock;
+ atomic_t refcnt;
+ int hw_unit; /* 1/2^hw_unit Joule */
+ int phys_id;
+ int n_active; /* number of active events */
+ struct list_head active_list;

+};
+
+static struct pmu rapl_pmu_class;
+static cpumask_t rapl_cpu_mask;
+static int rapl_cntr_mask;
+
+static DEFINE_PER_CPU(struct rapl_pmu *, rapl_pmu);
+static DEFINE_PER_CPU(struct rapl_pmu *, rapl_pmu_kfree);
+

+static DEFINE_SPINLOCK(rapl_hotplug_lock);

+
+static inline u64 rapl_read_counter(struct perf_event *event)
+{
+ u64 raw;
+ rdmsrl(event->hw.event_base, raw);
+ return raw;
+}
+
+static inline u64 rapl_scale(u64 v)
+{
+ /*
+ * scale delta to smallest unit (1/2^32)
+ * users must then scale back: count * 1/(1e9*2^32) to get Joules

+ * or use ldexp(count, -32).

+ * Watts = Joules/Time delta
+ */

+ return v << (32 - __get_cpu_var(rapl_pmu)->hw_unit);
+}
+

+static u64 rapl_event_update(struct perf_event *event)
+{
+ struct hw_perf_event *hwc = &event->hw;
+ u64 prev_raw_count, new_raw_count;
+ s64 delta, sdelta;
+ int shift = RAPL_CNTR_WIDTH;
+
+again:
+ prev_raw_count = local64_read(&hwc->prev_count);
+ rdmsrl(event->hw.event_base, new_raw_count);
+
+ if (local64_cmpxchg(&hwc->prev_count, prev_raw_count,

+ new_raw_count) != prev_raw_count) {
+ cpu_relax();

+ goto again;
+ }
+

+ /*
+ * Now we have the new raw value and have updated the prev
+ * timestamp already. We can now calculate the elapsed delta
+ * (event-)time and add that to the generic event.
+ *
+ * Careful, not all hw sign-extends above the physical width
+ * of the count.
+ */
+ delta = (new_raw_count << shift) - (prev_raw_count << shift);
+ delta >>= shift;
+
+ sdelta = rapl_scale(delta);
+
+ local64_add(sdelta, &event->count);
+
+ return new_raw_count;
+}
+

+static void __rapl_pmu_event_start(struct rapl_pmu *pmu,
+ struct perf_event *event)

+{
+ if (WARN_ON_ONCE(!(event->hw.state & PERF_HES_STOPPED)))
+ return;
+
+ event->hw.state = 0;
+

+ list_add_tail(&event->active_entry, &pmu->active_list);

+
+ local64_set(&event->hw.prev_count, rapl_read_counter(event));
+
+ pmu->n_active++;

+}
+
+static void rapl_pmu_event_start(struct perf_event *event, int mode)

+{
+ struct rapl_pmu *pmu = __get_cpu_var(rapl_pmu);

+ unsigned long flags;
+
+ spin_lock_irqsave(&pmu->lock, flags);
+ __rapl_pmu_event_start(pmu, event);
+ spin_unlock_irqrestore(&pmu->lock, flags);
+}
+
+static void rapl_pmu_event_stop(struct perf_event *event, int mode)

+{
+ struct rapl_pmu *pmu = __get_cpu_var(rapl_pmu);

+ struct hw_perf_event *hwc = &event->hw;

+ unsigned long flags;
+
+ spin_lock_irqsave(&pmu->lock, flags);

+
+ /* mark event as deactivated and stopped */

+ if (!(hwc->state & PERF_HES_STOPPED)) {

+ WARN_ON_ONCE(pmu->n_active <= 0);
+ pmu->n_active--;
+

+ list_del(&event->active_entry);
+

+ WARN_ON_ONCE(hwc->state & PERF_HES_STOPPED);
+ hwc->state |= PERF_HES_STOPPED;
+ }
+
+ /* check if update of sw counter is necessary */

+ if ((mode & PERF_EF_UPDATE) && !(hwc->state & PERF_HES_UPTODATE)) {

+ /*
+ * Drain the remaining delta count out of a event
+ * that we are disabling:
+ */
+ rapl_event_update(event);
+ hwc->state |= PERF_HES_UPTODATE;
+ }
+

+ spin_unlock_irqrestore(&pmu->lock, flags);
+}
+
+static int rapl_pmu_event_add(struct perf_event *event, int mode)

+{
+ struct rapl_pmu *pmu = __get_cpu_var(rapl_pmu);

+ struct hw_perf_event *hwc = &event->hw;

+ unsigned long flags;
+
+ spin_lock_irqsave(&pmu->lock, flags);

+
+ hwc->state = PERF_HES_UPTODATE | PERF_HES_STOPPED;
+

+ if (mode & PERF_EF_START)
+ __rapl_pmu_event_start(pmu, event);
+
+ spin_unlock_irqrestore(&pmu->lock, flags);

+
+ return 0;
+}
+

+static void rapl_pmu_event_del(struct perf_event *event, int flags)
+{
+ rapl_pmu_event_stop(event, PERF_EF_UPDATE);
+}
+

+ event->hw.config = cfg;

+ event->hw.idx = bit;
+

+ return ret;
+}
+
+static void rapl_pmu_event_read(struct perf_event *event)
+{
+ rapl_event_update(event);

+}
+
+static ssize_t rapl_get_attr_cpumask(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ int n = cpulist_scnprintf(buf, PAGE_SIZE - 2, &rapl_cpu_mask);

+
+ buf[n++] = '\n';
+ buf[n] = '\0';
+ return n;
+}

+ spin_lock(&rapl_hotplug_lock);

+
+ /* check if phys_is is already covered */
+ for_each_cpu(i, &rapl_cpu_mask) {
+ if (i == 0)
+ continue;
+ if (phys_id == topology_physical_package_id(i))
+ return;
+ }
+ /* was not found, so add it */
+ cpumask_set_cpu(cpu, &rapl_cpu_mask);
+

+ spin_unlock(&rapl_hotplug_lock);

+}
+
+static int rapl_cpu_prepare(int cpu)

+{
+ struct rapl_pmu *pmu = per_cpu(rapl_pmu, cpu);
+ int phys_id = topology_physical_package_id(cpu);
+

+ if (pmu)
+ return 0;
+
+ if (phys_id < 0)
+ return -1;
+
+ pmu = kzalloc_node(sizeof(*pmu), GFP_KERNEL, cpu_to_node(cpu));
+ if (!pmu)
+ return -1;
+

+ spin_lock_init(&pmu->lock);

+ atomic_set(&pmu->refcnt, 1);
+

+ INIT_LIST_HEAD(&pmu->active_list);
+

+ pmu->phys_id = phys_id;
+ /*
+ * grab power unit as: 1/2^unit Joules
+ *
+ * we cache in local PMU instance
+ */
+ rdmsrl(MSR_RAPL_POWER_UNIT, pmu->hw_unit);

+ pmu->hw_unit = (pmu->hw_unit >> 8) & 0x1FULL;
+

+ /* set RAPL pmu for this cpu for now */

+ per_cpu(rapl_pmu_kfree, cpu) = NULL;
+ per_cpu(rapl_pmu, cpu) = pmu;
+
+ return 0;
+}
+
+static int rapl_cpu_starting(int cpu)
+{
+ struct rapl_pmu *pmu2;
+ struct rapl_pmu *pmu1 = per_cpu(rapl_pmu, cpu);
+ int i, phys_id = topology_physical_package_id(cpu);
+
+ if (pmu1)
+ return 0;
+

+ spin_lock(&rapl_hotplug_lock);
+

+ for_each_online_cpu(i) {
+ pmu2 = per_cpu(rapl_pmu, i);
+
+ if (!pmu2 || i == cpu)
+ continue;
+
+ if (pmu2->phys_id == phys_id) {
+ per_cpu(rapl_pmu, cpu) = pmu2;
+ per_cpu(rapl_pmu_kfree, cpu) = pmu1;
+ atomic_inc(&pmu2->refcnt);
+ break;
+ }
+ }

+ spin_unlock(&rapl_hotplug_lock);

+ return 0;
+}
+
+static int rapl_cpu_dying(int cpu)

+{
+ struct rapl_pmu *pmu = per_cpu(rapl_pmu, cpu);
+ struct perf_event *event, *tmp;

+
+ if (!pmu)
+ return 0;
+

+ spin_lock(&rapl_hotplug_lock);
+
+ /*

+ * stop all syswide RAPL events on that CPU
+ * as a consequence also stops the hrtimer
+ */

+ list_for_each_entry_safe(event, tmp, &pmu->active_list, active_entry) {

+ rapl_pmu_event_stop(event, PERF_EF_UPDATE);
+ }
+

+ per_cpu(rapl_pmu, cpu) = NULL;
+
+ if (atomic_dec_and_test(&pmu->refcnt))
+ kfree(pmu);
+

+ spin_unlock(&rapl_hotplug_lock);

+static const struct x86_cpu_id rapl_cpu_match[] = {
+ [0] = { .vendor = X86_VENDOR_INTEL, .family = 6 },
+ [1] = {},

+};
+static int __init rapl_pmu_init(void)

+{

+ struct rapl_pmu *pmu;
+ int i, cpu, ret;

+
+ /*
+ * check for Intel processor family 6
+ */
+ if (!x86_match_cpu(rapl_cpu_match))
+ return 0;
+

+ /* check supported CPU */
+ switch (boot_cpu_data.x86_model) {
+ case 42: /* Sandy Bridge */
+ case 58: /* Ivy Bridge */
+ case 60: /* Haswell */

+ rapl_cntr_mask = RAPL_IDX_CLN;
+ rapl_pmu_events_group.attrs = rapl_events_cln_attr;
+ break;
+ case 45: /* Sandy Bridge-EP */
+ case 62: /* IvyTown */
+ rapl_cntr_mask = RAPL_IDX_SRV;
+ rapl_pmu_events_group.attrs = rapl_events_srv_attr;
+ break;
+
+ default:
+ /* unsupported */
+ return 0;
+ }
+ get_online_cpus();
+
+ for_each_online_cpu(cpu) {

+ int phys_id = topology_physical_package_id(cpu);
+

+ pr_info("RAPL PMU detected, hw unit 2^-%d Joules,"

+ " API unit is 2^-32 Joules,"
+ " %d fixed counters\n",
+ pmu->hw_unit,
+ hweight32(rapl_cntr_mask));
+
+ put_online_cpus();
+
+ return 0;
+}
+device_initcall(rapl_pmu_init);

Stephane Eranian

unread,

Oct 10, 2013, 10:50:40 AM10/10/13

to linux-...@vger.kernel.org, pet...@infradead.org, mi...@elte.hu, a...@linux.intel.com, ac...@redhat.com, jo...@redhat.com, zheng...@intel.com, b...@alien8.de

The RAPL PMU counters do not interrupt on overflow.
Therefore, the kernel needs to poll the counters
to avoid missing an overflow. This patch adds
the hrtimer code to do this.

The timer internval is calculated at boot time
based on the power unit used by the HW.

Signed-off-by: Stephane Eranian <era...@google.com>
---

arch/x86/kernel/cpu/perf_event_intel_rapl.c | 75 +++++++++++++++++++++++++--
1 file changed, 70 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_rapl.c b/arch/x86/kernel/cpu/perf_event_intel_rapl.c
index abaaf4f..c5a6f51 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_rapl.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_rapl.c
@@ -92,11 +92,13 @@ static struct kobj_attribute format_attr_##_var = \

struct rapl_pmu {
spinlock_t lock;
- atomic_t refcnt;

int hw_unit; /* 1/2^hw_unit Joule */

- int phys_id;
- int n_active; /* number of active events */
+ struct hrtimer hrtimer;
struct list_head active_list;

+ ktime_t timer_interval; /* in ktime_t unit */

+ int n_active; /* number of active events */

+ int phys_id;
+ atomic_t refcnt;
};

static struct pmu rapl_pmu_class;
@@ -161,6 +163,47 @@ static u64 rapl_event_update(struct perf_event *event)
return new_raw_count;
}

+static void rapl_start_hrtimer(struct rapl_pmu *pmu)
+{
+ __hrtimer_start_range_ns(&pmu->hrtimer,
+ pmu->timer_interval, 0,
+ HRTIMER_MODE_REL_PINNED, 0);
+}
+
+static void rapl_stop_hrtimer(struct rapl_pmu *pmu)
+{
+ hrtimer_cancel(&pmu->hrtimer);
+}
+

+static enum hrtimer_restart rapl_hrtimer_handle(struct hrtimer *hrtimer)

+{

+ struct rapl_pmu *pmu = container_of(hrtimer, struct rapl_pmu, hrtimer);

+ struct perf_event *event;

+ unsigned long flags;
+

+ if (!pmu->n_active)
+ return HRTIMER_NORESTART;

+
+ spin_lock_irqsave(&pmu->lock, flags);
+

+ list_for_each_entry(event, &pmu->active_list, active_entry) {
+ rapl_event_update(event);

+ }
+
+ spin_unlock_irqrestore(&pmu->lock, flags);
+

+ hrtimer_forward_now(&pmu->hrtimer, pmu->timer_interval);
+
+ return HRTIMER_RESTART;
+}
+
+static void rapl_hrtimer_init(struct rapl_pmu *pmu)
+{
+ hrtimer_init(&pmu->hrtimer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+ pmu->hrtimer.function = rapl_hrtimer_handle;
+}
+
+

static void __rapl_pmu_event_start(struct rapl_pmu *pmu,
struct perf_event *event)
{
@@ -174,6 +217,8 @@ static void __rapl_pmu_event_start(struct rapl_pmu *pmu,

local64_set(&event->hw.prev_count, rapl_read_counter(event));

pmu->n_active++;
+ if (pmu->n_active == 1)
+ rapl_start_hrtimer(pmu);
}

static void rapl_pmu_event_start(struct perf_event *event, int mode)

@@ -198,6 +243,8 @@ static void rapl_pmu_event_stop(struct perf_event *event, int mode)

if (!(hwc->state & PERF_HES_STOPPED)) {

WARN_ON_ONCE(pmu->n_active <= 0);
pmu->n_active--;
+ if (pmu->n_active == 0)
+ rapl_stop_hrtimer(pmu);

list_del(&event->active_entry);

@@ -416,6 +463,7 @@ static int rapl_cpu_prepare(int cpu)
{

struct rapl_pmu *pmu = per_cpu(rapl_pmu, cpu);

int phys_id = topology_physical_package_id(cpu);

+ u64 ms;

if (pmu)
return 0;

@@ -441,6 +489,20 @@ static int rapl_cpu_prepare(int cpu)
rdmsrl(MSR_RAPL_POWER_UNIT, pmu->hw_unit);

pmu->hw_unit = (pmu->hw_unit >> 8) & 0x1FULL;

+ /*
+ * use reference of 200W for scaling the timeout
+ * to avoid missing counter overflows.
+ * 200W = 200 Joules/sec
+ * divide interval by 2 to avoid lockstep (2 * 100)
+ * if hw unit is 32, then we use 2 ms 1/200/2
+ */
+ if (pmu->hw_unit < 32)
+ ms = 1000 * (1ULL << (32 - pmu->hw_unit - 1)) / (2 * 100);
+ else
+ ms = 2;
+
+ pmu->timer_interval = ms_to_ktime(ms);

+
/* set RAPL pmu for this cpu for now */

per_cpu(rapl_pmu_kfree, cpu) = NULL;
per_cpu(rapl_pmu, cpu) = pmu;

@@ -602,6 +664,7 @@ static int __init rapl_pmu_init(void)

}
rapl_cpu_prepare(cpu);
cpumask_set_cpu(cpu, &rapl_cpu_mask);
+ rapl_hrtimer_init(per_cpu(rapl_pmu, cpu));
}

perf_cpu_notifier(rapl_cpu_notifier);

@@ -612,9 +675,11 @@ static int __init rapl_pmu_init(void)
pmu = __get_cpu_var(rapl_pmu);

pr_info("RAPL PMU detected, hw unit 2^-%d Joules,"

" API unit is 2^-32 Joules,"

- " %d fixed counters\n",
+ " %d fixed counters"
+ " %llu ms ovfl timer\n",

pmu->hw_unit,
- hweight32(rapl_cntr_mask));
+ hweight32(rapl_cntr_mask),
+ ktime_to_ms(pmu->timer_interval));

put_online_cpus();

Andi Kleen

unread,

Oct 10, 2013, 1:43:27 PM10/10/13

to Stephane Eranian, linux-...@vger.kernel.org, pet...@infradead.org, mi...@elte.hu, ac...@redhat.com, jo...@redhat.com, zheng...@intel.com, b...@alien8.de

Looks all good to me now.

Reviewed-by: Andi Kleen <a...@linux.intel.com>

-Andi

--
a...@linux.intel.com -- Speaking for myself only

Borislav Petkov

unread,

Oct 10, 2013, 2:01:01 PM10/10/13

to Stephane Eranian, linux-...@vger.kernel.org, pet...@infradead.org, mi...@elte.hu, a...@linux.intel.com, ac...@redhat.com, jo...@redhat.com, zheng...@intel.com

On Thu, Oct 10, 2013 at 04:50:05PM +0200, Stephane Eranian wrote:
> $ perf stat -a -e rapl/rapl-energy-cores/,rapl/rapl-energy-pkg/ -I 1000 sleep 10
> time counts events
> 1.000345931 772 278 493 rapl/rapl-energy-cores/
> 1.000345931 55 539 138 560 rapl/rapl-energy-pkg/
> 2.000836387 771 751 936 rapl/rapl-energy-cores/
> 2.000836387 55 326 015 488 rapl/rapl-energy-pkg/

Hmm, so I'm looking at builtin-stat.c::print_interval() and since
it gets the perf_evsel counters and you can deduce the counter name
from it, you probably could match the rapl counters and do the Watts
conversion above as a special case.

I dunno, it is much better than having some naked numbers for which
people have to go stare at the sources + CPU vendor docs as to what they
actually mean.

Thanks.

--
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

Ingo Molnar

unread,

Oct 16, 2013, 8:46:37 AM10/16/13

to Borislav Petkov, Stephane Eranian, linux-...@vger.kernel.org, pet...@infradead.org, mi...@elte.hu, a...@linux.intel.com, ac...@redhat.com, jo...@redhat.com, zheng...@intel.com

So, the RAPL patch-set clearly needs more work.

* Borislav Petkov <b...@alien8.de> wrote:

> On Thu, Oct 10, 2013 at 04:50:05PM +0200, Stephane Eranian wrote:
> > $ perf stat -a -e rapl/rapl-energy-cores/,rapl/rapl-energy-pkg/ -I 1000 sleep 10
> > time counts events
> > 1.000345931 772 278 493 rapl/rapl-energy-cores/
> > 1.000345931 55 539 138 560 rapl/rapl-energy-pkg/
> > 2.000836387 771 751 936 rapl/rapl-energy-cores/
> > 2.000836387 55 326 015 488 rapl/rapl-energy-pkg/

Why is there the rapl/rapl duplication in the event name? It should be
rapl/energy-cores, rapl/energy-pkg, etc.

I'm also not sure about the Intel-specific naming. Joules per core and
Joules per socket ought to be pretty generic, even if the initial
implementation is Intel-only. I.e.:

power/energy-core
power/energy-pkg

> Hmm, so I'm looking at builtin-stat.c::print_interval() and since it
> gets the perf_evsel counters and you can deduce the counter name from
> it, you probably could match the rapl counters and do the Watts
> conversion above as a special case.
>
> I dunno, it is much better than having some naked numbers for which
> people have to go stare at the sources + CPU vendor docs as to what they
> actually mean.

So what should happen here is to extend the sysfs attributes that tell us
that it's in 32.32 fixed-point format.

We should also tell user-space that the unit of this counter is 'Joule'.

Then things like:

perf stat -a -e power/* sleep 1

would output, without knowing any RAPL details:

0.20619 Joule power/energy-core
2.42151 Joule power/energy-pkg

or so.

Other platforms offering energy measurement facilities will then name
their counters in the same power/* (or energy/*) namespace, with new names
if they do something fundamentally differently.

Tooling can then generalize along these abstractions, as much as the
hardware allows it.

Thanks,

Ingo

Stephane Eranian

unread,

Oct 16, 2013, 9:14:01 AM10/16/13

to Ingo Molnar, Borislav Petkov, LKML, Peter Zijlstra, mi...@elte.hu, a...@linux.intel.com, Arnaldo Carvalho de Melo, Jiri Olsa, Yan, Zheng

On Wed, Oct 16, 2013 at 2:46 PM, Ingo Molnar <mi...@kernel.org> wrote:
>
> So, the RAPL patch-set clearly needs more work.
>
> * Borislav Petkov <b...@alien8.de> wrote:
>
>> On Thu, Oct 10, 2013 at 04:50:05PM +0200, Stephane Eranian wrote:
>> > $ perf stat -a -e rapl/rapl-energy-cores/,rapl/rapl-energy-pkg/ -I 1000 sleep 10
>> > time counts events
>> > 1.000345931 772 278 493 rapl/rapl-energy-cores/
>> > 1.000345931 55 539 138 560 rapl/rapl-energy-pkg/
>> > 2.000836387 771 751 936 rapl/rapl-energy-cores/
>> > 2.000836387 55 326 015 488 rapl/rapl-energy-pkg/
>
> Why is there the rapl/rapl duplication in the event name? It should be
> rapl/energy-cores, rapl/energy-pkg, etc.
>

yeah, I thought about doing that too. I will change the names.

> I'm also not sure about the Intel-specific naming. Joules per core and
> Joules per socket ought to be pretty generic, even if the initial
> implementation is Intel-only. I.e.:
>

Joules per cores (with an s)
Joules per package.
Joules per dram, i.e., all the DRAM attached to a socket (I think).

> power/energy-core
> power/energy-pkg
>
Fine with me. Or joules-cores to make the unit explicit

>> Hmm, so I'm looking at builtin-stat.c::print_interval() and since it
>> gets the perf_evsel counters and you can deduce the counter name from
>> it, you probably could match the rapl counters and do the Watts
>> conversion above as a special case.
>>
>> I dunno, it is much better than having some naked numbers for which
>> people have to go stare at the sources + CPU vendor docs as to what they
>> actually mean.
>
> So what should happen here is to extend the sysfs attributes that tell us
> that it's in 32.32 fixed-point format.
>

We could add that in sysfs, but then I am wondering how would the tool realize
it has to use this file. We'd have to create something generic like a scaling
factor. If the file is there, then use it, if not assume 1x. Is that what you
are thinking about?

> We should also tell user-space that the unit of this counter is 'Joule'.
>
> Then things like:
>
> perf stat -a -e power/* sleep 1
>
> would output, without knowing any RAPL details:
>
> 0.20619 Joule power/energy-core
> 2.42151 Joule power/energy-pkg
>

Not sure there is already some support for this in perf stat. Arnaldo?
If not that we need another sysfs file to export the unit. Another
possibility is for perf stat to recognize the power/* and extract the
unit from the event name. In my example power/joules-cores -> joules.

Arnaldo Carvalho de Melo

unread,

Oct 16, 2013, 1:54:01 PM10/16/13

to Stephane Eranian, Ingo Molnar, Borislav Petkov, LKML, Peter Zijlstra, mi...@elte.hu, a...@linux.intel.com, Jiri Olsa, Yan, Zheng

Em Wed, Oct 16, 2013 at 03:13:54PM +0200, Stephane Eranian escreveu:
> On Wed, Oct 16, 2013 at 2:46 PM, Ingo Molnar <mi...@kernel.org> wrote:
> > We should also tell user-space that the unit of this counter is 'Joule'.
> >
> > Then things like:
> >
> > perf stat -a -e power/* sleep 1
> >
> > would output, without knowing any RAPL details:
> >
> > 0.20619 Joule power/energy-core
> > 2.42151 Joule power/energy-pkg
> >
> Not sure there is already some support for this in perf stat. Arnaldo?

Nope, there is not, we would have to have some table somewhere with
"event-regexp: unit-string"

> If not that we need another sysfs file to export the unit. Another
> possibility is for perf stat to recognize the power/* and extract the
> unit from the event name. In my example power/joules-cores -> joules.

I.e. you would be encoding the counter unit as the suffix, might as well
call it "power/cores.joules" and use the dot as the separator for the
unit, but would be just a compact form to encode the counter->unit
table.

- Arnaldo

Stephane Eranian

unread,

Oct 16, 2013, 2:14:13 PM10/16/13

to Arnaldo Carvalho de Melo, Ingo Molnar, Borislav Petkov, LKML, Peter Zijlstra, mi...@elte.hu, a...@linux.intel.com, Jiri Olsa, Yan, Zheng

On Wed, Oct 16, 2013 at 7:53 PM, Arnaldo Carvalho de Melo
<ac...@redhat.com> wrote:
> Em Wed, Oct 16, 2013 at 03:13:54PM +0200, Stephane Eranian escreveu:
>> On Wed, Oct 16, 2013 at 2:46 PM, Ingo Molnar <mi...@kernel.org> wrote:
>> > We should also tell user-space that the unit of this counter is 'Joule'.
>> >
>> > Then things like:
>> >
>> > perf stat -a -e power/* sleep 1
>> >
>> > would output, without knowing any RAPL details:
>> >
>> > 0.20619 Joule power/energy-core
>> > 2.42151 Joule power/energy-pkg
>> >
>> Not sure there is already some support for this in perf stat. Arnaldo?
>
> Nope, there is not, we would have to have some table somewhere with
> "event-regexp: unit-string"
>
>> If not that we need another sysfs file to export the unit. Another
>> possibility is for perf stat to recognize the power/* and extract the
>> unit from the event name. In my example power/joules-cores -> joules.
>
> I.e. you would be encoding the counter unit as the suffix, might as well
> call it "power/cores.joules" and use the dot as the separator for the
> unit, but would be just a compact form to encode the counter->unit
> table.
>

May be easier to add a sysfs entry with the unit to display.

Ingo Molnar

unread,

Oct 17, 2013, 4:14:34 AM10/17/13

to Stephane Eranian, Arnaldo Carvalho de Melo, Borislav Petkov, LKML, Peter Zijlstra, mi...@elte.hu, a...@linux.intel.com, Jiri Olsa, Yan, Zheng

* Stephane Eranian <era...@google.com> wrote:

> On Wed, Oct 16, 2013 at 7:53 PM, Arnaldo Carvalho de Melo
> <ac...@redhat.com> wrote:
> > Em Wed, Oct 16, 2013 at 03:13:54PM +0200, Stephane Eranian escreveu:
> >> On Wed, Oct 16, 2013 at 2:46 PM, Ingo Molnar <mi...@kernel.org> wrote:
> >> > We should also tell user-space that the unit of this counter is 'Joule'.
> >> >
> >> > Then things like:
> >> >
> >> > perf stat -a -e power/* sleep 1
> >> >
> >> > would output, without knowing any RAPL details:
> >> >
> >> > 0.20619 Joule power/energy-core
> >> > 2.42151 Joule power/energy-pkg
> >> >
> >> Not sure there is already some support for this in perf stat. Arnaldo?
> >
> > Nope, there is not, we would have to have some table somewhere with
> > "event-regexp: unit-string"
> >
> >> If not that we need another sysfs file to export the unit. Another
> >> possibility is for perf stat to recognize the power/* and extract the
> >> unit from the event name. In my example power/joules-cores -> joules.
> >
> > I.e. you would be encoding the counter unit as the suffix, might as well
> > call it "power/cores.joules" and use the dot as the separator for the
> > unit, but would be just a compact form to encode the counter->unit
> > table.
>
> May be easier to add a sysfs entry with the unit to display.

Yes - with no entry meaning a raw 'count' or such.

Thanks,

Ingo

Peter Zijlstra

unread,

Oct 17, 2013, 5:08:04 AM10/17/13

to Ingo Molnar, Stephane Eranian, Arnaldo Carvalho de Melo, Borislav Petkov, LKML, mi...@elte.hu, a...@linux.intel.com, Jiri Olsa, Yan, Zheng

On Thu, Oct 17, 2013 at 10:14:20AM +0200, Ingo Molnar wrote:
> > > I.e. you would be encoding the counter unit as the suffix, might as well
> > > call it "power/cores.joules" and use the dot as the separator for the
> > > unit, but would be just a compact form to encode the counter->unit
> > > table.
> >
> > May be easier to add a sysfs entry with the unit to display.
>
> Yes - with no entry meaning a raw 'count' or such.

The downside to such a sysfs entry will be the scope. It would either be
pmu wide (unwieldy for many PMUs) or be only per listed event; and we
really don't want exhaustive event lists in the kernel.

Borislav Petkov

unread,

Oct 17, 2013, 5:12:25 AM10/17/13

to Peter Zijlstra, Ingo Molnar, Stephane Eranian, Arnaldo Carvalho de Melo, LKML, mi...@elte.hu, a...@linux.intel.com, Jiri Olsa, Yan, Zheng

On Thu, Oct 17, 2013 at 11:07:30AM +0200, Peter Zijlstra wrote:
> The downside to such a sysfs entry will be the scope. It would either
> be pmu wide (unwieldy for many PMUs) or be only per listed event; and
> we really don't want exhaustive event lists in the kernel.

So why not teach perf tool to recognize the PMU instead of adding
anything to the kernel?

It seems much easier to me...

--
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

Stephane Eranian

unread,

Oct 17, 2013, 4:09:54 PM10/17/13

to Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo, Borislav Petkov, LKML, mi...@elte.hu, a...@linux.intel.com, Jiri Olsa, Yan, Zheng

Peter,

On Thu, Oct 17, 2013 at 11:07 AM, Peter Zijlstra <pet...@infradead.org> wrote:
> On Thu, Oct 17, 2013 at 10:14:20AM +0200, Ingo Molnar wrote:
>> > > I.e. you would be encoding the counter unit as the suffix, might as well
>> > > call it "power/cores.joules" and use the dot as the separator for the
>> > > unit, but would be just a compact form to encode the counter->unit
>> > > table.
>> >
>> > May be easier to add a sysfs entry with the unit to display.
>>
>> Yes - with no entry meaning a raw 'count' or such.
>
> The downside to such a sysfs entry will be the scope. It would either be
> pmu wide (unwieldy for many PMUs) or be only per listed event; and we
> really don't want exhaustive event lists in the kernel.
>

Why not put in the events subdir:

/sys/devices/power/events/energy-cores
/sys/devices/power/events/energy-cores.unit
/sys/devices/power/events/energy-cores.scaling
$ cat energy-core.unit
Joules
$ cat energy-core.scaling
0.00000000023

Perf could easily lookup those files and if they are not present it will print
the event as it does today. If present, then it will print the unit and apply
the scaling factor to the raw cont (already scaled for multiplexing).

Borislav, the scaling factor cannot be hardcoded into perf because it
can change for processor to processor.

Stephane Eranian

unread,

Oct 22, 2013, 12:47:44 PM10/22/13

to Ingo Molnar, Arnaldo Carvalho de Melo, Borislav Petkov, LKML, Peter Zijlstra, mi...@elte.hu, a...@linux.intel.com, Jiri Olsa, Yan, Zheng

Hi,

I have updated my RAPL patches to implement the suggested changes.
I will post the patch very soon. The new look and feel is as folllows:

# perf stat -a -e power/energy-cores/,power/energy-pkg/,ref-cycles -I
1000 sleep 1000
# time unit counts events
1.000264953 Joules 2.09 power/energy-cores/
[100.00%]
1.000264953 Joules 5.94 power/energy-pkg/
1.000264953 160,530,320 ref-cycles
2.000640422 Joules 2.07 power/energy-cores/
2.000640422 Joules 5.94 power/energy-pkg/
2.000640422 152,673,056 ref-cycles
3.000964416 Joules 2.08 power/energy-cores/
3.000964416 Joules 5.93 power/energy-pkg/
3.000964416 158,779,184 ref-cycles

# ls -1 /sys/devices/power/events/
energy-cores
energy-cores.scale
energy-cores.unit
energy-pkg
energy-pkg.scale
energy-pkg.unit

# cat /sys/devices/power/events/energy-cores.scale
2.3e-10
# cat /sys/devices/power/events/energy-cores.unit
Joules

Of course, this unit and scaling support is generic and not limited
to the RAPL events. For now, this only works with events exported
by the kernel via sysfs.

Arnaldo Carvalho de Melo

unread,

Oct 22, 2013, 6:18:45 PM10/22/13

to Stephane Eranian, Ingo Molnar, Borislav Petkov, LKML, Peter Zijlstra, mi...@elte.hu, a...@linux.intel.com, Jiri Olsa, Yan, Zheng

Em Tue, Oct 22, 2013 at 06:47:38PM +0200, Stephane Eranian escreveu:
> I have updated my RAPL patches to implement the suggested changes.
> I will post the patch very soon. The new look and feel is as folllows:

> # perf stat -a -e power/energy-cores/,power/energy-pkg/,ref-cycles -I
> 1000 sleep 1000
> # time unit counts events
> 1.000264953 Joules 2.09 power/energy-cores/
> [100.00%]
> 1.000264953 Joules 5.94 power/energy-pkg/
> 1.000264953 160,530,320 ref-cycles
> 2.000640422 Joules 2.07 power/energy-cores/
> 2.000640422 Joules 5.94 power/energy-pkg/
> 2.000640422 152,673,056 ref-cycles
> 3.000964416 Joules 2.08 power/energy-cores/
> 3.000964416 Joules 5.93 power/energy-pkg/
> 3.000964416 158,779,184 ref-cycles

What about:

# perf stat -a -e power/energy-cores/,power/energy-pkg/,ref-cycles -I 1000 sleep 1000

# time events
1.000264953 2.09 Joules power/energy-cores/
1.000264953 5.94 Joules power/energy-pkg/
1.000264953 160,530,320 ref-cycles
2.000640422 2.07 Joules power/energy-cores/
2.000640422 5.94 Joules power/energy-pkg/
2.000640422 152,673,056 ref-cycles
3.000964416 2.08 Joules power/energy-cores/
3.000964416 5.93 Joules power/energy-pkg/
3.000964416 158,779,184 ref-cycles

?

Or even 2.09J power/energy-cores/?

I.e. a perf_evsel__fprintf_value(evsel) would append a unit string, if
available.

- Arnaldo

Andi Kleen

unread,

Oct 23, 2013, 3:07:53 AM10/23/13

to Stephane Eranian, Ingo Molnar, Arnaldo Carvalho de Melo, Borislav Petkov, LKML, Peter Zijlstra, mi...@elte.hu, Jiri Olsa, Yan, Zheng

> # perf stat -a -e power/energy-cores/,power/energy-pkg/,ref-cycles -I
> 1000 sleep 1000
> # time unit counts events
> 1.000264953 Joules 2.09 power/energy-cores/
> [100.00%]
> 1.000264953 Joules 5.94 power/energy-pkg/
> 1.000264953 160,530,320 ref-cycles
> 2.000640422 Joules 2.07 power/energy-cores/
> 2.000640422 Joules 5.94 power/energy-pkg/
> 2.000640422 152,673,056 ref-cycles
> 3.000964416 Joules 2.08 power/energy-cores/
> 3.000964416 Joules 5.93 power/energy-pkg/
> 3.000964416 158,779,184 ref-cycles

Can you add some column marker that there is no unit (like -) ?

This is just in case someone wants to parse this with a tool. Yes they
should be using -x, but it is still better to be always parseable.

-Andi

Stephane Eranian

unread,

Oct 23, 2013, 5:24:59 AM10/23/13

to Andi Kleen, Ingo Molnar, Arnaldo Carvalho de Melo, Borislav Petkov, LKML, Peter Zijlstra, mi...@elte.hu, Jiri Olsa, Yan, Zheng

Andi,

On Wed, Oct 23, 2013 at 9:07 AM, Andi Kleen <a...@linux.intel.com> wrote:
>> # perf stat -a -e power/energy-cores/,power/energy-pkg/,ref-cycles -I
>> 1000 sleep 1000
>> # time unit counts events
>> 1.000264953 Joules 2.09 power/energy-cores/
>> [100.00%]
>> 1.000264953 Joules 5.94 power/energy-pkg/
>> 1.000264953 160,530,320 ref-cycles
>> 2.000640422 Joules 2.07 power/energy-cores/
>> 2.000640422 Joules 5.94 power/energy-pkg/
>> 2.000640422 152,673,056 ref-cycles
>> 3.000964416 Joules 2.08 power/energy-cores/
>> 3.000964416 Joules 5.93 power/energy-pkg/
>> 3.000964416 158,779,184 ref-cycles
>
> Can you add some column marker that there is no unit (like -) ?
>
> This is just in case someone wants to parse this with a tool. Yes they
> should be using -x, but it is still better to be always parseable.
>

It is parseable, it's just that you get an empty field: ,,
But I can add a "?".

Stephane Eranian

unread,

Oct 23, 2013, 5:34:47 AM10/23/13

to Arnaldo Carvalho de Melo, Ingo Molnar, Borislav Petkov, LKML, Peter Zijlstra, mi...@elte.hu, a...@linux.intel.com, Jiri Olsa, Yan, Zheng

Arnaldo,

I can try that.

> I.e. a perf_evsel__fprintf_value(evsel) would append a unit string, if
> available.
>

I don't have this function is my tree yet (tip.git).

Stephane Eranian

unread,

Oct 23, 2013, 8:58:46 AM10/23/13

to linux-...@vger.kernel.org, pet...@infradead.org, mi...@elte.hu, a...@linux.intel.com, ac...@redhat.com, jo...@redhat.com, zheng...@intel.com, b...@alien8.de

This patch adds a new uncore PMU to expose the Intel

RAPL (Running Average Power Limit) energy consumption counters.

Up to 3 counters, each counting a particular RAPL event are exposed.

The RAPL counters are available on Intel SandyBridge,

IvyBridge, Haswell. The server skus add a 3rd counter to measure
DRAM power consumption.

The following events are available nd exposed in sysfs:

- power/energy-cores: power consumption of all cores on socket
- power/energy-pkg: power consumption of all cores + LLc cache
- power/energy-dram: power consumption of DRAM (server skus only)

The RAPL PMU is uncore by nature and is implemented such
that it only works in system-wide mode. Measuring only
one CPU per socket is sufficient. The /sys/devices/rapl/cpumask
is exported and can be used by tools to figure out which CPU
to monitor by default. For instance, on a 2-socket system, 2 CPUs
(one on each socket) will be shown.

The counters all count in the same unit. The perf_events API
exposes all RAPL counters as 64-bit integers counting in unit
of 1/2^32 Joules (or 0.23 nJ). User level tools must convert
the counts by multiplying them by 0.23 and divide 10^9 to
obtain Joules. The reason for this is that the kernel avoids
doing floating point math whenever possible because it is
expensive (user floating-point state must be saved). The method
used avoids kernel floating-point and minimizes the loss of
precision (bits). Thanks to PeterZ for suggesting this approach.

To convert the raw count in Watt: W = C * 0.23 / (1e9 * time)

The kernel exposes both the scaling factor (0.23 nJ) and the
unit (Joules) in sysfs:
$ ls -1 /sys/devices/power/events/energy-*
/sys/devices/power/events/energy-cores
/sys/devices/power/events/energy-cores.scale
/sys/devices/power/events/energy-cores.unit
/sys/devices/power/events/energy-pkg
/sys/devices/power/events/energy-pkg.scale
/sys/devices/power/events/energy-pkg.unit

$ cat /sys/devices/power/events/energy-cores.scale
2.3e-10

$ cat cat /sys/devices/power/events/energy-cores.unit
Joules

RAPL PMU is a new standalone PMU which registers with the
perf_event core subsystem. The PMU type (attr->type) is
dynamically allocated and is available from /sys/device/rapl/type.

Sampling is not supported by the RAPL PMU. There is no
privilege level filtering either.

The PMU exports a cpumask in /sys/devices/power/cpumask. It

is used by perf to ensure only one instance of each RAPL event
is measured per processor socket. Hotplug CPU is also supported.

The perf stat infrasrtructure is modified to now show event
unit. It also applies the scaling factor. As such it will print
RAPL events in Joules (and not increments on 0.23 nJ):

# perf stat -a -e power/energy-pkg/,power/energy-cores/,cycles -I 1000 sleep 1000
# time counts unit events
1.000282860 2.51 Joules power/energy-pkg/ [100.00%]
1.000282860 0.31 Joules power/energy-cores/
1.000282860 37765378 ? cycles [100.00%]

The patch adds a hrtimer to poll the counters given that

they do no interrupt on overflow. Hardware counters are 32-bit
wide.

In v2, we add the locking necesarry to protect the rapl_pmu
struct. We also add a description at the top of the file.
We check for Intel only processor. We improved the data
layout of the rapl_pmu struct. We also lifted the restriction
of the number of instances of RAPL counters that can be active
at the same time. RAPL is free running counters, so ought to be
able to measure events as many times as necessary in parallel
via multiple tools. There is never multiplexing among RAPL events.

In v3, we have renamed the event to be more generic power/* instead
of rapl/*. We have modified perf stat to print the event with the
unit and scaling factors.

Supported CPUs: SandyBridge, IvyBridge, Haswell.

Signed-off-by: Stephane Eranian <era...@google.com>

Stephane Eranian (4):

perf: add active_entry list head to struct perf_event

perf stat: add event unit and scale support

perf,x86: add Intel RAPL PMU support
perf,x86: add RAPL hrtimer support

arch/x86/kernel/cpu/Makefile | 2 +-

arch/x86/kernel/cpu/perf_event_intel_rapl.c | 717 +++++++++++++++++++++++++++

include/linux/perf_event.h | 1 +
kernel/events/core.c | 1 +

create mode 100644 arch/x86/kernel/cpu/perf_event_intel_rapl.c

--
1.7.9.5

Stephane Eranian

unread,

Oct 23, 2013, 8:58:51 AM10/23/13

to linux-...@vger.kernel.org, pet...@infradead.org, mi...@elte.hu, a...@linux.intel.com, ac...@redhat.com, jo...@redhat.com, zheng...@intel.com, b...@alien8.de

This patch adds a new fields to the struct perf_event.
It is intended to be used to chain events which are
active (enabled). It helps in the hardware layer
for PMU which do not have actual counter restrictions, i.e.,
free running read-only counters. Active events are chained
as opposed to being tracked via the counter they use.

Signed-off-by: Stephane Eranian <era...@google.com>
---

include/linux/perf_event.h | 1 +
kernel/events/core.c | 1 +

2 files changed, 2 insertions(+)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 2e069d1..a376384 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -435,6 +435,7 @@ struct perf_event {
struct perf_cgroup *cgrp; /* cgroup event is attach to */
int cgrp_defer_enabled;
#endif
+ struct list_head active_entry;

#endif /* CONFIG_PERF_EVENTS */
};
diff --git a/kernel/events/core.c b/kernel/events/core.c

index 5bd7fe4..6ef9d19 100644

--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -6629,6 +6629,7 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
INIT_LIST_HEAD(&event->event_entry);
INIT_LIST_HEAD(&event->sibling_list);
INIT_LIST_HEAD(&event->rb_entry);
+ INIT_LIST_HEAD(&event->active_entry);

init_waitqueue_head(&event->waitq);
init_irq_work(&event->pending, perf_pending_event);

Stephane Eranian

unread,

Oct 23, 2013, 8:58:54 AM10/23/13

to linux-...@vger.kernel.org, pet...@infradead.org, mi...@elte.hu, a...@linux.intel.com, ac...@redhat.com, jo...@redhat.com, zheng...@intel.com, b...@alien8.de

This patch adds perf stat support fo rhandling event units and
scales as exported by the kernel.

The kernel can export PMU events actual unit and scaling factor
via sysfs:

$ ls -1 /sys/devices/power/events/energy-*
/sys/devices/power/events/energy-cores
/sys/devices/power/events/energy-cores.scale
/sys/devices/power/events/energy-cores.unit
/sys/devices/power/events/energy-pkg
/sys/devices/power/events/energy-pkg.scale
/sys/devices/power/events/energy-pkg.unit
$ cat /sys/devices/power/events/energy-cores.scale
2.3e-10
$ cat cat /sys/devices/power/events/energy-cores.unit
Joules

This patch modifies the pmu event alias code to check
for the presence of the .unit and .scale files to load
the corresponding values. They are then used by perf stat
transparentely:

# perf stat -a -e power/energy-pkg/,power/energy-cores/,cycles -I 1000 sleep 1000
# time counts unit events

1.000214717 3.07 Joules power/energy-pkg/ [100.00%]
1.000214717 0.53 Joules power/energy-cores/
1.000214717 12965028 ? cycles [100.00%]
2.000749289 3.01 Joules power/energy-pkg/
2.000749289 0.52 Joules power/energy-cores/
2.000749289 15817043 ? cycles

Signed-off-by: Stephane Eranian <era...@google.com>
---

tools/perf/builtin-stat.c | 72 ++++++++++++-----

tools/perf/util/evsel.c | 2 +
tools/perf/util/evsel.h | 3 +
tools/perf/util/parse-events.c | 1 +

tools/perf/util/pmu.c | 170 +++++++++++++++++++++++++++++++++++++++-
tools/perf/util/pmu.h | 3 +
6 files changed, 230 insertions(+), 21 deletions(-)

diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
index 1a9c95d..43dea3b 100644
--- a/tools/perf/builtin-stat.c
+++ b/tools/perf/builtin-stat.c
@@ -138,6 +138,7 @@ static const char *post_cmd = NULL;
static bool sync_run = false;
static unsigned int interval = 0;
static unsigned int initial_delay = 0;
+static unsigned int unit_width = 4; /* strlen("unit") */
static bool forever = false;
static struct timespec ref_time;
static struct cpu_map *aggr_map;
@@ -462,17 +463,17 @@ static void print_interval(void)
if (num_print_interval == 0 && !csv_output) {
switch (aggr_mode) {
case AGGR_SOCKET:
- fprintf(output, "# time socket cpus counts events\n");
+ fprintf(output, "# time socket cpus counts %*s events\n", unit_width, "unit");
break;
case AGGR_CORE:
- fprintf(output, "# time core cpus counts events\n");
+ fprintf(output, "# time core cpus counts %*s events\n", unit_width, "unit");
break;
case AGGR_NONE:
- fprintf(output, "# time CPU counts events\n");
+ fprintf(output, "# time CPU counts %*s events\n", unit_width, "unit");
break;
case AGGR_GLOBAL:
default:
- fprintf(output, "# time counts events\n");
+ fprintf(output, "# time counts %*s events\n", unit_width, "unit");
}
}

@@ -517,6 +518,7 @@ static int __run_perf_stat(int argc, const char **argv)
unsigned long long t0, t1;
struct perf_evsel *counter;
struct timespec ts;
+ size_t l;
int status = 0;
const bool forks = (argc > 0);

@@ -566,6 +568,10 @@ static int __run_perf_stat(int argc, const char **argv)
return -1;
}
counter->supported = true;
+
+ l = strlen(counter->unit);
+ if (l > unit_width)
+ unit_width = l;
}

if (perf_evlist__apply_filters(evsel_list)) {
@@ -911,19 +917,32 @@ static void abs_printout(int cpu, int nr, struct perf_evsel *evsel, double avg)
double total, ratio = 0.0, total2;
const char *fmt;

- if (csv_output)
- fmt = "%.0f%s%s";
- else if (big_num)
- fmt = "%'18.0f%s%-25s";
- else
- fmt = "%18.0f%s%-25s";
+ if (csv_output) {
+ if (evsel->scale != 1.0)
+ fmt = "%.2f%s%s%s%s";
+ else
+ fmt = "%.0f%s%s%s%s";
+ } else if (big_num)
+ if (evsel->scale != 1.0)
+ fmt = "%'18.2f%s%-*s%s%-25s";
+ else
+ fmt = "%'18.0f%s%-*s%s%-25s";
+ else {
+ if (evsel->scale != 1.0)
+ fmt = "%18.2f%s%-*s%s%-25s";
+ else
+ fmt = "%18.0f%s%-*s%s%-25s";
+ }

aggr_printout(evsel, cpu, nr);

if (aggr_mode == AGGR_GLOBAL)
cpu = 0;

- fprintf(output, fmt, avg, csv_sep, perf_evsel__name(evsel));
+ if (csv_output)
+ fprintf(output, fmt, avg, csv_sep, evsel->unit, csv_sep, perf_evsel__name(evsel));
+ else
+ fprintf(output, fmt, avg, csv_sep, unit_width, evsel->unit, csv_sep, perf_evsel__name(evsel));

if (evsel->cgrp)
fprintf(output, "%s%s", csv_sep, evsel->cgrp->name);
@@ -1062,6 +1081,7 @@ static void print_aggr(char *prefix)
{
struct perf_evsel *counter;
int cpu, cpu2, s, s2, id, nr;
+ double uval;
u64 ena, run, val;

if (!(aggr_map || aggr_get_id))
@@ -1088,9 +1108,13 @@ static void print_aggr(char *prefix)
if (run == 0 || ena == 0) {
aggr_printout(counter, id, nr);

- fprintf(output, "%*s%s%*s",
+ fprintf(output, "%*s%s%*s%s%*s",
csv_output ? 0 : 18,
counter->supported ? CNTR_NOT_COUNTED : CNTR_NOT_SUPPORTED,
+
+ csv_sep,
+ csv_output ? 0 : -10,
+ counter->unit,
csv_sep,
csv_output ? 0 : -24,
perf_evsel__name(counter));
@@ -1102,11 +1126,12 @@ static void print_aggr(char *prefix)
fputc('\n', output);
continue;
}
+ uval = val * counter->scale;

if (nsec_counter(counter))
- nsec_printout(id, nr, counter, val);
+ nsec_printout(id, nr, counter, uval);
else
- abs_printout(id, nr, counter, val);
+ abs_printout(id, nr, counter, uval);

if (!csv_output) {
print_noise(counter, 1.0);
@@ -1129,6 +1154,7 @@ static void print_counter_aggr(struct perf_evsel *counter, char *prefix)
struct perf_stat *ps = counter->priv;
double avg = avg_stats(&ps->res_stats[0]);
int scaled = counter->counts->scaled;
+ double uval;

if (prefix)
fprintf(output, "%s", prefix);
@@ -1148,10 +1174,12 @@ static void print_counter_aggr(struct perf_evsel *counter, char *prefix)
return;
}

+ uval = avg * counter->scale;
+
if (nsec_counter(counter))
- nsec_printout(-1, 0, counter, avg);
+ nsec_printout(-1, 0, counter, uval);
else
- abs_printout(-1, 0, counter, avg);
+ abs_printout(-1, 0, counter, uval);

print_noise(counter, avg);

@@ -1178,6 +1206,7 @@ static void print_counter_aggr(struct perf_evsel *counter, char *prefix)
static void print_counter(struct perf_evsel *counter, char *prefix)
{
u64 ena, run, val;
+ double uval;
int cpu;

for (cpu = 0; cpu < perf_evsel__nr_cpus(counter); cpu++) {
@@ -1189,12 +1218,15 @@ static void print_counter(struct perf_evsel *counter, char *prefix)
fprintf(output, "%s", prefix);

if (run == 0 || ena == 0) {
- fprintf(output, "CPU%*d%s%*s%s%*s",
+ fprintf(output, "CPU%*d%s%*s%s%*s%s%*s",
csv_output ? 0 : -4,
perf_evsel__cpus(counter)->map[cpu], csv_sep,
csv_output ? 0 : 18,
counter->supported ? CNTR_NOT_COUNTED : CNTR_NOT_SUPPORTED,
csv_sep,
+ csv_output ? 0 : -10,
+ counter->unit,
+ csv_sep,
csv_output ? 0 : -24,
perf_evsel__name(counter));

@@ -1206,10 +1238,12 @@ static void print_counter(struct perf_evsel *counter, char *prefix)
continue;
}

+ uval = val * counter->scale;
+
if (nsec_counter(counter))
- nsec_printout(cpu, 0, counter, val);
+ nsec_printout(cpu, 0, counter, uval);
else
- abs_printout(cpu, 0, counter, val);
+ abs_printout(cpu, 0, counter, uval);

if (!csv_output) {
print_noise(counter, 1.0);
diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index 3a334f0..867971b 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -162,6 +162,8 @@ void perf_evsel__init(struct perf_evsel *evsel,
evsel->idx = idx;
evsel->attr = *attr;
evsel->leader = evsel;
+ evsel->unit = "?";
+ evsel->scale = 1.0;
INIT_LIST_HEAD(&evsel->node);
hists__init(&evsel->hists);
evsel->sample_size = __perf_evsel__sample_size(attr->sample_type);
diff --git a/tools/perf/util/evsel.h b/tools/perf/util/evsel.h
index 5aa68cd..d5c6606 100644
--- a/tools/perf/util/evsel.h
+++ b/tools/perf/util/evsel.h
@@ -68,6 +68,8 @@ struct perf_evsel {
u32 ids;
struct hists hists;
char *name;
+ double scale;
+ const char *unit;
struct event_format *tp_format;
union {
void *priv;
@@ -130,6 +132,7 @@ extern const char *perf_evsel__sw_names[PERF_COUNT_SW_MAX];
int __perf_evsel__hw_cache_type_op_res_name(u8 type, u8 op, u8 result,
char *bf, size_t size);
const char *perf_evsel__name(struct perf_evsel *evsel);
+
const char *perf_evsel__group_name(struct perf_evsel *evsel);
int perf_evsel__group_desc(struct perf_evsel *evsel, char *buf, size_t size);

diff --git a/tools/perf/util/parse-events.c b/tools/perf/util/parse-events.c
index c90e55c..be7eba8 100644
--- a/tools/perf/util/parse-events.c
+++ b/tools/perf/util/parse-events.c
@@ -838,6 +838,7 @@ int parse_events_name(struct list_head *list, char *name)
list_for_each_entry(evsel, list, node) {
if (!evsel->name)
evsel->name = strdup(name);
+ pmu_get_event_unit_scale(evsel);
}

return 0;
diff --git a/tools/perf/util/pmu.c b/tools/perf/util/pmu.c
index 64362fe..ae17132 100644
--- a/tools/perf/util/pmu.c
+++ b/tools/perf/util/pmu.c
@@ -4,6 +4,7 @@
#include <unistd.h>
#include <stdio.h>
#include <dirent.h>
+#include <locale.h>
#include "sysfs.h"
#include "util.h"
#include "pmu.h"
@@ -14,6 +15,8 @@ struct perf_pmu_alias {
char *name;
struct list_head terms;
struct list_head list;
+ char *unit;
+ double scale;
};

struct perf_pmu_format {
@@ -95,7 +98,89 @@ static int pmu_format(const char *name, struct list_head *format)
return 0;
}

-static int perf_pmu__new_alias(struct list_head *list, char *name, FILE *file)
+static int perf_pmu__parse_scale(struct perf_pmu_alias *alias, char *dir, char *name)
+{
+ struct stat st;
+ ssize_t sret;
+ char scale[128];
+ int fd, ret = -1;
+ char path[PATH_MAX];
+ char *lc;
+
+ snprintf(path, PATH_MAX, "%s/%s.scale", dir, name);
+
+ fd = open(path, O_RDONLY);
+ if (fd == -1)
+ return -1;
+
+ if (fstat(fd, &st) < 0)
+ goto error;
+
+ sret = read(fd, scale, sizeof(scale)-1);
+ if (sret < 0)
+ goto error;
+
+ scale[sret] = '\0';
+ /*
+ * save current locale
+ */
+ lc = setlocale(LC_NUMERIC, NULL);
+
+ /*
+ * force to C locale to ensure kernel
+ * scale string is converted correctly.
+ * kernel uses default C locale.
+ */
+ setlocale(LC_NUMERIC, "C");
+
+ alias->scale = strtod(scale, NULL);
+
+ /* restore locale */
+ setlocale(LC_NUMERIC, lc);
+
+ ret = 0;
+error:
+ close(fd);

+ return ret;
+}
+

+static int perf_pmu__parse_unit(struct perf_pmu_alias *alias, char *dir, char *name)
+{
+ struct stat st;
+ ssize_t sret;
+ int fd;
+ char path[PATH_MAX];
+
+ snprintf(path, PATH_MAX, "%s/%s.unit", dir, name);
+
+ fd = open(path, O_RDONLY);
+ if (fd == -1)
+ return -1;
+
+ if (fstat(fd, &st) < 0)
+ goto error;
+
+ alias->unit = malloc(st.st_size + 1);
+ if (!alias->unit)
+ goto error;
+
+ sret = read(fd, alias->unit, st.st_size);
+ if (sret < 0)
+ goto error;
+
+ close(fd);
+
+ alias->unit[sret] = '\0';
+
+ return 0;
+error:
+ close(fd);
+ free(alias->unit);
+ alias->unit = NULL;

+ return -1;
+}
+

+static int perf_pmu__new_alias(struct list_head *list, char *dir, char *name, FILE *file)
{
struct perf_pmu_alias *alias;
char buf[256];
@@ -111,6 +196,9 @@ static int perf_pmu__new_alias(struct list_head *list, char *name, FILE *file)
return -ENOMEM;

INIT_LIST_HEAD(&alias->terms);
+ alias->scale = 1.0;
+ alias->unit = NULL;
+
ret = parse_events_terms(&alias->terms, buf);
if (ret) {
free(alias);
@@ -118,7 +206,14 @@ static int perf_pmu__new_alias(struct list_head *list, char *name, FILE *file)
}

alias->name = strdup(name);
+ /*
+ * load unit name and scale if available
+ */
+ perf_pmu__parse_unit(alias, dir, name);
+ perf_pmu__parse_scale(alias, dir, name);
+
list_add_tail(&alias->list, list);
+
return 0;
}

@@ -130,6 +225,7 @@ static int pmu_aliases_parse(char *dir, struct list_head *head)
{
struct dirent *evt_ent;
DIR *event_dir;
+ size_t len;
int ret = 0;

event_dir = opendir(dir);
@@ -144,13 +240,24 @@ static int pmu_aliases_parse(char *dir, struct list_head *head)
if (!strcmp(name, ".") || !strcmp(name, ".."))
continue;

+ /*
+ * skip .unit and .scale info files
+ * parsed in perf_pmu__new_alias()
+ */
+ len = strlen(name);
+ if (len > 5 && !strcmp(name + len - 5, ".unit"))
+ continue;
+ if (len > 6 && !strcmp(name + len - 6, ".scale"))
+ continue;
+
snprintf(path, PATH_MAX, "%s/%s", dir, name);

ret = -EINVAL;
file = fopen(path, "r");
if (!file)
break;
- ret = perf_pmu__new_alias(head, name, file);
+
+ ret = perf_pmu__new_alias(head, dir, name, file);
fclose(file);
}

@@ -653,3 +760,62 @@ bool pmu_have_event(const char *pname, const char *name)
}
return false;
}
+
+static const char *pmu_event_unit(struct perf_pmu *pmu, const char *name)
+{
+ struct perf_pmu_alias *alias;
+ char buf[1024];
+ char *fname;
+ const char *unit = "";
+
+ if (!name)
+ return unit;
+
+ list_for_each_entry(alias, &pmu->aliases, list) {
+ fname = format_alias(buf, sizeof(buf), pmu, alias);
+ if (!strcmp(fname, name)) {
+ unit = alias->unit;
+ break;
+ }
+ }
+ return unit;
+}
+
+static double pmu_event_scale(struct perf_pmu *pmu, const char *name)
+{
+ struct perf_pmu_alias *alias;
+ char buf[1024];
+ char *fname;
+ double scale = 1.0;
+
+ if (!name)
+ return 1.0;
+
+ list_for_each_entry(alias, &pmu->aliases, list) {
+ fname = format_alias(buf, sizeof(buf), pmu, alias);
+ if (!strcmp(fname, name)) {
+ scale = alias->scale;
+ break;
+ }
+ }
+ return scale;
+}
+
+int pmu_get_event_unit_scale(struct perf_evsel *evsel)
+{
+ __u32 type = evsel->attr.type;
+ struct perf_pmu *pmu;
+
+ if (!evsel->name)
+ return -1;
+
+ list_for_each_entry(pmu, &pmus, list) {
+ if (pmu->type == type)
+ goto found;
+ }
+ return -1;
+found:
+ evsel->unit = pmu_event_unit(pmu, evsel->name);
+ evsel->scale = pmu_event_scale(pmu, evsel->name);
+ return 0;
+}
diff --git a/tools/perf/util/pmu.h b/tools/perf/util/pmu.h
index 1179b26..6bf23b2 100644
--- a/tools/perf/util/pmu.h
+++ b/tools/perf/util/pmu.h
@@ -4,6 +4,7 @@
#include <linux/bitops.h>
#include <linux/perf_event.h>
#include <stdbool.h>
+#include "util/evsel.h"

enum {
PERF_PMU_FORMAT_VALUE_CONFIG,
@@ -45,4 +46,6 @@ void print_pmu_events(const char *event_glob, bool name_only);
bool pmu_have_event(const char *pname, const char *name);

int perf_pmu__test(void);
+
+int pmu_get_event_unit_scale(struct perf_evsel *evsel);
#endif /* __PMU_H */

Stephane Eranian

unread,

Oct 23, 2013, 8:59:06 AM10/23/13

to linux-...@vger.kernel.org, pet...@infradead.org, mi...@elte.hu, a...@linux.intel.com, ac...@redhat.com, jo...@redhat.com, zheng...@intel.com, b...@alien8.de

This patch adds a new uncore PMU to expose the Intel

RAPL energy consumption counters. Up to 3 counters,

each counting a particular RAPL event are exposed.

The RAPL counters are available on Intel SandyBridge,

IvyBridge, Haswell. The server skus add a 3rd counter.

The following events are available and exposed in sysfs:

- power/energy-cores: power consumption of all cores on socket
- power/energy-pkg: power consumption of all cores + LLc cache

- power/energy-dram: power consumption of DRAM (servers only)

For each event both the unit (Joules) and scale (0.23 nJ)
is exposed in sysfs for use by perf stat and other tools.
Files are:
/sys/devices/power/events/energy-*.unit
/sys/devices/power/events/energy-*.scale

The RAPL PMU is uncore by nature and is implemented such
that it only works in system-wide mode. Measuring only

one CPU per socket is sufficient. The /sys/devices/power/cpumask
file can be used by tools to figure out which CPUs

to monitor by default. For instance, on a 2-socket system, 2 CPUs
(one on each socket) will be shown.

The counters all count in the same unit (exposed via sysfs).

The perf_events API exposes all RAPL counters as 64-bit integers
counting in unit of 1/2^32 Joules (or 0.23 nJ). User level tools
must convert the counts by multiplying them by 0.23 and divide 10^9
to obtain Joules. The reason for this is that the kernel avoids
doing floating point math whenever possible because it is
expensive (user floating-point state must be saved). The method
used avoids kernel floating-point and minimizes the loss of
precision (bits). Thanks to PeterZ for suggesting this approach.

To convert the raw count in Watt:
W = C * 0.23 / (1e9 * time)

or ldexp(C, -32).

RAPL PMU is a new standalone PMU which registers with the
perf_event core subsystem. The PMU type (attr->type) is

dynamically allocated and is available from /sys/device/power/type.

Sampling is not supported by the RAPL PMU. There is no
privilege level filtering either.

Signed-off-by: Stephane Eranian <era...@google.com>
---

arch/x86/kernel/cpu/Makefile | 2 +-
arch/x86/kernel/cpu/perf_event_intel_rapl.c | 652 +++++++++++++++++++++++++++
2 files changed, 653 insertions(+), 1 deletion(-)

create mode 100644 arch/x86/kernel/cpu/perf_event_intel_rapl.c

diff --git a/arch/x86/kernel/cpu/Makefile b/arch/x86/kernel/cpu/Makefile
index 47b56a7..6359506 100644
--- a/arch/x86/kernel/cpu/Makefile
+++ b/arch/x86/kernel/cpu/Makefile
@@ -36,7 +36,7 @@ obj-$(CONFIG_CPU_SUP_AMD) += perf_event_amd_iommu.o
endif
obj-$(CONFIG_CPU_SUP_INTEL) += perf_event_p6.o perf_event_knc.o perf_event_p4.o
obj-$(CONFIG_CPU_SUP_INTEL) += perf_event_intel_lbr.o perf_event_intel_ds.o perf_event_intel.o
-obj-$(CONFIG_CPU_SUP_INTEL) += perf_event_intel_uncore.o
+obj-$(CONFIG_CPU_SUP_INTEL) += perf_event_intel_uncore.o perf_event_intel_rapl.o
endif

diff --git a/arch/x86/kernel/cpu/perf_event_intel_rapl.c b/arch/x86/kernel/cpu/perf_event_intel_rapl.c
new file mode 100644

index 0000000..c61b411
--- /dev/null
+++ b/arch/x86/kernel/cpu/perf_event_intel_rapl.c
@@ -0,0 +1,652 @@

+/*
+ * perf_event_intel_rapl.c: support Intel RAPL energy consumption counters
+ * Copyright (C) 2013 Google, Inc., Stephane Eranian
+ *
+ * Intel RAPL interface is specified in the IA-32 Manual Vol3b
+ * section 14.7.1 (September 2013)
+ *
+ * RAPL provides more controls than just reporting energy consumption
+ * however here we only expose the 3 energy consumption free running
+ * counters (pp0, pkg, dram).
+ *
+ * Each of those counters increments in a power unit defined by the
+ * RAPL_POWER_UNIT MSR. On SandyBridge, this unit is 1/(2^16) Joules
+ * but it can vary.
+ *
+ * Counter to rapl events mappings:
+ *
+ * pp0 counter: consumption of all physical cores (power plane 0)

+ * event: power/energy_cores

+ * perf code: 0x1
+ *
+ * pkg counter: consumption of the whole processor package

+ * event: power/energy_pkg

+ * perf code: 0x2
+ *
+ * dram counter: consumption of the dram domain (servers only)

+ * event: power/energy_dram

+ int hw_unit; /* 1/2^hw_unit Joule */
+ int phys_id;

+ int n_active; /* number of active events */

+ struct list_head active_list;
+};
+
+static struct pmu rapl_pmu_class;
+static cpumask_t rapl_cpu_mask;
+static int rapl_cntr_mask;
+
+static DEFINE_PER_CPU(struct rapl_pmu *, rapl_pmu);
+static DEFINE_PER_CPU(struct rapl_pmu *, rapl_pmu_kfree);
+
+static DEFINE_SPINLOCK(rapl_hotplug_lock);
+
+static inline u64 rapl_read_counter(struct perf_event *event)
+{
+ u64 raw;
+ rdmsrl(event->hw.event_base, raw);
+ return raw;
+}
+
+static inline u64 rapl_scale(u64 v)

+{
+ /*

+ * scale delta to smallest unit (1/2^32)
+ * users must then scale back: count * 1/(1e9*2^32) to get Joules
+ * or use ldexp(count, -32).
+ * Watts = Joules/Time delta
+ */
+ return v << (32 - __get_cpu_var(rapl_pmu)->hw_unit);
+}
+
+static u64 rapl_event_update(struct perf_event *event)
+{
+ struct hw_perf_event *hwc = &event->hw;
+ u64 prev_raw_count, new_raw_count;
+ s64 delta, sdelta;
+ int shift = RAPL_CNTR_WIDTH;
+
+again:
+ prev_raw_count = local64_read(&hwc->prev_count);
+ rdmsrl(event->hw.event_base, new_raw_count);
+
+ if (local64_cmpxchg(&hwc->prev_count, prev_raw_count,
+ new_raw_count) != prev_raw_count) {
+ cpu_relax();
+ goto again;
+ }

+
+ /*

+static void rapl_pmu_event_start(struct perf_event *event, int mode)
+{
+ struct rapl_pmu *pmu = __get_cpu_var(rapl_pmu);

+ unsigned long flags;
+

+ spin_lock_irqsave(&pmu->lock, flags);
+ __rapl_pmu_event_start(pmu, event);

+ spin_unlock_irqrestore(&pmu->lock, flags);
+}
+

+static void rapl_pmu_event_stop(struct perf_event *event, int mode)
+{

+ struct rapl_pmu *pmu = __get_cpu_var(rapl_pmu);
+ struct hw_perf_event *hwc = &event->hw;

+ unsigned long flags;
+

+ spin_lock_irqsave(&pmu->lock, flags);
+

+ /* mark event as deactivated and stopped */
+ if (!(hwc->state & PERF_HES_STOPPED)) {
+ WARN_ON_ONCE(pmu->n_active <= 0);
+ pmu->n_active--;
+
+ list_del(&event->active_entry);
+
+ WARN_ON_ONCE(hwc->state & PERF_HES_STOPPED);
+ hwc->state |= PERF_HES_STOPPED;
+ }
+
+ /* check if update of sw counter is necessary */
+ if ((mode & PERF_EF_UPDATE) && !(hwc->state & PERF_HES_UPTODATE)) {
+ /*
+ * Drain the remaining delta count out of a event
+ * that we are disabling:
+ */
+ rapl_event_update(event);
+ hwc->state |= PERF_HES_UPTODATE;

+ }
+
+ spin_unlock_irqrestore(&pmu->lock, flags);
+}
+

+static int rapl_pmu_event_add(struct perf_event *event, int mode)

+{

+ struct rapl_pmu *pmu = __get_cpu_var(rapl_pmu);
+ struct hw_perf_event *hwc = &event->hw;

+ unsigned long flags;
+

+ spin_lock_irqsave(&pmu->lock, flags);
+

+ hwc->state = PERF_HES_UPTODATE | PERF_HES_STOPPED;
+
+ if (mode & PERF_EF_START)
+ __rapl_pmu_event_start(pmu, event);

+
+ spin_unlock_irqrestore(&pmu->lock, flags);
+

+ return 0;
+}
+
+static void rapl_pmu_event_del(struct perf_event *event, int flags)
+{
+ rapl_pmu_event_stop(event, PERF_EF_UPDATE);
+}
+
+static int rapl_pmu_event_init(struct perf_event *event)
+{
+ u64 cfg = event->attr.config & RAPL_EVENT_MASK;
+ int bit, msr, ret = 0;
+
+ /* only look at RAPL events */
+ if (event->attr.type != rapl_pmu_class.type)
+ return -ENOENT;
+
+ /* check only supported bits are set */
+ if (event->attr.config & ~RAPL_EVENT_MASK)
+ return -EINVAL;

+
+ /*

+ return ret;
+}
+

+EVENT_ATTR_STR(energy-cores, rapl_cores, "event=0x01");
+EVENT_ATTR_STR(energy-pkg , rapl_pkg, "event=0x02");
+EVENT_ATTR_STR(energy-ram , rapl_ram, "event=0x03");
+
+EVENT_ATTR_STR(energy-cores.unit, rapl_cores_unit, "Joules");
+EVENT_ATTR_STR(energy-pkg.unit , rapl_pkg_unit, "Joules");
+EVENT_ATTR_STR(energy-ram.unit , rapl_ram_unit, "Joules");
+
+/*
+ * we compute in 0.23 nJ increments regardless of MSR
+ */
+EVENT_ATTR_STR(energy-cores.scale, rapl_cores_scale, "2.3e-10");
+EVENT_ATTR_STR(energy-pkg.scale, rapl_pkg_scale, "2.3e-10");
+EVENT_ATTR_STR(energy-ram.scale, rapl_ram_scale, "2.3e-10");

+
+static struct attribute *rapl_events_srv_attr[] = {

+ EVENT_PTR(rapl_cores),

+ EVENT_PTR(rapl_pkg),
+ EVENT_PTR(rapl_ram),
+

+ EVENT_PTR(rapl_cores_unit),
+ EVENT_PTR(rapl_pkg_unit),
+ EVENT_PTR(rapl_ram_unit),
+
+ EVENT_PTR(rapl_cores_scale),
+ EVENT_PTR(rapl_pkg_scale),
+ EVENT_PTR(rapl_ram_scale),

+ NULL,
+};
+
+static struct attribute *rapl_events_cln_attr[] = {

+ EVENT_PTR(rapl_cores),
+ EVENT_PTR(rapl_pkg),
+
+ EVENT_PTR(rapl_cores_unit),
+ EVENT_PTR(rapl_pkg_unit),
+
+ EVENT_PTR(rapl_cores_scale),
+ EVENT_PTR(rapl_pkg_scale),

+ break;
+ }
+ }
+

+ WARN_ON(cpumask_empty(&rapl_cpu_mask));
+}
+
+static void rapl_init_cpu(int cpu)
+{
+ int i, phys_id = topology_physical_package_id(cpu);
+
+ spin_lock(&rapl_hotplug_lock);
+
+ /* check if phys_is is already covered */
+ for_each_cpu(i, &rapl_cpu_mask) {

+ if (phys_id == topology_physical_package_id(i))
+ return;
+ }
+ /* was not found, so add it */
+ cpumask_set_cpu(cpu, &rapl_cpu_mask);
+
+ spin_unlock(&rapl_hotplug_lock);
+}
+
+static int rapl_cpu_prepare(int cpu)

+{
+ struct rapl_pmu *pmu = per_cpu(rapl_pmu, cpu);
+ int phys_id = topology_physical_package_id(cpu);
+

+ if (pmu)
+ return 0;
+
+ if (phys_id < 0)

+ return -1;
+

+ pmu = kzalloc_node(sizeof(*pmu), GFP_KERNEL, cpu_to_node(cpu));
+ if (!pmu)

+ return -1;
+

+ spin_lock_init(&pmu->lock);
+ atomic_set(&pmu->refcnt, 1);
+
+ INIT_LIST_HEAD(&pmu->active_list);
+
+ pmu->phys_id = phys_id;
+ /*
+ * grab power unit as: 1/2^unit Joules
+ *
+ * we cache in local PMU instance
+ */
+ rdmsrl(MSR_RAPL_POWER_UNIT, pmu->hw_unit);

+ pmu->hw_unit = (pmu->hw_unit >> 8) & 0x1FULL;
+

+ /* set RAPL pmu for this cpu for now */

+ break;
+ }
+ }

+ spin_unlock(&rapl_hotplug_lock);
+ return 0;
+}
+
+static int rapl_cpu_dying(int cpu)

+{

+ struct rapl_pmu *pmu = per_cpu(rapl_pmu, cpu);
+ struct perf_event *event, *tmp;
+
+ if (!pmu)
+ return 0;
+
+ spin_lock(&rapl_hotplug_lock);

+
+ /*

+ break;
+ }
+

+ /* select the cpu that collects uncore events */
+ switch (action & ~CPU_TASKS_FROZEN) {
+ case CPU_DOWN_FAILED:
+ case CPU_STARTING:
+ rapl_init_cpu(cpu);
+ break;
+ case CPU_DOWN_PREPARE:
+ rapl_exit_cpu(cpu);
+ break;
+ default:

+ break;
+ }
+

+ return NOTIFY_OK;
+}
+
+static const struct x86_cpu_id rapl_cpu_match[] = {
+ [0] = { .vendor = X86_VENDOR_INTEL, .family = 6 },
+ [1] = {},
+};
+static int __init rapl_pmu_init(void)

+{

+ struct rapl_pmu *pmu;
+ int i, cpu, ret;

+
+ /*

+ int phys_id = topology_physical_package_id(cpu);
+

+ /* save on prepare by only calling prepare for new phys_id */
+ for_each_cpu(i, &rapl_cpu_mask) {
+ if (phys_id == topology_physical_package_id(i)) {
+ phys_id = -1;

+ break;
+ }
+ }

+ if (phys_id < 0) {
+ pmu = per_cpu(rapl_pmu, i);
+ if (pmu) {
+ per_cpu(rapl_pmu, cpu) = pmu;
+ atomic_inc(&pmu->refcnt);
+ }
+ continue;
+ }
+ rapl_cpu_prepare(cpu);
+ cpumask_set_cpu(cpu, &rapl_cpu_mask);
+ }
+
+ perf_cpu_notifier(rapl_cpu_notifier);
+

+ ret = perf_pmu_register(&rapl_pmu_class, "power", -1);
+ WARN_ON(ret);
+ if (!ret) {
+ pr_info("RAPL PMU detected, registration failed, RAPL PMU disabled\n");
+ put_online_cpus();

+ return -1;
+ }
+

+ pmu = __get_cpu_var(rapl_pmu);
+
+ pr_info("RAPL PMU detected, hw unit 2^-%d Joules,"
+ " API unit is 2^-32 Joules,"
+ " %d fixed counters\n",
+ pmu->hw_unit,
+ hweight32(rapl_cntr_mask));
+
+ put_online_cpus();
+
+ return 0;
+}
+device_initcall(rapl_pmu_init);

Stephane Eranian

unread,

Oct 23, 2013, 8:59:09 AM10/23/13

to linux-...@vger.kernel.org, pet...@infradead.org, mi...@elte.hu, a...@linux.intel.com, ac...@redhat.com, jo...@redhat.com, zheng...@intel.com, b...@alien8.de

The RAPL PMU counters do not interrupt on overflow.
Therefore, the kernel needs to poll the counters
to avoid missing an overflow. This patch adds
the hrtimer code to do this.

The timer internval is calculated at boot time
based on the power unit used by the HW.

Signed-off-by: Stephane Eranian <era...@google.com>
---

arch/x86/kernel/cpu/perf_event_intel_rapl.c | 75 +++++++++++++++++++++++++--
1 file changed, 70 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_rapl.c b/arch/x86/kernel/cpu/perf_event_intel_rapl.c

index 3d71d39..ed0566a 100644

--- a/arch/x86/kernel/cpu/perf_event_intel_rapl.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_rapl.c
@@ -92,11 +92,13 @@ static struct kobj_attribute format_attr_##_var = \

struct rapl_pmu {
spinlock_t lock;
- atomic_t refcnt;

int hw_unit; /* 1/2^hw_unit Joule */

- int phys_id;
- int n_active; /* number of active events */
+ struct hrtimer hrtimer;
struct list_head active_list;
+ ktime_t timer_interval; /* in ktime_t unit */

+ int n_active; /* number of active events */

+ int phys_id;
+ atomic_t refcnt;
};

static struct pmu rapl_pmu_class;
@@ -161,6 +163,47 @@ static u64 rapl_event_update(struct perf_event *event)
return new_raw_count;
}

+static void rapl_start_hrtimer(struct rapl_pmu *pmu)
+{
+ __hrtimer_start_range_ns(&pmu->hrtimer,
+ pmu->timer_interval, 0,
+ HRTIMER_MODE_REL_PINNED, 0);
+}
+
+static void rapl_stop_hrtimer(struct rapl_pmu *pmu)
+{
+ hrtimer_cancel(&pmu->hrtimer);
+}
+
+static enum hrtimer_restart rapl_hrtimer_handle(struct hrtimer *hrtimer)

+{

+ struct rapl_pmu *pmu = container_of(hrtimer, struct rapl_pmu, hrtimer);
+ struct perf_event *event;

+ unsigned long flags;
+

+ if (!pmu->n_active)
+ return HRTIMER_NORESTART;

+
+ spin_lock_irqsave(&pmu->lock, flags);
+

+ list_for_each_entry(event, &pmu->active_list, active_entry) {
+ rapl_event_update(event);

+ }
+
+ spin_unlock_irqrestore(&pmu->lock, flags);
+

+ hrtimer_forward_now(&pmu->hrtimer, pmu->timer_interval);
+
+ return HRTIMER_RESTART;
+}
+
+static void rapl_hrtimer_init(struct rapl_pmu *pmu)
+{
+ hrtimer_init(&pmu->hrtimer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+ pmu->hrtimer.function = rapl_hrtimer_handle;
+}
+
+
static void __rapl_pmu_event_start(struct rapl_pmu *pmu,
struct perf_event *event)
{
@@ -174,6 +217,8 @@ static void __rapl_pmu_event_start(struct rapl_pmu *pmu,
local64_set(&event->hw.prev_count, rapl_read_counter(event));

pmu->n_active++;
+ if (pmu->n_active == 1)
+ rapl_start_hrtimer(pmu);
}

static void rapl_pmu_event_start(struct perf_event *event, int mode)

@@ -198,6 +243,8 @@ static void rapl_pmu_event_stop(struct perf_event *event, int mode)

if (!(hwc->state & PERF_HES_STOPPED)) {

WARN_ON_ONCE(pmu->n_active <= 0);
pmu->n_active--;
+ if (pmu->n_active == 0)
+ rapl_stop_hrtimer(pmu);

list_del(&event->active_entry);

@@ -439,6 +486,7 @@ static int rapl_cpu_prepare(int cpu)
{

struct rapl_pmu *pmu = per_cpu(rapl_pmu, cpu);

int phys_id = topology_physical_package_id(cpu);

+ u64 ms;

if (pmu)
return 0;

@@ -464,6 +512,20 @@ static int rapl_cpu_prepare(int cpu)
rdmsrl(MSR_RAPL_POWER_UNIT, pmu->hw_unit);

pmu->hw_unit = (pmu->hw_unit >> 8) & 0x1FULL;

+ /*
+ * use reference of 200W for scaling the timeout
+ * to avoid missing counter overflows.
+ * 200W = 200 Joules/sec
+ * divide interval by 2 to avoid lockstep (2 * 100)
+ * if hw unit is 32, then we use 2 ms 1/200/2
+ */
+ if (pmu->hw_unit < 32)
+ ms = 1000 * (1ULL << (32 - pmu->hw_unit - 1)) / (2 * 100);
+ else
+ ms = 2;
+
+ pmu->timer_interval = ms_to_ktime(ms);

+
/* set RAPL pmu for this cpu for now */

per_cpu(rapl_pmu_kfree, cpu) = NULL;
per_cpu(rapl_pmu, cpu) = pmu;

@@ -625,6 +687,7 @@ static int __init rapl_pmu_init(void)

}
rapl_cpu_prepare(cpu);
cpumask_set_cpu(cpu, &rapl_cpu_mask);
+ rapl_hrtimer_init(per_cpu(rapl_pmu, cpu));
}

perf_cpu_notifier(rapl_cpu_notifier);

@@ -641,9 +704,11 @@ static int __init rapl_pmu_init(void)

pr_info("RAPL PMU detected, hw unit 2^-%d Joules,"

" API unit is 2^-32 Joules,"

- " %d fixed counters\n",
+ " %d fixed counters"
+ " %llu ms ovfl timer\n",
pmu->hw_unit,
- hweight32(rapl_cntr_mask));
+ hweight32(rapl_cntr_mask),
+ ktime_to_ms(pmu->timer_interval));

put_online_cpus();

Arnaldo Carvalho de Melo

unread,

Oct 23, 2013, 10:23:24 AM10/23/13

to Stephane Eranian, Ingo Molnar, Borislav Petkov, LKML, Peter Zijlstra, mi...@elte.hu, a...@linux.intel.com, Jiri Olsa, Yan, Zheng

Em Wed, Oct 23, 2013 at 11:34:42AM +0200, Stephane Eranian escreveu:
> On Wed, Oct 23, 2013 at 12:18 AM, Arnaldo Carvalho de Melo

> > What about:

> > # perf stat -a -e power/energy-cores/,power/energy-pkg/,ref-cycles -I 1000 sleep 1000
> > # time events
> > 1.000264953 2.09 Joules power/energy-cores/
> > 1.000264953 5.94 Joules power/energy-pkg/
> > 1.000264953 160,530,320 ref-cycles
> > 2.000640422 2.07 Joules power/energy-cores/
> > 2.000640422 5.94 Joules power/energy-pkg/
> > 2.000640422 152,673,056 ref-cycles
> > 3.000964416 2.08 Joules power/energy-cores/
> > 3.000964416 5.93 Joules power/energy-pkg/
> > 3.000964416 158,779,184 ref-cycles

> > ?
> > Or even 2.09J power/energy-cores/?

> I can try that.

> > I.e. a perf_evsel__fprintf_value(evsel) would append a unit string, if
> > available.

> I don't have this function is my tree yet (tip.git).

That would be a new one :-)

At some point I'll study the %pM, etc things in the kernel printk code
to come up with something like perf_evsel__{f,scn}printf that allows us
to use just one string format and then pick things like units as a
modifier, but till then having these fprintf variants seems good enough.

- Arnaldo

Stephane Eranian

unread,

Oct 23, 2013, 10:33:18 AM10/23/13

to Arnaldo Carvalho de Melo, Ingo Molnar, Borislav Petkov, LKML, Peter Zijlstra, mi...@elte.hu, a...@linux.intel.com, Jiri Olsa, Yan, Zheng

Arnaldo,

Having the printf() would only be good to print the value but the problem is
that you'd need to synchronize with the column headers and width. So
if you say fprintf_value() print the count + unit, then you need to line up
also with the column header which comes from somwhere else. I am
talking about the interval printing mode here.

Jiri Olsa

unread,

Oct 25, 2013, 7:14:41 AM10/25/13

to Stephane Eranian, linux-...@vger.kernel.org, pet...@infradead.org, mi...@elte.hu, a...@linux.intel.com, ac...@redhat.com, zheng...@intel.com, b...@alien8.de

On Wed, Oct 23, 2013 at 02:58:04PM +0200, Stephane Eranian wrote:

SNIP

> +
> + perf_cpu_notifier(rapl_cpu_notifier);
> +
> + ret = perf_pmu_register(&rapl_pmu_class, "power", -1);
> + WARN_ON(ret);
> + if (!ret) {
> + pr_info("RAPL PMU detected, registration failed, RAPL PMU disabled\n");
> + put_online_cpus();
> + return -1;
> + }

should above rather be:

if (WARN_ON(ret)) {

pr_info("RAPL PMU detected, registration failed, RAPL PMU disabled\n");

jirka

Jiri Olsa

unread,

Oct 25, 2013, 7:14:56 AM10/25/13

to Stephane Eranian, linux-...@vger.kernel.org, pet...@infradead.org, mi...@elte.hu, a...@linux.intel.com, ac...@redhat.com, zheng...@intel.com, b...@alien8.de

On Wed, Oct 23, 2013 at 02:58:04PM +0200, Stephane Eranian wrote:

SNIP

> + pmu = per_cpu(rapl_pmu, i);
> + if (pmu) {
> + per_cpu(rapl_pmu, cpu) = pmu;
> + atomic_inc(&pmu->refcnt);
> + }
> + continue;
> + }
> + rapl_cpu_prepare(cpu);
> + cpumask_set_cpu(cpu, &rapl_cpu_mask);
> + }
> +
> + perf_cpu_notifier(rapl_cpu_notifier);

hum, this should be rather called below only if we succeed
with the perf_pmu_register

Jiri Olsa

unread,

Oct 25, 2013, 7:15:26 AM10/25/13

to Stephane Eranian, linux-...@vger.kernel.org, pet...@infradead.org, mi...@elte.hu, a...@linux.intel.com, ac...@redhat.com, zheng...@intel.com, b...@alien8.de

On Wed, Oct 23, 2013 at 02:58:04PM +0200, Stephane Eranian wrote:

SNIP

> +
> +static void rapl_init_cpu(int cpu)
> +{
> + int i, phys_id = topology_physical_package_id(cpu);
> +
> + spin_lock(&rapl_hotplug_lock);
> +
> + /* check if phys_is is already covered */
> + for_each_cpu(i, &rapl_cpu_mask) {
> + if (phys_id == topology_physical_package_id(i))
> + return;

missing 'spin_unlock(&rapl_hotplug_lock)' above

> + }
> + /* was not found, so add it */
> + cpumask_set_cpu(cpu, &rapl_cpu_mask);
> +
> + spin_unlock(&rapl_hotplug_lock);
> +}
> +

Jiri Olsa

unread,

Oct 25, 2013, 10:57:16 AM10/25/13

to Stephane Eranian, linux-...@vger.kernel.org, pet...@infradead.org, mi...@elte.hu, a...@linux.intel.com, ac...@redhat.com, zheng...@intel.com, b...@alien8.de

On Wed, Oct 23, 2013 at 02:58:02PM +0200, Stephane Eranian wrote:
> This patch adds a new fields to the struct perf_event.
> It is intended to be used to chain events which are
> active (enabled). It helps in the hardware layer
> for PMU which do not have actual counter restrictions, i.e.,
> free running read-only counters. Active events are chained
> as opposed to being tracked via the counter they use.
>
> Signed-off-by: Stephane Eranian <era...@google.com>
> ---
> include/linux/perf_event.h | 1 +
> kernel/events/core.c | 1 +
> 2 files changed, 2 insertions(+)
>
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index 2e069d1..a376384 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -435,6 +435,7 @@ struct perf_event {
> struct perf_cgroup *cgrp; /* cgroup event is attach to */
> int cgrp_defer_enabled;
> #endif
> + struct list_head active_entry;

Could this be in union with 'hlist_entry' ? It looks
as 'same purpose' and 'mutualy exclusive stuff.

jirka

Jiri Olsa

unread,

Oct 25, 2013, 1:45:53 PM10/25/13

to Stephane Eranian, linux-...@vger.kernel.org, pet...@infradead.org, mi...@elte.hu, a...@linux.intel.com, ac...@redhat.com, zheng...@intel.com, b...@alien8.de

hi,
I dont fully understand the reason for the timer,
I'm probably missing something..

- the timer calls rapl_event_update for all defined events
- but rapl_pmu_event_read calls rapl_event_update any time the
event is read (sys_read)

The rapl_event_update only read msr and updates
event->count|hw,prev_count.

What's the timer purpose then?

thanks for info,
jirka

Stephane Eranian

unread,

Oct 26, 2013, 12:57:42 PM10/26/13

to Jiri Olsa, LKML, Peter Zijlstra, mi...@elte.hu, a...@linux.intel.com, Arnaldo Carvalho de Melo, Yan, Zheng, Borislav Petkov

On Fri, Oct 25, 2013 at 4:56 PM, Jiri Olsa <jo...@redhat.com> wrote:
>
> On Wed, Oct 23, 2013 at 02:58:02PM +0200, Stephane Eranian wrote:
> > This patch adds a new fields to the struct perf_event.
> > It is intended to be used to chain events which are
> > active (enabled). It helps in the hardware layer
> > for PMU which do not have actual counter restrictions, i.e.,
> > free running read-only counters. Active events are chained
> > as opposed to being tracked via the counter they use.
> >
> > Signed-off-by: Stephane Eranian <era...@google.com>
> > ---
> > include/linux/perf_event.h | 1 +
> > kernel/events/core.c | 1 +
> > 2 files changed, 2 insertions(+)
> >
> > diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> > index 2e069d1..a376384 100644
> > --- a/include/linux/perf_event.h
> > +++ b/include/linux/perf_event.h
> > @@ -435,6 +435,7 @@ struct perf_event {
> > struct perf_cgroup *cgrp; /* cgroup event is attach to */
> > int cgrp_defer_enabled;
> > #endif
> > + struct list_head active_entry;
>
> Could this be in union with 'hlist_entry' ? It looks
> as 'same purpose' and 'mutualy exclusive stuff.
>

You're saying that I could use the hlist_entry field because
it is currently only used by the sw events in the generic layer.
But it seems to be a complicated rcu list for the purpose here.

Stephane Eranian

unread,

Oct 26, 2013, 1:00:56 PM10/26/13

to Jiri Olsa, LKML, Peter Zijlstra, mi...@elte.hu, a...@linux.intel.com, Arnaldo Carvalho de Melo, Yan, Zheng, Borislav Petkov

On Fri, Oct 25, 2013 at 1:14 PM, Jiri Olsa <jo...@redhat.com> wrote:
> On Wed, Oct 23, 2013 at 02:58:04PM +0200, Stephane Eranian wrote:
>
> SNIP
>
>> +
>> +static void rapl_init_cpu(int cpu)
>> +{
>> + int i, phys_id = topology_physical_package_id(cpu);
>> +
>> + spin_lock(&rapl_hotplug_lock);
>> +
>> + /* check if phys_is is already covered */
>> + for_each_cpu(i, &rapl_cpu_mask) {
>> + if (phys_id == topology_physical_package_id(i))
>> + return;
>
> missing 'spin_unlock(&rapl_hotplug_lock)' above
>

Good catch. I fixed that now.

Stephane Eranian

unread,

Oct 26, 2013, 1:07:21 PM10/26/13

to Jiri Olsa, LKML, Peter Zijlstra, mi...@elte.hu, a...@linux.intel.com, Arnaldo Carvalho de Melo, Yan, Zheng, Borislav Petkov

The reason is rather simple and is similar to what happens with uncore.
The counter are narrow, 32-bit and there is no interrupt capability. We
need to poll the counters and accumulate in the sw counter to avoid missing
an overflow.

> - the timer calls rapl_event_update for all defined events

No, only for the defined RAPL events which is what we want.

> - but rapl_pmu_event_read calls rapl_event_update any time the
> event is read (sys_read)
>

Yes, but we want to prevent missing a counter overflow. It may happen
if the counter counts in a unit which increments fast.

> The rapl_event_update only read msr and updates
> event->count|hw,prev_count.

No, it does update the count:
local64_add(sdelta, &event->count);

Jiri Olsa

unread,

Oct 26, 2013, 1:45:36 PM10/26/13

to Stephane Eranian, LKML, Peter Zijlstra, mi...@elte.hu, a...@linux.intel.com, Arnaldo Carvalho de Melo, Yan, Zheng, Borislav Petkov

nope, I just meant saving little space like:

union {
struct list_head active_entry;
struct hlist_node hlist_entry;
}

just a nitpick

jirka

Jiri Olsa

unread,

Oct 26, 2013, 1:54:00 PM10/26/13

to Stephane Eranian, LKML, Peter Zijlstra, mi...@elte.hu, a...@linux.intel.com, Arnaldo Carvalho de Melo, Yan, Zheng, Borislav Petkov

On Sat, Oct 26, 2013 at 07:07:06PM +0200, Stephane Eranian wrote:
> On Fri, Oct 25, 2013 at 7:44 PM, Jiri Olsa <jo...@redhat.com> wrote:
> > On Wed, Oct 23, 2013 at 02:58:05PM +0200, Stephane Eranian wrote:

SNIP

> >> + list_for_each_entry(event, &pmu->active_list, active_entry) {
> >> + rapl_event_update(event);
> >> + }
> >
> > hi,
> > I dont fully understand the reason for the timer,
> > I'm probably missing something..
> >
> The reason is rather simple and is similar to what happens with uncore.
> The counter are narrow, 32-bit and there is no interrupt capability. We
> need to poll the counters and accumulate in the sw counter to avoid missing
> an overflow.
>
> > - the timer calls rapl_event_update for all defined events
>
> No, only for the defined RAPL events which is what we want.

ok, that's what I meant

>
> > - but rapl_pmu_event_read calls rapl_event_update any time the
> > event is read (sys_read)
> >
> Yes, but we want to prevent missing a counter overflow. It may happen
> if the counter counts in a unit which increments fast.
>
> > The rapl_event_update only read msr and updates
> > event->count|hw,prev_count.
> No, it does update the count:
> local64_add(sdelta, &event->count);

ah, there's the shift that takes care of the
overflowed msr value.. ok

thanks,
jirka

Stephane Eranian

unread,

Oct 28, 2013, 5:55:19 AM10/28/13

to Jiri Olsa, LKML, Peter Zijlstra, mi...@elte.hu, a...@linux.intel.com, Arnaldo Carvalho de Melo, Yan, Zheng, Borislav Petkov

No, we try to poll the counter faster than it can possibly overflow.

Stephane Eranian

unread,

Oct 28, 2013, 5:58:09 AM10/28/13

to Jiri Olsa, LKML, Peter Zijlstra, mi...@elte.hu, a...@linux.intel.com, Arnaldo Carvalho de Melo, Yan, Zheng, Borislav Petkov

But you are relying on the fact that active_entry and hlist_entry can never be
used at the same time. You're saying this because *so far* hlist_entry is
only used for SW events and what I added is only used by RAPL.

If the goal is the same, then we should add a better description for the field:
"chain the event in the list of active events for a PMU instance on a CPU".
Handled by PMU specific code only (not generic code).

Stephane Eranian

unread,

Oct 28, 2013, 6:33:56 AM10/28/13

to Jiri Olsa, LKML, Peter Zijlstra, mi...@elte.hu, a...@linux.intel.com, Arnaldo Carvalho de Melo, Yan, Zheng, Borislav Petkov

Hi,

I was thinking about the scaling issue over the week-end.

We agreed that it was necessary to export the
scaling via sysfs per event. The RAPL v3 has
an implementation of that including in the perf tool.

If we have that, then it may not be necessary anymore
to express the raw count in the 1/2^32 J unit like we
are currently doing. This loses a bit of precision. We
could as well expose the actual raw count and export
the actual unit via sysfs. For instance, on SNB/IVB the
unit is 1/2^16, but on Haswell it is 1/2^14.

I see two issues with that approach though:

- the interpretation of the raw count changes from machine to
machine and needs to ALWAYS be used with the scaling.
It is not always possible to just compare raw counts directly.

- we would need a way to express that ratio without actually
calculating it in the kernel. There are 6 ratios possible. So
we either have a lookup table with the floating point values
already computed and encoded as strings. Or we need to
add a calculator style parsing in the perf tool (or any other)
tool to parse a ratio: 1 / 65536 or any basic mathematical
expression. After all, the scaling support has to be generic.
Other events may use a different form of scaling ratios. But
that seems overkill to me.

So in the end, it may be the case that what we have not in RAPLv3
is the simplest approach.

Any opinion?

Peter Zijlstra

unread,

Oct 28, 2013, 8:18:17 AM10/28/13

to Stephane Eranian, Jiri Olsa, LKML, mi...@elte.hu, a...@linux.intel.com, Arnaldo Carvalho de Melo, Yan, Zheng, Borislav Petkov

On Mon, Oct 28, 2013 at 11:33:50AM +0100, Stephane Eranian wrote:
> If we have that, then it may not be necessary anymore
> to express the raw count in the 1/2^32 J unit like we
> are currently doing. This loses a bit of precision. We
> could as well expose the actual raw count and export
> the actual unit via sysfs. For instance, on SNB/IVB the
> unit is 1/2^16, but on Haswell it is 1/2^14.

2^-32 can losslessly express both 2^-16 and 2^-14.

Notably: 2^18/2^32 = 2^(18-32) = 2^-14.

So no, 2^-32 does not loose precision.

The only side effect of always using 2^-32 is that we can only maximally
represent 2^32 (from 64-32), whereas when using 2^-14 we could maximally
represent 2^50.

That said; 2^32 Joule ~ 4.2 GJ which is a rather large quantity of
energy; one I would hope is out there when measuring package energy
costs over any reasonable amount of time.

So the only reason to switch away from using the 32.32 fixed point would
be if someone can make a reasonable argument for why 4.2 GJ is not
sufficient and they need 1 PJ (yes, peta-joule, as in we need a private
nuclear reactor to power this CPU).

Stephane Eranian

unread,

Oct 28, 2013, 11:54:34 AM10/28/13

to Peter Zijlstra, Jiri Olsa, LKML, mi...@elte.hu, a...@linux.intel.com, Arnaldo Carvalho de Melo, Yan, Zheng, Borislav Petkov

Peter,

On Mon, Oct 28, 2013 at 1:17 PM, Peter Zijlstra <pet...@infradead.org> wrote:
> On Mon, Oct 28, 2013 at 11:33:50AM +0100, Stephane Eranian wrote:
>> If we have that, then it may not be necessary anymore
>> to express the raw count in the 1/2^32 J unit like we
>> are currently doing. This loses a bit of precision. We
>> could as well expose the actual raw count and export
>> the actual unit via sysfs. For instance, on SNB/IVB the
>> unit is 1/2^16, but on Haswell it is 1/2^14.
>
> 2^-32 can losslessly express both 2^-16 and 2^-14.
>
> Notably: 2^18/2^32 = 2^(18-32) = 2^-14.
>
> So no, 2^-32 does not loose precision.
>

You are correct. No bits are lost.

> The only side effect of always using 2^-32 is that we can only maximally
> represent 2^32 (from 64-32), whereas when using 2^-14 we could maximally
> represent 2^50.
>
> That said; 2^32 Joule ~ 4.2 GJ which is a rather large quantity of
> energy; one I would hope is out there when measuring package energy
> costs over any reasonable amount of time.
>
> So the only reason to switch away from using the 32.32 fixed point would
> be if someone can make a reasonable argument for why 4.2 GJ is not
> sufficient and they need 1 PJ (yes, peta-joule, as in we need a private
> nuclear reactor to power this CPU).

I think we are fine with what we have. Simple, no precision lost, constant
user-visible scaling factor easy to export as a string to user tools. Comparison
of raw count possible directly.

I will post v4 very soon.
Thanks.