Code Review - Akaros perf tool

Davide Libenzi

unread,

Dec 2, 2015, 6:57:49 PM12/2/15

to Akaros

As part of this CL the /prof/kpctl interface has been changed back to split timer+profile.

Documentation has been updated.

https://github.com/brho/akaros/compare/master...dlibenzi:devarch_msr_perf

The following changes since commit 2fa42319139e4cc5ca853546363f84443d0ead00:

Rename 'reallocarray' to 'kreallocarray'. (2015-11-25 18:02:04 -0500)

are available in the git repository at:

g...@github.com:dlibenzi/akaros devarch_msr_perf

for you to fetch changes up to 63a44bc73435826257ccef2c7174b58764204374:

Added new perf utility to access CPU counters from userspace (2015-12-02 15:48:17 -0800)

----------------------------------------------------------------

Davide Libenzi (16):

Added intptr_t definition and made error pointer code use them

Make disable MTRRs the defaul configuration option

Added ZERO_DATA() and COUNT_OF() macros to common.h

Fixed the sys_proc_create API to use const pointers

Added libpfm4 library support

Added ability to declare local per CPU variables

Drop the fake exception tables entries and use weak symbols

Added error_assert() macro and used branch hinting

Added API for read/write common integer sizes from/to memory

Moved MSR read/write APIs out of devarch and into a dedicated file

STFU useless printk messages

Reverted /prof/kpctl interface to split start and timer

Migrated devarch MSR access to new MSR API

Added perfmon interrupt handling to allow overflow based profiling

Created the new devarch perf file using the perfmon infrastructure

Added new perf utility to access CPU counters from userspace

Documentation/profiling.txt | 140 +-

Makefile | 7 +-

kern/arch/riscv/types.h | 1 +

kern/arch/riscv/uaccess.h | 7 -

kern/arch/x86/Kbuild | 1 +

kern/arch/x86/Kconfig | 2 +-

kern/arch/x86/devarch.c | 337 ++-

kern/arch/x86/devarch.h | 14 +-

kern/arch/x86/msr.c | 130 +

kern/arch/x86/msr.h | 116 +

kern/arch/x86/perfmon.c | 469 ++-

kern/arch/x86/perfmon.h | 70 +-

kern/arch/x86/ros/perfmon.h | 96 +

kern/arch/x86/trap.c | 3 +

kern/arch/x86/types.h | 1 +

kern/arch/x86/uaccess.h | 7 -

kern/drivers/dev/kprof.c | 11 +-

kern/include/assert.h | 7 +-

kern/include/error.h | 6 +-

kern/include/percpu.h | 90 +

kern/include/ros/common.h | 7 +-

kern/include/ros/memops.h | 72 +

kern/src/Kbuild | 1 +

kern/src/ex_table.c | 71 +-

kern/src/init.c | 2 +

kern/src/percpu.c | 42 +

kern/src/profiler.c | 4 -

.../glibc-2.19-akaros/sysdeps/akaros/serialize.c | 4 +-

tools/profile/perf/.gitignore | 1 +

tools/profile/perf/Makefile | 42 +

tools/profile/perf/akaros.c | 133 +

tools/profile/perf/akaros.h | 48 +

tools/profile/perf/perf.c | 175 ++

tools/profile/perf/perf_core.c | 618 ++++

tools/profile/perf/perf_core.h | 69 +

tools/profile/perf/xlib.c | 100 +

tools/profile/perf/xlib.h | 42 +

user/parlib/include/parlib.h | 4 +-

user/parlib/include/serialize.h | 4 +-

user/parlib/syscall.c | 4 +-

user/perfmon/Makefile | 4 +

user/perfmon/events/amd64_events_fam10h.h | 2418 +++++++++++++++

user/perfmon/events/amd64_events_fam11h.h | 1403 +++++++++

user/perfmon/events/amd64_events_fam12h.h | 1758 +++++++++++

user/perfmon/events/amd64_events_fam14h.h | 1540 ++++++++++

user/perfmon/events/amd64_events_fam15h.h | 2298 +++++++++++++++

user/perfmon/events/amd64_events_fam15h_nb.h | 2022 +++++++++++++

user/perfmon/events/amd64_events_k7.h | 228 ++

user/perfmon/events/amd64_events_k8.h | 1307 +++++++++

user/perfmon/events/intel_atom_events.h | 1364 +++++++++

user/perfmon/events/intel_bdw_events.h | 2485 ++++++++++++++++

user/perfmon/events/intel_core_events.h | 1895 ++++++++++++

user/perfmon/events/intel_coreduo_events.h | 1229 ++++++++

user/perfmon/events/intel_hsw_events.h | 2646 +++++++++++++++++

user/perfmon/events/intel_ivb_events.h | 2354 +++++++++++++++

user/perfmon/events/intel_ivbep_unc_cbo_events.h | 981 +++++++

user/perfmon/events/intel_ivbep_unc_ha_events.h | 925 ++++++

user/perfmon/events/intel_ivbep_unc_imc_events.h | 644 ++++

user/perfmon/events/intel_ivbep_unc_irp_events.h | 267 ++

user/perfmon/events/intel_ivbep_unc_pcu_events.h | 479 +++

user/perfmon/events/intel_ivbep_unc_qpi_events.h | 696 +++++

.../perfmon/events/intel_ivbep_unc_r2pcie_events.h | 253 ++

user/perfmon/events/intel_ivbep_unc_r3qpi_events.h | 552 ++++

user/perfmon/events/intel_ivbep_unc_ubo_events.h | 101 +

user/perfmon/events/intel_knc_events.h | 383 +++

user/perfmon/events/intel_netburst_events.h | 1549 ++++++++++

user/perfmon/events/intel_nhm_events.h | 2530 ++++++++++++++++

user/perfmon/events/intel_nhm_unc_events.h | 1247 ++++++++

user/perfmon/events/intel_p6_events.h | 716 +++++

user/perfmon/events/intel_pii_events.h | 656 +++++

user/perfmon/events/intel_pm_events.h | 930 ++++++

user/perfmon/events/intel_ppro_events.h | 525 ++++

user/perfmon/events/intel_slm_events.h | 896 ++++++

user/perfmon/events/intel_snb_events.h | 2445 ++++++++++++++++

user/perfmon/events/intel_snb_unc_events.h | 256 ++

user/perfmon/events/intel_snbep_events.h | 2466 ++++++++++++++++

user/perfmon/events/intel_snbep_unc_cbo_events.h | 801 +++++

user/perfmon/events/intel_snbep_unc_ha_events.h | 545 ++++

user/perfmon/events/intel_snbep_unc_imc_events.h | 344 +++

user/perfmon/events/intel_snbep_unc_pcu_events.h | 307 ++

user/perfmon/events/intel_snbep_unc_qpi_events.h | 429 +++

.../perfmon/events/intel_snbep_unc_r2pcie_events.h | 188 ++

user/perfmon/events/intel_snbep_unc_r3qpi_events.h | 323 +++

user/perfmon/events/intel_snbep_unc_ubo_events.h | 72 +

user/perfmon/events/intel_wsm_events.h | 2579 ++++++++++++++++

user/perfmon/events/intel_wsm_unc_events.h | 1372 +++++++++

user/perfmon/events/intel_x86_arch_events.h | 63 +

user/perfmon/events/perf_events.h | 410 +++

user/perfmon/examples/Makefile | 66 +

user/perfmon/examples/check_events.c | 161 +

user/perfmon/examples/showevtinfo.c | 913 ++++++

user/perfmon/include/perfmon/err.h | 45 +

user/perfmon/include/perfmon/perf_event.h | 504 ++++

user/perfmon/include/perfmon/pfmlib.h | 490 ++++

user/perfmon/include/perfmon/pfmlib_perf_event.h | 64 +

user/perfmon/pfmlib_amd64.c | 860 ++++++

user/perfmon/pfmlib_amd64_fam10h.c | 63 +

user/perfmon/pfmlib_amd64_fam11h.c | 61 +

user/perfmon/pfmlib_amd64_fam12h.c | 61 +

user/perfmon/pfmlib_amd64_fam14h.c | 60 +

user/perfmon/pfmlib_amd64_fam15h.c | 85 +

user/perfmon/pfmlib_amd64_k7.c | 60 +

user/perfmon/pfmlib_amd64_k8.c | 66 +

user/perfmon/pfmlib_amd64_priv.h | 217 ++

user/perfmon/pfmlib_common.c | 1938 +++++++++++++

user/perfmon/pfmlib_intel_atom.c | 86 +

user/perfmon/pfmlib_intel_bdw.c | 70 +

user/perfmon/pfmlib_intel_core.c | 81 +

user/perfmon/pfmlib_intel_coreduo.c | 83 +

user/perfmon/pfmlib_intel_hsw.c | 109 +

user/perfmon/pfmlib_intel_ivb.c | 106 +

user/perfmon/pfmlib_intel_ivb_unc.c | 86 +

user/perfmon/pfmlib_intel_ivbep_unc_cbo.c | 125 +

user/perfmon/pfmlib_intel_ivbep_unc_ha.c | 97 +

user/perfmon/pfmlib_intel_ivbep_unc_imc.c | 71 +

user/perfmon/pfmlib_intel_ivbep_unc_irp.c | 79 +

user/perfmon/pfmlib_intel_ivbep_unc_pcu.c | 97 +

user/perfmon/pfmlib_intel_ivbep_unc_qpi.c | 85 +

user/perfmon/pfmlib_intel_ivbep_unc_r2pcie.c | 61 +

user/perfmon/pfmlib_intel_ivbep_unc_r3qpi.c | 65 +

user/perfmon/pfmlib_intel_ivbep_unc_ubo.c | 61 +

user/perfmon/pfmlib_intel_knc.c | 61 +

user/perfmon/pfmlib_intel_netburst.c | 493 ++++

user/perfmon/pfmlib_intel_netburst_priv.h | 233 ++

user/perfmon/pfmlib_intel_nhm.c | 173 ++

user/perfmon/pfmlib_intel_nhm_unc.c | 348 +++

user/perfmon/pfmlib_intel_p6.c | 178 ++

user/perfmon/pfmlib_intel_rapl.c | 158 +

user/perfmon/pfmlib_intel_slm.c | 76 +

user/perfmon/pfmlib_intel_snb.c | 107 +

user/perfmon/pfmlib_intel_snb_unc.c | 72 +

user/perfmon/pfmlib_intel_snbep_unc.c | 630 ++++

user/perfmon/pfmlib_intel_snbep_unc_cbo.c | 107 +

user/perfmon/pfmlib_intel_snbep_unc_ha.c | 93 +

user/perfmon/pfmlib_intel_snbep_unc_imc.c | 67 +

user/perfmon/pfmlib_intel_snbep_unc_pcu.c | 96 +

user/perfmon/pfmlib_intel_snbep_unc_priv.h | 296 ++

user/perfmon/pfmlib_intel_snbep_unc_qpi.c | 84 +

user/perfmon/pfmlib_intel_snbep_unc_r2pcie.c | 61 +

user/perfmon/pfmlib_intel_snbep_unc_r3qpi.c | 64 +

user/perfmon/pfmlib_intel_snbep_unc_ubo.c | 61 +

user/perfmon/pfmlib_intel_wsm.c | 111 +

user/perfmon/pfmlib_intel_x86.c | 1014 +++++++

user/perfmon/pfmlib_intel_x86_arch.c | 224 ++

user/perfmon/pfmlib_intel_x86_priv.h | 345 +++

user/perfmon/pfmlib_priv.h | 439 +++

user/perfmon/tests/Makefile | 74 +

user/perfmon/tests/validate.c | 344 +++

user/perfmon/tests/validate_arm.c | 324 +++

user/perfmon/tests/validate_arm64.c | 208 ++

user/perfmon/tests/validate_mips.c | 224 ++

user/perfmon/tests/validate_power.c | 223 ++

user/perfmon/tests/validate_x86.c | 3064 ++++++++++++++++++++

153 files changed, 71747 insertions(+), 220 deletions(-)

create mode 100644 kern/arch/x86/msr.c

create mode 100644 kern/arch/x86/msr.h

create mode 100644 kern/arch/x86/ros/perfmon.h

create mode 100644 kern/include/percpu.h

create mode 100644 kern/include/ros/memops.h

create mode 100644 kern/src/percpu.c

create mode 100644 tools/profile/perf/.gitignore

create mode 100644 tools/profile/perf/Makefile

create mode 100644 tools/profile/perf/akaros.c

create mode 100644 tools/profile/perf/akaros.h

create mode 100644 tools/profile/perf/perf.c

create mode 100644 tools/profile/perf/perf_core.c

create mode 100644 tools/profile/perf/perf_core.h

create mode 100644 tools/profile/perf/xlib.c

create mode 100644 tools/profile/perf/xlib.h

create mode 100644 user/perfmon/Makefile

create mode 100644 user/perfmon/events/amd64_events_fam10h.h

create mode 100644 user/perfmon/events/amd64_events_fam11h.h

create mode 100644 user/perfmon/events/amd64_events_fam12h.h

create mode 100644 user/perfmon/events/amd64_events_fam14h.h

create mode 100644 user/perfmon/events/amd64_events_fam15h.h

create mode 100644 user/perfmon/events/amd64_events_fam15h_nb.h

create mode 100644 user/perfmon/events/amd64_events_k7.h

create mode 100644 user/perfmon/events/amd64_events_k8.h

create mode 100644 user/perfmon/events/intel_atom_events.h

create mode 100644 user/perfmon/events/intel_bdw_events.h

create mode 100644 user/perfmon/events/intel_core_events.h

create mode 100644 user/perfmon/events/intel_coreduo_events.h

create mode 100644 user/perfmon/events/intel_hsw_events.h

create mode 100644 user/perfmon/events/intel_ivb_events.h

create mode 100644 user/perfmon/events/intel_ivbep_unc_cbo_events.h

create mode 100644 user/perfmon/events/intel_ivbep_unc_ha_events.h

create mode 100644 user/perfmon/events/intel_ivbep_unc_imc_events.h

create mode 100644 user/perfmon/events/intel_ivbep_unc_irp_events.h

create mode 100644 user/perfmon/events/intel_ivbep_unc_pcu_events.h

create mode 100644 user/perfmon/events/intel_ivbep_unc_qpi_events.h

create mode 100644 user/perfmon/events/intel_ivbep_unc_r2pcie_events.h

create mode 100644 user/perfmon/events/intel_ivbep_unc_r3qpi_events.h

create mode 100644 user/perfmon/events/intel_ivbep_unc_ubo_events.h

create mode 100644 user/perfmon/events/intel_knc_events.h

create mode 100644 user/perfmon/events/intel_netburst_events.h

create mode 100644 user/perfmon/events/intel_nhm_events.h

create mode 100644 user/perfmon/events/intel_nhm_unc_events.h

create mode 100644 user/perfmon/events/intel_p6_events.h

create mode 100644 user/perfmon/events/intel_pii_events.h

create mode 100644 user/perfmon/events/intel_pm_events.h

create mode 100644 user/perfmon/events/intel_ppro_events.h

create mode 100644 user/perfmon/events/intel_slm_events.h

create mode 100644 user/perfmon/events/intel_snb_events.h

create mode 100644 user/perfmon/events/intel_snb_unc_events.h

create mode 100644 user/perfmon/events/intel_snbep_events.h

create mode 100644 user/perfmon/events/intel_snbep_unc_cbo_events.h

create mode 100644 user/perfmon/events/intel_snbep_unc_ha_events.h

create mode 100644 user/perfmon/events/intel_snbep_unc_imc_events.h

create mode 100644 user/perfmon/events/intel_snbep_unc_pcu_events.h

create mode 100644 user/perfmon/events/intel_snbep_unc_qpi_events.h

create mode 100644 user/perfmon/events/intel_snbep_unc_r2pcie_events.h

create mode 100644 user/perfmon/events/intel_snbep_unc_r3qpi_events.h

create mode 100644 user/perfmon/events/intel_snbep_unc_ubo_events.h

create mode 100644 user/perfmon/events/intel_wsm_events.h

create mode 100644 user/perfmon/events/intel_wsm_unc_events.h

create mode 100644 user/perfmon/events/intel_x86_arch_events.h

create mode 100644 user/perfmon/events/perf_events.h

create mode 100644 user/perfmon/examples/Makefile

create mode 100644 user/perfmon/examples/check_events.c

create mode 100644 user/perfmon/examples/showevtinfo.c

create mode 100644 user/perfmon/include/perfmon/err.h

create mode 100644 user/perfmon/include/perfmon/perf_event.h

create mode 100644 user/perfmon/include/perfmon/pfmlib.h

create mode 100644 user/perfmon/include/perfmon/pfmlib_perf_event.h

create mode 100644 user/perfmon/pfmlib_amd64.c

create mode 100644 user/perfmon/pfmlib_amd64_fam10h.c

create mode 100644 user/perfmon/pfmlib_amd64_fam11h.c

create mode 100644 user/perfmon/pfmlib_amd64_fam12h.c

create mode 100644 user/perfmon/pfmlib_amd64_fam14h.c

create mode 100644 user/perfmon/pfmlib_amd64_fam15h.c

create mode 100644 user/perfmon/pfmlib_amd64_k7.c

create mode 100644 user/perfmon/pfmlib_amd64_k8.c

create mode 100644 user/perfmon/pfmlib_amd64_priv.h

create mode 100644 user/perfmon/pfmlib_common.c

create mode 100644 user/perfmon/pfmlib_intel_atom.c

create mode 100644 user/perfmon/pfmlib_intel_bdw.c

create mode 100644 user/perfmon/pfmlib_intel_core.c

create mode 100644 user/perfmon/pfmlib_intel_coreduo.c

create mode 100644 user/perfmon/pfmlib_intel_hsw.c

create mode 100644 user/perfmon/pfmlib_intel_ivb.c

create mode 100644 user/perfmon/pfmlib_intel_ivb_unc.c

create mode 100644 user/perfmon/pfmlib_intel_ivbep_unc_cbo.c

create mode 100644 user/perfmon/pfmlib_intel_ivbep_unc_ha.c

create mode 100644 user/perfmon/pfmlib_intel_ivbep_unc_imc.c

create mode 100644 user/perfmon/pfmlib_intel_ivbep_unc_irp.c

create mode 100644 user/perfmon/pfmlib_intel_ivbep_unc_pcu.c

create mode 100644 user/perfmon/pfmlib_intel_ivbep_unc_qpi.c

create mode 100644 user/perfmon/pfmlib_intel_ivbep_unc_r2pcie.c

create mode 100644 user/perfmon/pfmlib_intel_ivbep_unc_r3qpi.c

create mode 100644 user/perfmon/pfmlib_intel_ivbep_unc_ubo.c

create mode 100644 user/perfmon/pfmlib_intel_knc.c

create mode 100644 user/perfmon/pfmlib_intel_netburst.c

create mode 100644 user/perfmon/pfmlib_intel_netburst_priv.h

create mode 100644 user/perfmon/pfmlib_intel_nhm.c

create mode 100644 user/perfmon/pfmlib_intel_nhm_unc.c

create mode 100644 user/perfmon/pfmlib_intel_p6.c

create mode 100644 user/perfmon/pfmlib_intel_rapl.c

create mode 100644 user/perfmon/pfmlib_intel_slm.c

create mode 100644 user/perfmon/pfmlib_intel_snb.c

create mode 100644 user/perfmon/pfmlib_intel_snb_unc.c

create mode 100644 user/perfmon/pfmlib_intel_snbep_unc.c

create mode 100644 user/perfmon/pfmlib_intel_snbep_unc_cbo.c

create mode 100644 user/perfmon/pfmlib_intel_snbep_unc_ha.c

create mode 100644 user/perfmon/pfmlib_intel_snbep_unc_imc.c

create mode 100644 user/perfmon/pfmlib_intel_snbep_unc_pcu.c

create mode 100644 user/perfmon/pfmlib_intel_snbep_unc_priv.h

create mode 100644 user/perfmon/pfmlib_intel_snbep_unc_qpi.c

create mode 100644 user/perfmon/pfmlib_intel_snbep_unc_r2pcie.c

create mode 100644 user/perfmon/pfmlib_intel_snbep_unc_r3qpi.c

create mode 100644 user/perfmon/pfmlib_intel_snbep_unc_ubo.c

create mode 100644 user/perfmon/pfmlib_intel_wsm.c

create mode 100644 user/perfmon/pfmlib_intel_x86.c

create mode 100644 user/perfmon/pfmlib_intel_x86_arch.c

create mode 100644 user/perfmon/pfmlib_intel_x86_priv.h

create mode 100644 user/perfmon/pfmlib_priv.h

create mode 100644 user/perfmon/tests/Makefile

create mode 100644 user/perfmon/tests/validate.c

create mode 100644 user/perfmon/tests/validate_arm.c

create mode 100644 user/perfmon/tests/validate_arm64.c

create mode 100644 user/perfmon/tests/validate_mips.c

create mode 100644 user/perfmon/tests/validate_power.c

create mode 100644 user/perfmon/tests/validate_x86.c

Davide Libenzi

unread,

Dec 3, 2015, 10:30:54 PM12/3/15

to Akaros

I was wrong in my belief that Linux perf does not have per-process counters, coupled with their migration upon context switch 😐

That is a pretty hefty chunk of change, which requires also hooking into the process context switch (and struct proc).

Though, the price of counters/MSR swap will be paid only for processes under perf.

I would rather do that in a following CL, as this one is already big enough.

barret rhoden

unread,

Dec 4, 2015, 8:53:44 AM12/4/15

to aka...@googlegroups.com

On 2015-12-03 at 19:30 'Davide Libenzi' via Akaros wrote:
> I would rather do that in a following CL, as this one is already big
> enough.

Sounds good to me. Having any sort of perf counter access will be
great for now (e.g. the per-cpu counts). We barely run one program at a
time, let alone needing to *profile* multiple programs at a time. =)

Barret Rhoden

unread,

Dec 11, 2015, 12:05:30 PM12/11/15

to aka...@googlegroups.com

Hi -

Good stuff overall; comments below.

Thanks,
Barret

On 2015-12-02 at 15:57 "'Davide Libenzi' via Akaros"

<aka...@googlegroups.com> wrote:
> As part of this CL the /prof/kpctl interface has been changed back to
> split timer+profile.
> Documentation has been updated.
>
>
> https://github.com/brho/akaros/compare/master...dlibenzi:devarch_msr_perf
>
>
> The following changes since commit
> 2fa42319139e4cc5ca853546363f84443d0ead00:
>
> Rename 'reallocarray' to 'kreallocarray'. (2015-11-25 18:02:04
> -0500)
>
> are available in the git repository at:
>
> g...@github.com:dlibenzi/akaros devarch_msr_perf

> From 69be5b1c189c29e35317dfe37d11b5ae3bc605ae Mon Sep 17 00:00:00 2001
> From: Davide Libenzi <dlib...@google.com>
> Date: Wed, 2 Dec 2015 09:20:41 -0800
> Subject: Fixed the sys_proc_create API to use const pointers
>
> Fixed the sys_proc_create API to use const pointers.

Need (XCC) in the subject, since this requires a toolchain/glibc
rebuild.

> --- a/tools/compilers/gcc-glibc/glibc-2.19-akaros/sysdeps/akaros/serialize.c
> +++ b/tools/compilers/gcc-glibc/glibc-2.19-akaros/sysdeps/akaros/serialize.c
> @@ -8,8 +8,8 @@
> #include <stddef.h>
> #include <stdint.h>
>
> -struct serialized_data* serialize_argv_envp(char* const* argv,
> - char* const* envp)
> +struct serialized_data* serialize_argv_envp(const char * const *argv,
> + const char * const *envp)
->
The pointer * should be moved to be adjacent to the function name.

There are a few other warnings like this in this patch and in some of
the others. Please run checkpatch to catch them.

> From 3e76d624cf8c4240c39c8db5020401ac14774f4b Mon Sep 17 00:00:00 2001
> From: Davide Libenzi <dlib...@google.com>
> Date: Mon, 16 Nov 2015 07:03:55 -0800
> Subject: Added libpfm4 library support
>
> Added libpfm4 library support. The libpfm4 library provides a database
> of counters available on different Intel platforms.

Where did this library come from? (URL, package name, etc).

> diff --git a/Makefile b/Makefile
> index 8defe95de83e..7832ecbef7f3 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -573,7 +573,7 @@ endif #ifeq ($(mixed-targets),1)
> # List all userspace directories here, and state any dependencies between them,
> # such as how pthread depends on parlib.
>
> -user-dirs = parlib pthread benchutil iplib ndblib vmm
> +user-dirs = parlib pthread benchutil iplib ndblib vmm perfmon

Does perfmon have any dependencies, such as on parlib or pthead? If so,
we'll need entries like this:

benchutil: parlib
pthread: parlib benchutil
iplib: parlib

> From 031a13e37081495c987b414b4b07902058fc5810 Mon Sep 17 00:00:00 2001
> From: Davide Libenzi <dlib...@google.com>
> Date: Mon, 16 Nov 2015 14:53:23 -0800
> Subject: Added ability to declare local per CPU variables

> +#define DEFINE_PERCPU(type, var) \
> + type PERCPU_VARNAME(var) __attribute__ ((section (PERCPU_SECTION_STR)))
> +#define DECLARE_PERCPU(type, var) \
> + extern type PERCPU_VARNAME(var) \
> + __attribute__ ((section (PERCPU_SECTION_STR)))

Mostly curious, does the declaration also need the attribute? I tried
with and without on a PERCPU that I dropped in init.c and it didn't seem
to care one way or the other.

> From 115665ce501cc4805cee552858c4cb35986bcbc5 Mon Sep 17 00:00:00 2001
> From: Davide Libenzi <dlib...@google.com>
> Date: Tue, 24 Nov 2015 15:56:48 -0800
> Subject: Moved MSR read/write APIs out of devarch and into a dedicated file

> diff --git a/kern/arch/x86/msr.h b/kern/arch/x86/msr.h
> new file mode 100644
> index 000000000000..f7c5ccd094f4
> --- /dev/null
> +++ b/kern/arch/x86/msr.h
> @@ -0,0 +1,116 @@
> +/* Copyright (c) 2015 Google Inc
> + * Davide Libenzi <dlib...@google.com>
> + * See LICENSE for details.
> + */
> +
> +#pragma once
> +
> +#include <sys/types.h>
> +#include <ros/errno.h>
> +#include <stdio.h>
> +#include <string.h>
> +#include <core_set.h>
> +
> +struct msr_address {
> + uint32_t addr;
> + const uint32_t *addresses;
> + size_t num_addresses;
> +};

So a struct msr_address has both an overall address, as well as an address per
core (based on its usage later)? When does each core need a different address
for a particular MSR (e.g. MSR_IA32_PERFCTR0)?

Or is this for something else, like "I want to read different MSRs on different
cores, but read them all at the same time." I'm trying to picture how this is
used and what it's used for. If it's cryptic, we can put something in this
header file's comments at the top.

> From fecb0cd575a71d8fa2fe5c19728337b3a6246801 Mon Sep 17 00:00:00 2001
> From: Davide Libenzi <dlib...@google.com>
> Date: Fri, 27 Nov 2015 10:57:19 -0800
> Subject: Migrated devarch MSR access to new MSR API

> @@ -362,22 +363,28 @@ static long archread(struct chan *c, void *a, long n, int64_t offset)

> + msr_set_values(&msrv, values, num_cores);
> +
> + err = msr_cores_read(&cset, &msra, &msrv);
> +
> + n = -1;
> + if (likely(!err)) {
> + if (n >= num_cores * sizeof(uint64_t)) {
> + if (!memcpy_to_user_errno(current, a, values,
> + num_cores * sizeof(uint64_t)))
> + n = num_cores * sizeof(uint64_t);
> } else {

What is going on here? n == -1, then we see if it is great enough to cover the
whole range of our output (which is always positive)?

> @@ -458,13 +467,19 @@ static long archwrite(struct chan *c, void *a, long n, int64_t offset)
> if (!address_range_find(msr_wr_wlist, ARRAY_SIZE(msr_wr_wlist),
> (uintptr_t) offset))
> error(EPERM, NULL);
> - core_set_init(&cset);
> - core_set_fill_available(&cset);
> if (n != sizeof(uint64_t))
> error(EINVAL, NULL);
> if (memcpy_from_user_errno(current, &value, a, sizeof(value)))
> return -1;
> - err = msr_cores_write(&cset, (uint32_t) offset, value);
> +
> + core_set_init(&cset);
> + core_set_fill_available(&cset);
> + msr_init_address(&msra);
> + msr_set_address(&msra, (uint32_t) offset);
> + msr_init_value(&msrv);
> + msr_set_value(&msrv, value);

It's a little odd to have to init_value() and then set_value(). I looked at the
functions, so I know that msr_init_value() zero's the msr_value struct, while
msr_set_value() sets the value field of the msr_value struct.

> From 820e28e014e6e413e6d43c1941d5d45bd91b096f Mon Sep 17 00:00:00 2001
> From: Davide Libenzi <dlib...@google.com>
> Date: Mon, 16 Nov 2015 07:13:13 -0800
> Subject: Added perfmon interrupt handling to allow overflow based profiling

> --- a/kern/arch/x86/perfmon.c
> +++ b/kern/arch/x86/perfmon.c

> +static void perfmon_do_cores_alloc(void *opaque)
> +{

> + perfmon_enable_fix_event(i, TRUE);
> +
> + write_msr(MSR_CORE_PERF_FIXED_CTR0 + i,
> + -(int64_t) pa->ev.trigger_count);
> + write_msr(MSR_CORE_PERF_FIXED_CTR_CTRL, tmp);
> + }

Should the enabling of the event happen before or after setting the event? If
it happens before (as is here) would we collect perf events from the old setting
of the fixed counter? Same goes for down below in the non-fixed counter.
perfmon_do_cores_free() looks okay (disables the event before changing the
fields).

> + } else {
> + for (i = 0; (i < (int) cpu_caps.counters_x_proc) &&
> + ((cctx->counters[i].u.v != 0) ||
> + !perfmon_event_available(i)); i++)
> + ;

That's pretty nasty. I get the impression we want to find the first free
counter (i = idx). It'd be much clearer with the loop just going over i, and
adding breaks or continues or something.

Is there ever a time that the counters[i] value is 0, but the perfmon_event is
not available? I.e., is that a kernel bug, and thus should be a panic?

> +static void perfmon_do_cores_free(void *opaque)
> +{
> + struct perfmon_alloc *pa = (struct perfmon_alloc *) opaque;
> + struct perfmon_cpu_context *cctx = PERCPU_VARPTR(counters_env);
> + int err = 0, coreno = core_id();
> + counter_t ccno = pa->cores_counters[coreno];
> +
> + spin_lock_irqsave(&counters_lock);

What is the lock protecting? (Both here and in the other perfmon_do_
functions). I imagine we need to disable interrupts, in case a PMI IRQ fires,
but do we need to serialize these accesses across the entire machine?

> + if (perfmon_is_fixed_event(&pa->ev)) {
> + uint64_t fxctrl_value = read_msr(MSR_CORE_PERF_FIXED_CTR_CTRL);
> +
> + if (!(fxctrl_value & (FIXCNTR_MASK << ccno)) ||
> + (ccno >= cpu_caps.fix_counters_x_proc)) {
> + err = -ENOENT;
> + } else {
> + perfmon_init_event(&cctx->fixed_counters[ccno]);
> +
> + perfmon_enable_fix_event((int) ccno, FALSE);
> +
> + write_msr(MSR_CORE_PERF_FIXED_CTR_CTRL,
> + fxctrl_value & ~(FIXCNTR_MASK << ccno));
> + write_msr(MSR_CORE_PERF_FIXED_CTR0 + ccno, 0);

For the (fxctrl_value & (FIXCNTR_MASK << ccno)) checks, should "ccno" be
multipled by FIXCNTR_NBITS? You're trying to select a 4-bit chunk of the
register, right? (btw, I'm looking at the SDM V3 18.2, Fig 18-2 for the layout
of the register).

> +void perfmon_init(void)
> +{
> + int i;
>
> /* Enable user level access to the performance counters */
> lcr4(rcr4() | CR4_PCE);
> +
> + /* This will be called from every core, no need to execute more than once.
> + */
> + if (cpu_caps.perfmon_version == 0)
> + perfmon_read_cpu_caps(&cpu_caps);

Technically this is racy. We might be okay, since they are all writing the same
values.

> + write_mmreg32(LAPIC_LVT_PERFMON, IdtLAPIC_PCINT);
> +}

This will break when we move to the x2APIC. We can deal with it later. =)

> +void perfmon_interrupt(struct hw_trapframe *hw_tf, void *data)
> +{
> + int i;
> + struct perfmon_cpu_context *cctx = PERCPU_VARPTR(counters_env);
> + uint64_t gctrl, status;
> +
> + profiler_add_hw_sample(hw_tf);

So it looks like we generate a trace whenever any of the counters overflow, but
we lose track of which one overflowed. I guess that's what userland perf
expects?

> + spin_lock_irqsave(&counters_lock);
> + /* We need to save the global control status, because we need to disable
> + * counters in order to be able to reset their values.
> + * We will restore the global control status on exit.
> + */
> + gctrl = read_msr(MSR_CORE_PERF_GLOBAL_CTRL);
> + write_msr(MSR_CORE_PERF_GLOBAL_CTRL, 0);
> + status = read_msr(MSR_CORE_PERF_GLOBAL_STATUS);
> + for (i = 0; i < (int) cpu_caps.counters_x_proc; i++) {
> + if (status & ((uint64_t) 1 << i)) {
> + if (cctx->counters[i].u.v)
> + write_msr(MSR_IA32_PERFCTR0 + i,
> + -(int64_t) cctx->counters[i].trigger_count);
> + }
> + }
> + for (i = 0; i < (int) cpu_caps.fix_counters_x_proc; i++) {
> + if (status & ((uint64_t) 1 << (32 + i))) {
> + if (cctx->fixed_counters[i].u.v)
> + write_msr(MSR_CORE_PERF_FIXED_CTR0 + i,
> + -(int64_t) cctx->fixed_counters[i].trigger_count);
> + }
> + }
> + write_msr(MSR_CORE_PERF_GLOBAL_OVF_CTRL, status);
> + write_msr(MSR_CORE_PERF_GLOBAL_CTRL, gctrl);
> + spin_unlock_irqsave(&counters_lock);
> +
> + write_mmreg32(LAPIC_LVT_PERFMON, IdtLAPIC_PCINT);

^This probably deserves a comment. Unlike the other LAPIC IRQs, the perf IRQ
gets masked in the LAPIC on interrupt.

> +static int perfmon_alloc_get(struct perfmon_session *ps, int ped, bool reset,
> + struct perfmon_alloc **ppa)
> +{
> + struct perfmon_alloc *pa;
> +
> + if (unlikely((ped < 0) || (ped >= ARRAY_SIZE(ps->allocs))))
> + return -EBADFD;

What is ped, an FD? If it's just an index that can't be < 0, you could make it
an unsigned type. I see the user of this (devarch) was casting it to an int
anyways.

> + spin_lock(&ps->lock);
> + pa = ps->allocs[ped];
> + if (likely(pa)) {
> + if (reset)
> + ps->allocs[ped] = NULL;
> + else
> + kref_get(&pa->ref, 1);
> + }
> + spin_unlock(&ps->lock);
> + if (unlikely(!pa))
> + return -ENOENT;
> + *ppa = pa;
> +
> + return 0;
> +}
> +
> +int perfmon_close_event(struct perfmon_session *ps, int ped)
> +{
> + int err;
> + struct perfmon_alloc *pa;
> +
> + err = perfmon_alloc_get(ps, ped, TRUE, &pa);
> + if (likely(!err))
> + kref_put(&pa->ref);
> +
> + return err;
> +}
> +
> +struct perfmon_status *perfmon_get_event_status(struct perfmon_session *ps,
> + int ped)
> +{
> + int err;
> + struct core_set cset;
> + struct perfmon_status_env env;
> +
> + err = perfmon_alloc_get(ps, ped, FALSE, &env.pa);
> + if (unlikely(err))
> + return ERR_PTR(err);

If you can avoid it, please don't use ERR_PTR. It makes it a lot harder to tell
how to use a function by looking at its signature. For instance, if I didn't
look into the C code, I'd think the following is fine:

foo = perfmon_get_event_status(ps, ped);
if (!foo)
handle_the_error.

In general, we opt to set errno if possible (which i think is the case here)
instead of using ERR_PTR. ERR_PTR and IS_ERR is pretty much only for linux
compatibility (and ancient memcpy code from 7 years ago). Throwing with error()
is another option.

If you do have to use it here, which could be the case if we're called from IRQ
context, then please comment the function accordingly.

> + env.pef = perfmon_alloc_status();
> + perfmon_setup_alloc_core_set(env.pa, &cset);
> +
> + smp_do_in_cores(&cset, perfmon_do_cores_status, &env);

Are there any weird cases where multiple threads are using the same pa? (env.pa
here).

Is 'pa' basically read-only once we install_session_alloc in perfmon_open_event?
If not, how do we protect from having multiple users?

If that's true and it's the invariant that keeps us from clobbering pa's state,
then please put that in the comments somewhere. Otherwise someone in the future
will modify this and enter a nightmare.

> +struct perfmon_event {
> + union {
> + struct {
> + uint64_t event:8;
> + uint64_t mask:8;
> + uint64_t usr:1;
> + uint64_t os:1;
> + uint64_t edge:1;
> + uint64_t pc:1;
> + uint64_t inten:1;
> + uint64_t anyth:1;
> + uint64_t en:1;
> + uint64_t invcmsk:1;
> + uint64_t cmask:8;
> + uint64_t res:32;
> + } b;
> + uint64_t v;
> + } u;
> + uint64_t flags;
> + uint64_t trigger_count;
> +};

If you can avoid the bitfields in the future, that'd be nice. We try to avoid
using them where possible, in favor of masks and shifts. If this is hard to
change here, then don't bother.

> From 907da2682383ca581a58352456c30c4dab7cd762 Mon Sep 17 00:00:00 2001
> From: Davide Libenzi <dlib...@google.com>
> Date: Sat, 28 Nov 2015 18:35:57 -0800
> Subject: Created the new devarch perf file using the perfmon infrastructure
>

> --- a/kern/arch/x86/devarch.c
> +++ b/kern/arch/x86/devarch.c

> +static void arch_free_perf_context(struct perf_context *pc)
> +{
> + if (likely(pc)) {

From the usage of this function, it looks like a bug if there is !pc.

> + perfmon_close_session(pc->ps);
> + kfree(pc->resp);
> + kfree(pc);
> + }
> +}

> +static long arch_perf_write(struct perf_context *pc, const void *udata,
> + long usize)
> +{
> + ERRSTACK(1);
> + void *kdata;
> + const uint8_t *kptr, *ktop;
> +
> + kfree(pc->resp);
> + pc->resp = NULL;
> + pc->resp_size = 0;
> +
> + kdata = user_memdup_errno(current, udata, usize);
> + if (unlikely(!kdata))
> + return -1;
> + if (waserror()) {
> + kfree(kdata);
> + nexterror();
> + }
> + kptr = kdata;
> + ktop = kptr + usize;
> + error_assert(EBADMSG, (kptr + 1) <= ktop);
> + switch (*kptr) {

Minor thing, but if you read *kptr and incremented it, like you do later with
the get_le_ helpers, you wouldn't need to remember to +1 it the first time you
use it in every case {}.

And if you read *kptr into something like uint8_t cmd, you can use 'cmd' later
in an error message.

> + case PERFMON_CMD_COUNTER_STATUS: {
> + uint32_t ped;
> + struct perfmon_status *pef;
> +
> + error_assert(EBADMSG, (kptr + 1 + sizeof(uint32_t)) <= ktop);
> + kptr = get_le_u32(kptr + 1, &ped);
> +
> + pef = perfmon_get_event_status(pc->ps, (int) ped);
> + if (!IS_ERR(pef)) {
> + int i;
> + uint8_t *rptr;
> + uint64_t *mvalues = kzmalloc(num_cores * sizeof(mvalues),
> + KMALLOC_WAIT);
> +
> + for (i = 0; i < num_cores; i++)
> + mvalues[get_os_coreid(i)] = pef->cores_values[i];

This usage of get_os_coreid is probably wrong. The range of i is the list of
os_coreids already, not hw_coreids. Probably just want mvalues[i].

> + pc->resp_size = 3 * sizeof(uint64_t) + sizeof(uint16_t) +
> + num_cores * sizeof(uint64_t);
> + pc->resp = kmalloc(pc->resp_size, KMALLOC_WAIT);
> +
> + rptr = put_le_u64(pc->resp, pef->ev.u.v);
> + rptr = put_le_u64(rptr, pef->ev.flags);
> + rptr = put_le_u64(rptr, pef->ev.trigger_count);
> + rptr = put_le_u16(rptr, num_cores);

Not that we actually support more than 2^16 cores yet, but num_cores is an int.

> + for (i = 0; i < num_cores; i++)
> + rptr = put_le_u64(rptr, mvalues[i]);
> + kfree(mvalues);
> + perfmon_free_event_status(pef);
> + } else {
> + error(-PTR_ERR(pef), NULL);

Can we just have perfmon_get_event_status() throw? Same goes for
perfmon_open_event() and all of these functions.

> + }
> + default:
> + error(EINVAL, NULL);

We should take advantage of error() and have a message, like
error(EINVAL, "Bad perfmon command 0x%x", cmd);

> From 14650311e81d001bd6cc08d57424f4bcc24d2336 Mon Sep 17 00:00:00 2001
> From: Davide Libenzi <dlib...@google.com>
> Date: Mon, 16 Nov 2015 07:05:47 -0800
> Subject: Added new perf utility to access CPU counters from userspace
>
> Added new perf utility to access CPU counters from userspace.

Like the other tools such as snc and kprof2perf, this can't be built by cd-ing
into the directory and running make. I'd like to fix that before it gets out of
hand, though maybe the cure is worse than the symptoms for now.

> diff --git a/tools/profile/perf/perf.c b/tools/profile/perf/perf.c
> new file mode 100644
> index 000000000000..0b5ef5d950b1
> --- /dev/null
> +++ b/tools/profile/perf/perf.c

> +static struct perf_context_config perf_cfg = {
> + .perf_file = "#arch/perf",
> + .kpctl_file = "/prof/kpctl",

It's safer to use #kprof instead of /prof. We happen to have it bound in our
current ifconfig, but there's no guarantee that'll always be the case. In
general, if you know the device by #name, use it directly.

> +static void run_process_and_wait(int argc, const char * const *argv,
> + const struct core_set *cores)
> +{
> + int pid, status;
> + size_t max_cores = ros_total_cores();
> + struct core_set pvcores;
> +
> + pid = sys_proc_create(argv[0], strlen(argv[0]), argv, NULL, 0);
> + if (pid < 0) {
> + perror(argv[0]);
> + exit(1);
> + }
> +
> + ros_get_low_latency_core_set(&pvcores);
> + ros_not_core_set(&pvcores);
> + ros_and_core_sets(&pvcores, cores);
> + for (size_t i = 0; i < max_cores; i++) {
> + if (ros_get_bit(&pvcores, i)) {
> + if (sys_provision(pid, RES_CORES, i)) {
> + fprintf(stderr,
> + "Unable to provision CPU %lu to PID %d: cmd='%s'\n",
> + i, pid, argv[0]);
> + exit(1);
> + }
> + }
> + }
> +
> + sys_proc_run(pid);

As an FYI, the process will only run on the provisioned cores once it becomes an
MCP. If you are running an app that is an SCP, then it will never run on those
cores. If it will be an MCP, you won't capture the events from when it was on
core 0 either.

Davide Libenzi

unread,

Dec 11, 2015, 3:09:20 PM12/11/15

to Akaros

On Fri, Dec 11, 2015 at 9:05 AM, Barret Rhoden <br...@cs.berkeley.edu> wrote:

Need (XCC) in the subject, since this requires a toolchain/glibc
rebuild.

Sorry, forgot about that.

> -struct serialized_data* serialize_argv_envp(char* const* argv,
> - char* const* envp)
> +struct serialized_data* serialize_argv_envp(const char * const *argv,
> + const char * const *envp)
->
The pointer * should be moved to be adjacent to the function name.

There are a few other warnings like this in this patch and in some of
the others. Please run checkpatch to catch them.

I left how it was. But I do have checkpatch in my pre-commit hook, and I wonder why it did not trigger.

This is the pre-commit I have (given to me by Kevin):

https://gist.github.com/dlibenzi/6fbcd64c2e81eb430226

WARNING: Do not use whitespace before Signed-off-by:

Hmm, this is git-s doing that.

I fixed the others.

> Added libpfm4 library support. The libpfm4 library provides a database
> of counters available on different Intel platforms.

Where did this library come from? (URL, package name, etc).

Apparently from a guy sitting next to me 😀

Did not explicitly add the links, as there is little confusion:

https://www.google.com/search?q=libpfm4

> --- a/Makefile
> +++ b/Makefile
> @@ -573,7 +573,7 @@ endif #ifeq ($(mixed-targets),1)
> # List all userspace directories here, and state any dependencies between them,
> # such as how pthread depends on parlib.
>
> -user-dirs = parlib pthread benchutil iplib ndblib vmm
> +user-dirs = parlib pthread benchutil iplib ndblib vmm perfmon

Does perfmon have any dependencies, such as on parlib or pthead? If so,
we'll need entries like this:

benchutil: parlib
pthread: parlib benchutil
iplib: parlib

The library has no dependencies.

The code was literally taken (removed some Linux pref specific files) and added to the perfmon folder with its own Makefile.

> From 031a13e37081495c987b414b4b07902058fc5810 Mon Sep 17 00:00:00 2001
> From: Davide Libenzi <dlib...@google.com>
> Date: Mon, 16 Nov 2015 14:53:23 -0800
> Subject: Added ability to declare local per CPU variables

> +#define DEFINE_PERCPU(type, var) \
> + type PERCPU_VARNAME(var) __attribute__ ((section (PERCPU_SECTION_STR)))
> +#define DECLARE_PERCPU(type, var) \
> + extern type PERCPU_VARNAME(var) \
> + __attribute__ ((section (PERCPU_SECTION_STR)))

Mostly curious, does the declaration also need the attribute? I tried
with and without on a PERCPU that I dropped in init.c and it didn't seem
to care one way or the other.

Apparently it doesn't.

I would slightly prefer to leave it there, in case GCC might decide to care later on 😀

> +struct msr_address {
> + uint32_t addr;
> + const uint32_t *addresses;
> + size_t num_addresses;
> +};

So a struct msr_address has both an overall address, as well as an address per
core (based on its usage later)? When does each core need a different address
for a particular MSR (e.g. MSR_IA32_PERFCTR0)?

Or is this for something else, like "I want to read different MSRs on different
cores, but read them all at the same time." I'm trying to picture how this is
used and what it's used for. If it's cryptic, we can put something in this
header file's comments at the top.

Yes. You can chose to use the same addr for every CPU, or, if you need to use different addresses for every CPU, it allows that as well.

Similarly for values. You can use one value for all, or a dedicated one for each CPU.

This allow one-shot programming instead of multiple ones.

Yes, it's because programming a single logical (from a user POV) counter, might involve issuing MSR calls at different addresses.

Added a comment.

> @@ -362,22 +363,28 @@ static long archread(struct chan *c, void *a, long n, int64_t offset)

> + msr_set_values(&msrv, values, num_cores);
> +
> + err = msr_cores_read(&cset, &msra, &msrv);
> +
> + n = -1;
> + if (likely(!err)) {
> + if (n >= num_cores * sizeof(uint64_t)) {
> + if (!memcpy_to_user_errno(current, a, values,
> + num_cores * sizeof(uint64_t)))
> + n = num_cores * sizeof(uint64_t);
> } else {

What is going on here? n == -1, then we see if it is great enough to cover the
whole range of our output (which is always positive)?

A bit cryptic. I admit 😎

Changed.

> @@ -458,13 +467,19 @@ static long archwrite(struct chan *c, void *a, long n, int64_t offset)
> if (!address_range_find(msr_wr_wlist, ARRAY_SIZE(msr_wr_wlist),
> (uintptr_t) offset))
> error(EPERM, NULL);
> - core_set_init(&cset);
> - core_set_fill_available(&cset);
> if (n != sizeof(uint64_t))
> error(EINVAL, NULL);
> if (memcpy_from_user_errno(current, &value, a, sizeof(value)))
> return -1;
> - err = msr_cores_write(&cset, (uint32_t) offset, value);
> +
> + core_set_init(&cset);
> + core_set_fill_available(&cset);
> + msr_init_address(&msra);
> + msr_set_address(&msra, (uint32_t) offset);
> + msr_init_value(&msrv);
> + msr_set_value(&msrv, value);

It's a little odd to have to init_value() and then set_value(). I looked at the
functions, so I know that msr_init_value() zero's the msr_value struct, while
msr_set_value() sets the value field of the msr_value struct.

OK, dropping the init, and leaving the set, which will do init+set.

> --- a/kern/arch/x86/perfmon.c
> +++ b/kern/arch/x86/perfmon.c

> +static void perfmon_do_cores_alloc(void *opaque)
> +{

> + perfmon_enable_fix_event(i, TRUE);
> +
> + write_msr(MSR_CORE_PERF_FIXED_CTR0 + i,
> + -(int64_t) pa->ev.trigger_count);
> + write_msr(MSR_CORE_PERF_FIXED_CTR_CTRL, tmp);
> + }

Should the enabling of the event happen before or after setting the event? If
it happens before (as is here) would we collect perf events from the old setting
of the fixed counter? Same goes for down below in the non-fixed counter.
perfmon_do_cores_free() looks okay (disables the event before changing the
fields).

It does not change much either ways, as the real programming comes when we write the CTRL register.

> + } else {
> + for (i = 0; (i < (int) cpu_caps.counters_x_proc) &&
> + ((cctx->counters[i].u.v != 0) ||
> + !perfmon_event_available(i)); i++)
> + ;

That's pretty nasty. I get the impression we want to find the first free
counter (i = idx). It'd be much clearer with the loop just going over i, and
adding breaks or continues or something.

OK, moving some break conditions within the loop.

Is there ever a time that the counters[i] value is 0, but the perfmon_event is
not available? I.e., is that a kernel bug, and thus should be a panic?

Not sure we should panic. Added a warn_once().

Which I noticed is broken, if put in an if/else (fixed in a new commit in the stream).

> +static void perfmon_do_cores_free(void *opaque)
> +{
> + struct perfmon_alloc *pa = (struct perfmon_alloc *) opaque;
> + struct perfmon_cpu_context *cctx = PERCPU_VARPTR(counters_env);
> + int err = 0, coreno = core_id();
> + counter_t ccno = pa->cores_counters[coreno];
> +
> + spin_lock_irqsave(&counters_lock);

What is the lock protecting? (Both here and in the other perfmon_do_
functions). I imagine we need to disable interrupts, in case a PMI IRQ fires,
but do we need to serialize these accesses across the entire machine?

The could be per-CPU locks, but given that this is not an extremely hot path, I chose to not do that.

Turned into per-CPU locks.

> + if (perfmon_is_fixed_event(&pa->ev)) {
> + uint64_t fxctrl_value = read_msr(MSR_CORE_PERF_FIXED_CTR_CTRL);
> +
> + if (!(fxctrl_value & (FIXCNTR_MASK << ccno)) ||
> + (ccno >= cpu_caps.fix_counters_x_proc)) {
> + err = -ENOENT;
> + } else {
> + perfmon_init_event(&cctx->fixed_counters[ccno]);
> +
> + perfmon_enable_fix_event((int) ccno, FALSE);
> +
> + write_msr(MSR_CORE_PERF_FIXED_CTR_CTRL,
> + fxctrl_value & ~(FIXCNTR_MASK << ccno));
> + write_msr(MSR_CORE_PERF_FIXED_CTR0 + ccno, 0);

For the (fxctrl_value & (FIXCNTR_MASK << ccno)) checks, should "ccno" be
multipled by FIXCNTR_NBITS? You're trying to select a 4-bit chunk of the
register, right? (btw, I'm looking at the SDM V3 18.2, Fig 18-2 for the layout
of the register).

Duh! Yes.

I was sure I had fixed that, but I think I did the fix only in perfmon_get_fixevent_mask().

> +void perfmon_init(void)
> +{
> + int i;
>
> /* Enable user level access to the performance counters */
> lcr4(rcr4() | CR4_PCE);
> +
> + /* This will be called from every core, no need to execute more than once.
> + */
> + if (cpu_caps.perfmon_version == 0)
> + perfmon_read_cpu_caps(&cpu_caps);

Technically this is racy. We might be okay, since they are all writing the same
values.

Yes, in theory, but as you notice, they end up writing the same value.

> + write_mmreg32(LAPIC_LVT_PERFMON, IdtLAPIC_PCINT);
> +}

This will break when we move to the x2APIC. We can deal with it later. =)

Yeah, I can't program for things which do not exist yet 😎

> +void perfmon_interrupt(struct hw_trapframe *hw_tf, void *data)
> +{
> + int i;
> + struct perfmon_cpu_context *cctx = PERCPU_VARPTR(counters_env);
> + uint64_t gctrl, status;
> +
> + profiler_add_hw_sample(hw_tf);

So it looks like we generate a trace whenever any of the counters overflow, but
we lose track of which one overflowed. I guess that's what userland perf
expects?

I need to add an "ID", and figure out how to pass that to userland perf.

Will do in follow up CL.

Added.

> +static int perfmon_alloc_get(struct perfmon_session *ps, int ped, bool reset,
> + struct perfmon_alloc **ppa)
> +{
> + struct perfmon_alloc *pa;
> +
> + if (unlikely((ped < 0) || (ped >= ARRAY_SIZE(ps->allocs))))
> + return -EBADFD;

What is ped, an FD? If it's just an index that can't be < 0, you could make it
an unsigned type. I see the user of this (devarch) was casting it to an int
anyways.

Like an FD, a signed value is needed to signal errors in API which return it as value.

> +struct perfmon_status *perfmon_get_event_status(struct perfmon_session *ps,
> + int ped)
> +{
> + int err;
> + struct core_set cset;
> + struct perfmon_status_env env;
> +
> + err = perfmon_alloc_get(ps, ped, FALSE, &env.pa);
> + if (unlikely(err))
> + return ERR_PTR(err);

If you can avoid it, please don't use ERR_PTR. It makes it a lot harder to tell
how to use a function by looking at its signature. For instance, if I didn't
look into the C code, I'd think the following is fine:

foo = perfmon_get_event_status(ps, ped);
if (!foo)
handle_the_error.

In general, we opt to set errno if possible (which i think is the case here)
instead of using ERR_PTR. ERR_PTR and IS_ERR is pretty much only for linux
compatibility (and ancient memcpy code from 7 years ago). Throwing with error()
is another option.

If you do have to use it here, which could be the case if we're called from IRQ
context, then please comment the function accordingly.

Yes, setting errno or throwing, make the APIs user-context only.

Even the "does it throw? does not?" thing is not exactly clear by simply looking at API signatures.

Should we have a PTR tag to signal those?

#define __errptr

struct foo __errptr *my_foo();

> + env.pef = perfmon_alloc_status();
> + perfmon_setup_alloc_core_set(env.pa, &cset);
> +
> + smp_do_in_cores(&cset, perfmon_do_cores_status, &env);

Are there any weird cases where multiple threads are using the same pa? (env.pa
here).

Is 'pa' basically read-only once we install_session_alloc in perfmon_open_event?
If not, how do we protect from having multiple users?

If that's true and it's the invariant that keeps us from clobbering pa's state,
then please put that in the comments somewhere. Otherwise someone in the future
will modify this and enter a nightmare.

Does not matter. perf alloc (pa) are immutable and perfmon_alloc_get() gets a reference on them.

> +struct perfmon_event {
> + union {
> + struct {
> + uint64_t event:8;
> + uint64_t mask:8;
> + uint64_t usr:1;
> + uint64_t os:1;
> + uint64_t edge:1;
> + uint64_t pc:1;
> + uint64_t inten:1;
> + uint64_t anyth:1;
> + uint64_t en:1;
> + uint64_t invcmsk:1;
> + uint64_t cmask:8;
> + uint64_t res:32;
> + } b;
> + uint64_t v;
> + } u;
> + uint64_t flags;
> + uint64_t trigger_count;
> +};

If you can avoid the bitfields in the future, that'd be nice. We try to avoid
using them where possible, in favor of masks and shifts. If this is hard to
change here, then don't bother.

It makes code much cleaner IMO.

You ad to create macros (or open code them) for getting/setting every one of them.

Which is pretty tedious when you have that many.

> From 907da2682383ca581a58352456c30c4dab7cd762 Mon Sep 17 00:00:00 2001
> From: Davide Libenzi <dlib...@google.com>
> Date: Sat, 28 Nov 2015 18:35:57 -0800
> Subject: Created the new devarch perf file using the perfmon infrastructure
>

> --- a/kern/arch/x86/devarch.c
> +++ b/kern/arch/x86/devarch.c

> +static void arch_free_perf_context(struct perf_context *pc)
> +{
> + if (likely(pc)) {

From the usage of this function, it looks like a bug if there is !pc.

Typically all the "pointer free" APIs should be allowed for NULLs inputs.

I would have loved that, if you (and Kevin) corrected me previously in doing *ptr++ things 😀

Changed to sane-c-code now.

> + case PERFMON_CMD_COUNTER_STATUS: {
> + uint32_t ped;
> + struct perfmon_status *pef;
> +
> + error_assert(EBADMSG, (kptr + 1 + sizeof(uint32_t)) <= ktop);
> + kptr = get_le_u32(kptr + 1, &ped);
> +
> + pef = perfmon_get_event_status(pc->ps, (int) ped);
> + if (!IS_ERR(pef)) {
> + int i;
> + uint8_t *rptr;
> + uint64_t *mvalues = kzmalloc(num_cores * sizeof(mvalues),
> + KMALLOC_WAIT);
> +
> + for (i = 0; i < num_cores; i++)
> + mvalues[get_os_coreid(i)] = pef->cores_values[i];

This usage of get_os_coreid is probably wrong. The range of i is the list of
os_coreids already, not hw_coreids. Probably just want mvalues[i].

Yes, that is a leftover from a patch which was trying remap logical to HW core IDs.

But then realized that send_ipi() does the conversion.

Dropped that get_os_coreid() now.

> + pc->resp_size = 3 * sizeof(uint64_t) + sizeof(uint16_t) +
> + num_cores * sizeof(uint64_t);
> + pc->resp = kmalloc(pc->resp_size, KMALLOC_WAIT);
> +
> + rptr = put_le_u64(pc->resp, pef->ev.u.v);
> + rptr = put_le_u64(rptr, pef->ev.flags);
> + rptr = put_le_u64(rptr, pef->ev.trigger_count);
> + rptr = put_le_u16(rptr, num_cores);

Not that we actually support more than 2^16 cores yet, but num_cores is an int.

Yeah, I figured, by the time we hit 65K cores, Intel might have changed the perf API 10 times, so this code might need revisioning anyway 😎

In any case, turned to u32.

> + for (i = 0; i < num_cores; i++)
> + rptr = put_le_u64(rptr, mvalues[i]);
> + kfree(mvalues);
> + perfmon_free_event_status(pef);
> + } else {
> + error(-PTR_ERR(pef), NULL);

Can we just have perfmon_get_event_status() throw? Same goes for
perfmon_open_event() and all of these functions.

OK, done.

> + }
> + default:
> + error(EINVAL, NULL);

We should take advantage of error() and have a message, like
error(EINVAL, "Bad perfmon command 0x%x", cmd);

Done.

> Added new perf utility to access CPU counters from userspace.

Like the other tools such as snc and kprof2perf, this can't be built by cd-ing
into the directory and running make. I'd like to fix that before it gets out of
hand, though maybe the cure is worse than the symptoms for now.

I asked the question a few times, about the revised build system. Got no answers.

> diff --git a/tools/profile/perf/perf.c b/tools/profile/perf/perf.c
> new file mode 100644
> index 000000000000..0b5ef5d950b1
> --- /dev/null
> +++ b/tools/profile/perf/perf.c

> +static struct perf_context_config perf_cfg = {
> + .perf_file = "#arch/perf",
> + .kpctl_file = "/prof/kpctl",

It's safer to use #kprof instead of /prof. We happen to have it bound in our
current ifconfig, but there's no guarantee that'll always be the case. In
general, if you know the device by #name, use it directly.

OK, done.

So, are there better options?

Changes have been posted in the same branch.

Davide Libenzi

unread,

Dec 12, 2015, 7:53:13 PM12/12/15

to Akaros

On Fri, Dec 11, 2015 at 12:09 PM, Davide Libenzi <dlib...@google.com> wrote:

> +void perfmon_interrupt(struct hw_trapframe *hw_tf, void *data)
> +{
> + int i;
> + struct perfmon_cpu_context *cctx = PERCPU_VARPTR(counters_env);
> + uint64_t gctrl, status;
> +
> + profiler_add_hw_sample(hw_tf);

So it looks like we generate a trace whenever any of the counters overflow, but
we lose track of which one overflowed. I guess that's what userland perf
expects?

I need to add an "ID", and figure out how to pass that to userland perf.
Will do in follow up CL.

Actually, doing it in this stream, as tail commits.

I already added an "info" field containing the event coordinates.

Need to figure out how to translate it for Linux perf now ...

Davide Libenzi

unread,

Dec 13, 2015, 12:43:46 PM12/13/15

to Akaros

There is a way, and Linux perf has an ID for every trace, ID which you can bind to any event string you can push into the perf file.

But, in order to do that, I will need to add libpfm4 dependency to kprof2perf, which is not worth it.

Instead I will go ahead to move the akaros to linux perf conversion, to the akaros perf tool (which already has the libpfm4 dependency), and drop the kprof2perf tool altogether.

In order to do that though, I will need to know, on the Akaros side, the full path within the build directory, of the akaros kernel ELF file.

A couple of solutions which come to my mind:

1) Build system create a magic build info file into KFS

2) Build info are stuck in a kernel ELF section, retrievable via system call or #some-dev-file

ron minnich

unread,

Dec 13, 2015, 12:49:00 PM12/13/15

to Akaros

One thing is it would be nice if however we do this, we can make sure we have reproducible builds, which the

ELF approach might cause trouble with.

ron

--
You received this message because you are subscribed to the Google Groups "Akaros" group.
To unsubscribe from this group and stop receiving emails from it, send an email to akaros+un...@googlegroups.com.
To post to this group, send email to aka...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Davide Libenzi

unread,

Dec 13, 2015, 12:50:45 PM12/13/15

to Akaros

Did not parse that ☺

ron minnich

unread,

Dec 13, 2015, 12:52:41 PM12/13/15

to Akaros

Sorry, what I mean is we have to be careful what we put into the kernel image, so that different people can build kernels

that are byte for byte the same given a git ref. This reproducible build stuff is all the rage now.

I think putting stuff like paths in the kernel binary will defeat reproducible builds.

ron

Davide Libenzi

unread,

Dec 13, 2015, 12:58:35 PM12/13/15

to Akaros

Now our KFS image is stuck in an ELF section as well.

The typical kernel uname info is generally generated (at least the time part) and compile time as well.

ron minnich

unread,

Dec 13, 2015, 1:31:57 PM12/13/15

to Akaros

It's true that that non-reproducible bits are typically embedded in a linux kernel at present. That may change.

But debian and coreboot at least are looking at the problem.

We can try to get it right from the start.

ron

ron minnich

unread,

Dec 13, 2015, 1:33:52 PM12/13/15

to Akaros

From reproducible-builds.org:

"First, the build system needs to be made entirely deterministic: transforming a given source must always create the same result. Typically, the current date and time must not be recorded and output always has to be written in the same order.

Second, the set of tools used to perform the build and more generally the build environment should either be recorded or pre-defined.

Third, users should be given a way to recreate a close enough build environment, perform the build process, and verify that the output matches the original build.

"

Davide Libenzi

unread,

Dec 13, 2015, 1:37:05 PM12/13/15

to Akaros

If this will ever be an issue, we can have a build option which does not include neither KFS (because the same invariant will have to apply to its populated ELF files as well), nor other build dependent information.

Without that, you will not be able to use perf to generate traces though.

Davide Libenzi

unread,

Dec 13, 2015, 1:39:52 PM12/13/15

to Akaros

Also, KFS uses CPIO, which in turn embeds file metadata (time as well).

So our builds will have to either point to an invariant KFS, or not have it at all.

Davide Libenzi

unread,

Dec 13, 2015, 2:00:38 PM12/13/15

to Akaros

To get me unlocked, I added this rule to the makefile, on which fill-kfs depends.

If we want to change it later on, thats fine.

create-build-file:
ifneq ($(INVARIANT_BUILD),1)
        @echo "Kernel: $(abspath $(KERNEL_OBJ))" > kern/kfs/etc/build.info
endif

Davide Libenzi

unread,

Dec 13, 2015, 2:04:40 PM12/13/15

to Akaros

Added a few more things. We don't have a kernel version, do we?

Kernel: /usr/local/google/home/src/akaros/akaros/obj/kern/akaros-kernel
Date: Sun Dec 13 11:03:28 PST 2015
Host: dlibenzi.mtv.corp.google.com

Davide Libenzi

unread,

Dec 14, 2015, 11:55:29 AM12/14/15

to Akaros

The counter ID propagation up to Linux perf has been done, in the same branch (as well as the CR comment fixes).

Unfortunately, Linux perf supported user defined event string before, so you could emit pretty things like "ICACHE:ALL" as event name, but now it does not support it anymore.

So you will get something like:

# Samples: 3K of event 'raw 0x840000000000ff80:HG'

Then you need to know what 0xff is for mask, and what 0x80 is for event.
But Akaros perf tells you that.

Barret Rhoden

unread,

Dec 14, 2015, 12:01:43 PM12/14/15

to aka...@googlegroups.com

On 2015-12-13 at 09:43 "'Davide Libenzi' via Akaros"

<aka...@googlegroups.com> wrote:
> There is a way, and Linux perf has an ID for every trace, ID which
> you can bind to any event string you can push into the perf file.
> But, in order to do that, I will need to add libpfm4 dependency to
> kprof2perf, which is not worth it.
> Instead I will go ahead to move the akaros to linux perf conversion,
> to the akaros perf tool (which already has the libpfm4 dependency),
> and drop the kprof2perf tool altogether.

sounds great.

> In order to do that though, I will need to know, on the Akaros side,
> the full path within the build directory, of the akaros kernel ELF
> file. A couple of solutions which come to my mind:
>
> 1) Build system create a magic build info file into KFS
> 2) Build info are stuck in a kernel ELF section, retrievable via
> system call or #some-dev-file

just to be clear, you want a way for a program running on Akaros to
know the path to the kernel when it was built,
e.g. /full/path/to/obj/kern/akaros-kernel (or perhaps
akaros-kernel-64b)?

Davide Libenzi

unread,

Dec 14, 2015, 12:03:00 PM12/14/15

to Akaros

On Mon, Dec 14, 2015 at 9:01 AM, Barret Rhoden <br...@cs.berkeley.edu> wrote:

> In order to do that though, I will need to know, on the Akaros side,
> the full path within the build directory, of the akaros kernel ELF
> file. A couple of solutions which come to my mind:
>
> 1) Build system create a magic build info file into KFS
> 2) Build info are stuck in a kernel ELF section, retrievable via
> system call or #some-dev-file

just to be clear, you want a way for a program running on Akaros to
know the path to the kernel when it was built,
e.g. /full/path/to/obj/kern/akaros-kernel (or perhaps
akaros-kernel-64b)?

Yes, I need indeed the -64b. So far, I added Makefile machinery to populate $KFS/etc/build.info

Davide Libenzi

unread,

Dec 14, 2015, 12:07:55 PM12/14/15

to Akaros

I then have an API to fetch it:

https://github.com/dlibenzi/akaros/blob/devarch_msr_perf/tools/profile/perf/perf_core.c#L642

BTW, I need the size as well, because I need to populate an MMP record.

An alternative, more involved, is to stick that info into an ELF section.

Barret Rhoden

unread,

Dec 14, 2015, 1:23:01 PM12/14/15

to aka...@googlegroups.com

Hi -

On 2015-12-11 at 12:09 "'Davide Libenzi' via Akaros"

<aka...@googlegroups.com> wrote:
> I left how it was. But I do have checkpatch in my pre-commit hook,
> and I wonder why it did not trigger.
> This is the pre-commit I have (given to me by Kevin):
>
> https://gist.github.com/dlibenzi/6fbcd64c2e81eb430226
>
> WARNING: Do not use whitespace before Signed-off-by:
>
> Hmm, this is git-s doing that.
> I fixed the others.

Not sure about why the pre-commit hooks don't work. They also fail to
checkpatch the commit message (sometimes they catch typos and lack of
signed-off tags). I have a script that I run to checkpatch on a
revision list:

$ cat ~/bin/git-checkpatch
#!/bin/bash

TMPDIR=/tmp/fp-cp

usage()
{
echo "$0 rev..list"
exit -1
}

if [ $# -ne 1 ]
then
usage
fi

mkdir -p $TMPDIR

git format-patch -o $TMPDIR $1
./scripts/checkpatch.pl $TMPDIR/*.patch
rm $TMPDIR/*patch

> > > Added libpfm4 library support. The libpfm4 library provides a
> > > database of counters available on different Intel platforms.
> >
> > Where did this library come from? (URL, package name, etc).
> >
>
> Apparently from a guy sitting next to me 😀
> Did not explicitly add the links, as there is little confusion:
>
> https://www.google.com/search?q=libpfm4

We've been a little lax with adding code and dealing with licenses in
the past. I'd like it to be clear from the commit message where code
came from. I take it then that you added libpfm4-4.6.0 or something
from http://perfmon2.sourceforge.net/, possibly from
git://git.code.sf.net/p/perfmon2/libpfm4.

> The library has no dependencies.
> The code was literally taken (removed some Linux pref specific files)
> and added to the perfmon folder with its own Makefile.

Good. That's good to know too for the future, in case we need to bring
in an updated version of libpfm.

> > Mostly curious, does the declaration also need the attribute? I
> > tried with and without on a PERCPU that I dropped in init.c and it
> > didn't seem to care one way or the other.
> >
>
> Apparently it doesn't.
> I would slightly prefer to leave it there, in case GCC might decide
> to care later on 😀

Sounds good to me. =)

> > Should the enabling of the event happen before or after setting the
> > event? If
> > it happens before (as is here) would we collect perf events from
> > the old setting
> > of the fixed counter? Same goes for down below in the non-fixed
> > counter. perfmon_do_cores_free() looks okay (disables the event
> > before changing the fields).
> >
>
> It does not change much either ways, as the real programming comes
> when we write the CTRL register.

sounds good. this line from the doc cleared it up for me:

"Each enable bit in IA32_PERF_GLOBAL_CTRL is AND’ed with
the enable bits for all privilege levels in the respective
IA32_PERFEVTSELx or IA32_PERF_FIXED_CTR_CTRL MSRs to start/stop
the counting of respective counters.

> > What is the lock protecting? (Both here and in the other
> > perfmon_do_ functions). I imagine we need to disable interrupts,
> > in case a PMI IRQ fires,
> > but do we need to serialize these accesses across the entire
> > machine?
> >
>
> The could be per-CPU locks, but given that this is not an extremely
> hot path, I chose to not do that.
> Turned into per-CPU locks.

That cuts down on the lock contention, but I'm still curious what the
lock is protecting. If you're just trying to protect from an
interrupt causing someone to muck with the state, then a lock isn't
needed.

Still, I'm fine with the lock (per cpu or global). I (and
perhaps whoever reads the code in the future) just want to know if
there's something to be worried about.

> > Is 'pa' basically read-only once we install_session_alloc in
> > perfmon_open_event?
> > If not, how do we protect from having multiple users?
> >
> > If that's true and it's the invariant that keeps us from clobbering
> > pa's state,
> > then please put that in the comments somewhere. Otherwise someone
> > in the future
> > will modify this and enter a nightmare.
> >
>
> Does not matter. perf alloc (pa) are immutable and
> perfmon_alloc_get() gets a reference on them.

OK. So they are immutable after all of the cores run
perfmon_do_cores_alloc() (due to this):

pa->cores_counters[core_id()] = (counter_t) i;

which is before perfmon_install_session_alloc().

That's the sort of thing that could catch someone in the future, even
if it is unlikely. I wasn't asking for the code to change; I was just
saying that a comment would have made it more clear both for me and for
future developers. We ought to be explicit about any of the invariants
about locking or concurrent access.

> > If you can avoid the bitfields in the future, that'd be nice. We
> > try to avoid
> > using them where possible, in favor of masks and shifts. If this
> > is hard to
> > change here, then don't bother.
> >
>
> It makes code much cleaner IMO.

> You had to create macros (or open code them) for getting/setting every

> one of them.
> Which is pretty tedious when you have that many.

I agree that they look nice. My concern with bitfields is that the
compiler might not do what we want with them. I'm not a language
expert here, but here's a sample of some of the issues:

http://stackoverflow.com/questions/6043483/why-bit-endianness-is-an-issue-in-bitfields/6044223#6044223

Again, if this is hard to change here, then don't bother. It's not a
huge deal.

> > Minor thing, but if you read *kptr and incremented it, like you do
> > later with
> > the get_le_ helpers, you wouldn't need to remember to +1 it the
> > first time you
> > use it in every case {}.
> >
>
> I would have loved that, if you (and Kevin) corrected me previously in
> doing *ptr++ things 😀
> Changed to sane-c-code now.

I think our concern was that *ptr++ can be a little confusing for
people not intimately familiar with the order of operations. We had a
bug a while back where *ptr++ was confused with (*ptr)++. Or was it
confused with *(ptr++)? =)

Either way, reading the cmd and advancing the pointer seems sane. =)

> > Not that we actually support more than 2^16 cores yet, but
> > num_cores is an int.
> >
>
> Yeah, I figured, by the time we hit 65K cores, Intel might have
> changed the perf API 10 times, so this code might need revisioning
> anyway 😎

Haha, absolutely.

> > Added new perf utility to access CPU counters from userspace.
> >
> > Like the other tools such as snc and kprof2perf, this can't be
> > built by cd-ing
> > into the directory and running make. I'd like to fix that before
> > it gets out of
> > hand, though maybe the cure is worse than the symptoms for now.
> >
>
> I asked the question a few times, about the revised build system. Got
> no answers.

I guess coming up with the right answer for a build system is a little
more involved. There are ways in which we can use the current system
and still have the app Makefiles know about the CC, but I'll look into
it on my own.

> > As an FYI, the process will only run on the provisioned cores once
> > it becomes an
> > MCP. If you are running an app that is an SCP, then it will never
> > run on those
> > cores. If it will be an MCP, you won't capture the events from
> > when it was on
> > core 0 either.
> >
>
> So, are there better options?

Nope. I just wanted to make sure everyone was aware of that limitation.

> Changes have been posted in the same branch.

Thanks!

Barret

Barret Rhoden

unread,

Dec 14, 2015, 1:35:22 PM12/14/15

to aka...@googlegroups.com

On 2015-12-13 at 11:04 "'Davide Libenzi' via Akaros"

<aka...@googlegroups.com> wrote:
> Added a few more things. We don't have a kernel version, do we?

Not yet. I was thinking of using the commit SHA, and exporting that
through a # device of some sort. Perhaps we can have all of this stuff
put in #version, and then we select that via Kconfig (instead of the
invariant build business).

> Kernel: /usr/local/google/home/src/akaros/akaros/obj/kern/akaros-kernel

I'm a little unclear on why you need to know the full path from within
Akaros. I'd imagine at some point a tool wants to know the original
binary for symbol lookups, but that seems a little different.

What's an MMP record?

Davide Libenzi

unread,

Dec 14, 2015, 2:06:38 PM12/14/15

to Akaros

On Mon, Dec 14, 2015 at 10:22 AM, Barret Rhoden <br...@cs.berkeley.edu> wrote:

> Hmm, this is git-s doing that.
> I fixed the others.

Not sure about why the pre-commit hooks don't work. They also fail to
checkpatch the commit message (sometimes they catch typos and lack of
signed-off tags). I have a script that I run to checkpatch on a
revision list:

$ cat ~/bin/git-checkpatch
#!/bin/bash

TMPDIR=/tmp/fp-cp

usage()
{
echo "$0 rev..list"
exit -1
}

if [ $# -ne 1 ]
then
usage
fi

mkdir -p $TMPDIR

git format-patch -o $TMPDIR $1
./scripts/checkpatch.pl $TMPDIR/*.patch
rm $TMPDIR/*patch

I need to fix my pre-commit, as otherwise I will be for sure forgetting to run it.

Have a big NO at commit time is much better for me.

> Did not explicitly add the links, as there is little confusion:

>
> https://www.google.com/search?q=libpfm4

We've been a little lax with adding code and dealing with licenses in
the past. I'd like it to be clear from the commit message where code
came from. I take it then that you added libpfm4-4.6.0 or something
from http://perfmon2.sourceforge.net/, possibly from
git://git.code.sf.net/p/perfmon2/libpfm4.

Should I amend the commit, or add a README into the library (or both)?

> The could be per-CPU locks, but given that this is not an extremely
> hot path, I chose to not do that.
> Turned into per-CPU locks.

That cuts down on the lock contention, but I'm still curious what the
lock is protecting. If you're just trying to protect from an
interrupt causing someone to muck with the state, then a lock isn't
needed.

Still, I'm fine with the lock (per cpu or global). I (and
perhaps whoever reads the code in the future) just want to know if
there's something to be worried about.

Well, multiple call from different tasks can come in, and the operations against the per CPU counters are definitely not atomic.

OK. So they are immutable after all of the cores run
perfmon_do_cores_alloc() (due to this):

pa->cores_counters[core_id()] = (counter_t) i;

While that code is running, the "pa" is not exposed anywhere.

Once the "pa" is out there, nobody will be changing it.

which is before perfmon_install_session_alloc().

That's the sort of thing that could catch someone in the future, even
if it is unlikely. I wasn't asking for the code to change; I was just
saying that a comment would have made it more clear both for me and for
future developers. We ought to be explicit about any of the invariants
about locking or concurrent access.

Should I add?

> > If you can avoid the bitfields in the future, that'd be nice. We
> > try to avoid
> > using them where possible, in favor of masks and shifts. If this
> > is hard to
> > change here, then don't bother.
> >
>
> It makes code much cleaner IMO.
> You had to create macros (or open code them) for getting/setting every
> one of them.
> Which is pretty tedious when you have that many.

I agree that they look nice. My concern with bitfields is that the
compiler might not do what we want with them. I'm not a language
expert here, but here's a sample of some of the issues:

http://stackoverflow.com/questions/6043483/why-bit-endianness-is-an-issue-in-bitfields/6044223#6044223

Again, if this is hard to change here, then don't bother. It's not a
huge deal.

Let me leave the bitfield, and try to add a static assert to catch if GCC decide to go sideways.

> > Added new perf utility to access CPU counters from userspace.
> >
> > Like the other tools such as snc and kprof2perf, this can't be
> > built by cd-ing
> > into the directory and running make. I'd like to fix that before
> > it gets out of
> > hand, though maybe the cure is worse than the symptoms for now.
> >
>
> I asked the question a few times, about the revised build system. Got
> no answers.

I guess coming up with the right answer for a build system is a little
more involved. There are ways in which we can use the current system
and still have the app Makefiles know about the CC, but I'll look into
it on my own.

I will revise that with Makefrag-user-app idea, until we decide for the future layout.

Davide Libenzi

unread,

Dec 14, 2015, 2:08:03 PM12/14/15

to Akaros

Once you prepare the Linux perf file, Linux perf needs to find the binary in order to fetch symbols from it.

What's an MMP record?

Sorry, MMAP.

Davide Libenzi

unread,

Dec 14, 2015, 3:21:15 PM12/14/15

to Akaros

The static assert is failing, as bitfield manipulation seem to be pushed down in the compiler stack, where "static" does not apply anymore.

Turning into macros.

Davide Libenzi

unread,

Dec 14, 2015, 5:06:04 PM12/14/15

to Akaros

Added link to libpfm4 to commit message, and migrated out of bitfields.

Changes in the same branch.

Barret Rhoden

unread,

Dec 14, 2015, 5:13:02 PM12/14/15

to aka...@googlegroups.com

On 2015-12-14 at 11:08 "'Davide Libenzi' via Akaros"

> > I'm a little unclear on why you need to know the full path from
> > within Akaros. I'd imagine at some point a tool wants to know the
> > original binary for symbol lookups, but that seems a little
> > different.
> >
>
> Once you prepare the Linux perf file, Linux perf needs to find the
> binary in order to fetch symbols from it.

That makes sense. So then on Linux, is a given 'perf' binary built to
go along with a specific kernel (the kernel that it runs on)?
Something like /usr/src/linux-4.2/boot/x86/bzimage? Or is the binary
path part of a trace record, and perf expects to find it in the stream?

Since the full path to the kernel binary is unusable from within
Akaros (since that file system doesn't exist), it seemed a little weird
to have the kernel export that info, esp when the analysis happens on
the Linux side on the developer's workstation, where the knowledge of
the binary might exist.

That being said, I'm not opposed to exporting that info. I'd just like
to understand why it's needed.

Barret

Barret Rhoden

unread,

Dec 14, 2015, 5:17:25 PM12/14/15

to aka...@googlegroups.com

On 2015-12-14 at 11:06 "'Davide Libenzi' via Akaros"

<aka...@googlegroups.com> wrote:
> > We've been a little lax with adding code and dealing with licenses
> > in the past. I'd like it to be clear from the commit message where
> > code came from. I take it then that you added libpfm4-4.6.0 or
> > something from http://perfmon2.sourceforge.net/, possibly from
> > git://git.code.sf.net/p/perfmon2/libpfm4.
> >
>
> Should I amend the commit, or add a README into the library (or both)?

thanks for amending it. i was taking care of some other things and
didn't get to this email in time. =)

> > Still, I'm fine with the lock (per cpu or global). I (and
> > perhaps whoever reads the code in the future) just want to know if
> > there's something to be worried about.
> >
>
> Well, multiple call from different tasks can come in, and the
> operations against the per CPU counters are definitely not atomic.

They should be atomic on Akaros, since the kernel messages (which
smp_do uses) aren't interruptible. Let's leave the locks in for now -
they aren't hurting.

> > which is before perfmon_install_session_alloc().
> >
> > That's the sort of thing that could catch someone in the future,
> > even if it is unlikely. I wasn't asking for the code to change; I
> > was just saying that a comment would have made it more clear both
> > for me and for future developers. We ought to be explicit about
> > any of the invariants about locking or concurrent access.
> >
>
> Should I add?

I'd appreciate a one-liner in there that says something like "pa must
be read-only once we publish it"

> I will revise that with Makefrag-user-app idea, until we decide for
> the future layout.

Sounds great.

Thanks for the fixes btw. =)

Barret

Davide Libenzi

unread,

Dec 14, 2015, 5:18:28 PM12/14/15

to Akaros

Currently the knowledge of the Linux perf file format is kept at a minimum.

Injecting that information at Linux side would require heavier changes to the Linux perf tool, which I intentionally kept at a minimum, to minimize conflicts in applying to new versions.

Similarly, if I have to crack open the perf file at Linux side, we will have to have another tools on the Linux side.

Barret Rhoden

unread,

Dec 14, 2015, 5:19:49 PM12/14/15

to aka...@googlegroups.com

On 2015-12-14 at 14:18 "'Davide Libenzi' via Akaros"

<aka...@googlegroups.com> wrote:
> Currently the knowledge of the Linux perf file format is kept at a
> minimum. Injecting that information at Linux side would require
> heavier changes to the Linux perf tool, which I intentionally kept at
> a minimum, to minimize conflicts in applying to new versions.
> Similarly, if I have to crack open the perf file at Linux side, we
> will have to have another tools on the Linux side.

Makes sense. So it sounds like there is a perf record or something
with the kernel binary location?

Davide Libenzi

unread,

Dec 14, 2015, 5:20:34 PM12/14/15

to Akaros

Yes, it's a Linux perf record injected.

Plus, we would need something, if we plan to implement uname.

Davide Libenzi

unread,

Dec 14, 2015, 5:22:09 PM12/14/15

to Akaros

Adding the "pa" thing now. Hold one ...

Davide Libenzi

unread,

Dec 14, 2015, 5:27:37 PM12/14/15

to Akaros

Done.

Barret Rhoden

unread,

Dec 14, 2015, 5:29:21 PM12/14/15

to aka...@googlegroups.com

On 2015-12-14 at 14:20 "'Davide Libenzi' via Akaros"

<aka...@googlegroups.com> wrote:
> Yes, it's a Linux perf record injected.

Cool, thanks for the info.

> Plus, we would need something, if we plan to implement uname.

I totally agree. I would like the ability to cat something to find out
what commit a kernel is on. It's not a huge deal for now, but it'll be
useful especially when debugging a kernel someone else made. Or maybe
that I made a couple weeks ago and forgot to update, etc. Also,
getting the .config would be nice too. I like the sound of a kernel
#device for all of this stuff.

Barret

Davide Libenzi

unread,

Dec 14, 2015, 5:32:23 PM12/14/15

to Akaros

The change would be the makefiles preparing the build info file, and sticking it to an ELF section, which the #version device will export.

Or something like that.

Barret Rhoden

unread,

Dec 14, 2015, 5:46:52 PM12/14/15

to aka...@googlegroups.com

On 2015-12-14 at 14:32 "'Davide Libenzi' via Akaros"

<aka...@googlegroups.com> wrote:
> The change would be the makefiles preparing the build info file, and
> sticking it to an ELF section, which the #version device will export.
> Or something like that.

Yeah. We could also do something like we do for the symbol table, where
there's a C file that gets generated and compiled that contains the
string we want. Those generated C files go in obj/kern/ (e.g.
ksyms-refl.c, which happens to be built by our linker script, since
the values of the C file depend on the link).

Davide Libenzi

unread,

Dec 14, 2015, 7:15:42 PM12/14/15

to Akaros

OK. I am not going to do that in this CL though.

There is enough stuff here already ☺

Barret Rhoden

unread,

Dec 15, 2015, 1:53:42 PM12/15/15

to Akaros

Hi -

Not sure, but your checkpatch stuff still seems to be missing things.
Do the commit hooks run on rebase? Here's ones that don't look like
false positives:

../patches/0013-Reverted-prof-kpctl-interface-to-split-start-and-tim.patch
--------------------------------------------------------------------------
WARNING: 'ouput' may be misspelled - perhaps 'output'?
#149: FILE: Documentation/profiling.txt:132:
+providing the event coordinates (of which, the ouput emitted by the commands

../patches/0017-Added-perfmon-interrupt-handling-to-allow-overflow-b.patch
--------------------------------------------------------------------------
ERROR: trailing whitespace
#31: FILE: kern/arch/x86/perfmon.c:6:
+#include <sys/types.h> $

ERROR: trailing whitespace
#32: FILE: kern/arch/x86/perfmon.c:7:
+#include <arch/ros/msr-index.h> $

ERROR: trailing whitespace
#546: FILE: kern/arch/x86/perfmon.h:8:
+#include <sys/types.h> $

../patches/0019-Added-new-perf-utility-to-access-CPU-counters-from-u.patch
--------------------------------------------------------------------------
ERROR: spaces required around that ':' (ctx:VxW)
#226: FILE: tools/profile/perf/akaros.c:116:
+ size_t nb = (max_cores >= CHAR_BIT) ? CHAR_BIT: max_cores;

../patches/0020-Enable-the-PFM-sampling-to-pass-an-64bit-info-value.patch
-------------------------------------------------------------------------
ERROR: trailing whitespace
#28: FILE: kern/arch/x86/perfmon.c:8:
+#include <arch/ros/membar.h> $

../patches/0021-Implement-flush-capability-for-the-profiler-tracing-.patch
--------------------------------------------------------------------------
WARNING: 'writting' may be misspelled - perhaps 'writing'?
#7:
per CPU buffers to be writting into the main queue, so that data is

../patches/0023-Move-Linux-perf-format-conversion-into-perf-tool-dro.patch
--------------------------------------------------------------------------
ERROR: "foo* bar" should be "foo *bar"
#1461: FILE: tools/profile/perf/xlib.c:218:
+const char* vb_decode_uint64(const char *data, uint64_t *pval)

ERROR: "foo* bar" should be "foo *bar"
#1551: FILE: tools/profile/perf/xlib.h:61:
+const char* vb_decode_uint64(const char *data, uint64_t *pval);

Barret

On 2015-12-14 at 14:27 "'Davide Libenzi' via Akaros"

Davide Libenzi

unread,

Dec 15, 2015, 3:29:00 PM12/15/15

to Akaros

Fixed. The "writting" one seems to be in an old commit message, which is not in this branch, AFAICS.

Barret Rhoden

unread,

Dec 15, 2015, 3:36:30 PM12/15/15

to aka...@googlegroups.com

Hi -

I have a few follow up comments, see below.

Thanks,

Barret

> From 6105697c7a3ed895c06d1265b7db79d9ecf1c015 Mon Sep 17 00:00:00 2001
> From: Davide Libenzi <dlib...@google.com>
> Date: Fri, 27 Nov 2015 10:57:19 -0800
> Subject: Migrated devarch MSR access to new MSR API

> @@ -362,22 +363,28 @@ static long archread(struct chan *c, void *a, long n, int64_t offset)

> + msr_set_values(&msrv, values, num_cores);
> +
> + err = msr_cores_read(&cset, &msra, &msrv);
> +
> + n = -1;
> + if (likely(!err)) {
> + if (n >= num_cores * sizeof(uint64_t)) {
> + if (!memcpy_to_user_errno(current, a, values,
> + num_cores * sizeof(uint64_t)))
> + n = num_cores * sizeof(uint64_t);
> } else {

This didn't seem to get fixed or clarified. Here's my original comment:

What is going on here? n == -1, then we see if it is great enough to cover the
whole range of our output (which is always positive)?

Oh, I found it later in commit f326ef1adcc. Please apply the
fixups to the originally incorrect patch. Other than
preserving a working git history, it saves your reviewers time.

> From 99ac80e47013f91452d8e5c439ba0937a48576c0 Mon Sep 17 00:00:00 2001
> From: Davide Libenzi <dlib...@google.com>
> Date: Mon, 14 Dec 2015 13:45:25 -0800
> Subject: Added ROS_STATIC_ASSERT to common.h and using that in assert.h
>
> Added ROS_STATIC_ASSERT to common.h and using that in assert.h.

We don't need this in ros/common.h, since userspace already has their
own. It's parlib_static_assert(), added by Dan in 4402cbebfec6
("Rename static_assert to avoid conflicting with C++'11 (and later)
(XCC)")

> From f5e38ab6bac9077cebedf147758cb3034156106b Mon Sep 17 00:00:00 2001
> From: Davide Libenzi <dlib...@google.com>
> Date: Mon, 16 Nov 2015 07:13:13 -0800
> Subject: Added perfmon interrupt handling to allow overflow based profiling

> +static uint64_t perfmon_get_fixevent_mask(const struct perfmon_event *pev,
> + int eventno, uint64_t base)
> +{
> + uint64_t m = 0;
> +
> + if (pev->u.b.inten)
> + m |= 1 << 3;
> + if (pev->u.b.os)
> + m |= (1 << 0);
> + if (pev->u.b.usr)
> + m |= (1 << 1);

This shouldn't compile on this commit. You change perfmon_event to use the
bitfield macros, but didn't change this. I guess you change it in a
later commit, since the final branch builds. There are other places in
this file that need the same fixup.

> From 4d09b83f75fef0da023624d6ba504480511fb646 Mon Sep 17 00:00:00 2001
> From: Davide Libenzi <dlib...@google.com>
> Date: Sat, 12 Dec 2015 13:35:31 -0800
> Subject: Enable the PFM sampling to pass an 64bit info value

> @@ -55,13 +56,13 @@ static void perfmon_read_cpu_caps(struct perfmon_cpu_caps *pcc)
>
> cpuid(0x0a, 0, &a, &b, &c, &d);
>
> - ZERO_DATA(*pcc);
> - pcc->perfmon_version = a & 0xff;
> pcc->proc_arch_events = a >> 24;
> pcc->bits_x_counter = (a >> 16) & 0xff;
> pcc->counters_x_proc = (a >> 8) & 0xff;
> pcc->bits_x_fix_counter = (d >> 5) & 0xff;
> pcc->fix_counters_x_proc = d & 0x1f;
> + wmb_f();
> + pcc->perfmon_version = a & 0xff;
> }

What is going on here? Is this an attempt to make it race-free? If
anything this should be a fixup for the original commit, but I'm not
even sure what it's doing.

> static void perfmon_enable_event(int event, bool enable)
> @@ -96,11 +97,11 @@ static uint64_t perfmon_get_fixevent_mask(const struct perfmon_event *pev,
> {
> uint64_t m = 0;
>
> - if (pev->u.b.inten)
> + if (PMEV_GET_EN(pev->event))
> m |= 1 << 3;
> - if (pev->u.b.os)
> + if (PMEV_GET_OS(pev->event))
> m |= (1 << 0);
> - if (pev->u.b.usr)
> + if (PMEV_GET_USR(pev->event))
> m |= (1 << 1);
> m <<= eventno * FIXCNTR_NBITS;

Ah, these (and others) should fixup commit #17 that changed the
bitfields.

> --- a/kern/include/ros/profiler_records.h
> +++ b/kern/include/ros/profiler_records.h
> @@ -7,9 +7,22 @@
>
> #include <sys/types.h>
>
> +#define PROF_DOM_SHIFT (8 * sizeof(uint64_t) - 4)
> +#define PROF_INFO_MASK (((uint64_t) 1 << PROF_DOM_SHIFT) - 1)

This is a little confusing. The DOM shift is just 60, right? Is that
value special (hence the 8 * sizeof - 4)?

> struct proftype_kern_trace64 {
> + uint64_t info;
> uint64_t tstamp;
> uint16_t cpu;
> uint16_t num_traces;

Subject for a future patch: are these structures copied in native
endianness, or in an endian-independent format?

> @@ -490,7 +511,7 @@ void profiler_notify_mmap(struct proc *p, uintptr_t addr, size_t size, int prot,
> int flags, struct file *f, size_t offset)
> {
> if (kref_get_not_zero(&profiler_kref, 1)) {
> - if (f && (prot & PROT_EXEC) && profiler_percpu_ctx && tracing) {
> + if (f && (prot & PROT_EXEC) && profiler_percpu_ctx) {
> char path_buf[PROFILER_MAX_PRG_PATH];
> char *path = file_abs_path(f, path_buf, sizeof(path_buf));
>
> @@ -504,7 +525,7 @@ void profiler_notify_mmap(struct proc *p, uintptr_t addr, size_t size, int prot,
> void profiler_notify_new_process(struct proc *p)
> {
> if (kref_get_not_zero(&profiler_kref, 1)) {
> - if (profiler_percpu_ctx && tracing && p->binary_path)
> + if (profiler_percpu_ctx && p->binary_path)
> profiler_push_new_process(p);
> kref_put(&profiler_kref);
> }

So now we're emitting trace info on all mmaps and process creation, even
if the profiler is turned off?

> From d2e4bae5a18d21205bf5de243450794a91674386 Mon Sep 17 00:00:00 2001
> From: Davide Libenzi <dlib...@google.com>
> Date: Sun, 13 Dec 2015 13:58:20 -0800
> Subject: Emit build information file into KFS /etc/build.info

> +KERNEL_ELF_PATH=$(abspath $(KERNEL_OBJ))-64b

This -64b is arch specific. We'll want to get this variable from the
arch/Makefile.

> +create-build-file:
> +ifneq ($(INVARIANT_BUILD),1)

Can drop the INVARIANT_BUILD, esp since we don't have a CONFIG_ var for
it.

> + @echo "KernelPath: $(KERNEL_ELF_PATH)" > kern/kfs/etc/build.info
> + @echo "KernelSize: $(shell stat -c %s $(KERNEL_ELF_PATH))" >> \
> + kern/kfs/etc/build.info
> + @echo "Date: `date`" >> kern/kfs/etc/build.info
> + @echo "Host: `hostname`" >> kern/kfs/etc/build.info

Use $(FIRST_KFS_PATH) instead of kern/kfs for all of these. (or rather
use $(FIRST_KFS_PATH) to create KFS_BUILD_INFO_PATH or something).

If you're removing this in an upcoming patchset (the #version stuff),
then don't worry about any of this stuff.

On 2015-12-14 at 14:27 "'Davide Libenzi' via Akaros"

Davide Libenzi

unread,

Dec 15, 2015, 3:58:13 PM12/15/15

to Akaros

On Tue, Dec 15, 2015 at 12:36 PM, Barret Rhoden <br...@cs.berkeley.edu> wrote:

> Date: Fri, 27 Nov 2015 10:57:19 -0800
> Subject: Migrated devarch MSR access to new MSR API

> @@ -362,22 +363,28 @@ static long archread(struct chan *c, void *a, long n, int64_t offset)

> + msr_set_values(&msrv, values, num_cores);
> +
> + err = msr_cores_read(&cset, &msra, &msrv);
> +
> + n = -1;
> + if (likely(!err)) {
> + if (n >= num_cores * sizeof(uint64_t)) {
> + if (!memcpy_to_user_errno(current, a, values,
> + num_cores * sizeof(uint64_t)))
> + n = num_cores * sizeof(uint64_t);
> } else {

This didn't seem to get fixed or clarified. Here's my original comment:

What is going on here? n == -1, then we see if it is great enough to cover the
whole range of our output (which is always positive)?

Oh, I found it later in commit f326ef1adcc. Please apply the
fixups to the originally incorrect patch. Other than
preserving a working git history, it saves your reviewers time.

Sorry, I applied that to the head of that file.

Now it is kind of a mess to apply only those to an older commit, when the changes have been merged into that one.

Do you have any recipe?

> From 99ac80e47013f91452d8e5c439ba0937a48576c0 Mon Sep 17 00:00:00 2001
> From: Davide Libenzi <dlib...@google.com>
> Date: Mon, 14 Dec 2015 13:45:25 -0800
> Subject: Added ROS_STATIC_ASSERT to common.h and using that in assert.h
>
> Added ROS_STATIC_ASSERT to common.h and using that in assert.h.

We don't need this in ros/common.h, since userspace already has their
own. It's parlib_static_assert(), added by Dan in 4402cbebfec6
("Rename static_assert to avoid conflicting with C++'11 (and later)
(XCC)")

Dropped the commit.

> +static uint64_t perfmon_get_fixevent_mask(const struct perfmon_event *pev,
> + int eventno, uint64_t base)
> +{
> + uint64_t m = 0;
> +
> + if (pev->u.b.inten)
> + m |= 1 << 3;
> + if (pev->u.b.os)
> + m |= (1 << 0);
> + if (pev->u.b.usr)
> + m |= (1 << 1);

This shouldn't compile on this commit. You change perfmon_event to use the
bitfield macros, but didn't change this. I guess you change it in a
later commit, since the final branch builds. There are other places in
this file that need the same fixup.

Again, sorry. Is there an easy way to apply only those back, when these plus other changes are in a future commit?

> @@ -55,13 +56,13 @@ static void perfmon_read_cpu_caps(struct perfmon_cpu_caps *pcc)
>
> cpuid(0x0a, 0, &a, &b, &c, &d);
>
> - ZERO_DATA(*pcc);
> - pcc->perfmon_version = a & 0xff;
> pcc->proc_arch_events = a >> 24;
> pcc->bits_x_counter = (a >> 16) & 0xff;
> pcc->counters_x_proc = (a >> 8) & 0xff;
> pcc->bits_x_fix_counter = (d >> 5) & 0xff;
> pcc->fix_counters_x_proc = d & 0x1f;
> + wmb_f();
> + pcc->perfmon_version = a & 0xff;
> }

What is going on here? Is this an attempt to make it race-free? If
anything this should be a fixup for the original commit, but I'm not
even sure what it's doing.

A CPU reading version != 0, could go ahead and use values which have not been populated.

> static void perfmon_enable_event(int event, bool enable)
> @@ -96,11 +97,11 @@ static uint64_t perfmon_get_fixevent_mask(const struct perfmon_event *pev,
> {
> uint64_t m = 0;
>
> - if (pev->u.b.inten)
> + if (PMEV_GET_EN(pev->event))
> m |= 1 << 3;
> - if (pev->u.b.os)
> + if (PMEV_GET_OS(pev->event))
> m |= (1 << 0);
> - if (pev->u.b.usr)
> + if (PMEV_GET_USR(pev->event))
> m |= (1 << 1);
> m <<= eventno * FIXCNTR_NBITS;

Ah, these (and others) should fixup commit #17 that changed the
bitfields.

Yeah, sorry about that.

> #include <sys/types.h>
>
> +#define PROF_DOM_SHIFT (8 * sizeof(uint64_t) - 4)
> +#define PROF_INFO_MASK (((uint64_t) 1 << PROF_DOM_SHIFT) - 1)

This is a little confusing. The DOM shift is just 60, right? Is that
value special (hence the 8 * sizeof - 4)?

Why confusing? It just leaves 4 bits for domain.

> struct proftype_kern_trace64 {
> + uint64_t info;
> uint64_t tstamp;
> uint16_t cpu;
> uint16_t num_traces;

Subject for a future patch: are these structures copied in native
endianness, or in an endian-independent format?

They are machine dependent (they assume machine).

We could make them independent, but I am not sure it is worth the effort at this time.

> @@ -490,7 +511,7 @@ void profiler_notify_mmap(struct proc *p, uintptr_t addr, size_t size, int prot,
> int flags, struct file *f, size_t offset)
> {
> if (kref_get_not_zero(&profiler_kref, 1)) {
> - if (f && (prot & PROT_EXEC) && profiler_percpu_ctx && tracing) {
> + if (f && (prot & PROT_EXEC) && profiler_percpu_ctx) {
> char path_buf[PROFILER_MAX_PRG_PATH];
> char *path = file_abs_path(f, path_buf, sizeof(path_buf));
>
> @@ -504,7 +525,7 @@ void profiler_notify_mmap(struct proc *p, uintptr_t addr, size_t size, int prot,
> void profiler_notify_new_process(struct proc *p)
> {
> if (kref_get_not_zero(&profiler_kref, 1)) {
> - if (profiler_percpu_ctx && tracing && p->binary_path)
> + if (profiler_percpu_ctx && p->binary_path)
> profiler_push_new_process(p);
> kref_put(&profiler_kref);
> }

So now we're emitting trace info on all mmaps and process creation, even
if the profiler is turned off?

No. If the profiler is not started, percpu_ctx is NULL.

> From d2e4bae5a18d21205bf5de243450794a91674386 Mon Sep 17 00:00:00 2001
> From: Davide Libenzi <dlib...@google.com>
> Date: Sun, 13 Dec 2015 13:58:20 -0800
> Subject: Emit build information file into KFS /etc/build.info

> +KERNEL_ELF_PATH=$(abspath $(KERNEL_OBJ))-64b

This -64b is arch specific. We'll want to get this variable from the
arch/Makefile.

How?

> +create-build-file:
> +ifneq ($(INVARIANT_BUILD),1)

Can drop the INVARIANT_BUILD, esp since we don't have a CONFIG_ var for
it.

> + @echo "KernelPath: $(KERNEL_ELF_PATH)" > kern/kfs/etc/build.info
> + @echo "KernelSize: $(shell stat -c %s $(KERNEL_ELF_PATH))" >> \
> + kern/kfs/etc/build.info
> + @echo "Date: `date`" >> kern/kfs/etc/build.info
> + @echo "Host: `hostname`" >> kern/kfs/etc/build.info

Use $(FIRST_KFS_PATH) instead of kern/kfs for all of these. (or rather
use $(FIRST_KFS_PATH) to create KFS_BUILD_INFO_PATH or something).

If you're removing this in an upcoming patchset (the #version stuff),
then don't worry about any of this stuff.

Yeah, those are gone in #version.

The -64b path issue still remains.

Barret Rhoden

unread,

Dec 15, 2015, 4:19:58 PM12/15/15

to aka...@googlegroups.com

On 2015-12-15 at 12:58 "'Davide Libenzi' via Akaros"

<aka...@googlegroups.com> wrote:
> Sorry, I applied that to the head of that file.
> Now it is kind of a mess to apply only those to an older commit, when
> the changes have been merged into that one.
> Do you have any recipe?

Not a one liner. What I usually to split up a commit is this:

$ git rebase -i some-commit-far-back-enough

pick the commit you want to split. change it from "pick" to "edit".
take note of the commit hash (will need it later).

git will stop there, then you can do whatever you want to that commit.

at that point i do:

$ git reset HEAD^

(pops off the commit.)

$ git add -p

(selectively add back in all of the things that should have been in
that original commit, but not the things you don't want)

$ git commit -c HASH_FROM_THE_COMMIT_ORIGINALLY

then add everything else from that commit to a WIP commit:

$ git add -p
$ git commit -m WIP-fixing-whatever

Then finish the git rebase:
$ git rebase --continue

At this point, all that did was split up a commit into two commits.
You could do whatever you want during that 'edit' phase of the rebase,
including making more commits.

Then you do another rebase -i, and move the WIP-fixing-whatever to the
right spot in the history.

If it turns out to be too much of a pain, then don't worry. But try to
catch it next time. =)

> > > #include <sys/types.h>
> > >
> > > +#define PROF_DOM_SHIFT (8 * sizeof(uint64_t) - 4)
> > > +#define PROF_INFO_MASK (((uint64_t) 1 << PROF_DOM_SHIFT) - 1)
> >
> > This is a little confusing. The DOM shift is just 60, right? Is
> > that value special (hence the 8 * sizeof - 4)?
> >
>
> Why confusing? It just leaves 4 bits for domain.

I was a little confused about the math computing the DOM_SHIFT. I see
that it's just 60, with 4 bits for dom. I was wondering if the
sizeof(u64) was trying to save enough space in the INFO_MASK or
something. Feel free to leave it as is.

> > Subject for a future patch: are these structures copied in native
> > endianness, or in an endian-independent format?
> >
>
> They are machine dependent (they assume machine).
> We could make them independent, but I am not sure it is worth the
> effort at this time.

Probably not. =)

> > > void profiler_notify_new_process(struct proc *p)
> > > {
> > > if (kref_get_not_zero(&profiler_kref, 1)) {
> > > - if (profiler_percpu_ctx && tracing &&
> > > p->binary_path)
> > > + if (profiler_percpu_ctx && p->binary_path)
> > > profiler_push_new_process(p);
> > > kref_put(&profiler_kref);
> > > }
> >
> > So now we're emitting trace info on all mmaps and process creation,
> > even if the profiler is turned off?
> >
>
> No. If the profiler is not started, percpu_ctx is NULL.

Ah, whoops. I missed free_cpu_buffers() being called whenever we stop
profiling. =)

> > > From d2e4bae5a18d21205bf5de243450794a91674386 Mon Sep 17 00:00:00
> > > 2001 From: Davide Libenzi <dlib...@google.com>
> > > Date: Sun, 13 Dec 2015 13:58:20 -0800
> > > Subject: Emit build information file into KFS /etc/build.info
> >
> > > +KERNEL_ELF_PATH=$(abspath $(KERNEL_OBJ))-64b
> >
> > This -64b is arch specific. We'll want to get this variable from
> > the arch/Makefile.
> >
>
> How?

KERNEL_ELF_PATH can be defined in each arch's Makefile, which gets
included into the top-level Makefile. The existence of the -64b
version of the file is an x86 thing. At least for now. (It's created
by this:

# Need to change the format to 32 bit, to trick multiboot/grub1 into loading
ARCH_POST_LINK_CMD = cp $@ $@-64b; $(OBJCOPY) -I elf64-x86-64 -O elf32-i386 $@

Barret

Davide Libenzi

unread,

Dec 15, 2015, 4:31:52 PM12/15/15

to Akaros

Please ☺

I know where I fscked up.

I did a `git log --oneline CHANGED_FILE | head -n 1` to get the commit to splice the new changes in (as in, rebase-i-fixup), thinking that the branch only had one commit touching that file.

Hopefully I will remember next time.

> How?

KERNEL_ELF_PATH can be defined in each arch's Makefile, which gets
included into the top-level Makefile. The existence of the -64b
version of the file is an x86 thing. At least for now. (It's created
by this:

# Need to change the format to 32 bit, to trick multiboot/grub1 into loading
ARCH_POST_LINK_CMD = cp $@ $@-64b; $(OBJCOPY) -I elf64-x86-64 -O elf32-i386 $@

Doing that in the build_info branch (#version stuff).

Barret Rhoden

unread,

Dec 15, 2015, 4:34:12 PM12/15/15

to aka...@googlegroups.com

On 2015-12-15 at 13:31 "'Davide Libenzi' via Akaros"

<aka...@googlegroups.com> wrote:
> > If it turns out to be too much of a pain, then don't worry. But
> > try to catch it next time. =)
> >
>
> Please ☺

Sounds good. =)

> I know where I fscked up.
> I did a `git log --oneline CHANGED_FILE | head -n 1` to get the
> commit to splice the new changes in (as in, rebase-i-fixup), thinking
> that the branch only had one commit touching that file.
> Hopefully I will remember next time.

Ah, good stuff.

> > # Need to change the format to 32 bit, to trick multiboot/grub1 into
> > loading
> > ARCH_POST_LINK_CMD = cp $@ $@-64b; $(OBJCOPY) -I elf64-x86-64 -O
> > elf32-i386 $@
> >
>
> Doing that in the build_info branch (#version stuff).

Sounds good.

Other than that, let me know if there are any other changes, and I
repull the branch.

Thanks,

Barret

Davide Libenzi

unread,

Dec 15, 2015, 4:37:02 PM12/15/15

to Akaros

That branch is ready AFAIK.

Then, over that, there are the #interrupts and #version ones.

Barret

Davide Libenzi

unread,

Dec 15, 2015, 5:25:40 PM12/15/15

to Akaros

Mistery of my pre-commit not firing ... permissions 😐

Barret Rhoden

unread,

Dec 15, 2015, 5:35:00 PM12/15/15

to aka...@googlegroups.com

On 2015-12-15 at 14:25 "'Davide Libenzi' via Akaros"

<aka...@googlegroups.com> wrote:
> Mistery of my pre-commit not firing ... permissions 😐

Noooooooooooooo!

Davide Libenzi

unread,

Dec 15, 2015, 5:37:54 PM12/15/15

to Akaros

Of course, GIT was not giving any hint about seeing it, and not executing it.

I guess it might have done something like:

if (!access("hooks/pre-commit", X_OK)) exec_it();

Barret Rhoden

unread,

Dec 16, 2015, 5:45:42 PM12/16/15

to aka...@googlegroups.com

Hi -

I get a GPF when I try to run this in qemu on the wrmsr in perf_init.

I put together a minor fix for this (pasted below, and also at
origin/perf-b). If it looks good to you, I'll merge this patch set.

Also, I tried using it on my desktop. The non-interrupt use looks fine:

/ $ perf record -e TLB_FLUSH:STLB_ANY,int=0,icount=0 -- sleep 10
Event: TLB_FLUSH:STLB_ANY
7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2

With the interrupt:

/ $ perf record -e TLB_FLUSH:STLB_ANY,int=1,icount=20 -- sleep 10
Event: TLB_FLUSH:STLB_ANY
281474976710643 281474976710636 281474976710636 281474976710636
281474976710636 281474976710636 281474976710636 281474976710636
281474976710636 28147497671

Those are close to negative numbers (e.g. 0xfffffffffff3). Whatever
the issue, we can deal with it in a follow-up patch.

Barret

From 64d18dc7359cdda1f2980e19db62f0ec6400e31f Mon Sep 17 00:00:00 2001
From: Barret Rhoden <br...@cs.berkeley.edu>
Date: Wed, 16 Dec 2015 17:14:36 -0500
Subject: x86: Detect and handle missing perf support

If a machine has perf version 0, which is the case for my Qemu, we'll get a
GPF during initialization. The per core initialization and any accesses to
the Qperf file will abort if we don't have the right version.

This assumes that if open of a Qperf fails, that there is no other way for
the user to trigger access to the perf MSRs.

Signed-off-by: Barret Rhoden <br...@cs.berkeley.edu>
---
kern/arch/x86/devarch.c | 2 ++
kern/arch/x86/init.c | 2 +-
kern/arch/x86/perfmon.c | 24 +++++++++++++-----------
kern/arch/x86/perfmon.h | 4 +++-
kern/arch/x86/smp_boot.c | 4 ++--
5 files changed, 21 insertions(+), 15 deletions(-)

diff --git a/kern/arch/x86/devarch.c b/kern/arch/x86/devarch.c
index 6549891548aa..04dba60d728c 100644
--- a/kern/arch/x86/devarch.c
+++ b/kern/arch/x86/devarch.c
@@ -475,6 +475,8 @@ static struct chan *archopen(struct chan *c, int omode)
c = devopen(c, omode, archdir, Qmax, devgen);
switch ((uint32_t) c->qid.path) {
case Qperf:
+ if (!perfmon_supported())
+ error(ENODEV, "perf is not supported");
assert(!c->aux);
c->aux = arch_create_perf_context();
break;
diff --git a/kern/arch/x86/init.c b/kern/arch/x86/init.c
index fe35276262d1..a2080485938c 100644
--- a/kern/arch/x86/init.c
+++ b/kern/arch/x86/init.c
@@ -80,6 +80,7 @@ void arch_init()
save_fp_state(&x86_default_fpu); /* used in arch/trap.h for fpu init */
pci_init();
vmm_init();
+ perfmon_global_init();
// this returns when all other cores are done and ready to receive IPIs
#ifdef CONFIG_SINGLE_CORE
smp_percpu_init();
@@ -88,7 +89,6 @@ void arch_init()
#endif
proc_init();

- perfmon_init();
cons_irq_init();
intel_lpc_init();
#ifdef CONFIG_ENABLE_LEGACY_USB
diff --git a/kern/arch/x86/perfmon.c b/kern/arch/x86/perfmon.c
index 810be3c6e5ae..4ea89d8f3ba8 100644
--- a/kern/arch/x86/perfmon.c
+++ b/kern/arch/x86/perfmon.c
@@ -291,23 +291,25 @@ static void perfmon_arm_irq(void)
write_mmreg32(LAPIC_LVT_PERFMON, IdtLAPIC_PCINT);
}

-void perfmon_init(void)
+bool perfmon_supported(void)
+{
+ return cpu_caps.perfmon_version >= 2;
+}
+
+void perfmon_global_init(void)
+{
+ perfmon_read_cpu_caps(&cpu_caps);
+}
+
+void perfmon_pcpu_init(void)
{
int i;

+ if (!perfmon_supported())
+ return;
/* Enable user level access to the performance counters */
lcr4(rcr4() | CR4_PCE);

- /* This will be called from every core, no need to execute more than once.
- * All the call to perfmon_init() will be done when the core boots, so
- * they will be no perfmon users calling it, while perfmon_read_cpu_caps()
- * is executing.
- * All the cores will be writing the same values, so even from that POV,
- * no serialization is required.
- */
- if (cpu_caps.perfmon_version == 0)
- perfmon_read_cpu_caps(&cpu_caps);
-
/* Reset all the counters and selectors to zero.
*/
write_msr(MSR_CORE_PERF_GLOBAL_CTRL, 0);
diff --git a/kern/arch/x86/perfmon.h b/kern/arch/x86/perfmon.h
index a655098df9bb..822b96a9e9cf 100644
--- a/kern/arch/x86/perfmon.h
+++ b/kern/arch/x86/perfmon.h
@@ -49,7 +49,9 @@ struct perfmon_status {
uint64_t cores_values[0];
};

-void perfmon_init(void);
+bool perfmon_supported(void);
+void perfmon_global_init(void);
+void perfmon_pcpu_init(void);
void perfmon_interrupt(struct hw_trapframe *hw_tf, void *data);
void perfmon_get_cpu_caps(struct perfmon_cpu_caps *pcc);
int perfmon_open_event(const struct core_set *cset, struct perfmon_session *ps,
diff --git a/kern/arch/x86/smp_boot.c b/kern/arch/x86/smp_boot.c
index 7f6f4d9bf1fb..e248d9edee1f 100644
--- a/kern/arch/x86/smp_boot.c
+++ b/kern/arch/x86/smp_boot.c
@@ -302,7 +302,7 @@ void __arch_pcpu_init(uint32_t coreid)
assert(read_msr(MSR_KERN_GS_BASE) == (uint64_t)pcpui);
/* Don't try setting up til after setting GS */
x86_sysenter_init(x86_get_stacktop_tss(pcpui->tss));
- /* need to init perfctr before potentiall using it in timer handler */
- perfmon_init();
+ /* need to init perfctr before potentially using it in timer handler */
+ perfmon_pcpu_init();
vmm_pcpu_init();
}
--
2.6.0.rc2.230.g3dd15c0

Davide Libenzi

unread,

Dec 16, 2015, 5:52:08 PM12/16/15

to Akaros

You need to use "-cpu host" in kvm/qemu. Also, those negative numbers are fine.

The IRQ trigger works that you set counter value at -icount, and wait the overflow to zero to trigger the PMI IRQ.

The patch is OK, but I don't think we need to worry about PMI version 0, especially when we committed on anything which does not have x2apic ☺

Davide Libenzi

unread,

Dec 16, 2015, 5:56:56 PM12/16/15

to Akaros

Also, you do not need to use int=0,icount=0, as they are implicitly so.

Barret Rhoden

unread,

Dec 16, 2015, 6:15:46 PM12/16/15

to aka...@googlegroups.com

On 2015-12-16 at 14:52 "'Davide Libenzi' via Akaros"

<aka...@googlegroups.com> wrote:
> You need to use "-cpu host" in kvm/qemu. Also, those negative numbers
> are fine.
> The IRQ trigger works that you set counter value at -icount, and wait
> the overflow to zero to trigger the PMI IRQ.

Okay, so if we're using int and icount, then we should ignore the
values we get back and just look at the backtrace samples? (similar to
the timer stuff)

> The patch is OK, but I don't think we need to worry about PMI version
> 0, especially when we committed on anything which does not have
> x2apic ☺

Qemu has had support for x2apic for a while. My rule for "burning the
bridges" is that for virtualization, we can have whatever requirements
we want (burn all the bridges). For general operation, like just
booting, I'd like a basic qemu to run - and preferably other VMMs.
(maybe Akaros too!).

Anyway, thanks for this patch set. =)

Merged to master at c9d5a6940cf8..6577674d3c3c (from, to]

You can see the entire diff with 'git diff' or at
https://github.com/brho/akaros/compare/c9d5a6940cf8...6577674d3c3c

Barret

Davide Libenzi

unread,

Dec 16, 2015, 6:29:23 PM12/16/15

to Akaros

Yes, perf dumps the status of the counters when it exists. independently from the fact that they have been configured with int=1 or not.

Reply all

Reply to author

Forward