Is AutoFDO compile / PMU hardware branch event collection possible on GCE?

474 views
Skip to first unread message

Kirill Katsnelson

unread,
Feb 22, 2021, 10:19:58 PM2/22/21
to gce-discussion
I attempted collecting branch statistics with the PMU hardware events for PGO/AutoFDO compiled optimization, to no avail. It appears that the PMU is not exposed to the VMs. Is there any way to enable it?

I attempted a few crazy things, like installing different kernels (from 4.19 to 5.10, all pretty recent), and disabling all mitigations in the kernel command line ('mitigations=off', thinking that the Spectre mitigation may interfere with branch stats collection),  but they were invariably unhelpful.

This is the standard Debian 10 image, and all support for hardware PMU perf is indeed compiled in.

Indeed, dmesg shows (N1 and N2 instances; CPU model varies)

[    0.768027] Performance Events: unsupported p6 CPU model 85 no PMU driver, software events only.

and perf reports that it's unable to collect branch prediction statistics

$ sudo sysctl kernel.perf_event_paranoid=-1
kernel.perf_event_paranoid = -1
$ perf record -b -- sleep .5
Error:
cpu-clock: PMU Hardware doesn't support sampling/overflow-interrupts. Try 'perf stat'
$ perf record -b -e branches -- sleep .5
Error:
The branches event is not supported.

Intel VTune agrees:

[  782.787868] vtsspp: Driver version 1.8.237-613804
[  782.792812] vtsspp: Kernel version 4.19.0-14-cloud-amd64
[  782.798264] vtsspp: Detected 6 CPUs
[  782.801870] vtsspp: CPU family: 0x06, model: 0x55, stepping: 03, HT: yes
[  782.808698] vtsspp: CPU freq: 2000184KHz, timer freq: 1000000KHz
[  782.814862] vtsspp: PMU: fixed counters: 0, general counters: 0
[  782.820961] vtsspp: PMU counters are not detected
[  782.826488] vtsspp: KPTI is enabled
[  782.830858] vtsspp: KASLR is detected
[  782.834670] vtsspp: Use sched tracepoints
[  782.838837] vtsspp: Failed to initialize driver

and the cpu counters symlink is absent entirely from /sys/bus/event_source/devices

$ sudo ls -l /sys/bus/event_source/devices
total 0
lrwxrwxrwx 1 root root 0 Feb 22 18:29 breakpoint -> ../../../devices/breakpoint
lrwxrwxrwx 1 root root 0 Feb 22 18:29 kprobe -> ../../../devices/kprobe
lrwxrwxrwx 1 root root 0 Feb 22 18:29 msr -> ../../../devices/msr
lrwxrwxrwx 1 root root 0 Feb 22 18:44 power -> ../../../devices/power
lrwxrwxrwx 1 root root 0 Feb 22 18:29 software -> ../../../devices/software
lrwxrwxrwx 1 root root 0 Feb 22 18:29 tracepoint -> ../../../devices/tracepoint
lrwxrwxrwx 1 root root 0 Feb 22 18:29 uprobe -> ../../../devices/uprobe

Compare my home base

$ sudo ls -l /sys/bus/event_source/devices/cpu/events
total 0
-r--r--r-- 1 root root 4096 Feb 17 22:36 branch-instructions
-r--r--r-- 1 root root 4096 Feb 17 22:36 branch-misses
. . . .

Perf does not have the 'branches' event, which is the essential one for AutoFDO collection:

$ perf list | head -5
  alignment-faults                                   [Software event]
  bpf-output                                         [Software event]
  context-switches OR cs                             [Software event]
  cpu-clock                                          [Software event]
  cpu-migrations OR migrations                       [Software event]
$ perf list | grep branches
$

Compare my own machine again

$ perf list | head -5
  branch-instructions OR branches                    [Hardware event]
  branch-misses                                      [Hardware event]
  bus-cycles                                         [Hardware event]
  cache-misses                                       [Hardware event]
  cache-references                                   [Hardware event]

I can run the full perf/AutoFDO build at home, but cannot transmogrify it into an automated Jenkins build on a GCE instance. Help!!! Everything points to this part of the PMU hardware not virtualized into the VM. Is there a magic flag on the VM to enable it? Can't believe it could be impossible: AutoFDO was invented right here. :)

Thanks,

 -kkm

Pedro Moreno

unread,
Apr 7, 2021, 6:56:42 AM4/7/21
to gce-discussion
Unfortunately is it not possible yet, please have a look a similar request from our issue tracker. This feature is already requested and engineering team is working on it.

Kirill Katsnelson

unread,
Apr 24, 2021, 4:36:20 PM4/24/21
to gce-discussion
Thanks much for the the issue link! I starred it. It's not a simple problem to solve, I reckon: KVM (the original KVM) doesn't really virtualize PMU, AFAIK; it rather allows pass-through access to it, which may be an isolation issue between clients sharing the same physical server. Hyper-V does PMU virtualization, but I had issues with it, at the least on a Windows 10 host. Maybe the server edition of Hyper-V is more sophisticated. But Google SWE are da best and can solve any problem¹: I'm a former one myself. :)

I would not mind if the pass-through PMU were available on sole-tenancy nodes, tho: the c2-node-60-240 is sized about right for pogo² builds and looks pretty affordable—should be $a_cup_of_fancy_latte per build with full stat collection, which can be later reused for a while until code changes substantially. Besides, c2 is the target runtime CPU for the software³, so that properly collected perf stats are the most representative of the real runtime scenario.
 ____
¹ Jeff Dean knows whether or not P=NP. He just doesn't want to reveal a spoiler to this little puzzle.
² Microsoft's term is Profile-Guided Optimization (PGO), but I've always heard it pronounced "pogo" by MS folks. Wondering if its GCC/Clangese equivalent, FDO, is pronounced "Fido." I would! :))
³ Behold in awe the power of Moore's exponent: the size of my '90s '486 desktop PC's hard drive was a tad smaller than this CPU's L3 cache. :)

 -kkm

Ahmad P - Cloud Platform Support

unread,
Apr 26, 2021, 10:18:32 AM4/26/21
to gce-discussion
Thank you for shared information.

Please add these information and your questions in this public issue tracker that can be followed up with our product team.
Reply all
Reply to author
Forward
0 new messages