--
You received this message because you are subscribed to the Google Groups "Perfetto Development - www.perfetto.dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to perfetto-dev...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/perfetto-dev/b87b69ae-d110-74da-6a08-b79f98710ee6%40collabora.com.
On Wed, May 12, 2021 at 2:00 AM Primiano Tucci <prim...@google.com> wrote:+Ryan Savitski who is working on perfetto_perf (the sampling profiler) and has a better view of future plans for traced_perf (The tracing-integrated sampling profiler).Summary from my side:as per very recently (~6 months?) we have a new daemon in perfetto (traced_perf) to do callstack sampling (example cfg).Right now we are quite focused on lowering unwinding times and memory from Android (there are various challenges due to lack of FP and compressed symbols) it supports only very few counters (these).At some point we'll get to extend that support but for now I am pretty sure it's not there right now.
From what I see in the current code, the perf event support in perfetto appears to be focused exclusively on CPU perf events. The type of perf events I'm referring to are not associated with any PID running on the CPU as they are system bus counters. These counters expose things like the following:1) DRAM interface load percentage2) Number of bytes written by the 3D GPU to DRAM3) Number of bytes read by the display controller4) Combined read/write bytes between ML (machine learning) IP core and DRAM interfaceSeveral ARM SoCs have drivers exposing this type of data via the perf_event framework. The drivers can be found in drivers/perf/... Additionally, for ARM SoCs that utilize the CCI (cache-coherent interconnect), there is also a perf driver that exposes the counters of this IP core.
The idea I have in mind is to expose the above counter data that these drivers expose as it can provide some very useful insight into why things are not behaving as expected. There are many cases with ARM SoCs where the various IP cores such as GPU and CPU are not at 100% load while performance is still slow and it turns out to be that there is no DRAM bandwidth level. Seeing DRAM bandwidth load or any other bus related HW counters would help in this case.
On Mon, May 17, 2021 at 5:16 AM Chris Healy <cph...@gmail.com> wrote:Hi Primiano, et al,On Wed, May 12, 2021 at 2:00 AM Primiano Tucci <prim...@google.com> wrote:+Ryan Savitski who is working on perfetto_perf (the sampling profiler) and has a better view of future plans for traced_perf (The tracing-integrated sampling profiler).Summary from my side:as per very recently (~6 months?) we have a new daemon in perfetto (traced_perf) to do callstack sampling (example cfg).Right now we are quite focused on lowering unwinding times and memory from Android (there are various challenges due to lack of FP and compressed symbols) it supports only very few counters (these).At some point we'll get to extend that support but for now I am pretty sure it's not there right now.From what I see in the current code, the perf event support in perfetto appears to be focused exclusively on CPU perf events. The type of perf events I'm referring to are not associated with any PID running on the CPU as they are system bus counters. These counters expose things like the following:1) DRAM interface load percentage2) Number of bytes written by the 3D GPU to DRAM3) Number of bytes read by the display controller4) Combined read/write bytes between ML (machine learning) IP core and DRAM interfaceSeveral ARM SoCs have drivers exposing this type of data via the perf_event framework. The drivers can be found in drivers/perf/... Additionally, for ARM SoCs that utilize the CCI (cache-coherent interconnect), there is also a perf driver that exposes the counters of this IP core.I have to admit my ignorance here (Ryan knows more than me, I am just going by vague memory).I thought that by design perf_event_open events are either CPU or PID scoped. From https://man7.org/linux/man-pages/man2/perf_event_open.2.html you can't just pass -1 to both.
perf stat -a -e imx8_ddr0/cycles/,imx8_ddr0/read-cycles/,imx8_ddr0/write-cycles/ -I 100 sleep 1
# time counts unit events
0.100275004 80211300 imx8_ddr0/cycles/
0.100275004 12556342 imx8_ddr0/read-cycles/
0.100275004 32476 imx8_ddr0/write-cycles/
0.200964848 80513498 imx8_ddr0/cycles/
0.200964848 12545674 imx8_ddr0/read-cycles/
0.200964848 4771 imx8_ddr0/write-cycles/How does one tell perf_event_open to capture events that are "globally" scoped? I thought in cases like this they end up looking like as if they are CPU scoped but the CPU buffer they end up on is purely whatever CPU where the overflow of the counter happened.In other words, I think that becomes more a UI feature to know that some perf events shouldn't be aggregated by CPU beause they are really global. But at capturing-time I think they should be treated as-if they are CPU-bound just because that is what perf_event_open assumes?But maybe I am completely misunderstanding how perf_event_open deals with global counters?
Hello Chris, very sorry for the huge delay. I should be much more available going forward.
As you've found out, the integrated perf profiler currently lacks the configuration options to
specify arbitrary counters. Architecturally, recording such platform PMU counters doesn't require
any reworks, as everything is exposed through `perf_event_open`. It's a matter of setting the
right pmu/event identifiers in the syscall argument struct (usually just `type` and `config`).
Let me try to itemise the lacking features for this, as I see them:
* Definitely necessary: propagating arbitrary raw values into the perf_event_attr struct. The way
I'm picturing this as an option in the perfetto config, where you directly specify the raw uint64s for
the perf_event_attr.{type,config}.
* Possibly necessary: specifying specific cpu masks to enable the event on. By default we configure
the requested counter on every cpu (same as what "perf -a" should be doing), but depending on how
the PMU driver is implemented, it might only expect the non-cpu-specific counters configured at
most once.
* Not strictly necessary, but desirable: leader sampling of counters. This is more accurate if
you're comparing or taking ratios of counter values, as it'll snapshot them as a group, instead of
polling each at its own (possibly unaligned) period. Though in the integrated profiler, you can
already do the equivalent of "perf stat -e event1 -e event2" (i.e. two concurrent, but not sampled
as a group counters) by putting a single counter per perfetto DataSource.
* Necessary, but trivial: graphing the counters in the UI, no unknows here.
* Nice to have, but not necessary for local use-cases: discovering the custom PMU events via
`/sys/bus/event_source/devices/`. Locally you could manually find the right id of the PMU and the
event, but it'd be nice if you could just give the symbolic name (such as imx8_ddr0/read-cycles/)
instead.
We want to do all of the above, but the timeline is a question of priorities within the team.
Barring surprises, I'd expect to see the minimal support for raw events within a month (in the
codebase and therefore the standalone build).
Thank you for reaching out, it's useful to hear about feature requests with particular use-cases. Do
tell me if I'm overlooking anything above, but I think that would cover use-cases like the AXI bus
counters.
-Ryan
Is there a HW platform you are working with that supports these bus HW counters via perf API?Is there anything I can do to help with this? I do have HW that has bus HW counters accessible via the perf API. (Some public and some not.)