Re: perfetto and bus utilization perf

Antonio Caggiano

unread,

May 10, 2021, 12:00:27 PM5/10/21

to Chris Healy, Daniel Stone, perfet...@googlegroups.com

Hi Chris,

Thank you very much for your message!

I am not very experienced with making use of perf events with perfetto,
but I am sure you can find someone with the answer to your question at
this mailing list: perfet...@googlegroups.com

Best regards,

Antonio

On 08/05/21 22:22, Chris Healy wrote:
> Hi Antonio,
>
> I see you landed perfetto integration in Mesa yesterday, congratulations!
>
> I'm working with some ARM SoCs that have ARM Mali GPUs so am quite
> excited to give this a try.
>
> The SoCs I'm working with also have support for "perf" HW events that
> expose overall DDR bus utilization in bytes as well as per-AXI bus
> master usage in bytes. (Drivers for this typically exist in
> drivers/perf/...)
>
> I see reference to perfetto having support for perf events here:
> https://perfetto.dev/docs/reference/trace-config-proto#PerfEventConfig
> <https://perfetto.dev/docs/reference/trace-config-proto#PerfEventConfig>
>
> It's not clear to me though if this support is just for CPU events or if
> there is sufficient support for HW events that are independent of CPU
> events. In the case of the perf drivers that expose bus utilization
> metrics, these drivers typically have the following capability set:
> "PERF_PMU_CAP_NO_EXCLUDE" set.
>
> Do you know if perf events for bus utilization could be visualized in
> perfetto in parallel with GPU counters with the existing codebase?
>
> Regards,
>
> Chris
> |
>
> |

Primiano Tucci

unread,

May 12, 2021, 5:00:52 AM5/12/21

to Antonio Caggiano, Ryan Savitski, Chris Healy, Daniel Stone, Perfetto Development - www.perfetto.dev

+Ryan Savitski who is working on perfetto_perf (the sampling profiler) and has a better view of future plans for traced_perf (The tracing-integrated sampling profiler).

Summary from my side:

as per very recently (~6 months?) we have a new daemon in perfetto (traced_perf) to do callstack sampling (example cfg).

Right now we are quite focused on lowering unwinding times and memory from Android (there are various challenges due to lack of FP and compressed symbols) it supports only very few counters (these).

At some point we'll get to extend that support but for now I am pretty sure it's not there right now.

Depending on how involved the support for AXI-related counters is? Is it a bunch of extra enums to pass to perf_event_open() or does it involve some more convoluted ABI/parsing logic of the binary data emitted into the perf ring buffer?

If the former, maybe you folks and rsavitski can figure out a way to collaborate on that and add support upstream?

--
You received this message because you are subscribed to the Google Groups "Perfetto Development - www.perfetto.dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to perfetto-dev...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/perfetto-dev/b87b69ae-d110-74da-6a08-b79f98710ee6%40collabora.com.

--

Primiano Tucci
Software Engineer
Google UK Limited

Primiano Tucci

unread,

May 17, 2021, 12:00:00 PM5/17/21

to Chris Healy, Antonio Caggiano, Ryan Savitski, Daniel Stone, Perfetto Development - www.perfetto.dev

On Mon, May 17, 2021 at 5:16 AM Chris Healy <cph...@gmail.com> wrote:

Hi Primiano, et al,

On Wed, May 12, 2021 at 2:00 AM Primiano Tucci <prim...@google.com> wrote:
+Ryan Savitski who is working on perfetto_perf (the sampling profiler) and has a better view of future plans for traced_perf (The tracing-integrated sampling profiler).

Summary from my side:
as per very recently (~6 months?) we have a new daemon in perfetto (traced_perf) to do callstack sampling (example cfg).
Right now we are quite focused on lowering unwinding times and memory from Android (there are various challenges due to lack of FP and compressed symbols) it supports only very few counters (these).
At some point we'll get to extend that support but for now I am pretty sure it's not there right now.

From what I see in the current code, the perf event support in perfetto appears to be focused exclusively on CPU perf events. The type of perf events I'm referring to are not associated with any PID running on the CPU as they are system bus counters. These counters expose things like the following:

1) DRAM interface load percentage
2) Number of bytes written by the 3D GPU to DRAM
3) Number of bytes read by the display controller
4) Combined read/write bytes between ML (machine learning) IP core and DRAM interface

Several ARM SoCs have drivers exposing this type of data via the perf_event framework. The drivers can be found in drivers/perf/... Additionally, for ARM SoCs that utilize the CCI (cache-coherent interconnect), there is also a perf driver that exposes the counters of this IP core.

I have to admit my ignorance here (Ryan knows more than me, I am just going by vague memory).

I thought that by design perf_event_open events are either CPU or PID scoped. From https://man7.org/linux/man-pages/man2/perf_event_open.2.html you can't just pass -1 to both.

How does one tell perf_event_open to capture events that are "globally" scoped? I thought in cases like this they end up looking like as if they are CPU scoped but the CPU buffer they end up on is purely whatever CPU where the overflow of the counter happened.

In other words, I think that becomes more a UI feature to know that some perf events shouldn't be aggregated by CPU beause they are really global. But at capturing-time I think they should be treated as-if they are CPU-bound just because that is what perf_event_open assumes?

But maybe I am completely misunderstanding how perf_event_open deals with global counters?

The idea I have in mind is to expose the above counter data that these drivers expose as it can provide some very useful insight into why things are not behaving as expected. There are many cases with ARM SoCs where the various IP cores such as GPU and CPU are not at 100% load while performance is still slow and it turns out to be that there is no DRAM bandwidth level. Seeing DRAM bandwidth load or any other bus related HW counters would help in this case.

Chris Healy

unread,

May 25, 2021, 10:18:53 AM5/25/21

to Primiano Tucci, Antonio Caggiano, Ryan Savitski, Daniel Stone, Perfetto Development - www.perfetto.dev

On Mon, May 17, 2021 at 9:00 AM Primiano Tucci <prim...@google.com> wrote:

On Mon, May 17, 2021 at 5:16 AM Chris Healy <cph...@gmail.com> wrote:
Hi Primiano, et al,

On Wed, May 12, 2021 at 2:00 AM Primiano Tucci <prim...@google.com> wrote:
+Ryan Savitski who is working on perfetto_perf (the sampling profiler) and has a better view of future plans for traced_perf (The tracing-integrated sampling profiler).

Summary from my side:
as per very recently (~6 months?) we have a new daemon in perfetto (traced_perf) to do callstack sampling (example cfg).
Right now we are quite focused on lowering unwinding times and memory from Android (there are various challenges due to lack of FP and compressed symbols) it supports only very few counters (these).
At some point we'll get to extend that support but for now I am pretty sure it's not there right now.

From what I see in the current code, the perf event support in perfetto appears to be focused exclusively on CPU perf events. The type of perf events I'm referring to are not associated with any PID running on the CPU as they are system bus counters. These counters expose things like the following:

1) DRAM interface load percentage
2) Number of bytes written by the 3D GPU to DRAM
3) Number of bytes read by the display controller
4) Combined read/write bytes between ML (machine learning) IP core and DRAM interface

Several ARM SoCs have drivers exposing this type of data via the perf_event framework. The drivers can be found in drivers/perf/... Additionally, for ARM SoCs that utilize the CCI (cache-coherent interconnect), there is also a perf driver that exposes the counters of this IP core.

I have to admit my ignorance here (Ryan knows more than me, I am just going by vague memory).
I thought that by design perf_event_open events are either CPU or PID scoped. From https://man7.org/linux/man-pages/man2/perf_event_open.2.html you can't just pass -1 to both.

I just gave "perf" a run on an NXP i.MX8M which has HW counters exposed via an upstream perf driver for the DRAM controller. The command I ran was the following:

perf stat -a -e imx8_ddr0/cycles/,imx8_ddr0/read-cycles/,imx8_ddr0/write-cycles/ -I 100 sleep 1

# time counts unit events

0.100275004 80211300 imx8_ddr0/cycles/

0.100275004 12556342 imx8_ddr0/read-cycles/

0.100275004 32476 imx8_ddr0/write-cycles/

0.200964848 80513498 imx8_ddr0/cycles/

0.200964848 12545674 imx8_ddr0/read-cycles/

0.200964848 4771 imx8_ddr0/write-cycles/

.....

With the above output, you will see that I'm capturing 3 HW counters: cycles, read-cycles, and write-cycles. (cycles = total cycles.) With this, we can compute the busy time as ((read-cycles + write-cycles) / cycles). In the above example, this will compute to ~15% busy time. Additionally, this can be broken down into absolute bytes by using the equation defined in the following file:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/perf/pmu-events/arch/arm64/freescale/imx8mq/sys/metrics.json

From the above file, read-cycles and write-cycles can each be multiplied by 16 to come up with read-bytes or write-bytes.

Other platforms have different numbers and types of HW counters or the bus. For example, some report on a per AXI bus-master basis so we can get exact numbers of read or write bytes for the 3D GPU for example.

When looking at the strace output from this run, I see the following with perf_event_open:

perf_event_open({type=0x8 /* PERF_TYPE_??? */, size=PERF_ATTR_SIZE_VER6, config=0, ...}, -1, 0, -1, PERF_FLAG_FD_CLOEXEC) = -1 EINVAL (Invalid argument)
perf_event_open({type=0x8 /* PERF_TYPE_??? */, size=PERF_ATTR_SIZE_VER6, config=0, ...}, -1, 0, -1, 0) = -1 EINVAL (Invalid argument)
perf_event_open({type=0x8 /* PERF_TYPE_??? */, size=PERF_ATTR_SIZE_VER6, config=0, ...}, -1, 0, -1, 0) = 3
perf_event_open({type=0x8 /* PERF_TYPE_??? */, size=PERF_ATTR_SIZE_VER6, config=0x2a, ...}, -1, 0, -1, 0) = 4
perf_event_open({type=0x8 /* PERF_TYPE_??? */, size=PERF_ATTR_SIZE_VER6, config=0x2b, ...}, -1, 0, -1, 0) = 5

As such, it seems that PID is set to -1 and CPU is set to 0.

How does one tell perf_event_open to capture events that are "globally" scoped? I thought in cases like this they end up looking like as if they are CPU scoped but the CPU buffer they end up on is purely whatever CPU where the overflow of the counter happened.
In other words, I think that becomes more a UI feature to know that some perf events shouldn't be aggregated by CPU beause they are really global. But at capturing-time I think they should be treated as-if they are CPU-bound just because that is what perf_event_open assumes?

But maybe I am completely misunderstanding how perf_event_open deals with global counters?

Not sure if it's relevant, but ARM Streamline has support for non-CPU specific HW counters. For example the CCI-400 (bus interconnect) is supported. Here's a relevant link to the CCI-400 HW perf events: https://github.com/ARM-software/gator/blob/main/daemon/events-CCI-400.xml

Chris Healy

unread,

May 26, 2021, 8:50:13 PM5/26/21

to Ryan Savitski, Primiano Tucci, Antonio Caggiano, Daniel Stone, Perfetto Development - www.perfetto.dev

Hi Ryan,

Thanks for the detailed explanation of what is needed.

Here's one thing that just crossed my mind that may also be needed:

With the example I gave, the counters are simple counters that count static things: overall read cycles, overall write cycles, etc, etc.

Some SoCs have additional counters for read cycles and write cycles that can be configured to only accumulate for specific AXI bus masters. For example, one SoC I work with has 4 general purpose counters and one overall counter. For each of these 4 general purpose counters, a filter can be set which specifies which bus master it is associated with. With this filtering capability, we could configure each of the four counters in the following manner:

1) bus_counter_1 - CPU read/write bytes

2) bus_counter_2 - GPU read/write bytes

3) bus_counter_3 - Video Decoder read/write bytes

4) bus_counter_4 - Display controller read/write bytes

5) bus_counter_overall

With the above configuration, we can then see what's going on with the DRAM bus with all the key AXI bus masters in realtime. To support this would require support for specifying the filter on a per counter basis. With the i.MX8M driver I previously pointed to, the i.MX8M does not have this special filtering support. The same kernel driver grew support for the i.MX8MP which has some enhanced filtering capabilities. The following driver kernel commit is relevant:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/drivers/perf/fsl_imx8_ddr_perf.c?h=v5.13-rc3&id=44f8bd014a94ed679ddb77d0b92350d4ac4f23a5

Additionally, when looking at the perf JSON file, you can see with the i.MX8MP how the filtering can be used:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/perf/pmu-events/arch/arm64/freescale/imx8mp/sys/metrics.json?h=v5.13-rc3

Having this support with perfetto would be very useful for SoC performance debugging!

Regards,

Chris

On Wed, May 26, 2021 at 4:54 PM Ryan Savitski <rsav...@google.com> wrote:

Hello Chris, very sorry for the huge delay. I should be much more available going forward.

As you've found out, the integrated perf profiler currently lacks the configuration options to
specify arbitrary counters. Architecturally, recording such platform PMU counters doesn't require
any reworks, as everything is exposed through `perf_event_open`. It's a matter of setting the
right pmu/event identifiers in the syscall argument struct (usually just `type` and `config`).

Let me try to itemise the lacking features for this, as I see them:
* Definitely necessary: propagating arbitrary raw values into the perf_event_attr struct. The way
I'm picturing this as an option in the perfetto config, where you directly specify the raw uint64s for
the perf_event_attr.{type,config}.
* Possibly necessary: specifying specific cpu masks to enable the event on. By default we configure
the requested counter on every cpu (same as what "perf -a" should be doing), but depending on how
the PMU driver is implemented, it might only expect the non-cpu-specific counters configured at
most once.
* Not strictly necessary, but desirable: leader sampling of counters. This is more accurate if
you're comparing or taking ratios of counter values, as it'll snapshot them as a group, instead of
polling each at its own (possibly unaligned) period. Though in the integrated profiler, you can
already do the equivalent of "perf stat -e event1 -e event2" (i.e. two concurrent, but not sampled
as a group counters) by putting a single counter per perfetto DataSource.
* Necessary, but trivial: graphing the counters in the UI, no unknows here.
* Nice to have, but not necessary for local use-cases: discovering the custom PMU events via
`/sys/bus/event_source/devices/`. Locally you could manually find the right id of the PMU and the
event, but it'd be nice if you could just give the symbolic name (such as imx8_ddr0/read-cycles/)
instead.

We want to do all of the above, but the timeline is a question of priorities within the team.
Barring surprises, I'd expect to see the minimal support for raw events within a month (in the
codebase and therefore the standalone build).

Thank you for reaching out, it's useful to hear about feature requests with particular use-cases. Do
tell me if I'm overlooking anything above, but I think that would cover use-cases like the AXI bus
counters.

-Ryan

Ryan Savitski

unread,

May 26, 2021, 11:10:23 PM5/26/21

to Chris Healy, Primiano Tucci, Antonio Caggiano, Daniel Stone, Perfetto Development - www.perfetto.dev

Right, my understanding is that these additional counter options get bitpacked into the `config1` and `config2` of the perf_event_attr struct.

Example for the `axi_id` and `axi_mask` above (this mapping should also be discoverable via /sys/bus/... files):

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/perf/fsl_imx8_ddr_perf.c#n272

So we'll also need to allow for those fields to be set via the raw config interface. Good point, thanks for bringing it up.

Ryan Savitski

unread,

May 27, 2021, 2:25:05 PM5/27/21

to Chris Healy, Primiano Tucci, Antonio Caggiano, Daniel Stone, Perfetto Development - www.perfetto.dev

I have devices with nontrivial dynamic PMU events, and a decent grasp of how analogous those are to the AXI bus events you quoted.

So the next step is on us - implementing the raw config option.

I'll give an update once that's there, at which point the help I'd ask for is to test that it works for your use-case. But nothing at the moment.

On Thu, 27 May 2021 at 04:48, Chris Healy <cph...@gmail.com> wrote:

Is there a HW platform you are working with that supports these bus HW counters via perf API?

Is there anything I can do to help with this? I do have HW that has bus HW counters accessible via the perf API. (Some public and some not.)

Chris Healy

unread,

Jul 22, 2022, 5:47:38 PM7/22/22

to Perfetto Development - www.perfetto.dev

Hi Ryan,

By any chance was any progress made with the PMU event support? I've got plenty of HW to test with once something testable is available... ;-)

Re: perfetto and bus utilization perf_event

Antonio Caggiano

Primiano Tucci

Primiano Tucci

Chris Healy

Chris Healy

Ryan Savitski

Ryan Savitski

Chris Healy