Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

perf-stat per thread results

1,044 views
Skip to first unread message

Wim Heirman

unread,
May 13, 2011, 11:20:01 AM5/13/11
to
Hi all,

I'm using perf-stat to look at hardware performance counters for a
parallel program. Is there a way to get counter values for each thread
individually, rather than aggregated for the whole process? I know I
can attach to a specific thread using --tid=, but due to the time
required to find the tid and attach/detach this isn't accurate for
short-running programs.
Or, alternatively, can I use perf record --stat and get an exact count
for each performance counter?

Thanks,
Wim Heirman
Ghent University, Belgium
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Ingo Molnar

unread,
May 13, 2011, 11:40:01 AM5/13/11
to

* Wim Heirman <w...@heirman.net> wrote:

> Hi all,
>
> I'm using perf-stat to look at hardware performance counters for a
> parallel program. Is there a way to get counter values for each thread

> individually, rather than aggregated for the whole process? [...]

Not at the moment, but it would be a useful feature.

> [...] I know I can attach to a specific thread using --tid=, but due to the

> time required to find the tid and attach/detach this isn't accurate for
> short-running programs. Or, alternatively, can I use perf record --stat and
> get an exact count for each performance counter?

Yes perf record --stat should work. 'perf report -T --stdio' is supposed to
print this, but it has regressed i think.

Arnaldo, any ideas?

Thanks,

Ingo

David Ahern

unread,
May 13, 2011, 11:40:01 AM5/13/11
to

On 05/13/11 09:14, Wim Heirman wrote:
> Hi all,
>
> I'm using perf-stat to look at hardware performance counters for a
> parallel program. Is there a way to get counter values for each thread
> individually, rather than aggregated for the whole process? I know I
> can attach to a specific thread using --tid=, but due to the time
> required to find the tid and attach/detach this isn't accurate for
> short-running programs.

perf-stat requires changes to dump counters per thread; it currently
sums all threads into a single value.

> Or, alternatively, can I use perf record --stat and get an exact count
> for each performance counter?

perf-record does not read values from hardware counters.

David

David Ahern

unread,
May 13, 2011, 11:50:03 AM5/13/11
to
On 05/13/11 09:30, Ingo Molnar wrote:

> Yes perf record --stat should work. 'perf report -T --stdio' is supposed to
> print this, but it has regressed i think.

There is a known regression in 2.6.39; perf-record and perf-top cannot
profile all threads in a process. They fail with:

Fatal: failed to mmap with 22 (Invalid argument)

Curiously, perf-stat does work - or at least does not fail with the mmap
error.

David

Ingo Molnar

unread,
May 13, 2011, 11:50:03 AM5/13/11
to

* David Ahern <dsa...@gmail.com> wrote:

> On 05/13/11 09:14, Wim Heirman wrote:
> > Hi all,
> >
> > I'm using perf-stat to look at hardware performance counters for a
> > parallel program. Is there a way to get counter values for each thread
> > individually, rather than aggregated for the whole process? I know I
> > can attach to a specific thread using --tid=, but due to the time
> > required to find the tid and attach/detach this isn't accurate for
> > short-running programs.
>
> perf-stat requires changes to dump counters per thread; it currently
> sums all threads into a single value.
>
> > Or, alternatively, can I use perf record --stat and get an exact count
> > for each performance counter?
>
> perf-record does not read values from hardware counters.

It's supposed to do that if --stat is specified, and it used to work - see this
commit:

8d51327090ac: perf report: Fix and improve the displaying of per-thread event counters

and the output there:

# PID TID cache-misses cache-references
4658 4659 495581 3238779
4658 4662 498246 3236823
4658 4663 499531 3243162

which appears to be roughly what Wim is asking for, AFAICT.

But this seems to have regressed meanwhile.

Thanks,

Ingo

Juri Lelli

unread,
May 13, 2011, 4:20:01 PM5/13/11
to
Hi,

On 05/13/2011 05:14 PM, Wim Heirman wrote:
> Hi all,
>
> I'm using perf-stat to look at hardware performance counters for a
> parallel program. Is there a way to get counter values for each thread
> individually, rather than aggregated for the whole process? I know I
> can attach to a specific thread using --tid=, but due to the time
> required to find the tid and attach/detach this isn't accurate for
> short-running programs.
> Or, alternatively, can I use perf record --stat and get an exact count
> for each performance counter?
>

What about using PAPI library (http://icl.cs.utk.edu/papi/)?
It can be built on top of the perf subsystem and allows to get counters
values on a pre-thread basis.
I used it for some experiments and works very well.

Cheers,
Juri

--
Juri Lelli
Via G. Moruzzi 1, 56124 Pisa (PI), Italy
Scuola Superiore Sant'Anna
TeCIP, ReTiS Lab

Web Site: http://retis.sssup.it/~jlelli | Skype: jurile2712
-------------------------------------------------------------
Il male minore non esiste: è sempre il preannuncio di un male peggiore.
(Sylos Labini)

Wim Heirman

unread,
May 13, 2011, 4:40:01 PM5/13/11
to
Hi,

> It's supposed to do that if --stat is specified, and it used to work - see this
> commit:
>
>  8d51327090ac: perf report: Fix and improve the displaying of per-thread event counters
>
> and the output there:
>
>     #  PID   TID  cache-misses  cache-references
>       4658  4659        495581           3238779
>       4658  4662        498246           3236823
>       4658  4663        499531           3243162
>
> which appears to be roughly what Wim is asking for, AFAICT.

Thanks, this is exactly what I'm looking for. In 2.6.32 (Ubuntu 10.04)
it works, although if I use --pid rather than the -- <command> variant
the first thread seams to be missing. In 2.6.38 (Ubuntu 11.04) the
first thread is missing in both use cases, and I get one column per
processor (which in itself is fine).

Regards,
Wim

Arnaldo Carvalho de Melo

unread,
May 13, 2011, 4:50:01 PM5/13/11
to
Em Fri, May 13, 2011 at 10:32:58PM +0200, Wim Heirman escreveu:
> > It's supposed to do that if --stat is specified, and it used to work - see this
> > commit:

> >  8d51327090ac: perf report: Fix and improve the displaying of per-thread event counters

> > and the output there:
> >
> >     #  PID   TID  cache-misses  cache-references
> >       4658  4659        495581           3238779
> >       4658  4662        498246           3236823
> >       4658  4663        499531           3243162

> > which appears to be roughly what Wim is asking for, AFAICT.

> Thanks, this is exactly what I'm looking for. In 2.6.32 (Ubuntu 10.04)
> it works, although if I use --pid rather than the -- <command> variant
> the first thread seams to be missing. In 2.6.38 (Ubuntu 11.04) the
> first thread is missing in both use cases, and I get one column per
> processor (which in itself is fine).

Can you try after applying the patches in this message:

http://marc.info/?l=linux-kernel&m=130385067430510&w=2

and report your results?

If it fixes the problems you're experiencing, please provide a:

Tested-by: Wim Heirman <w...@heirman.net>

So that I can add when sending them to Ingo.

Thanks,

- Arnaldo

David Ahern

unread,
May 13, 2011, 7:10:02 PM5/13/11
to

On 05/13/11 14:32, Wim Heirman wrote:
> Hi,
>
>> It's supposed to do that if --stat is specified, and it used to work - see this
>> commit:
>>
>> 8d51327090ac: perf report: Fix and improve the displaying of per-thread event counters
>>
>> and the output there:
>>
>> # PID TID cache-misses cache-references
>> 4658 4659 495581 3238779
>> 4658 4662 498246 3236823
>> 4658 4663 499531 3243162
>>
>> which appears to be roughly what Wim is asking for, AFAICT.
>
> Thanks, this is exactly what I'm looking for. In 2.6.32 (Ubuntu 10.04)
> it works, although if I use --pid rather than the -- <command> variant
> the first thread seams to be missing. In 2.6.38 (Ubuntu 11.04) the
> first thread is missing in both use cases, and I get one column per
> processor (which in itself is fine).
>
> Regards,
> Wim

Hmm.... my mileage varies using latest kernel
(446cc6345d3de6571bdd0840f48aca441488a28d)

$ /tmp/build-perf/perf record --stat -fo /tmp/perf.data -p $(pidof rsyslogd)
^C[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.007 MB /tmp/perf.data (~308 samples) ]

$ /tmp/build-perf/perf report -T -i /tmp/perf.data
# Events: 6 cycles
#
# Overhead Command Shared Object Symbol
# ........ ........ ................. ..........................
#
97.61% rsyslogd libc-2.13.so [.] __libc_disable_asynccancel
2.39% rsyslogd [kernel.kallsyms] [k] native_write_msr_safe


#
# (For a higher level overview, try: perf report --sort comm,dso)
#
# PID TID


ie., I do not get the counter values. Specifying the counter with -e
(e.g., -e branch-misses) does not help -- still no counter output.

David

Wim Heirman

unread,
May 14, 2011, 8:50:02 AM5/14/11
to

Is rsyslogd multithreaded? (Or at least, do the non-main threads
execute any work during your perf-record measurement) If not, then
what you see is consistent with what I'm getting, i.e. everything but
the main thread is reported.

Regards,
Wim

Wim Heirman

unread,
May 14, 2011, 8:50:02 AM5/14/11
to
On 13 May 2011 22:45, Arnaldo Carvalho de Melo <ac...@redhat.com> wrote:
> Em Fri, May 13, 2011 at 10:32:58PM +0200, Wim Heirman escreveu:
>> > It's supposed to do that if --stat is specified, and it used to work - see this
>> > commit:
>
>> >  8d51327090ac: perf report: Fix and improve the displaying of per-thread event counters
>
>> > and the output there:
>> >
>> >     #  PID   TID  cache-misses  cache-references
>> >       4658  4659        495581           3238779
>> >       4658  4662        498246           3236823
>> >       4658  4663        499531           3243162
>
>> > which appears to be roughly what Wim is asking for, AFAICT.
>
>> Thanks, this is exactly what I'm looking for. In 2.6.32 (Ubuntu 10.04)
>> it works, although if I use --pid rather than the -- <command> variant
>> the first thread seams to be missing. In 2.6.38 (Ubuntu 11.04) the
>> first thread is missing in both use cases, and I get one column per
>> processor (which in itself is fine).
>
> Can you try after applying the patches in this message:
>
> http://marc.info/?l=linux-kernel&m=130385067430510&w=2
>
> and report your results?

Sorry, no improvement.

$ ./perf record -e cycles --stat -- ./fft -p4 -m24 && ./perf report
--thread | tail
[ perf record: Woken up 5 times to write data ]
[ perf record: Captured and wrote 1.198 MB perf.data (~52331 samples) ]
# PID TID cpu-clock
954 958 8067423322
954 957 6761317556
954 956 6006327147

$ ls /proc/$(pidof fft)/task
954 956 957 958

David Ahern

unread,
May 14, 2011, 1:50:02 PM5/14/11
to
On 05/14/11 06:49, Wim Heirman wrote:
>> Hmm.... my mileage varies using latest kernel
>> (446cc6345d3de6571bdd0840f48aca441488a28d)
>>
>> $ /tmp/build-perf/perf record --stat -fo /tmp/perf.data -p $(pidof rsyslogd)
>> ^C[ perf record: Woken up 1 times to write data ]
>> [ perf record: Captured and wrote 0.007 MB /tmp/perf.data (~308 samples) ]
>>
>> $ /tmp/build-perf/perf report -T -i /tmp/perf.data
>> # Events: 6 cycles
>> #
>> # Overhead Command Shared Object Symbol
>> # ........ ........ ................. ..........................
>> #
>> 97.61% rsyslogd libc-2.13.so [.] __libc_disable_asynccancel
>> 2.39% rsyslogd [kernel.kallsyms] [k] native_write_msr_safe
>>
>>
>> #
>> # (For a higher level overview, try: perf report --sort comm,dso)
>> #
>> # PID TID
>>
>>
>> ie., I do not get the counter values. Specifying the counter with -e
>> (e.g., -e branch-misses) does not help -- still no counter output.
>
> Is rsyslogd multithreaded? (Or at least, do the non-main threads
> execute any work during your perf-record measurement) If not, then
> what you see is consistent with what I'm getting, i.e. everything but
> the main thread is reported.

It is multithreaded, but my point is that I do not get counter output at
the end -- the PID/TID table is empty. I do not get counters for single
threaded processes nor for commands run by perf record -- e.g.,
/tmp/build-perf/perf record --stat -e instructions -fo /tmp/perf.data --
sleep 1

David

Wim Heirman

unread,
May 14, 2011, 3:00:02 PM5/14/11
to
2011/5/14 David Ahern <dsa...@gmail.com>:

My guess was that none of the threads got scheduled while you did your
perf-record run (rsyslogd usually isn't exactly very CPU intensive).
And the main thread isn't ever reported, at least that's the bug I'm
seeing. Can you try with a compute-intensive, multi-threaded program?

Wim.

David Ahern

unread,
May 14, 2011, 3:00:01 PM5/14/11
to

The reports show data was collected. I have done a number of simple
examples -- all of which execute at least 1 instruction, but the
counters are not displayed (which they should be per the commit
changelog commit).

David

Wim Heirman

unread,
May 14, 2011, 4:30:02 PM5/14/11
to

Looks like perf-report --thread is reading PERF_RECORD_READ events
from perf.data. But these are only emitted by the kernel for child
threads: in kernel/events/core.c, the only call to
perf_event_read_event() is in sync_child_event().
Should perf-record then be adapted to do something like perf-stat does
and use __perf_evsel__read to read the parent counter's final values
and add them to perf.data ? That way perf-report can subtract all
children from the final value and get the main thread's counter
values.

Regards,
Wim

0 new messages