Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

[PATCH 0/5] perf kmem: Add more functions and show more statistics

4 views
Skip to first unread message

Li Zefan

unread,
Nov 24, 2009, 12:30:02 AM11/24/09
to
List of new things:

- Add option "--raw-ip", to print raw ip instead of symbols.

- Sort the output by fragmentation by default, and support
multi keys.

- Collect and show cross node allocation stats.

- Collect and show alloc/free ping-pong stats.

- And help file.

---
tools/perf/Documentation/perf-kmem.txt | 44 ++++
tools/perf/builtin-kmem.c | 353 ++++++++++++++++++++++++++------
tools/perf/command-list.txt | 1 +
3 files changed, 331 insertions(+), 67 deletions(-)


Pekka, do you think we can remove kmemtrace now?

With kmem trace events, low-level analyzing can be done using
ftrace, and high-level analyzing can be done using perf-kmem.

And chance is, more people may use and improve perf-kmem, and it
will be well-maintained within the perf infrastructure. On the
other hand, I guess few people use and contribute to kmemtrace-user.

BTW, seems kmemtrace-user doesn't work with ftrace. I got setfault:

# ./kmemtraced
Copying /proc/kallsyms...
Logging... Press Control-C to stop.
^CSegmentation fault


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Li Zefan

unread,
Nov 24, 2009, 12:30:01 AM11/24/09
to
Add Documentation/perf-kmem.txt

Signed-off-by: Li Zefan <li...@cn.fujitsu.com>
---
tools/perf/Documentation/perf-kmem.txt | 44 ++++++++++++++++++++++++++++++++
tools/perf/command-list.txt | 1 +
2 files changed, 45 insertions(+), 0 deletions(-)
create mode 100644 tools/perf/Documentation/perf-kmem.txt

diff --git a/tools/perf/Documentation/perf-kmem.txt b/tools/perf/Documentation/perf-kmem.txt
new file mode 100644
index 0000000..44b0ce3
--- /dev/null
+++ b/tools/perf/Documentation/perf-kmem.txt
@@ -0,0 +1,44 @@
+perf-kmem(1)
+==============
+
+NAME
+----
+perf-kmem - Tool to trace/measure kernel memory(slab) properties
+
+SYNOPSIS
+--------
+[verse]
+'perf kmem' {record} [<options>]
+
+DESCRIPTION
+-----------
+There's two variants of perf kmem:
+
+ 'perf kmem record <command>' to record the kmem events
+ of an arbitrary workload.
+
+ 'perf kmem' to report kernel memory statistics.
+
+OPTIONS
+-------
+-i <file>::
+--input=<file>::
+ Select the input file (default: perf.data)
+
+--stat=<caller|alloc>::
+ Select per callsite or per allocation statistics
+
+-s <key[,key2...]>::
+--sort=<key[,key2...]>::
+ Sort the output (default: frag,hit,bytes)
+
+-l <num>::
+--line=<num>::
+ Print n lines only
+
+--raw-ip::
+ Print raw ip instead of symbol
+
+SEE ALSO
+--------
+linkperf:perf-record[1]
diff --git a/tools/perf/command-list.txt b/tools/perf/command-list.txt
index d3a6e18..02b09ea 100644
--- a/tools/perf/command-list.txt
+++ b/tools/perf/command-list.txt
@@ -14,3 +14,4 @@ perf-timechart mainporcelain common
perf-top mainporcelain common
perf-trace mainporcelain common
perf-probe mainporcelain common
+perf-kmem mainporcelain common
--
1.6.3

Pekka Enberg

unread,
Nov 24, 2009, 2:20:02 AM11/24/09
to
Hi Li,

On Tue, Nov 24, 2009 at 7:25 AM, Li Zefan <li...@cn.fujitsu.com> wrote:
> Pekka, do you think we can remove kmemtrace now?

One more use case I forgot to mention: boot time tracing. Much of the
persistent kernel memory footprint comes from the boot process which
is why it's important to be able to trace memory allocations
immediately after kmem_cache_init() has run. Can we make "perf kmem"
do that? Eduard put most of his efforts into making that work for
kmemtrace.

On Tue, Nov 24, 2009 at 7:25 AM, Li Zefan <li...@cn.fujitsu.com> wrote:
> With kmem trace events, low-level analyzing can be done using
> ftrace, and high-level analyzing can be done using perf-kmem.
>
> And chance is, more people may use and improve perf-kmem, and it
> will be well-maintained within the perf infrastructure. On the
> other hand, I guess few people use and contribute to kmemtrace-user.

Sure, I think "perf kmem" is the way forward. I'd love to hear
Eduard's comments on this before we remove the code from kernel. Do we
need to do that for 2.6.33 or can we postpone that for 2.6.34?

Pekka

Pekka Enberg

unread,
Nov 24, 2009, 2:20:02 AM11/24/09
to
On Tue, Nov 24, 2009 at 7:25 AM, Li Zefan <li...@cn.fujitsu.com> wrote:
> List of new things:
>
> - Add option "--raw-ip", to print raw ip instead of symbols.
>
> - Sort the output by fragmentation by default, and support
> �multi keys.
>
> - Collect and show cross node allocation stats.
>
> - Collect and show alloc/free ping-pong stats.
>
> - And help file.

The series looks good to me!

Acked-by: Pekka Enberg <pen...@cs.helsinki.fi>

Ingo Molnar

unread,
Nov 24, 2009, 2:40:02 AM11/24/09
to

* Pekka Enberg <pen...@cs.helsinki.fi> wrote:

> Hi Li,
>
> On Tue, Nov 24, 2009 at 7:25 AM, Li Zefan <li...@cn.fujitsu.com> wrote:
> > Pekka, do you think we can remove kmemtrace now?
>
> One more use case I forgot to mention: boot time tracing. Much of the
> persistent kernel memory footprint comes from the boot process which
> is why it's important to be able to trace memory allocations
> immediately after kmem_cache_init() has run. Can we make "perf kmem"
> do that? Eduard put most of his efforts into making that work for
> kmemtrace.

Would be lovely if someone looked at perf from that angle (and extended
it).

Another interesting area would be to allow a capture session without a
process context running immediately. (i.e. pre-allocate all the buffers,
use them, for a later 'perf save' to pick it up.)

The two are kind of the same thing conceptually: a boot time trace is a
preallocated 'process context less' recording, to be picked up after
bootup.

[ It also brings us 'stability/persistency of event logging' - i.e. a
capture session could be started and guaranteed by the kernel to be
underway, regardless of what user-space does. ]

Btw., Arjan is doing a _lot_ of boot time tracing for Moblin, and he
indicated it in the past that starting a perf recording session from an
initrd is a pretty practical substitute as well. (I've Cc:-ed Arjan.)

> On Tue, Nov 24, 2009 at 7:25 AM, Li Zefan <li...@cn.fujitsu.com> wrote:
>
> > With kmem trace events, low-level analyzing can be done using
> > ftrace, and high-level analyzing can be done using perf-kmem.
> >
> > And chance is, more people may use and improve perf-kmem, and it
> > will be well-maintained within the perf infrastructure. On the other
> > hand, I guess few people use and contribute to kmemtrace-user.
>
> Sure, I think "perf kmem" is the way forward. I'd love to hear
> Eduard's comments on this before we remove the code from kernel. Do we
> need to do that for 2.6.33 or can we postpone that for 2.6.34?

Certainly we can postpone it, as long as there's rough strategic
consensus on the way forward. I'd hate to have two overlapping core
kernel facilities and friction between the groups pursuing them and
constant distraction from having two targets.

Such situations just rarely end with a good solution for the user - see
security modules for a horror story ...

[ I dont think it will occur here, just wanted to mention it out of
abundance of caution that 1.5 decades of kernel hacking experience
inflicts on me ;-) ]

Ingo

Pekka Enberg

unread,
Nov 24, 2009, 2:50:02 AM11/24/09
to
Hi Ingo,

On Tue, Nov 24, 2009 at 9:34 AM, Ingo Molnar <mi...@elte.hu> wrote:
> Certainly we can postpone it, as long as there's rough strategic
> consensus on the way forward. I'd hate to have two overlapping core
> kernel facilities and friction between the groups pursuing them and
> constant distraction from having two targets.

Sure, like I said, I think "kmem perf" is the way forward. The only
reason we did kmemtrace userspace out-of-tree was because there was no
perf (or ftrace!) at the time and there wasn't much interest in
putting userspace tools in the tree.

I hope that counts as a "rough strategic consensus" :-)

Ingo Molnar

unread,
Nov 24, 2009, 2:50:01 AM11/24/09
to

* Pekka Enberg <pen...@cs.helsinki.fi> wrote:

> Hi Ingo,
>
> On Tue, Nov 24, 2009 at 9:34 AM, Ingo Molnar <mi...@elte.hu> wrote:
> > Certainly we can postpone it, as long as there's rough strategic
> > consensus on the way forward. I'd hate to have two overlapping core
> > kernel facilities and friction between the groups pursuing them and
> > constant distraction from having two targets.
>
> Sure, like I said, I think "kmem perf" is the way forward. The only
> reason we did kmemtrace userspace out-of-tree was because there was no
> perf (or ftrace!) at the time and there wasn't much interest in
> putting userspace tools in the tree.
>
> I hope that counts as a "rough strategic consensus" :-)

it does :-)

Ingo

Li Zefan

unread,
Nov 24, 2009, 3:10:01 AM11/24/09
to

It would be great if perf can be used for boot time tracing. This needs
pretty big work on kernel side.

>> On Tue, Nov 24, 2009 at 7:25 AM, Li Zefan <li...@cn.fujitsu.com> wrote:
>>
>>> With kmem trace events, low-level analyzing can be done using
>>> ftrace, and high-level analyzing can be done using perf-kmem.
>>>
>>> And chance is, more people may use and improve perf-kmem, and it
>>> will be well-maintained within the perf infrastructure. On the other
>>> hand, I guess few people use and contribute to kmemtrace-user.
>> Sure, I think "perf kmem" is the way forward. I'd love to hear
>> Eduard's comments on this before we remove the code from kernel. Do we
>> need to do that for 2.6.33 or can we postpone that for 2.6.34?
>
> Certainly we can postpone it, as long as there's rough strategic
> consensus on the way forward. I'd hate to have two overlapping core
> kernel facilities and friction between the groups pursuing them and
> constant distraction from having two targets.
>
> Such situations just rarely end with a good solution for the user - see
> security modules for a horror story ...
>
> [ I dont think it will occur here, just wanted to mention it out of
> abundance of caution that 1.5 decades of kernel hacking experience
> inflicts on me ;-) ]
>

Yeah, so we'd like to remove most of tracers, but I'm not rushing to
remove kmemtrace for .33.

Ingo Molnar

unread,
Nov 24, 2009, 3:40:01 AM11/24/09
to

What would be needed is to open per cpu events right after perf events
initializes, and allocate memory for output buffers to them.

They would round-robin after that point, and we could use
perf_event_open() (with a special flag) to 'attach' to them and mmap()
them - at which point they'd turn into regular objects with a lot of
boot time data in them.

Perhaps it should be possible to attach/detach from events in a flexible
way, not just during bootup. (bootup tracing is just a special case of
it.)

For example a 'flight recorder' could be started by creating the events
and then detaching from them. Whenever some exception is flagged the
monitoring context (whatever task that might be at that time - at
whatever point in the future) could attach to it and save it to a file
(or send it over any other transport of choice).

This would IMO mix the best of ftrace with the best of perf, for these
usecases.

[ Certainly not a feature for the faint hearted :-) ]

Ingo

Ingo Molnar

unread,
Nov 24, 2009, 4:10:02 AM11/24/09
to

a few more UI suggestions for 'perf kmem':

I think it should look similar to how 'perf' and 'perf sched' prints
sub-commands with increasing specificity, which means that we display a
list of subcommands and options when typed:

$ perf sched

usage: perf sched [<options>] {record|latency|map|replay|trace}

-i, --input <file> input file name
-v, --verbose be more verbose (show symbol address, etc)
-D, --dump-raw-trace dump raw trace in ASCII


For 'perf kmem' we could print something like:

$ perf kmem

usage: perf kmem [<options>] {record|report|trace}

-i, --input <file> input file name
-v, --verbose be more verbose (show symbol address, etc)
-D, --dump-raw-trace dump raw trace in ASCII

The advantage is that right now, when a new user sees the subcommand in
'perf' output:

$ perf
...
kmem Tool to trace/measure kernel memory(slab) properties
...

And types 'perf kmem', the following is displayed currently:

$ perf kmem

SUMMARY
=======
Total bytes requested: 0
Total bytes allocated: 0
Total bytes wasted on internal fragmentation: 0
Internal fragmentation: 0.000000%
Cross CPU allocations: 0/0

That's not very useful to someone who tries to figure out how to use
this command. A summary page would be more useful - and that would
advertise all the commands in a really short summary form (shorter than
-h/--help).

The other thing is that if someone types 'perf kmem record', the command
seems 'hung':

$ perf kmem record
<hang>

Now if i Ctrl-C it i see that a recording session was going on:

$ perf kmem record
^C[ perf record: Woken up 10 times to write data ]
[ perf record: Captured and wrote 1.327 MB perf.data (~57984 samples) ]

but this was not apparent from the tool output and the user was left
wondering about what is going on.

I think at minimum we should print a:

[ Recording all kmem events in the system, Ctrl-C to stop. ]

line. (on a related note, 'perf sched record' needs such a fix too.)

Another solution would be for 'perf kmem record' to work analogous to
'perf record': it could display a short help page by default, something
like:

$ perf kmem record

usage: perf kmem record [<options>] [<command>]

example: perf kmem record -a sleep 10 # capture all events for 10 seconds
perf kmem record /bin/ls # capture events of this command
perf kmem record -p 1234 # capture events of PID 1234

What do you think?

Also, a handful of mini-bugreports wrt. usability:

1)

running 'perf kmem' without having a perf.data gives:

earth4:~/tip/tools/perf> ./perf kmem
Failed to open file: perf.data (try 'perf record' first)

SUMMARY
=======
Total bytes requested: 0
Total bytes allocated: 0
Total bytes wasted on internal fragmentation: 0
Internal fragmentation: 0.000000%
Cross CPU allocations: 0/0

2)

running 'perf kmem record' on a box without kmem events gives:

earth4:~/tip/tools/perf> ./perf kmem record
invalid or unsupported event: 'kmem:kmalloc'
Run 'perf list' for a list of valid events

i think we want to print something kmem specific - and tell the user how
to enable kmem events or so - 'perf list' is not a solution to him.

3)

it doesnt seem to be working on one of my boxes, which has perf and kmem
events as well:

aldebaran:~/linux/linux/tools/perf> perf kmem record
^C[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.050 MB perf.data (~2172 samples) ]

aldebaran:~/linux/linux/tools/perf> perf kmem

SUMMARY
=======
Total bytes requested: 0
Total bytes allocated: 0
Total bytes wasted on internal fragmentation: 0
Internal fragmentation: 0.000000%
Cross CPU allocations: 0/0
aldebaran:~/linux/linux/tools/perf>

Ingo

Li Zefan

unread,
Nov 24, 2009, 4:40:01 AM11/24/09
to
Ingo Molnar wrote:
> a few more UI suggestions for 'perf kmem':
>

Thanks for the suggestions!

> I think it should look similar to how 'perf' and 'perf sched' prints
> sub-commands with increasing specificity, which means that we display a
> list of subcommands and options when typed:
>

Yes, I'd like to make the usage and output format similar to perf-sched.

perf-timechart acts similarly - it won't show help page by "perf timechart"

# ./perf timechart
0xbc480 [0x18]: skipping unknown header type: 2
0xbc488 [(nil)]: skipping unknown header type: 238
0xbc490 [(nil)]: skipping unknown header type: 20034
Written 1.0 seconds of trace to output.svg.

But sure, I can change this for perf-kmem. So, do we want to do the same
for perf-timechart too?

> The other thing is that if someone types 'perf kmem record', the command
> seems 'hung':
>
> $ perf kmem record
> <hang>
>
> Now if i Ctrl-C it i see that a recording session was going on:
>
> $ perf kmem record
> ^C[ perf record: Woken up 10 times to write data ]
> [ perf record: Captured and wrote 1.327 MB perf.data (~57984 samples) ]
>
> but this was not apparent from the tool output and the user was left
> wondering about what is going on.
>
> I think at minimum we should print a:
>
> [ Recording all kmem events in the system, Ctrl-C to stop. ]
>
> line. (on a related note, 'perf sched record' needs such a fix too.)
>

Yes, I followed perf-sched and perf-timechart. ;)

I'll fix it for these tools.

> Another solution would be for 'perf kmem record' to work analogous to
> 'perf record': it could display a short help page by default, something
> like:
>
> $ perf kmem record
>
> usage: perf kmem record [<options>] [<command>]
>
> example: perf kmem record -a sleep 10 # capture all events for 10 seconds
> perf kmem record /bin/ls # capture events of this command
> perf kmem record -p 1234 # capture events of PID 1234
>
> What do you think?
>

But I'm not sure I like this, actually I prefer to just print
a line to explain what's going on.

> Also, a handful of mini-bugreports wrt. usability:
>
> 1)
>
> running 'perf kmem' without having a perf.data gives:
>
> earth4:~/tip/tools/perf> ./perf kmem
> Failed to open file: perf.data (try 'perf record' first)
>
> SUMMARY
> =======
> Total bytes requested: 0
> Total bytes allocated: 0
> Total bytes wasted on internal fragmentation: 0
> Internal fragmentation: 0.000000%
> Cross CPU allocations: 0/0
>

Again, this issue exists in perf-sched too..

So we need to fix not only perf-kmem.

> 2)
>
> running 'perf kmem record' on a box without kmem events gives:
>
> earth4:~/tip/tools/perf> ./perf kmem record
> invalid or unsupported event: 'kmem:kmalloc'
> Run 'perf list' for a list of valid events
>
> i think we want to print something kmem specific - and tell the user how
> to enable kmem events or so - 'perf list' is not a solution to him.
>

ditto

> 3)
>
> it doesnt seem to be working on one of my boxes, which has perf and kmem
> events as well:
>
> aldebaran:~/linux/linux/tools/perf> perf kmem record
> ^C[ perf record: Woken up 1 times to write data ]
> [ perf record: Captured and wrote 0.050 MB perf.data (~2172 samples) ]
>

Seems no kmem event is recorded. No sure what happened here.

Might be that the parameters that perf-kmem passes to perf-record
are not properly selected?

Do perf-sched and perf-timechart work on this box?

> aldebaran:~/linux/linux/tools/perf> perf kmem
>
> SUMMARY
> =======
> Total bytes requested: 0
> Total bytes allocated: 0
> Total bytes wasted on internal fragmentation: 0
> Internal fragmentation: 0.000000%
> Cross CPU allocations: 0/0
> aldebaran:~/linux/linux/tools/perf>
>

Ingo Molnar

unread,
Nov 24, 2009, 5:10:02 AM11/24/09
to

* Li Zefan <li...@cn.fujitsu.com> wrote:

> > 3)
> >
> > it doesnt seem to be working on one of my boxes, which has perf and kmem
> > events as well:
> >
> > aldebaran:~/linux/linux/tools/perf> perf kmem record
> > ^C[ perf record: Woken up 1 times to write data ]
> > [ perf record: Captured and wrote 0.050 MB perf.data (~2172 samples) ]
> >
>
> Seems no kmem event is recorded. No sure what happened here.
>
> Might be that the parameters that perf-kmem passes to perf-record
> are not properly selected?
>
> Do perf-sched and perf-timechart work on this box?

yeah:

aldebaran:~> perf sched record sleep 1


[ perf record: Woken up 1 times to write data ]

[ perf record: Captured and wrote 0.017 MB perf.data (~758 samples) ]
aldebaran:~> perf trace | tail -5
distccd-20944 [010] 1792.787376: sched_stat_runtime: comm=distccd pid=20944 runtime=11196 [ns] vruntime=696395420043 [ns]
init-0 [009] 1792.914837: sched_stat_wait: comm=x86_64-linux-gc pid=881 delay=10686 [ns]
init-0 [009] 1792.915082: sched_stat_sleep: comm=events/9 pid=44 delay=2183651362 [ns]
as-889 [013] 1793.008008: sched_stat_runtime: comm=as pid=889 runtime=156807 [ns] vruntime=1553569219042 [ns]
init-0 [004] 1793.154400: sched_stat_wait: comm=events/4 pid=39 delay=12155 [ns]

aldebaran:~> perf kmem record sleep 1


[ perf record: Woken up 1 times to write data ]

[ perf record: Captured and wrote 0.078 MB perf.data (~3398 samples) ]
aldebaran:~> perf trace | tail -5
aldebaran:~>

the perf.data has mmap and exit events - but no kmem events.

I've attached the config, in case it matters. It runs latest -tip, with
your latest series applied as well.

Ingo

config

Li Zefan

unread,
Nov 24, 2009, 6:10:02 AM11/24/09
to
>> Do perf-sched and perf-timechart work on this box?
>
> yeah:
>
> aldebaran:~> perf sched record sleep 1
> [ perf record: Woken up 1 times to write data ]
> [ perf record: Captured and wrote 0.017 MB perf.data (~758 samples) ]
> aldebaran:~> perf trace | tail -5
> distccd-20944 [010] 1792.787376: sched_stat_runtime: comm=distccd pid=20944 runtime=11196 [ns] vruntime=696395420043 [ns]
> init-0 [009] 1792.914837: sched_stat_wait: comm=x86_64-linux-gc pid=881 delay=10686 [ns]
> init-0 [009] 1792.915082: sched_stat_sleep: comm=events/9 pid=44 delay=2183651362 [ns]
> as-889 [013] 1793.008008: sched_stat_runtime: comm=as pid=889 runtime=156807 [ns] vruntime=1553569219042 [ns]
> init-0 [004] 1793.154400: sched_stat_wait: comm=events/4 pid=39 delay=12155 [ns]
>
> aldebaran:~> perf kmem record sleep 1
> [ perf record: Woken up 1 times to write data ]
> [ perf record: Captured and wrote 0.078 MB perf.data (~3398 samples) ]
> aldebaran:~> perf trace | tail -5
> aldebaran:~>
>
> the perf.data has mmap and exit events - but no kmem events.
>

I was using yesterday's -tip tree, and I just updated to the latest -tip,
and I found perf tools are not working:

# ./perf kmem record sleep 3
...
# ./perf trace
perf-1805 [001] 66.239160: kmem_cache_free: ...
perf-1806 [000] 66.403561: kmem_cache_alloc: ...
swapper-0 [000] 66.420099: kmem_cache_free: ...
# ./perf kmem record sleep 3
...
# ./perf trace
# ./perf sched record sleep 3
.../
# ./perf trace
perf-1825 [000] 103.543014: sched_stat_runtime: ...
# ./perf sched record sleep 3
...
# ./perf trace
#

So I think some new updates on kernel perf_event break.

Arjan van de Ven

unread,
Nov 24, 2009, 10:00:02 AM11/24/09
to
On Tue, 24 Nov 2009 09:34:23 +0100
Ingo Molnar <mi...@elte.hu> wrote:
> > It would be great if perf can be used for boot time tracing. This
> > needs pretty big work on kernel side.
>
> What would be needed is to open per cpu events right after perf
> events initializes, and allocate memory for output buffers to them.
>
> They would round-robin after that point, and we could use
> perf_event_open() (with a special flag) to 'attach' to them and
> mmap() them - at which point they'd turn into regular objects with a
> lot of boot time data in them.

I'm not too worried about this btw;
we can start the userland trace early enough in the boot (the kernel is
done after 0.6 seconds after all) to capture the relevant stuff.
The actual kernel mostly gets captured with scripts/bootgraph.pl
already.

Yes it would be nice to do a timechart earlier, but if it's extremely
hard...
Also unless it starts before the drivers (eg the normal driver
initcall level), it is not useful.

tip-bot for Li Zefan

unread,
Nov 24, 2009, 12:00:03 PM11/24/09
to
Commit-ID: b23d5767a5818caec8547d0bce1588b02bdecd30
Gitweb: http://git.kernel.org/tip/b23d5767a5818caec8547d0bce1588b02bdecd30
Author: Li Zefan <li...@cn.fujitsu.com>
AuthorDate: Tue, 24 Nov 2009 13:27:11 +0800
Committer: Ingo Molnar <mi...@elte.hu>
CommitDate: Tue, 24 Nov 2009 08:49:51 +0100

perf kmem: Add help file

Add Documentation/perf-kmem.txt

Signed-off-by: Li Zefan <li...@cn.fujitsu.com>
Acked-by: Pekka Enberg <pen...@cs.helsinki.fi>
Cc: Eduard - Gabriel Munteanu <eduard....@linux360.ro>
Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Frederic Weisbecker <fwei...@gmail.com>
Cc: linu...@kvack.org <linu...@kvack.org>
LKML-Reference: <4B0B6EA...@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mi...@elte.hu>


---
tools/perf/Documentation/perf-kmem.txt | 44 ++++++++++++++++++++++++++++++++
tools/perf/command-list.txt | 1 +
2 files changed, 45 insertions(+), 0 deletions(-)

diff --git a/tools/perf/Documentation/perf-kmem.txt b/tools/perf/Documentation/perf-kmem.txt

Frederic Weisbecker

unread,
Nov 24, 2009, 1:50:01 PM11/24/09
to

I think this is a problem external to kmem events. It's about
trace events/perf in general. It looks like we have some losses.
Steve and Arjan have reported similar things.

I'll investigate this way.

Thanks.

Frederic Weisbecker

unread,
Nov 24, 2009, 2:50:01 PM11/24/09
to
Commit 4ed7c92d68a5387ba5f7030dc76eab03558e27f5
(perf_events: Undo some recursion damage) has introduced a bad
reference counting of the recursion context. putting the context
behaves like getting it, dropping every software/trace events
after the first one in a context.

Signed-off-by: Frederic Weisbecker <fwei...@gmail.com>
Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Arnaldo Carvalho de Melo <ac...@redhat.com>
Cc: Paul Mackerras <pau...@samba.org>
Cc: Arjan van de Ven <ar...@infradead.org>
Cc: Li Zefan <li...@cn.fujitsu.com>
Cc: Steven Rostedt <ros...@goodmis.org>
---
kernel/perf_event.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/kernel/perf_event.c b/kernel/perf_event.c
index accfd7b..35df94e 100644
--- a/kernel/perf_event.c
+++ b/kernel/perf_event.c
@@ -3914,7 +3914,7 @@ void perf_swevent_put_recursion_context(int rctx)
{
struct perf_cpu_context *cpuctx = &__get_cpu_var(perf_cpu_context);
barrier();
- cpuctx->recursion[rctx]++;
+ cpuctx->recursion[rctx]--;
put_cpu_var(perf_cpu_context);
}
EXPORT_SYMBOL_GPL(perf_swevent_put_recursion_context);
--
1.6.2.3

tip-bot for Frederic Weisbecker

unread,
Nov 24, 2009, 3:40:02 PM11/24/09
to
Commit-ID: fe6126722718e51fba4879517c11ac12d9775bcc
Gitweb: http://git.kernel.org/tip/fe6126722718e51fba4879517c11ac12d9775bcc
Author: Frederic Weisbecker <fwei...@gmail.com>
AuthorDate: Tue, 24 Nov 2009 20:38:22 +0100
Committer: Ingo Molnar <mi...@elte.hu>
CommitDate: Tue, 24 Nov 2009 21:34:00 +0100

perf_events: Fix bad software/trace event recursion counting

Commit 4ed7c92d68a5387ba5f7030dc76eab03558e27f5
(perf_events: Undo some recursion damage) has introduced a bad
reference counting of the recursion context. putting the context
behaves like getting it, dropping every software/trace events
after the first one in a context.

Signed-off-by: Frederic Weisbecker <fwei...@gmail.com>
Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Arnaldo Carvalho de Melo <ac...@redhat.com>
Cc: Paul Mackerras <pau...@samba.org>
Cc: Arjan van de Ven <ar...@infradead.org>
Cc: Li Zefan <li...@cn.fujitsu.com>
Cc: Steven Rostedt <ros...@goodmis.org>

LKML-Reference: <1259091502-5171-1-gi...@gmail.com>
Signed-off-by: Ingo Molnar <mi...@elte.hu>


---
kernel/perf_event.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/kernel/perf_event.c b/kernel/perf_event.c
index accfd7b..35df94e 100644
--- a/kernel/perf_event.c
+++ b/kernel/perf_event.c
@@ -3914,7 +3914,7 @@ void perf_swevent_put_recursion_context(int rctx)
{
struct perf_cpu_context *cpuctx = &__get_cpu_var(perf_cpu_context);
barrier();
- cpuctx->recursion[rctx]++;
+ cpuctx->recursion[rctx]--;
put_cpu_var(perf_cpu_context);
}
EXPORT_SYMBOL_GPL(perf_swevent_put_recursion_context);
--

Peter Zijlstra

unread,
Nov 24, 2009, 3:50:01 PM11/24/09
to
On Tue, 2009-11-24 at 20:38 +0100, Frederic Weisbecker wrote:

> diff --git a/kernel/perf_event.c b/kernel/perf_event.c
> index accfd7b..35df94e 100644
> --- a/kernel/perf_event.c
> +++ b/kernel/perf_event.c
> @@ -3914,7 +3914,7 @@ void perf_swevent_put_recursion_context(int rctx)
> {
> struct perf_cpu_context *cpuctx = &__get_cpu_var(perf_cpu_context);
> barrier();
> - cpuctx->recursion[rctx]++;
> + cpuctx->recursion[rctx]--;
> put_cpu_var(perf_cpu_context);
> }

D'0h!

Ingo Molnar

unread,
Nov 24, 2009, 4:50:02 PM11/24/09
to

* Li Zefan <li...@cn.fujitsu.com> wrote:

> So I think some new updates on kernel perf_event break.

yeah, you were right. This commit in latest -tip should fix it:

fe61267: perf_events: Fix bad software/trace event recursion counting

Ingo

Ingo Molnar

unread,
Nov 24, 2009, 5:40:02 PM11/24/09
to

* Ingo Molnar <mi...@elte.hu> wrote:

>
> * Li Zefan <li...@cn.fujitsu.com> wrote:
>
> > So I think some new updates on kernel perf_event break.
>
> yeah, you were right. This commit in latest -tip should fix it:
>
> fe61267: perf_events: Fix bad software/trace event recursion counting

i tested perf kmem and it works fine now:

aldebaran:~> perf kmem

SUMMARY
=======
Total bytes requested: 153166032
Total bytes allocated: 188544080
Total bytes wasted on internal fragmentation: 35378048
Internal fragmentation: 18.763807%
Cross CPU allocations: 61680/451425

0 new messages