There are many things that happen in this patchset, treating
different problems:
- remove most of the string copy overhead in fast path
- open the way for lock class oriented profiling (as
opposite to lock instance profiling. Both can be useful
in different ways).
- remove the buffers muliplexing (less contention)
- event injection support
- remove violent lock events recursion (only 2 among 3, the remaining
one is detailed below).
Some differences, by running:
perf lock record perf sched pipe -l 100000
Before the patchset:
Total time: 91.015 [sec]
910.157300 usecs/op
1098 ops/sec
After this patchset applied:
Total time: 43.706 [sec]
437.062080 usecs/op
2288 ops/sec
Although it's actually 50 secs after the very latest patch in this
series. It is supposed to bring more scalability (and I believe it
does on a box with more than two cpus, although I can't test).
But multiplexing the counters had a side effect: perf record has
only one buffer to eat and not 5 * NR_CPUS, which makes its job
a bit easier when we multiplex (at the cost of cpus contention of
course, but on my atom, the scalability gain is not very visible).
And also, after this odd patch:
diff --git a/kernel/perf_event.c b/kernel/perf_event.c
index 98fd360..254b3d4 100644
--- a/kernel/perf_event.c
+++ b/kernel/perf_event.c
@@ -3094,7 +3094,8 @@ static u32 perf_event_tid(struct perf_event *event, struct task_struct *p)
if (event->parent)
event = event->parent;
- return task_pid_nr_ns(p, event->ns);
+ return p->pid;
}
We get:
Total time: 26.170 [sec]
261.707960 usecs/op
3821 ops/sec
Ie: 2x faster than this patchset, and more than 3x faster than
tip:/perf/core
This is because task_pid_nr_ns() takes a lock and creates
lock events recursion. We really need to fix that.
You can pull this patchset from:
git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing.git
perf/core
Thanks.
---
Frederic Weisbecker (11):
tracing: Add lock_class_init event
tracing: Introduce TRACE_EVENT_INJECT
tracing: Inject lock_class_init events on registration
tracing: Add lock class id in lock_acquire event
perf: New PERF_EVENT_IOC_INJECT ioctl
perf: Handle injection ioctl with trace events
perf: Handle injection iotcl for tracepoints from perf record
perf/lock: Add support for lock_class_init events
tracing: Remove the lock name from most lock events
tracing/perf: Fix lock events recursions in the fast path
perf lock: Drop the buffers multiplexing dependency
include/linux/ftrace_event.h | 6 +-
include/linux/lockdep.h | 4 +
include/linux/perf_event.h | 6 +
include/linux/tracepoint.h | 3 +
include/trace/define_trace.h | 6 +
include/trace/events/lock.h | 57 ++++--
include/trace/ftrace.h | 31 +++-
kernel/lockdep.c | 16 ++
kernel/perf_event.c | 47 ++++-
kernel/trace/trace_event_profile.c | 46 +++--
kernel/trace/trace_events.c | 3 +
tools/perf/builtin-lock.c | 345 ++++++++++++++++++++++++++++++++----
tools/perf/builtin-record.c | 9 +
13 files changed, 497 insertions(+), 82 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Signed-off-by: Frederic Weisbecker <fwei...@gmail.com>
Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Arnaldo Carvalho de Melo <ac...@redhat.com>
Cc: Steven Rostedt <ros...@goodmis.org>
Cc: Paul Mackerras <pau...@samba.org>
Cc: Hitoshi Mitake <mit...@dcl.info.waseda.ac.jp>
Cc: Li Zefan <li...@cn.fujitsu.com>
Cc: Lai Jiangshan <la...@cn.fujitsu.com>
Cc: Masami Hiramatsu <mhir...@redhat.com>
Cc: Jens Axboe <jens....@oracle.com>
---
include/trace/events/lock.h | 23 +++++------------------
1 files changed, 5 insertions(+), 18 deletions(-)
diff --git a/include/trace/events/lock.h b/include/trace/events/lock.h
index 3c4a5c1..ced4ceb 100644
--- a/include/trace/events/lock.h
+++ b/include/trace/events/lock.h
@@ -50,23 +50,20 @@ TRACE_EVENT(lock_acquire,
TP_STRUCT__entry(
__field(unsigned int, flags)
- __string(name, lock->name)
__field(void *, lockdep_addr)
__field(void *, class_id)
),
TP_fast_assign(
__entry->flags = (trylock ? 1 : 0) | (read ? 2 : 0);
- __assign_str(name, lock->name);
__entry->lockdep_addr = lock;
__entry->class_id = (void *)lock->name;
),
- TP_printk("%p %p %s%s%s", __entry->lockdep_addr,
+ TP_printk("%p %p %s%s", __entry->lockdep_addr,
__entry->class_id,
(__entry->flags & 1) ? "try " : "",
- (__entry->flags & 2) ? "read " : "",
- __get_str(name))
+ (__entry->flags & 2) ? "read " : "")
);
TRACE_EVENT(lock_release,
@@ -76,17 +73,14 @@ TRACE_EVENT(lock_release,
TP_ARGS(lock, nested, ip),
TP_STRUCT__entry(
- __string(name, lock->name)
__field(void *, lockdep_addr)
),
TP_fast_assign(
- __assign_str(name, lock->name);
__entry->lockdep_addr = lock;
),
- TP_printk("%p %s",
- __entry->lockdep_addr, __get_str(name))
+ TP_printk("%p", __entry->lockdep_addr)
);
#ifdef CONFIG_LOCK_STAT
@@ -98,17 +92,14 @@ TRACE_EVENT(lock_contended,
TP_ARGS(lock, ip),
TP_STRUCT__entry(
- __string(name, lock->name)
__field(void *, lockdep_addr)
),
TP_fast_assign(
- __assign_str(name, lock->name);
__entry->lockdep_addr = lock;
),
- TP_printk("%p %s",
- __entry->lockdep_addr, __get_str(name))
+ TP_printk("%p", __entry->lockdep_addr)
);
TRACE_EVENT(lock_acquired,
@@ -117,19 +108,15 @@ TRACE_EVENT(lock_acquired,
TP_ARGS(lock, ip, waittime),
TP_STRUCT__entry(
- __string(name, lock->name)
__field(s64, wait_nsec)
__field(void *, lockdep_addr)
),
TP_fast_assign(
- __assign_str(name, lock->name);
__entry->wait_nsec = waittime;
__entry->lockdep_addr = lock;
),
- TP_printk("%p %s (%llu ns)", __entry->lockdep_addr,
- __get_str(name),
- __entry->wait_nsec)
+ TP_printk("%p (%llu ns)", __entry->lockdep_addr, __entry->wait_nsec)
);
#endif
--
1.6.2.3
One pair in do_perf_sw_event:
__lock_acquire
|
|--96.11%-- lock_acquire
| |
| |--27.21%-- do_perf_sw_event
| | perf_tp_event
| | |
| | |--49.62%-- ftrace_profile_lock_release
| | | lock_release
| | | |
| | | |--33.85%-- _raw_spin_unlock
Another pair in perf_output_begin/end:
__lock_acquire
|--23.40%-- perf_output_begin
| | __perf_event_overflow
| | perf_swevent_overflow
| | perf_swevent_add
| | perf_swevent_ctx_event
| | do_perf_sw_event
| | perf_tp_event
| | |
| | |--55.37%-- ftrace_profile_lock_acquire
| | | lock_acquire
| | | |
| | | |--37.31%-- _raw_spin_lock
The problem is not that much the trace recursion itself, as we have a
recursion protection already (though it's always wasteful to recurse).
But the trace events are outside the lockdep recursion protection, then
each lockdep event triggers a lock trace, which will trigger two
other lockdep events. Here the recursive lock trace event won't
be taken because of the trace recursion, so the recursion stops there
but lockdep will still analyse these new events:
To sum up, for each lockdep events we have:
lock_*()
|
trace lock_acquire
|
----- rcu_read_lock()
| |
| lock_acquire()
| |
| trace_lock_acquire() (stopped)
| |
| lockdep analyze
|
----- rcu_read_unlock()
|
lock_release
|
trace_lock_release() (stopped)
|
lockdep analyze
And you can repeat the above two times as we have two rcu read side
sections when we submit an event.
This is fixed in this pacth by using the non-lockdep versions of
rcu_read_(un)lock.
Signed-off-by: Frederic Weisbecker <fwei...@gmail.com>
Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Arnaldo Carvalho de Melo <ac...@redhat.com>
Cc: Steven Rostedt <ros...@goodmis.org>
Cc: Paul Mackerras <pau...@samba.org>
Cc: Hitoshi Mitake <mit...@dcl.info.waseda.ac.jp>
Cc: Li Zefan <li...@cn.fujitsu.com>
Cc: Lai Jiangshan <la...@cn.fujitsu.com>
Cc: Masami Hiramatsu <mhir...@redhat.com>
Cc: Jens Axboe <jens....@oracle.com>
---
kernel/perf_event.c | 10 +++++-----
1 files changed, 5 insertions(+), 5 deletions(-)
diff --git a/kernel/perf_event.c b/kernel/perf_event.c
index 280ae44..98fd360 100644
--- a/kernel/perf_event.c
+++ b/kernel/perf_event.c
@@ -2986,7 +2986,7 @@ int perf_output_begin(struct perf_output_handle *handle,
u64 lost;
} lost_event;
- rcu_read_lock();
+ __rcu_read_lock();
/*
* For inherited events we send all the output towards the parent.
*/
@@ -3051,7 +3051,7 @@ fail:
atomic_inc(&data->lost);
perf_output_unlock(handle);
out:
- rcu_read_unlock();
+ __rcu_read_unlock();
return -ENOSPC;
}
@@ -3072,7 +3072,7 @@ void perf_output_end(struct perf_output_handle *handle)
}
perf_output_unlock(handle);
- rcu_read_unlock();
+ __rcu_read_unlock();
}
static u32 perf_event_pid(struct perf_event *event, struct task_struct *p)
@@ -4116,7 +4116,7 @@ static void do_perf_sw_event(enum perf_type_id type, u32 event_id,
struct perf_event_context *ctx;
cpuctx = &__get_cpu_var(perf_cpu_context);
- rcu_read_lock();
+ __rcu_read_lock();
perf_swevent_ctx_event(&cpuctx->ctx, type, event_id,
nr, nmi, data, regs);
/*
@@ -4126,7 +4126,7 @@ static void do_perf_sw_event(enum perf_type_id type, u32 event_id,
ctx = rcu_dereference(current->perf_event_ctxp);
if (ctx)
perf_swevent_ctx_event(ctx, type, event_id, nr, nmi, data, regs);
- rcu_read_unlock();
+ __rcu_read_unlock();
}
void __perf_sw_event(u32 event_id, u64 nr, int nmi,
--
1.6.2.3
The class id is represented by the lock name address, which is
the same in the corresponding lock class.
Signed-off-by: Frederic Weisbecker <fwei...@gmail.com>
Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Arnaldo Carvalho de Melo <ac...@redhat.com>
Cc: Steven Rostedt <ros...@goodmis.org>
Cc: Paul Mackerras <pau...@samba.org>
Cc: Hitoshi Mitake <mit...@dcl.info.waseda.ac.jp>
Cc: Li Zefan <li...@cn.fujitsu.com>
Cc: Lai Jiangshan <la...@cn.fujitsu.com>
Cc: Masami Hiramatsu <mhir...@redhat.com>
Cc: Jens Axboe <jens....@oracle.com>
---
include/trace/events/lock.h | 5 ++++-
1 files changed, 4 insertions(+), 1 deletions(-)
diff --git a/include/trace/events/lock.h b/include/trace/events/lock.h
index 6999f16..3c4a5c1 100644
--- a/include/trace/events/lock.h
+++ b/include/trace/events/lock.h
@@ -52,15 +52,18 @@ TRACE_EVENT(lock_acquire,
__field(unsigned int, flags)
__string(name, lock->name)
__field(void *, lockdep_addr)
+ __field(void *, class_id)
),
TP_fast_assign(
__entry->flags = (trylock ? 1 : 0) | (read ? 2 : 0);
__assign_str(name, lock->name);
__entry->lockdep_addr = lock;
+ __entry->class_id = (void *)lock->name;
),
- TP_printk("%p %s%s%s", __entry->lockdep_addr,
+ TP_printk("%p %p %s%s%s", __entry->lockdep_addr,
+ __entry->class_id,
(__entry->flags & 1) ? "try " : "",
(__entry->flags & 2) ? "read " : "",
__get_str(name))
--
1.6.2.3
On trace events, it will call the inject callback, usually reserved
for events that need to catch up with the user.
Signed-off-by: Frederic Weisbecker <fwei...@gmail.com>
Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Arnaldo Carvalho de Melo <ac...@redhat.com>
Cc: Steven Rostedt <ros...@goodmis.org>
Cc: Paul Mackerras <pau...@samba.org>
Cc: Hitoshi Mitake <mit...@dcl.info.waseda.ac.jp>
Cc: Li Zefan <li...@cn.fujitsu.com>
Cc: Lai Jiangshan <la...@cn.fujitsu.com>
Cc: Masami Hiramatsu <mhir...@redhat.com>
Cc: Jens Axboe <jens....@oracle.com>
---
include/linux/perf_event.h | 2 ++
kernel/perf_event.c | 23 +++++++++++++++++++++++
2 files changed, 25 insertions(+), 0 deletions(-)
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 556b0f4..d2e83f0 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -228,6 +228,7 @@ struct perf_event_attr {
#define PERF_EVENT_IOC_PERIOD _IOW('$', 4, __u64)
#define PERF_EVENT_IOC_SET_OUTPUT _IO ('$', 5)
#define PERF_EVENT_IOC_SET_FILTER _IOW('$', 6, char *)
+#define PERF_EVENT_IOC_INJECT _IO ('$', 7)
enum perf_event_ioc_flags {
PERF_IOC_FLAG_GROUP = 1U << 0,
@@ -513,6 +514,7 @@ struct pmu {
void (*disable) (struct perf_event *event);
void (*read) (struct perf_event *event);
void (*unthrottle) (struct perf_event *event);
+ void (*inject) (struct perf_event *event);
};
/**
diff --git a/kernel/perf_event.c b/kernel/perf_event.c
index 40f8b07..e4dfd12 100644
--- a/kernel/perf_event.c
+++ b/kernel/perf_event.c
@@ -2151,6 +2151,26 @@ unlock:
return ret;
}
+static void __perf_event_inject(void *info)
+{
+ struct perf_event *event = info;
+
+ event->pmu->inject(event);
+}
+
+static int perf_event_inject(struct perf_event *event)
+{
+ struct perf_event_context *ctx = event->ctx;
+ struct task_struct *task = ctx->task;
+
+ if (!event->pmu->inject || task)
+ return -EINVAL;
+
+ smp_call_function_single(event->cpu, __perf_event_inject, event, 1);
+
+ return 0;
+}
+
static int perf_event_set_output(struct perf_event *event, int output_fd);
static int perf_event_set_filter(struct perf_event *event, void __user *arg);
@@ -2183,6 +2203,9 @@ static long perf_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
case PERF_EVENT_IOC_SET_FILTER:
return perf_event_set_filter(event, (void __user *)arg);
+ case PERF_EVENT_IOC_INJECT:
+ return perf_event_inject(event);
+
default:
return -ENOTTY;
}
--
1.6.2.3
Ah, I forgot to say. Injection is only supported on cpu
bound events (non-task bound). Because if it is task-bound,
we can't ensure we can inject while the event is scheduled,
not sure how to fix this.
This does a lot better here, even if it isn't exactly stellar
performance. It generates a LOT of data:
root@nehalem:/dev/shm # time perf lock rec -fg ls
perf.data perf.data.old
[ perf record: Woken up 0 times to write data ]
[ perf record: Captured and wrote 137.224 MB perf.data (~5995421
samples) ]
real 0m3.320s
user 0m0.000s
sys 0m3.220s
Without -g, it has 1.688s real and 1.590s sys time.
So while this is orders of magnitude better than the previous patchset,
it's still not anywhere near lean. But I expect you know that, just
consider this a 'I tested it and this is what happened' report :-)
--
Jens Axboe
> Hi,
>
> There are many things that happen in this patchset, treating
> different problems:
>
> - remove most of the string copy overhead in fast path
> - open the way for lock class oriented profiling (as
> opposite to lock instance profiling. Both can be useful
> in different ways).
> - remove the buffers muliplexing (less contention)
> - event injection support
> - remove violent lock events recursion (only 2 among 3, the remaining
> one is detailed below).
>
> Some differences, by running:
> perf lock record perf sched pipe -l 100000
>
> Before the patchset:
>
> Total time: 91.015 [sec]
>
> 910.157300 usecs/op
> 1098 ops/sec
>
> After this patchset applied:
>
> Total time: 43.706 [sec]
>
> 437.062080 usecs/op
> 2288 ops/sec
Fantastic!
There's one area that needs more thought i think: the dump-all-classes
init-event-injector approach. It is async, hence we could lose events if
there's a lot of lock classes to dump. Plus we eventually want to use your
injector approach for other things as well (such as to dump the state of a
collection of tasks) - so i think we want it to be more synchronous.
One approach would be to allow a gradual read() deplete the dump. Also, i
think the 'state dump' events should be separate from regular init events.
Filters attached to these events will automatically cause the dumping to be
restricted to the filter set. For example in the case of tasks one could dump
only tasks from a particular UID - by adding a 'uid == 1234' filter before
the dump (on a per tasks basis - so the filtering is nicely task local).
What do you think?
Ingo
I like the idea, I'm just not sure about the name and API details.
I would like to call it something like collection support, and the API
should have an iterator like interface.
That is, it should not blindly dump all events from a collection at
once, praying the output buffer is large enough, but either dump a
specified number and/or stop dumping when the buffer is full. Allowing a
second invocation to continue where it left off after the buffer content
has been consumed.
Which brings us to the ioctl() interface, we can do the above using
ioctl()s, but it seems to me we're starting to get ioctl() heavy and
should be looking at alternative ways of extending this.
Anybody any bright ideas?
Doh, 137 MB for a single ls :)
That said we don't have yet support for callchains in perf lock,
and callchains can fill the buffer quickly, especially on lock
events. You can drop the -g option for now.
>
> real 0m3.320s
> user 0m0.000s
> sys 0m3.220s
>
> Without -g, it has 1.688s real and 1.590s sys time.
Ok.
> So while this is orders of magnitude better than the previous patchset,
> it's still not anywhere near lean. But I expect you know that, just
> consider this a 'I tested it and this is what happened' report :-)
Ok, thanks a lot, the fact you can test on a 64 threads box is critically
helpful.
I also wonder what happens after this patch applied:
diff --git a/kernel/perf_event.c b/kernel/perf_event.c
index 98fd360..254b3d4 100644
--- a/kernel/perf_event.c
+++ b/kernel/perf_event.c
@@ -3094,7 +3094,8 @@ static u32 perf_event_tid(struct perf_event *event, struct task_struct *p)
if (event->parent)
event = event->parent;
- return task_pid_nr_ns(p, event->ns);
+ return p->pid;
}
In my box it has increased the speed from 2x this patchset.
I wonder if the tool becomes usable for you with that.
Otherwise, it means we have other things to fix, and
the result of:
perf record -g -f perf lock record sleep 6
perf report
would be very nice to have.
Thanks!
Yeah, the dump itself is purely async.
But the lock_class_init event is used from two sites:
- the injector, purely aynchronous and catch up from the past
thing
- register_lock_class(), this is the synchronous point, each time
we have a new class created.
When we register a lock_class_init ftrace/perf event, we first activate
the synchronous point, it behaves here as a usual event, and will hook
in every present new events.
And only after, we call the injector, asynchronous, rely on past events.
That beeing split in two part covers every events.
> Plus we eventually want to use your
> injector approach for other things as well (such as to dump the state of a
> collection of tasks) - so i think we want it to be more synchronous.
Yeah, that would work also for tasks. And we can follow the same
pattern for that.
We can set up a synchronous trace point in fork and have a secondary
asynchronous point that can dump the task list. That too would
cover every events we want.
> One approach would be to allow a gradual read() deplete the dump. Also, i
> think the 'state dump' events should be separate from regular init events.
> Filters attached to these events will automatically cause the dumping to be
> restricted to the filter set. For example in the case of tasks one could dump
> only tasks from a particular UID - by adding a 'uid == 1234' filter before
> the dump (on a per tasks basis - so the filtering is nicely task local).
But this is what we want right? If the init event and the dump event are
the same, which is the case currently, the filter will apply to both.
And if we are only interested in tasks for uid == 1234, I guess we
want async and sync events that have the same filter.
May be we want to split up init events from dump events, say,
have an event class that you can open in either async or
sync mode. But I can't figure out a workflow for which it can
be useful.
Yeah, it's pretty crazy. It varies a lot too, I didn't post the other
run which was ~170MB.
> That said we don't have yet support for callchains in perf lock,
> and callchains can fill the buffer quickly, especially on lock
> events. You can drop the -g option for now.
OK
> > So while this is orders of magnitude better than the previous patchset,
> > it's still not anywhere near lean. But I expect you know that, just
> > consider this a 'I tested it and this is what happened' report :-)
>
>
> Ok, thanks a lot, the fact you can test on a 64 threads box is critically
> helpful.
My pleasure, I'd love to have a fast and functional perf lockstat. If my
testing helps getting there, consider me signed up :-)
> I also wonder what happens after this patch applied:
>
> diff --git a/kernel/perf_event.c b/kernel/perf_event.c
> index 98fd360..254b3d4 100644
> --- a/kernel/perf_event.c
> +++ b/kernel/perf_event.c
> @@ -3094,7 +3094,8 @@ static u32 perf_event_tid(struct perf_event *event, struct task_struct *p)
> if (event->parent)
> event = event->parent;
>
> - return task_pid_nr_ns(p, event->ns);
> + return p->pid;
> }
>
> In my box it has increased the speed from 2x this patchset.
Cool, I'll give that a spin in the morning, the box is off atm.
> I wonder if the tool becomes usable for you with that.
> Otherwise, it means we have other things to fix, and
> the result of:
>
> perf record -g -f perf lock record sleep 6
> perf report
>
> would be very nice to have.
I'll package that up for you and put it somewhere.
--
Jens Axboe
Yeah I agree. But my worry is there are induced races in this scheme.
But probably tight enough that we don't care much.
Consider dumping the task list content:
A -> B -> C -> D
You open a "task" event. And ask to inject it one by one,
you first dump A, and B disappear, then you'll miss it
but you can still get C and D if they don't disappear.
As I said it is tight enough that we don't care. If B disappears
so early, it means it won't have a determinant role in the profiling
anyway (at worst few isolated events in the beginning).
The fact is the async injection will always have such misses by
its nature anyway, you can still dump the whole at once, and you will
miss events that just disappeared before the injection call, so whatever...
>
> Which brings us to the ioctl() interface, we can do the above using
> ioctl()s, but it seems to me we're starting to get ioctl() heavy and
> should be looking at alternative ways of extending this.
>
> Anybody any bright ideas?
>
Using ioctl, we can basically have such structure:
struct perf_inject_req {
int start; (in)
int len; (in)
int *read; (out)
};
Using an idx will often imply rewalk a whole object list from
the beginning, which is perfectly fine.
If you prefer an alternate syscall, I can make it. I've been
told about ioctl drawbacks by the past, I personally think
ioctl has not only drawbacks: it avoids a proliferation
of syscalls. But that's how you prefer.
Great :)
> > I also wonder what happens after this patch applied:
> >
> > diff --git a/kernel/perf_event.c b/kernel/perf_event.c
> > index 98fd360..254b3d4 100644
> > --- a/kernel/perf_event.c
> > +++ b/kernel/perf_event.c
> > @@ -3094,7 +3094,8 @@ static u32 perf_event_tid(struct perf_event *event, struct task_struct *p)
> > if (event->parent)
> > event = event->parent;
> >
> > - return task_pid_nr_ns(p, event->ns);
> > + return p->pid;
> > }
> >
> > In my box it has increased the speed from 2x this patchset.
>
> Cool, I'll give that a spin in the morning, the box is off atm.
Thanks a lot!
> On Wed, Feb 03, 2010 at 11:33:16AM +0100, Peter Zijlstra wrote:
> > On Wed, 2010-02-03 at 10:14 +0100, Frederic Weisbecker wrote:
> > > - event injection support
> >
> > I like the idea, I'm just not sure about the name and API details.
> >
> > I would like to call it something like collection support, and the API
> > should have an iterator like interface.
> >
> > That is, it should not blindly dump all events from a collection at once,
> > praying the output buffer is large enough, but either dump a specified
> > number and/or stop dumping when the buffer is full. Allowing a second
> > invocation to continue where it left off after the buffer content has
> > been consumed.
>
>
> Yeah I agree. But my worry is there are induced races in this scheme. But
> probably tight enough that we don't care much.
>
> Consider dumping the task list content:
>
> A -> B -> C -> D
>
> You open a "task" event. And ask to inject it one by one,
> you first dump A, and B disappear, then you'll miss it
> but you can still get C and D if they don't disappear.
>
> As I said it is tight enough that we don't care. If B disappears
> so early, it means it won't have a determinant role in the profiling
> anyway (at worst few isolated events in the beginning).
We probably dont care - /proc is racy in the same fashion (you can miss
tasks), still 'top' has been able to cope for a decade.
If we cared, it would be possible to construct a dump-collection-state
sequence along the lines of:
activate init/deinit events
initiate dump
anything that got missed by the dump will be covered by the init/deinit event
flow. In that sense it's important that init/deinit and dumping uses similar
state structure - possibly the same event (as your solution did).
A 'dump' of a collection is basically the artificial replay of all its init
events.
Ingo
Hmmm... Perhaps I should rename __rcu_read_lock() to something more
meaningful if it is to be used outside of the RCU files. In the
meantime:
Reviewed-by: Paul E. McKenney <pau...@linux.vnet.ibm.com>
Doesn't seem to change anything, same runtime for a ls.
> I wonder if the tool becomes usable for you with that.
> Otherwise, it means we have other things to fix, and
> the result of:
>
> perf record -g -f perf lock record sleep 6
> perf report
>
> would be very nice to have.
root@nehalem:/dev/shm # perf record -g -f perf lock record sleep 6
[ perf record: Woken up 0 times to write data ]
[ perf record: Captured and wrote 446.208 MB perf.data (~19495127 samples) ]
[ perf record: Woken up 9 times to write data ]
[ perf record: Captured and wrote 1.135 MB perf.data (~49609 samples) ]
It's huuuge. Thankfully the output isn't so big, I've attached it.
--
Jens Axboe
Perhaps we can use the existed rcu_read_lock_sched_notrace().
not relate to this patchset, but RCU & lockdep:
We need to remove lockdep from rcu_read_lock_*().
1) rcu_read_lock() is deadlock-immunity,
we get very little benefit from lockdep.
rcu_read_lock()
lock_acquire(read=2,check=1)
* Values for check:
*
* 0: disabled
* 1: simple checks (freeing, held-at-exit-time, etc.)
* 2: full validation
*/
We can check it by other methods.
2) popular distributions and some companies enable lockdep for their kernel.
rcu_read_lock_*() are the most frequent lock in kernel.
lock_acquire() is not fast enough, it is a big function for RCU.
I'm not at all convinced we need to do any such thing, remember its
debugging stuff, performance, while nice, doesn't really count.
> 1) rcu_read_lock() is deadlock-immunity,
> we get very little benefit from lockdep.
Except it detects things like failing to unlock, or going into userspace
while holding an rcu_read_lock()
Also, Paul has been spending lots of effort on getting rcu_dereference()
annotated.
> rcu_read_lock()
> lock_acquire(read=2,check=1)
>
> * Values for check:
> *
> * 0: disabled
> * 1: simple checks (freeing, held-at-exit-time, etc.)
> * 2: full validation
> */
>
> We can check it by other methods.
>
> 2) popular distributions and some companies enable lockdep for their kernel.
> rcu_read_lock_*() are the most frequent lock in kernel.
> lock_acquire() is not fast enough, it is a big function for RCU.
Its debug stuff, get over it, we're not going to limit its coverage
because people do silly things.
That said, I'm not at all happy about removing lockdep annotations to
make the tracer faster, that's really counter productive.
Are there no dynamic techniques that could be used here?
Lockdep obviously wants maximum instrumentation coverage - performance be
damned.
Lock profiling/tracing/visualization wants the minimum subset of events it is
interested in - everything else is unnecessary overhead.
Ingo
IIRC the reason its now outside is that you'd loose tracepoint on
lockdep_off() usage, but having the tracer folks help on removing any
such usage is of course a good thing.
The usage thereof in nmi_enter() doesn't seem like a problem, since
you're not supposed to be using locks from nmi context anyway, more so,
I'd not be adverse to putting BUG_ON(in_nmi()) in every lockdep hook.
Well, they could start by moving the tracepoint inside the lockdep
recursion check.
--
Another nasty side effect is that it (lockdep recursion) isn't IRQ aware
in that we don't do any tracking for IRQ's that hit while we're doing
lockdep. We can fix that using a recursion context like we did for perf,
that would actually improve lockdep itself too.
Ah.
> > I wonder if the tool becomes usable for you with that.
> > Otherwise, it means we have other things to fix, and
> > the result of:
> >
> > perf record -g -f perf lock record sleep 6
> > perf report
> >
> > would be very nice to have.
>
> root@nehalem:/dev/shm # perf record -g -f perf lock record sleep 6
> [ perf record: Woken up 0 times to write data ]
> [ perf record: Captured and wrote 446.208 MB perf.data (~19495127 samples) ]
> [ perf record: Woken up 9 times to write data ]
> [ perf record: Captured and wrote 1.135 MB perf.data (~49609 samples) ]
>
> It's huuuge. Thankfully the output isn't so big, I've attached it.
Yeah. I made a mistake, both perf record and perf lock are writing to the
same perf.data with this cmdline.
Actually we'd have needed:
perf record -o perflock.data -g -f perf lock record sleep 6
perf report -i perflock.data
That's why you don't have callchains in your profile, probably because the
file has been messed up.
Anyway, we should retry after my next iteration, I have various
reviews to address :)
Thanks.
Yep, I agree with you. With the lockdep recursion check fixed to be
subtle enough for that + the lock events under lockdep recursion checks,
it fixes the situation while keeping the lockdep coverage in perf
tracing path for other cases.
I will start by adressing this.
That said, I think this is good for a first step, but we can't continue
to force the lock events -> lockdep dependency in the long term. We
can't have a serious lock profiling if we are doomed to suffer the
slowness due to lockdep checks at the same time.
Sure we can continue to support having both, but I think we should also
think about a solution to handle lock events without it in the future.
That will require some minimal lockdep functionalities (keeping the
lockdep map, and class hashes).
Thanks.
> That said, I think this is good for a first step, but we can't continue
> to force the lock events -> lockdep dependency in the long term. We
> can't have a serious lock profiling if we are doomed to suffer the
> slowness due to lockdep checks at the same time.
>
> Sure we can continue to support having both, but I think we should also
> think about a solution to handle lock events without it in the future.
> That will require some minimal lockdep functionalities (keeping the
> lockdep map, and class hashes).
You mean like building without CONFIG_PROVE_LOCKING, or boot with
lockdep.prove_locking=0, or use echo 0
> /sys/modules/lockdep/prove_locking ?
That keeps the lock tracking but does away with all the dependency
analysis and was created for just such an use case as you are looking
at, namely lockstat.
Looks pretty what I'm looking for. Except that it still continues
to fill and keep track of the locks held by the current thread,
namely the copies in curr->held_locks.
Which is exactly what you need for that lock hierarchy recording you
wanted :-)
Well, the lock hierarchy should probably be retrieved from the traces,
using state machines.
Otherwise we would need yet other lock events for that, which is going
to add even more overhead.
Right, well you could look at adding a mode that also strips out the
held_lock tracing, but since you really need the register class stuff to
re-generate the class mapping, avoiding the held_lock tracing doesn't
look like it's going to save you much, its all thread local storage.
One thing we could do is add a PERF_TYPE_COLLECTION and have its read()
method do something else than regular counters would, that way we can
use read()/lseek()/stat() fd interface. The only trouble with that is
that doesn't use the 'regular' output buffer.
/me goes try and come up with yet another solution..
But if we draw a tree based representation (per-instances or per-class)
from perf lock, we'll already need to get the locking scenarios from
post processing, which induce the dependencies.
I don't think it will help much.
Thanks.
But would be nice too. The drawback is indeed a separation/implementation
treatment against other events.
> /me goes try and come up with yet another solution..
Not sure there is another solution than the above or an inject/mmap
thing.
Actually, looking at lock_acquire/release/acquired/contended, they are
all performing their job under a raw_lock_irq_save() window, so it
doesn't seem we are losing anything.
Something that could be nice though: dropping this irq saving from the
main window, add the perf style recursion protection, and eventually
have a raw_local_irq_save only when we take internal lockdep locks.
This will let irqs happen during lockdep checks.
I just guess the irq disabled thing is also protecting something else,
something I'll probably discover after testing that :)