[RFC PATCH 0/4] perf/core: Small event scheduling changes

Frederic Weisbecker

unread,

Nov 8, 2009, 3:20:02 PM11/8/09

to

Hi,

This is an rfc patchset, only compile tested just to ensure I'm taking
a good direction before going ahead.

This is intended to rework a bit the perf event scheduling to
guarantee a real priority of the pinned events over the volatile
ones. This patchset handles such priority on task tick time
only. But if the idea is agreed, I could expand that to every
task event sched-in calls to guarantee the priority in every
event rescheduling time.

Thanks.

Frederic Weisbecker (4):
perf/core: split context's event group list into pinned and
non-pinned lists
perf/core: Optimize a bit rotate_ctx()
perf/core: Split up pinned and non pinned processing
perf/core: Schedule every pinned events before the the non-pinned

include/linux/perf_event.h | 3 +-
kernel/perf_event.c | 283 ++++++++++++++++++++++++++++++++-----------
2 files changed, 212 insertions(+), 74 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Frederic Weisbecker

unread,

Nov 8, 2009, 3:20:02 PM11/8/09

to

Split up pinned and non-pinned events processing in two helpers
so that it's more flexible to handle them seperately.

Signed-off-by: Frederic Weisbecker <fwei...@gmail.com>
Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Arnaldo Carvalho de Melo <ac...@redhat.com>
Cc: Mike Galbraith <efa...@gmx.de>
Cc: Paul Mackerras <pau...@samba.org>
Cc: Thomas Gleixner <tg...@linutronix.de>
---
kernel/perf_event.c | 51 +++++++++++++++++++++++++++++++++++----------------
1 files changed, 35 insertions(+), 16 deletions(-)

diff --git a/kernel/perf_event.c b/kernel/perf_event.c
index 0432c1c..50f2997 100644
--- a/kernel/perf_event.c
+++ b/kernel/perf_event.c
@@ -1246,25 +1246,11 @@ static void perf_event_cpu_sched_out(struct perf_cpu_context *cpuctx)
}

static void
-__perf_event_sched_in(struct perf_event_context *ctx,
- struct perf_cpu_context *cpuctx, int cpu)
+__perf_event_sched_in_pinned(struct perf_event_context *ctx,
+ struct perf_cpu_context *cpuctx, int cpu)
{
struct perf_event *event;
- int can_add_hw = 1;
-
- spin_lock(&ctx->lock);
- ctx->is_active = 1;
- if (likely(!ctx->nr_events))
- goto out;
-
- ctx->timestamp = perf_clock();
-
- perf_disable();

- /*
- * First go through the list and put on any pinned groups
- * in order to give them the best chance of going on.
- */
list_for_each_entry(event, &ctx->pinned_grp_list, group_entry) {
if (event->state <= PERF_EVENT_STATE_OFF)
continue;
@@ -1283,6 +1269,14 @@ __perf_event_sched_in(struct perf_event_context *ctx,
event->state = PERF_EVENT_STATE_ERROR;
}
}
+}
+
+static void
+__perf_event_sched_in_volatile(struct perf_event_context *ctx,
+ struct perf_cpu_context *cpuctx, int cpu)
+{
+ int can_add_hw = 1;
+ struct perf_event *event;

list_for_each_entry(event, &ctx->volatile_grp_list, group_entry) {
/*
@@ -1303,6 +1297,31 @@ __perf_event_sched_in(struct perf_event_context *ctx,
if (group_sched_in(event, cpuctx, ctx, cpu))
can_add_hw = 0;
}
+}
+
+static void
+__perf_event_sched_in(struct perf_event_context *ctx,
+ struct perf_cpu_context *cpuctx, int cpu)
+{
+ spin_lock(&ctx->lock);
+ ctx->is_active = 1;
+ if (likely(!ctx->nr_events))
+ goto out;
+
+ ctx->timestamp = perf_clock();
+
+ perf_disable();
+
+ /*
+ * First go through the list and put on any pinned groups
+ * in order to give them the best chance of going on.
+ */
+ __perf_event_sched_in_pinned(ctx, cpuctx, cpu);
+
+ /* Then handle the non-pinned groups */
+ __perf_event_sched_in_volatile(ctx, cpuctx, cpu);
+
+
perf_enable();
out:
spin_unlock(&ctx->lock);
--
1.6.2.3

Frederic Weisbecker

unread,

Nov 8, 2009, 3:20:03 PM11/8/09

to

Don't use list_for_each_entry() just to swap the first and
last entry in the list.

Instead, use a direct list->next derefencing. This saves up a
useless prefetch.

Signed-off-by: Frederic Weisbecker <fwei...@gmail.com>
Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Arnaldo Carvalho de Melo <ac...@redhat.com>
Cc: Mike Galbraith <efa...@gmx.de>
Cc: Paul Mackerras <pau...@samba.org>
Cc: Thomas Gleixner <tg...@linutronix.de>
---

kernel/perf_event.c | 16 +++++++++-------
1 files changed, 9 insertions(+), 7 deletions(-)

diff --git a/kernel/perf_event.c b/kernel/perf_event.c
index b3a31c8..0432c1c 100644
--- a/kernel/perf_event.c
+++ b/kernel/perf_event.c
@@ -1432,7 +1432,7 @@ static void perf_ctx_adjust_freq(struct perf_event_context *ctx)
*/
static void rotate_ctx(struct perf_event_context *ctx)
{
- struct perf_event *event;
+ struct list_head *list;

if (!ctx->nr_events)
return;
@@ -1442,15 +1442,17 @@ static void rotate_ctx(struct perf_event_context *ctx)
* Rotate the first entry last (works just fine for group events too):
*/
perf_disable();
- list_for_each_entry(event, &ctx->pinned_grp_list, group_entry) {
- list_move_tail(&event->group_entry, &ctx->pinned_grp_list);
- break;
+
+ if (!list_empty(&ctx->pinned_grp_list)) {
+ list = ctx->pinned_grp_list.next;
+ list_move_tail(list, &ctx->pinned_grp_list);
}

- list_for_each_entry(event, &ctx->volatile_grp_list, group_entry) {
- list_move_tail(&event->group_entry, &ctx->volatile_grp_list);
- break;
+ if (!list_empty(&ctx->volatile_grp_list)) {
+ list = ctx->volatile_grp_list.next;
+ list_move_tail(list, &ctx->volatile_grp_list);
}
+
perf_enable();

spin_unlock(&ctx->lock);
--
1.6.2.3

Frederic Weisbecker

unread,

Nov 8, 2009, 3:20:03 PM11/8/09

to

Currently, the order to schedule events is as follows:

- cpu context pinned events
- cpu context non-pinned events
- task context pinned events
- task context non-pinned events

What we want instead is to schedule every pinned events first because
those have a higher priority.

This is what does this patch in each task tick. If the approach is
agreed, we may want to expand this to task-only sched in (where the
cpu events are not sched out), fork, exec, etc... So that we guarantee
the pinned priority every time the task is scheduled in.

Signed-off-by: Frederic Weisbecker <fwei...@gmail.com>
Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Arnaldo Carvalho de Melo <ac...@redhat.com>
Cc: Mike Galbraith <efa...@gmx.de>
Cc: Paul Mackerras <pau...@samba.org>
Cc: Thomas Gleixner <tg...@linutronix.de>
---

kernel/perf_event.c | 49 ++++++++++++++++++++++++++++++++++++++++++++++---
1 files changed, 46 insertions(+), 3 deletions(-)

diff --git a/kernel/perf_event.c b/kernel/perf_event.c
index 50f2997..f32aec4 100644
--- a/kernel/perf_event.c
+++ b/kernel/perf_event.c
@@ -1327,6 +1327,41 @@ __perf_event_sched_in(struct perf_event_context *ctx,
spin_unlock(&ctx->lock);
}

+static void
+__perf_event_sched_in_all(struct perf_event_context *ctx,

+ struct perf_cpu_context *cpuctx, int cpu)
+{

+ struct perf_event_context *cpu_ctx = &cpuctx->ctx;
+
+ /* May require different classes between cpu and task lock */
+ spin_lock(&cpu_ctx->lock);
+ spin_lock(&ctx->lock);
+ cpu_ctx->is_active = ctx->is_active = 1;
+
+ ctx->timestamp = cpu_ctx->timestamp = perf_clock();
+
+ perf_disable();
+
+ if (cpu_ctx->nr_events)
+ __perf_event_sched_in_pinned(cpu_ctx, cpuctx, cpu);
+
+ if (ctx->nr_events)
+ __perf_event_sched_in_pinned(cpu_ctx, cpuctx, cpu);
+
+ if (cpu_ctx->nr_events)
+ __perf_event_sched_in_volatile(cpu_ctx, cpuctx, cpu);
+
+ if (ctx->nr_events)
+ __perf_event_sched_in_volatile(cpu_ctx, cpuctx, cpu);
+
+ cpuctx->task_ctx = ctx;
+
+ perf_enable();
+
+ spin_unlock(&ctx->lock);
+ spin_lock(&cpu_ctx->lock);
+}
+
/*
* Called from scheduler to add the events of the current task
* with interrupts disabled.
@@ -1477,6 +1512,16 @@ static void rotate_ctx(struct perf_event_context *ctx)
spin_unlock(&ctx->lock);
}

+static void
+perf_event_sched_in_all(struct perf_event_context *ctx,

+ struct perf_cpu_context *cpuctx, int cpu)
+{

+ if (!ctx || ctx == cpuctx->task_ctx)
+ perf_event_cpu_sched_in(cpuctx, cpu);
+ else
+ __perf_event_sched_in_all(ctx, cpuctx, cpu);
+}
+
void perf_event_task_tick(struct task_struct *curr, int cpu)
{
struct perf_cpu_context *cpuctx;
@@ -1500,9 +1545,7 @@ void perf_event_task_tick(struct task_struct *curr, int cpu)
if (ctx)
rotate_ctx(ctx);

- perf_event_cpu_sched_in(cpuctx, cpu);
- if (ctx)
- perf_event_task_sched_in(curr, cpu);
+ perf_event_sched_in_all(ctx, cpuctx, cpu);
}

static void __perf_event_enable_on_exec(struct perf_event *event,
--
1.6.2.3

Frederic Weisbecker

unread,

Nov 8, 2009, 3:20:02 PM11/8/09

to

Split-up struct perf_event_context::group_list into pinned_grp_list
and volatile_grp_list (non-pinned).

This first appears to be useless as it duplicates various loops around
the group list handlings.

But it scales better in the fast-path in perf_sched_in(). We don't
anymore iterate twice through the entire list to separate pinned and
non-pinned scheduling. Instead we interate through two distinct lists.

The another desired effect is that it makes easier the distinct
scheduling rules for both.

Signed-off-by: Frederic Weisbecker <fwei...@gmail.com>
Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Arnaldo Carvalho de Melo <ac...@redhat.com>
Cc: Mike Galbraith <efa...@gmx.de>
Cc: Paul Mackerras <pau...@samba.org>
Cc: Thomas Gleixner <tg...@linutronix.de>
---

include/linux/perf_event.h | 3 +-
kernel/perf_event.c | 177 +++++++++++++++++++++++++++++++-------------
2 files changed, 127 insertions(+), 53 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 6ff7c3b..659351c 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -662,7 +662,8 @@ struct perf_event_context {
*/
struct mutex mutex;

- struct list_head group_list;
+ struct list_head pinned_grp_list;
+ struct list_head volatile_grp_list;
struct list_head event_list;
int nr_events;
int nr_active;
diff --git a/kernel/perf_event.c b/kernel/perf_event.c
index 6f4ed3b..b3a31c8 100644
--- a/kernel/perf_event.c
+++ b/kernel/perf_event.c
@@ -259,9 +259,15 @@ list_add_event(struct perf_event *event, struct perf_event_context *ctx)
* add it straight to the context's event list, or to the group
* leader's sibling list:
*/
- if (group_leader == event)
- list_add_tail(&event->group_entry, &ctx->group_list);
- else {
+ if (group_leader == event) {
+ struct list_head *list;
+
+ if (event->attr.pinned)
+ list = &ctx->pinned_grp_list;
+ else
+ list = &ctx->volatile_grp_list;
+ list_add_tail(&event->group_entry, list);
+ } else {
list_add_tail(&event->group_entry, &group_leader->sibling_list);
group_leader->nr_siblings++;
}
@@ -299,8 +305,14 @@ list_del_event(struct perf_event *event, struct perf_event_context *ctx)
* to the context list directly:
*/
list_for_each_entry_safe(sibling, tmp, &event->sibling_list, group_entry) {
+ struct list_head *list;
+
+ if (sibling->attr.pinned)
+ list = &ctx->pinned_grp_list;
+ else
+ list = &ctx->volatile_grp_list;

- list_move_tail(&sibling->group_entry, &ctx->group_list);
+ list_move_tail(&sibling->group_entry, list);
sibling->group_leader = sibling;
}
}
@@ -1032,10 +1044,14 @@ void __perf_event_sched_out(struct perf_event_context *ctx,
update_context_time(ctx);

perf_disable();
- if (ctx->nr_active)
- list_for_each_entry(event, &ctx->group_list, group_entry)
+ if (ctx->nr_active) {
+ list_for_each_entry(event, &ctx->pinned_grp_list, group_entry)
group_sched_out(event, cpuctx, ctx);

+ list_for_each_entry(event, &ctx->volatile_grp_list, group_entry)
+ group_sched_out(event, cpuctx, ctx);

+ }
+
perf_enable();
out:
spin_unlock(&ctx->lock);

@@ -1249,9 +1265,8 @@ __perf_event_sched_in(struct perf_event_context *ctx,

* First go through the list and put on any pinned groups

* in order to give them the best chance of going on.

*/
- list_for_each_entry(event, &ctx->group_list, group_entry) {
- if (event->state <= PERF_EVENT_STATE_OFF ||
- !event->attr.pinned)
+ list_for_each_entry(event, &ctx->pinned_grp_list, group_entry) {
+ if (event->state <= PERF_EVENT_STATE_OFF)
continue;
if (event->cpu != -1 && event->cpu != cpu)
continue;
@@ -1269,13 +1284,12 @@ __perf_event_sched_in(struct perf_event_context *ctx,
}
}

- list_for_each_entry(event, &ctx->group_list, group_entry) {
+ list_for_each_entry(event, &ctx->volatile_grp_list, group_entry) {
/*
* Ignore events in OFF or ERROR state, and
* ignore pinned events since we did them already.
*/
- if (event->state <= PERF_EVENT_STATE_OFF ||
- event->attr.pinned)
+ if (event->state <= PERF_EVENT_STATE_OFF)
continue;

/*
@@ -1428,8 +1442,13 @@ static void rotate_ctx(struct perf_event_context *ctx)

* Rotate the first entry last (works just fine for group events too):
*/
perf_disable();

- list_for_each_entry(event, &ctx->group_list, group_entry) {
- list_move_tail(&event->group_entry, &ctx->group_list);
+ list_for_each_entry(event, &ctx->pinned_grp_list, group_entry) {
+ list_move_tail(&event->group_entry, &ctx->pinned_grp_list);
+ break;
+ }
+
+ list_for_each_entry(event, &ctx->volatile_grp_list, group_entry) {
+ list_move_tail(&event->group_entry, &ctx->volatile_grp_list);
break;
}
perf_enable();
@@ -1465,6 +1484,22 @@ void perf_event_task_tick(struct task_struct *curr, int cpu)
perf_event_task_sched_in(curr, cpu);
}

+static void __perf_event_enable_on_exec(struct perf_event *event,
+ struct perf_event_context *ctx,
+ int *enabled)
+{
+ if (!event->attr.enable_on_exec)
+ return;
+
+ event->attr.enable_on_exec = 0;
+ if (event->state >= PERF_EVENT_STATE_INACTIVE)
+ return;
+
+ __perf_event_mark_enabled(event, ctx);
+
+ *enabled = 1;
+}
+
/*
* Enable all of a task's events that have been marked enable-on-exec.
* This expects task == current.
@@ -1485,15 +1520,11 @@ static void perf_event_enable_on_exec(struct task_struct *task)

spin_lock(&ctx->lock);

- list_for_each_entry(event, &ctx->group_list, group_entry) {
- if (!event->attr.enable_on_exec)
- continue;
- event->attr.enable_on_exec = 0;
- if (event->state >= PERF_EVENT_STATE_INACTIVE)
- continue;
- __perf_event_mark_enabled(event, ctx);
- enabled = 1;
- }
+ list_for_each_entry(event, &ctx->pinned_grp_list, group_entry)
+ __perf_event_enable_on_exec(event, ctx, &enabled);
+
+ list_for_each_entry(event, &ctx->volatile_grp_list, group_entry)
+ __perf_event_enable_on_exec(event, ctx, &enabled);

/*
* Unclone this context if we enabled any event.
@@ -1562,7 +1593,8 @@ __perf_event_init_context(struct perf_event_context *ctx,
memset(ctx, 0, sizeof(*ctx));
spin_lock_init(&ctx->lock);
mutex_init(&ctx->mutex);
- INIT_LIST_HEAD(&ctx->group_list);
+ INIT_LIST_HEAD(&ctx->pinned_grp_list);
+ INIT_LIST_HEAD(&ctx->volatile_grp_list);
INIT_LIST_HEAD(&ctx->event_list);
atomic_set(&ctx->refcount, 1);
ctx->task = task;
@@ -4869,7 +4901,11 @@ void perf_event_exit_task(struct task_struct *child)
mutex_lock_nested(&child_ctx->mutex, SINGLE_DEPTH_NESTING);

again:
- list_for_each_entry_safe(child_event, tmp, &child_ctx->group_list,
+ list_for_each_entry_safe(child_event, tmp, &child_ctx->pinned_grp_list,
+ group_entry)
+ __perf_event_exit_task(child_event, child_ctx, child);
+
+ list_for_each_entry_safe(child_event, tmp, &child_ctx->volatile_grp_list,
group_entry)
__perf_event_exit_task(child_event, child_ctx, child);

@@ -4878,7 +4914,8 @@ again:
* its siblings to the list, but we obtained 'tmp' before that which
* will still point to the list head terminating the iteration.
*/
- if (!list_empty(&child_ctx->group_list))
+ if (!list_empty(&child_ctx->pinned_grp_list) ||
+ !list_empty(&child_ctx->volatile_grp_list))
goto again;

mutex_unlock(&child_ctx->mutex);
@@ -4886,6 +4923,24 @@ again:
put_ctx(child_ctx);
}

+static void perf_event_free_event(struct perf_event *event,
+ struct perf_event_context *ctx)
+{
+ struct perf_event *parent = event->parent;
+
+ if (WARN_ON_ONCE(!parent))
+ return;
+
+ mutex_lock(&parent->child_mutex);
+ list_del_init(&event->child_list);
+ mutex_unlock(&parent->child_mutex);
+
+ fput(parent->filp);
+
+ list_del_event(event, ctx);
+ free_event(event);
+}
+
/*
* free an unexposed, unused context as created by inheritance by
* init_task below, used by fork() in case of fail.
@@ -4900,23 +4955,15 @@ void perf_event_free_task(struct task_struct *task)

mutex_lock(&ctx->mutex);
again:
- list_for_each_entry_safe(event, tmp, &ctx->group_list, group_entry) {
- struct perf_event *parent = event->parent;
-
- if (WARN_ON_ONCE(!parent))
- continue;
-
- mutex_lock(&parent->child_mutex);
- list_del_init(&event->child_list);
- mutex_unlock(&parent->child_mutex);
+ list_for_each_entry_safe(event, tmp, &ctx->pinned_grp_list, group_entry)
+ perf_event_free_event(event, ctx);

- fput(parent->filp);
-
- list_del_event(event, ctx);
- free_event(event);
- }
+ list_for_each_entry_safe(event, tmp, &ctx->volatile_grp_list,
+ group_entry)
+ perf_event_free_event(event, ctx);

- if (!list_empty(&ctx->group_list))
+ if (!list_empty(&ctx->pinned_grp_list) ||
+ !list_empty(&ctx->volatile_grp_list))
goto again;

mutex_unlock(&ctx->mutex);
@@ -4924,6 +4971,29 @@ again:
put_ctx(ctx);
}

+static int
+perf_event_inherit(struct perf_event *event, struct task_struct *parent,
+ struct perf_event_context *parent_ctx,
+ struct task_struct *child,
+ struct perf_event_context *child_ctx,
+ int *inherited_all)
+{
+ int ret;
+
+ if (!event->attr.inherit) {
+ *inherited_all = 0;
+ return 0;
+ }
+
+ ret = inherit_group(event, parent, parent_ctx,
+ child, child_ctx);
+ if (ret)
+ *inherited_all = 0;
+
+ return ret;
+}
+
+
/*
* Initialize the perf_event context in task_struct
*/
@@ -4981,19 +5051,20 @@ int perf_event_init_task(struct task_struct *child)
* We dont have to disable NMIs - we are only looking at
* the list, not manipulating it:
*/
- list_for_each_entry(event, &parent_ctx->group_list, group_entry) {
+ list_for_each_entry(event, &parent_ctx->pinned_grp_list, group_entry) {

- if (!event->attr.inherit) {
- inherited_all = 0;
- continue;
- }
+ ret = perf_event_inherit(event, parent, parent_ctx, child,
+ child_ctx, &inherited_all);
+ if (ret)
+ break;
+ }
+
+ list_for_each_entry(event, &parent_ctx->volatile_grp_list, group_entry) {

- ret = inherit_group(event, parent, parent_ctx,
- child, child_ctx);
- if (ret) {
- inherited_all = 0;
+ ret = perf_event_inherit(event, parent, parent_ctx, child,
+ child_ctx, &inherited_all);
+ if (ret)
break;
- }
}

if (inherited_all) {
@@ -5044,7 +5115,9 @@ static void __perf_event_exit_cpu(void *info)
struct perf_event_context *ctx = &cpuctx->ctx;
struct perf_event *event, *tmp;

- list_for_each_entry_safe(event, tmp, &ctx->group_list, group_entry)
+ list_for_each_entry_safe(event, tmp, &ctx->pinned_grp_list, group_entry)
+ __perf_event_remove_from_context(event);
+ list_for_each_entry_safe(event, tmp, &ctx->volatile_grp_list, group_entry)
__perf_event_remove_from_context(event);
}
static void perf_event_exit_cpu(int cpu)
--
1.6.2.3

Ingo Molnar

unread,

Nov 10, 2009, 12:20:01 AM11/10/09

to

* Frederic Weisbecker <fwei...@gmail.com> wrote:

> Hi,
>
> This is an rfc patchset, only compile tested just to ensure I'm taking
> a good direction before going ahead.
>
> This is intended to rework a bit the perf event scheduling to
> guarantee a real priority of the pinned events over the volatile ones.
> This patchset handles such priority on task tick time only. But if the
> idea is agreed, I could expand that to every task event sched-in calls
> to guarantee the priority in every event rescheduling time.
>
> Thanks.
>
> Frederic Weisbecker (4):
> perf/core: split context's event group list into pinned and
> non-pinned lists
> perf/core: Optimize a bit rotate_ctx()
> perf/core: Split up pinned and non pinned processing
> perf/core: Schedule every pinned events before the the non-pinned
>
> include/linux/perf_event.h | 3 +-
> kernel/perf_event.c | 283 ++++++++++++++++++++++++++++++++-----------
> 2 files changed, 212 insertions(+), 74 deletions(-)

Sans the small naming suggestions i had, the general principle looks
good to me - it's a nice restructuring of the various scheduling rules
we have for events.

Ingo

Ingo Molnar

unread,

Nov 10, 2009, 12:20:02 AM11/10/09

to

* Frederic Weisbecker <fwei...@gmail.com> wrote:

> Split up pinned and non-pinned events processing in two helpers
> so that it's more flexible to handle them seperately.

> +static void

> +__perf_event_sched_in_volatile(struct perf_event_context *ctx,
> + struct perf_cpu_context *cpuctx, int cpu)

Small naming suggestion: 'volatile' is a C keyword and rarely used
outside of that context in the kernel, which makes this function name a
bit confusing.

So instead of pinned/volatile, a pinned/flexible naming would be more
readable, i.e. __perf_event_sched_in_flexible() or so.

Also, most of the static functions in kernel/perf_event.c could lose
their perf_event_ prefix - we already know it's a perf thing, right?
That will shorten quite a few function names there.

These functions would turn into __sched_in_pinned()/__sched_in_flexible().

Agreed?

Ingo

Frederic Weisbecker

unread,

Nov 10, 2009, 4:40:02 AM11/10/09

to

On Tue, Nov 10, 2009 at 06:11:41AM +0100, Ingo Molnar wrote:
>
> * Frederic Weisbecker <fwei...@gmail.com> wrote:
>
> > Split up pinned and non-pinned events processing in two helpers
> > so that it's more flexible to handle them seperately.
>
> > +static void
> > +__perf_event_sched_in_volatile(struct perf_event_context *ctx,
> > + struct perf_cpu_context *cpuctx, int cpu)
>
> Small naming suggestion: 'volatile' is a C keyword and rarely used
> outside of that context in the kernel, which makes this function name a
> bit confusing.
>
> So instead of pinned/volatile, a pinned/flexible naming would be more
> readable, i.e. __perf_event_sched_in_flexible() or so.

Right, also that makes it consistent with the hw-breakpoint constraints
naming.

> Also, most of the static functions in kernel/perf_event.c could lose
> their perf_event_ prefix - we already know it's a perf thing, right?
> That will shorten quite a few function names there.
>
> These functions would turn into __sched_in_pinned()/__sched_in_flexible().
>
> Agreed?

Totally.

I'll prepare a new iteration, thanks!

Frederic Weisbecker

unread,

Nov 10, 2009, 4:50:02 AM11/10/09

to

On Tue, Nov 10, 2009 at 10:41:45AM +0100, Frederic Weisbecker wrote:
> Thanks.
>
> With this draft, it makes the pinned priority more consistent
> with its purpose but it doesn't yet bring the full pinned over
> flexible priority determinism.
>
> It does apply the priority in tick time, while we round robin.
> I did that there first so that it covers most of the events
> rescheduling actions and also it doesn't bring much more
> overhead over the previous layout (in theory), it just changes
> the order.
>
> I'll also try to expand the priority constraint each time we
> sched in a task: when we schedule a new task that belongs to
> a new context, we don't schedule out/in the cpu context but
> that will be needed if we want the full priority determinism.

To lower the overhead at non-tick time, we could even just reschedule
the cpu flexible events. Anyway...

>
> Anyway, I'll do that progressively.
>
> Frederic.

Frederic Weisbecker

unread,

Nov 10, 2009, 4:50:03 AM11/10/09

to

On Tue, Nov 10, 2009 at 06:13:21AM +0100, Ingo Molnar wrote:
>
> * Frederic Weisbecker <fwei...@gmail.com> wrote:
>
> > Hi,
> >
> > This is an rfc patchset, only compile tested just to ensure I'm taking
> > a good direction before going ahead.
> >
> > This is intended to rework a bit the perf event scheduling to
> > guarantee a real priority of the pinned events over the volatile ones.
> > This patchset handles such priority on task tick time only. But if the
> > idea is agreed, I could expand that to every task event sched-in calls
> > to guarantee the priority in every event rescheduling time.
> >
> > Thanks.
> >
> > Frederic Weisbecker (4):
> > perf/core: split context's event group list into pinned and
> > non-pinned lists
> > perf/core: Optimize a bit rotate_ctx()
> > perf/core: Split up pinned and non pinned processing
> > perf/core: Schedule every pinned events before the the non-pinned
> >
> > include/linux/perf_event.h | 3 +-
> > kernel/perf_event.c | 283 ++++++++++++++++++++++++++++++++-----------
> > 2 files changed, 212 insertions(+), 74 deletions(-)
>
> Sans the small naming suggestions i had, the general principle looks
> good to me - it's a nice restructuring of the various scheduling rules
> we have for events.
>
> Ingo

Thanks.

With this draft, it makes the pinned priority more consistent
with its purpose but it doesn't yet bring the full pinned over
flexible priority determinism.

It does apply the priority in tick time, while we round robin.
I did that there first so that it covers most of the events
rescheduling actions and also it doesn't bring much more
overhead over the previous layout (in theory), it just changes
the order.

I'll also try to expand the priority constraint each time we
sched in a task: when we schedule a new task that belongs to
a new context, we don't schedule out/in the cpu context but
that will be needed if we want the full priority determinism.

Anyway, I'll do that progressively.

Frederic.

--

Peter Zijlstra

unread,

Nov 10, 2009, 5:20:02 AM11/10/09

to

On Sun, 2009-11-08 at 21:13 +0100, Frederic Weisbecker wrote:

> +static void
> +__perf_event_sched_in_all(struct perf_event_context *ctx,
> + struct perf_cpu_context *cpuctx, int cpu)
> +{
> + struct perf_event_context *cpu_ctx = &cpuctx->ctx;
> +
> + /* May require different classes between cpu and task lock */
> + spin_lock(&cpu_ctx->lock);
> + spin_lock(&ctx->lock);

Would be good to know for sure, running with lockdep enabled ought to
tell you that pretty quick ;-)

> + cpu_ctx->is_active = ctx->is_active = 1;
> +
> + ctx->timestamp = cpu_ctx->timestamp = perf_clock();
> +
> + perf_disable();
> +
> + if (cpu_ctx->nr_events)
> + __perf_event_sched_in_pinned(cpu_ctx, cpuctx, cpu);
> +
> + if (ctx->nr_events)
> + __perf_event_sched_in_pinned(cpu_ctx, cpuctx, cpu);
> +
> + if (cpu_ctx->nr_events)
> + __perf_event_sched_in_volatile(cpu_ctx, cpuctx, cpu);
> +
> + if (ctx->nr_events)
> + __perf_event_sched_in_volatile(cpu_ctx, cpuctx, cpu);
> +
> + cpuctx->task_ctx = ctx;
> +
> + perf_enable();
> +
> + spin_unlock(&ctx->lock);
> + spin_lock(&cpu_ctx->lock);

I'm pretty sure that ought to be spin_unlock() ;-)

> +}

Like Ingo I don't really like the volatile name.

Can't we simply have 2 lists per cpu a pinned and normal list, and first
schedule all the pinned and RR the normal events?

I guess one could implement that by adding the task context events to
the cpu context events on sched_in and removing them on sched_out. That
would clear up a lot of funny scheduling details.

Frederic Weisbecker

unread,

Nov 10, 2009, 5:40:03 AM11/10/09

to

On Tue, Nov 10, 2009 at 11:10:13AM +0100, Peter Zijlstra wrote:
> On Sun, 2009-11-08 at 21:13 +0100, Frederic Weisbecker wrote:
>
> > +static void
> > +__perf_event_sched_in_all(struct perf_event_context *ctx,
> > + struct perf_cpu_context *cpuctx, int cpu)
> > +{
> > + struct perf_event_context *cpu_ctx = &cpuctx->ctx;
> > +
> > + /* May require different classes between cpu and task lock */
> > + spin_lock(&cpu_ctx->lock);
> > + spin_lock(&ctx->lock);
>
> Would be good to know for sure, running with lockdep enabled ought to
> tell you that pretty quick ;-)

That's about sure I guess :)
I just wanted to take care of that after your comments.

Indeed :)

> > +}
>
>
> Like Ingo I don't really like the volatile name.
>
> Can't we simply have 2 lists per cpu a pinned and normal list, and first
> schedule all the pinned and RR the normal events?
>
> I guess one could implement that by adding the task context events to
> the cpu context events on sched_in and removing them on sched_out. That
> would clear up a lot of funny scheduling details.

I thought about doing that, but didn't expand the idea that much,
because of the list manipulation that induces.

But you're right, that would be be indeed more proper.
I can just save the "real" cpu event group tail in the
struct perf_cpu_context so that I can keep track of the real
state and (un)glue the queues easily.

Yeah, I'll try that, thanks!