[PATCH 1/9] sched: Remove cpu parameter for idle

Peter Zijlstra

unread,

Jan 21, 2014, 6:30:02 AM1/21/14

to

daniel_lezcano-1_sched-remove__cpu__parameter_for_idle_balance.patch

Peter Zijlstra

unread,

Jan 21, 2014, 6:30:02 AM1/21/14

to

Hi,

A set of recent patches by Daniel reminded me of a series I did a while back.
So here's a combined set.

Paul, Ben, could you guys have a careful look at the one optimizing
pick_next_task_fair() (patch 8)? That patch is somewhat big and scary,
although it does appear to do what it says.

I've not yet tested the bandwidth aspect at all.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Peter Zijlstra

unread,

Jan 21, 2014, 6:40:01 AM1/21/14

to

peter_zijlstra-sched-optimize_cgroup_pick_next_task_fair_2.patch

Peter Zijlstra

unread,

Jan 21, 2014, 6:40:02 AM1/21/14

to

peter_zijlstra-sched-optimize_cgroup_pick_next_task_fair_3.patch

Peter Zijlstra

unread,

Jan 21, 2014, 6:40:01 AM1/21/14

to

peter_zijlstra-sched-optimize_cgroup_pick_next_task_fair_1.patch

Peter Zijlstra

unread,

Jan 21, 2014, 6:40:02 AM1/21/14

to

daniel_lezcano-2_sched-fix_race_in_idle_balance.patch

Peter Zijlstra

unread,

Jan 21, 2014, 6:40:02 AM1/21/14

to

peter_zijlstra-sched-optimize_cgroup_pick_next_task_fair_4.patch

Peter Zijlstra

unread,

Jan 21, 2014, 6:40:02 AM1/21/14

to

daniel_lezcano-3_sched-move_idle_stamp_up_to_the_core.patch

Peter Zijlstra

unread,

Jan 21, 2014, 6:40:02 AM1/21/14

to

peter_zijlstra-2_sched-use_idle_task_shortcut.patch

Peter Zijlstra

unread,

Jan 21, 2014, 6:40:02 AM1/21/14

to

peter_zijlstra-sched-clean_up_idle_task_smp_logic.patch

Vincent Guittot

unread,

Jan 21, 2014, 12:30:03 PM1/21/14

to

On 21 January 2014 12:17, Peter Zijlstra <pet...@infradead.org> wrote:
> The idle post_schedule hook is just a vile waste of time, fix it proper.
>
> Cc: alex...@linaro.org
> Cc: mi...@kernel.org
> Cc: Vincent Guittot <vincent...@linaro.org>
> Tested-by: Daniel Lezcano <daniel....@linaro.org>
> Signed-off-by: Peter Zijlstra <pet...@infradead.org>
> Link: http://lkml.kernel.org/r/20140117140...@laptop.programming.kicks-ass.net
> ---
> kernel/sched/fair.c | 5 +++--
> kernel/sched/idle_task.c | 21 ++++++---------------
> 2 files changed, 9 insertions(+), 17 deletions(-)
>
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2416,7 +2416,8 @@ void idle_exit_fair(struct rq *this_rq)
> update_rq_runnable_avg(this_rq, 0);
> }
>
> -#else
> +#else /* CONFIG_SMP */
> +
> static inline void update_entity_load_avg(struct sched_entity *se,
> int update_cfs_rq) {}
> static inline void update_rq_runnable_avg(struct rq *rq, int runnable) {}
> @@ -2428,7 +2429,7 @@ static inline void dequeue_entity_load_a
> int sleep) {}
> static inline void update_cfs_rq_blocked_load(struct cfs_rq *cfs_rq,
> int force_update) {}
> -#endif
> +#endif /* CONFIG_SMP */
>
> static void enqueue_sleeper(struct cfs_rq *cfs_rq, struct sched_entity *se)
> {
> --- a/kernel/sched/idle_task.c
> +++ b/kernel/sched/idle_task.c
> @@ -13,18 +13,8 @@ select_task_rq_idle(struct task_struct *
> {
> return task_cpu(p); /* IDLE tasks as never migrated */
> }
> -
> -static void pre_schedule_idle(struct rq *rq, struct task_struct *prev)
> -{
> - idle_exit_fair(rq);
> - rq_last_tick_reset(rq);
> -}
> -
> -static void post_schedule_idle(struct rq *rq)
> -{
> - idle_enter_fair(rq);
> -}
> #endif /* CONFIG_SMP */
> +
> /*
> * Idle tasks are unconditionally rescheduled:
> */
> @@ -37,8 +27,7 @@ static struct task_struct *pick_next_tas
> {
> schedstat_inc(rq, sched_goidle);
> #ifdef CONFIG_SMP
> - /* Trigger the post schedule to do an idle_enter for CFS */
> - rq->post_schedule = 1;
> + idle_enter_fair(rq);

If i have correctly followed the new function path that is introduced
by the patchset, idle_enter_fair is called after idle_balance whereas
it must be called before in order to update the
runnable_avg_sum/period of the rq before evaluating the interest of
pulling cfs task

Regards,
vincent

> #endif
> return rq->idle;
> }
> @@ -58,6 +47,10 @@ dequeue_task_idle(struct rq *rq, struct
>
> static void put_prev_task_idle(struct rq *rq, struct task_struct *prev)
> {
> +#ifdef CONFIG_SMP
> + idle_exit_fair(rq);
> + rq_last_tick_reset(rq);
> +#endif
> }
>
> static void task_tick_idle(struct rq *rq, struct task_struct *curr, int queued)
> @@ -101,8 +94,6 @@ const struct sched_class idle_sched_clas
>
> #ifdef CONFIG_SMP
> .select_task_rq = select_task_rq_idle,
> - .pre_schedule = pre_schedule_idle,
> - .post_schedule = post_schedule_idle,
> #endif
>
> .set_curr_task = set_curr_task_idle,

bse...@google.com

unread,

Jan 21, 2014, 2:30:01 PM1/21/14

to

Peter Zijlstra <pet...@infradead.org> writes:
> static struct task_struct *
> pick_next_task_fair(struct rq *rq, struct task_struct *prev)
> {
> + struct sched_entity *se, __maybe_unused *pse;
> struct task_struct *p;
> - struct cfs_rq *cfs_rq = &rq->cfs;
> - struct sched_entity *se;
> + struct cfs_rq *cfs_rq;
> +
> +again: __maybe_unused
> + cfs_rq = &rq->cfs;
> +
> + if (prev) {
> + if (!IS_ENABLED(CONFIG_FAIR_GROUP_SCHED) ||
> + (prev->sched_class != &fair_sched_class)) {
> + prev->sched_class->put_prev_task(rq, prev);
> + prev = NULL;
> + }
> + }
>
> if (!cfs_rq->nr_running)
> return NULL;
>
> - if (prev)
> - prev->sched_class->put_prev_task(rq, prev);
> -
> do {
> se = pick_next_entity(cfs_rq);
> - set_next_entity(cfs_rq, se);
> + if (!prev)
> + set_next_entity(cfs_rq, se);
> cfs_rq = group_cfs_rq(se);
> } while (cfs_rq);
>
> p = task_of(se);
> - if (hrtick_enabled(rq))
> - hrtick_start_fair(rq, p);
>
> - return p;
> -}
> +#ifdef CONFIG_FAIR_GROUP_SCHED
> + /*
> + * If we haven't yet done put_prev_entity and the selected task is
> + * a different task than we started out with, try and touch the least
> + * amount of cfs_rq trees.
> + */
> + if (prev) {
> + if (prev != p) {
> + pse = &prev->se;
> +
> + while (!(cfs_rq = is_same_group(se, pse))) {
> + int se_depth = se->depth;
> + int pse_depth = pse->depth;
> +
> + if (se_depth <= pse_depth) {
> + put_prev_entity(cfs_rq_of(pse), pse);
> + pse = parent_entity(pse);
> + }
> + if (se_depth >= pse_depth) {
> + set_next_entity(cfs_rq_of(se), se);
> + se = parent_entity(se);
> + }
> + }
>
> -/*
> - * Account for a descheduled task:
> - */
> -static void put_prev_task_fair(struct rq *rq, struct task_struct *prev)
> -{
> - struct sched_entity *se = &prev->se;
> - struct cfs_rq *cfs_rq;
> + put_prev_entity(cfs_rq, pse);
> + set_next_entity(cfs_rq, se);
> + }
>
> - for_each_sched_entity(se) {
> - cfs_rq = cfs_rq_of(se);
> - put_prev_entity(cfs_rq, se);
> + /*
> + * In case the common cfs_rq got throttled, just give up and
> + * put the stack and retry.
> + */
> + if (unlikely(check_cfs_rq_runtime(cfs_rq))) {
> + put_prev_task_fair(rq, p);
> + prev = NULL;
> + goto again;
> + }

This double-calls put_prev_entity on any non-common cfs_rqs and ses,
which means double __enqueue_entity, among other things. Just doing the
put_prev loop from se->parent should fix that.

However, any sort of abort means that we may have already done
set_next_entity on some children, which even with the changes to
pick_next_entity will cause problems, up to and including double
__dequeue_entity I think.

Also, this way we never do check_cfs_rq_runtime on any parents of the
common cfs_rq, which could even have been the reason for the resched to
begin with. I'm not sure if there would be any problem doing it on the
way down or not, I don't see any problems at a glance.

> }
> +#endif
> +
> + if (hrtick_enabled(rq))
> + hrtick_start_fair(rq, p);
> +
> + return p;
> }
>
> /*

Peter Zijlstra

unread,

Jan 21, 2014, 2:40:02 PM1/21/14

to

> > + put_prev_entity(cfs_rq, pse);
> > + set_next_entity(cfs_rq, se);
> > + }

(A)

> > + /*
> > + * In case the common cfs_rq got throttled, just give up and
> > + * put the stack and retry.
> > + */
> > + if (unlikely(check_cfs_rq_runtime(cfs_rq))) {
> > + put_prev_task_fair(rq, p);
> > + prev = NULL;
> > + goto again;
> > + }
>
> This double-calls put_prev_entity on any non-common cfs_rqs and ses,
> which means double __enqueue_entity, among other things. Just doing the
> put_prev loop from se->parent should fix that.

I'm not seeing that, so at point (A) we've completely switched over from
@prev to @p, we've put all pse until the common parent and set all se
back to @p.

So if we then do: put_prev_task_fair(rq, p), we simply undo all the
set_next_entity(se) we just did, and continue from the common parent
upwards.

> However, any sort of abort means that we may have already done
> set_next_entity on some children, which even with the changes to
> pick_next_entity will cause problems, up to and including double
> __dequeue_entity I think.

But the abort is only done after we've completely set up @p as the
current task.

Yes, completely tearing it down again is probably a waste, but given
that bandwidth enforcement should be rare and I didn't want to
complicate things even further for rare cases.

> Also, this way we never do check_cfs_rq_runtime on any parents of the
> common cfs_rq, which could even have been the reason for the resched to
> begin with. I'm not sure if there would be any problem doing it on the
> way down or not, I don't see any problems at a glance.

Oh, so we allow a parent to have less runtime than the sum of all its
children?

Indeed, in that case we can miss something... we could try to call
check_cfs_rq_runtime() from the initial top-down selection loop? When
true, just put the entire stack and don't pretend to be smart?

bse...@google.com

unread,

Jan 21, 2014, 3:10:03 PM1/21/14

to

Ah, I missed that this was p, not prev. That makes a lot more sense, and
I agree that this should be fine.

>
>> Also, this way we never do check_cfs_rq_runtime on any parents of the
>> common cfs_rq, which could even have been the reason for the resched to
>> begin with. I'm not sure if there would be any problem doing it on the
>> way down or not, I don't see any problems at a glance.
>
> Oh, so we allow a parent to have less runtime than the sum of all its
> children?

Yeah, the check is currently max(children) <= parent, and unthrottled
children are also allowed.

>
> Indeed, in that case we can miss something... we could try to call
> check_cfs_rq_runtime() from the initial top-down selection loop? When
> true, just put the entire stack and don't pretend to be smart?

Yeah, I think that should work. I wasn't sure if there could be a
problem of doing throttle_cfs_rq(parent); throttle_cfs_rq(child);, but
thinking about, that has to be ok, because schedule can do that if
deactivate throttled the parent, schedule calls update_rq_clock, and
then put_prev throttled the child.

Peter Zijlstra

unread,

Jan 21, 2014, 3:50:02 PM1/21/14

to

On Tue, Jan 21, 2014 at 12:03:38PM -0800, bse...@google.com wrote:

> > Indeed, in that case we can miss something... we could try to call
> > check_cfs_rq_runtime() from the initial top-down selection loop? When
> > true, just put the entire stack and don't pretend to be smart?
>
> Yeah, I think that should work. I wasn't sure if there could be a
> problem of doing throttle_cfs_rq(parent); throttle_cfs_rq(child);, but
> thinking about, that has to be ok, because schedule can do that if
> deactivate throttled the parent, schedule calls update_rq_clock, and
> then put_prev throttled the child.

Maybe something like the below? Completely untested and everything.

There's the obviuos XXX fail that was also in the previous version; not
sure how to deal with that yet, either we need to change the interface
to take struct task_struct **prev, or get smarter :-)

Any other obvious fails in here?

---
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2736,17 +2736,36 @@ wakeup_preempt_entity(struct sched_entit
* 3) pick the "last" process, for cache locality
* 4) do not run the "skip" process, if something else is available
*/
-static struct sched_entity *pick_next_entity(struct cfs_rq *cfs_rq)
+static struct sched_entity *
+pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr)
{
- struct sched_entity *se = __pick_first_entity(cfs_rq);
- struct sched_entity *left = se;
+ struct sched_entity *left = __pick_first_entity(cfs_rq);
+ struct sched_entity *se;
+
+ /*
+ * If curr is set we have to see if its left of the leftmost entity
+ * still in the tree, provided there was anything in the tree at all.
+ */
+ if (!left || (curr && entity_before(curr, left)))
+ left = curr;
+
+ se = left; /* ideally we run the leftmost entity */

/*
* Avoid running the skip buddy, if running something else can
* be done without getting too unfair.
*/
if (cfs_rq->skip == se) {
- struct sched_entity *second = __pick_next_entity(se);
+ struct sched_entity *second;
+
+ if (se == curr) {
+ second = __pick_first_entity(cfs_rq);
+ } else {
+ second = __pick_next_entity(se);
+ if (!second || (curr && entity_before(curr, second)))
+ second = curr;
+ }
+
if (second && wakeup_preempt_entity(second, left) < 1)
se = second;
}
@@ -2768,7 +2787,7 @@ static struct sched_entity *pick_next_en
return se;
}

-static void check_cfs_rq_runtime(struct cfs_rq *cfs_rq);
+static bool check_cfs_rq_runtime(struct cfs_rq *cfs_rq);

static void put_prev_entity(struct cfs_rq *cfs_rq, struct sched_entity *prev)
{
@@ -3423,22 +3442,23 @@ static void check_enqueue_throttle(struc
}

/* conditionally throttle active cfs_rq's from put_prev_entity() */
-static void check_cfs_rq_runtime(struct cfs_rq *cfs_rq)
+static bool check_cfs_rq_runtime(struct cfs_rq *cfs_rq)
{
if (!cfs_bandwidth_used())
- return;
+ return false;

if (likely(!cfs_rq->runtime_enabled || cfs_rq->runtime_remaining > 0))
- return;
+ return false;

/*
* it's possible for a throttled entity to be forced into a running
* state (e.g. set_curr_task), in this case we're finished.
*/
if (cfs_rq_throttled(cfs_rq))
- return;
+ return true;

throttle_cfs_rq(cfs_rq);
+ return true;
}

static enum hrtimer_restart sched_cfs_slack_timer(struct hrtimer *timer)
@@ -3548,7 +3568,7 @@ static inline u64 cfs_rq_clock_task(stru
}

static void account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec) {}
-static void check_cfs_rq_runtime(struct cfs_rq *cfs_rq) {}
+static bool check_cfs_rq_runtime(struct cfs_rq *cfs_rq) { return false; }
static void check_enqueue_throttle(struct cfs_rq *cfs_rq) {}
static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq) {}

@@ -4484,26 +4504,109 @@ static void check_preempt_wakeup(struct
set_last_buddy(se);
}

+/*
+ * Account for a descheduled task:
+ */
+static void put_prev_task_fair(struct rq *rq, struct task_struct *prev)
+{
+ struct sched_entity *se = &prev->se;

+ struct cfs_rq *cfs_rq;
+

+ for_each_sched_entity(se) {
+ cfs_rq = cfs_rq_of(se);
+ put_prev_entity(cfs_rq, se);
+ }
+}
+

static struct task_struct *
pick_next_task_fair(struct rq *rq, struct task_struct *prev)
{

- struct task_struct *p;

struct cfs_rq *cfs_rq = &rq->cfs;

struct sched_entity *se;
+ struct task_struct *p;

if (!cfs_rq->nr_running)
return NULL;

- if (prev)

+ if (!IS_ENABLED(CONFIG_FAIR_GROUP_SCHED) ||
+ (prev->sched_class != &fair_sched_class)) {

prev->sched_class->put_prev_task(rq, prev);

+ prev = NULL;
+ }

+
+#ifdef CONFIG_FAIR_GROUP_SCHED
+ if (!prev)
+ goto simple;

do {
- se = pick_next_entity(cfs_rq);
+ struct sched_entity *curr = cfs_rq->curr;
+
+ /*
+ * Since we got here without doing put_prev_entity() we also
+ * have to consider cfs_rq->curr. If it is still a runnable
+ * entity, update_curr() will update its vruntime, otherwise
+ * forget we've ever seen it.
+ */
+ if (curr->on_rq)
+ update_curr(cfs_rq);
+ else
+ curr = NULL;
+
+ if (unlikely(check_cfs_rq_runtime(cfs_rq))) {
+ put_prev_task_fair(rq, prev);
+ goto simple;
+ }
+
+ se = pick_next_entity(cfs_rq, curr);
+ cfs_rq = group_cfs_rq(se);
+ } while (cfs_rq);
+
+ p = task_of(se);
+
+ /*
+ * Since we haven't yet done put_prev_entity and if the selected task
+ * is a different task than we started out with, try and touch the
+ * least amount of cfs_rqs.
+ */

+ if (prev != p) {

+ struct sched_entity *pse = &prev->se;

+
+ while (!(cfs_rq = is_same_group(se, pse))) {
+ int se_depth = se->depth;
+ int pse_depth = pse->depth;
+
+ if (se_depth <= pse_depth) {
+ put_prev_entity(cfs_rq_of(pse), pse);
+ pse = parent_entity(pse);
+ }
+ if (se_depth >= pse_depth) {
+ set_next_entity(cfs_rq_of(se), se);
+ se = parent_entity(se);
+ }
+ }
+
+ put_prev_entity(cfs_rq, pse);
+ set_next_entity(cfs_rq, se);
+ }

+
+ return p;
+simple:
+#endif

+ cfs_rq = &rq->cfs;
+

+ if (!cfs_rq->nr_running) {
+ // XXX FAIL we should never return NULL after putting @prev
+ return NULL;
+ }
+
+ do {
+ se = pick_next_entity(cfs_rq, NULL);

set_next_entity(cfs_rq, se);
cfs_rq = group_cfs_rq(se);
} while (cfs_rq);

p = task_of(se);

+
if (hrtick_enabled(rq))
hrtick_start_fair(rq, p);

@@ -4511,20 +4614,6 @@ pick_next_task_fair(struct rq *rq, struc

}

/*
- * Account for a descheduled task:
- */
-static void put_prev_task_fair(struct rq *rq, struct task_struct *prev)
-{
- struct sched_entity *se = &prev->se;
- struct cfs_rq *cfs_rq;

-

- for_each_sched_entity(se) {
- cfs_rq = cfs_rq_of(se);
- put_prev_entity(cfs_rq, se);

- }
-}
-
-/*
* sched_yield() is very simple
*
* The magic of dealing with the ->skip buddy is in pick_next_entity.

bse...@google.com

unread,

Jan 21, 2014, 4:50:03 PM1/21/14

to

Peter Zijlstra <pet...@infradead.org> writes:

> In order to avoid having to do put/set on a whole cgroup hierarchy
> when we context switch, push the put into pick_next_task() so that
> both operations are in the same function. Further changes then allow
> us to possibly optimize away redundant work.
>
> Signed-off-by: Peter Zijlstra <pet...@infradead.org>
> Link: http://lkml.kernel.org/r/1328936700.2476.17.camel@laptop
> --- a/kernel/sched/idle_task.c
> +++ b/kernel/sched/idle_task.c
> @@ -23,8 +23,12 @@ static void check_preempt_curr_idle(stru
> resched_task(rq->idle);
> }
>
> -static struct task_struct *pick_next_task_idle(struct rq *rq)
> +static struct task_struct *
> +pick_next_task_idle(struct rq *rq, struct task_struct *prev)
> {
> + if (prev)
> + prev->sched_class->put_prev_task(rq, prev);
> +
> schedstat_inc(rq, sched_goidle);
> #ifdef CONFIG_SMP
> idle_enter_fair(rq);
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -1310,15 +1310,7 @@ static struct task_struct *_pick_next_ta
> {
> struct sched_rt_entity *rt_se;
> struct task_struct *p;
> - struct rt_rq *rt_rq;
> -
> - rt_rq = &rq->rt;
> -
> - if (!rt_rq->rt_nr_running)
> - return NULL;
> -
> - if (rt_rq_throttled(rt_rq))
> - return NULL;
> + struct rt_rq *rt_rq = &rq->rt;
>
> do {
> rt_se = pick_next_rt_entity(rq, rt_rq);
> @@ -1332,9 +1324,22 @@ static struct task_struct *_pick_next_ta
> return p;
> }
>
> -static struct task_struct *pick_next_task_rt(struct rq *rq)
> +static struct task_struct *
> +pick_next_task_rt(struct rq *rq, struct task_struct *prev)
> {
> - struct task_struct *p = _pick_next_task_rt(rq);
> + struct task_struct *p;
> + struct rt_rq *rt_rq = &rq->rt;
> +
> + if (!rt_rq->rt_nr_running)
> + return NULL;
> +
> + if (rt_rq_throttled(rt_rq))
> + return NULL;
> +
> + if (prev)
> + prev->sched_class->put_prev_task(rq, prev);
> +
> + p = _pick_next_task_rt(rq);
>
> /* The running task is never eligible for pushing */
> if (p)
p is now always non-NULL, so this branch can now go (and it is important
that we can't fail after doing put).

bse...@google.com

unread,

Jan 21, 2014, 5:00:02 PM1/21/14

to

Peter Zijlstra <pet...@infradead.org> writes:

> On Tue, Jan 21, 2014 at 12:03:38PM -0800, bse...@google.com wrote:
>
>> > Indeed, in that case we can miss something... we could try to call
>> > check_cfs_rq_runtime() from the initial top-down selection loop? When
>> > true, just put the entire stack and don't pretend to be smart?
>>
>> Yeah, I think that should work. I wasn't sure if there could be a
>> problem of doing throttle_cfs_rq(parent); throttle_cfs_rq(child);, but
>> thinking about, that has to be ok, because schedule can do that if
>> deactivate throttled the parent, schedule calls update_rq_clock, and
>> then put_prev throttled the child.
>
> Maybe something like the below? Completely untested and everything.
>
> There's the obviuos XXX fail that was also in the previous version; not
> sure how to deal with that yet, either we need to change the interface
> to take struct task_struct **prev, or get smarter :-)
>
> Any other obvious fails in here?

prev can be NULL to start with, hrtick should be handled in both paths.
How about this on top of your patch (equally untested) to fix those and
the XXX? The double-check on nr_running is annoying, but necessary when
prev slept.

---
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c

@@ -4528,14 +4528,8 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev)

if (!cfs_rq->nr_running)
return NULL;

- if (!IS_ENABLED(CONFIG_FAIR_GROUP_SCHED) ||
- (prev->sched_class != &fair_sched_class)) {

- prev->sched_class->put_prev_task(rq, prev);

- prev = NULL;
- }
-
#ifdef CONFIG_FAIR_GROUP_SCHED
- if (!prev)
+ if (!prev || prev->sched_class != &fair_sched_class)
goto simple;

do {
@@ -4552,10 +4546,8 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev)
else
curr = NULL;

- if (unlikely(check_cfs_rq_runtime(cfs_rq))) {
- put_prev_task_fair(rq, prev);
+ if (unlikely(check_cfs_rq_runtime(cfs_rq)))
goto simple;
- }

se = pick_next_entity(cfs_rq, curr);
cfs_rq = group_cfs_rq(se);
@@ -4589,15 +4581,19 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev)
set_next_entity(cfs_rq, se);

}

+ if (hrtick_enabled(rq))
+ hrtick_start_fair(rq, p);
+

return p;
simple:
#endif
cfs_rq = &rq->cfs;

- if (!cfs_rq->nr_running) {
- // XXX FAIL we should never return NULL after putting @prev
+ if (!cfs_rq->nr_running)
return NULL;
- }
+
+ if (prev)
+ prev->sched_class->put_prev_task(rq, prev);

do {
se = pick_next_entity(cfs_rq, NULL);

Peter Zijlstra

unread,

Jan 22, 2014, 1:10:02 PM1/22/14

to

On Tue, Jan 21, 2014 at 01:43:32PM -0800, bse...@google.com wrote:
> prev can be NULL to start with, hrtick should be handled in both paths.
> How about this on top of your patch (equally untested) to fix those and
> the XXX? The double-check on nr_running is annoying, but necessary when
> prev slept.

Ah yes indeed. Let me build the lot and see if I can wreck it ;-)

Peter Zijlstra

unread,

Jan 23, 2014, 6:40:02 AM1/23/14

to

On Tue, Jan 21, 2014 at 06:27:11PM +0100, Vincent Guittot wrote:
> On 21 January 2014 12:17, Peter Zijlstra <pet...@infradead.org> wrote:

> If i have correctly followed the new function path that is introduced
> by the patchset, idle_enter_fair is called after idle_balance whereas
> it must be called before in order to update the
> runnable_avg_sum/period of the rq before evaluating the interest of
> pulling cfs task

Its idle_exit_fair, that's moved from pre_schedule to put_prev_task and
thus indeed has crossed idle_balance.

Yeah, I can leave that pre_schedule thing in place, however all that has
be thinking.

Ideally we'd do something like the below; but I must admit to still
being slightly confused about the idle_{enter,exit}_fair() calls, their
comment doesn't seem to clarify things.

Steve, I don't think I wrecked rt/deadline by pulling in the
pre_schedule call into pick_next_task(), but then, who knows :-)

The only thing I really don't like is having that double conditional for
the direct fair_sched_class call, but I didn't see a way around that,
other than dropping that entirely. Then again, we cut out a conditional
and indirect call by getting rid of pre_schedule() -- so it might just
balance out.

---
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2146,13 +2146,6 @@ static void finish_task_switch(struct rq

#ifdef CONFIG_SMP

-/* assumes rq->lock is held */
-static inline void pre_schedule(struct rq *rq, struct task_struct *prev)
-{
- if (prev->sched_class->pre_schedule)
- prev->sched_class->pre_schedule(rq, prev);
-}
-
/* rq->lock is NOT held, but preemption is disabled */
static inline void post_schedule(struct rq *rq)
{
@@ -2170,10 +2163,6 @@ static inline void post_schedule(struct

#else

-static inline void pre_schedule(struct rq *rq, struct task_struct *p)
-{
-}
-
static inline void post_schedule(struct rq *rq)
{
}
@@ -2571,7 +2560,8 @@ pick_next_task(struct rq *rq, struct tas
* Optimization: we know that if all tasks are in
* the fair class we can call that function directly:
*/
- if (likely(rq->nr_running == rq->cfs.h_nr_running))
+ if (likely(prev->sched_class == &fair_sched_class &&
+ rq->nr_running == rq->cfs.h_nr_running))
return fair_sched_class.pick_next_task(rq, prev);

for_each_class(class) {
@@ -2581,14 +2571,6 @@ pick_next_task(struct rq *rq, struct tas
}
}

- /*
- * If there is a task balanced on this cpu, pick the next task,
- * otherwise fall in the optimization by picking the idle task
- * directly.
- */
- if (idle_balance(rq))
- goto again;
-
rq->idle_stamp = rq_clock(rq);

return idle_sched_class.pick_next_task(rq, prev);
@@ -2682,8 +2664,6 @@ static void __sched __schedule(void)
switch_count = &prev->nvcsw;
}

- pre_schedule(rq, prev);
-
if (prev->on_rq || rq->skip_clock_update < 0)
update_rq_clock(rq);

--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -997,6 +997,9 @@ struct task_struct *pick_next_task_dl(st

dl_rq = &rq->dl;

+ if (dl_task(prev))
+ pull_dl_task(rq);
+
if (unlikely(!dl_rq->dl_nr_running))
return NULL;

@@ -1430,8 +1433,6 @@ static int pull_dl_task(struct rq *this_
static void pre_schedule_dl(struct rq *rq, struct task_struct *prev)
{
/* Try to pull other tasks here */
- if (dl_task(prev))
- pull_dl_task(rq);
}

static void post_schedule_dl(struct rq *rq)
@@ -1626,7 +1627,6 @@ const struct sched_class dl_sched_class
.set_cpus_allowed = set_cpus_allowed_dl,
.rq_online = rq_online_dl,
.rq_offline = rq_offline_dl,
- .pre_schedule = pre_schedule_dl,
.post_schedule = post_schedule_dl,
.task_woken = task_woken_dl,
#endif
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2581,7 +2581,8 @@ void idle_exit_fair(struct rq *this_rq)

update_rq_runnable_avg(this_rq, 0);
}

-#else
+#else /* CONFIG_SMP */
+
static inline void update_entity_load_avg(struct sched_entity *se,
int update_cfs_rq) {}
static inline void update_rq_runnable_avg(struct rq *rq, int runnable) {}

@@ -2593,7 +2594,7 @@ static inline void dequeue_entity_load_a

int sleep) {}
static inline void update_cfs_rq_blocked_load(struct cfs_rq *cfs_rq,
int force_update) {}
-#endif
+#endif /* CONFIG_SMP */

static void enqueue_sleeper(struct cfs_rq *cfs_rq, struct sched_entity *se)
{

@@ -4700,10 +4701,11 @@ pick_next_task_fair(struct rq *rq, struc
struct sched_entity *se;
struct task_struct *p;

+again:
+#ifdef CONFIG_FAIR_GROUP_SCHED
if (!cfs_rq->nr_running)
- return NULL;
+ goto idle;

-#ifdef CONFIG_FAIR_GROUP_SCHED
if (prev->sched_class != &fair_sched_class)
goto simple;

@@ -4769,11 +4771,10 @@ pick_next_task_fair(struct rq *rq, struc

return p;
simple:
-#endif

cfs_rq = &rq->cfs;
-

+#endif
if (!cfs_rq->nr_running)
- return NULL;
+ goto idle;

prev->sched_class->put_prev_task(rq, prev);

@@ -4789,6 +4790,13 @@ pick_next_task_fair(struct rq *rq, struc
hrtick_start_fair(rq, p);

return p;
+
+idle:
+ idle_exit_fair();
+ if (idle_balance()) /* drops rq->lock */
+ goto again;
+
+ return NULL;
}

/*

--- a/kernel/sched/idle_task.c
+++ b/kernel/sched/idle_task.c
@@ -13,18 +13,8 @@ select_task_rq_idle(struct task_struct *
{
return task_cpu(p); /* IDLE tasks as never migrated */
}
-
-static void pre_schedule_idle(struct rq *rq, struct task_struct *prev)
-{
- idle_exit_fair(rq);
- rq_last_tick_reset(rq);
-}
-
-static void post_schedule_idle(struct rq *rq)
-{
- idle_enter_fair(rq);
-}
#endif /* CONFIG_SMP */
+
/*
* Idle tasks are unconditionally rescheduled:
*/

@@ -40,8 +30,7 @@ pick_next_task_idle(struct rq *rq, struc

schedstat_inc(rq, sched_goidle);
#ifdef CONFIG_SMP
- /* Trigger the post schedule to do an idle_enter for CFS */
- rq->post_schedule = 1;
+ idle_enter_fair(rq);

#endif
return rq->idle;
}
@@ -61,6 +50,10 @@ dequeue_task_idle(struct rq *rq, struct

static void put_prev_task_idle(struct rq *rq, struct task_struct *prev)
{
+#ifdef CONFIG_SMP
+ idle_exit_fair(rq);
+ rq_last_tick_reset(rq);
+#endif
}

static void task_tick_idle(struct rq *rq, struct task_struct *curr, int queued)

@@ -104,8 +97,6 @@ const struct sched_class idle_sched_clas

#ifdef CONFIG_SMP
.select_task_rq = select_task_rq_idle,
- .pre_schedule = pre_schedule_idle,
- .post_schedule = post_schedule_idle,
#endif

.set_curr_task = set_curr_task_idle,

--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1330,6 +1330,10 @@ pick_next_task_rt(struct rq *rq, struct
struct task_struct *p;

struct rt_rq *rt_rq = &rq->rt;

+ /* Try to pull RT tasks here if we lower this rq's prio */
+ if (rq->rt.highest_prio.curr > prev->prio)
+ pull_rt_task(rq);
+
if (!rt_rq->rt_nr_running)
return NULL;

@@ -1720,13 +1724,6 @@ static int pull_rt_task(struct rq *this_
return ret;
}

-static void pre_schedule_rt(struct rq *rq, struct task_struct *prev)
-{
- /* Try to pull RT tasks here if we lower this rq's prio */
- if (rq->rt.highest_prio.curr > prev->prio)
- pull_rt_task(rq);
-}
-
static void post_schedule_rt(struct rq *rq)
{
push_rt_tasks(rq);
@@ -2003,7 +2000,6 @@ const struct sched_class rt_sched_class
.set_cpus_allowed = set_cpus_allowed_rt,
.rq_online = rq_online_rt,
.rq_offline = rq_offline_rt,
- .pre_schedule = pre_schedule_rt,
.post_schedule = post_schedule_rt,
.task_woken = task_woken_rt,
.switched_from = switched_from_rt,
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1136,7 +1136,6 @@ struct sched_class {
int (*select_task_rq)(struct task_struct *p, int task_cpu, int sd_flag, int flags);
void (*migrate_task_rq)(struct task_struct *p, int next_cpu);

- void (*pre_schedule) (struct rq *this_rq, struct task_struct *task);
void (*post_schedule) (struct rq *this_rq);
void (*task_waking) (struct task_struct *task);
void (*task_woken) (struct rq *this_rq, struct task_struct *task);

Peter Zijlstra

unread,

Jan 23, 2014, 8:00:02 AM1/23/14

to

On Tue, Jan 21, 2014 at 12:17:57PM +0100, Peter Zijlstra wrote:
> From: Daniel Lezcano <daniel....@linaro.org>
>
> The idle_balance modifies the idle_stamp field of the rq, making this
> information to be shared across core.c and fair.c. As we can know if the
> cpu is going to idle or not with the previous patch, let's encapsulate the
> idle_stamp information in core.c by moving it up to the caller. The
> idle_balance function returns true in case a balancing occured and the cpu
> won't be idle, false if no balance happened and the cpu is going idle.
>
> Cc: alex...@linaro.org
> Cc: pet...@infradead.org
> Cc: mi...@kernel.org
> Signed-off-by: Daniel Lezcano <daniel....@linaro.org>
> Signed-off-by: Peter Zijlstra <pet...@infradead.org>
> Link: http://lkml.kernel.org/r/1389949444-14821-3-git-s...@linaro.org
> ---
> kernel/sched/core.c | 2 +-
> kernel/sched/fair.c | 14 ++++++--------
> kernel/sched/sched.h | 2 +-
> 3 files changed, 8 insertions(+), 10 deletions(-)
>
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2680,7 +2680,7 @@ static void __sched __schedule(void)
> pre_schedule(rq, prev);
>
> if (unlikely(!rq->nr_running))
> - idle_balance(rq);
> + rq->idle_stamp = idle_balance(rq) ? 0 : rq_clock(rq);

OK, spotted a problem here..

So previously idle_stamp was set _before_ actually doing idle_balance(),
and that was RIGHT, because that way we include the cost of actually
doing idle_balance() into the idle time.

By not including the cost of idle_balance() you underestimate the 'idle'
time in that if idle_balance() filled the entire idle time we account 0
idle, even though we had 'plenty' of time to run the entire thing.

This leads to not running idle_balance() even though we have the time
for it.

So we very much want something like:

if (!rq->nr_running)
rq->idle_stamp = rq_clock(rq);

p = pick_next_task(rq, prev);

if (!is_idle_task(p))
rq->idle_stamp = 0;

Daniel Lezcano

unread,

Jan 23, 2014, 9:40:02 AM1/23/14

to

Good catch. How did you notice that ?

> So we very much want something like:
>
>
> if (!rq->nr_running)
> rq->idle_stamp = rq_clock(rq);
>
> p = pick_next_task(rq, prev);
>
> if (!is_idle_task(p))
> rq->idle_stamp = 0;

Is this code assuming idle_balance() is in pick_next_task ?

For this specific patch 3/9, will be ok the following ?

+ if (unlikely(!rq->nr_running)) {
+ rq->idle_stamp = rq_clock(rq);
+ if (idle_balance(rq))
+ rq->idle_stamp = 0;
+ }

So the patch 9/9 is wrong also because it does not fill idle_stamp
before idle_balance, the fix would be.

kernel/sched/core.c | 8 +++++---
1 file changed, 5 insertions(+), 3 deletions(-)

Index: cpuidle-next/kernel/sched/core.c
===================================================================
--- cpuidle-next.orig/kernel/sched/core.c
+++ cpuidle-next/kernel/sched/core.c
@@ -2579,15 +2579,17 @@ again:
}
}

+ rq->idle_stamp = rq_clock(rq);
+
/*

* If there is a task balanced on this cpu, pick the next task,

* otherwise fall in the optimization by picking the idle task

* directly.
*/
- if (idle_balance(rq))
+ if (idle_balance(rq)) {
+ rq->idle_stamp = 0;
goto again;
-

- rq->idle_stamp = rq_clock(rq);

+ }

return idle_sched_class.pick_next_task(rq, prev);

}

> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majo...@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

--
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog

Vincent Guittot

unread,

Jan 23, 2014, 10:00:02 AM1/23/14

to

On 23 January 2014 12:37, Peter Zijlstra <pet...@infradead.org> wrote:
> On Tue, Jan 21, 2014 at 06:27:11PM +0100, Vincent Guittot wrote:
>> On 21 January 2014 12:17, Peter Zijlstra <pet...@infradead.org> wrote:
>
>> If i have correctly followed the new function path that is introduced
>> by the patchset, idle_enter_fair is called after idle_balance whereas
>> it must be called before in order to update the
>> runnable_avg_sum/period of the rq before evaluating the interest of
>> pulling cfs task
>
> Its idle_exit_fair, that's moved from pre_schedule to put_prev_task and
> thus indeed has crossed idle_balance.

ok, so i probably mix the changes introduced by your patches and a
potential issue that was already present before

idle_enter/exit_fair are used to be sure to account correctly the run
time of a rq in its sched_avg field, so they must be called before
entering/leaving idle. Your patch ensures that they will be called
correctly. Now, the idle_balance is used to check the interest of
pulling task on this newly idle CPU and it could use the sched_avg
field in a near future regarding the discussion around a energy aware
scheduler. as a result, we must update the sched_avg field before
running the idle_balance (that's not the case even before your
patches)

So one solution is probably to move idle_enter_fair into the idle_balance

Regards,
Vincent

>
> Yeah, I can leave that pre_schedule thing in place, however all that has
> be thinking.
>
> Ideally we'd do something like the below; but I must admit to still
> being slightly confused about the idle_{enter,exit}_fair() calls, their
> comment doesn't seem to clarify things.
>
> Steve, I don't think I wrecked rt/deadline by pulling in the
> pre_schedule call into pick_next_task(), but then, who knows :-)
>
> The only thing I really don't like is having that double conditional for
> the direct fair_sched_class call, but I didn't see a way around that,
> other than dropping that entirely. Then again, we cut out a conditional
> and indirect call by getting rid of pre_schedule() -- so it might just
> balance out.
>

[snip]

Peter Zijlstra

unread,

Jan 23, 2014, 10:30:01 AM1/23/14

to

Staring at that code for too long :-)

> >So we very much want something like:
> >
> >
> > if (!rq->nr_running)
> > rq->idle_stamp = rq_clock(rq);
> >
> > p = pick_next_task(rq, prev);
> >
> > if (!is_idle_task(p))
> > rq->idle_stamp = 0;
>
> Is this code assuming idle_balance() is in pick_next_task ?

Yeah, I'm trying to make that work.

Peter Zijlstra

unread,

Jan 28, 2014, 12:40:01 PM1/28/14

to

peter_zijlstra-sched-optimize_cgroup_pick_next_task_fair_4.patch

Peter Zijlstra

unread,

Jan 28, 2014, 12:40:02 PM1/28/14

to

peter_zijlstra-sched-optimize_cgroup_pick_next_task_fair_1.patch

Peter Zijlstra

unread,

Jan 28, 2014, 12:40:02 PM1/28/14

to

peterz-sched-idle-1.patch

Peter Zijlstra

unread,

Jan 28, 2014, 12:40:02 PM1/28/14

to

peter_zijlstra-sched-optimize_cgroup_pick_next_task_fair_2.patch

Peter Zijlstra

unread,

Jan 28, 2014, 12:40:02 PM1/28/14

to

daniel_lezcano-2_sched-fix_race_in_idle_balance.patch

Peter Zijlstra

unread,

Jan 28, 2014, 12:40:03 PM1/28/14

to

daniel_lezcano-3_sched-move_idle_stamp_up_to_the_core.patch

Peter Zijlstra

unread,

Jan 28, 2014, 12:40:03 PM1/28/14

to

Here's a set that seems to work properly and hopefully adresses all previously
raised issues.

Daniel, I'm sorry, the idle_stamp thing wandered back over to fair.c :-)

I did test the fair bits, including bandwidth (although not very
extensively).

I did not test the dl/rt bits, Steve do you have a testcase for that?

Peter Zijlstra

unread,

Jan 28, 2014, 12:40:03 PM1/28/14

to

peter_zijlstra-sched-clean_up_idle_task_smp_logic.patch

Peter Zijlstra

unread,

Jan 28, 2014, 12:40:04 PM1/28/14

to

peter_zijlstra-sched-optimize_cgroup_pick_next_task_fair_3.patch

Peter Zijlstra

unread,

Jan 28, 2014, 12:40:04 PM1/28/14

to

daniel_lezcano-1_sched-remove__cpu__parameter_for_idle_balance.patch

Steven Rostedt

unread,

Jan 28, 2014, 1:10:01 PM1/28/14

to

On Tue, 28 Jan 2014 18:16:34 +0100
Peter Zijlstra <pet...@infradead.org> wrote:

> Here's a set that seems to work properly and hopefully adresses all previously
> raised issues.
>
> Daniel, I'm sorry, the idle_stamp thing wandered back over to fair.c :-)
>
> I did test the fair bits, including bandwidth (although not very
> extensively).
>
> I did not test the dl/rt bits, Steve do you have a testcase for that?

My sched tests are usually done by running some scheduling tests while
recording sched switches and then analyzing it under kernelshark. I
don't have many other tests that do pass fail. Hmm, I did have a couple
of rt migration tests, but that's it.

I'll take a look at theses tomorrow.

-- Steve

bse...@google.com

unread,

Jan 28, 2014, 1:50:02 PM1/28/14

to

Peter Zijlstra <pet...@infradead.org> writes:

> In order to avoid having to do put/set on a whole cgroup hierarchy
> when we context switch, push the put into pick_next_task() so that
> both operations are in the same function. Further changes then allow
> us to possibly optimize away redundant work.
>
> Signed-off-by: Peter Zijlstra <pet...@infradead.org>

> ---
> kernel/sched/core.c | 21 ++++++++-------------
> kernel/sched/deadline.c | 4 +++-
> kernel/sched/fair.c | 5 ++++-
> kernel/sched/idle_task.c | 5 ++++-
> kernel/sched/rt.c | 26 +++++++++++++++-----------
> kernel/sched/sched.h | 8 +++++++-
> kernel/sched/stop_task.c | 14 ++++++++------
> 7 files changed, 49 insertions(+), 34 deletions(-)
>
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2556,18 +2556,11 @@ static inline void schedule_debug(struct
> schedstat_inc(this_rq(), sched_count);
> }
>
> -static void put_prev_task(struct rq *rq, struct task_struct *prev)
> -{
> - if (prev->on_rq || rq->skip_clock_update < 0)
> - update_rq_clock(rq);

> - prev->sched_class->put_prev_task(rq, prev);
> -}

> -
> /*
> * Pick up the highest-prio task:
> */
> static inline struct task_struct *
> -pick_next_task(struct rq *rq)
> +pick_next_task(struct rq *rq, struct task_struct *prev)
> {
> const struct sched_class *class;
> struct task_struct *p;
> @@ -2577,13 +2570,13 @@ pick_next_task(struct rq *rq)

> * the fair class we can call that function directly:
> */

> if (likely(rq->nr_running == rq->cfs.h_nr_running)) {
> - p = fair_sched_class.pick_next_task(rq);
> + p = fair_sched_class.pick_next_task(rq, prev);
> if (likely(p))
> return p;
> }
>
> for_each_class(class) {
> - p = class->pick_next_task(rq);
> + p = class->pick_next_task(rq, prev);
> if (p)
> return p;
> }
> @@ -2691,8 +2684,10 @@ static void __sched __schedule(void)
> rq->idle_stamp = 0;
> }
>
> - put_prev_task(rq, prev);
> - next = pick_next_task(rq);
> + if (prev->on_rq || rq->skip_clock_update < 0)
> + update_rq_clock(rq);
> +
> + next = pick_next_task(rq, prev);
> clear_tsk_need_resched(prev);
> clear_preempt_need_resched();
> rq->skip_clock_update = 0;
> @@ -4734,7 +4729,7 @@ static void migrate_tasks(unsigned int d
> if (rq->nr_running == 1)
> break;
>
> - next = pick_next_task(rq);
> + next = pick_next_task(rq, NULL);

This seems to be incorrect without the if (prev) lines in
pick_next_task_foo, since foo_nr_running isn't enough to save us.

> BUG_ON(!next);
> next->sched_class->put_prev_task(rq, next);
>
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -989,7 +989,7 @@ static struct sched_dl_entity *pick_next
> return rb_entry(left, struct sched_dl_entity, rb_node);
> }
>
> -struct task_struct *pick_next_task_dl(struct rq *rq)
> +struct task_struct *pick_next_task_dl(struct rq *rq, struct task_struct *prev)
> {
> struct sched_dl_entity *dl_se;
> struct task_struct *p;
> @@ -1000,6 +1000,8 @@ struct task_struct *pick_next_task_dl(st

> if (unlikely(!dl_rq->dl_nr_running))
> return NULL;
>

> + prev->sched_class->put_prev_task(rq, prev);

> +
> dl_se = pick_next_dl_entity(rq, dl_rq);
> BUG_ON(!dl_se);
>
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4659,7 +4659,8 @@ static void check_preempt_wakeup(struct
> set_last_buddy(se);
> }
>
> -static struct task_struct *pick_next_task_fair(struct rq *rq)
> +static struct task_struct *
> +pick_next_task_fair(struct rq *rq, struct task_struct *prev)
> {

> struct task_struct *p;
> struct cfs_rq *cfs_rq = &rq->cfs;

> @@ -4668,6 +4669,8 @@ static struct task_struct *pick_next_tas

> if (!cfs_rq->nr_running)
> return NULL;
>

> + prev->sched_class->put_prev_task(rq, prev);

> +
> do {
> se = pick_next_entity(cfs_rq);
> set_next_entity(cfs_rq, se);
> --- a/kernel/sched/idle_task.c
> +++ b/kernel/sched/idle_task.c
> @@ -33,8 +33,11 @@ static void check_preempt_curr_idle(stru

> resched_task(rq->idle);
> }
>
> -static struct task_struct *pick_next_task_idle(struct rq *rq)
> +static struct task_struct *

> +pick_next_task_idle(struct rq *rq, struct task_struct *prev)
> {

> + prev->sched_class->put_prev_task(rq, prev);

> +
> schedstat_inc(rq, sched_goidle);
> #ifdef CONFIG_SMP

> /* Trigger the post schedule to do an idle_enter for CFS */

> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c

> @@ -1310,15 +1310,7 @@ static struct task_struct *_pick_next_ta
> {
> struct sched_rt_entity *rt_se;
> struct task_struct *p;
> - struct rt_rq *rt_rq;
> -
> - rt_rq = &rq->rt;
> -
> - if (!rt_rq->rt_nr_running)
> - return NULL;
> -
> - if (rt_rq_throttled(rt_rq))
> - return NULL;
> + struct rt_rq *rt_rq = &rq->rt;
>
> do {
> rt_se = pick_next_rt_entity(rq, rt_rq);

> @@ -1332,9 +1324,21 @@ static struct task_struct *_pick_next_ta

> return p;
> }
>
> -static struct task_struct *pick_next_task_rt(struct rq *rq)
> +static struct task_struct *
> +pick_next_task_rt(struct rq *rq, struct task_struct *prev)
> {
> - struct task_struct *p = _pick_next_task_rt(rq);
> + struct task_struct *p;

> + struct rt_rq *rt_rq = &rq->rt;
> +

> + if (!rt_rq->rt_nr_running)
> + return NULL;
> +
> + if (rt_rq_throttled(rt_rq))
> + return NULL;
> +

> + prev->sched_class->put_prev_task(rq, prev);

> +
> + p = _pick_next_task_rt(rq);
>
> /* The running task is never eligible for pushing */
> if (p)

> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1123,7 +1123,13 @@ struct sched_class {
>
> void (*check_preempt_curr) (struct rq *rq, struct task_struct *p, int flags);
>
> - struct task_struct * (*pick_next_task) (struct rq *rq);
> + /*
> + * It is the responsibility of the pick_next_task() method that will
> + * return the next task to call put_prev_task() on the @prev task or
> + * something equivalent.
> + */
> + struct task_struct * (*pick_next_task) (struct rq *rq,
> + struct task_struct *prev);
> void (*put_prev_task) (struct rq *rq, struct task_struct *p);
>
> #ifdef CONFIG_SMP
> --- a/kernel/sched/stop_task.c
> +++ b/kernel/sched/stop_task.c
> @@ -23,16 +23,18 @@ check_preempt_curr_stop(struct rq *rq, s
> /* we're never preempted */
> }
>
> -static struct task_struct *pick_next_task_stop(struct rq *rq)
> +static struct task_struct *
> +pick_next_task_stop(struct rq *rq, struct task_struct *prev)
> {
> struct task_struct *stop = rq->stop;
>
> - if (stop && stop->on_rq) {
> - stop->se.exec_start = rq_clock_task(rq);
> - return stop;
> - }
> + if (!stop || !stop->on_rq)
> + return NULL;
>
> - return NULL;

> + prev->sched_class->put_prev_task(rq, prev);

> + stop->se.exec_start = rq_clock_task(rq);
> +
> + return stop;
> }
>
> static void

Peter Zijlstra

unread,

Jan 28, 2014, 2:20:01 PM1/28/14

to

On Tue, Jan 28, 2014 at 10:46:18AM -0800, bse...@google.com wrote:
> > @@ -4734,7 +4729,7 @@ static void migrate_tasks(unsigned int d
> > if (rq->nr_running == 1)
> > break;
> >
> > - next = pick_next_task(rq);
> > + next = pick_next_task(rq, NULL);
>
> This seems to be incorrect without the if (prev) lines in
> pick_next_task_foo, since foo_nr_running isn't enough to save us.

ARgh, I knew there must've been a reason I had those :/

Vincent Guittot

unread,

Jan 30, 2014, 6:00:02 AM1/30/14

to

Hi Peter,

Acked-by: Vincent Guittot <vincent...@linaro.org>

Vincent

On 28 January 2014 18:16, Peter Zijlstra <pet...@infradead.org> wrote:
> The idle post_schedule hook is just a vile waste of time, furthermore
> it appears unneeded, move the idle_enter_fair() call into
> pick_next_task_idle().
>
> Cc: Daniel Lezcano <daniel....@linaro.org>
> Cc: Vincent Guittot <vincent...@linaro.org>
> Cc: alex...@linaro.org
> Cc: mi...@kernel.org
> Cc: Steven Rostedt <ros...@goodmis.org>

> Signed-off-by: Peter Zijlstra <pet...@infradead.org>
> ---

> kernel/sched/idle_task.c | 9 +--------
> 1 file changed, 1 insertion(+), 8 deletions(-)
>
> --- a/kernel/sched/idle_task.c
> +++ b/kernel/sched/idle_task.c
> @@ -19,11 +19,6 @@ static void pre_schedule_idle(struct rq
> idle_exit_fair(rq);
> rq_last_tick_reset(rq);

> }
> -
> -static void post_schedule_idle(struct rq *rq)
> -{
> - idle_enter_fair(rq);
> -}
> #endif /* CONFIG_SMP */

> /*
> * Idle tasks are unconditionally rescheduled:

> @@ -40,8 +35,7 @@ pick_next_task_idle(struct rq *rq, struc

>
> schedstat_inc(rq, sched_goidle);
> #ifdef CONFIG_SMP
> - /* Trigger the post schedule to do an idle_enter for CFS */
> - rq->post_schedule = 1;
> + idle_enter_fair(rq);
> #endif
> return rq->idle;
> }

> @@ -105,7 +99,6 @@ const struct sched_class idle_sched_clas

> #ifdef CONFIG_SMP
> .select_task_rq = select_task_rq_idle,

> .pre_schedule = pre_schedule_idle,
> - .post_schedule = post_schedule_idle,
> #endif
>
> .set_curr_task = set_curr_task_idle,
>
>

Vincent Guittot

unread,

Jan 30, 2014, 7:20:01 AM1/30/14

to

On 28 January 2014 18:16, Peter Zijlstra <pet...@infradead.org> wrote:

[snip]

>
> @@ -4662,9 +4682,86 @@ static void check_preempt_wakeup(struct
> static struct task_struct *
> pick_next_task_fair(struct rq *rq, struct task_struct *prev)
> {
> - struct task_struct *p;

> struct cfs_rq *cfs_rq = &rq->cfs;

> struct sched_entity *se;

> + struct task_struct *p;
> +

> +#ifdef CONFIG_FAIR_GROUP_SCHED
> + if (!cfs_rq->nr_running)
> + return NULL;

Couldn't you move the test above out of the CONFIG_FAIR_GROUP_SCHED
and remove the same test that is done after the simple label

Vincent

> +
> + if (prev->sched_class != &fair_sched_class)
> + goto simple;
> +
> + /*
> + * Because of the set_next_buddy() in dequeue_task_fair() it is rather
> + * likely that a next task is from the same cgroup as the current.
> + *
> + * Therefore attempt to avoid putting and setting the entire cgroup
> + * hierarchy, only change the part that actually changes.
> + */
> +
> + do {

> + struct sched_entity *curr = cfs_rq->curr;
> +
> + /*
> + * Since we got here without doing put_prev_entity() we also
> + * have to consider cfs_rq->curr. If it is still a runnable
> + * entity, update_curr() will update its vruntime, otherwise
> + * forget we've ever seen it.
> + */

> + if (curr && curr->on_rq)

> + update_curr(cfs_rq);
> + else
> + curr = NULL;
> +

> + /*
> + * This call to check_cfs_rq_runtime() will do the throttle and
> + * dequeue its entity in the parent(s). Therefore the 'simple'
> + * nr_running test will indeed be correct.

> + */
> + if (unlikely(check_cfs_rq_runtime(cfs_rq)))

> + if (hrtick_enabled(rq))
> + hrtick_start_fair(rq, p);
> +

> + return p;
> +simple:

> + cfs_rq = &rq->cfs;

> +#endif

>
> if (!cfs_rq->nr_running)
> return NULL;

> @@ -4672,12 +4769,13 @@ pick_next_task_fair(struct rq *rq, struc

> prev->sched_class->put_prev_task(rq, prev);
>
> do {

> - se = pick_next_entity(cfs_rq);

> + se = pick_next_entity(cfs_rq, NULL);
> set_next_entity(cfs_rq, se);
> cfs_rq = group_cfs_rq(se);
> } while (cfs_rq);
>
> p = task_of(se);
> +
> if (hrtick_enabled(rq))
> hrtick_start_fair(rq, p);
>
>
>

Peter Zijlstra

unread,

Jan 30, 2014, 7:40:02 AM1/30/14

to

On Thu, Jan 30, 2014 at 01:18:09PM +0100, Vincent Guittot wrote:
> On 28 January 2014 18:16, Peter Zijlstra <pet...@infradead.org> wrote:
>
> [snip]
>
> >
> > @@ -4662,9 +4682,86 @@ static void check_preempt_wakeup(struct
> > static struct task_struct *
> > pick_next_task_fair(struct rq *rq, struct task_struct *prev)
> > {
> > - struct task_struct *p;
> > struct cfs_rq *cfs_rq = &rq->cfs;
> > struct sched_entity *se;
> > + struct task_struct *p;
> > +
> > +#ifdef CONFIG_FAIR_GROUP_SCHED
> > + if (!cfs_rq->nr_running)
> > + return NULL;
>
> Couldn't you move the test above out of the CONFIG_FAIR_GROUP_SCHED
> and remove the same test that is done after the simple label

No, we have to check it twice because..

... here if you read the comment above, we could have modified the
nr_running.

And therefore this test needs to stay.

Vincent Guittot

unread,

Jan 30, 2014, 7:50:01 AM1/30/14

to

On 28 January 2014 18:16, Peter Zijlstra <pet...@infradead.org> wrote:

> This patch both merged idle_balance() and pre_schedule() and pushes
> both of them into pick_next_task().
>
> Conceptually pre_schedule() and idle_balance() are rather similar,
> both are used to pull more work onto the current CPU.
>
> We cannot however first move idle_balance() into pre_schedule_fair()
> since there is no guarantee the last runnable task is a fair task, and
> thus we would miss newidle balances.
>
> Similarly, the dl and rt pre_schedule calls must be ran before
> idle_balance() since their respective tasks have higher priority and
> it would not do to delay their execution searching for less important
> tasks first.
>
> However, by noticing that pick_next_tasks() already traverses the
> sched_class hierarchy in the right order, we can get the right
> behaviour and do away with both calls.
>
> We must however change the special case optimization to also require
> that prev is of sched_class_fair, otherwise we can miss doing a dl or
> rt pull where we needed one.

>
> Signed-off-by: Peter Zijlstra <pet...@infradead.org>
> ---

> kernel/sched/core.c | 26 ++------------------------
> kernel/sched/deadline.c | 15 +++++++--------
> kernel/sched/fair.c | 24 ++++++++++++++++++++----
> kernel/sched/idle_task.c | 12 +++++-------
> kernel/sched/rt.c | 16 ++++++++--------
> kernel/sched/sched.h | 1 -
> 6 files changed, 42 insertions(+), 52 deletions(-)
>
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c

> @@ -2146,13 +2146,6 @@ static void finish_task_switch(struct rq
>
> #ifdef CONFIG_SMP
>
> -/* assumes rq->lock is held */
> -static inline void pre_schedule(struct rq *rq, struct task_struct *prev)
> -{
> - if (prev->sched_class->pre_schedule)
> - prev->sched_class->pre_schedule(rq, prev);
> -}
> -
> /* rq->lock is NOT held, but preemption is disabled */
> static inline void post_schedule(struct rq *rq)
> {
> @@ -2170,10 +2163,6 @@ static inline void post_schedule(struct
>
> #else
>
> -static inline void pre_schedule(struct rq *rq, struct task_struct *p)
> -{
> -}
> -
> static inline void post_schedule(struct rq *rq)
> {
> }

> @@ -2569,7 +2558,8 @@ pick_next_task(struct rq *rq, struct tas

> * Optimization: we know that if all tasks are in

> * the fair class we can call that function directly:
> */

> - if (likely(rq->nr_running == rq->cfs.h_nr_running)) {

> + if (likely(prev->sched_class == &fair_sched_class &&

> + rq->nr_running == rq->cfs.h_nr_running)) {

> p = fair_sched_class.pick_next_task(rq, prev);
> if (likely(p))
> return p;

> @@ -2672,18 +2662,6 @@ static void __sched __schedule(void)

> switch_count = &prev->nvcsw;
> }
>
> - pre_schedule(rq, prev);
> -

> - if (unlikely(!rq->nr_running)) {
> - /*
> - * We must set idle_stamp _before_ calling idle_balance(), such
> - * that we measure the duration of idle_balance() as idle time.
> - */

> - rq->idle_stamp = rq_clock(rq);

> - if (idle_balance(rq))
> - rq->idle_stamp = 0;
> - }

> -
> if (prev->on_rq || rq->skip_clock_update < 0)

> update_rq_clock(rq);
>
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -989,6 +989,8 @@ static struct sched_dl_entity *pick_next

> return rb_entry(left, struct sched_dl_entity, rb_node);
> }
>

> +static int pull_dl_task(struct rq *this_rq);
> +
> struct task_struct *pick_next_task_dl(struct rq *rq, struct task_struct *prev)
> {
> struct sched_dl_entity *dl_se;
> @@ -997,6 +999,11 @@ struct task_struct *pick_next_task_dl(st
>
> dl_rq = &rq->dl;
>
> +#ifdef CONFIG_SMP

> + if (dl_task(prev))
> + pull_dl_task(rq);

> +#endif
> +

> if (unlikely(!dl_rq->dl_nr_running))
> return NULL;
>

> @@ -1427,13 +1434,6 @@ static int pull_dl_task(struct rq *this_
> return ret;
> }
>
> -static void pre_schedule_dl(struct rq *rq, struct task_struct *prev)
> -{
> - /* Try to pull other tasks here */

> - if (dl_task(prev))
> - pull_dl_task(rq);

> -}
> -

> static void post_schedule_dl(struct rq *rq)

> {
> push_dl_tasks(rq);
> @@ -1626,7 +1626,6 @@ const struct sched_class dl_sched_class

> .set_cpus_allowed = set_cpus_allowed_dl,
> .rq_online = rq_online_dl,
> .rq_offline = rq_offline_dl,
> - .pre_schedule = pre_schedule_dl,
> .post_schedule = post_schedule_dl,
> .task_woken = task_woken_dl,
> #endif

> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c

> @@ -2581,7 +2581,8 @@ void idle_exit_fair(struct rq *this_rq)
> update_rq_runnable_avg(this_rq, 0);
> }
>
> -#else
> +#else /* CONFIG_SMP */
> +
> static inline void update_entity_load_avg(struct sched_entity *se,
> int update_cfs_rq) {}
> static inline void update_rq_runnable_avg(struct rq *rq, int runnable) {}
> @@ -2593,7 +2594,7 @@ static inline void dequeue_entity_load_a
> int sleep) {}
> static inline void update_cfs_rq_blocked_load(struct cfs_rq *cfs_rq,
> int force_update) {}
> -#endif
> +#endif /* CONFIG_SMP */
>
> static void enqueue_sleeper(struct cfs_rq *cfs_rq, struct sched_entity *se)
> {

> @@ -4686,9 +4687,10 @@ pick_next_task_fair(struct rq *rq, struc

> struct sched_entity *se;
> struct task_struct *p;
>
> +again:

> #ifdef CONFIG_FAIR_GROUP_SCHED
> if (!cfs_rq->nr_running)
> - return NULL;
> + goto idle;
>

> if (prev->sched_class != &fair_sched_class)

> goto simple;
> @@ -4764,7 +4766,7 @@ pick_next_task_fair(struct rq *rq, struc

> #endif
>
> if (!cfs_rq->nr_running)
> - return NULL;
> + goto idle;
>
> prev->sched_class->put_prev_task(rq, prev);
>

> @@ -4780,6 +4782,20 @@ pick_next_task_fair(struct rq *rq, struc

> hrtick_start_fair(rq, p);
>
> return p;
> +
> +idle:

> + idle_exit_fair(rq);

It should be idle_enter_fair.

we want to update the statistic with the running time of other classes
than CFS.

The use case is:

exit idle
put_prev_task_idle
--> idle_exit_fair (account elapsed idle time)
pick_next_task other than fair tasks
switch between "other than fair" tasks
...
no more "other than fair" tasks to schedule
pick_next_task_fair
--> no fair task on the rq
--> jump to simple
--> idle_enter_fair (account elapsed running time of other class
before trying to pull fair task from other CPUs)
--> idle_balance()
...

Vincent

> + /*
> + * We must set idle_stamp _before_ calling idle_balance(), such that we
> + * measure the duration of idle_balance() as idle time.
> + */

> + rq->idle_stamp = rq_clock(rq);

> + if (idle_balance(rq)) { /* drops rq->lock */

> + rq->idle_stamp = 0;

> + goto again;
> + }

> +
> + return NULL;
> }
>
> /*

> --- a/kernel/sched/idle_task.c
> +++ b/kernel/sched/idle_task.c
> @@ -13,13 +13,8 @@ select_task_rq_idle(struct task_struct *

> {
> return task_cpu(p); /* IDLE tasks as never migrated */
> }
> -

> -static void pre_schedule_idle(struct rq *rq, struct task_struct *prev)
> -{
> - idle_exit_fair(rq);
> - rq_last_tick_reset(rq);

> -}
> #endif /* CONFIG_SMP */

> +

> /*
> * Idle tasks are unconditionally rescheduled:

> */
> @@ -55,6 +50,10 @@ dequeue_task_idle(struct rq *rq, struct

>
> static void put_prev_task_idle(struct rq *rq, struct task_struct *prev)
> {
> +#ifdef CONFIG_SMP
> + idle_exit_fair(rq);
> + rq_last_tick_reset(rq);
> +#endif
> }
>
> static void task_tick_idle(struct rq *rq, struct task_struct *curr, int queued)

> @@ -98,7 +97,6 @@ const struct sched_class idle_sched_clas

>
> #ifdef CONFIG_SMP
> .select_task_rq = select_task_rq_idle,

> - .pre_schedule = pre_schedule_idle,
> #endif
>
> .set_curr_task = set_curr_task_idle,
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -1324,12 +1324,20 @@ static struct task_struct *_pick_next_ta
> return p;
> }
>
> +static int pull_rt_task(struct rq *this_rq);

> +
> static struct task_struct *

> pick_next_task_rt(struct rq *rq, struct task_struct *prev)
> {
> struct task_struct *p;

> struct rt_rq *rt_rq = &rq->rt;
>

> +#ifdef CONFIG_SMP

> + /* Try to pull RT tasks here if we lower this rq's prio */
> + if (rq->rt.highest_prio.curr > prev->prio)
> + pull_rt_task(rq);

> +#endif

> +
> if (!rt_rq->rt_nr_running)
> return NULL;
>

> @@ -1720,13 +1728,6 @@ static int pull_rt_task(struct rq *this_

> return ret;
> }
>
> -static void pre_schedule_rt(struct rq *rq, struct task_struct *prev)
> -{
> - /* Try to pull RT tasks here if we lower this rq's prio */
> - if (rq->rt.highest_prio.curr > prev->prio)
> - pull_rt_task(rq);
> -}
> -
> static void post_schedule_rt(struct rq *rq)
> {
> push_rt_tasks(rq);

> @@ -2003,7 +2004,6 @@ const struct sched_class rt_sched_class

> .set_cpus_allowed = set_cpus_allowed_rt,
> .rq_online = rq_online_rt,
> .rq_offline = rq_offline_rt,
> - .pre_schedule = pre_schedule_rt,
> .post_schedule = post_schedule_rt,
> .task_woken = task_woken_rt,
> .switched_from = switched_from_rt,

> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h

> @@ -1136,7 +1136,6 @@ struct sched_class {
> int (*select_task_rq)(struct task_struct *p, int task_cpu, int sd_flag, int flags);
> void (*migrate_task_rq)(struct task_struct *p, int next_cpu);
>
> - void (*pre_schedule) (struct rq *this_rq, struct task_struct *task);
> void (*post_schedule) (struct rq *this_rq);
> void (*task_waking) (struct task_struct *task);
> void (*task_woken) (struct rq *this_rq, struct task_struct *task);
>
>

Vincent Guittot

unread,

Jan 30, 2014, 8:00:02 AM1/30/14

to

ok, i missed this point

Peter Zijlstra

unread,

Jan 30, 2014, 10:30:03 AM1/30/14

to

On Thu, Jan 30, 2014 at 01:45:07PM +0100, Vincent Guittot wrote:
> On 28 January 2014 18:16, Peter Zijlstra <pet...@infradead.org> wrote:
> > +idle:
> > + idle_exit_fair(rq);
>
> It should be idle_enter_fair.
>
> we want to update the statistic with the running time of other classes
> than CFS.
>
> The use case is:
>
> exit idle
> put_prev_task_idle
> --> idle_exit_fair (account elapsed idle time)
> pick_next_task other than fair tasks
> switch between "other than fair" tasks
> ...
> no more "other than fair" tasks to schedule
> pick_next_task_fair
> --> no fair task on the rq
> --> jump to simple
> --> idle_enter_fair (account elapsed running time of other class
> before trying to pull fair task from other CPUs)
> --> idle_balance()
> ...

Right.. terminal confusion on this topic it seems. Fixed it up.

[PATCH 1/9] sched: Remove cpu parameter for idle_balance()

Peter Zijlstra

Peter Zijlstra

Peter Zijlstra

Peter Zijlstra

Peter Zijlstra

Peter Zijlstra

Peter Zijlstra

Peter Zijlstra

Peter Zijlstra

Peter Zijlstra

Vincent Guittot

bse...@google.com

Peter Zijlstra

bse...@google.com

Peter Zijlstra

bse...@google.com

bse...@google.com

Peter Zijlstra

Peter Zijlstra

Peter Zijlstra

Daniel Lezcano

Vincent Guittot

Peter Zijlstra

Peter Zijlstra

Peter Zijlstra

Peter Zijlstra

Peter Zijlstra

Peter Zijlstra

Peter Zijlstra

Peter Zijlstra

Peter Zijlstra

Peter Zijlstra

Peter Zijlstra

Steven Rostedt

bse...@google.com

Peter Zijlstra

Vincent Guittot

Vincent Guittot

Peter Zijlstra

Vincent Guittot

Vincent Guittot

Peter Zijlstra