[RFC][PATCH] sched: avoid huge bonus to sleepers on busy machines

Suresh Jayaraman

unread,

Jan 4, 2010, 4:30:02 AM1/4/10

to

As I understand the idea of sleeper fairness is to consider sleeping tasks
similar to the ones on the runqueue and credit the sleepers in a way that it
would get CPU as if it were running.

Currently, when fair sleepers are enabled, the task that was sleeping seem to
get a bonus of cfs_rq->min_vruntime - sched_latency (in most cases). While with
gentle fair sleepers this effect was reduced to half, there still remains a
chance that on busy machines with more number of tasks, the sleepers might get
a huge undue bonus.

Here's a patch to avoid this by computing the entitled CPU time for the
sleeping task during the period taking into account only the current
cfs_rq->nr_running and thus tries to make it adaptive.
Compile-tested only.

Signed-off-by: Suresh Jayaraman <sjaya...@suse.de>
---
kernel/sched_fair.c | 11 ++++++++++-
1 files changed, 10 insertions(+), 1 deletions(-)

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 42ac3c9..d81fcb3 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -739,6 +739,15 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
/* sleeps up to a single latency don't count. */
if (!initial && sched_feat(FAIR_SLEEPERS)) {
unsigned long thresh = sysctl_sched_latency;
+ unsigned long delta_exec = (unsigned long)
+ (rq_of(cfs_rq)->clock - se->exec_start);
+ unsigned long sleeper_bonus;
+
+ /* entitled share of CPU time adapted to current nr_running */
+ if (likely(cfs_rq->nr_running > 1))
+ sleeper_bonus = delta_exec/cfs_rq->nr_running;
+ else
+ sleeper_bonus = delta_exec;

/*
* Convert the sleeper threshold into virtual time.
@@ -757,7 +766,7 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
if (sched_feat(GENTLE_FAIR_SLEEPERS))
thresh >>= 1;

- vruntime -= thresh;
+ vruntime -= min(thresh, sleeper_bonus);
}

/* ensure we never gain time by being placed backwards. */
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Mike Galbraith

unread,

Jan 4, 2010, 6:20:03 AM1/4/10

to

On Mon, 2010-01-04 at 14:50 +0530, Suresh Jayaraman wrote:
> As I understand the idea of sleeper fairness is to consider sleeping tasks
> similar to the ones on the runqueue and credit the sleepers in a way that it
> would get CPU as if it were running.
>
> Currently, when fair sleepers are enabled, the task that was sleeping seem to
> get a bonus of cfs_rq->min_vruntime - sched_latency (in most cases). While with
> gentle fair sleepers this effect was reduced to half, there still remains a
> chance that on busy machines with more number of tasks, the sleepers might get
> a huge undue bonus.

There is no bonus. Sleepers simply get to keep some of their lag, but
any lag beyond sched_latency is trashed in the interest of reasonable
latency for non-sleepers as the sleeper preempts and tries to catch up.

-Mike

Suresh Jayaraman

unread,

Jan 4, 2010, 7:10:02 AM1/4/10

to

On 01/04/2010 04:44 PM, Mike Galbraith wrote:
> On Mon, 2010-01-04 at 14:50 +0530, Suresh Jayaraman wrote:
>> As I understand the idea of sleeper fairness is to consider sleeping tasks
>> similar to the ones on the runqueue and credit the sleepers in a way that it
>> would get CPU as if it were running.
>>
>> Currently, when fair sleepers are enabled, the task that was sleeping seem to
>> get a bonus of cfs_rq->min_vruntime - sched_latency (in most cases). While with
>> gentle fair sleepers this effect was reduced to half, there still remains a
>> chance that on busy machines with more number of tasks, the sleepers might get
>> a huge undue bonus.
>
> There is no bonus. Sleepers simply get to keep some of their lag, but
> any lag beyond sched_latency is trashed in the interest of reasonable
> latency for non-sleepers as the sleeper preempts and tries to catch up.
>

Sorry, perhaps it's not a bonus, but it seems that the credit to
sleepers due to their lag (when it was sleeping) doesn't appear to take
in to account the number of tasks in the run_queue currently. IOW, the
credit to sleepers is same irrespective of the number of current tasks.
This might mean sleepers are getting an edge (since this will slow down
current tasks) when the number of tasks is more, isn't?

Would it be a good idea to make the threshold dependent on number of
tasks? This can help us achieve sleeper fairness with respect to the
current context and not relevant to when the task went to sleep, I think.

Does this make sense?

Thanks,

--
Suresh Jayaraman

Mike Galbraith

unread,

Jan 4, 2010, 7:40:03 AM1/4/10

to

On Mon, 2010-01-04 at 17:32 +0530, Suresh Jayaraman wrote:
> On 01/04/2010 04:44 PM, Mike Galbraith wrote:
> > On Mon, 2010-01-04 at 14:50 +0530, Suresh Jayaraman wrote:
> >> As I understand the idea of sleeper fairness is to consider sleeping tasks
> >> similar to the ones on the runqueue and credit the sleepers in a way that it
> >> would get CPU as if it were running.
> >>
> >> Currently, when fair sleepers are enabled, the task that was sleeping seem to
> >> get a bonus of cfs_rq->min_vruntime - sched_latency (in most cases). While with
> >> gentle fair sleepers this effect was reduced to half, there still remains a
> >> chance that on busy machines with more number of tasks, the sleepers might get
> >> a huge undue bonus.
> >
> > There is no bonus. Sleepers simply get to keep some of their lag, but
> > any lag beyond sched_latency is trashed in the interest of reasonable
> > latency for non-sleepers as the sleeper preempts and tries to catch up.
> >
>
> Sorry, perhaps it's not a bonus, but it seems that the credit to
> sleepers due to their lag (when it was sleeping) doesn't appear to take
> in to account the number of tasks in the run_queue currently. IOW, the
> credit to sleepers is same irrespective of the number of current tasks.
> This might mean sleepers are getting an edge (since this will slow down
> current tasks) when the number of tasks is more, isn't?

As load increases, min_vruntime advances slower, so it's already scaled.

> Would it be a good idea to make the threshold dependent on number of
> tasks? This can help us achieve sleeper fairness with respect to the
> current context and not relevant to when the task went to sleep, I think.
>
> Does this make sense?

In one respect it makes some sense to scale. As load climbs, the waker
has to wait longer to get cpu, so sleepers sleep longer. This leads to
increased wakeup peremption as load climbs. However, if you do any kind
of scaling, you harm light threads, not their hog competition. Any
diddling of sleeper fairness would have to be accompanied with a
preemption model change methinks.

-Mike

Peter Zijlstra

unread,

Jan 4, 2010, 7:40:03 AM1/4/10

to

On Mon, 2010-01-04 at 13:30 +0100, Mike Galbraith wrote:
> Any diddling of sleeper fairness would have to be accompanied with a
> preemption model change methinks.

Just told jays the exact same thing on IRC ;-)

Also, workloads are interesting, the signal test thing is the easiest to
test the preemption side, various things like QPID show the down-side
iirc.

Mike Galbraith

unread,

Jan 4, 2010, 7:50:02 AM1/4/10

to

On Mon, 2010-01-04 at 13:36 +0100, Peter Zijlstra wrote:
> On Mon, 2010-01-04 at 13:30 +0100, Mike Galbraith wrote:
> > Any diddling of sleeper fairness would have to be accompanied with a
> > preemption model change methinks.
>
> Just told jays the exact same thing on IRC ;-)
>
> Also, workloads are interesting, the signal test thing is the easiest to
> test the preemption side, various things like QPID show the down-side
> iirc.

Best testcase for the downside in my arsenal is vmark. It performs a
_lot_ better with no wakeup preemption. 'Course if you run your box
that way, you quickly find out what a horrible idea that is :)

-Mike