Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

[Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]

180 views
Skip to first unread message

Ingo Molnar

unread,
Apr 13, 2007, 4:30:13 PM4/13/07
to
[announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]

i'm pleased to announce the first release of the "Modular Scheduler Core
and Completely Fair Scheduler [CFS]" patchset:

http://redhat.com/~mingo/cfs-scheduler/sched-modular+cfs.patch

This project is a complete rewrite of the Linux task scheduler. My goal
is to address various feature requests and to fix deficiencies in the
vanilla scheduler that were suggested/found in the past few years, both
for desktop scheduling and for server scheduling workloads.

[ QuickStart: apply the patch to v2.6.21-rc6, recompile, reboot. The
new scheduler will be active by default and all tasks will default
to the new SCHED_FAIR interactive scheduling class. ]

Highlights are:

- the introduction of Scheduling Classes: an extensible hierarchy of
scheduler modules. These modules encapsulate scheduling policy
details and are handled by the scheduler core without the core
code assuming about them too much.

- sched_fair.c implements the 'CFS desktop scheduler': it is a
replacement for the vanilla scheduler's SCHED_OTHER interactivity
code.

i'd like to give credit to Con Kolivas for the general approach here:
he has proven via RSDL/SD that 'fair scheduling' is possible and that
it results in better desktop scheduling. Kudos Con!

The CFS patch uses a completely different approach and implementation
from RSDL/SD. My goal was to make CFS's interactivity quality exceed
that of RSDL/SD, which is a high standard to meet :-) Testing
feedback is welcome to decide this one way or another. [ and, in any
case, all of SD's logic could be added via a kernel/sched_sd.c module
as well, if Con is interested in such an approach. ]

CFS's design is quite radical: it does not use runqueues, it uses a
time-ordered rbtree to build a 'timeline' of future task execution,
and thus has no 'array switch' artifacts (by which both the vanilla
scheduler and RSDL/SD are affected).

CFS uses nanosecond granularity accounting and does not rely on any
jiffies or other HZ detail. Thus the CFS scheduler has no notion of
'timeslices' and has no heuristics whatsoever. There is only one
central tunable:

/proc/sys/kernel/sched_granularity_ns

which can be used to tune the scheduler from 'desktop' (low
latencies) to 'server' (good batching) workloads. It defaults to a
setting suitable for desktop workloads. SCHED_BATCH is handled by the
CFS scheduler module too.

due to its design, the CFS scheduler is not prone to any of the
'attacks' that exist today against the heuristics of the stock
scheduler: fiftyp.c, thud.c, chew.c, ring-test.c, massive_intr.c all
work fine and do not impact interactivity and produce the expected
behavior.

the CFS scheduler has a much stronger handling of nice levels and
SCHED_BATCH: both types of workloads should be isolated much more
agressively than under the vanilla scheduler.

( another rdetail: due to nanosec accounting and timeline sorting,
sched_yield() support is very simple under CFS, and in fact under
CFS sched_yield() behaves much better than under any other
scheduler i have tested so far. )

- sched_rt.c implements SCHED_FIFO and SCHED_RR semantics, in a simpler
way than the vanilla scheduler does. It uses 100 runqueues (for all
100 RT priority levels, instead of 140 in the vanilla scheduler)
and it needs no expired array.

- reworked/sanitized SMP load-balancing: the runqueue-walking
assumptions are gone from the load-balancing code now, and
iterators of the scheduling modules are used. The balancing code got
quite a bit simpler as a result.

the core scheduler got smaller by more than 700 lines:

kernel/sched.c | 1454 ++++++++++++++++------------------------------------------------
1 file changed, 372 insertions(+), 1082 deletions(-)

and even adding all the scheduling modules, the total size impact is
relatively small:

18 files changed, 1454 insertions(+), 1133 deletions(-)

most of the increase is due to extensive comments. The kernel size
impact is in fact a small negative:

text data bss dec hex filename
23366 4001 24 27391 6aff kernel/sched.o.vanilla
24159 2705 56 26920 6928 kernel/sched.o.CFS

(this is mainly due to the benefit of getting rid of the expired array
and its data structure overhead.)

thanks go to Thomas Gleixner and Arjan van de Ven for review of this
patchset.

as usual, any sort of feedback, bugreports, fixes and suggestions are
more than welcome,

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Bill Huey

unread,
Apr 13, 2007, 4:30:16 PM4/13/07
to
On Fri, Apr 13, 2007 at 10:21:00PM +0200, Ingo Molnar wrote:
> [announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
...
> The CFS patch uses a completely different approach and implementation
> from RSDL/SD. My goal was to make CFS's interactivity quality exceed
> that of RSDL/SD, which is a high standard to meet :-) Testing
> feedback is welcome to decide this one way or another. [ and, in any
> case, all of SD's logic could be added via a kernel/sched_sd.c module
> as well, if Con is interested in such an approach. ]

Ingo,

Con has been asking for module support for years if I understand your patch
corectly. You'll also need this for -rt as well with regards to bandwidth
scheduling. Good to see that you're moving in this direction.

bill

Ingo Molnar

unread,
Apr 13, 2007, 5:00:14 PM4/13/07
to

* Bill Huey <bi...@gnuppy.monkey.org> wrote:

> Con has been asking for module support for years if I understand your

> patch corectly. [...]

Yeah. Note that there are some subtle but crutial differences between
PlugSched (which Con used, and which i opposed in the past) and this
approach.

PlugSched cuts the interfaces at a high level in a monolithic way and
introduces kernel/scheduler.c that uses one pluggable scheduler
(represented via the 'scheduler' global template) at a time.

while in this CFS patchset i'm using modularization ('scheduler
classes') to simplify the _existing_ multi-policy implementation of the
scheduler. These 'scheduler classes' are in a hierarchy and are stacked
on top of each other. They are in use at once. Currently there's two of
them: sched_ops_rt is stacked ontop of sched_ops_fair. Fortunately the
performance impact is minimal.

So scheduler classes are mainly a simplification of the design of the
scheduler - not just a mere facility to select multiple schedulers.
Their ability to also facilitate easier experimentation with schedulers
is 'just' a happy side-effect. So all in one: it's a fairly different
model from PlugSched (and that's why i didnt reuse PlugSched) - but
there's indeed overlap.

> [...] You'll also need this for -rt as well with regards to bandwidth
> scheduling.

yeah.

scheduler classes are also useful for other purposes like containers and
virtualization, hierarchical/group scheduling, security encapsulation,
etc. - features that can be on-demand layered, and which we dont
necessarily want to have enabled all the time.

> [...] Good to see that you're moving in this direction.

thanks! :)

Ingo

William Lee Irwin III

unread,
Apr 13, 2007, 5:30:10 PM4/13/07
to
On Fri, Apr 13, 2007 at 10:55:45PM +0200, Ingo Molnar wrote:
> Yeah. Note that there are some subtle but crutial differences between
> PlugSched (which Con used, and which i opposed in the past) and this
> approach.
> PlugSched cuts the interfaces at a high level in a monolithic way and
> introduces kernel/scheduler.c that uses one pluggable scheduler
> (represented via the 'scheduler' global template) at a time.

What I originally did did so for a good reason, which was that it was
intended to support far more radical reorganizations, for instance,
things that changed the per-cpu runqueue affairs for gang scheduling.
I wrote a top-level driver that did support scheduling classes in a
similar fashion, though it didn't survive others maintaining the patches.


-- wli

Bill Huey

unread,
Apr 13, 2007, 5:40:10 PM4/13/07
to
On Fri, Apr 13, 2007 at 02:21:10PM -0700, William Lee Irwin III wrote:
> On Fri, Apr 13, 2007 at 10:55:45PM +0200, Ingo Molnar wrote:
> > Yeah. Note that there are some subtle but crutial differences between
> > PlugSched (which Con used, and which i opposed in the past) and this
> > approach.
> > PlugSched cuts the interfaces at a high level in a monolithic way and
> > introduces kernel/scheduler.c that uses one pluggable scheduler
> > (represented via the 'scheduler' global template) at a time.
>
> What I originally did did so for a good reason, which was that it was
> intended to support far more radical reorganizations, for instance,
> things that changed the per-cpu runqueue affairs for gang scheduling.
> I wrote a top-level driver that did support scheduling classes in a
> similar fashion, though it didn't survive others maintaining the patches.

Also, gang scheduling is needed to solve virtualization issues regarding
spinlocks in a guest image. You could potentally be spinning on a thread
that isn't currently running which, needless to say, is very bad.

bill

Ingo Molnar

unread,
Apr 13, 2007, 5:50:10 PM4/13/07
to

* William Lee Irwin III <w...@holomorphy.com> wrote:

> What I originally did did so for a good reason, which was that it was
> intended to support far more radical reorganizations, for instance,
> things that changed the per-cpu runqueue affairs for gang scheduling.
> I wrote a top-level driver that did support scheduling classes in a
> similar fashion, though it didn't survive others maintaining the
> patches.

yeah - i looked at plugsched-6.5-for-2.6.20.patch in particular.

Ingo

Michal Piotrowski

unread,
Apr 13, 2007, 6:00:14 PM4/13/07
to
Ingo Molnar napisał(a):

> [announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
>
> i'm pleased to announce the first release of the "Modular Scheduler Core
> and Completely Fair Scheduler [CFS]" patchset:
>
> http://redhat.com/~mingo/cfs-scheduler/sched-modular+cfs.patch
>

Friday the 13th, my lucky day :).

/mnt/md0/devel/linux-msc-cfs/usr/include/linux/sched.h requires linux/rbtree.h, which does not exist in exported headers


make[3]: *** No rule to make target `/mnt/md0/devel/linux-msc-cfs/usr/include/linux/.check.sched.h', needed by `__headerscheck'. Stop.
make[2]: *** [linux] Error 2
make[1]: *** [headers_check] Error 2
make: *** [vmlinux] Error 2

Regards,
Michal

--
Michal K. K. Piotrowski
LTG - Linux Testers Group (PL)
(http://www.stardust.webpages.pl/ltg/)
LTG - Linux Testers Group (EN)
(http://www.stardust.webpages.pl/linux_testers_group_en/)

Signed-off-by: Michal Piotrowski <michal.k.k...@gmail.com>

--- linux-msc-cfs-clean/include/linux/Kbuild 2007-04-13 23:52:47.000000000 +0200
+++ linux-msc-cfs/include/linux/Kbuild 2007-04-13 23:49:41.000000000 +0200
@@ -133,6 +133,7 @@ header-y += quotaio_v1.h
header-y += quotaio_v2.h
header-y += radeonfb.h
header-y += raw.h
+header-y += rbtree.h
header-y += resource.h
header-y += rose.h
header-y += smbno.h

Ingo Molnar

unread,
Apr 13, 2007, 6:00:13 PM4/13/07
to

* Ingo Molnar <mi...@elte.hu> wrote:

> and even adding all the scheduling modules, the total size impact is
> relatively small:
>
> 18 files changed, 1454 insertions(+), 1133 deletions(-)
>
> most of the increase is due to extensive comments. The kernel size
> impact is in fact a small negative:
>
> text data bss dec hex filename
> 23366 4001 24 27391 6aff kernel/sched.o.vanilla
> 24159 2705 56 26920 6928 kernel/sched.o.CFS

update: these were older numbers, here are the stats redone with the
latest patch:

text data bss dec hex filename
23366 4001 24 27391 6aff kernel/sched.o.vanilla

23671 4548 24 28243 6e53 kernel/sched.o.sd.v40
23349 2705 24 26078 65de kernel/sched.o.cfs

so CFS is now a win both for text and for data size :)

Daniel Walker

unread,
Apr 13, 2007, 6:20:08 PM4/13/07
to
On Fri, 2007-04-13 at 22:21 +0200, Ingo Molnar wrote:
> [announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
>
> i'm pleased to announce the first release of the "Modular Scheduler Core
> and Completely Fair Scheduler [CFS]" patchset:
>
> http://redhat.com/~mingo/cfs-scheduler/sched-modular+cfs.patch
>
> This project is a complete rewrite of the Linux task scheduler. My goal
> is to address various feature requests and to fix deficiencies in the
> vanilla scheduler that were suggested/found in the past few years, both
> for desktop scheduling and for server scheduling workloads.

I'm not in love with the current or other schedulers, so I'm indifferent
to this change. However, I was reviewing your release notes and the
patch and found myself wonder what the logarithmic complexity of this
new scheduler is .. I assumed it would also be constant time , but the
__enqueue_task_fair doesn't appear to be constant time (rbtree insert
complexity).. Maybe that's not a critical path , but I thought I would
at least comment on it.

Daniel

William Lee Irwin III

unread,
Apr 13, 2007, 6:30:13 PM4/13/07
to
On Fri, Apr 13, 2007 at 10:21:00PM +0200, Ingo Molnar wrote:
> [announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
> i'm pleased to announce the first release of the "Modular Scheduler Core
> and Completely Fair Scheduler [CFS]" patchset:
> http://redhat.com/~mingo/cfs-scheduler/sched-modular+cfs.patch
> This project is a complete rewrite of the Linux task scheduler. My goal
> is to address various feature requests and to fix deficiencies in the
> vanilla scheduler that were suggested/found in the past few years, both
> for desktop scheduling and for server scheduling workloads.
> [ QuickStart: apply the patch to v2.6.21-rc6, recompile, reboot. The
> new scheduler will be active by default and all tasks will default
> to the new SCHED_FAIR interactive scheduling class. ]

A pleasant surprise, though I did see it coming.


On Fri, Apr 13, 2007 at 10:21:00PM +0200, Ingo Molnar wrote:

> Highlights are:
> - the introduction of Scheduling Classes: an extensible hierarchy of
> scheduler modules. These modules encapsulate scheduling policy
> details and are handled by the scheduler core without the core
> code assuming about them too much.

It probably needs further clarification that they're things on the order
of SCHED_FIFO, SCHED_RR, SCHED_NORMAL, etc.; some prioritization amongst
the classes is furthermore assumed, and so on. They're not quite
capable of being full-blown alternative policies, though quite a bit
can be crammed into them.

There are issues with the per- scheduling class data not being very
well-abstracted. A union for per-class data might help, if not a
dynamically allocated scheduling class -private structure. Getting
an alternative policy floating around that actually clashes a little
with the stock data in the task structure would help clarify what's
needed.


On Fri, Apr 13, 2007 at 10:21:00PM +0200, Ingo Molnar wrote:

> - sched_fair.c implements the 'CFS desktop scheduler': it is a
> replacement for the vanilla scheduler's SCHED_OTHER interactivity
> code.
> i'd like to give credit to Con Kolivas for the general approach here:
> he has proven via RSDL/SD that 'fair scheduling' is possible and that
> it results in better desktop scheduling. Kudos Con!

Bob Mullens banged out a virtual deadline interactive task scheduler
for Multics back in 1976 or thereabouts. ISTR the name Ferranti in
connection with deadline task scheduling for UNIX in particular. I've
largely seen deadline schedulers as a realtime topic, though. In any
event, it's not so radical as to lack a fair number of precedents.


On Fri, Apr 13, 2007 at 10:21:00PM +0200, Ingo Molnar wrote:

> The CFS patch uses a completely different approach and implementation
> from RSDL/SD. My goal was to make CFS's interactivity quality exceed
> that of RSDL/SD, which is a high standard to meet :-) Testing
> feedback is welcome to decide this one way or another. [ and, in any
> case, all of SD's logic could be added via a kernel/sched_sd.c module
> as well, if Con is interested in such an approach. ]
> CFS's design is quite radical: it does not use runqueues, it uses a
> time-ordered rbtree to build a 'timeline' of future task execution,
> and thus has no 'array switch' artifacts (by which both the vanilla
> scheduler and RSDL/SD are affected).

A binomial heap would likely serve your purposes better than rbtrees.
It's faster to have the next item to dequeue at the root of the tree
structure rather than a leaf, for one. There are, of course, other
priority queue structures (e.g. van Emde Boas) able to exploit the
limited precision of the priority key for faster asymptotics, though
actual performance is an open question.

Another advantage of heaps is that they support decreasing priorities
directly, so that instead of removal and reinsertion, a less invasive
movement within the tree is possible. This nets additional constant
factor improvements beyond those for the next item to dequeue for the
case where a task remains runnable, but is preempted and its priority
decreased while it remains runnable.


On Fri, Apr 13, 2007 at 10:21:00PM +0200, Ingo Molnar wrote:

> CFS uses nanosecond granularity accounting and does not rely on any
> jiffies or other HZ detail. Thus the CFS scheduler has no notion of
> 'timeslices' and has no heuristics whatsoever. There is only one
> central tunable:
> /proc/sys/kernel/sched_granularity_ns
> which can be used to tune the scheduler from 'desktop' (low
> latencies) to 'server' (good batching) workloads. It defaults to a
> setting suitable for desktop workloads. SCHED_BATCH is handled by the
> CFS scheduler module too.

I like not relying on timeslices. Timeslices ultimately get you into
a 2.4.x -like epoch expiry scenarios and introduce a number of RR-esque
artifacts therefore.


On Fri, Apr 13, 2007 at 10:21:00PM +0200, Ingo Molnar wrote:

> due to its design, the CFS scheduler is not prone to any of the
> 'attacks' that exist today against the heuristics of the stock
> scheduler: fiftyp.c, thud.c, chew.c, ring-test.c, massive_intr.c all
> work fine and do not impact interactivity and produce the expected
> behavior.

I'm always suspicious of these claims. A moderately formal regression
test suite needs to be assembled and the testcases rather seriously
cleaned up so they e.g. run for a deterministic period of time, have
their parameters passable via command-line options instead of editing
and recompiling, don't need Lindenting to be legible, and so on. With
that in hand, a battery of regression tests can be run against scheduler
modifications to verify their correctness and to detect any disturbance
in scheduling semantics they might cause.

A very serious concern is that while a fresh scheduler may pass all
these tests, later modifications may later cause failures unnoticed
because no one's doing the regression tests and there's no obvious
test suite for testing types to latch onto. Another is that the
testcases themselves may bitrot if they're not maintainable code.


On Fri, Apr 13, 2007 at 10:21:00PM +0200, Ingo Molnar wrote:

> the CFS scheduler has a much stronger handling of nice levels and
> SCHED_BATCH: both types of workloads should be isolated much more
> agressively than under the vanilla scheduler.

Speaking of regression tests, let's please at least state intended
nice semantics and get a regression test for CPU bandwidth distribution
by nice levels going.


On Fri, Apr 13, 2007 at 10:21:00PM +0200, Ingo Molnar wrote:

> ( another rdetail: due to nanosec accounting and timeline sorting,
> sched_yield() support is very simple under CFS, and in fact under
> CFS sched_yield() behaves much better than under any other
> scheduler i have tested so far. )

And there's another one. sched_yield() semantics need a regression test
more transparent than VolanoMark or other macrobenchmarks.

At some point we really need to decide what our sched_yield() is
intended to do and get something out there to detect whether it's
behaving as intended.


On Fri, Apr 13, 2007 at 10:21:00PM +0200, Ingo Molnar wrote:

> - reworked/sanitized SMP load-balancing: the runqueue-walking
> assumptions are gone from the load-balancing code now, and
> iterators of the scheduling modules are used. The balancing code got
> quite a bit simpler as a result.

The SMP load balancing class operations strike me as unusual and likely
to trip over semantic issues in alternative scheduling classes. Getting
some alternative scheduling classes out there to clarify the issues
would help here, too.


A more general question here is what you mean by "completely fair;"
there doesn't appear to be inter-tgrp, inter-pgrp, inter-session,
or inter-user fairness going on, though one might argue those are
relatively obscure notions of fairness. Complete fairness arguably
precludes static prioritization by nice levels, so there is also
that. There is also the issue of what a fair CPU bandwidth distribution
between tasks of varying desired in-isolation CPU utilization might be.
I suppose my thorniest point is where the demonstration of fairness is
as, say, a testcase. Perhaps it's fair now; when will we find out when
that fairness has been disturbed?

What these things mean when there are multiple CPU's to schedule across
may also be of concern.

I propose the following two testcases:
(1) CPU bandwidth distribution of CPU-bound tasks of varying nice levels
Create a number of tasks at varying nice levels. Measure the
CPU bandwidth allocated to each.
Success depends on intent: we decide up-front that a given nice
level should correspond to a given share of CPU bandwidth.
Check to see how far from the intended distribution of CPU
bandwidth according to those decided-up-front shares the actual
distribution of CPU bandwidth is for the test.

(2) CPU bandwidth distribution of tasks with varying CPU demands
Create a number of tasks that would in isolation consume
varying %cpu. Measure the CPU bandwidth allocated to each.
Success depends on intent here, too. Decide up-front that a
given %cpu that would be consumed in isolation should
correspond to a given share of CPU bandwidth and check the
actual distribution of CPU bandwidth vs. what was intended.
Note that the shares need not linearly correspond to the
%cpu; various sorts of things related to interactivity will
make this nonlinear.

A third testcase for sched_yield() should be brewed up.

These testcases are oblivious to SMP. This will demand that a scheduling
policy integrate with load balancing to the extent that load balancing
occurs for the sake of distributing CPU bandwidth according to nice level.
Some explicit decision should be made regarding that.


-- wli

Willy Tarreau

unread,
Apr 13, 2007, 6:40:08 PM4/13/07
to
On Sat, Apr 14, 2007 at 12:30:17AM +0200, Ingo Molnar wrote:

>
> * Daniel Walker <dwa...@mvista.com> wrote:
>
> > I'm not in love with the current or other schedulers, so I'm
> > indifferent to this change. However, I was reviewing your release
> > notes and the patch and found myself wonder what the logarithmic
> > complexity of this new scheduler is .. I assumed it would also be
> > constant time , but the __enqueue_task_fair doesn't appear to be
> > constant time (rbtree insert complexity).. [...]
>
> i've been worried about that myself and i've done extensive measurements
> before choosing this implementation. The rbtree turned out to be a quite
> compact data structure: we get it quite cheaply as part of the task
> structure cachemisses - which have to be touched anyway. For 1000 tasks
> it's a loop of ~10 - that's still very fast and bound in practice.

I'm not worried at all by O(log(n)) algorithms, and generally prefer smart log(n)
than dumb O(1).

In a userland TCP stack I started to write 2 years ago, I used a comparable
scheduler and could reach a sustained rate of 145000 connections/s at 4
millions of concurrent connections. And yes, each time a packet was sent or
received, a task was queued/dequeued (so about 450k/s with 4 million tasks,
on an athlon 1.5 GHz). So that seems much higher than what we currently need.

Regards,
Willy

Willy Tarreau

unread,
Apr 13, 2007, 6:40:10 PM4/13/07
to
Hi Ingo,

On Fri, Apr 13, 2007 at 10:21:00PM +0200, Ingo Molnar wrote:

> [announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]

(...)


> CFS's design is quite radical: it does not use runqueues, it uses a
> time-ordered rbtree to build a 'timeline' of future task execution,
> and thus has no 'array switch' artifacts (by which both the vanilla
> scheduler and RSDL/SD are affected).

I have a high confidence this will work better : I've been using
time-ordered trees in userland projects for several years, and never
found anything better. To be honnest, I never understood the concept
behind the array switch, but as I never felt brave enough to hack
something in this kernel area, I simply preferred to shut up (not
enough knowledge and not enough time).

However, I have been using a very fast struct timeval-ordered RADIX
tree. I found generic rbtree code to generally be slower, certainly
because of the call to a function with arguments on every node. Both
trees are O(log(n)), the rbtree being balanced and the radix tree
being unbalanced. If you're interested, I can try to see how that
would fit (but not this week-end).

Also, I had spent much time in the past doing paper work on how to
improve fairness between interactive tasks and batch tasks. I came
up with the conclusion that for perfectness, tasks should not be
ordered by their expected wakeup time, but by their expected completion
time, which automatically takes account of their allocated and used
timeslice. It would also allow both types of workloads to share equal
CPU time with better responsiveness for the most interactive one through
the reallocation of a "credit" for the tasks which have not consumed
all of their timeslices. I remember we had discussed this with Mike
about one year ago when he fixed lots of problems in mainline scheduler.
The downside is that I never found how to make this algo fit in
O(log(n)). I always ended in something like O(n.log(n)) IIRC.

But maybe this is overkill for real life anyway. Given that a basic two
arrays switch (which I never understood) was sufficient for many people,
probably that a basic tree will be an order of magnitude better.

> CFS uses nanosecond granularity accounting and does not rely on any
> jiffies or other HZ detail. Thus the CFS scheduler has no notion of
> 'timeslices' and has no heuristics whatsoever. There is only one
> central tunable:
>
> /proc/sys/kernel/sched_granularity_ns
>
> which can be used to tune the scheduler from 'desktop' (low
> latencies) to 'server' (good batching) workloads. It defaults to a
> setting suitable for desktop workloads. SCHED_BATCH is handled by the
> CFS scheduler module too.

I find this useful, but to be fair with Mike and Con, they both have
proposed similar tuning knobs in the past and you said you did not want
to add that complexity for admins. People can sometimes be demotivated
by seeing their proposals finally used by people who first rejected them.
And since both Mike and Con both have done a wonderful job in that area,
we need their experience and continued active participation more than ever.

> due to its design, the CFS scheduler is not prone to any of the
> 'attacks' that exist today against the heuristics of the stock
> scheduler: fiftyp.c, thud.c, chew.c, ring-test.c, massive_intr.c all
> work fine and do not impact interactivity and produce the expected
> behavior.

I'm very pleased to read this. Because as I have already said it, my major
concern with 2.6 was the stock scheduler. Recently, RSDL fixed most of the
basic problems for me to the point that I switched the default lilo entry
on my notebook to 2.6 ! I hope that whatever the next scheduler will be,
we'll definitely get rid of any heuristics. Heuristics are good in 95% of
situations and extremely bad in the remaining 5%. I prefer something
reasonably good in 100% of situations.

> the CFS scheduler has a much stronger handling of nice levels and
> SCHED_BATCH: both types of workloads should be isolated much more
> agressively than under the vanilla scheduler.
>
> ( another rdetail: due to nanosec accounting and timeline sorting,
> sched_yield() support is very simple under CFS, and in fact under
> CFS sched_yield() behaves much better than under any other
> scheduler i have tested so far. )
>
> - sched_rt.c implements SCHED_FIFO and SCHED_RR semantics, in a simpler
> way than the vanilla scheduler does. It uses 100 runqueues (for all
> 100 RT priority levels, instead of 140 in the vanilla scheduler)
> and it needs no expired array.
>
> - reworked/sanitized SMP load-balancing: the runqueue-walking
> assumptions are gone from the load-balancing code now, and
> iterators of the scheduling modules are used. The balancing code got
> quite a bit simpler as a result.

Will this have any impact on NUMA/HT/multi-core/etc... ?

> the core scheduler got smaller by more than 700 lines:

Well done !

Cheers,
Willy

Ingo Molnar

unread,
Apr 13, 2007, 6:40:10 PM4/13/07
to

* Daniel Walker <dwa...@mvista.com> wrote:

> I'm not in love with the current or other schedulers, so I'm
> indifferent to this change. However, I was reviewing your release
> notes and the patch and found myself wonder what the logarithmic
> complexity of this new scheduler is .. I assumed it would also be
> constant time , but the __enqueue_task_fair doesn't appear to be

> constant time (rbtree insert complexity).. [...]

i've been worried about that myself and i've done extensive measurements
before choosing this implementation. The rbtree turned out to be a quite
compact data structure: we get it quite cheaply as part of the task
structure cachemisses - which have to be touched anyway. For 1000 tasks
it's a loop of ~10 - that's still very fast and bound in practice.

here's a test i did under CFS. Lets take some ridiculous load: 1000
infinite loop tasks running at SCHED_BATCH on a single CPU (all inserted
into the same rbtree), and lets run lat_ctx:

neptune:~/l> uptime
22:51:23 up 8 min, 2 users, load average: 713.06, 254.64, 91.51

neptune:~/l> ./lat_ctx -s 0 2
"size=0k ovr=1.61
2 1.41

lets stop the 1000 tasks and only have ~2 tasks in the runqueue:

neptune:~/l> ./lat_ctx -s 0 2

"size=0k ovr=1.70
2 1.16

so the overhead is 0.25 usecs. Considering the load (1000 tasks trash
the cache like crazy already), this is more than acceptable.

Ingo

Ingo Molnar

unread,
Apr 13, 2007, 7:00:13 PM4/13/07
to

* William Lee Irwin III <w...@holomorphy.com> wrote:

> On Fri, Apr 13, 2007 at 10:21:00PM +0200, Ingo Molnar wrote:
> > [announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
> > i'm pleased to announce the first release of the "Modular Scheduler Core
> > and Completely Fair Scheduler [CFS]" patchset:
> > http://redhat.com/~mingo/cfs-scheduler/sched-modular+cfs.patch
> > This project is a complete rewrite of the Linux task scheduler. My goal
> > is to address various feature requests and to fix deficiencies in the
> > vanilla scheduler that were suggested/found in the past few years, both
> > for desktop scheduling and for server scheduling workloads.
> > [ QuickStart: apply the patch to v2.6.21-rc6, recompile, reboot. The
> > new scheduler will be active by default and all tasks will default
> > to the new SCHED_FAIR interactive scheduling class. ]
>
> A pleasant surprise, though I did see it coming.

hey ;)

> On Fri, Apr 13, 2007 at 10:21:00PM +0200, Ingo Molnar wrote:
> > Highlights are:
> > - the introduction of Scheduling Classes: an extensible hierarchy of
> > scheduler modules. These modules encapsulate scheduling policy
> > details and are handled by the scheduler core without the core
> > code assuming about them too much.
>
> It probably needs further clarification that they're things on the
> order of SCHED_FIFO, SCHED_RR, SCHED_NORMAL, etc.; some prioritization

> amongst the classes is furthermore assumed, and so on. [...]

yep - they are linked via sched_ops->next pointer, with NULL delimiting
the last one.

> [...] They're not quite capable of being full-blown alternative

> policies, though quite a bit can be crammed into them.

yeah, they are not full-blown: i extended them on-demand, for the
specific purposes of sched_fair.c and sched_rt.c. More can be done too.

> There are issues with the per- scheduling class data not being very

> well-abstracted. [...]

yes. It's on my TODO list: i'll work more on extending the cleanups to
those fields too.

> A binomial heap would likely serve your purposes better than rbtrees.
> It's faster to have the next item to dequeue at the root of the tree
> structure rather than a leaf, for one. There are, of course, other
> priority queue structures (e.g. van Emde Boas) able to exploit the
> limited precision of the priority key for faster asymptotics, though
> actual performance is an open question.

i'm caching the leftmost leaf, which serves as an alternate, task-pick
centric root in essence.

> Another advantage of heaps is that they support decreasing priorities
> directly, so that instead of removal and reinsertion, a less invasive
> movement within the tree is possible. This nets additional constant
> factor improvements beyond those for the next item to dequeue for the
> case where a task remains runnable, but is preempted and its priority
> decreased while it remains runnable.

yeah. (Note that in CFS i'm not decreasing priorities anywhere though -
all the priority levels in CFS stay constant, fairness is not achieved
via rotating priorities or similar, it is achieved via the accounting
code.)

> On Fri, Apr 13, 2007 at 10:21:00PM +0200, Ingo Molnar wrote:
> > due to its design, the CFS scheduler is not prone to any of the
> > 'attacks' that exist today against the heuristics of the stock
> > scheduler: fiftyp.c, thud.c, chew.c, ring-test.c, massive_intr.c all
> > work fine and do not impact interactivity and produce the expected
> > behavior.
>

> I'm always suspicious of these claims. [...]

hey, sure - but please give it a go nevertheless, i _did_ test all these
;)

> A moderately formal regression test suite needs to be assembled [...]

by all means feel free! ;)

> A more general question here is what you mean by "completely fair;"

by that i mean the most common-sense definition: with N tasks running
each gets 1/N CPU time if observed for a reasonable amount of time. Now
extend this to arbitrary scheduling patterns, the end result should
still be completely fair, according to the fundamental 1/N(time) rule
individually applied to all the small scheduling patterns that the
scheduling patterns give. (this assumes that the scheduling patterns are
reasonably independent of each other - if they are not then there's no
reasonable definition of fairness that makes sense, and we might as well
use the 1/N rule for those cases too.)

> there doesn't appear to be inter-tgrp, inter-pgrp, inter-session, or
> inter-user fairness going on, though one might argue those are

> relatively obscure notions of fairness. [...]

sure, i mainly concentrated on what we have in Linux today. The things
you mention are add-ons that i can see handling via new scheduling
classes: all the CKRM and containers type of CPU time management
facilities.

> What these things mean when there are multiple CPU's to schedule
> across may also be of concern.

that is handled by the existing smp-nice load balancer, that logic is
preserved under CFS.

> These testcases are oblivious to SMP. This will demand that a
> scheduling policy integrate with load balancing to the extent that
> load balancing occurs for the sake of distributing CPU bandwidth
> according to nice level. Some explicit decision should be made
> regarding that.

this should already work reasonably fine with CFS: try massive_intr.c on
an SMP box.

Ingo

Ingo Molnar

unread,
Apr 13, 2007, 7:20:07 PM4/13/07
to

* Willy Tarreau <w...@1wt.eu> wrote:

> > central tunable:
> >
> > /proc/sys/kernel/sched_granularity_ns
> >
> > which can be used to tune the scheduler from 'desktop' (low
> > latencies) to 'server' (good batching) workloads. It defaults to a
> > setting suitable for desktop workloads. SCHED_BATCH is handled by the
> > CFS scheduler module too.
>
> I find this useful, but to be fair with Mike and Con, they both have
> proposed similar tuning knobs in the past and you said you did not

> want to add that complexity for admins. [...]

yeah. [ Note that what i opposed in the past was mostly the 'export all
the zillion of sched.c knobs to /sys and let people mess with them' kind
of patches which did exist and still exist. A _single_ knob, which
represents basically the totality of parameters within sched_fair.c is
less of a problem. I dont think i ever objected to this knob within
staircase/SD. (If i did then i was dead wrong.) ]

> [...] People can sometimes be demotivated by seeing their proposals

> finally used by people who first rejected them. And since both Mike
> and Con both have done a wonderful job in that area, we need their
> experience and continued active participation more than ever.

very much so! Both Con and Mike has contributed regularly to upstream
sched.c:

$ git-log kernel/sched.c | grep 'by: Con Kolivas' 1 | wc -l
19

$ git-log kernel/sched.c | grep 'by: Mike' | wc -l
6

and i'd very much like both counts to increase steadily in the future
too :)

> > - reworked/sanitized SMP load-balancing: the runqueue-walking
> > assumptions are gone from the load-balancing code now, and
> > iterators of the scheduling modules are used. The balancing code
> > got quite a bit simpler as a result.
>
> Will this have any impact on NUMA/HT/multi-core/etc... ?

it will inevitably have some sort of effect - and if it's negative, i'll
try to fix it.

I got rid of the explicit cache-hot tracking code and replaced it with a
more natural pure 'pick the next-to-run task first, that is likely the
most cache-cold one' logic. That just derives naturally from the rbtree
approach.

> > the core scheduler got smaller by more than 700 lines:
>
> Well done !

thanks :)

Ingo

Gabriel C

unread,
Apr 13, 2007, 7:20:08 PM4/13/07
to
Ingo Molnar wrote:
> [announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
>
> i'm pleased to announce the first release of the "Modular Scheduler Core
> and Completely Fair Scheduler [CFS]" patchset:
>
> http://redhat.com/~mingo/cfs-scheduler/sched-modular+cfs.patch
>
>
[...]

> as usual, any sort of feedback, bugreports, fixes and suggestions are
> more than welcome,
>

Compile error here.

...

CC kernel/sched.o
kernel/sched.c: In function '__rq_clock':
kernel/sched.c:219: error: 'struct rq' has no member named 'cpu'
kernel/sched.c:219: warning: type defaults to 'int' in declaration of
'__ret_warn_once'
kernel/sched.c:219: error: 'struct rq' has no member named 'cpu'
kernel/sched.c: In function 'rq_clock':
kernel/sched.c:230: error: 'struct rq' has no member named 'cpu'
kernel/sched.c: In function 'sched_init':
kernel/sched.c:6013: warning: unused variable 'j'
make[1]: *** [kernel/sched.o] Error 1
make: *** [kernel] Error 2
==> ERROR: Build Failed. Aborting...

...

There the config :

http://frugalware.org/~crazy/other/kernel/config

> Ingo
> -
>
>

Regards,

Gabriel

Ingo Molnar

unread,
Apr 13, 2007, 7:30:11 PM4/13/07
to

* Gabriel C <nix.o...@googlemail.com> wrote:

> > as usual, any sort of feedback, bugreports, fixes and suggestions
> > are more than welcome,
>
> Compile error here.

ah, !CONFIG_SMP. Does the patch below do the trick for you? (I've also
updated the full patch at the cfs-scheduler URL)

Ingo

----------------------->
From: Ingo Molnar <mi...@elte.hu>
Subject: [cfs] fix !CONFIG_SMP build

fix the !CONFIG_SMP build error reported by Gabriel C

Signed-off-by: Ingo Molnar <mi...@elte.hu>

Index: linux/kernel/sched.c
===================================================================
--- linux.orig/kernel/sched.c
+++ linux/kernel/sched.c
@@ -257,16 +257,6 @@ static inline unsigned long long __rq_cl
return rq->rq_clock;
}

-static inline unsigned long long rq_clock(struct rq *rq)
-{
- int this_cpu = smp_processor_id();
-
- if (this_cpu == rq->cpu)
- return __rq_clock(rq);
-
- return rq->rq_clock;
-}
-
static inline int cpu_of(struct rq *rq)
{
#ifdef CONFIG_SMP
@@ -276,6 +266,16 @@ static inline int cpu_of(struct rq *rq)
#endif
}

+static inline unsigned long long rq_clock(struct rq *rq)
+{
+ int this_cpu = smp_processor_id();
+
+ if (this_cpu == cpu_of(rq))
+ return __rq_clock(rq);
+
+ return rq->rq_clock;
+}
+
/*
* The domain tree (rq->sd) is protected by RCU's quiescent state transition.
* See detach_destroy_domains: synchronize_sched for details.

William Lee Irwin III

unread,
Apr 13, 2007, 7:40:07 PM4/13/07
to
* William Lee Irwin III <w...@holomorphy.com> wrote:
>> A binomial heap would likely serve your purposes better than rbtrees.
>> It's faster to have the next item to dequeue at the root of the tree
>> structure rather than a leaf, for one. There are, of course, other
>> priority queue structures (e.g. van Emde Boas) able to exploit the
>> limited precision of the priority key for faster asymptotics, though
>> actual performance is an open question.

On Sat, Apr 14, 2007 at 12:52:16AM +0200, Ingo Molnar wrote:
> i'm caching the leftmost leaf, which serves as an alternate, task-pick
> centric root in essence.

I noticed that, yes. It seemed a better idea to me to use a data
structure that has what's needed built-in, but I suppose it's not gospel.


* William Lee Irwin III <w...@holomorphy.com> wrote:
>> Another advantage of heaps is that they support decreasing priorities
>> directly, so that instead of removal and reinsertion, a less invasive
>> movement within the tree is possible. This nets additional constant
>> factor improvements beyond those for the next item to dequeue for the
>> case where a task remains runnable, but is preempted and its priority
>> decreased while it remains runnable.

On Sat, Apr 14, 2007 at 12:52:16AM +0200, Ingo Molnar wrote:
> yeah. (Note that in CFS i'm not decreasing priorities anywhere though -
> all the priority levels in CFS stay constant, fairness is not achieved
> via rotating priorities or similar, it is achieved via the accounting
> code.)

Sorry, "priority" here would be from the POV of the queue data
structure. From the POV of the scheduler it would be resetting the
deadline or whatever the nomenclature cooked up for things is, most
obviously in requeue_task_fair() and task_tick_fair().


* William Lee Irwin III <w...@holomorphy.com> wrote:
>> I'm always suspicious of these claims. [...]

On Sat, Apr 14, 2007 at 12:52:16AM +0200, Ingo Molnar wrote:
> hey, sure - but please give it a go nevertheless, i _did_ test all these
> ;)

The suspicion essentially centers around how long the state of affairs
will hold up because comprehensive re-testing is not noticeably done
upon updates to scheduling code or kernel point releases.


* William Lee Irwin III <w...@holomorphy.com> wrote:
>> A moderately formal regression test suite needs to be assembled [...]

On Sat, Apr 14, 2007 at 12:52:16AM +0200, Ingo Molnar wrote:
> by all means feel free! ;)

I can only do so much, but I have done work to clean up other testcases
going around. I'm mostly looking at testcases as I go over them or
develop some interest in the subject and rewriting those that already
exist or hammering out new ones as I need them. The main contribution
toward this is that I've sort of made a mental note to stash the results
of the effort somewhere and pass them along to those who do regular
testing on kernels or otherwise import test suites into their collections.


* William Lee Irwin III <w...@holomorphy.com> wrote:
>> A more general question here is what you mean by "completely fair;"

On Sat, Apr 14, 2007 at 12:52:16AM +0200, Ingo Molnar wrote:
> by that i mean the most common-sense definition: with N tasks running
> each gets 1/N CPU time if observed for a reasonable amount of time. Now
> extend this to arbitrary scheduling patterns, the end result should
> still be completely fair, according to the fundamental 1/N(time) rule
> individually applied to all the small scheduling patterns that the
> scheduling patterns give. (this assumes that the scheduling patterns are
> reasonably independent of each other - if they are not then there's no
> reasonable definition of fairness that makes sense, and we might as well
> use the 1/N rule for those cases too.)

I'd start with identically-behaving CPU-bound tasks here. It's easy
enough to hammer out a testcase that starts up N CPU-bound tasks, runs
them for a few minutes, stops them, collects statistics on their
runtime, and gives us an idea of whether 1/N came out properly. I'll
get around to that at some point.

Where it gets complex is when the behavior patterns vary, e.g. they're
not entirely CPU-bound and their desired in-isolation CPU utilization
varies, or when nice levels vary, or both vary. I went on about
testcases for those in particular in the prior post, though not both
at once. The nice level one in particular needs an up-front goal for
distribution of CPU bandwidth in a mixture of competing tasks with
varying nice levels.

There are different ways to define fairness, but a uniform distribution
of CPU bandwidth across a set of identical competing tasks is a good,
testable definition.


* William Lee Irwin III <w...@holomorphy.com> wrote:
>> there doesn't appear to be inter-tgrp, inter-pgrp, inter-session, or
>> inter-user fairness going on, though one might argue those are
>> relatively obscure notions of fairness. [...]

On Sat, Apr 14, 2007 at 12:52:16AM +0200, Ingo Molnar wrote:
> sure, i mainly concentrated on what we have in Linux today. The things
> you mention are add-ons that i can see handling via new scheduling
> classes: all the CKRM and containers type of CPU time management
> facilities.

At some point the CKRM and container people should be pinged to see
what (if anything) they need to achieve these sorts of things. It's
not clear to me that the specific cases I cited are considered
relevant to anyone. I presume that if they are, someone will pipe
up with a feature request. It was more a sort of catalogue of different
notions of fairness that could arise than any sort of suggestion.


* William Lee Irwin III <w...@holomorphy.com> wrote:
>> What these things mean when there are multiple CPU's to schedule
>> across may also be of concern.

On Sat, Apr 14, 2007 at 12:52:16AM +0200, Ingo Molnar wrote:
> that is handled by the existing smp-nice load balancer, that logic is
> preserved under CFS.

Given the things going wrong, I'm curious as to whether that works, and
if so, how well. I'll drop that into my list of testcases that should be
arranged for, though I won't guarantee that I'll get to it myself in any
sort of timely fashion.

What this ultimately needs is specifying the semantics of nice levels
so that we can say that a mixture of competing tasks with varying nice
levels should have an ideal distribution of CPU bandwidth to check for.

On Sat, Apr 14, 2007 at 12:52:16AM +0200, Ingo Molnar wrote:
> this should already work reasonably fine with CFS: try massive_intr.c on
> an SMP box.

Where is massive_intr.c, BTW?


-- wli

Ingo Molnar

unread,
Apr 13, 2007, 7:50:08 PM4/13/07
to

* William Lee Irwin III <w...@holomorphy.com> wrote:

> Where it gets complex is when the behavior patterns vary, e.g. they're
> not entirely CPU-bound and their desired in-isolation CPU utilization

> varies, or when nice levels vary, or both vary. [...]

yes. I tested things like 'massive_intr.c' (attached, written by Satoru
Takeuchi) which starts N tasks which each work for 8msec then sleep
1msec:

from its output, the second column is the CPU time each thread got, the
more even, the fairer the scheduling. On vanilla i get:

mercury:~> ./massive_intr 10 10
024873 00000150
024874 00000123
024870 00000069
024868 00000068
024866 00000051
024875 00000206
024872 00000093
024869 00000138
024867 00000078
024871 00000223

on CFS i get:

neptune:~> ./massive_intr 10 10
002266 00000112
002260 00000113
002261 00000112
002267 00000112
002269 00000112
002265 00000112
002262 00000113
002268 00000113
002264 00000112
002263 00000113

so it is quite a bit more even ;)

another related test-utility is one i wrote:

http://people.redhat.com/mingo/scheduler-patches/ring-test.c

this is a ring of 100 tasks each doing work for 100 msecs and then
sleeping for 1 msec. I usually test this by also running a CPU hog in
parallel to it, and checking whether it gets ~50.0% of CPU time under
CFS. (it does)

Ingo

massive_intr.c

Gabriel C

unread,
Apr 13, 2007, 7:50:07 PM4/13/07
to
Ingo Molnar wrote:
> * Gabriel C <nix.o...@googlemail.com> wrote:
>
>
>>> as usual, any sort of feedback, bugreports, fixes and suggestions
>>> are more than welcome,
>>>
>> Compile error here.
>>
>
> ah, !CONFIG_SMP. Does the patch below do the trick for you? (I've also
> updated the full patch at the cfs-scheduler URL)
>

Yes it does , thx :) , only the " warning: unused variable 'j' " left.

> Ingo
>

Regards,

Gabriel

William Lee Irwin III

unread,
Apr 13, 2007, 8:00:12 PM4/13/07
to
* William Lee Irwin III <w...@holomorphy.com> wrote:
>> Where it gets complex is when the behavior patterns vary, e.g. they're
>> not entirely CPU-bound and their desired in-isolation CPU utilization
>> varies, or when nice levels vary, or both vary. [...]

On Sat, Apr 14, 2007 at 01:44:44AM +0200, Ingo Molnar wrote:
> yes. I tested things like 'massive_intr.c' (attached, written by Satoru
> Takeuchi) which starts N tasks which each work for 8msec then sleep
> 1msec:

[...]


> another related test-utility is one i wrote:
> http://people.redhat.com/mingo/scheduler-patches/ring-test.c
> this is a ring of 100 tasks each doing work for 100 msecs and then
> sleeping for 1 msec. I usually test this by also running a CPU hog in
> parallel to it, and checking whether it gets ~50.0% of CPU time under
> CFS. (it does)

These are both tremendously useful. The code is also in rather good
shape so only minimal modifications (for massive_intr.c I'm not even
sure if any are needed at all) are needed to plug them into the test
harness I'm aware of. I'll queue them both for me to adjust and send
over to testers I don't want to burden with hacking on testcases I
myself am asking them to add to their suites.

Daniel Walker

unread,
Apr 13, 2007, 8:10:08 PM4/13/07
to

One other thing, what happens in the case of slow, frequency changing,
are/or inaccurate clocks .. Is the old sched_clock behavior still
tolerated?

Daniel

Nick Piggin

unread,
Apr 13, 2007, 10:10:07 PM4/13/07
to
On Fri, Apr 13, 2007 at 10:21:00PM +0200, Ingo Molnar wrote:
> [announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
>
> i'm pleased to announce the first release of the "Modular Scheduler Core
> and Completely Fair Scheduler [CFS]" patchset:
>
> http://redhat.com/~mingo/cfs-scheduler/sched-modular+cfs.patch

Always good to see another contender ;)

>
> This project is a complete rewrite of the Linux task scheduler. My goal
> is to address various feature requests and to fix deficiencies in the
> vanilla scheduler that were suggested/found in the past few years, both
> for desktop scheduling and for server scheduling workloads.
>
> [ QuickStart: apply the patch to v2.6.21-rc6, recompile, reboot. The
> new scheduler will be active by default and all tasks will default
> to the new SCHED_FAIR interactive scheduling class. ]

I don't know why there is such noise about fairness right now... I
thought fairness was one of the fundamental properties of a good CPU
scheduler, and my scheduler definitely always aims for that above most
other things. Why not just keep SCHED_OTHER?


> Highlights are:
>
> - the introduction of Scheduling Classes: an extensible hierarchy of
> scheduler modules. These modules encapsulate scheduling policy
> details and are handled by the scheduler core without the core
> code assuming about them too much.

Don't really like this, but anyway...


> - sched_fair.c implements the 'CFS desktop scheduler': it is a
> replacement for the vanilla scheduler's SCHED_OTHER interactivity
> code.
>
> i'd like to give credit to Con Kolivas for the general approach here:
> he has proven via RSDL/SD that 'fair scheduling' is possible and that
> it results in better desktop scheduling. Kudos Con!

I guess the 2.4 and earlier scheduler kind of did that as well.


> The CFS patch uses a completely different approach and implementation
> from RSDL/SD. My goal was to make CFS's interactivity quality exceed
> that of RSDL/SD, which is a high standard to meet :-) Testing
> feedback is welcome to decide this one way or another. [ and, in any
> case, all of SD's logic could be added via a kernel/sched_sd.c module
> as well, if Con is interested in such an approach. ]

Comment about the code: shouldn't you be requeueing the task in the rbtree
wherever you change wait_runtime? eg. task_new_fair? (I've only had a quick
look so far).


> CFS's design is quite radical: it does not use runqueues, it uses a
> time-ordered rbtree to build a 'timeline' of future task execution,
> and thus has no 'array switch' artifacts (by which both the vanilla
> scheduler and RSDL/SD are affected).
>
> CFS uses nanosecond granularity accounting and does not rely on any
> jiffies or other HZ detail. Thus the CFS scheduler has no notion of
> 'timeslices' and has no heuristics whatsoever.

Well, I guess there is still some mechanism to decide which process is most
eligible to run? ;) Considering that question has no "right" answer for
SCHED_OTHER scheduling, I guess you could say it has heuristics. But granted
they are obviously fairly elegant in contrast to the O(1) scheduler ;)


> There is only one
> central tunable:
>
> /proc/sys/kernel/sched_granularity_ns

Suppose you have 2 CPU hogs running, is sched_granularity_ns the
frequency at which they will context switch?


> ( another rdetail: due to nanosec accounting and timeline sorting,
> sched_yield() support is very simple under CFS, and in fact under
> CFS sched_yield() behaves much better than under any other
> scheduler i have tested so far. )

What is better behaviour for sched_yield?

Thanks,
Nick

Ingo Molnar

unread,
Apr 14, 2007, 2:40:07 AM4/14/07
to

* Nick Piggin <npi...@suse.de> wrote:

> > The CFS patch uses a completely different approach and implementation
> > from RSDL/SD. My goal was to make CFS's interactivity quality exceed
> > that of RSDL/SD, which is a high standard to meet :-) Testing
> > feedback is welcome to decide this one way or another. [ and, in any
> > case, all of SD's logic could be added via a kernel/sched_sd.c module
> > as well, if Con is interested in such an approach. ]
>
> Comment about the code: shouldn't you be requeueing the task in the

> rbtree wherever you change wait_runtime? eg. task_new_fair? [...]

yes: the task's position within the rbtree is updated every time
wherever wait_runtime is change. task_new_fair is the method during new
task creation, but indeed i forgot to requeue the parent. I've fixed
this in my tree (see the delta patch below) - thanks!

Ingo

----------->
From: Ingo Molnar <mi...@elte.hu>

Subject: [cfs] fix parent's rbtree position

Nick noticed that upon fork we change parent->wait_runtime but we do not
requeue it within the rbtree.

Signed-off-by: Ingo Molnar <mi...@elte.hu>

Index: linux/kernel/sched_fair.c
===================================================================
--- linux.orig/kernel/sched_fair.c
+++ linux/kernel/sched_fair.c
@@ -524,6 +524,8 @@ static void task_new_fair(struct rq *rq,

p->wait_runtime = parent->wait_runtime/2;
parent->wait_runtime /= 2;
+ requeue_task_fair(rq, parent);
+
/*
* For the first timeslice we allow child threads
* to move their parent-inherited fairness back

Ingo Molnar

unread,
Apr 14, 2007, 2:50:10 AM4/14/07
to

* Ingo Molnar <mi...@elte.hu> wrote:

> Nick noticed that upon fork we change parent->wait_runtime but we do
> not requeue it within the rbtree.

this fix is not complete - because the child runqueue is locked here,
not the parent's. I've fixed this properly in my tree and have uploaded
a new sched-modular+cfs.patch. (the effects of the original bug are
mostly harmless, the rbtree position gets corrected the first time the
parent reschedules. The fix might improve heavy forker handling.)

Ingo

Willy Tarreau

unread,
Apr 14, 2007, 4:20:08 AM4/14/07
to
On Sat, Apr 14, 2007 at 08:43:34AM +0200, Ingo Molnar wrote:
>
> * Ingo Molnar <mi...@elte.hu> wrote:
>
> > Nick noticed that upon fork we change parent->wait_runtime but we do
> > not requeue it within the rbtree.
>
> this fix is not complete - because the child runqueue is locked here,
> not the parent's. I've fixed this properly in my tree and have uploaded
> a new sched-modular+cfs.patch. (the effects of the original bug are
> mostly harmless, the rbtree position gets corrected the first time the
> parent reschedules. The fix might improve heavy forker handling.)

It looks like it did not reach your public dir yet.

BTW, I've given it a try. It seems pretty usable. I have also tried
the usual meaningless "glxgears" test with 12 of them at the same time,
and they rotate very smoothly, there is absolutely no pause in any of
them. But they don't all run at same speed, and top reports their CPU
load varying from 3.4 to 10.8%, with what looks like more CPU is
assigned to the first processes, and less CPU for the last ones. But
this is just a rough observation on a stupid test, I would not call
that one scientific in any way (and X has its share in the test too).

I'll perform other tests when I can rebuild with your fixed patch.

Cheers,
Willy

Willy Tarreau

unread,
Apr 14, 2007, 4:40:11 AM4/14/07
to
On Sat, Apr 14, 2007 at 10:08:34AM +0200, Willy Tarreau wrote:
> On Sat, Apr 14, 2007 at 08:43:34AM +0200, Ingo Molnar wrote:
> >
> > * Ingo Molnar <mi...@elte.hu> wrote:
> >
> > > Nick noticed that upon fork we change parent->wait_runtime but we do
> > > not requeue it within the rbtree.
> >
> > this fix is not complete - because the child runqueue is locked here,
> > not the parent's. I've fixed this properly in my tree and have uploaded
> > a new sched-modular+cfs.patch. (the effects of the original bug are
> > mostly harmless, the rbtree position gets corrected the first time the
> > parent reschedules. The fix might improve heavy forker handling.)
>
> It looks like it did not reach your public dir yet.
>
> BTW, I've given it a try. It seems pretty usable. I have also tried
> the usual meaningless "glxgears" test with 12 of them at the same time,
> and they rotate very smoothly, there is absolutely no pause in any of
> them. But they don't all run at same speed, and top reports their CPU
> load varying from 3.4 to 10.8%, with what looks like more CPU is
> assigned to the first processes, and less CPU for the last ones. But
> this is just a rough observation on a stupid test, I would not call
> that one scientific in any way (and X has its share in the test too).

Follow-up: I think this is mostly X-related. I've started 100 scheddos,
and all get the same CPU percentage. Interestingly, mpg123 in parallel
does never skip at all because it needs quite less than 1% CPU and gets
its fair share at a load of 112. Xterms are slow to respond to typing
with the 12 gears and 100 scheddos, and expectedly it was X which was
starving. renicing it to -5 restores normal feeling with very slow
but smooth gear rotations. Leaving X niced at 0 and killing the gears
also restores normal behaviour.

All in all, it seems logical that processes which serve many others
become a bottleneck for them.

Forking becomes very slow above a load of 100 it seems. Sometimes,
the shell takes 2 or 3 seconds to return to prompt after I run
"scheddos &"

Those are very promising results, I nearly observe the same responsiveness
as I had on a solaris 10 with 10k running processes on a bigger machine.

I would be curious what a mysql test result would look like now.

Regards,

Ingo Molnar

unread,
Apr 14, 2007, 6:40:13 AM4/14/07
to

* Willy Tarreau <w...@1wt.eu> wrote:

> > this fix is not complete - because the child runqueue is locked
> > here, not the parent's. I've fixed this properly in my tree and have
> > uploaded a new sched-modular+cfs.patch. (the effects of the original
> > bug are mostly harmless, the rbtree position gets corrected the
> > first time the parent reschedules. The fix might improve heavy
> > forker handling.)
>
> It looks like it did not reach your public dir yet.

oops, forgot to do the last step - should be fixed now.

> BTW, I've given it a try. It seems pretty usable. I have also tried
> the usual meaningless "glxgears" test with 12 of them at the same
> time, and they rotate very smoothly, there is absolutely no pause in
> any of them. But they don't all run at same speed, and top reports
> their CPU load varying from 3.4 to 10.8%, with what looks like more
> CPU is assigned to the first processes, and less CPU for the last
> ones. But this is just a rough observation on a stupid test, I would
> not call that one scientific in any way (and X has its share in the
> test too).

ok, i'll try that too - there should be nothing particularly special
about glxgears.

there's another tweak you could try:

echo 500000 > /proc/sys/kernel/sched_granularity_ns

note that this causes preemption to be done as fast as the scheduler can
do it. (in practice it will be mainly driven by CONFIG_HZ, so to get the
best results a CONFIG_HZ of 1000 is useful.)

plus there's an add-on to CFS at:

http://redhat.com/~mingo/cfs-scheduler/sched-fair-hog.patch

this makes the 'CPU usage history cutoff' configurable and sets it to a
default of 100 msecs. This means that CPU hogs (tasks which actively
kept other tasks from running) will be remembered, for up to 100 msecs
of their 'hogness'.

Setting this limit back to 0 gives the 'vanilla' CFS scheduler's
behavior:

echo 0 > /proc/sys/kernel/sched_max_hog_history_ns

(So when trying this you dont have to reboot with this patch
applied/unapplied, just set this value.)

> I'll perform other tests when I can rebuild with your fixed patch.

cool, thanks!

Ingo

Ingo Molnar

unread,
Apr 14, 2007, 7:00:10 AM4/14/07
to

* Daniel Walker <dwa...@mvista.com> wrote:

> One other thing, what happens in the case of slow, frequency changing,
> are/or inaccurate clocks .. Is the old sched_clock behavior still
> tolerated?

yeah, good question. Yesterday i did a quick testboot with that too, and
it seemed to behave pretty OK with the low-res [jiffies based]
sched_clock() too. Although in that case things are much more of an
approximation and rounding/arithmetics artifacts are possible. CFS works
best with a high-resolution cycle counter.

Ingo

Ingo Molnar

unread,
Apr 14, 2007, 7:00:12 AM4/14/07
to

* Willy Tarreau <w...@1wt.eu> wrote:

> Forking becomes very slow above a load of 100 it seems. Sometimes, the
> shell takes 2 or 3 seconds to return to prompt after I run "scheddos
> &"

this might be changed/impacted by the parent-requeue fix that is in the
updated (for real, promise! ;) patch. Right now on CFS a forking parent
shares its own run stats with the child 50%/50%. This means that heavy
forkers are indeed penalized. Another logical choice would be 100%/0%: a
child has to earn its own right.

i kept the 50%/50% rule from the old scheduler, but maybe it's a more
pristine (and smaller/faster) approach to just not give new children any
stats history to begin with. I've implemented an add-on patch that
implements this, you can find it at:

http://redhat.com/~mingo/cfs-scheduler/sched-fair-fork.patch

> Those are very promising results, I nearly observe the same
> responsiveness as I had on a solaris 10 with 10k running processes on
> a bigger machine.

cool and thanks for the feedback! (Btw., as another test you could also
try to renice "scheddos" to +19. While that does not push the scheduler
nearly as hard as nice 0, it is perhaps more indicative of how a truly
abusive many-tasks workload would be run in practice.)

Ingo

Willy Tarreau

unread,
Apr 14, 2007, 9:30:08 AM4/14/07
to
On Sat, Apr 14, 2007 at 03:01:01PM +0200, Willy Tarreau wrote:
>
> Well, I'll stop heating the room for now as I get out of ideas about how
> to defeat it.

Ah, I found something nasty.
If I start large batches of processes like this :

$ for i in $(seq 1 1000); do ./scheddos2 4000 4000 & done

the ramp up slows down after 700-800 processes, but something very
strange happens. If I'm under X, I can switch the focus to all xterms
(the WM is still alive) but all xterms are frozen. On the console,
after one moment I simply cannot switch to another VT anymore while
I can still start commands locally. But "chvt 2" simply blocks.
SysRq-K killed everything and restored full control. Dmesg shows lots
of :
SAK: killed process xxxx (scheddos2): process_session(p)==tty->session.

I wonder if part of the problem would be too many processes bound to
the same tty :-/

I'll investigate a bit.

Willy

Willy Tarreau

unread,
Apr 14, 2007, 9:40:09 AM4/14/07
to
On Sat, Apr 14, 2007 at 12:53:39PM +0200, Ingo Molnar wrote:
>
> * Willy Tarreau <w...@1wt.eu> wrote:
>
> > Forking becomes very slow above a load of 100 it seems. Sometimes, the
> > shell takes 2 or 3 seconds to return to prompt after I run "scheddos
> > &"
>
> this might be changed/impacted by the parent-requeue fix that is in the
> updated (for real, promise! ;) patch. Right now on CFS a forking parent
> shares its own run stats with the child 50%/50%. This means that heavy
> forkers are indeed penalized. Another logical choice would be 100%/0%: a
> child has to earn its own right.
>
> i kept the 50%/50% rule from the old scheduler, but maybe it's a more
> pristine (and smaller/faster) approach to just not give new children any
> stats history to begin with. I've implemented an add-on patch that
> implements this, you can find it at:
>
> http://redhat.com/~mingo/cfs-scheduler/sched-fair-fork.patch

Not tried yet, it already looks better with the update and sched-fair-hog.
Now xterm open "instantly" even with 1000 running processes.

> > Those are very promising results, I nearly observe the same
> > responsiveness as I had on a solaris 10 with 10k running processes on
> > a bigger machine.
>
> cool and thanks for the feedback! (Btw., as another test you could also
> try to renice "scheddos" to +19. While that does not push the scheduler
> nearly as hard as nice 0, it is perhaps more indicative of how a truly
> abusive many-tasks workload would be run in practice.)

Good idea. The machine I'm typing from now has 1000 scheddos running at +19,
and 12 gears at nice 0. Top keeps reporting different cpu usages for all gears,
but I'm pretty sure that it's a top artifact now because the cumulated times
are roughly identical :

14:33:13 up 13 min, 7 users, load average: 900.30, 443.75, 177.70
1088 processes: 80 sleeping, 1008 running, 0 zombie, 0 stopped
CPU0 states: 56.0% user 43.0% system 23.0% nice 0.0% iowait 0.0% idle
CPU1 states: 94.0% user 5.0% system 0.0% nice 0.0% iowait 0.0% idle
Mem: 1034764k av, 223788k used, 810976k free, 0k shrd, 7192k buff
104400k active, 51904k inactive
Swap: 497972k av, 0k used, 497972k free 68020k cached

PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME CPU COMMAND
1325 root 20 0 69240 9400 3740 R 27.6 0.9 4:46 1 X
1412 willy 20 0 6284 2552 1740 R 14.2 0.2 1:09 1 glxgears
1419 willy 20 0 6256 2384 1612 R 10.7 0.2 1:09 1 glxgears
1409 willy 20 0 2824 1940 788 R 8.9 0.1 0:25 1 top
1414 willy 20 0 6280 2544 1728 S 8.9 0.2 1:08 0 glxgears
1415 willy 20 0 6256 2376 1600 R 8.9 0.2 1:07 1 glxgears
1417 willy 20 0 6256 2384 1612 S 8.9 0.2 1:05 1 glxgears
1420 willy 20 0 6284 2552 1740 R 8.9 0.2 1:07 1 glxgears
1410 willy 20 0 6256 2372 1600 S 7.1 0.2 1:11 1 glxgears
1413 willy 20 0 6260 2388 1612 S 7.1 0.2 1:08 0 glxgears
1416 willy 20 0 6284 2544 1728 S 6.2 0.2 1:06 0 glxgears
1418 willy 20 0 6252 2384 1612 S 6.2 0.2 1:09 0 glxgears
1411 willy 20 0 6280 2548 1740 S 5.3 0.2 1:15 1 glxgears
1421 willy 20 0 6280 2536 1728 R 5.3 0.2 1:05 1 glxgears

From time to time, one of the 12 aligned gears will quickly perform a full
quarter of round while others slowly turn by a few degrees. In fact, while
I don't know this process's CPU usage pattern, there's something useful in
it : it allows me to visually see when process accelerate/deceleraet. What
would be best would be just a clock requiring low X ressources and eating
vast amounts of CPU between movements. It will help visually monitor CPU
distribution without being too much impacted by X.

I've just added another 100 scheddos at nice 0, and the system is still
amazingly usable. I just tried exchanging a 1-byte token between 188 "dd"
processes which communicate through circular pipes. The context switch
rate is rather high but this has no impact on the rest :

willy@pcw:c$ dd if=/tmp/fifo bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | (echo -n a;dd bs=1) | dd bs=1 of=/tmp/fifo

procs memory swap io system cpu
r b w swpd free buff cache si so bi bo in cs us sy id
1105 0 1 0 781108 8364 68180 0 0 0 12 5 82187 59 41 0
1114 0 1 0 781108 8364 68180 0 0 0 0 0 81528 58 42 0
1112 0 1 0 781108 8364 68180 0 0 0 0 1 80899 58 42 0
1113 0 1 0 781108 8364 68180 0 0 0 0 26 83466 58 42 0
1106 0 2 0 781108 8376 68168 0 0 0 8 91 83193 58 42 0
1107 0 1 0 781108 8376 68180 0 0 0 4 7 79951 58 42 0
1106 0 1 0 781108 8376 68180 0 0 0 0 46 80939 57 43 0
1114 0 1 0 781108 8376 68180 0 0 0 0 21 82019 56 44 0
1116 0 1 0 781108 8376 68180 0 0 0 0 16 85134 56 44 0
1114 0 3 0 781108 8388 68168 0 0 0 16 20 85871 56 44 0
1112 0 1 0 781108 8388 68168 0 0 0 0 15 80412 57 43 0
1112 0 1 0 781108 8388 68180 0 0 0 0 101 83002 58 42 0
1113 0 1 0 781108 8388 68180 0 0 0 0 25 82230 56 44 0

Playing with the sched_max_hog_history_ns does not seem to change anything.
Maybe it's useful for other workloads. Anyway, I have nothing to complain
about, because it's not common for me to be able to normally type a mail on
a system with more than 1000 running processes ;-)

Also, mixed with this load, I have started injecting HTTP requests between
two local processes. The load is stable at 7700 req/s (11800 when alone),
and what I was interested in is the response time. It's perfectly stable
between 9.0 and 9.4 ms with a standard deviation of about 6.0 ms. Those were
varying a lot under stock scheduler, with some sessions sometimes pausing
for seconds. (RSDL fixed this though).

Well, I'll stop heating the room for now as I get out of ideas about how

to defeat it. I'm convinced. I'm impatient to read about Mike's feedback
with his workload which behaves strangely on RSDL. If it works OK here,
it will be the proof that heuristics should not be needed.

Congrats !
Willy

Willy Tarreau

unread,
Apr 14, 2007, 10:50:20 AM4/14/07
to
On Sat, Apr 14, 2007 at 03:27:32PM +0200, Willy Tarreau wrote:
> On Sat, Apr 14, 2007 at 03:01:01PM +0200, Willy Tarreau wrote:
> >
> > Well, I'll stop heating the room for now as I get out of ideas about how
> > to defeat it.
>
> Ah, I found something nasty.
> If I start large batches of processes like this :
>
> $ for i in $(seq 1 1000); do ./scheddos2 4000 4000 & done
>
> the ramp up slows down after 700-800 processes, but something very
> strange happens. If I'm under X, I can switch the focus to all xterms
> (the WM is still alive) but all xterms are frozen. On the console,
> after one moment I simply cannot switch to another VT anymore while
> I can still start commands locally. But "chvt 2" simply blocks.
> SysRq-K killed everything and restored full control. Dmesg shows lots
> of :
> SAK: killed process xxxx (scheddos2): process_session(p)==tty->session.
>
> I wonder if part of the problem would be too many processes bound to
> the same tty :-/

Does not seem easy to reproduce, it looks like some resource pools are
kept pre-allocated after a first run, because if I kill scheddos during
the ramp up then start it again, it can go further. The problem happens
when the parent is forking. Also, I modified scheddos to close(0,1,2)
and to perform the forks itself and it does not cause any problem, even
with 4000 processes running. So I really suspect that the problem I
encountered above was tty-related.

BTW, I've tried your fork patch. It definitely helps forking because it
takes below one second to create 4000 processes, then the load slowly
increases. As you said, the children have to earn their share, and I
find that it makes it easier to conserve control of the whole system's
stability.

Regards,

S.Çağlar Onur

unread,
Apr 14, 2007, 11:20:23 AM4/14/07
to
13 Nis 2007 Cum tarihinde, Ingo Molnar şunları yazmıştı:
> [announce] [patch] Modular Scheduler Core and Completely Fair Scheduler
> [CFS]
>
> i'm pleased to announce the first release of the "Modular Scheduler Core
> and Completely Fair Scheduler [CFS]" patchset:

Currently im using Linus's current git + your extra patches + CFS for a while.
Kaffeine constantly freezes (and uses %80+ CPU time) [1] if i seek
forward/backward while its playing a video with some workload (checking out
SVN repositories, compiling something). Stopping other process didn't help
kaffeine so it stays freezed stated until i kill it.

I'm not sure whether its a xine-lib or kaffeine bug (cause mplayer didn't have
that problem) but i can't reproduce this with mainline or mainline + sd-0.39.

[1] http://cekirdek.pardus.org.tr/~caglar/psaux
--
S.Çağlar Onur <cag...@pardus.org.tr>
http://cekirdek.pardus.org.tr/~caglar/

Linux is like living in a teepee. No Windows, no Gates and an Apache in house!

signature.asc

Mark Lord

unread,
Apr 14, 2007, 11:20:21 AM4/14/07
to
Ingo Molnar wrote:
> i kept the 50%/50% rule from the old scheduler, but maybe it's a more
> pristine (and smaller/faster) approach to just not give new children any
> stats history to begin with. I've implemented an add-on patch that
> implements this, you can find it at:
>
> http://redhat.com/~mingo/cfs-scheduler/sched-fair-fork.patch

I've been running my desktop (single-core Pentium-M w/2GB RAM, Kubuntu Dapper)
with the new CFS for much of this morning now, with the odd switch back to
the stock scheduler for comparison.

Here, CFS really works and feels better than the stock scheduler.

Even with a "make -j2" kernel rebuild happening (no manual renice, either!)
things "just work" about as smoothly as ever. That's something which RSDL
never achieved for me, though I have not retested RSDL beyond v0.34 or so.

Well done, Ingo! I *want* this as my default scheduler.

Things seemed slightly less smooth when I had the CPU hogs
and fair-fork extension patches both applied.

I'm going to try again now with just the fair-fork added on.

Cheers

Mark

Ingo Molnar

unread,
Apr 14, 2007, 12:20:08 PM4/14/07
to

* S.Çağlar Onur <cag...@pardus.org.tr> wrote:

> > i'm pleased to announce the first release of the "Modular Scheduler
> > Core and Completely Fair Scheduler [CFS]" patchset:
>
> Currently im using Linus's current git + your extra patches + CFS for
> a while. Kaffeine constantly freezes (and uses %80+ CPU time) [1] if i
> seek forward/backward while its playing a video with some workload
> (checking out SVN repositories, compiling something). Stopping other
> process didn't help kaffeine so it stays freezed stated until i kill
> it.

hm, could you try to strace it and/or attach gdb to it and figure out
what's wrong? (perhaps involving the Kaffeine developers too?) As long
as it's not a kernel level crash i cannot see how the scheduler could
directly cause this - other than by accident creating a scheduling
pattern that triggers a user-space bug more often than with other
schedulers.

> [1] http://cekirdek.pardus.org.tr/~caglar/psaux

looks quite weird!

Ingo

Ingo Molnar

unread,
Apr 14, 2007, 12:20:05 PM4/14/07
to

* Willy Tarreau <w...@1wt.eu> wrote:

> BTW, I've tried your fork patch. It definitely helps forking because
> it takes below one second to create 4000 processes, then the load
> slowly increases. As you said, the children have to earn their share,
> and I find that it makes it easier to conserve control of the whole
> system's stability.

ok, thanks for testing this out, i think i'll integrate this one back
into the core. (I'm still unsure about the cpu-hog one.) And it saves
some code-size too:

text data bss dec hex filename
23349 2705 24 26078 65de kernel/sched.o.cfs-v1
23189 2705 24 25918 653e kernel/sched.o.cfs-before
23052 2705 24 25781 64b5 kernel/sched.o.cfs-after

23366 4001 24 27391 6aff kernel/sched.o.vanilla
23671 4548 24 28243 6e53 kernel/sched.o.sd.v40

Ingo

Ingo Molnar

unread,
Apr 14, 2007, 12:30:12 PM4/14/07
to

* Willy Tarreau <w...@1wt.eu> wrote:

> On Sat, Apr 14, 2007 at 03:01:01PM +0200, Willy Tarreau wrote:
> >
> > Well, I'll stop heating the room for now as I get out of ideas about how
> > to defeat it.
>
> Ah, I found something nasty.
> If I start large batches of processes like this :
>
> $ for i in $(seq 1 1000); do ./scheddos2 4000 4000 & done
>
> the ramp up slows down after 700-800 processes, but something very
> strange happens. If I'm under X, I can switch the focus to all xterms
> (the WM is still alive) but all xterms are frozen. On the console,
> after one moment I simply cannot switch to another VT anymore while I
> can still start commands locally. But "chvt 2" simply blocks. SysRq-K
> killed everything and restored full control. Dmesg shows lots of :

> SAK: killed process xxxx (scheddos2): process_session(p)==tty->session.
>
> I wonder if part of the problem would be too many processes bound to
> the same tty :-/

hm, that's really weird. I've Cc:-ed the tty experts (Erik, Jiri, Alan),
maybe this description rings a bell with them?

Ingo

S.Çağlar Onur

unread,
Apr 14, 2007, 1:10:13 PM4/14/07
to
14 Nis 2007 Cts tarihinde, Ingo Molnar şunları yazmıştı:
> hm, could you try to strace it and/or attach gdb to it and figure out
> what's wrong? (perhaps involving the Kaffeine developers too?) As long
> as it's not a kernel level crash i cannot see how the scheduler could
> directly cause this - other than by accident creating a scheduling
> pattern that triggers a user-space bug more often than with other
> schedulers.

...
futex(0x89ac218, FUTEX_WAIT, 2, NULL) = 0
futex(0x89ac218, FUTEX_WAIT, 2, NULL) = 0
futex(0x89ac218, FUTEX_WAIT, 2, NULL) = 0
futex(0x89ac218, FUTEX_WAIT, 2, NULL) = 0
futex(0x89ac218, FUTEX_WAIT, 2, NULL) = 0
futex(0x89ac218, FUTEX_WAIT, 2, NULL) = -1 EINTR (Interrupted system call)
--- SIGINT (Interrupt) @ 0 (0) ---
+++ killed by SIGINT +++

is where freeze occurs. Full log can be found at [1]

> > [1] http://cekirdek.pardus.org.tr/~caglar/psaux
>
> looks quite weird!

:)

[1] http://cekirdek.pardus.org.tr/~caglar/strace.kaffeine

signature.asc

Eric W. Biederman

unread,
Apr 14, 2007, 1:20:10 PM4/14/07
to
Ingo Molnar <mi...@elte.hu> writes:

> * Willy Tarreau <w...@1wt.eu> wrote:
>
>> On Sat, Apr 14, 2007 at 03:01:01PM +0200, Willy Tarreau wrote:
>> >
>> > Well, I'll stop heating the room for now as I get out of ideas about how
>> > to defeat it.
>>
>> Ah, I found something nasty.
>> If I start large batches of processes like this :
>>
>> $ for i in $(seq 1 1000); do ./scheddos2 4000 4000 & done
>>
>> the ramp up slows down after 700-800 processes, but something very
>> strange happens. If I'm under X, I can switch the focus to all xterms
>> (the WM is still alive) but all xterms are frozen. On the console,
>> after one moment I simply cannot switch to another VT anymore while I
>> can still start commands locally. But "chvt 2" simply blocks. SysRq-K
>> killed everything and restored full control. Dmesg shows lots of :
>
>> SAK: killed process xxxx (scheddos2): process_session(p)==tty->session.

This. Yes. SAK is noisy and tells you everything it kills.

>> I wonder if part of the problem would be too many processes bound to
>> the same tty :-/
>
> hm, that's really weird. I've Cc:-ed the tty experts (Erik, Jiri, Alan),
> maybe this description rings a bell with them?

Is there any swapping going on?

I'm inclined to suspect that it is a problem that has more to do with the
number of processes and has nothing to do with ttys.

Anyway you can easily rule out ttys by having your startup program
detach from a controlling tty before you start everything.

I'm more inclined to guess something is reading /proc a lot, or doing
something that holds the tasklist lock, a lot or something like that,
if the problem isn't that you are being kicked into swap.

Eric

Willy Tarreau

unread,
Apr 14, 2007, 1:40:09 PM4/14/07
to
Hi Eric,

[...]


> >> the ramp up slows down after 700-800 processes, but something very
> >> strange happens. If I'm under X, I can switch the focus to all xterms
> >> (the WM is still alive) but all xterms are frozen. On the console,
> >> after one moment I simply cannot switch to another VT anymore while I
> >> can still start commands locally. But "chvt 2" simply blocks. SysRq-K
> >> killed everything and restored full control. Dmesg shows lots of :
> >
> >> SAK: killed process xxxx (scheddos2): process_session(p)==tty->session.
>
> This. Yes. SAK is noisy and tells you everything it kills.

OK, that's what I suspected, but I did not know if the fact that it talked
about the session was systematic or related to any particular state when it
killed the task.

> >> I wonder if part of the problem would be too many processes bound to
> >> the same tty :-/
> >
> > hm, that's really weird. I've Cc:-ed the tty experts (Erik, Jiri, Alan),
> > maybe this description rings a bell with them?
>
> Is there any swapping going on?

Not at all.

> I'm inclined to suspect that it is a problem that has more to do with the
> number of processes and has nothing to do with ttys.

It is clearly possible. What I found strange is that I could still fork
processes (eg: ls, dmesg|tail), ... but not switch to another VT anymore.
It first happened under X with frozen xterms but a perfectly usable WM,
then I reproduced it on pure console to rule out any potential X problem.

> Anyway you can easily rule out ttys by having your startup program
> detach from a controlling tty before you start everything.
>
> I'm more inclined to guess something is reading /proc a lot, or doing
> something that holds the tasklist lock, a lot or something like that,
> if the problem isn't that you are being kicked into swap.

Oh I'm sorry you were invited into the discussion without a first description
of the context. I was giving a try to Ingo's new scheduler, and trying to
reach corner cases with lots of processes competing for CPU.

I simply used a "for" loop in bash to fork 1000 processes, and this problem
happened between 700-800 children. The program only uses a busy loop and a
pause. I then changed my program to close 0,1,2 and perform the fork itself,
and the problem vanished. So there are two differences here :

- bash not forking anymore
- far less FDs on /dev/tty1

At first, I had around 2200 fds on /dev/tty1, reason why I suspected something
in this area.

I agree that this is not normal usage at all, I'm just trying to attack
Ingo's scheduler to ensure it is more robust than the stock one. But
sometimes brute force methods can make other sleeping problems pop up.

Thinking about it, I don't know if there are calls to schedule() while
switching from tty1 to tty2. Alt-F2 had no effect anymore, and "chvt 2"
simply blocked. It would have been possible that a schedule() call
somewhere got starved due to the load, I don't know.

Thanks,
Willy

Eric W. Biederman

unread,
Apr 14, 2007, 1:50:08 PM4/14/07
to
Willy Tarreau <w...@1wt.eu> writes:

Yes. But with /dev/tty1 being the controlling terminal in both cases,
as you haven't dropped your session, or disassociated your tty.

The bash problem may have something to setpgid or scheduling effects.
Hmm. I just looked and setpgid does grab the tasklist lock for
writing so we may possibly have some contention there.

> At first, I had around 2200 fds on /dev/tty1, reason why I suspected something
> in this area.
>
> I agree that this is not normal usage at all, I'm just trying to attack
> Ingo's scheduler to ensure it is more robust than the stock one. But
> sometimes brute force methods can make other sleeping problems pop up.

Yep. If we can narrow it down to one that would be interesting. Of course
that also means when we start finding other possibly sleeping problems people
are working in areas of code the don't normally touch, so we must investigate.

> Thinking about it, I don't know if there are calls to schedule() while
> switching from tty1 to tty2. Alt-F2 had no effect anymore, and "chvt 2"
> simply blocked. It would have been possible that a schedule() call
> somewhere got starved due to the load, I don't know.

It looks like there is a call to schedule_work.

There are two pieces of the path. If you are switching in and out of a tty
controlled by something like X. User space has to grant permission before
the operation happens. Where there isn't a gate keeper I know it is cheaper
but I don't know by how much, I suspect there is still a schedule happening
in there.

Eric

Linus Torvalds

unread,
Apr 14, 2007, 2:00:41 PM4/14/07
to

On Sat, 14 Apr 2007, Willy Tarreau wrote:
>
> It is clearly possible. What I found strange is that I could still fork
> processes (eg: ls, dmesg|tail), ... but not switch to another VT anymore.

Considering the patches in question, it's almost definitely just a CPU
scheduling problem with starvation.

The VT switching is obviously done by the kernel, but the kernel will
signal and wait for the "controlling process" for the VT. The most obvious
case of that is X, of course, but even in text mode I think gpm will
have taken control of the VT's it runs on (all of them), which means that
when you initiate a VT switch, the kernel will actually signal the
controlling process (gpm), and wait for it to acknowledge the switch.

If gpm doesn't get a timeslice for some reason (and it sounds like there
may be some serious unfairness after "fork()"), your behaviour is
explainable.

(NOTE! I've never actually looked at gpm sources or what it really does,
so maybe I'm wrong, and it doesn't try to do the controlling VT thing, and
something else is going on, but quite frankly, it sounds like the obvious
candidate for this bug. Explaining it with some non-scheduler-related
thing sounds unlikely, considering the patch in question).

Linus

Ingo Molnar

unread,
Apr 14, 2007, 2:01:05 PM4/14/07
to

* Eric W. Biederman <ebie...@xmission.com> wrote:

> > Thinking about it, I don't know if there are calls to schedule()
> > while switching from tty1 to tty2. Alt-F2 had no effect anymore, and
> > "chvt 2" simply blocked. It would have been possible that a
> > schedule() call somewhere got starved due to the load, I don't know.
>
> It looks like there is a call to schedule_work.

so this goes over keventd, right?

> There are two pieces of the path. If you are switching in and out of a
> tty controlled by something like X. User space has to grant
> permission before the operation happens. Where there isn't a gate
> keeper I know it is cheaper but I don't know by how much, I suspect
> there is still a schedule happening in there.

Could keventd perhaps be starved? Willy, to exclude this possibility,
could you perhaps chrt keventd to RT priority? If events/0 is PID 5 then
the command to set it to SCHED_FIFO:50 would be:

chrt -f -p 50 5

but ... events/0 is reniced to -5 by default, so it should definitely
not be starved.

Ingo

Bill Huey

unread,
Apr 15, 2007, 11:36:01 AM4/15/07
to
On Sun, Apr 15, 2007 at 01:27:13PM +1000, Con Kolivas wrote:
...
> Now that you're agreeing my direction was correct you've done the usual Linux
> kernel thing - ignore all my previous code and write your own version. Oh
> well, that I've come to expect; at least you get a copyright notice in the
> bootup and somewhere in the comments give me credit for proving it's
> possible. Let's give some other credit here too. William Lee Irwin provided
> the major architecture behind plugsched at my request and I simply finished
> the work and got it working. He is also responsible for many IRC discussions
> I've had about cpu scheduling fairness, designs, programming history and code
> help. Even though he did not contribute code directly to SD, his comments
> have been invaluable.

Hello folks,

I think the main failure I see here is that Con wasn't included in this design
or privately in review process. There could have been better co-ownership of the
code. This could also have been done openly on lkml (since this is kind of what
this medium is about to significant degree) so that consensus can happen (Con
can be reasoned with). It would have achieved the same thing but probably more
smoothly if folks just listened, considered an idea and then, in this case,
created something that would allow for experimentation from outsiders in a
fluid fashion.

If these issues aren't fixed, you're going to stuck with the same kind of creeping
elitism that has gradually killed the FreeBSD project and other BSDs. I can't
comment on the code implementation. I'm focus on other things now that I'm at
NetApp and I can't help out as much as I could. Being former BSDi, I had a first
hand account of these issues as they played out.

A development process like this is likely to exclude smart people from wanting
to contribute to Linux and folks should be conscious about this issues. It's
basically a lot of code and concept that at least two individuals have worked
on (wli and con) only to have it be rejected and then sudden replaced by
code from a community gatekeeper. In this case, this results in both Con and
Bill Irwin being woefully under utilized.

If I were one of these people. I'd be mighty pissed.

bill

Con Kolivas

unread,
Apr 15, 2007, 11:37:37 AM4/15/07
to
On Saturday 14 April 2007 06:21, Ingo Molnar wrote:
> [announce] [patch] Modular Scheduler Core and Completely Fair Scheduler

> [CFS]
>
> i'm pleased to announce the first release of the "Modular Scheduler Core
> and Completely Fair Scheduler [CFS]" patchset:
>
> http://redhat.com/~mingo/cfs-scheduler/sched-modular+cfs.patch

>
> This project is a complete rewrite of the Linux task scheduler. My goal
> is to address various feature requests and to fix deficiencies in the
> vanilla scheduler that were suggested/found in the past few years, both
> for desktop scheduling and for server scheduling workloads.

The casual observer will be completely confused by what on earth has happened
here so let me try to demystify things for them.

1. I tried in vain some time ago to push a working extensable pluggable cpu
scheduler framework (based on wli's work) for the linux kernel. It was
perma-vetoed by Linus and Ingo (and Nick also said he didn't like it) as
being absolutely the wrong approach and that we should never do that. Oddly
enough the linux-kernel-mailing list was -dead- at the time and the
discussion did not make it to the mailing list. Every time I've tried to
forward it to the mailing list the spam filter decided to drop it so most
people have not even seen this original veto-forever discussion.

2. Since then I've been thinking/working on a cpu scheduler design that takes
away all the guesswork out of scheduling and gives very predictable, as fair
as possible, cpu distribution and latency while preserving as solid
interactivity as possible within those confines. For weeks now, Ingo has said
that the interactivity regressions were showstoppers and we should address
them, never mind the fact that the so-called regressions were purely "it
slows down linearly with load" which to me is perfectly desirable behaviour.
While this was not perma-vetoed, I predicted pretty accurately your intent
was to veto it based on this.

People kept claiming scheduling problems were few and far between but what was
really happening is users were terrified of lkml and instead used 1. windows
and 2. 2.4 kernels. The problems were there.

So where are we now? Here is where your latest patch comes in.

As a solution to the many scheduling problems we finally all agree exist, you
propose a patch that adds 1. a limited pluggable framework and 2. a fairness
based cpu scheduler policy... o_O

So I should be happy at last now that the things I was promoting you are also
promoting, right? Well I'll fill in the rest of the gaps and let other people
decide how I should feel.

> as usual, any sort of feedback, bugreports, fixes and suggestions are
> more than welcome,

In the last 4 weeks I've spent time lying in bed drugged to the eyeballs and
having trips in and out of hospitals for my condition. I appreciate greatly
the sympathy and patience from people in this regard. However at one stage I
virtually begged for support with my attempts and help with the code. Dmitry
Adamushko is the only person who actually helped me with the code in the
interim, while others poked sticks at it. Sure the sticks helped at times but
the sticks always seemed to have their ends kerosene doused and flaming for
reasons I still don't get. No other help was forthcoming.

Now that you're agreeing my direction was correct you've done the usual Linux
kernel thing - ignore all my previous code and write your own version. Oh
well, that I've come to expect; at least you get a copyright notice in the
bootup and somewhere in the comments give me credit for proving it's
possible. Let's give some other credit here too. William Lee Irwin provided
the major architecture behind plugsched at my request and I simply finished
the work and got it working. He is also responsible for many IRC discussions
I've had about cpu scheduling fairness, designs, programming history and code
help. Even though he did not contribute code directly to SD, his comments
have been invaluable.

So let's look at the code.

kernel/sched.c
kernel/sched_fair.c
kernel/sched_rt.c

It turns out this is not a pluggable cpu scheduler framework at all, and I
guess you didn't really promote it as such. It's a "modular scheduler core".
Which means you moved code from sched.c into sched_fair.c and sched_rt.c.
This abstracts out each _scheduling policy's_ functions into struct
sched_class and allows each scheduling policy's functions to be in a separate
file etc.

Ok so what it means is that instead of whole cpu schedulers being able to be
plugged into this framework we can plug in only cpu scheduling policies....
hrm... So let's look on

-#define SCHED_NORMAL 0

Ok once upon a time we rename SCHED_OTHER which every other unix calls the
standard policy 99.9% of applications used into a more meaningful name,
SCHED_NORMAL. That's fine since all it did was change the description
internally for those reading the code. Let's see what you've done now:

+#define SCHED_FAIR 0

You've renamed it again. This is, I don't know what exactly to call it, but an
interesting way of making it look like there is now more choice. Well,
whatever you call it, everything in linux spawned from init without
specifying a policy still gets policy 0. This is SCHED_OTHER still, renamed
SCHED_NORMAL and now SCHED_FAIR.

You encouraged me to create a sched_sd.c to add onto your design as well.
Well, what do I do with that? I need to create another scheduling policy for
that code to even be used. A separate scheduling policy requires a userspace
change to even benefit from it. Even if I make that sched_sd.c patch, people
cannot use SD as their default scheduler unless they hack SCHED_FAIR 0 to
read SCHED_SD 0 or similar. The same goes for original staircase cpusched,
nicksched, zaphod, spa_ws, ebs and so on.

So what you've achieved with your patch is - replaced the current scheduler
with another one and moved it into another file. There is no choice, and no
pluggability, just code trumping.

Do I support this? In this form.... no.

It's not that I don't like your new scheduler. Heck it's beautiful like most
of your _serious_ code. It even comes with a catchy name that's bound to give
people hard-ons (even though many schedulers aim to be completely fair, yours
has been named that for maximum selling power). The complaint I have is that
you are not providing quite what you advertise (on the modular front), or
perhaps you're advertising it as such to make it look more appealing; I'm not
sure.

Since we'll just end up with your code, don't pretend SCHED_NORMAL is anything
different, and that this is anything other than your NIH (Not Invented Here)
cpu scheduling policy rewrite which will probably end up taking it's position
in mainline after yet another truckload of regression/performance tests and
so on. I haven't seen an awful lot of comparisons with SD yet, just people
jumping on your bandwagon which is fine I guess. Maybe a few tiny tests
showing less than 5% variation in their fairness from what I can see. Either
way, I already feel you've killed off SD... like pretty much everything else
I've done lately. At least I no longer have to try and support my code mostly
by myself.

In the interest of putting aside any ego concerns since this is about linux
and not me...

Because... You are a hair's breadth away from producing something that I
would support, which _does_ do what you say and produces the pluggability
we're all begging for with only tiny changes to the code you've already done.
Make Kconfig let you choose which sched_*.c gets built into the kernel, and
make SCHED_OTHER choose which SCHED_* gets chosen as the default from Kconfig
and even choose one of the alternative built in ones with boot parametersyour
code has more clout than mine will (ie do exactly what plugsched does). Then
we can have 7 schedulers in linux kernel within a few weeks. Oh no! This is
the very thing Linus didn't want in specialisation with the cpu schedulers!
Does this mean this idea will be vetoed yet again? In all likelihood, yes.

I guess I have lots to put into -ck still... sigh.

> Ingo

--
-ck

Ingo Molnar

unread,
Apr 15, 2007, 11:46:17 AM4/15/07
to

* Mike Galbraith <efa...@gmx.de> wrote:

> On Sat, 2007-04-14 at 15:01 +0200, Willy Tarreau wrote:
>
> > Well, I'll stop heating the room for now as I get out of ideas about
> > how to defeat it. I'm convinced. I'm impatient to read about Mike's
> > feedback with his workload which behaves strangely on RSDL. If it
> > works OK here, it will be the proof that heuristics should not be
> > needed.
>

> You mean the X + mp3 player + audio visualization test? X+Gforce
> visualization have problems getting half of my box in the presence of
> two other heavy cpu using tasks. Behavior is _much_ better than
> RSDL/SD, but the synchronous nature of X/client seems to be a problem.
>
> With this scheduler, renicing X/client does cure it, whereas with SD
> it did not help one bit. [...]

thanks for testing it! I was quite worried about your setup - two tasks
using up 50%/50% of CPU time, pitted against a kernel rebuild workload
seems to be a hard workload to get right.

> [...] (I know a trivial way to cure that, and this framework makes
> that possible without dorking up fairness as a general policy.)

great! Please send patches so i can add them (once you are happy with
the solution) - i think your workload isnt special in any way and could
hit other people too.

Ingo

William Lee Irwin III

unread,
Apr 15, 2007, 11:46:54 AM4/15/07
to
On Fri, 13 Apr 2007, William Lee Irwin III wrote:
>> A binomial heap would likely serve your purposes better than rbtrees.
[...]

On Sat, Apr 14, 2007 at 03:38:04PM -0700, Davide Libenzi wrote:
> Haven't looked at the scheduler code yet, but for a similar problem I use
> a time ring. The ring has Ns (2 power is better) slots (where tasks are
> queued - in my case they were som sort of timers), and it has a current
> base index (Ib), a current base time (Tb) and a time granularity (Tg). It
> also has a bitmap with bits telling you which slots contains queued tasks.
> An item (task) that has to be scheduled at time T, will be queued in the slot:
> S = Ib + min((T - Tb) / Tg, Ns - 1);
> Items with T longer than Ns*Tg will be scheduled in the relative last slot
> (chosing a proper Ns and Tg can minimize this).
> Queueing is O(1) and de-queueing is O(Ns). You can play with Ns and Tg to
> suite to your needs.

I used a similar sort of queue in the virtual deadline scheduler I
wrote in 2003 or thereabouts. CFS uses queue priorities with too high
a precision to map directly to this (queue priorities are marked as
"key" in the cfs code and should not be confused with task priorities).
The elder virtual deadline scheduler used millisecond resolution and a
rather different calculation for its equivalent of ->key, which
explains how it coped with a limited priority space.

The two basic attacks on such large priority spaces are the near future
vs. far future subdivisions and subdividing the priority space into
(most often regular) intervals. Subdividing the priority space into
intervals is the most obvious; you simply use some O(lg(n)) priority
queue as the bucket discipline in the "time ring," queue by the upper
bits of the queue priority in the time ring, and by the lower bits in
the O(lg(n)) bucket discipline. The near future vs. far future
subdivision is maintaining the first N tasks in a low-constant-overhead
structure like a sorted list and the remainder in some other sort of
queue structure intended to handle large numbers of elements gracefully.
The distribution of queue priorities strongly influences which of the
methods is most potent, though it should be clear the methods can be
used in combination.


-- wli

Pekka J Enberg

unread,
Apr 15, 2007, 12:01:04 PM4/15/07
to
On Sun, 15 Apr 2007, Willy Tarreau wrote:
> Ingo could have publicly spoken with them about his ideas of killing
> the O(1) scheduler and replacing it with an rbtree-based one, and using
> part of Bill's work to speed up development.

He did exactly that and he did it with a patch. Nothing new here. This is
how development on LKML proceeds when you have two or more competing
designs. There's absolutely no need to get upset or hurt your feelings
over it. It's not malicious, it's how we do Linux development.

Bill Huey

unread,
Apr 15, 2007, 12:04:08 PM4/15/07
to
On Sun, Apr 15, 2007 at 10:44:47AM +0200, Ingo Molnar wrote:
> I prefer such early releases to lkml _alot_ more than any private review
> process. I released the CFS code about 6 hours after i thought "okay,
> this looks pretty good" and i spent those final 6 hours on testing it
> (making sure it doesnt blow up on your box, etc.), in the final 2 hours
> i showed it to two folks i could reach on IRC (Arjan and Thomas) and on
> various finishing touches. It doesnt get much faster than that and i
> definitely didnt want to sit on it even one day longer because i very
> much thought that Con and others should definitely see this work!
>
> And i very much credited (and still credit) Con for the whole fairness
> angle:
>
> || i'd like to give credit to Con Kolivas for the general approach here:
> || he has proven via RSDL/SD that 'fair scheduling' is possible and that
> || it results in better desktop scheduling. Kudos Con!
>
> the 'design consultation' phase you are talking about is _NOW_! :)
>
> I got the v1 code out to Con, to Mike and to many others ASAP. That's
> how you are able to comment on this thread and be part of the
> development process to begin with, in a 'private consultation' setup
> you'd not have had any opportunity to see _any_ of this.
>
> In the BSD space there seem to be more 'political' mechanisms for
> development, but Linux is truly about doing things out in the open, and
> doing it immediately.

I can't even begin to talk about how screwed up BSD development is. Maybe
another time privately.

Ok, Linux development and inclusiveness can be improved. I'm not trying
to "call you out" (slang for accusing you with the sole intention to call
you crazy in a highly confrontative manner). This is discussed publically
here to bring this issue to light, open a communication channel as a means
to resolve it.

> Okay? ;-)

It's cool. We're still getting to know each other professionally and it's
okay to a certain degree to have a communication disconnect but only as
long as it clears. Your productivity is amazing BTW. But here's the
problem, there's this perception that NIH is the default mentality here
in Linux.

Con feels that this kind of action is intentional and has a malicious
quality to it as means of "churn squating" sections of the kernel tree.
The perception here is that there is that there is this expectation that
sections of the Linux kernel are intentionally "churn squated" to prevent
any other ideas from creeping in other than of the owner of that subsytem
(VM, scheduling, etc...) because of lack of modularity in the kernel.
This isn't an API question but a question possibly general code quality
and how maintenance () of it can .

This was predicted by folks and then this perception was *realized* when
you wrote the equivalent kind of code that has technical overlap with SDL
(this is just one dry example). To a person that is writing new code for
Linux, having one of the old guards write equivalent code to that of a
newcomer has the effect of displacing that person both with regards to
code and responsibility with that. When this happens over and over again
and folks get annoyed by it, it starts seeming that Linux development
seems elitist.

I know this because I heard (read) Con's IRC chats all the time about
these matters all of the time. This is not just his view but a view of
other kernel folks that differing views as to. The closing talk at OLS
2006 was highly disturbing in many ways. It went "Christoph" is right
everybody else is wrong which sends a highly negative message to new
kernel developers that, say, don't work for RH directly or any of the
other mainstream Linux companies. After a while, it starts seeming like
this kind of behavior is completely intentional and that Linux is
full of arrogant bastards.

What I would have done here was to contact Peter Williams, Bill Irwin
and Con about what your doing and reach a common concensus about how
to create something that would be inclusive of all of their ideas.
Discussions can technically heated but that's ok, the discussion is
happening and it brings down the wall of this perception. Bill and
Con are on oftc.net/#offtopic2. Riel is there as well as Peter Zijlstra.
It might be very useful, it might not be. Folks are all stubborn
about there ideas and hold on to them for dear life. Effective
leaders can deconstruct this hostility and animosity. I don't claim
to be one.

Because of past hostility to something like schedplugin, the hostility
and terseness of responses can be percieved simply as "I'm right,
you're wrong" which is condescending. This effects discussion and
outright destroys a constructive process if this happens continually
since it reenforces that view of "You're an outsider, we don't care
about you". Nobody is listening to each other at that point, folks get
pissed. Then they think about "I'm going to NIH this person with patc
X because he/she did the same here" which is dysfunctional.

Oddly enough, sometimes you're the best person to get a new idea
into the tree. What's not happening here is communication. That takes
sensitivity, careful listening which is a difficult skill, and then
understanding of the characters involved to unify creative energies.

That's a very difficult thing to do for folks that are use to working
solo. It take time to develop trust in those relationships so that
a true collaboration can happen. I know that there is a lot of
creativity in folks like Con and Bill. It would be wise to develop a
dialog with them to see if they can offload some of your work for you
(we all know you're really busy) yet have you be a key facilitator of
their and your ideas. That's a really tough thing to do and it requires
practice. Just imagine (assuming they can follow through) what could
have positively happened if there collective knowledge was leveraged
better. It's not all clear and rosy, but I think these people are more
on your side that you might realized and it might be a good thing to
discover that.

This is tough because I know the personalities involved and I know
kind of how people function and malfunction in this discussion on a
personal basis.

[We can continue privately. This not just about you but applicable
to open source development in general]

The tone of this email is intellectually critical (not ment as
personality attack) and calm. If I'm otherwise, them I'm a bastard. :)

Mike Galbraith

unread,
Apr 15, 2007, 12:05:46 PM4/15/07
to
On Sun, 2007-04-15 at 10:58 +0200, Ingo Molnar wrote:
> * Mike Galbraith <efa...@gmx.de> wrote:

> > [...] (I know a trivial way to cure that, and this framework makes
> > that possible without dorking up fairness as a general policy.)
>
> great! Please send patches so i can add them (once you are happy with
> the solution) - i think your workload isnt special in any way and could
> hit other people too.

I'll give it a shot. (have to read and actually understand your new
code first though, then see if it's really viable)

-Mike

Ingo Molnar

unread,
Apr 15, 2007, 12:11:13 PM4/15/07
to

* Esben Nielsen <nielse...@googlemail.com> wrote:

> I took a brief look at it. Have you tested priority inheritance?

yeah, you are right, it's broken at the moment, i'll fix it. But the
good news is that i think PI could become cleaner via scheduling
classes.

> As far as I can see rt_mutex_setprio doesn't have much effect on
> SCHED_FAIR/SCHED_BATCH. I am looking for a place where such a task
> change scheduler class when boosted in rt_mutex_setprio().

i think via scheduling classes we dont have to do the p->policy and
p->prio based gymnastics anymore, we can just have a clean look at
p->sched_class and stack the original scheduling class into
p->real_sched_class. It would probably also make sense to 'privatize'
p->prio into the scheduling class. That way PI would be a pure property
of sched_rt, and the PI scheduler would be driven purely by
p->rt_priority, not by p->prio. That way all the normal_prio() kind of
complications and interactions with SCHED_OTHER/SCHED_FAIR would be
eliminated as well. What do you think?

Ingo

Willy Tarreau

unread,
Apr 15, 2007, 12:15:47 PM4/15/07
to
On Sun, Apr 15, 2007 at 01:39:27PM +0300, Pekka Enberg wrote:
> On 4/15/07, hui Bill Huey <bi...@gnuppy.monkey.org> wrote:
> >The perception here is that there is that there is this expectation that
> >sections of the Linux kernel are intentionally "churn squated" to prevent
> >any other ideas from creeping in other than of the owner of that subsytem
>
> Strangely enough, my perception is that Ingo is simply trying to
> address the issues Mike's testing discovered in RDSL and SD. It's not
> surprising Ingo made it a separate patch set as Con has repeatedly
> stated that the "problems" are in fact by design and won't be fixed.

That's not exactly the problem. There are people who work very hard to
try to improve some areas of the kernel. They progress slowly, and
acquire more and more skills. Sometimes they feel like they need to
change some concepts and propose those changes which are required for
them to go further, or to develop faster. Those are rejected. So they
are constrained to work in a delimited perimeter from which it is
difficult for them to escape.

Then, the same person who rejected their changes comes with something
shiny new, better and which took him far less time. But he sort of
broke the rules because what was forbidden to the first persons is
suddenly permitted. Maybe for very good reasons, I'm not discussing
that. The good reason should have been valid the first time too.

The fact is that when changes are rejected, we should not simply say
"no", but explain why and define what would be acceptable. Some people
here have excellent teaching skills for this, but most others do not.
Anyway, the rules should be the same for everybody.

Also, there is what can be perceived as marketting here. Con worked
on his idea with convictions, he took time to write some generous
documentation, but he hit a wall where his concept was suboptimal on
a given workload. But at least, all the work was oriented on a technical
basis : design + code + doc.

Then, Ingo comes in with something looking amazingly better, with
virtually no documentation, an appealing announcement, and a shiny
advertising at boot. All this implemented without the constraints
other people had to respect. It already looks like definitive work
which will be merge as-is without many changes except a few bugfixes.

If those were two companies, the first one would simply have accused
the second one of not having respected contracts and having employed
heaving marketting to take the first place.

People here do not code for a living, they do it at least because they
believe in what they are doing, and some of them want a bit of gratitude
for their work. I've met people who were proud to say they implement
this or that feature in the kernel, so it is something important for
them. And being cited in an email is nothing compared to advertising
at boot time.

When the discussion was blocked between Con and Mike concerning the
design problems, it is where a new discussion should have taken place.


Ingo could have publicly spoken with them about his ideas of killing
the O(1) scheduler and replacing it with an rbtree-based one, and using
part of Bill's work to speed up development.

It is far easier to resign when people explain what concepts are wrong
and how they think they will do than when they suddenly present something
out of nowhere which is already better.

And it's not specific to Ingo (though I think his ability to work that
fast alone makes him tend to practise this more often than others).

Imagine if Con had worked another full week on his scheduler with better
results on Mike's workload, but still not as good as Ingo's, and they both
published at the same time. You certainly can imagine he would have preferred
to be informed first that it was pointless to continue in that direction.

Now I hope he and Bill will get over this and accept to work on improving
this scheduler, because I really find it smarter than a dumb O(1). I even
agree with Mike that we now have a solid basis for future work. But for
this, maybe a good starting point would be to remove the selfish printk
at boot, revert useless changes (SCHED_NORMAL->SCHED_FAIR come to mind)
and improve the documentation a bit so that people can work together on
the new design, without feeling like their work will only server to
promote X or Y.

Regards,
Willy

Gene Heskett

unread,
Apr 15, 2007, 12:22:04 PM4/15/07
to
On Sunday 15 April 2007, Pekka Enberg wrote:
>On 4/15/07, hui Bill Huey <bi...@gnuppy.monkey.org> wrote:
>> The perception here is that there is that there is this expectation that
>> sections of the Linux kernel are intentionally "churn squated" to prevent
>> any other ideas from creeping in other than of the owner of that subsytem
>
>Strangely enough, my perception is that Ingo is simply trying to
>address the issues Mike's testing discovered in RDSL and SD. It's not
>surprising Ingo made it a separate patch set as Con has repeatedly
>stated that the "problems" are in fact by design and won't be fixed.

I won't get into the middle of this just yet, not having decided which dog I
should bet on yet. I've been running 2.6.21-rc6 + Con's 0.40 patch for about
24 hours, its been generally usable, but gzip still causes lots of 5 to 10+
second lags when its running. I'm coming to the conclusion that gzip simply
doesn't play well with others...

Amazing to me, the cpu its using stays generally below 80%, and often below
60%, even while the kmail composer has a full sentence in its buffer that it
still hasn't shown me when I switch to the htop screen to check, and back to
the kmail screen to see if its updated yet. The screen switch doesn't seem
to lag so I don't think renicing x would be helpfull. Those are the obvious
lags, and I'll build & reboot to the CFS patch at some point this morning
(whats left of it that is :). And report in due time of course

--
Cheers, Gene
"There are four boxes to be used in defense of liberty:
soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
knot in cables caused data stream to become twisted and kinked

Davide Libenzi

unread,
Apr 15, 2007, 12:22:26 PM4/15/07
to
On Sat, 14 Apr 2007, Davide Libenzi wrote:

> Haven't looked at the scheduler code yet, but for a similar problem I use
> a time ring. The ring has Ns (2 power is better) slots (where tasks are
> queued - in my case they were som sort of timers), and it has a current
> base index (Ib), a current base time (Tb) and a time granularity (Tg). It
> also has a bitmap with bits telling you which slots contains queued tasks.
> An item (task) that has to be scheduled at time T, will be queued in the slot:
>
> S = Ib + min((T - Tb) / Tg, Ns - 1);

... mod Ns, of course ;)


- Davide

Ingo Molnar

unread,
Apr 15, 2007, 12:22:53 PM4/15/07
to

* Bill Huey <bi...@gnuppy.monkey.org> wrote:

> On Sun, Apr 15, 2007 at 08:43:04AM +0200, Mike Galbraith wrote:
> > [...]
> >
> > Demystify what? The casual observer need only read either your
> > attempt
>
> Here's the problem. You're a casual observer and obviously not paying
> attention.

guys, please calm down. Judging by the number of contributions to
sched.c the main folks who are not 'observers' here and who thus have an
unalienable right to be involved in a nasty flamewar about scheduler
interactivity are Con, Mike, Nick and me ;-) Everyone else is just a
happy bystander, ok? ;-)

Ingo

Ingo Molnar

unread,
Apr 15, 2007, 12:25:42 PM4/15/07
to

* S.Çağlar Onur <cag...@pardus.org.tr> wrote:

> > hm, could you try to strace it and/or attach gdb to it and figure
> > out what's wrong? (perhaps involving the Kaffeine developers too?)
> > As long as it's not a kernel level crash i cannot see how the
> > scheduler could directly cause this - other than by accident
> > creating a scheduling pattern that triggers a user-space bug more
> > often than with other schedulers.
>
> ...
> futex(0x89ac218, FUTEX_WAIT, 2, NULL) = 0
> futex(0x89ac218, FUTEX_WAIT, 2, NULL) = 0
> futex(0x89ac218, FUTEX_WAIT, 2, NULL) = 0
> futex(0x89ac218, FUTEX_WAIT, 2, NULL) = 0
> futex(0x89ac218, FUTEX_WAIT, 2, NULL) = 0
> futex(0x89ac218, FUTEX_WAIT, 2, NULL) = -1 EINTR (Interrupted system call)
> --- SIGINT (Interrupt) @ 0 (0) ---
> +++ killed by SIGINT +++
>
> is where freeze occurs. Full log can be found at [1]

> [1] http://cekirdek.pardus.org.tr/~caglar/strace.kaffeine

thanks. This does has the appearance of a userspace race condition of
some sorts. Can you trigger this hang with the patch below applied to
the vanilla tree as well? (with no CFS patch applied)

if yes then this would suggest that Kaffeine has some sort of
child-runs-first problem. (which CFS changes to parent-runs-first.
Kaffeine starts a couple of threads and the futex calls are a sign of
thread<->thread communication.)

[ i have also Cc:-ed the Kaffeine folks - maybe your strace gives them
an idea about what the problem could be :) ]

Ingo

---
kernel/sched.c | 70 ++-------------------------------------------------------
1 file changed, 3 insertions(+), 67 deletions(-)

Index: linux/kernel/sched.c
===================================================================
--- linux.orig/kernel/sched.c
+++ linux/kernel/sched.c
@@ -1653,77 +1653,13 @@ void fastcall sched_fork(struct task_str
*/
void fastcall wake_up_new_task(struct task_struct *p, unsigned long clone_flags)
{
- struct rq *rq, *this_rq;
unsigned long flags;
- int this_cpu, cpu;
+ struct rq *rq;

rq = task_rq_lock(p, &flags);
BUG_ON(p->state != TASK_RUNNING);
- this_cpu = smp_processor_id();
- cpu = task_cpu(p);
-
- /*
- * We decrease the sleep average of forking parents
- * and children as well, to keep max-interactive tasks
- * from forking tasks that are max-interactive. The parent
- * (current) is done further down, under its lock.
- */
- p->sleep_avg = JIFFIES_TO_NS(CURRENT_BONUS(p) *
- CHILD_PENALTY / 100 * MAX_SLEEP_AVG / MAX_BONUS);
-
- p->prio = effective_prio(p);
-
- if (likely(cpu == this_cpu)) {
- if (!(clone_flags & CLONE_VM)) {
- /*
- * The VM isn't cloned, so we're in a good position to
- * do child-runs-first in anticipation of an exec. This
- * usually avoids a lot of COW overhead.
- */
- if (unlikely(!current->array))
- __activate_task(p, rq);
- else {
- p->prio = current->prio;
- p->normal_prio = current->normal_prio;
- list_add_tail(&p->run_list, &current->run_list);
- p->array = current->array;
- p->array->nr_active++;
- inc_nr_running(p, rq);
- }
- set_need_resched();
- } else
- /* Run child last */
- __activate_task(p, rq);
- /*
- * We skip the following code due to cpu == this_cpu
- *
- * task_rq_unlock(rq, &flags);
- * this_rq = task_rq_lock(current, &flags);
- */
- this_rq = rq;
- } else {
- this_rq = cpu_rq(this_cpu);
-
- /*
- * Not the local CPU - must adjust timestamp. This should
- * get optimised away in the !CONFIG_SMP case.
- */
- p->timestamp = (p->timestamp - this_rq->most_recent_timestamp)
- + rq->most_recent_timestamp;
- __activate_task(p, rq);
- if (TASK_PREEMPTS_CURR(p, rq))
- resched_task(rq->curr);
-
- /*
- * Parent and child are on different CPUs, now get the
- * parent runqueue to update the parent's ->sleep_avg:
- */
- task_rq_unlock(rq, &flags);
- this_rq = task_rq_lock(current, &flags);
- }
- current->sleep_avg = JIFFIES_TO_NS(CURRENT_BONUS(current) *
- PARENT_PENALTY / 100 * MAX_SLEEP_AVG / MAX_BONUS);
- task_rq_unlock(this_rq, &flags);
+ __activate_task(p, rq);
+ task_rq_unlock(rq, &flags);
}

/*

William Lee Irwin III

unread,
Apr 15, 2007, 12:41:08 PM4/15/07
to
On Sun, Apr 15, 2007 at 02:45:27PM +0200, Willy Tarreau wrote:
> Now I hope he and Bill will get over this and accept to work on improving
> this scheduler, because I really find it smarter than a dumb O(1). I even
> agree with Mike that we now have a solid basis for future work. But for
> this, maybe a good starting point would be to remove the selfish printk
> at boot, revert useless changes (SCHED_NORMAL->SCHED_FAIR come to mind)
> and improve the documentation a bit so that people can work together on
> the new design, without feeling like their work will only server to
> promote X or Y.

While I appreciate people coming to my defense, or at least the good
intentions behind such, my only actual interest in pointing out
4-year-old work is getting some acknowledgment of having done something
relevant at all. Sometimes it has "I told you so" value. At other times
it's merely clarifying what went on when people refer to it since in a
number of cases the patches are no longer extant, so they can't
actually look at it to get an idea of what was or wasn't done. At other
times I'm miffed about not being credited, whether I should've been or
whether dead and buried code has an implementation of the same idea
resurfacing without the author(s) having any knowledge of my prior work.

One should note that in this case, the first work of mine this trips
over (scheduling classes) was never publicly posted as it was only a
part of the original plugsched (an alternate scheduler implementation
devised to demonstrate plugsched's flexibility with respect to
scheduling policies), and a part that was dropped by subsequent
maintainers. The second work of mine this trips over, a virtual deadline
scheduler named "vdls," was also never publicly posted. Both are from
around the same time period, which makes them approximately 4 years dead.
Neither of the codebases are extant, having been lost in a transition
between employers, though various people recall having been sent them
privately, and plugsched survives in a mutated form as maintained by
Peter Williams, who's been very good about acknowledging my original
contribution.

If I care to become a direct participant in scheduler work, I can do so
easily enough.

I'm not entirely sure what this is about a basis for future work. By
and large one should alter the API's and data structures to fit the
policy being implemented. While the array swapping was nice for
algorithmically improving 2.4.x -style epoch expiry, most algorithms
not based on the 2.4.x scheduler (in however mutated a form) should use
a different queue structure, in fact, one designed around their
policy's specific algorithmic needs. IOW, when one alters the scheduler,
one should also alter the queue data structure appropriately. I'd not
expect the priority queue implementation in cfs to continue to be used
unaltered as it matures, nor would I expect any significant modification
of the scheduler to necessarily use a similar one.

By and large I've been mystified as to why there is such a penchant for
preserving the existing queue structures in the various scheduler
patches floating around. I am now every bit as mystified at the point
of view that seems to be emerging that a change of queue structure is
particularly significant. These are all largely internal changes to
sched.c, and as such, rather small changes in and of themselves. While
they do tend to have user-visible effects, from this point of view
even changing out every line of sched.c is effectively a micropatch.

Something more significant might be altering the schedule() API to
take a mandatory description of the intention of the call to it, or
breaking up schedule() into several different functions to distinguish
between different sorts of uses of it to which one would then respond
differently. Also more significant would be adding a new state beyond
TASK_INTERRUPTIBLE, TASK_UNINTERRUPTIBLE, and TASK_RUNNING for some
tasks to respond only to fatal signals, then sweeping TASK_UNINTERRUPTIBLE
users to use the new state and handle those fatal signals. While not
quite as ostentatious in their user-visible effects as SCHED_OTHER
policy affairs, they are tremendously more work than switching out the
implementation of a single C file, and so somewhat more respectable.

Even as scheduling semantics go, these are micropatches. So SCHED_OTHER
changes a little. Where are the gang schedulers? Where are the batch
schedulers (SCHED_BATCH is not truly such)? Where are the isochronous
(frame) schedulers? I suppose there is some CKRM work that actually has
a semantic impact despite being largely devoted to SCHED_OTHER, and
there's some spufs gang scheduling going on, though not all that much.
And to reiterate a point from other threads, even as SCHED_OTHER
patches go, I see precious little verification that things like the
semantics of nice numbers or other sorts of CPU bandwidth allocation
between competing tasks of various natures are staying the same while
other things are changed, or at least being consciously modified in
such a fashion as to improve them. I've literally only seen one or two
tests (and rather inflexible ones with respect to sleep and running
time mixtures) with any sort of quantification of how CPU bandwidth is
distributed get run on all this.

So from my point of view, there's a lot of churn and craziness going on
in one tiny corner of the kernel and people don't seem to have a very
solid grip on what effects their changes have or how they might
potentially break userspace. So I've developed a sudden interest in
regression testing of the scheduler in order to ensure that various sorts
of semantics on which userspace relies are not broken, and am trying to
spark more interest in general in nailing down scheduling semantics and
verifying that those semantics are honored and remain honored by whatever
future scheduler implementations might be merged.

Thus far, the laundry list of semantics I'd like to have nailed down
are specifically:

(1) CPU bandwidth allocation according to nice numbers
(2) CPU bandwidth allocation among mixtures of tasks with varying
sleep/wakeup behavior e.g. that consume some percentage of cpu
in isolation, perhaps also varying the granularity of their
sleep/wakeup patterns
(3) sched_yield(), so multitier userspace locking doesn't go haywire
(4) How these work with SMP; most people agree it should be mostly the
same as it works on UP, but it's not being verified, as most
testcases are barely SMP-aware if at all, and corner cases
where proportionality breaks down aren't considered

The sorts of like explicit decisions I'd like to be made for these are:
(1) In a mixture of tasks with varying nice numbers, a given nice number
corresponds to some share of CPU bandwidth. Implementations
should not have the freedom to change this arbitrarily according
to some intention.
(2) A given scheduler _implementation_ intends to distribute CPU
bandwidth among mixtures of tasks that would each consume some
percentage of the CPU in isolation varying across tasks in some
particular pattern. For example, maybe some scheduler
implementation assigns a share of 1/%cpu to a task that would
consume %cpu in isolation, for a CPU bandwidth allocation of
(1/%cpu)/(sum 1/%cpu(t)) as t ranges over all competing tasks
(this is not to say that such a policy makes sense).
(3) sched_yield() is intended to result in some particular scheduling
pattern in a given scheduler implementation. For instance, an
implementation may intend that a set of CPU hogs calling
sched_yield() between repeatedly printf()'ing their pid's will
see their printf()'s come out in an approximately consistent
order as the scheduler cycles between them.
(4) What an implementation intends to do with respect to SMP CPU
bandwidth allocation when precise emulation of UP behavior is
impossible, considering sched_yield() scheduling patterns when
possible as well. For instance, perhaps an implementation
intends to ensure equal CPU bandwidth among competing CPU-bound
tasks of equal priority at all costs, and so triggers migration
and/or load balancing to make it so. Or perhaps an implementation
intends to ensure precise sched_yield() ordering at all costs
even on SMP. Some sort of specification of the intention, then
a verification that the intention is carried out in a testcase.

Also, if there's a semantic issue to be resolved, I want it to have
something describing it and verifying it. For instance, characterizing
whatever sort of scheduling artifacts queue-swapping causes in the
mainline scheduler and then a testcase to demonstrate the artifact and
its resolution in a given scheduler rewrite would be a good design
statement and verification. For instance, if someone wants to go back
to queue-swapping or other epoch expiry semantics, it would make them
(and hopefully everyone else) conscious of the semantic issue the
change raises, or possibly serve as a demonstration that the artifacts
can be mitigated in some implementation retaining epoch expiry semantics.

As I become aware of more potential issues I'll add more to my laundry
list, and I'll hammer out testcases as I go. My concern with the
scheduler is that this sort of basic functionality may be significantly
disturbed with no one noticing at all until a distro issues a prerelease
and benchmarks go haywire, and furthermore that changes to this kind of
basic behavior may be signs of things going awry, particularly as more
churn happens.

So now that I've clarified my role in all this to date and my point of
view on it, it should be clear that accepting something and working on
some particular scheduler implementation don't make sense as
suggestions to me.


-- wli

Bill Huey

unread,
Apr 15, 2007, 12:41:28 PM4/15/07
to
On Sun, Apr 15, 2007 at 08:43:04AM +0200, Mike Galbraith wrote:
> [...]
>
> Demystify what? The casual observer need only read either your attempt

Here's the problem. You're a casual observer and obviously not paying
attention.

> at writing a scheduler, or my attempts at fixing the one we have, to see
> that it was high time for someone with the necessary skills to step in.
> Now progress can happen, which was _not_ happening before.

I think that's inaccurate and there are plenty of folks that have that
technical skill and background. The scheduler code isn't a deep mystery
and there are plenty of good kernel hackers out here across many
communities. Ingo isn't the only person on this planet to have deep
scheduler knowledge. Priority heaps are not new and Solaris has had a
pluggable scheduler framework for years.

Con's characterization is something that I'm more prone to believe about
how Linux kernel development works versus your view. I think it's a great
shame to have folks like Bill Irwin and Con to have waste time trying to
do something right only to have their ideas attack, then copied and held
as the solution for this kind of technical problem as complete reversal
of technical opinion as it suits a moment. This is just wrong in so many
ways.

It outlines the problems with Linux kernel development and questionable
elistism regarding ownership of certain sections of the kernel code.

I call it "churn squat" and instances like this only support that view
which I would rather it be completely wrong and inaccurate instead.

bill

Davide Libenzi

unread,
Apr 15, 2007, 12:42:54 PM4/15/07
to
On Fri, 13 Apr 2007, William Lee Irwin III wrote:

> On Fri, Apr 13, 2007 at 10:21:00PM +0200, Ingo Molnar wrote:
> > The CFS patch uses a completely different approach and implementation
> > from RSDL/SD. My goal was to make CFS's interactivity quality exceed
> > that of RSDL/SD, which is a high standard to meet :-) Testing
> > feedback is welcome to decide this one way or another. [ and, in any
> > case, all of SD's logic could be added via a kernel/sched_sd.c module
> > as well, if Con is interested in such an approach. ]
> > CFS's design is quite radical: it does not use runqueues, it uses a
> > time-ordered rbtree to build a 'timeline' of future task execution,
> > and thus has no 'array switch' artifacts (by which both the vanilla
> > scheduler and RSDL/SD are affected).


>
> A binomial heap would likely serve your purposes better than rbtrees.

> It's faster to have the next item to dequeue at the root of the tree
> structure rather than a leaf, for one. There are, of course, other
> priority queue structures (e.g. van Emde Boas) able to exploit the
> limited precision of the priority key for faster asymptotics, though
> actual performance is an open question.

Haven't looked at the scheduler code yet, but for a similar problem I use
a time ring. The ring has Ns (2 power is better) slots (where tasks are
queued - in my case they were som sort of timers), and it has a current
base index (Ib), a current base time (Tb) and a time granularity (Tg). It
also has a bitmap with bits telling you which slots contains queued tasks.
An item (task) that has to be scheduled at time T, will be queued in the slot:

S = Ib + min((T - Tb) / Tg, Ns - 1);

Items with T longer than Ns*Tg will be scheduled in the relative last slot

(chosing a proper Ns and Tg can minimize this).
Queueing is O(1) and de-queueing is O(Ns). You can play with Ns and Tg to
suite to your needs.

This is a simple bench between time-ring (TR) and CFS queueing:

http://www.xmailserver.org/smart-queue.c

In my box (Dual Opteron 252):

davide@alien:~$ ./smart-queue -n 8
CFS = 142.21 cycles/loop
TR = 72.33 cycles/loop
davide@alien:~$ ./smart-queue -n 16
CFS = 188.74 cycles/loop
TR = 83.79 cycles/loop
davide@alien:~$ ./smart-queue -n 32
CFS = 221.36 cycles/loop
TR = 75.93 cycles/loop
davide@alien:~$ ./smart-queue -n 64
CFS = 242.89 cycles/loop
TR = 81.29 cycles/loop

- Davide

Ingo Molnar

unread,
Apr 15, 2007, 12:43:14 PM4/15/07
to

* Bill Huey <bi...@gnuppy.monkey.org> wrote:

> Hello folks,
>
> I think the main failure I see here is that Con wasn't included in
> this design or privately in review process. There could have been
> better co-ownership of the code. This could also have been done openly

> on lkml [...]

Bill, you come from a BSD background and you are still relatively new to
Linux development, so i dont at all fault you for misunderstanding this
situation, and fortunately i have a really easy resolution for your
worries: i did exactly that! :)

i wrote the first line of code of the CFS patch this week, 8am Wednesday
morning, and released it to lkml 62 hours later, 22pm on Friday. (I've
listed the file timestamps of my backup patches further below, for all
the fine details.)

I prefer such early releases to lkml _alot_ more than any private review
process. I released the CFS code about 6 hours after i thought "okay,
this looks pretty good" and i spent those final 6 hours on testing it
(making sure it doesnt blow up on your box, etc.), in the final 2 hours
i showed it to two folks i could reach on IRC (Arjan and Thomas) and on
various finishing touches. It doesnt get much faster than that and i
definitely didnt want to sit on it even one day longer because i very
much thought that Con and others should definitely see this work!

And i very much credited (and still credit) Con for the whole fairness
angle:

|| i'd like to give credit to Con Kolivas for the general approach here:
|| he has proven via RSDL/SD that 'fair scheduling' is possible and that
|| it results in better desktop scheduling. Kudos Con!

the 'design consultation' phase you are talking about is _NOW_! :)

I got the v1 code out to Con, to Mike and to many others ASAP. That's
how you are able to comment on this thread and be part of the
development process to begin with, in a 'private consultation' setup
you'd not have had any opportunity to see _any_ of this.

In the BSD space there seem to be more 'political' mechanisms for
development, but Linux is truly about doing things out in the open, and
doing it immediately.

Okay? ;-)

Here's the timestamps of all my backups of the patch, from its humble 4K
beginnings to the 100K first-cut v1 result:

-rw-rw-r-- 1 mingo mingo 4230 Apr 11 08:47 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 7653 Apr 11 09:12 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 7728 Apr 11 09:26 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 14416 Apr 11 10:08 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 24211 Apr 11 10:41 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 27878 Apr 11 10:45 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 33807 Apr 11 11:05 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 34524 Apr 11 11:09 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 39650 Apr 11 11:19 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 40231 Apr 11 11:34 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 40627 Apr 11 11:48 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 40638 Apr 11 11:54 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 42733 Apr 11 12:19 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 42817 Apr 11 12:31 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 43270 Apr 11 12:41 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 43531 Apr 11 12:48 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 44331 Apr 11 12:51 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 45173 Apr 11 12:56 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 45288 Apr 11 12:59 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 45368 Apr 11 13:06 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 45370 Apr 11 13:06 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 45815 Apr 11 13:14 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 45887 Apr 11 13:19 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 45914 Apr 11 13:25 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 45850 Apr 11 13:29 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 49196 Apr 11 13:39 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 64317 Apr 11 13:45 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 64403 Apr 11 13:52 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 65199 Apr 11 14:03 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 65199 Apr 11 14:07 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 68995 Apr 11 14:50 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 69919 Apr 11 15:23 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 71065 Apr 11 16:26 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 70642 Apr 11 16:28 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 72334 Apr 11 16:49 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 71624 Apr 11 17:01 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 71854 Apr 11 17:20 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 73571 Apr 11 17:42 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 74708 Apr 11 17:49 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 74708 Apr 11 17:51 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 75144 Apr 11 17:57 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 80722 Apr 11 18:19 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 80727 Apr 11 18:41 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 80727 Apr 11 18:59 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 89356 Apr 11 21:32 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 95278 Apr 12 08:36 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 97749 Apr 12 10:49 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 97687 Apr 12 10:58 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 97722 Apr 12 11:06 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 97933 Apr 12 11:22 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 98167 Apr 12 12:04 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 98167 Apr 12 12:09 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 100405 Apr 12 12:29 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 100380 Apr 12 12:50 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 101631 Apr 12 13:32 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 102293 Apr 12 14:12 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 102431 Apr 12 14:28 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 102502 Apr 12 14:53 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 102128 Apr 13 11:13 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 102473 Apr 13 12:12 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 102536 Apr 13 12:24 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 102481 Apr 13 12:30 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 103408 Apr 13 13:08 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 103441 Apr 13 13:31 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 104759 Apr 13 14:19 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 104815 Apr 13 14:39 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 104762 Apr 13 15:04 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 105978 Apr 13 16:18 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 105977 Apr 13 16:26 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 106761 Apr 13 17:08 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 106358 Apr 13 17:40 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 107802 Apr 13 19:04 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 104427 Apr 13 19:35 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 103927 Apr 13 19:40 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 101867 Apr 13 20:30 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 101011 Apr 13 21:05 patches/sched-fair.patch

i hope this helps :)

Ingo

Ingo Molnar

unread,
Apr 15, 2007, 12:46:23 PM4/15/07
to

* Ingo Molnar <mi...@elte.hu> wrote:

> > [1] http://cekirdek.pardus.org.tr/~caglar/strace.kaffeine
>
> thanks. This does has the appearance of a userspace race condition of
> some sorts. Can you trigger this hang with the patch below applied to
> the vanilla tree as well? (with no CFS patch applied)

oops, please use the patch below instead.

Ingo

---
kernel/sched.c | 69 ++++-----------------------------------------------------
1 file changed, 5 insertions(+), 64 deletions(-)

Index: linux/kernel/sched.c
===================================================================
--- linux.orig/kernel/sched.c
+++ linux/kernel/sched.c

@@ -1653,77 +1653,18 @@ void fastcall sched_fork(struct task_str


*/
void fastcall wake_up_new_task(struct task_struct *p, unsigned long clone_flags)
{
- struct rq *rq, *this_rq;
unsigned long flags;
- int this_cpu, cpu;
+ struct rq *rq;

rq = task_rq_lock(p, &flags);
BUG_ON(p->state != TASK_RUNNING);
- this_cpu = smp_processor_id();
- cpu = task_cpu(p);
-
- /*
- * We decrease the sleep average of forking parents
- * and children as well, to keep max-interactive tasks
- * from forking tasks that are max-interactive. The parent
- * (current) is done further down, under its lock.
- */
- p->sleep_avg = JIFFIES_TO_NS(CURRENT_BONUS(p) *
- CHILD_PENALTY / 100 * MAX_SLEEP_AVG / MAX_BONUS);

p->prio = effective_prio(p);
+ __activate_task(p, rq);
+ if (TASK_PREEMPTS_CURR(p, rq))
+ resched_task(rq->curr);

Ingo Molnar

unread,
Apr 15, 2007, 12:47:50 PM4/15/07
to

* Con Kolivas <ker...@kolivas.org> wrote:

[ i'm quoting this bit out of order: ]

> 2. Since then I've been thinking/working on a cpu scheduler design
> that takes away all the guesswork out of scheduling and gives very
> predictable, as fair as possible, cpu distribution and latency while
> preserving as solid interactivity as possible within those confines.

yeah. I think you were right on target with this call. I've applied the
sched.c change attached at the bottom of this mail to the CFS patch, if
you dont mind. (or feel free to suggest some other text instead.)

> 1. I tried in vain some time ago to push a working extensable
> pluggable cpu scheduler framework (based on wli's work) for the linux
> kernel. It was perma-vetoed by Linus and Ingo (and Nick also said he
> didn't like it) as being absolutely the wrong approach and that we

> should never do that. [...]

i partially replied to that point to Will already, and i'd like to make
it clear again: yes, i rejected plugsched 2-3 years ago (which already
drifted away from wli's original codebase) and i would still reject it
today.

First and foremost, please dont take such rejections too personally - i
had my own share of rejections (and in fact, as i mentioned it in a
previous mail, i had a fair number of complete project throwaways:
4g:4g, in-kernel Tux, irqrate and many others). I know that they can
hurt and can demoralize, but if i dont like something it's my job to
tell that.

Can i sum up your argument as: "you rejected plugsched, but then why on
earth did you modularize portions of the scheduler in CFS? Isnt your
position thus woefully inconsistent?" (i'm sure you would never put it
this impolitely though, but i guess i can flame myself with impunity ;)

While having an inconsistent position isnt a terminal sin in itself,
please realize that the scheduler classes code in CFS is quite different
from plugsched: it was a result of what i saw to be technological
pressure for _internal modularization_. (This internal/policy
modularization aspect is something that Will said was present in his
original plugsched code, but which aspect i didnt see in the plugsched
patches that i reviewed.)

That possibility never even occured to me to until 3 days ago. You never
raised it either AFAIK. No patches to simplify the scheduler that way
were ever sent. Plugsched doesnt even touch the core load-balancer for
example, and most of the time i spent with the modularization was to get
the load-balancing details right. So it's really apples to oranges.

My view about plugsched: first please take a look at the latest
plugsched code:

http://downloads.sourceforge.net/cpuse/plugsched-6.5-for-2.6.20.patch

26 files changed, 8951 insertions(+), 1495 deletions(-)

As an experiment i've removed all the add-on schedulers (both the core
and the include files, only kept the vanilla one) from the plugsched
patch (and the makefile and kconfig complications, etc), to see the
'infrastructure cost', and it still gave:

12 files changed, 1933 insertions(+), 1479 deletions(-)

that's the extra complication i didnt like 3 years ago and which i still
dont like today. What the current plugsched code does is that it
simplifies the adding of new experimental schedulers, but it doesnt
really do what i wanted: to simplify the _scheduler itself_. Personally
i'm still not primarily interested in having a large selection of
schedulers, i'm mainly interested in a good and maintainable scheduler
that works for people.

so the rejection was on these grounds, and i still very much stand by
that position here and today: i didnt want to see the Linux scheduler
landscape balkanized and i saw no technological reasons for the
complication that external modularization brings.

the new scheding classes code in the CFS patch was not a result of "oh,
i want to write a new scheduler, lets make schedulers pluggable" kind of
thinking. That result was just a side-effect of it. (and as you
correctly noted it, the CFS related modularization is incomplete).

Btw., the thing that triggered the scheduling classes code wasnt even
plugsched or RSDL/SD, it was Mike's patches. Mike had an itch and he
fixed it within the framework of the existing scheduler, and the end
result behaved quite well when i threw various testloads on it.

But i felt a bit uncomfortable that it added another few hundred lines
of code to an already complex sched.c. This felt unnatural so i mailed
Mike that i'd attempt to clean these infrastructure aspects of sched.c
up a bit so that it becomes more hackable to him. Thus 3 days ago,
without having made up my mind about anything, i started this experiment
(which ended up in the modularization and in the CFS scheduler) to
simplify the code and to enable Mike to fix such itches in an easier
way. By your logic Mike should in fact be quite upset about this: if the
new code works out and proves to be useful then it obsoletes a whole lot
of code of him!

> For weeks now, Ingo has said that the interactivity regressions were
> showstoppers and we should address them, never mind the fact that the
> so-called regressions were purely "it slows down linearly with load"

> which to me is perfectly desirable behaviour. [...]

yes. For me the first thing when considering a large scheduler patch is:
"does a patch do what it claims" and "does it work". If those goals are
met (and if it's a complete scheduler i actually try it quite
extensively) then i look at the code cleanliness issues. Mike's patch
was the first one that seemed to meet that threshold in my own humble
testing, and CFS was a direct result of that.

note that i tried the same workloads with CFS and while it wasnt as good
as mainline, it handled them better than SD. Mike reported the same, and
Mark Lord (who too reported SD interactivity problems) reported success
yesterday too.

(but .. CFS is a mere 2 days old so we cannot really tell anything with
certainty yet.)

> [...] However at one stage I virtually begged for support with my

> attempts and help with the code. Dmitry Adamushko is the only person
> who actually helped me with the code in the interim, while others
> poked sticks at it. Sure the sticks helped at times but the sticks
> always seemed to have their ends kerosene doused and flaming for
> reasons I still don't get. No other help was forthcoming.

i'm really sorry you got that impression.

in 2004 i had a good look at the staircase scheduler and said:

http://www.uwsg.iu.edu/hypermail/linux/kernel/0408.0/1146.html

"But in general i'm quite positive about the staircase scheduler."

and even tested it and gave you feedback:

http://lwn.net/Articles/96562/

i think i even told Andrew that i dont really like pluggable schedulers
and if there's any replacement for the current scheduler then that would
be a full replacement, and it would be the staircase scheduler.

Hey, i told this to you as recently as 1 month ago as well:

http://lkml.org/lkml/2007/3/8/54

"cool! I like this even more than i liked your original staircase
scheduler from 2 years ago :)"

Ingo

----------->


Index: linux/kernel/sched.c
===================================================================
--- linux.orig/kernel/sched.c
+++ linux/kernel/sched.c

@@ -16,6 +16,7 @@
* by Davide Libenzi, preemptible kernel bits by Robert Love.
* 2003-09-03 Interactivity tuning by Con Kolivas.
* 2004-04-02 Scheduler domains code by Nick Piggin
+ * 2007-04-15 Con Kolivas was dead right: fairness matters! :)
*/

Arjan van de Ven

unread,
Apr 15, 2007, 12:53:49 PM4/15/07
to

> It outlines the problems with Linux kernel development and questionable
> elistism regarding ownership of certain sections of the kernel code.

I have to step in and disagree here....

Linux is not about who writes the code.

Linux is about getting the best solution for a problem. Who wrote which
line of the code is irrelevant in the big picture.

that often means that multiple implementations happen, and that the a
darwinistic process decides that the best solution wins.

This darwinistic process often happens in the form of discussion, and
that discussion can happen with words or with code. In this case it
happened with a code proposal.

To make this specific: it has happened many times to me that when I
solved an issue with code, someone else stepped in and wrote a different
solution (although that was usually for smaller pieces). Was I upset
about that? No! I was happy because my *problem got solved* in the best
possible way.

Now this doesn't mean that people shouldn't be nice to each other, not
cooperate or steal credits, but I don't get the impression that that is
happening here. Ingo is taking part in the discussion with a counter
proposal for discussion *on the mailing list*. What more do you want??
If you or anyone else can improve it or do better, take part of this
discussion and show what you mean either in words or in code.

Your qualification of the discussion as a elitist takeover... I disagree
with that. It's a *discussion*. Now if you agree that Ingo's patch is
better technically, you and others should be happy about that because
your problem is getting solved better. If you don't agree that his patch
is better technically, take part in the technical discussion.

--
if you want to mail me at work (you don't), use arjan (at) linux.intel.com
Test the interaction between Linux and your BIOS via http://www.linuxfirmwarekit.org

Willy Tarreau

unread,
Apr 15, 2007, 12:54:32 PM4/15/07
to
On Sat, Apr 14, 2007 at 12:40:15PM -0600, Eric W. Biederman wrote:
> Willy Tarreau <w...@1wt.eu> writes:

>
> > On Sat, Apr 14, 2007 at 07:54:33PM +0200, Ingo Molnar wrote:
> >>
> >> * Eric W. Biederman <ebie...@xmission.com> wrote:
> >>
> >> > > Thinking about it, I don't know if there are calls to schedule()
> >> > > while switching from tty1 to tty2. Alt-F2 had no effect anymore, and
> >> > > "chvt 2" simply blocked. It would have been possible that a
> >> > > schedule() call somewhere got starved due to the load, I don't know.
> >> >
> >> > It looks like there is a call to schedule_work.
> >>
> >> so this goes over keventd, right?
> >>
> >> > There are two pieces of the path. If you are switching in and out of a
> >> > tty controlled by something like X. User space has to grant
> >> > permission before the operation happens. Where there isn't a gate
> >> > keeper I know it is cheaper but I don't know by how much, I suspect
> >> > there is still a schedule happening in there.
> >>
> >> Could keventd perhaps be starved? Willy, to exclude this possibility,
> >> could you perhaps chrt keventd to RT priority? If events/0 is PID 5 then
> >> the command to set it to SCHED_FIFO:50 would be:
> >>
> >> chrt -f -p 50 5
> >>
> >> but ... events/0 is reniced to -5 by default, so it should definitely
> >> not be starved.
> >
> > Well, since I merged the fair-fork patch, I cannot reproduce (in fact,
> > bash forks 1000 processes, then progressively execs scheddos, but it
> > takes some time). So I'm rebuilding right now. But I think that Linus
> > has an interesting clue about GPM and notification before switching
> > the terminal. I think it was enabled in console mode. I don't know
> > how that translates to frozen xterms, but let's attack the problems
> > one at a time.
>
> I think it is a good clue. However the intention of the mechanism is
> that only processes that change the video mode on a VT are supposed to
> use it. So I really don't think gpm is the culprit. However it easily could
> be something else that has similar characteristics.
>
> I just realized we do have proof that schedule_work is actually working
> because SAK works, and we can't sanely do SAK from interrupt context
> so we call schedule work.

Eric,

I can say that Linus, Ingo and you all got on the right track.
I could reproduce, I got a hung tty around 1400 running processes.
Fortunately, it was the one with the root shell which was reniced
to -19.

I could strace chvt 2 :

20:44:23.761117 open("/dev/tty", O_RDONLY) = 3 <0.004000>
20:44:23.765117 ioctl(3, KDGKBTYPE, 0xbfa305a3) = 0 <0.024002>
20:44:23.789119 ioctl(3, VIDIOC_G_COMP or VT_ACTIVATE, 0x3) = 0 <0.000000>
20:44:23.789119 ioctl(3, VIDIOC_S_COMP or VT_WAITACTIVE <unfinished ...>

Then I applied Ingo's suggestion about changing keventd prio :

root@pcw:~# ps auxw|grep event
root 8 0.0 0.0 0 0 ? SW< 20:31 0:00 [events/0]
root 9 0.0 0.0 0 0 ? RW< 20:31 0:00 [events/1]

root@pcw:~# rtprio -s 1 -p 50 8 9 (I don't have chrt but it does the same)

My VT immediately switched as soon as I hit Enter. Everything's
working fine again now. So the good news is that it's not a bug
in the tty code, nor a deadlock.

Now, maybe keventd should get a higher prio ? It seems worrying to
me that it may starve when it seems so much sensible.

Also, that may explain why I couldn't reproduce with the fork patch.
Since all new processes got no runtime at first, their impact on
existing ones must have been lower. But I think that if I had waited
longer, I would have had the problem again (though I did not see it
even under a load of 7800).

Regards,
Willy

Esben Nielsen

unread,
Apr 15, 2007, 12:55:56 PM4/15/07
to
On Fri, 13 Apr 2007, Ingo Molnar wrote:

> [announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
>
> i'm pleased to announce the first release of the "Modular Scheduler Core
> and Completely Fair Scheduler [CFS]" patchset:
>
> http://redhat.com/~mingo/cfs-scheduler/sched-modular+cfs.patch
>
> This project is a complete rewrite of the Linux task scheduler. My goal
> is to address various feature requests and to fix deficiencies in the
> vanilla scheduler that were suggested/found in the past few years, both
> for desktop scheduling and for server scheduling workloads.
>

> [...]

I took a brief look at it. Have you tested priority inheritance?

As far as I can see rt_mutex_setprio doesn't have much effect on
SCHED_FAIR/SCHED_BATCH. I am looking for a place where such a task change
scheduler class when boosted in rt_mutex_setprio().

Esben

Willy Tarreau

unread,
Apr 15, 2007, 12:57:30 PM4/15/07
to
On Sat, Apr 14, 2007 at 12:48:55PM -0700, William Lee Irwin III wrote:
> On Sat, Apr 14, 2007 at 10:36:25AM +0200, Willy Tarreau wrote:
> > Forking becomes very slow above a load of 100 it seems. Sometimes,
> > the shell takes 2 or 3 seconds to return to prompt after I run
> > "scheddos &"
> > Those are very promising results, I nearly observe the same responsiveness
> > as I had on a solaris 10 with 10k running processes on a bigger machine.
> > I would be curious what a mysql test result would look like now.
>
> Where is scheddos?

I will send it to you off-list. I've been avoiding to publish it for a long
time because the stock scheduler was *very* sensible to trivial attacks
(freezes larger than 30s, impossible to log in). It's very basic, and I
have no problem sending it to anyone who requests it, it's just that as
long as some distros ship early 2.6 kernels I do not want it to appear on
mailing list archives for anyone to grab it and annoy their admins for free.

Cheers,
Willy

Davide Libenzi

unread,
Apr 15, 2007, 12:57:57 PM4/15/07
to
On Sat, 14 Apr 2007, William Lee Irwin III wrote:

> The two basic attacks on such large priority spaces are the near future
> vs. far future subdivisions and subdividing the priority space into
> (most often regular) intervals. Subdividing the priority space into
> intervals is the most obvious; you simply use some O(lg(n)) priority
> queue as the bucket discipline in the "time ring," queue by the upper
> bits of the queue priority in the time ring, and by the lower bits in
> the O(lg(n)) bucket discipline.

Sure. If you really need sub-millisecond precision, you can replace the
bucket's list_head with an rb_root. It may be not necessary though for a
cpu scheduler (still, didn't read Ingo's code yet).


- Davide

Mike Galbraith

unread,
Apr 15, 2007, 1:04:02 PM4/15/07
to
On Sun, 2007-04-15 at 13:27 +1000, Con Kolivas wrote:
> On Saturday 14 April 2007 06:21, Ingo Molnar wrote:
> > [announce] [patch] Modular Scheduler Core and Completely Fair Scheduler
> > [CFS]
> >
> > i'm pleased to announce the first release of the "Modular Scheduler Core
> > and Completely Fair Scheduler [CFS]" patchset:
> >
> > http://redhat.com/~mingo/cfs-scheduler/sched-modular+cfs.patch
> >
> > This project is a complete rewrite of the Linux task scheduler. My goal
> > is to address various feature requests and to fix deficiencies in the
> > vanilla scheduler that were suggested/found in the past few years, both
> > for desktop scheduling and for server scheduling workloads.
>
> The casual observer will be completely confused by what on earth has happened
> here so let me try to demystify things for them.

[...]

Demystify what? The casual observer need only read either your attempt

at writing a scheduler, or my attempts at fixing the one we have, to see
that it was high time for someone with the necessary skills to step in.
Now progress can happen, which was _not_ happening before.

-Mike

Bernd Eckenfels

unread,
Apr 15, 2007, 1:07:49 PM4/15/07
to
In article <20070415051...@gnuppy.monkey.org> you wrote:
> A development process like this is likely to exclude smart people from wanting
> to contribute to Linux and folks should be conscious about this issues.

Nobody is excluded, you can always have a next iteration.

Gruss
Bernd

William Lee Irwin III

unread,
Apr 15, 2007, 1:12:08 PM4/15/07
to
* Willy Tarreau <w...@1wt.eu> wrote:
>> [...] and using part of Bill's work to speed up development.

On Sun, Apr 15, 2007 at 05:39:33PM +0200, Ingo Molnar wrote:
> ok, let me make this absolutely clear: i didnt use any bit of plugsched
> - in fact the most difficult bits of the modularization was for areas of
> sched.c that plugsched never even touched AFAIK. (the load-balancer for
> example.)
> Plugsched simply does something else: i modularized scheduling policies
> in essence that have to cooperate with each other, while plugsched
> modularized complete schedulers which are compile-time or boot-time
> selected, with no runtime cooperation between them. (one has to be
> selected at a time)
> (and i have no trouble at all with crediting Will's work either: a few
> years ago i used Will's PID rework concepts for an NPTL related speedup
> and Will is very much credited for it in today's kernel/pid.c and he
> continued to contribute to it later on.)
> (the tree walking bits of sched_fair.c were in fact derived from
> kernel/hrtimer.c, the rbtree code written by Thomas and me :-)

The extant plugsched patches have nothing to do with cfs; I suspect
what everyone else is going on about is terminological confusion. The
4-year-old sample policy with scheduling classes for the original
plugsched is something you had no way of knowing about, as it was never
publicly posted. There isn't really anything all that interesting going
on here, apart from pointing out that it's been done before.


-- wli

Ingo Molnar

unread,
Apr 15, 2007, 1:13:33 PM4/15/07
to

* Willy Tarreau <w...@1wt.eu> wrote:

> Ingo could have publicly spoken with them about his ideas of killing

> the O(1) scheduler and replacing it with an rbtree-based one, [...]

yes, that's precisely what i did, via a patchset :)

[ I can even tell you when it all started: i was thinking about Mike's
throttling patches while watching Manchester United beat the crap out
of AS Roma (7 to 1 end result), Thuesday evening. I started coding it
Wednesday morning and sent the patch Friday evening. I very much
believe in low-latency when it comes to development too ;) ]

(if this had been done via a comittee then today we'd probably still be
trying to find a suitable timeslot for the initial conference call where
we'd discuss the election of a chair who would be tasked with writing up
an initial document of feature requests, on which we'd take a vote,
possibly this year already, because the matter is really urgent you know
;-)

> [...] and using part of Bill's work to speed up development.

ok, let me make this absolutely clear: i didnt use any bit of plugsched
- in fact the most difficult bits of the modularization was for areas of
sched.c that plugsched never even touched AFAIK. (the load-balancer for
example.)

Plugsched simply does something else: i modularized scheduling policies
in essence that have to cooperate with each other, while plugsched
modularized complete schedulers which are compile-time or boot-time
selected, with no runtime cooperation between them. (one has to be
selected at a time)

(and i have no trouble at all with crediting Will's work either: a few
years ago i used Will's PID rework concepts for an NPTL related speedup
and Will is very much credited for it in today's kernel/pid.c and he
continued to contribute to it later on.)

(the tree walking bits of sched_fair.c were in fact derived from
kernel/hrtimer.c, the rbtree code written by Thomas and me :-)

Ingo

Eric W. Biederman

unread,
Apr 15, 2007, 1:13:40 PM4/15/07
to
Willy Tarreau <w...@1wt.eu> writes:

Eric

Pekka Enberg

unread,
Apr 15, 2007, 1:15:46 PM4/15/07
to
On 4/15/07, hui Bill Huey <bi...@gnuppy.monkey.org> wrote:
> The perception here is that there is that there is this expectation that
> sections of the Linux kernel are intentionally "churn squated" to prevent
> any other ideas from creeping in other than of the owner of that subsytem

Strangely enough, my perception is that Ingo is simply trying to


address the issues Mike's testing discovered in RDSL and SD. It's not
surprising Ingo made it a separate patch set as Con has repeatedly
stated that the "problems" are in fact by design and won't be fixed.

William Lee Irwin III

unread,
Apr 15, 2007, 1:16:10 PM4/15/07
to
On Sat, Apr 14, 2007 at 10:36:25AM +0200, Willy Tarreau wrote:
> Forking becomes very slow above a load of 100 it seems. Sometimes,
> the shell takes 2 or 3 seconds to return to prompt after I run
> "scheddos &"
> Those are very promising results, I nearly observe the same responsiveness
> as I had on a solaris 10 with 10k running processes on a bigger machine.
> I would be curious what a mysql test result would look like now.

Where is scheddos?


-- wli

Con Kolivas

unread,
Apr 15, 2007, 1:18:28 PM4/15/07
to
On Monday 16 April 2007 01:16, Gene Heskett wrote:
> On Sunday 15 April 2007, Pekka Enberg wrote:
> >On 4/15/07, hui Bill Huey <bi...@gnuppy.monkey.org> wrote:
> >> The perception here is that there is that there is this expectation that
> >> sections of the Linux kernel are intentionally "churn squated" to
> >> prevent any other ideas from creeping in other than of the owner of that
> >> subsytem
> >
> >Strangely enough, my perception is that Ingo is simply trying to
> >address the issues Mike's testing discovered in RDSL and SD. It's not
> >surprising Ingo made it a separate patch set as Con has repeatedly
> >stated that the "problems" are in fact by design and won't be fixed.
>
> I won't get into the middle of this just yet, not having decided which dog
> I should bet on yet. I've been running 2.6.21-rc6 + Con's 0.40 patch for
> about 24 hours, its been generally usable, but gzip still causes lots of 5
> to 10+ second lags when its running. I'm coming to the conclusion that
> gzip simply doesn't play well with others...

Actually Gene I think you're being bitten here by something I/O bound since
the cpu usage never tops out. If that's the case and gzip is dumping
truckloads of writes then you're suffering something that irks me even more
than the scheduler in linux, and that's how much writes hurt just about
everything else. Try your testcase with bzip2 instead (since that won't be
i/o bound), or drop your dirty ratio to as low as possible which helps a
little bit (5% is the minimum)

echo 5 > /proc/sys/vm/dirty_ratio

and finally try the braindead noop i/o scheduler as well.

echo noop > /sys/block/sda/queue/scheduler

(replace sda with your drive obviously).

I'd wager a big one that's what causes your gzip pain. If it wasn't for the
fact that I've decided to all but give up ever trying to provide code for
mainline again, trying my best to make writes hurt less on linux would be my
next big thing [tm].

Oh and for the others watching, (points to vm hackers) I found a bug when
playing with the dirty ratio code. If you modify it to allow it drop below 5%
but still above the minimum in the vm code, stalls happen somewhere in the vm
where nothing much happens for sometimes 20 or 30 seconds worst case
scenario. I had to drop a patch in 2.6.19 that allowed the dirty ratio to be
set ultra low because these stalls were gross.

> Amazing to me, the cpu its using stays generally below 80%, and often below
> 60%, even while the kmail composer has a full sentence in its buffer that
> it still hasn't shown me when I switch to the htop screen to check, and
> back to the kmail screen to see if its updated yet. The screen switch
> doesn't seem to lag so I don't think renicing x would be helpfull. Those
> are the obvious lags, and I'll build & reboot to the CFS patch at some
> point this morning (whats left of it that is :). And report in due time of
> course

--
-ck

Mike Galbraith

unread,
Apr 15, 2007, 1:21:29 PM4/15/07
to
On Sun, 2007-04-15 at 01:36 -0700, Bill Huey wrote:
> On Sun, Apr 15, 2007 at 08:43:04AM +0200, Mike Galbraith wrote:
> > [...]
> >
> > Demystify what? The casual observer need only read either your attempt
>
> Here's the problem. You're a casual observer and obviously not paying
> attention.
>
> > at writing a scheduler, or my attempts at fixing the one we have, to see
> > that it was high time for someone with the necessary skills to step in.
> > Now progress can happen, which was _not_ happening before.
>
> I think that's inaccurate and there are plenty of folks that have that
> technical skill and background. The scheduler code isn't a deep mystery
> and there are plenty of good kernel hackers out here across many
> communities. Ingo isn't the only person on this planet to have deep
> scheduler knowledge.

Ok <shrug>, I'm not paying attention, and you can't read. We're even.
Have a nice life.

-Mike

Gene Heskett

unread,
Apr 15, 2007, 1:22:44 PM4/15/07
to

Chuckle, possibly but then I'm not anything even remotely close to an expert
here Con, just reporting what I get. And I just rebooted to 2.6.21-rc6 +
sched-mike-5.patch for grins and giggles, or frowns and profanity as the case
may call for.

>Oh and for the others watching, (points to vm hackers) I found a bug when
>playing with the dirty ratio code. If you modify it to allow it drop below
> 5% but still above the minimum in the vm code, stalls happen somewhere in
> the vm where nothing much happens for sometimes 20 or 30 seconds worst case
> scenario. I had to drop a patch in 2.6.19 that allowed the dirty ratio to
> be set ultra low because these stalls were gross.

I think I'd need a bit of tutoring on how to do that. I recall that one other
time, several weeks back, I thought I would try one of those famous echo this
>/proc/that ideas that went by on this list, but even though I was root,
apparently /proc was read-only AFAIWC.

>> Amazing to me, the cpu its using stays generally below 80%, and often
>> below 60%, even while the kmail composer has a full sentence in its buffer
>> that it still hasn't shown me when I switch to the htop screen to check,
>> and back to the kmail screen to see if its updated yet. The screen switch
>> doesn't seem to lag so I don't think renicing x would be helpfull. Those
>> are the obvious lags, and I'll build & reboot to the CFS patch at some
>> point this morning (whats left of it that is :). And report in due time
>> of course

And now I wonder if I applied the right patch. This one feels good ATM, but I
don't think its the CFS thingy. No, I'm sure of it now, none of the patches
I've saved say a thing about CFS. Backtrack up the list time I guess, ignore
me for the nonce.


--
Cheers, Gene
"There are four boxes to be used in defense of liberty:
soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)

Microsoft: Re-inventing square wheels

-- From a Slashdot.org post

Bill Huey

unread,
Apr 15, 2007, 1:24:33 PM4/15/07
to
On Sat, Apr 14, 2007 at 01:18:09AM +0200, Ingo Molnar wrote:
> very much so! Both Con and Mike has contributed regularly to upstream
> sched.c:

The problem here is tha Con can get demotivated (and rather upset) when an
idea gets proposed, like SchedPlug, only to have people be hostile to it
and then sudden turn around an adopt this idea. It give the impression
that you, in this specific case, were more interested in controlling a
situation and the track of development instead of actually being inclusive
of the development process with discussion and serious consideration, etc...

This is how the Linux community can be perceived as elitist. The old guard
would serve the community better if people were more mindful and sensitive
to developer issues. There was a particular speech that I was turned off by
at OLS 2006 that pretty much pandering to the "old guard's" needs over
newer developers. Since I'm a some what established engineer in -rt (being
the only other person that mapped the lock hierarchy out for full
preemptibility), I had the confidence to pretty much ignored it while
previously this could have really upset me and be highly discouraging to
a relatively new developer.

As Linux gets larger and larger this is going to be an increasing problem
when folks come into the community with new ideas and the community will
need to change if it intends to integrate these folks. IMO, a lot of
these flame ware wouldn't need to exist if folks listent ot each other
better and permit co-ownership of code like the scheduler since it needs
multipule hands in it adapt to new loads and situations, etc...

I'm saying this nicely now since I can be nasty about it.

bill

Willy Tarreau

unread,
Apr 15, 2007, 1:27:50 PM4/15/07
to

Willy

Mike Galbraith

unread,
Apr 15, 2007, 1:32:18 PM4/15/07
to
On Sat, 2007-04-14 at 15:01 +0200, Willy Tarreau wrote:

> Well, I'll stop heating the room for now as I get out of ideas about how
> to defeat it. I'm convinced. I'm impatient to read about Mike's feedback
> with his workload which behaves strangely on RSDL. If it works OK here,
> it will be the proof that heuristics should not be needed.

You mean the X + mp3 player + audio visualization test? X+Gforce
visualization have problems getting half of my box in the presence of
two other heavy cpu using tasks. Behavior is _much_ better than
RSDL/SD, but the synchronous nature of X/client seems to be a problem.

With this scheduler, renicing X/client does cure it, whereas with SD it
did not help one bit. (I know a trivial way to cure that, and this
framework makes that possible without dorking up fairness as a general
policy.)

-Mike

Mike Galbraith

unread,
Apr 15, 2007, 1:43:17 PM4/15/07
to
On Sun, 2007-04-15 at 16:08 +0300, Pekka J Enberg wrote:

> On Sun, 15 Apr 2007, Willy Tarreau wrote:
> > Ingo could have publicly spoken with them about his ideas of killing
> > the O(1) scheduler and replacing it with an rbtree-based one, and using

> > part of Bill's work to speed up development.
>
> He did exactly that and he did it with a patch. Nothing new here. This is
> how development on LKML proceeds when you have two or more competing
> designs. There's absolutely no need to get upset or hurt your feelings
> over it. It's not malicious, it's how we do Linux development.

Yes. Exactly. This is what it's all about, this is what makes it work.

-Mike

Christoph Pfister

unread,
Apr 15, 2007, 1:48:37 PM4/15/07
to
Hi,

2007/4/15, Ingo Molnar <mi...@elte.hu>:

Could you try xine-ui or gxine? Because I suspect rather xine-lib for
freezing issues. In any way I think a gdb backtrace would be much
nicer - but if you can't reproduce the freeze issue with other xine
based players and want to run kaffeine in gdb, you need to execute
"gdb --args kaffeine --nofork".

> > thanks. This does has the appearance of a userspace race condition of
> > some sorts. Can you trigger this hang with the patch below applied to
> > the vanilla tree as well? (with no CFS patch applied)
>
> oops, please use the patch below instead.
>
> Ingo

<snip>

Christoph

Ingo Molnar

unread,
Apr 15, 2007, 2:01:40 PM4/15/07
to

* Willy Tarreau <w...@1wt.eu> wrote:

> Well, since I merged the fair-fork patch, I cannot reproduce (in fact,
> bash forks 1000 processes, then progressively execs scheddos, but it
> takes some time). So I'm rebuilding right now. But I think that Linus
> has an interesting clue about GPM and notification before switching
> the terminal. I think it was enabled in console mode. I don't know how
> that translates to frozen xterms, but let's attack the problems one at
> a time.

to debug this, could you try to apply this add-on as well:

http://redhat.com/~mingo/cfs-scheduler/sched-fair-print.patch

with this patch applied you should have a /proc/sched_debug file that
prints all runnable tasks and other interesting info from the runqueue.

[ i've refreshed all the patches on the CFS webpage, so if this doesnt
apply cleanly to your current tree then you'll probably have to
refresh one of the patches.]

The output should look like this:

Sched Debug Version: v0.01
now at 226761724575 nsecs

cpu: 0
.nr_running : 3
.raw_weighted_load : 384
.nr_switches : 13666
.nr_uninterruptible : 0
.next_balance : 4294947416
.curr->pid : 2179
.rq_clock : 241337421233
.fair_clock : 7503791206
.wait_runtime : 2269918379

runnable tasks:
task | PID | tree-key | -delta | waiting | switches
-----------------------------------------------------------------
+ cat 2179 7501930066 -1861140 1861140 2
loop_silent 2149 7503010354 -780852 0 911
loop_silent 2148 7503510048 -281158 280753 918

now for your workload the list should be considerably larger. If there's
starvation going on then the 'switches' field (number of context
switches) of one of the tasks would never increase while you have this
'cannot switch consoles' problem.

maybe you'll have to unapply the fair-fork patch to make it trigger
again. (fair-fork does not fix anything, so it probably just hides a
real bug.)

(i'm meanwhile busy running your scheddos utilities to reproduce it
locally as well :)

Ingo

Willy Tarreau

unread,
Apr 15, 2007, 2:10:32 PM4/15/07
to
Hi Ingo,

On Sun, Apr 15, 2007 at 07:55:55PM +0200, Ingo Molnar wrote:
>
> * Willy Tarreau <w...@1wt.eu> wrote:
>
> > Well, since I merged the fair-fork patch, I cannot reproduce (in fact,
> > bash forks 1000 processes, then progressively execs scheddos, but it
> > takes some time). So I'm rebuilding right now. But I think that Linus
> > has an interesting clue about GPM and notification before switching
> > the terminal. I think it was enabled in console mode. I don't know how
> > that translates to frozen xterms, but let's attack the problems one at
> > a time.
>
> to debug this, could you try to apply this add-on as well:
>
> http://redhat.com/~mingo/cfs-scheduler/sched-fair-print.patch
>
> with this patch applied you should have a /proc/sched_debug file that
> prints all runnable tasks and other interesting info from the runqueue.

I don't know if you have seen my mail from yesterday evening (here). I
found that changing keventd prio fixed the problem. You may be interested
in the description. I sent it at 21:01 (+200).

> [ i've refreshed all the patches on the CFS webpage, so if this doesnt
> apply cleanly to your current tree then you'll probably have to
> refresh one of the patches.]

Fine, I'll have a look. I already had to rediff the sched-fair-fork
patch last time.

> The output should look like this:
>
> Sched Debug Version: v0.01
> now at 226761724575 nsecs
>
> cpu: 0
> .nr_running : 3
> .raw_weighted_load : 384
> .nr_switches : 13666
> .nr_uninterruptible : 0
> .next_balance : 4294947416
> .curr->pid : 2179
> .rq_clock : 241337421233
> .fair_clock : 7503791206
> .wait_runtime : 2269918379
>
> runnable tasks:
> task | PID | tree-key | -delta | waiting | switches
> -----------------------------------------------------------------
> + cat 2179 7501930066 -1861140 1861140 2
> loop_silent 2149 7503010354 -780852 0 911
> loop_silent 2148 7503510048 -281158 280753 918

Nice.

> now for your workload the list should be considerably larger. If there's
> starvation going on then the 'switches' field (number of context
> switches) of one of the tasks would never increase while you have this
> 'cannot switch consoles' problem.
>
> maybe you'll have to unapply the fair-fork patch to make it trigger
> again. (fair-fork does not fix anything, so it probably just hides a
> real bug.)
>
> (i'm meanwhile busy running your scheddos utilities to reproduce it
> locally as well :)

I discovered I had the frame-buffer enabled (I did not notice it first
because I do not have the logo and the resolution is the same as text).
It's matroxfb with a G400, if that can help. It may be possible that
it needs some CPU that it cannot get to clear the display before
switching, I don't know.

However I won't try this right now, I'm deep in userland at the moment.

Regards,
Willy

Linus Torvalds

unread,
Apr 15, 2007, 2:10:51 PM4/15/07
to

On Sun, 15 Apr 2007, Mike Galbraith wrote:

> On Sun, 2007-04-15 at 16:08 +0300, Pekka J Enberg wrote:
> >
> > He did exactly that and he did it with a patch. Nothing new here. This is
> > how development on LKML proceeds when you have two or more competing
> > designs. There's absolutely no need to get upset or hurt your feelings
> > over it. It's not malicious, it's how we do Linux development.
>
> Yes. Exactly. This is what it's all about, this is what makes it work.

I obviously agree, but I will also add that one of the most motivating
things there *is* in open source is "personal pride".

It's a really good thing, and it means that if somebody shows that your
code is flawed in some way (by, for example, making a patch that people
claim gets better behaviour or numbers), any *good* programmer that
actually cares about his code will obviously suddenly be very motivated to
out-do the out-doer!

Does this mean that there will be tension and rivalry? Hell yes. But
that's kind of the point. Life is a game, and if you aren't in it to win,
what the heck are you still doing here?

As long as it's reasonably civil (I'm not personally a huge believer in
being too polite or "politically correct", so I think the "reasonably" is
more important than the "civil" part!), and as long as the end result is
judged on TECHNICAL MERIT, it's all good.

We don't want to play politics. But encouraging peoples competitive
feelings? Oh, yes.

Linus

Mike Galbraith

unread,
Apr 15, 2007, 2:11:05 PM4/15/07
to
On Sun, 2007-04-15 at 12:58 -0400, Gene Heskett wrote:

> Chuckle, possibly but then I'm not anything even remotely close to an expert
> here Con, just reporting what I get. And I just rebooted to 2.6.21-rc6 +
> sched-mike-5.patch for grins and giggles, or frowns and profanity as the case
> may call for.

Erm, that patch is embarrassingly buggy, so profanity should dominate.

-Mike

Tim Tassonis

unread,
Apr 15, 2007, 3:00:12 PM4/15/07
to
> + printk("Fair Scheduler: Copyright (c) 2007 Red Hat, Inc., Ingo Molnar\n");

So that's what all the fuss about the staircase scheduler is all about
then! At last, I see your point.


> i'd like to give credit to Con Kolivas for the general approach here:
> he has proven via RSDL/SD that 'fair scheduling' is possible and that
> it results in better desktop scheduling. Kudos Con!
>

How pathetic can you get?

Tim, really looking forward to the CL final where Liverpool will beat
the shit out of Scum (and there's a lot to be beaten out).

Jonathan Lundell

unread,
Apr 15, 2007, 3:10:05 PM4/15/07
to
On Apr 15, 2007, at 10:59 AM, Linus Torvalds wrote:

> It's a really good thing, and it means that if somebody shows that
> your
> code is flawed in some way (by, for example, making a patch that
> people
> claim gets better behaviour or numbers), any *good* programmer that
> actually cares about his code will obviously suddenly be very
> motivated to
> out-do the out-doer!

"No one who cannot rejoice in the discovery of his own mistakes
deserves to be called a scholar."

--Don Foster, "literary sleuth", on retracting his attribution of "A
Funerall Elegye" to Shakespeare (it's more likely John Ford's work).

Ingo Molnar

unread,
Apr 15, 2007, 3:30:09 PM4/15/07
to

* Willy Tarreau <w...@1wt.eu> wrote:

> > to debug this, could you try to apply this add-on as well:
> >
> > http://redhat.com/~mingo/cfs-scheduler/sched-fair-print.patch
> >
> > with this patch applied you should have a /proc/sched_debug file
> > that prints all runnable tasks and other interesting info from the
> > runqueue.
>
> I don't know if you have seen my mail from yesterday evening (here). I
> found that changing keventd prio fixed the problem. You may be
> interested in the description. I sent it at 21:01 (+200).

ah, indeed i missed that mail - the response to the patches was quite
overwhelming (and i naively thought people dont do Linux hacking over
the weekends anymore ;).

so Linus was right: this was caused by scheduler starvation. I can see
one immediate problem already: the 'nice offset' is not divided by
nr_running as it should. The patch below should fix this but i have yet
to test it accurately, this change might as well render nice levels
unacceptably ineffective under high loads.

Ingo

--------->
---
kernel/sched_fair.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)

Index: linux/kernel/sched_fair.c
===================================================================
--- linux.orig/kernel/sched_fair.c
+++ linux/kernel/sched_fair.c
@@ -31,7 +31,9 @@ static void __enqueue_task_fair(struct r
int leftmost = 1;
long long key;

- key = rq->fair_clock - p->wait_runtime + p->nice_offset;
+ key = rq->fair_clock - p->wait_runtime;
+ if (unlikely(p->nice_offset))
+ key += p->nice_offset / rq->nr_running;

p->fair_key = key;

William Lee Irwin III

unread,
Apr 15, 2007, 3:40:10 PM4/15/07
to
On Sun, Apr 15, 2007 at 09:20:46PM +0200, Ingo Molnar wrote:
> so Linus was right: this was caused by scheduler starvation. I can see
> one immediate problem already: the 'nice offset' is not divided by
> nr_running as it should. The patch below should fix this but i have yet
> to test it accurately, this change might as well render nice levels
> unacceptably ineffective under high loads.

I've been suggesting testing CPU bandwidth allocation as influenced by
nice numbers for a while now for a reason.


-- wli

Ingo Molnar

unread,
Apr 15, 2007, 3:40:12 PM4/15/07
to

* Ingo Molnar <mi...@elte.hu> wrote:

> so Linus was right: this was caused by scheduler starvation. I can see
> one immediate problem already: the 'nice offset' is not divided by
> nr_running as it should. The patch below should fix this but i have
> yet to test it accurately, this change might as well render nice
> levels unacceptably ineffective under high loads.

erm, rather the updated patch below if you want to use this on a 32-bit
system. But ... i think you should wait until i have all this re-tested.

Ingo

---
include/linux/sched.h | 2 +-
kernel/sched_fair.c | 4 +++-
2 files changed, 4 insertions(+), 2 deletions(-)

Index: linux/include/linux/sched.h
===================================================================
--- linux.orig/include/linux/sched.h
+++ linux/include/linux/sched.h
@@ -839,7 +839,7 @@ struct task_struct {

s64 wait_runtime;
u64 exec_runtime, fair_key;
- s64 nice_offset, hog_limit;
+ s32 nice_offset, hog_limit;

unsigned long policy;
cpumask_t cpus_allowed;


Index: linux/kernel/sched_fair.c
===================================================================
--- linux.orig/kernel/sched_fair.c
+++ linux/kernel/sched_fair.c
@@ -31,7 +31,9 @@ static void __enqueue_task_fair(struct r
int leftmost = 1;
long long key;

- key = rq->fair_clock - p->wait_runtime + p->nice_offset;
+ key = rq->fair_clock - p->wait_runtime;
+ if (unlikely(p->nice_offset))

+ key += p->nice_offset / (rq->nr_running + 1);

Ingo Molnar

unread,
Apr 15, 2007, 4:00:21 PM4/15/07
to

* William Lee Irwin III <w...@holomorphy.com> wrote:

> On Sun, Apr 15, 2007 at 09:20:46PM +0200, Ingo Molnar wrote:
> > so Linus was right: this was caused by scheduler starvation. I can
> > see one immediate problem already: the 'nice offset' is not divided
> > by nr_running as it should. The patch below should fix this but i
> > have yet to test it accurately, this change might as well render
> > nice levels unacceptably ineffective under high loads.
>
> I've been suggesting testing CPU bandwidth allocation as influenced by
> nice numbers for a while now for a reason.

Oh I was very much testing "CPU bandwidth allocation as influenced by
nice numbers" - it's one of the basic things i do when modifying the
scheduler. An automated tool, while nice (all automation is nice)
wouldnt necessarily show such bugs though, because here too it needed
thousands of running tasks to trigger in practice. Any volunteers? ;)

Ingo

Matt Mackall

unread,
Apr 15, 2007, 4:20:09 PM4/15/07
to
On Sun, Apr 15, 2007 at 05:05:36PM +0200, Ingo Molnar wrote:
> so the rejection was on these grounds, and i still very much stand by
> that position here and today: i didnt want to see the Linux scheduler
> landscape balkanized and i saw no technological reasons for the
> complication that external modularization brings.

But "balkanization" is a good thing. "Monoculture" is a bad thing.

Look at what happened with I/O scheduling. Opening things up to some
new ideas by making it possible to select your I/O scheduler took us
from 10 years of stagnation to healthy, competitive development, which
gave us a substantially better I/O scheduler.

Look at what's happening right now with TCP congestion algorithms.
We've had decades of tweaking Reno slightly now turned into a vibrant
research area with lots of radical alternatives. A winner will
eventually emerge and it will probably look quite a bit different than
Reno.

Similar things have gone on since the beginning with filesystems on
Linux. Being able to easily compare filesystems head to head has been
immensely valuable in improving our 'core' Linux filesystems.

And what we've had up to now is a scheduler monoculture. Until Andrew
put RSDL in -mm, if people wanted to experiment with other schedulers,
they had to go well off the beaten path to do it. So all the people
who've been hopelessy frustrated with the mainline scheduler go off to
the -ck ghetto, or worse, stick with 2.4.

Whether your motivations have been protectionist or merely
shortsighted, you've stomped pretty heavily on alternative scheduler
development by completely rejecting the whole plugsched concept. If
we'd opened up mainline to a variety of schedulers _3 years ago_, we'd
probably have gotten to where we are today much sooner.

Hopefully, the next time Rik suggests pluggable page replacement
algorithms, folks will actually seriously consider it.

--
Mathematics is the supreme nostalgia of our time.

Ingo Molnar

unread,
Apr 15, 2007, 4:50:09 PM4/15/07
to

* Matt Mackall <m...@selenic.com> wrote:

> Look at what happened with I/O scheduling. Opening things up to some
> new ideas by making it possible to select your I/O scheduler took us
> from 10 years of stagnation to healthy, competitive development, which
> gave us a substantially better I/O scheduler.

actually, 2-3 years ago we already had IO schedulers, and my opinion
against plugsched back then (also shared by Nick and Linus) was very
much considering them. There are at least 4 reasons why I/O schedulers
are different from CPU schedulers:

1) CPUs are a non-persistent resource shared by _all_ tasks and
workloads in the system. Disks are _persistent_ resources very much
attached to specific workloads. (If tasks had to be 'persistent' to
the CPU they were started on we'd have much different scheduling
technology, and there would be much less complexity.) More analogous
to CPU schedulers would perhaps be VM/MM schedulers, and those tend
to be hard to modularize in a technologically sane way too. (and
unlike disks there's no good generic way to attach VM/MM schedulers
to particular workloads.) So it's apples to oranges.

in practice it comes down to having one good scheduler that runs all
workloads on a system reasonably well. And given that a very large
portion of system runs mixed workloads, the demand for one good
scheduler is pretty high. While i can run with mixed IO schedulers
just fine.

2) plugsched did not allow on the fly selection of schedulers, nor did
it allow a per CPU selection of schedulers. IO schedulers you can
change per disk, on the fly, making them much more useful in
practice. Also, IO schedulers (while definitely not being slow!) are
alot less performance sensitive than CPU schedulers.

3) I/O schedulers are pretty damn clean code, and plugsched, at least
the last version i saw of it, didnt come even close.

4) the good thing that happened to I/O, after years of stagnation isnt
I/O schedulers. The good thing that happened to I/O is called Jens
Axboe. If you care about the I/O subystem then print that name out
and hang it on the wall. That and only that is what mattered.

all in one, while there are definitely uses (embedded would like to have
a smaller/different scheduler, etc.), the technical case for
modularization for the sake of selectability is alot lower for CPU
schedulers than it is for I/O schedulers.

nor was the non-modularity of some piece of code ever an impediment to
competition. May i remind you of the pretty competitive SLAB allocator
landscape, resulting in things like the SLOB allocator, written by
yourself? ;-)

Ingo

Matt Mackall

unread,
Apr 15, 2007, 5:50:06 PM4/15/07
to
On Sun, Apr 15, 2007 at 10:48:24PM +0200, Ingo Molnar wrote:
>
> * Matt Mackall <m...@selenic.com> wrote:
>
> > Look at what happened with I/O scheduling. Opening things up to some
> > new ideas by making it possible to select your I/O scheduler took us
> > from 10 years of stagnation to healthy, competitive development, which
> > gave us a substantially better I/O scheduler.
>
> actually, 2-3 years ago we already had IO schedulers, and my opinion
> against plugsched back then (also shared by Nick and Linus) was very
> much considering them. There are at least 4 reasons why I/O schedulers
> are different from CPU schedulers:

...

> 3) I/O schedulers are pretty damn clean code, and plugsched, at least
> the last version i saw of it, didnt come even close.

That's irrelevant. Plugsched was an attempt to get alternative
schedulers exposure in mainline. I know, because I remember
encouraging Bill to pursue it. Not only did you veto plugsched (which
may have been a perfectly reasonable thing to do), but you also vetoed
the whole concept of multiple schedulers in the tree too. "We don't
want to balkanize the scheduling landscape".

And that latter part is what I'm claiming has set us back for years.
It's not a technical argument but a strategic one. And it's just not a
good strategy.



> 4) the good thing that happened to I/O, after years of stagnation isnt
> I/O schedulers. The good thing that happened to I/O is called Jens
> Axboe. If you care about the I/O subystem then print that name out
> and hang it on the wall. That and only that is what mattered.

Disagree. Things didn't actually get interesting until Nick showed up
with AS and got it in-tree to demonstrate the huge amount of room we
had for improvement. It took several iterations of AS and CFQ (with a
couple complete rewrites) before CFQ began to look like the winner.
The resulting time-sliced CFQ was fairly heavily influenced by the
ideas in AS.

Similarly, things in scheduler land had been pretty damn boring until
Con finally got Andrew to take one of his schedulers for a spin.

> nor was the non-modularity of some piece of code ever an impediment to
> competition. May i remind you of the pretty competitive SLAB allocator
> landscape, resulting in things like the SLOB allocator, written by
> yourself? ;-)

Thankfully no one came out and said "we don't want to balkanize the
allocator landscape" when I submitted it or I probably would have just
dropped it, rather than painfully dragging it along out of tree for
years. I'm not nearly the glutton for punishment that Con is. :-P

--
Mathematics is the supreme nostalgia of our time.

S.Çağlar Onur

unread,
Apr 15, 2007, 6:20:08 PM4/15/07
to
15 Nis 2007 Paz tarihinde, Christoph Pfister şunları yazmıştı:
> Could you try xine-ui or gxine? Because I suspect rather xine-lib for
> freezing issues. In any way I think a gdb backtrace would be much
> nicer - but if you can't reproduce the freeze issue with other xine
> based players and want to run kaffeine in gdb, you need to execute
> "gdb --args kaffeine --nofork".

I just tested xine-ui and i can easily reproduce exact same problem with
xine-ui also so you are right, it seems a xine-lib problem trigger by CFS
changes.

> > > thanks. This does has the appearance of a userspace race condition of
> > > some sorts. Can you trigger this hang with the patch below applied to
> > > the vanilla tree as well? (with no CFS patch applied)
> >
> > oops, please use the patch below instead.

Tomorrow i'll test that patch and also try to get a backtrace.

Cheers
--
S.Çağlar Onur <cag...@pardus.org.tr>
http://cekirdek.pardus.org.tr/~caglar/

Linux is like living in a teepee. No Windows, no Gates and an Apache in house!

signature.asc

Ismail Dönmez

unread,
Apr 15, 2007, 6:50:05 PM4/15/07
to
Hi,
On Friday 13 April 2007 23:21:00 Ingo Molnar wrote:
> [announce] [patch] Modular Scheduler Core and Completely Fair Scheduler
> [CFS]
>
> i'm pleased to announce the first release of the "Modular Scheduler Core
> and Completely Fair Scheduler [CFS]" patchset:
>
> http://redhat.com/~mingo/cfs-scheduler/sched-modular+cfs.patch

Tested this on top of Linus' GIT tree but the system gets very unresponsive
during high disk i/o using ext3 as filesystem but even writing a 300mb file
to a usb disk (iPod actually) has the same affect.

Regards,
ismail

signature.asc

Con Kolivas

unread,
Apr 15, 2007, 7:00:10 PM4/15/07
to
On Monday 16 April 2007 05:00, Jonathan Lundell wrote:
> On Apr 15, 2007, at 10:59 AM, Linus Torvalds wrote:
> > It's a really good thing, and it means that if somebody shows that
> > your
> > code is flawed in some way (by, for example, making a patch that
> > people
> > claim gets better behaviour or numbers), any *good* programmer that
> > actually cares about his code will obviously suddenly be very
> > motivated to
> > out-do the out-doer!
>
> "No one who cannot rejoice in the discovery of his own mistakes
> deserves to be called a scholar."

Lovely comment. I realise this is not truly directed at me but clearly in the
context it has been said people will assume it is directed my way, so while
we're all spinning lkml quality rhetoric, let me have a right of reply.

One thing I have never tried to do was to ignore bug reports. I'm forever
joking that I keep pulling code out of my arse to improve what I've done.
RSDL/SD was no exception; heck it had 40 iterations. The reason I could not
reply to bug report A with "Oh that is problem B so I'll fix it with code C"
was, as I've said many many times over, health related. I did indeed try to
fix many of them without spending hours replying to sometimes unpleasant
emails. If health wasn't an issue there might have been 1000 iterations of
SD.

There was only ever _one_ thing that I was absolutely steadfast on as a
concept that I refused to fix that people might claim was "a mistake I did
not rejoice in to be a scholar". That was that the _correct_ behaviour for a
scheduler is to be fair such that proportional slowdown with load is (using
that awful pun) a feature, not a bug. Now there are people who will still
disagree violently with me on that. SD attempted to be a fairness first
virtual-deadline design. If I failed on that front, then so be it (and at
least one person certainly has said in lovely warm fuzzy friendly
communication that I'm a global failure on all fronts with SD). But let me
point out now that Ingo's shiny new scheduler is a fairness-first
virtual-deadline design which will have proportional slowdown with load. So
it will have a very similar feature. I dare anyone to claim that proportional
slowdown with load is a bug, because I will no longer feel like I'm standing
alone with a BFG9000 trying to defend my standpoint. Others can take up the
post at last.

--
-ck

Arjan van de Ven

unread,
Apr 15, 2007, 7:30:12 PM4/15/07
to

just to make sure; this exact same workload but with the stock scheduler
does not have this effect?

if so, then it could well be that the scheduler is too fair for it's own
good (being really fair inevitably ends up not batching as much as one
should, and batching is needed to get any kind of decent performance out
of disks nowadays)


--
if you want to mail me at work (you don't), use arjan (at) linux.intel.com
Test the interaction between Linux and your BIOS via http://www.linuxfirmwarekit.org

It is loading more messages.
0 new messages