[PATCH][plugsched 0/28] Pluggable cpu scheduler framework

Con Kolivas

unread,

Oct 30, 2004, 10:34:09 AM10/30/04

to linux kernel mailing list, Andrew Morton, Ingo Molnar, Peter Williams, William Lee Irwin III, Alexander Nyberg, Nick Piggin

With the recent interest in varying the cpu schedulers in linux, this
set of patches provides a modular framework for adding multiple
boot-time selectable cpu schedulers. William Lee Irwin III came up with
the original design and I based my patchset on that.

This code was designed to touch the least number of files, be completely
arch-independant, and allow extra schedulers to be coded in by only
touching Kconfig, scheduler.c and scheduler.h. It should incur no
overhead when run and will allow you to compile in only the scheduler(s)
you desire. This allows, for example, embedded hardware to have a tiny
new scheduler that takes up minimal code space.

This works by taking all functions that will be common to all scheduler
designs and moving them from kernel/sched.c into kernel/scheduler.c.

Then it adds the scheduler driver struct into scheduler.h which is a set
of pointers to functions that will all have per-scheduler versions.
include/linux/scheduler.h has the definitions for the scheduler driver
structure

kernel/sched.c remains as the default cpu scheduler in the same place to
minimise the patch size and portability of the patch set.

All variables of the task_struct that could be unique to a different
scheduler are now in a private struct held within a union in
task_struct. rt_priority and static_priority are kept global for
userspace interface and for the possibility of adding run-time switching
later on.

The main disadvantage of this design is that there will (initially) be a
lot of code duplication by different scheduler designs in their own
private space. This will mean that if a new scheduler uses the same smp
balancing algorithm then it will need to be modified to keep in sync
with changes to the default scheduler. If, for example, you modified
just the dynamic priority component of the current scheduler and left
the runqueue and task_struct the same, you could make it depend on the
default scheduler and point most functions to that one.

However, the same disadvantage can be a major advantage. The fact that
so much of the scheduler is privatised means that wildly different
designs can be plugged in without any reference to the number of
runqueues, frame schedulers could be plugged in, shared runqueues (eg on
numa designs), and we could even keep new balancing in a "developing"
arm of the scheduler that can be booted by testers and so on.

What is left to do is add a per-scheduler entry into /sys which can be
used to modify the unique controls each scheduler has, and write up some
documentation for this and staircase.

Anyway the patches will follow shortly, and then (not surprisingly) a
port of the staircase scheduler to plug into this framework which can
also be used as an example for others wishing to code up or port their
schedulers.

While I have tried to build this on as many configurations as possible,
I am sure breakage will creep in given the type of modification so I
apologise in advance.

Patches for those who want to download them separately here:

http://ck.kolivas.org/patches/plugsched/

Here is a diffstat of the patches rolled up.

Thanks to William Lee Irwin III for design help, Alex Nyberg for testing
and a bootload of others for ideas and help with the coding.

Cheers,
Con

signature.asc

Pavel Machek

unread,

Oct 31, 2004, 6:34:29 PM10/31/04

to Con Kolivas, linux kernel mailing list, Andrew Morton, Ingo Molnar, Peter Williams, William Lee Irwin III, Alexander Nyberg, Nick Piggin

Hi!

> This code was designed to touch the least number of files, be completely
> arch-independant, and allow extra schedulers to be coded in by only
> touching Kconfig, scheduler.c and scheduler.h. It should incur no
> overhead when run and will allow you to compile in only the scheduler(s)
> you desire. This allows, for example, embedded hardware to have a tiny
> new scheduler that takes up minimal code space.

You are changing

some_functions()

into

something->function()

no? I do not think that is 0 overhead...
Pavel

--
People were complaining that M$ turns users into beta-testers...
...jr ghea gurz vagb qrirybcref, naq gurl frrz gb yvxr vg gung jnl!
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Con Kolivas

unread,

Oct 31, 2004, 6:41:30 PM10/31/04

to Pavel Machek, linux kernel mailing list, Andrew Morton, Ingo Molnar, Peter Williams, William Lee Irwin III, Alexander Nyberg, Nick Piggin

Pavel Machek wrote:
> Hi!
>
>
>>This code was designed to touch the least number of files, be completely
>>arch-independant, and allow extra schedulers to be coded in by only
>>touching Kconfig, scheduler.c and scheduler.h. It should incur no
>>overhead when run and will allow you to compile in only the scheduler(s)
>>you desire. This allows, for example, embedded hardware to have a tiny
>>new scheduler that takes up minimal code space.
>
>
> You are changing
>
> some_functions()
>
> into
>
> something->function()
>
> no? I do not think that is 0 overhead...

Indeed, and I am performing microbenchmarks to see what measurable
overhead there is and so far any difference is lost in noise.

Cheers,
Con

signature.asc

William Lee Irwin III

unread,

Oct 31, 2004, 8:49:16 PM10/31/04

to Pavel Machek, Con Kolivas, linux kernel mailing list, Andrew Morton, Ingo Molnar, Peter Williams, Alexander Nyberg, Nick Piggin

At some point in the past, Con Kolivas wrote:
>> This code was designed to touch the least number of files, be completely
>> arch-independant, and allow extra schedulers to be coded in by only
>> touching Kconfig, scheduler.c and scheduler.h. It should incur no
>> overhead when run and will allow you to compile in only the scheduler(s)
>> you desire. This allows, for example, embedded hardware to have a tiny
>> new scheduler that takes up minimal code space.

On Mon, Nov 01, 2004 at 12:33:13AM +0100, Pavel Machek wrote:
> You are changing
> some_functions()
> into
> something->function()
> no? I do not think that is 0 overhead...

It's nonzero, yes. However, it's rather small with modern branch
predictors; older microarchitectures handled this less well, which
is probably why you expect a measurable hit. It may still have
non-negligible performance effects on some legacy architectures,
but I would not let that hold up progress.

-- wli

Ingo Molnar

unread,

Nov 1, 2004, 7:02:22 AM11/1/04

to Pavel Machek, Con Kolivas, linux kernel mailing list, Andrew Morton, Peter Williams, William Lee Irwin III, Alexander Nyberg, Nick Piggin, Linus Torvalds

* Pavel Machek <pa...@ucw.cz> wrote:

> You are changing
>
> some_functions()
>
> into
>
> something->function()
>
> no? I do not think that is 0 overhead...

my main worry with this approach is not really overhead but the impact
on scheduler development. Right now there is a Linux scheduler that
every developer (small-workload and large-workload people) tries to make
as good as possible. Historically and fundamentally, scheduler
development and feedback has always been a 'scarce resource' - the
feedback cycle is (necessarily) long and there are alot of specialized
cases to take care of, which slowly dribble in with time.

firstly, if someone wants a different or specialized scheduler there's
no problem even under the current model, and it has happened before. We
made the scheduler itself easily 'rip-out-able' in 2.6 by decreasing the
junction points between the scheduler and the rest of the system. Also,
the current scheduler is no way cast into stone, we could easily end up
having a different interactivity code within the scheduler, as a result
of the various 'get rid of the two arrays' efforts currently underway.
But i very much do not support making the 'junction points' at the wrong
place.

But more importantly, in the current model, people who care about
'fringe' workloads (embedded and high-end) are 'forced' to improve the
core scheduler if they want to see their problems solved by mainline.
They are forced to think about issues, to generalize problems and to
solve them so that the large picture is still right. This worked pretty
well in the past and works well today. It is painful in terms of getting
stuff integrated but it works.

Scheduler domains was and is a prime example of this concept in the
works: load-balancing was a difficult issue that kept (some of) us
uneasy for years and then a nice generic framework came along that
replaced the old code, made both small boxes and large boxes possible.
As a bonus it also solved the 'HT scheduling' issue almost for free.
Sched-domains is nice for both the low-end and the high-end - it enables
512 CPU single-system-image systems supported by (almost-) vanilla 2.6
kernel. What more can we ask for?

I am 100% sure that we'd not have sched-domains today had we gone for a
'plugin' model say 2-3 years ago. It's always hard to predict 'what if'
scenarios but here's my guess: we'd have a NUMA scheduler, a separate
SMP scheduler, a number of UP schedulers and embedded schedulers, and
say HT would be supported in different ways by the SMP and NUMA
schedulers.

or to give another example: we emphatically do not allow 'dynamic
syscalls' in Linux, albeit for years we've been hammered with how
enterprise-ready Linux would be from them. In reality, without 'dynamic
syscalls' all the 'fringe functionality' people have to think harder and
have to integrate their stuff into the current
syscalls/drivers/subsystems.

the process scheduler is i think a similar piece of technology: we want
to make it _harder_ for specialized workloads to be handled in some
'specialized' way, because those precise workloads do show up in other
workloads too, in a different manner. A fix made for NUMA or real-time
purposes can easily make a difference for desktop workloads. Often
'specialized' is an excluse for a 'fundamentally broken, limited hack',
especially in the scheduler world.

I believe that by compartmenting in the wrong way [*] we kill the
natural integration effects. We'd end up with 5 (or 20) bad generic
schedulers that happen to work in one precise workload only, but there
would not be enough push to build one good generic scheduler, because
the people who are now forced to care about the Linux scheduler would be
content about their specialized schedulers. Yes, it would be easier to
make a specialized scheduler work well in that precise workload (because
the developer can make the 'this is only for this parcticular workload'
excuse), and this approach may satisfy the embedded and high-end needs
in a quicker way. So i consider scheduler plugins as the STREAMS
equivalent of scheduling and i am not very positive about it. Just like
STREAMS, i consider 'scheduler plugins' as the easy but deceptive and
wrong way out of current problems, which will create much worse problems
than the ones it tries to solve.

Ingo

( [*] how is this different from say the IO scheduler plugin
architecture? Just compare the two, it's two very different things.
Firstly, the timescale is very different - the process scheduler cares
about microseconds, the IO scheduler's domain is milliseconds. Also, IO
scheduling is fundamentally per-device and often there is good
per-device workload isolation so picking an IO scheduler per queue makes
much more sense than say picking a scheduler per CPU ... There are other
differences too, such as complexity and isolation from the rest of the
system. )

Kasper Sandberg

unread,

Nov 1, 2004, 8:22:35 AM11/1/04

to Ingo Molnar, Pavel Machek, Con Kolivas, LKML Mailinglist, Andrew Morton, Peter Williams, William Lee Irwin III, Alexander Nyberg, Nick Piggin, Linus Torvalds

On Mon, 2004-11-01 at 12:41 +0100, Ingo Molnar wrote:
<snip>

> I believe that by compartmenting in the wrong way [*] we kill the
> natural integration effects. We'd end up with 5 (or 20) bad generic
> schedulers that happen to work in one precise workload only, but there
> would not be enough push to build one good generic scheduler, because
> the people who are now forced to care about the Linux scheduler would be
> content about their specialized schedulers. Yes, it would be easier to
> make a specialized scheduler work well in that precise workload (because
> the developer can make the 'this is only for this parcticular workload'
> excuse), and this approach may satisfy the embedded and high-end needs
> in a quicker way. So i consider scheduler plugins as the STREAMS
> equivalent of scheduling and i am not very positive about it. Just like
> STREAMS, i consider 'scheduler plugins' as the easy but deceptive and
> wrong way out of current problems, which will create much worse problems
> than the ones it tries to solve.

i see your point, and i agree its not very nice to have specialized
schedulers for particular workloads only. however, as i see it,
plugsched doesent have any direct overhead, and plugsched doesent remove
the ability to develop on one allround scheduler, which tries to handle
all workloads good. however plugsched does give the opportunity to
create specialized schedulers, and as i see it not, staircase does a
better job in handling allround workloads(atleast for me). and it
certainly do make stuff easier.

Con Kolivas

unread,

Nov 1, 2004, 8:42:39 AM11/1/04

to Ingo Molnar, Pavel Machek, linux kernel mailing list, Andrew Morton, Peter Williams, William Lee Irwin III, Alexander Nyberg, Nick Piggin, Linus Torvalds

Ingo Molnar wrote:
> my main worry with this approach is not really overhead but the impact
> on scheduler development.

> no problem even under the current model, and it has happened before. We

> made the scheduler itself easily 'rip-out-able' in 2.6 by decreasing the
> junction points between the scheduler and the rest of the system. Also,
> the current scheduler is no way cast into stone, we could easily end up
> having a different interactivity code within the scheduler, as a result
> of the various 'get rid of the two arrays' efforts currently underway.

Do you honestly think with the current "2.6 forever" development process
that this is likely, even possible any more?

Given that fact, it means the current scheduler policy mechanism is
effectively set in stone. Do you think we can polish the current
scheduler enough to be, if not perfect, good enough for _every_ situation?

Noone said that if we have a plugsched infrastructure that we should
instantly accept any scheduler.

Regards,
Con

signature.asc

Nick Piggin

unread,

Nov 1, 2004, 9:34:02 AM11/1/04

to Con Kolivas, Ingo Molnar, Pavel Machek, linux kernel mailing list, Andrew Morton, Peter Williams, William Lee Irwin III, Alexander Nyberg, Linus Torvalds

Con Kolivas wrote:
> Ingo Molnar wrote:
>
>> my main worry with this approach is not really overhead but the impact
>> on scheduler development.
>
>
>> no problem even under the current model, and it has happened before. We
>> made the scheduler itself easily 'rip-out-able' in 2.6 by decreasing the
>> junction points between the scheduler and the rest of the system. Also,
>> the current scheduler is no way cast into stone, we could easily end up
>> having a different interactivity code within the scheduler, as a result
>> of the various 'get rid of the two arrays' efforts currently underway.
>
>
> Do you honestly think with the current "2.6 forever" development process
> that this is likely, even possible any more?
>

That's a nutty problem... but suppose 2.7 opened tomorrow, how would
you justify putting in a new scheduler even then? And how would you
get enough testing to ensure a repeat of early 2.6 didn't happen again?

I'd be very happy if we could figure out the answer to that question,
but I'm afraid pluggable schedulers probably isn't it (unfortunately).

> Given that fact, it means the current scheduler policy mechanism is
> effectively set in stone. Do you think we can polish the current
> scheduler enough to be, if not perfect, good enough for _every_ situation?
>

Specialised users always have and always will do specialised
modifications, that's no problem. But as much as I hate to say it
(it's a good thing, I'd just like to be able to get nicksched in),
it seems that the current scheduler *is* actually good enough for
a general purpose operating system. At least the lack of complaints
seems to indicate that.

My proposal to the "how to get a new scheduler in" question is a set
of pretty comprehensive controlled (blind of course), subjective tests
with proper statistical analysis, to determine best behaviour... And
I'm only half joking :(

> Noone said that if we have a plugsched infrastructure that we should
> instantly accept any scheduler.
>

But so long as you don't have a compelling argument to _replace_
the current scheduler, people who want to use other ones may as
well just patch them in. By having multiple schedulers however,
you don't IMO get too many benefits and a lot of downsides that
Ingo pointed out.

That said, if you were able to get a unanimous "yes" from Linus,
Andrew, and Ingo then I wouldn't complain too loudly...

Jesse Barnes

unread,

Nov 1, 2004, 1:06:38 PM11/1/04

to Ingo Molnar, Pavel Machek, Con Kolivas, linux kernel mailing list, Andrew Morton, Peter Williams, William Lee Irwin III, Alexander Nyberg, Nick Piggin, Linus Torvalds

On Monday, November 1, 2004 3:41 am, Ingo Molnar wrote:
> Sched-domains is nice for both the low-end and the high-end - it enables
> 512 CPU single-system-image systems supported by (almost-) vanilla 2.6
> kernel. What more can we ask for?

Minor correction: last I checked, a 512p booted with the vanilla kernel.
We're working hard to keep it that way :). (Witness John Hawkes' patch to
fixup the load balancing stuff recently.)

Jesse

Matthias Urlichs

unread,

Nov 2, 2004, 4:32:29 PM11/2/04

to linux-...@vger.kernel.org

Hi, Ingo Molnar wrote:

> I believe that by compartmenting in the wrong way [*] we kill the
> natural integration effects. We'd end up with 5 (or 20) bad generic
> schedulers that happen to work in one precise workload only, but there
> would not be enough push to build one good generic scheduler, because
> the people who are now forced to care about the Linux scheduler would be
> content about their specialized schedulers.

I don't think so. There are multiple attempts to build a better
generic scheduler (Con's for one), so there's your counterexample right
here. However, testing a different scheduler currently requires a kernel
recompile and a reboot.

I hate that. Ideally, the scheduler would be hotpluggable... but I can
live with a reboot. I don't think a kernel recompile to switch schedulers
makes sense, though, so I for one am likely not to bother. So far.

You can't actually develop a better scheduler if people need to go too
far out of their way to compare them.

--
Matthias Urlichs | {M:U} IT Design @ m-u-it.de | sm...@smurf.noris.de

Peter Chubb

unread,

Nov 2, 2004, 5:35:39 PM11/2/04

to Matthias Urlichs, linux-...@vger.kernel.org

>>>>> "Matthias" == Matthias Urlichs <sm...@smurf.noris.de> writes:

Matthias> Hi, Ingo Molnar wrote:
>> I believe that by compartmenting in the wrong way [*] we kill the
>> natural integration effects. We'd end up with 5 (or 20) bad generic
>> schedulers that happen to work in one precise workload only, but
>> there would not be enough push to build one good generic scheduler,
>> because the people who are now forced to care about the Linux
>> scheduler would be content about their specialized schedulers.

Matthias> I hate that. Ideally, the scheduler would be
Matthias> hotpluggable... but I can live with a reboot. I don't think
Matthias> a kernel recompile to switch schedulers makes sense, though,
Matthias> so I for one am likely not to bother. So far.

Matthias> You can't actually develop a better scheduler if people need
Matthias> to go too far out of their way to compare them.

I'd like to go further and be able to have families of schedulers that
work together --- if you're going to vector to a scheduler anyway,
why not do it per process? That way the special cases for SCHED_FIFO and
SCHED_RR can be moved into separate functions (likewise SCHED_ISO,
SCHED_BATCH, SCHED_GANG etc., as and when they're developed), rather
than being controlled by if() or switch() statements in a common
do-everything scheduler.

In general, it's the interactive SCHED_OTHER scheduler that's been the
problem, and the focus of most of the work. We more-or-less know how
to do the basic POSIX schedulers.
--
Dr Peter Chubb http://www.gelato.unsw.edu.au peterc AT gelato.unsw.edu.au
The technical we do immediately, the political takes *forever*