Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

[announce] [patch] ultra-scalable O(1) SMP and UP scheduler

5 views
Skip to first unread message

Ingo Molnar

unread,
Jan 3, 2002, 9:19:10 PM1/3/02
to

now that new-year's parties are over things are getting boring again. For
those who want to see and perhaps even try something more complex, i'm
announcing this patch that is a pretty radical rewrite of the Linux
scheduler for 2.5.2-pre6:

http://redhat.com/~mingo/O(1)-scheduler/sched-O1-2.5.2-A0.patch

for 2.4.17:

http://redhat.com/~mingo/O(1)-scheduler/sched-O1-2.4.17-A0.patch

Goal
====

The main goal of the new scheduler is to keep all the good things we know
and love about the current Linux scheduler:

- good interactive performance even during high load: if the user
types or clicks then the system must react instantly and must execute
the user tasks smoothly, even during considerable background load.

- good scheduling/wakeup performance with 1-2 runnable processes.

- fairness: no process should stay without any timeslice for any
unreasonable amount of time. No process should get an unjustly high
amount of CPU time.

- priorities: less important tasks can be started with lower priority,
more important tasks with higher priority.

- SMP efficiency: no CPU should stay idle if there is work to do.

- SMP affinity: processes which run on one CPU should stay affine to
that CPU. Processes should not bounce between CPUs too frequently.

- plus additional scheduler features: RT scheduling, CPU binding.

and the goal is also to add a few new things:

- fully O(1) scheduling. Are you tired of the recalculation loop
blowing the L1 cache away every now and then? Do you think the goodness
loop is taking a bit too long to finish if there are lots of runnable
processes? This new scheduler takes no prisoners: wakeup(), schedule(),
the timer interrupt are all O(1) algorithms. There is no recalculation
loop. There is no goodness loop either.

- 'perfect' SMP scalability. With the new scheduler there is no 'big'
runqueue_lock anymore - it's all per-CPU runqueues and locks - two
tasks on two separate CPUs can wake up, schedule and context-switch
completely in parallel, without any interlocking. All
scheduling-relevant data is structured for maximum scalability. (see
the benchmark section later on.)

- better SMP affinity. The old scheduler has a particular weakness that
causes the random bouncing of tasks between CPUs if/when higher
priority/interactive tasks, this was observed and reported by many
people. The reason is that the timeslice recalculation loop first needs
every currently running task to consume its timeslice. But when this
happens on eg. an 8-way system, then this property starves an
increasing number of CPUs from executing any process. Once the last
task that has a timeslice left has finished using up that timeslice,
the recalculation loop is triggered and other CPUs can start executing
tasks again - after having idled around for a number of timer ticks.
The more CPUs, the worse this effect.

Furthermore, this same effect causes the bouncing effect as well:
whenever there is such a 'timeslice squeeze' of the global runqueue,
idle processors start executing tasks which are not affine to that CPU.
(because the affine tasks have finished off their timeslices already.)

The new scheduler solves this problem by distributing timeslices on a
per-CPU basis, without having any global synchronization or
recalculation.

- batch scheduling. A significant proportion of computing-intensive tasks
benefit from batch-scheduling, where timeslices are long and processes
are roundrobin scheduled. The new scheduler does such batch-scheduling
of the lowest priority tasks - so nice +19 jobs will get
'batch-scheduled' automatically. With this scheduler, nice +19 jobs are
in essence SCHED_IDLE, from an interactiveness point of view.

- handle extreme loads more smoothly, without breakdown and scheduling
storms.

- O(1) RT scheduling. For those RT folks who are paranoid about the
O(nr_running) property of the goodness loop and the recalculation loop.

- run fork()ed children before the parent. Andrea has pointed out the
advantages of this a few months ago, but patches for this feature
do not work with the old scheduler as well as they should,
because idle processes often steal the new child before the fork()ing
CPU gets to execute it.


Design
======

(those who find the following design issues boring can skip to the next,
'Benchmarks' section.)

the core of the new scheduler are the following mechanizms:

- *two*, priority-ordered 'priority arrays' per CPU. There is an 'active'
array and an 'expired' array. The active array contains all tasks that
are affine to this CPU and have timeslices left. The expired array
contains all tasks which have used up their timeslices - but this array
is kept sorted as well. The active and expired array is not accessed
directly, it's accessed through two pointers in the per-CPU runqueue
structure. If all active tasks are used up then we 'switch' the two
pointers and from now on the ready-to-go (former-) expired array is the
active array - and the empty active array serves as the new collector
for expired tasks.

- there is a 64-bit bitmap cache for array indices. Finding the highest
priority task is thus a matter of two x86 BSFL bit-search instructions.

the split-array solution enables us to have an arbitrary number of active
and expired tasks, and the recalculation of timeslices can be done
immediately when the timeslice expires. Because the arrays are always
access through the pointers in the runqueue, switching the two arrays can
be done very quickly.

this is a hybride priority-list approach coupled with roundrobin
scheduling and the array-switch method of distributing timeslices.

- there is a per-task 'load estimator'.

one of the toughest things to get right is good interactive feel during
heavy system load. While playing with various scheduler variants i found
that the best interactive feel is achieved not by 'boosting' interactive
tasks, but by 'punishing' tasks that want to use more CPU time than there
is available. This method is also much easier to do in an O(1) fashion.

to establish the actual 'load' the task contributes to the system, a
complex-looking but pretty accurate method is used: there is a 4-entry
'history' ringbuffer of the task's activities during the last 4 seconds.
This ringbuffer is operated without much overhead. The entries tell the
scheduler a pretty accurate load-history of the task: has it used up more
CPU time or less during the past N seconds. [the size '4' and the interval
of 4x 1 seconds was found by lots of experimentation - this part is
flexible and can be changed in both directions.]

the penalty a task gets for generating more load than the CPU can handle
is a priority decrease - there is a maximum amount to this penalty
relative to their static priority, so even fully CPU-bound tasks will
observe each other's priorities, and will share the CPU accordingly.

I've separated the RT scheduler into a different codebase, while still
keeping some of the scheduling codebase common. This does not look pretty
in certain places such as __sched_tail() or activate_task(), but i dont
think it can be avoided. RT scheduling is different, it uses a global
runqueue (and global spinlock) and it needs global decisions. To make RT
scheduling more instant, i've added a broadcast-reschedule message as
well, to make it absolutely sure that RT tasks of the right priority are
scheduled apropriately, even on SMP systems. The RT-scheduling part is
O(1) as well.

the SMP load-balancer can be extended/switched with additional parallel
computing and cache hierarchy concepts: NUMA scheduling, multi-core CPUs
can be supported easily by changing the load-balancer. Right now it's
tuned for my SMP systems.

i skipped the prev->mm == next->mm advantage - no workload i know of shows
any sensitivity to this. It can be added back by sacrificing O(1)
schedule() [the current and one-lower priority list can be searched for a
that->mm == current->mm condition], but costs a fair number of cycles
during a number of important workloads, so i wanted to avoid this as much
as possible.

- the SMP idle-task startup code was still racy and the new scheduler
triggered this. So i streamlined the idle-setup code a bit. We do not call
into schedule() before all processors have started up fully and all idle
threads are in place.

- the patch also cleans up a number of aspects of sched.c - moves code
into other areas of the kernel where it's appropriate, and simplifies
certain code paths and data constructs. As a result, the new scheduler's
code is smaller than the old one.

(i'm sure there are other details i forgot to explain. I've commented some
of the more important code paths and data constructs. If you think some
aspect of this design is faulty or misses some important issue then please
let me know.)

(the current code is by no means perfect, my main goal right now, besides
fixing bugs is to make the code cleaner. Any suggestions for
simplifications are welcome.)

Benchmarks
==========

i've performed two major groups of benchmarks: first i've verified the
interactive performance (interactive 'feel') of the new scheduler on UP
and SMP systems as well. While this is a pretty subjective thing, i found
that the new scheduler is at least as good as the old one in all areas,
and in a number of high load workloads it feels visibly smoother. I've
tried a number of workloads, such as make -j background compilation or
1000 background processes. Interactive performance can also be verified
via tracing both schedulers, and i've done that and found no areas of
missed wakeups or imperfect SMP scheduling latencies in either of the two
schedulers.

the other group of benchmarks was the actual performance of the scheduler.
I picked the following ones (some were intentionally picked to load the
scheduler, others were picked to make the benchmark spectrum more
complete):

- compilation benchmarks

- thr chat-server workload simulator written by Bill Hartner

- the usual components from the lmbench suite

- a heavily sched_yield()-ing testcode to measure yield() performance.

[ i can test any other workload too that anyone would find interesting. ]

i ran these benchmarks on a 1-CPU box using a UP kernel, a 2-CPU and a
8-CPU box as well, using the SMP kernel.

The chat-server simulator creates a number of processes that are connected
to each other via TCP sockets, the processes send messages to each other
randomly, in a way that simulates actual chat server designs and
workloads.

3 successive runs of './chat_c 127.0.0.1 10 1000' produce the following
message throughput:

vanilla-2.5.2-pre6:

Average throughput : 110619 messages per second
Average throughput : 107813 messages per second
Average throughput : 120558 messages per second

O(1)-schedule-2.5.2-pre6:

Average throughput : 131250 messages per second
Average throughput : 116333 messages per second
Average throughput : 179686 messages per second

this is a rougly 20% improvement.

To get all benefits of the new scheduler, i ran it reniced, which in
essence triggers round-robin batch scheduling for the chat server tasks:

3 successive runs of 'nice -n 19 ./chat_c 127.0.0.1 10 1000' produce the
following throughput:

vanilla-2.5.2-pre6:

Average throughput : 77719 messages per second
Average throughput : 83460 messages per second
Average throughput : 90029 messages per second

O(1)-schedule-2.5.2-pre6:

Average throughput : 609942 messages per second
Average throughput : 610221 messages per second
Average throughput : 609570 messages per second

throughput improved by more than 600%. The UP and 2-way SMP tests show a
similar edge for the new scheduler. Furthermore, during these chatserver
tests, the old scheduler doesnt handle interactive tasks very well, and
the system is very jerky. (which is a side-effect of the overscheduling
situation the scheduler gets into.)

the 1-CPU UP numbers are interesting as well:

vanilla-2.5.2-pre6:

./chat_c 127.0.0.1 10 100
Average throughput : 102885 messages per second
Average throughput : 95319 messages per second
Average throughput : 99076 messages per second

nice -n 19 ./chat_c 127.0.0.1 10 1000
Average throughput : 161205 messages per second
Average throughput : 151785 messages per second
Average throughput : 152951 messages per second

O(1)-schedule-2.5.2-pre6:

./chat_c 127.0.0.1 10 100 # NEW
Average throughput : 128865 messages per second
Average throughput : 115240 messages per second
Average throughput : 99034 messages per second

nice -n 19 ./chat_c 127.0.0.1 10 1000 # NEW
Average throughput : 163112 messages per second
Average throughput : 163012 messages per second
Average throughput : 163652 messages per second

this shows that while on UP we dont have the scalability improvements, the
O(1) scheduler is still slightly ahead.


another benchmark measures sched_yield() performance. (which the pthreads
code relies on pretty heavily.)

on a 2-way system, starting 4 instances of ./loop_yield gives the
following context-switch throughput:

vanilla-2.5.2-pre6

# vmstat 5 | cut -c57-
system cpu
in cs us sy id
102 241247 6 94 0
101 240977 5 95 0
101 241051 6 94 0
101 241176 7 93 0

O(1)-schedule-2.5.2-pre6

# vmstat 5 | cut -c57-
system cpu
in cs us sy id
101 977530 31 69 0
101 977478 28 72 0
101 977538 27 73 0

the O(1) scheduler is 300% faster, and we do nearly 1 million context
switches per second!

this test is even more interesting on the 8-way system, running 16
instances of loop_yield:

vanilla-2.5.2-pre6:

vmstat 5 | cut -c57-
system cpu
in cs us sy id
106 108498 2 98 0
101 108333 1 99 0
102 108437 1 99 0

100K/sec context switches - the overhead of the global runqueue makes the
scheduler slower than the 2-way box!

O(1)-schedule-2.5.2-pre6:

vmstat 5 | cut -c57-
system cpu
in cs us sy id
102 6120358 34 66 0
101 6117063 33 67 0
101 6117124 34 66 0

this is more than 6 million context switches per second! (i think this is
a first, no Linux box in existence did so many context switches per second
before.) This is one workload where the per-CPU runqueues and scalability
advantages show up big time.

here are the lat_proc and lat_ctx comparisons (the results quoted here are
the best numbers from a series of tests):

vanilla-2.5.2-pre6:

./lat_proc fork
Process fork+exit: 440.0000 microseconds
./lat_proc exec
Process fork+execve: 491.6364 microseconds
./lat_proc shell
Process fork+/bin/sh -c: 3545.0000 microseconds

O(1)-schedule-2.5.2-pre6:

./lat_proc fork
Process fork+exit: 168.6667 microseconds
./lat_proc exec
Process fork+execve: 279.6500 microseconds
./lat_proc shell
Process fork+/bin/sh -c: 2874.0000 microseconds

the difference is pretty dramatic - it's mostly due to avoiding much of
the COW overhead that comes from fork()+execve(). The fork()+exit()
improvement is mostly due to better CPU affinity - parent and child are
running on the same CPU, while the old scheduler pushes the child to
another, idle CPU, which creates heavy interlocking traffic between the MM
structures.

the compilation benchmarks i ran gave very similar results for both
schedulers. The O(1) scheduler has a small 2% advantage in make -j
benchmarks (not accounting statistical noise - it's hard to produce
reliable compilation benchmarks) - probably due to better SMP affinity
again.

Status
======

i've tested the new scheduler under the aforementioned range of systems
and workloads, but it's still experimental code nevertheless. I've
developed it on SMP systems using the 2.5.2-pre kernels, so it has the
most testing there, but i did a fair number of UP and 2.4.17 tests as
well. NOTE! For the 2.5.2-pre6 kernel to be usable you should apply
Andries' latest 2.5.2pre6-kdev_t patch available at:

http://www.kernel.org/pub/linux/kernel/people/aeb/

i also tested the RT scheduler for various situations such as
sched_yield()-ing of RT tasks, strace-ing RT tasks and other details, and
it's all working as expected. There might be some rough edges though.

Comments, bug reports, suggestions are welcome,

Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Oliver Xymoron

unread,
Jan 3, 2002, 11:27:31 PM1/3/02
to
On Fri, 4 Jan 2002, Ingo Molnar wrote:

> this is more than 6 million context switches per second!

<yawn>

Everyone knows scheduling is boring.

:)

--
"Love the dolphins," she advised him. "Write by W.A.S.T.E.."

Ian Morgan

unread,
Jan 4, 2002, 12:34:30 AM1/4/02
to
On Fri, 4 Jan 2002, Ingo Molnar wrote:

> for 2.4.17:
>
> http://redhat.com/~mingo/O(1)-scheduler/sched-O1-2.4.17-A0.patch

> i did a fair number of UP and 2.4.17 tests as well.

> Comments, bug reports, suggestions are welcome,

Well, tried it out here on 2.4.17 (w/ ide.2.4.16 & freeswan_1.91) (and after
patching md.c and loop.c), and boy did it blow chunks! ;-)

First, loading the software watchdog module caused the thing to lock up
solid during boot. No SysRq response, and nmi_watchdog did nada too.

Next, after disabling the s/w watchdog, I could boot up.. great, I though it
was all happy. I tried starting up 1 instance of setiathome, and saw how
nicely it (mostly) remained affine to 1 processor, occasionally switching to
the 2nd, and back again, maybe every 5-10 seconds.

Next, though, I couldn't figure out why Mozilla wouldn't load. Then I
noticed I could not open any more xterms. When I went to one open xterm to
do some digging, I noticed very long pauses, where no keyboard input would
be accepted, yet the mouse and the window manager remained responsive. Other
tasks, like my scrolling network usage graph applet would stall as well.
Eventually (after another minute or so), the box locked up solid.

Well, I really liked the sound of this patch. Seemed very well though out.
Too bad it doesn't like me! :-(

You seem to indicate you've done (some) testing on 2.4.17, so I'm somewhat
surprised that it died on me so quickly here. What more info can I give you
that might help track the problem?

Box: 2x Celeron / ABIT BP6 / 384MB
Gnu C 2.95.3
Gnu make 3.78.1
binutils 2.10.1.0.2
util-linux 2.10s
mount 2.10s
modutils 2.4.0
e2fsprogs 1.19
reiserfsprogs 3.x.0j
pcmcia-cs 3.1.31
PPP 2.4.0
Linux C Library 2.1.3
Dynamic linker (ldd) 2.1.3
Linux C++ Library 2.10.0
Procps 2.0.6
Net-tools 1.54
Console-tools 0.3.3
Sh-utils 2.0
Modules Loaded sb sb_lib uart401 sound soundcore printer hid input
usb-uhci usbcore tulip ipt_mac ipt_state iptable_mangle ipt_LOG ipt_REJECT
ip_nat_ftp ip_conntrack_ftp iptable_nat ip_conntrack iptable_filter
ip_tables

Regards,
Ian Morgan
--
-------------------------------------------------------------------
Ian E. Morgan Vice President & C.O.O. Webcon, Inc.
imo...@webcon.net PGP: #2DA40D07 www.webcon.net
-------------------------------------------------------------------

Dieter Nützel

unread,
Jan 4, 2002, 2:42:37 AM1/4/02
to
After the loop.c fix (nfs.o and sunrpc.o waiting) I got it running here, too.

2.4.17
sched-O1-2.4.17-A0.patch
00_nanosleep-5 (Andrea)
bootmem-2.4.17-pre6 (at all IBM)
elevator-fix (Andrew, worth it for 2.4.18)

plus ReiserFS fixes
linux-2.4.17rc2-KLMN+exp_trunc+3fixes.patch
corrupt_items_checks.diff
kmalloc_cleanup.diff
O-inode-attrs.patch

To much trouble during 10_vm-21 (VM taken from 2.4.17rc2aa2) merge. So I
skipped it this time. Much wanted for 2.4.18 (best I ever had).

All the above (with preempt and 10_vm-21 AA except sched-O1-2.4.17-A0.patch)
didn't crashed before.

One system crash during the first X start up (kdeinit).

ksymoops 2.4.3 on i686 2.4.17-O1. Options used
-V (default)
-k /proc/ksyms (default)
-l /proc/modules (default)
-o /lib/modules/2.4.17-O1/ (default)
-m /boot/System.map (specified)

invalid operand: 0000
CPU: 0
EIP: 0010:[<c01194ae>] Not tainted
Using defaults from ksymoops -t elf32-i386 -a i386
EFLAGS: 00010246
eax: 00000000 ebx: c1a7ca40 ecx: c028e658 edx: dea3002c
esi: dea30000 edi: 00000000 ebp: bffff19c esp: dea31fac
ds: 0018 es: 0018 ss: 0018
Process kdeinit (pid: 961, stackpage=dea31000)
Stack: dea30000 40e1ed40 00000000 c01194ee 00000000 c0106d0b 00000000
00000001
40e208c4 40e1ed40 00000000 bffff19c 00000001 0000002b 0000002b
00000001
40da84dd 00000023 00000287 bffff170 0000002b
Call Trace: [<c01194ee>] [<c0106d0b>]
Code: 0f 0b e9 74 fe ff ff 8d 74 26 00 8d bc 27 00 00 00 00 8b 44

>>EIP; c01194ae <do_exit+1ee/200> <=====
Trace; c01194ee <sys_exit+e/10>
Trace; c0106d0a <system_call+32/38>
Code; c01194ae <do_exit+1ee/200>
00000000 <_EIP>:
Code; c01194ae <do_exit+1ee/200> <=====
0: 0f 0b ud2a <=====
Code; c01194b0 <do_exit+1f0/200>
2: e9 74 fe ff ff jmp fffffe7b <_EIP+0xfffffe7b> c0119328
<do_exit+68/200>
Code; c01194b4 <do_exit+1f4/200>
7: 8d 74 26 00 lea 0x0(%esi,1),%esi
Code; c01194b8 <do_exit+1f8/200>
b: 8d bc 27 00 00 00 00 lea 0x0(%edi,1),%edi
Code; c01194c0 <complete_and_exit+0/20>
12: 8b 44 00 00 mov 0x0(%eax,%eax,1),%eax

System runs relatively smooth but is under full system load.

7:57am up 1:36, 1 user, load average: 0.18, 0.18, 0.26
91 processes: 90 sleeping, 1 running, 0 zombie, 0 stopped
CPU states: 0.7% user, 99.2% system, 0.0% nice, 0.0% idle
Mem: 643064K av, 464212K used, 178852K free, 0K shrd, 89964K buff
Swap: 1028120K av, 3560K used, 1024560K free 179928K
cached

PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME COMMAND
1263 root 52 0 138M 42M 1728 S 33.0 6.7 16:46 X
1669 nuetzel 62 0 7544 7544 4412 S 16.6 1.1 11:57 artsd
10891 nuetzel 46 0 113M 17M 12600 S 12.0 2.7 2:35 kmail
1756 nuetzel 46 0 105M 9.9M 7288 S 10.8 1.5 7:45 kmix
1455 nuetzel 45 0 109M 12M 10508 S 7.9 2.0 1:55 kdeinit
1467 nuetzel 45 0 107M 10M 8456 S 5.5 1.7 0:55 kdeinit
1414 nuetzel 45 0 105M 8916 7536 S 3.9 1.3 1:59 kdeinit
814 squid 45 0 6856 6848 1280 S 2.3 1.0 0:52 squid
6 root 45 0 0 0 0 SW 2.1 0.0 0:42 kupdated
1458 nuetzel 45 0 109M 13M 9856 S 1.3 2.0 0:44 kdeinit
11099 nuetzel 47 0 1000 1000 776 R 1.3 0.1 0:00 top
556 root 45 0 1136 1136 848 S 1.1 0.1 0:14 smpppd
1678 nuetzel 45 0 7544 7544 4412 S 0.7 1.1 0:12 artsd
494 root 45 0 3072 3072 1776 S 0.3 0.4 0:18 named
1451 root 45 0 6860 6852 1416 S 0.1 1.0 0:14 xperfmon++
10871 nuetzel 45 0 111M 14M 11680 S 0.1 2.3 0:14 kdeinit
1 root 46 0 212 212 176 S 0.0 0.0 0:06 init

/home/nuetzel> procinfo
Linux 2.4.17-O1 (root@SunWave1) (gcc 2.95.3 20010315 ) #1 1CPU [SunWave1.]

Memory: Total Used Free Shared Buffers Cached
Mem: 643064 452988 190076 0 100096 183148
Swap: 1028120 3560 1024560

Bootup: Fri Jan 4 06:21:02 2002 Load average: 0.14 0.32 0.31
1894046082/90 11460

user : 0:16:59.19 14.9% page in : 404106 disk 1: 31792r
70750w
nice : 0:00:00.00 0.0% page out: 2580336 disk 2: 137r
225w
system: 1:19:41.05 70.0% swap in : 2 disk 3: 1r
0w
idle : 0:17:11.67 15.1% swap out: 695 disk 4: 1009r
0w
uptime: 1:53:51.90 context : 2427806

irq 0: 683191 timer irq 8: 154583 rtc
irq 1: 12567 keyboard irq 9: 0 acpi, usb-ohci
irq 2: 0 cascade [4] irq 10: 103402 aic7xxx
irq 5: 9333 eth1 irq 11: 310704 eth0, EMU10K1
irq 7: 115 parport0 [3] irq 12: 136392 PS/2 Mouse

More processes die during second and third boot...
I have some more crash logs if needed.

Preempt plus lock-break is better for now.
latencytest0.42-png crash the system.

What next?
Maybe a combination of O(1) and preempt?

--
Dieter Nützel
Graduate Student, Computer Science

University of Hamburg
Department of Computer Science
@home: Dieter....@hamburg.de

David Lang

unread,
Jan 4, 2002, 3:02:08 AM1/4/02
to
Ingo,
back in the 2.4.4-2.4.5 days when we experimented with the
child-runs-first scheduling patch we ran into quite a few programs that
died or locked up due to this. (I had a couple myself and heard of others)

try switching this back to the current behaviour and see if the lockups
still happen.

David Lang

On Fri, 4 Jan 2002, Dieter [iso-8859-15] Nützel wrote:

> Date: Fri, 4 Jan 2002 08:42:37 +0100
> From: "Dieter [iso-8859-15] Nützel" <Dieter....@hamburg.de>
> To: Ingo Molnar <mi...@elte.hu>
> Cc: Linux Kernel List <linux-...@vger.kernel.org>
> Subject: Re: [announce] [patch] ultra-scalable O(1) SMP and UP scheduler

Ingo Molnar

unread,
Jan 4, 2002, 6:42:09 AM1/4/02
to

On Fri, 4 Jan 2002, Dieter [iso-8859-15] Nützel wrote:

> What next? Maybe a combination of O(1) and preempt?

yes, fast preemption of kernel-mode tasks and the scheduler code are
almost orthogonal. So i agree that to get the best interactive performance
we need both.

Ingo

ps. i'm working on fixing the crashes you saw.

Ingo Molnar

unread,
Jan 4, 2002, 6:44:57 AM1/4/02
to

On Fri, 4 Jan 2002, David Lang wrote:

> Ingo,
> back in the 2.4.4-2.4.5 days when we experimented with the
> child-runs-first scheduling patch we ran into quite a few programs that
> died or locked up due to this. (I had a couple myself and heard of others)

hm, Andrea said that the only serious issue was in the sysvinit code,
which should be fixed in any recent distro. Andrea?

> try switching this back to the current behaviour and see if the
> lockups still happen.

there must be some other bug as well, the child-runs-first scheduling can
cause lockups, but it shouldnt cause oopes.

Ingo

Anton Blanchard

unread,
Jan 4, 2002, 5:30:18 AM1/4/02
to

Great stuff Ingo!

> now that new-year's parties are over things are getting boring again. For
> those who want to see and perhaps even try something more complex, i'm
> announcing this patch that is a pretty radical rewrite of the Linux
> scheduler for 2.5.2-pre6:

Good timing :) We were just looking at an application that hit sched_yield
heavily on a large SMP machine (dont have the source so fixing the
symptoms not the cause). Performance was pretty bad with the standard
scheduler.

We managed to get a 4 way ppc64 machine to boot, but unfortunately the
18 way hung somewhere after smp initialisation and before execing
init. More testing to come :)

Is the idle loop optimisation (atomic exchange -1 into need_resched
to avoid IPI) gone forever?

Is my understanding of this right?:

#define BITMAP_SIZE (MAX_RT_PRIO/8+1)

...

char bitmap[BITMAP_SIZE+1];
list_t queue[MAX_RT_PRIO];

You have an bitmap of size MAX_RT_PRIO+1 (actually I think you have one too
many +1 here :) and you zero the last bit to terminate it. You use
find_first_zero_bit to search the entire priority list and
sched_find_first_zero_bit to search only the first MAX_PRIO (64) bits.

> the SMP load-balancer can be extended/switched with additional parallel
> computing and cache hierarchy concepts: NUMA scheduling, multi-core CPUs
> can be supported easily by changing the load-balancer. Right now it's
> tuned for my SMP systems.

It will be interesting to test this on our HMT hardware.

> - the SMP idle-task startup code was still racy and the new scheduler
> triggered this. So i streamlined the idle-setup code a bit. We do not call
> into schedule() before all processors have started up fully and all idle
> threads are in place.

I like this cleanup, it pushes more stuff out the arch specific code
into init_idle().

> another benchmark measures sched_yield() performance. (which the pthreads
> code relies on pretty heavily.)

Can you send me this benchmark and I'll get some more results :)

I dont think pthreads uses sched_yield on spinlocks any more, but there
seems to be enough userspace code that does.

Anton

Ingo Molnar

unread,
Jan 4, 2002, 7:53:14 AM1/4/02
to

On Fri, 4 Jan 2002, Anton Blanchard wrote:

> Good timing :) We were just looking at an application that hit
> sched_yield heavily on a large SMP machine (dont have the source so
> fixing the symptoms not the cause). Performance was pretty bad with
> the standard scheduler.

hey, great. I mean, what a pity :-| But in any case, sched_yield() is just
a tiny portion of the scheduler spectrum, and it's certainly not the most
important one.

> We managed to get a 4 way ppc64 machine to boot, but unfortunately the
> 18 way hung somewhere after smp initialisation and before execing
> init. More testing to come :)

could this be the child-runs-first problem perhaps? You can disable it by
exchanging wake_up_forked_process() for wake_up_process() in fork.c, and
removing the current->need_resched = 1 line.

> Is the idle loop optimisation (atomic exchange -1 into need_resched to
> avoid IPI) gone forever?

it's just broken temporarily, i will fix it. The two places the need to
set need_resched is the timer interrupt (that triggers periodic
load_balance()) and the wakeup code, i'll fix this in the next patch.

> Is my understanding of this right?:
>
> #define BITMAP_SIZE (MAX_RT_PRIO/8+1)
>
> ...
>
> char bitmap[BITMAP_SIZE+1];
> list_t queue[MAX_RT_PRIO];
>
> You have an bitmap of size MAX_RT_PRIO+1 (actually I think you have

> one too many +1 here :) [...]

[ yes :-) paranoia. will fix it. ]

> [...] and you zero the last bit to terminate it. You


> use find_first_zero_bit to search the entire priority list and
> sched_find_first_zero_bit to search only the first MAX_PRIO (64) bits.

correct.

the reason for this logic inversion is temporary as well: we'll have to
add find_next_set_bit before doing the more intuitive thing of setting the
bit when the runlist is active. Right now sched_find_first_zero_bit() has
to invert the value (on x86) again to get it right for BSFL.

> > - the SMP idle-task startup code was still racy and the new scheduler
> > triggered this. So i streamlined the idle-setup code a bit. We do not call
> > into schedule() before all processors have started up fully and all idle
> > threads are in place.
>
> I like this cleanup, it pushes more stuff out the arch specific code
> into init_idle().

the new rules are this: no schedule() must be called before all bits in
wait_init_idle are clear. I'd suggest for you to add this to the top of
schedule():

if (wait_init_idle)
BUG();

to debug the SMP startup code.

the other new property is that the init thread wakes itself up and then
later on becomes an idle thread just like the other idle threads. This
makes the idle startup code more symmetric, and the scheduler more
debuggable.

Ingo

Ingo Molnar

unread,
Jan 4, 2002, 8:20:49 AM1/4/02
to

On Fri, 4 Jan 2002, rwhron wrote:

> The second time I ran it, when I was tailing the output the machine
> seemed to freeze. (Usually the syscall tests complete very quickly).
> I got the oops below on a serial console, it scrolled much longer and
> didn't seem to like the call trace would ever complete, so i rebooted.

the oops shows an infinite page fault, the actual point of fault is not
visible. To make it visible, could you please apply the following patch to
your tree and check the serial console? It should now just print a single
line (showing the faulting EIP) and freeze forever:

pagefault at: 12341234. locking up now.

Please look up the EIP via gdb:

gdb ./vmlinux
list *0x12341234

(for this you'll have to add '-g' to CFLAGS in the top level Makefile.)

> I ran it a third time trying to isolate which test triggered the oops,
> but the behavior was different again. The machine got very very slow,
> but tests would eventually print their output. The test that
> triggered the behavior was apparently between pipe11 and the
> setrlimit01 command below.

i'll try these tests. Does the test use RT scheduling as well?

> It looks like you already have an idea where the problem is.
> Looking forward to the next patch. :)

i havent found the bug yet :-|

Ingo

--- linux/arch/i386/mm/fault.c.orig Fri Jan 4 12:08:41 2002
+++ linux/arch/i386/mm/fault.c Fri Jan 4 12:11:09 2002
@@ -314,6 +314,8 @@
* Oops. The kernel tried to access some bad page. We'll have to
* terminate things with extreme prejudice.
*/
+ printk("pagefault at: %08lx. locking up now.\n", regs->eip);
+ for (;;) __cli();

bust_spinlocks(1);

rwh...@earthlink.net

unread,
Jan 4, 2002, 6:16:50 AM1/4/02
to
Ingo,

I tried the new scheduler with the kdev_t patch you mentioned and it
went very well for a long time. dbench 64, 128, 192 all completed,
3 iterations of bonnie++ were fine too.

When I ran the syscalls test from the http://ltp.sf.net/ I had a
few problems.

The first time the machine rebooted. The last thing in the logfile
was that pipe11 test was running. I believe it got past that point.
(the syscall tests run in alphabetic order).

The second time I ran it, when I was tailing the output the
machine seemed to freeze. (Usually the syscall tests complete
very quickly). I got the oops below on a serial console, it
scrolled much longer and didn't seem to like the call trace
would ever complete, so i rebooted.

I ran it a third time trying to isolate which test triggered the oops,


but the behavior was different again. The machine got very very
slow, but tests would eventually print their output. The test that
triggered the behavior was apparently between pipe11 and the setrlimit01
command below.

Here is top in the locked state:

33 processes: 23 sleeping, 4 running, 6 zombie, 0 stopped
CPU states: 0.3% user, 0.0% system, 0.0% nice, 99.6% idle
Mem: 385344K av, 83044K used, 302300K free, 0K shrd, 51976K buff
Swap: 152608K av, 0K used, 152608K free 16564K cached

PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME COMMAND

50 root 55 0 1620 1620 1404 R 0.1 0.4 0:05 sshd
8657 root 47 0 948 948 756 R 0.1 0.2 0:00 top
1 root 46 0 524 524 452 S 0.0 0.1 0:06 init
2 root 63 0 0 0 0 SW 0.0 0.0 0:00 keventd
3 root 63 19 0 0 0 RWN 0.0 0.0 0:00 ksoftirqd_CPU0
4 root 63 0 0 0 0 SW 0.0 0.0 0:00 kswapd
5 root 63 0 0 0 0 SW 0.0 0.0 0:00 bdflush
6 root 47 0 0 0 0 SW 0.0 0.0 0:00 kupdated
7 root 45 0 0 0 0 SW 0.0 0.0 0:00 kreiserfsd
28 root 45 0 620 620 516 S 0.0 0.1 0:00 syslogd
31 root 46 0 480 480 404 S 0.0 0.1 0:00 klogd
35 root 47 0 0 0 0 SW 0.0 0.0 0:00 eth0
40 root 49 0 816 816 664 S 0.0 0.2 0:00 iplog
41 root 46 0 816 816 664 S 0.0 0.2 0:00 iplog
42 root 45 0 816 816 664 S 0.0 0.2 0:00 iplog
43 root 45 0 816 816 664 S 0.0 0.2 0:00 iplog
44 root 58 0 816 816 664 S 0.0 0.2 0:01 iplog
46 root 53 0 1276 1276 1156 S 0.0 0.3 0:00 sshd
48 root 46 0 472 472 396 S 0.0 0.1 0:00 agetty
51 hrandoz 49 0 1164 1164 892 S 0.0 0.3 0:00 bash
59 root 46 0 1184 1184 1028 S 0.0 0.3 0:00 bash
702 root 45 0 1224 1224 1028 S 0.0 0.3 0:01 bash
8564 root 63 0 1596 1596 1404 S 0.0 0.4 0:00 sshd
8602 root 46 0 472 472 396 S 0.0 0.1 0:00 agetty
8616 hrandoz 63 0 1164 1164 892 S 0.0 0.3 0:00 bash
8645 root 49 0 1152 1152 888 S 0.0 0.2 0:00 bash
8647 root 46 0 468 468 388 R 0.0 0.1 0:00 setrlimit01
8649 root 46 0 0 0 0 Z 0.0 0.0 0:00 setrlimit01 <defunct>
8650 root 46 0 0 0 0 Z 0.0 0.0 0:00 setrlimit01 <defunct>
8651 root 46 0 0 0 0 Z 0.0 0.0 0:00 setrlimit01 <defunct>
8658 root 46 0 0 0 0 Z 0.0 0.0 0:00 setrlimit01 <defunct>
8659 root 46 0 0 0 0 Z 0.0 0.0 0:00 setrlimit01 <defunct>
8660 root 46 0 0 0 0 Z 0.0 0.0 0:00 setrlimit01 <defunct>

No modules in ksyms, skipping objects
Warning (read_lsmod): no symbols in lsmod, is /proc/modules a valid lsmod file?
Unable to handle kernel NULL pointer dereference at virtual address 00000000
*pde = c024be44
Oops: 0002
CPU: 0
EIP: 0010:[<c024c06d>] Not tainted


Using defaults from ksymoops -t elf32-i386 -a i386

EFLAGS: 00010046
eax: 00000000 ebx: d6184000 ecx: d6184290 edx: d6184290
esi: c024be3c edi: d6184001 ebp: d6185f98 esp: c024c06c


ds: 0018 es: 0018 ss: 0018

Process (pid: 0, stackpage=c024b000)
Stack: c024c06c c024c06c c024c074 c024c074 c024c07c c024c07c c024c084 c024c084
c024c08c c024c08c c024c094 c024c094 c024c09c c024c09c c024c0a4 c024c0a4
c024c0ac c024c0ac c024c0b4 c024c0b4 c024c0bc c024c0bc c024c0c4 c024c0c4
Code: c0 24 c0 6c c0 24 c0 74 c0 24 c0 74 c0 24 c0 7c c0 24 c0 7c

>>EIP; c024c06c <rt_array+28c/340> <=====
Code; c024c06c <rt_array+28c/340>
00000000 <_EIP>:
Code; c024c06c <rt_array+28c/340> <=====
0: c0 24 c0 6c shlb $0x6c,(%eax,%eax,8) <=====
Code; c024c070 <rt_array+290/340>
4: c0 24 c0 74 shlb $0x74,(%eax,%eax,8)
Code; c024c074 <rt_array+294/340>
8: c0 24 c0 74 shlb $0x74,(%eax,%eax,8)
Code; c024c078 <rt_array+298/340>
c: c0 24 c0 7c shlb $0x7c,(%eax,%eax,8)
Code; c024c07c <rt_array+29c/340>
10: c0 24 c0 7c shlb $0x7c,(%eax,%eax,8)

CPU: 0
EIP: 0010:[<c01110b5>] Not tainted
EFLAGS: 00010086
eax: c0000000 ebx: c0248000 ecx: c024ca58 edx: 01400000
esi: 00000000 edi: c0110c48 ebp: c1400000 esp: c0247fb8


ds: 0018 es: 0018 ss: 0018

Process $.#. (pid: 5373, stackpage=c0247000)
Stack: c0248000 00000000 c0110c48 c1400000 00000000 00000000 c0246000 00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000 00000000
Call Trace: [<c0110c48>]
Code: f6 04 10 81 0f 84 02 fe ff ff 5b 5e 5f 5d 81 c4 8c 00 00 00

>>EIP; c01110b4 <do_page_fault+46c/488> <=====
Trace; c0110c48 <do_page_fault+0/488>
Code; c01110b4 <do_page_fault+46c/488>
00000000 <_EIP>:
Code; c01110b4 <do_page_fault+46c/488> <=====
0: f6 04 10 81 testb $0x81,(%eax,%edx,1) <=====
Code; c01110b8 <do_page_fault+470/488>
4: 0f 84 02 fe ff ff je fffffe0c <_EIP+0xfffffe0c> c0110ec0 <do_page_fault+278/488>
Code; c01110be <do_page_fault+476/488>
a: 5b pop %ebx
Code; c01110be <do_page_fault+476/488>
b: 5e pop %esi
Code; c01110c0 <do_page_fault+478/488>
c: 5f pop %edi
Code; c01110c0 <do_page_fault+478/488>
d: 5d pop %ebp
Code; c01110c2 <do_page_fault+47a/488>
e: 81 c4 8c 00 00 00 add $0x8c,%esp

CPU: 0
EIP: 0010:[<c0180e93>] Not tainted
EFLAGS: 00010002
eax: c1604120 ebx: 00000000 ecx: c0203cc0 edx: 00000000
esi: 00000000 edi: c0108e1c ebp: c1400000 esp: c0247f30


ds: 0018 es: 0018 ss: 0018

Process $.#. (pid: 5373, stackpage=c0247000)
Stack: 0000000f 00000000 c0110bfb 00000000 c0108afe 00000000 c0247f84 c01dcccd
c01dcd1f 00000000 00000001 c0247f84 c0108e70 c01dcd1f c0247f84 00000000
c0246000 00000000 c0108684 c0247f84 00000000 c0248000 c024ca58 01400000
Call Trace: [<c0110bfb>] [<c0108afe>] [<c0108e70>] [<c0108684>] [<c0110c48>]
[<c01110b5>] [<c0110c48>]
Code: 80 78 04 00 75 7f 8b 15 a0 90 20 c0 c7 05 44 1c 26 c0 1c 0f

>>EIP; c0180e92 <unblank_screen+4e/d8> <=====
Trace; c0110bfa <bust_spinlocks+22/4c>
Trace; c0108afe <die+42/50>
Trace; c0108e70 <do_double_fault+54/5c>
Trace; c0108684 <error_code+34/40>
Trace; c0110c48 <do_page_fault+0/488>
Trace; c01110b4 <do_page_fault+46c/488>
Trace; c0110c48 <do_page_fault+0/488>
Code; c0180e92 <unblank_screen+4e/d8>
00000000 <_EIP>:
Code; c0180e92 <unblank_screen+4e/d8> <=====
0: 80 78 04 00 cmpb $0x0,0x4(%eax) <=====
Code; c0180e96 <unblank_screen+52/d8>
4: 75 7f jne 85 <_EIP+0x85> c0180f16 <unblank_screen+d2/d8>
Code; c0180e98 <unblank_screen+54/d8>
6: 8b 15 a0 90 20 c0 mov 0xc02090a0,%edx
Code; c0180e9e <unblank_screen+5a/d8>
c: c7 05 44 1c 26 c0 1c movl $0xf1c,0xc0261c44
Code; c0180ea4 <unblank_screen+60/d8>
13: 0f 00 00

CPU: 0
EIP: 0010:[<c0180e93>] Not tainted
EFLAGS: 00010002
eax: c1604120 ebx: 00000000 ecx: c0203cc0 edx: 00000000
esi: 00000000 edi: c0108e1c ebp: c1400000 esp: c0247ea8


ds: 0018 es: 0018 ss: 0018

Process $.#. (pid: 5373, stackpage=c0247000)
Stack: 0000000f 00000000 c0110bfb 00000000 c0108afe 00000000 c0247efc c01dcccd
c01dcd1f 00000000 00000001 c0247efc c0108e70 c01dcd1f c0247efc 00000000
c0246000 00000000 c0108684 c0247efc 00000000 00000000 c0203cc0 00000000
Call Trace: [<c0110bfb>] [<c0108afe>] [<c0108e70>] [<c0108684>] [<c0108e1c>]
[<c0180e93>] [<c0110bfb>] [<c0108afe>] [<c0108e70>] [<c0108684>] [<c0110c48>]
[<c01110b5>] [<c0110c48>]
Code: 80 78 04 00 75 7f 8b 15 a0 90 20 c0 c7 05 44 1c 26 c0 1c 0f


It looks like you already have an idea where the problem is.
Looking forward to the next patch. :)

--
Randy Hron

David Lang

unread,
Jan 4, 2002, 6:33:19 AM1/4/02
to
On Fri, 4 Jan 2002, Ingo Molnar wrote:

> On Fri, 4 Jan 2002, David Lang wrote:
>
> > Ingo,
> > back in the 2.4.4-2.4.5 days when we experimented with the
> > child-runs-first scheduling patch we ran into quite a few programs that
> > died or locked up due to this. (I had a couple myself and heard of others)
>
> hm, Andrea said that the only serious issue was in the sysvinit code,
> which should be fixed in any recent distro. Andrea?
>

I remember running into problems with some user apps (not lockups, but the
apps failed) on my 2x400MHz pentium box. I specificly remember the Citrix
client hanging, but I think there were others as well.

I'll try and get a chance to try your patch in the next couple days.

David Lang


> > try switching this back to the current behaviour and
see if the > > lockups still happen.
>
> there must be some other bug as well, the child-runs-first scheduling can
> cause lockups, but it shouldnt cause oopes.
>
> Ingo
>

Ingo Molnar

unread,
Jan 4, 2002, 9:04:18 AM1/4/02
to

On Fri, 4 Jan 2002, David Lang wrote:

> I remember running into problems with some user apps (not lockups, but
> the apps failed) on my 2x400MHz pentium box. I specificly remember the
> Citrix client hanging, but I think there were others as well.

ok. Generally there is no guarantee that the parent will run first under
the current scheduler, but it's likely to run first. But if eg. a higher
priority process preempts the forking process while it's doing fork() then
the child will run first in 50% of the cases. So this ordering is not
guaranteed by the 2.4 (or 2.2) Linux scheduler in any way.

Ingo Molnar

unread,
Jan 4, 2002, 10:44:32 AM1/4/02
to

On Fri, 4 Jan 2002, Andrea Arcangeli wrote:

> + {
> + int counter = current->counter;
> + p->counter = (counter + 1) >> 1;
> + current->counter = counter >> 1;
> + p->policy &= ~SCHED_YIELD;
> + current->policy |= SCHED_YIELD;
> current->need_resched = 1;
> + }

yep - good, this means that applications got some fair testing already.

What i mentioned in the previous email is that on SMP this solution is
still not the optimal one under the current scheduler, because the wakeup
of the child process might end up pushing the process to another (idle)
CPU - worsening the COW effect with SMP-interlocking effects. This is why
i introduced wake_up_forked_process() that knows about this distinction
and keeps the child on the current CPU.

Andrea Arcangeli

unread,
Jan 4, 2002, 8:39:01 AM1/4/02
to
On Fri, Jan 04, 2002 at 03:33:19AM -0800, David Lang wrote:
> On Fri, 4 Jan 2002, Ingo Molnar wrote:
>
> > On Fri, 4 Jan 2002, David Lang wrote:
> >
> > > Ingo,
> > > back in the 2.4.4-2.4.5 days when we experimented with the
> > > child-runs-first scheduling patch we ran into quite a few programs that
> > > died or locked up due to this. (I had a couple myself and heard of others)
> >
> > hm, Andrea said that the only serious issue was in the sysvinit code,
> > which should be fixed in any recent distro. Andrea?
> >
>
> I remember running into problems with some user apps (not lockups, but the
> apps failed) on my 2x400MHz pentium box. I specificly remember the Citrix
> client hanging, but I think there were others as well.

those are broken apps that have to be fixed, they will eventually lockup
even without the patch. userspace can be preempted anytime.

> I'll try and get a chance to try your patch in the next couple days.
>
> David Lang
>
>
> > > try switching this back to the current behaviour and
> see if the > > lockups still happen.
> >
> > there must be some other bug as well, the child-runs-first scheduling can
> > cause lockups, but it shouldnt cause oopes.
> >
> > Ingo
> >


Andrea

Andrea Arcangeli

unread,
Jan 4, 2002, 8:36:59 AM1/4/02
to
On Fri, Jan 04, 2002 at 12:44:57PM +0100, Ingo Molnar wrote:
>
> On Fri, 4 Jan 2002, David Lang wrote:
>
> > Ingo,
> > back in the 2.4.4-2.4.5 days when we experimented with the
> > child-runs-first scheduling patch we ran into quite a few programs that
> > died or locked up due to this. (I had a couple myself and heard of others)
>
> hm, Andrea said that the only serious issue was in the sysvinit code,
> which should be fixed in any recent distro. Andrea?

correct. I run child-run-first always on all my boxes. And
those races could trigger also before, so it's better to make them more
easily reproducible anyways.

I always run with this patch applied:

diff -urN parent-timeslice/include/linux/sched.h child-first/include/linux/sched.h
--- parent-timeslice/include/linux/sched.h Thu May 3 18:17:56 2001
+++ child-first/include/linux/sched.h Thu May 3 18:19:44 2001
@@ -301,7 +301,7 @@
* all fields in a single cacheline that are needed for
* the goodness() loop in schedule().
*/
- int counter;
+ volatile int counter;
int nice;
unsigned int policy;
struct mm_struct *mm;
diff -urN parent-timeslice/kernel/fork.c child-first/kernel/fork.c
--- parent-timeslice/kernel/fork.c Thu May 3 18:18:31 2001
+++ child-first/kernel/fork.c Thu May 3 18:20:40 2001
@@ -665,15 +665,18 @@
p->pdeath_signal = 0;

/*
- * "share" dynamic priority between parent and child, thus the
- * total amount of dynamic priorities in the system doesnt change,
- * more scheduling fairness. This is only important in the first
- * timeslice, on the long run the scheduling behaviour is unchanged.
+ * Scheduling the child first is especially useful in avoiding a
+ * lot of copy-on-write faults if the child for a fork() just wants
+ * to do a few simple things and then exec().
*/
- p->counter = (current->counter + 1) >> 1;
- current->counter >>= 1;
- if (!current->counter)


+ {
+ int counter = current->counter;
+ p->counter = (counter + 1) >> 1;
+ current->counter = counter >> 1;
+ p->policy &= ~SCHED_YIELD;
+ current->policy |= SCHED_YIELD;
current->need_resched = 1;
+ }

/* Tell the parent if it can get back its timeslice when child exits */
p->get_child_timeslice = 1;

>
> > try switching this back to the current behaviour and see if the
> > lockups still happen.
>
> there must be some other bug as well, the child-runs-first scheduling can
> cause lockups, but it shouldnt cause oopes.

definitely. My above implementation works fine.

Thomas Cataldo

unread,
Jan 4, 2002, 9:18:29 AM1/4/02
to
Patch applied to 2.4.17 vanilla. Oops on startup, mounting partitions.
Without mounting my vfat partition, I can boot up but X freezed
completely after one minute.


ksymoops 2.4.3 on i686 2.4.17. Options used


-V (default)
-k /proc/ksyms (default)
-l /proc/modules (default)

-o /lib/modules/2.4.17/ (default)
-m /boot/System.map-2.4.17 (default)

Warning: You did not tell me where to find symbol information. I will
assume that the log matches the kernel and modules that are running
right now and I'll use the default options above for symbol resolution.
If the current kernel and/or modules do not match the log, you can get
more accurate output by telling me the kernel version and where to find
map, modules, ksyms etc. ksymoops -h explains the options.

cpu: 0, clocks: 1002224, slice: 334074
cpu: 1, clocks: 1002224, slice: 334074
cpu 1 has done init idle, doing cpu_idle().
Unable to handle kernel NULL pointer dereference at virtual address 00000008
e0852f65
*pde = 00000000
Oops: 0000
CPU: 1
EIP: 0010:[<e0852f65>] Not tainted


Using defaults from ksymoops -t elf32-i386 -a i386

EFLAGS: 00010286
eax: 00000000 ebx: dfb5bdbc ecx: dfb5be70 edx: 00000000
esi: dfeee200 edi: dfb5be24 ebp: dfb5be88 esp: dfb5bd40


ds: 0018 es: 0018 ss: 0018

Process mount (pid: 82, stackpage=dfb5b000)
Stack: dfb5bdbc dfeee200 dfb5be24 dfb5be88 fffffffa dfbd8da0 00000001
c0194b1f
c1864000 c1849600 e0853af8 df86a000 dfb5bdb4 dfb5bdb8 dfb5be70
dfb5be24
dfb5bdbc dfeee200 dfeee200 df86a000 dfb5be70 dfeee2cc 00000803
00000000
Call Trace: [<c0194b1f>] [<e0853af8>] [<c013a370>] [<e085af7b>]
[<e085b520>]
[<c013904c>] [<e085b560>] [<c0148e83>] [<c013965b>] [<e085b560>]
[<c0149c89>]
[<c0149f0f>] [<c0149d84>] [<c0149fd7>] [<c0106f6b>]
Code: 0f b7 40 08 66 89 41 08 8a 41 14 66 c7 41 0a 00 00 80 61 15

>>EIP; e0852f64 <[fat]parse_options+3c/7fc> <=====
Trace; c0194b1e <sym_queue_command+ae/c0>
Trace; e0853af8 <[fat]fat_read_super+dc/86c>
Trace; c013a370 <blkdev_get+68/78>
Trace; e085af7a <[vfat]vfat_read_super+22/88>
Trace; e085b520 <[vfat]vfat_dir_inode_operations+0/40>
Trace; c013904c <get_sb_bdev+254/30c>
Trace; e085b560 <[vfat]vfat_fs_type+0/1a>
Trace; c0148e82 <set_devname+26/54>
Trace; c013965a <do_kern_mount+aa/150>
Trace; e085b560 <[vfat]vfat_fs_type+0/1a>
Trace; c0149c88 <do_add_mount+20/cc>
Trace; c0149f0e <do_mount+13a/154>
Trace; c0149d84 <copy_mount_options+50/a0>
Trace; c0149fd6 <sys_mount+ae/11c>
Trace; c0106f6a <system_call+32/38>
Code; e0852f64 <[fat]parse_options+3c/7fc>
00000000 <_EIP>:
Code; e0852f64 <[fat]parse_options+3c/7fc> <=====
0: 0f b7 40 08 movzwl 0x8(%eax),%eax <=====
Code; e0852f68 <[fat]parse_options+40/7fc>
4: 66 89 41 08 mov %ax,0x8(%ecx)
Code; e0852f6c <[fat]parse_options+44/7fc>
8: 8a 41 14 mov 0x14(%ecx),%al
Code; e0852f6e <[fat]parse_options+46/7fc>
b: 66 c7 41 0a 00 00 movw $0x0,0xa(%ecx)
Code; e0852f74 <[fat]parse_options+4c/7fc>
11: 80 61 15 00 andb $0x0,0x15(%ecx)


14 warnings issued. Results may not be reliable.


Ingo Molnar wrote:

> now that new-year's parties are over things are getting boring again. For
> those who want to see and perhaps even try something more complex, i'm
> announcing this patch that is a pretty radical rewrite of the Linux
> scheduler for 2.5.2-pre6:
>

> the SMP load-balancer can be extended/switched with additional parallel
> computing and cache hierarchy concepts: NUMA scheduling, multi-core CPUs
> can be supported easily by changing the load-balancer. Right now it's
> tuned for my SMP systems.
>

> i skipped the prev->mm == next->mm advantage - no workload i know of shows
> any sensitivity to this. It can be added back by sacrificing O(1)
> schedule() [the current and one-lower priority list can be searched for a
> that->mm == current->mm condition], but costs a fair number of cycles
> during a number of important workloads, so i wanted to avoid this as much
> as possible.
>

> - the SMP idle-task startup code was still racy and the new scheduler
> triggered this. So i streamlined the idle-setup code a bit. We do not call
> into schedule() before all processors have started up fully and all idle
> threads are in place.
>

> another benchmark measures sched_yield() performance. (which the pthreads
> code relies on pretty heavily.)
>

> on a 2-way system, starting 4 instances of ./loop_yield gives the
> following context-switch throughput:
>
> vanilla-2.5.2-pre6
>
> # vmstat 5 | cut -c57-
> system cpu
> in cs us sy id
> 102 241247 6 94 0
> 101 240977 5 95 0
> 101 241051 6 94 0
> 101 241176 7 93 0
>
> O(1)-schedule-2.5.2-pre6
>
> # vmstat 5 | cut -c57-
> system cpu
> in cs us sy id
> 101 977530 31 69 0
> 101 977478 28 72 0
> 101 977538 27 73 0
>

> the O(1) scheduler is 300% faster, and we do nearly 1 million context
> switches per second!
>

> Comments, bug reports, suggestions are welcome,
>

> Ingo

Andrea Arcangeli

unread,
Jan 4, 2002, 9:45:00 AM1/4/02
to
On Fri, Jan 04, 2002 at 04:44:32PM +0100, Ingo Molnar wrote:

>
> On Fri, 4 Jan 2002, Andrea Arcangeli wrote:
>
> > + {
> > + int counter = current->counter;
> > + p->counter = (counter + 1) >> 1;
> > + current->counter = counter >> 1;
> > + p->policy &= ~SCHED_YIELD;
> > + current->policy |= SCHED_YIELD;
> > current->need_resched = 1;
> > + }
>
> yep - good, this means that applications got some fair testing already.

yes :).

> What i mentioned in the previous email is that on SMP this solution is
> still not the optimal one under the current scheduler, because the wakeup
> of the child process might end up pushing the process to another (idle)
> CPU - worsening the COW effect with SMP-interlocking effects. This is why
> i introduced wake_up_forked_process() that knows about this distinction
> and keeps the child on the current CPU.

The right thing to do is not obvious here. If you keep the parent on the
current CPU, it will be the parent that will be rescheduled on another
(idle) cpu. But then the parent could be the "real" app, or something
that would better not migrate over and over to all cpus (better to run
the child in another cpu in such case). Of course we cannot avoid child
and parent to run in parallel if there are idle cpus or we risk to waste
a whole timeslice worth of cpu cycles. But, at the very least we should
avoid to set need_resched to 1 in the parent if the child is getting
migrated, so at least that bit can be optimized away with an aware
wake_up_forked_process.

Andrea

Ingo Molnar

unread,
Jan 4, 2002, 12:07:15 PM1/4/02
to

On Fri, 4 Jan 2002, dan kelley wrote:

> using the 2.4.17 patch against a vanilla 2.4.17 tree, looks like there's a
> problem w/ reiserfs:

please check out the -A1 patch, this should be fixed.

Ingo

Nikita Danilov

unread,
Jan 4, 2002, 10:22:04 AM1/4/02
to
Ingo Molnar writes:
>
> On Fri, 4 Jan 2002, dan kelley wrote:
>
> > using the 2.4.17 patch against a vanilla 2.4.17 tree, looks like there's a
> > problem w/ reiserfs:
>
> please check out the -A1 patch, this should be fixed.

As a temporary solution you can also compile without reiserfs procinfo
support.

>
> Ingo

Nikita.

dan kelley

unread,
Jan 4, 2002, 9:46:30 AM1/4/02
to

using the 2.4.17 patch against a vanilla 2.4.17 tree, looks like there's a
problem w/ reiserfs:

gcc -D__KERNEL__ -I/usr/src/linux-2.4.17/include -Wall -Wstrict-prototypes
-Wno-trigraphs -O2 -fomit-frame-pointer -fno-strict-aliasing -fno-common
-pipe -mpreferred-stack-boundary=2 -march=i686 -c -o buffer2.o
buffer2.c
buffer2.c: In function `reiserfs_bread':
buffer2.c:54: structure has no member named `context_swtch'
buffer2.c:58: structure has no member named `context_swtch'
make[3]: *** [buffer2.o] Error 1
make[3]: Leaving directory `/usr/src/linux-2.4.17/fs/reiserfs'
make[2]: *** [first_rule] Error 2
make[2]: Leaving directory `/usr/src/linux-2.4.17/fs/reiserfs'
make[1]: *** [_subdir_reiserfs] Error 2
make[1]: Leaving directory `/usr/src/linux-2.4.17/fs'
make: *** [_dir_fs] Error 2

0 new messages