Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

[SCHED] Totally WRONG prority calculation with specific test-case (since 2.6.10-bk12)

0 views
Skip to first unread message

Paolo Ornati

unread,
Dec 27, 2005, 1:08:57 PM12/27/05
to Linux Kernel Mailing List
Hello,

I've found an easy-to-reproduce-for-me test case that shows a totally
wrong priority calculation: basically a CPU-intensitive process gets
better priority than a disk-intensitive one (dd if=bigfile
of=/dev/null ...).

Seems impossible, isn't it?

---- THE NUMBERS with 2.6.15-rc7 -----

The test-case is the Xvid encoding of dvd-ripped track with transcode
(using "dvd::rip" interface). The copied-and-pasted command line is
this:

mkdir -m 0775 -p '/home/paolo/tmp/test/tmp' &&
cd /home/paolo/tmp/test/tmp && dr_exec transcode -H 10 -a 2 -x vob,null
-i /home/paolo/tmp/test/vob/003 -w 1198,50 -b 128,0,0 -s 1.972
--a52_drc_off -f 25 -Y 52,8,52,8 -B 27,10,8 -R 1 -y xvid4,null
-o /dev/null --print_status 20 && echo DVDRIP_SUCCESS mkdir -m 0775 -p
'/home/paolo/tmp/test/tmp' && cd /home/paolo/tmp/test/tmp && dr_exec
transcode -H 10 -a 2 -x vob -i /home/paolo/tmp/test/vob/003 -w 1198,50
-b 128,0,0 -s 1.972 --a52_drc_off -f 25 -Y 52,8,52,8 -B 27,10,8 -R 2 -y
xvid4 -o /home/paolo/tmp/test/avi/003/test-003.avi --print_status 20 &&
echo DVDRIP_SUCCESS


Here there is a TOP snapshot while running it:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
5721 paolo 16 0 115m 18m 2428 R 84.4 3.7 0:15.11 transcode
5736 paolo 25 0 50352 4516 1912 R 8.4 0.9 0:01.53 tcdecode
5725 paolo 15 0 115m 18m 2428 S 4.6 3.7 0:00.84 transcode
5738 paolo 18 0 115m 18m 2428 S 0.8 3.7 0:00.15 transcode
5734 paolo 25 0 20356 1140 920 S 0.6 0.2 0:00.12 tcdemux
5731 paolo 25 0 47312 2540 1996 R 0.4 0.5 0:00.08 tcdecode
5319 root 15 0 166m 16m 2584 S 0.2 3.2 0:25.06 X
5444 paolo 16 0 87116 22m 15m R 0.2 4.6 0:04.05 konsole
5716 paolo 16 0 10424 1160 876 R 0.2 0.2 0:00.06 top
5735 paolo 25 0 22364 1436 932 S 0.2 0.3 0:00.01 tcextract


DD running alone:

paolo@tux /mnt $ mount space/; time dd if=space/bigfile of=/dev/null bs=1M count=128; umount space/
128+0 records in
128+0 records out

real 0m4.052s
user 0m0.000s
sys 0m0.209s

DD while transcoding:

paolo@tux /mnt $ mount space/; time dd if=space/bigfile of=/dev/null bs=1M count=128; umount space/
128+0 records in
128+0 records out

real 0m26.121s
user 0m0.001s
sys 0m0.255s

---------------------------------------

I've tried older kernels finding that 2.6.11 is the first affected.

Going on with testing...

2.6.11-rc[1-5]:
2.6.11-rc3 bad
2.6.11-rc1 bad

2.6.10-bk[1-14]
2.6.10-bk7 good
2.6.10-bk11 good
2.6.10-bk13 bad
2.6.10-bk12 bad

So the problem was introduced with:
>> 2.6.10-bk12 09-Jan-2005 <<

The exact behaviour is different with 2.6.11/12/13/14... for example:
with 2.6.11 the priority of "transcode" is initially set to ~25 and go
down to 17/18 when running DD.

The problem doesn't seem 100% reproducible with every kernel, sometimes
a "BAD" kernel looks "GOOD"... or maybe it was me confused by too
much compile/install/reboot/test work ;)

Other INFO:
- I'm on x86_64
- preemption ON/OFF doesn't make any differences


Can anyone reproduce this?
IOW: is this affecting only my machine?

--
Paolo Ornati
Linux 2.6.15-rc7-gf89f5948 on x86_64
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Paolo Ornati

unread,
Dec 27, 2005, 4:48:42 PM12/27/05
to Paolo Ornati, Linux Kernel Mailing List, Con Kolivas, Ingo Molnar

Hello Con and Ingo... I've found that the above problem goes away
by reverting this:

http://linux.bkbits.net:8080/linux-2.6/cset@41e054c6pwNQXzErMxvfh4IpLPXA5A?nav=index.html|src/|src/include|src/include/linux|related/include/linux/sched.h

--------------------------------------------------

[PATCH] sched: remove_interactive_credit

Special casing tasks by interactive credit was helpful for preventing fully
cpu bound tasks from easily rising to interactive status.

However it did not select out tasks that had periods of being fully cpu
bound and then sleeping while waiting on pipes, signals etc. This led to a
more disproportionate share of cpu time.

Backing this out will no longer special case only fully cpu bound tasks,
and prevents the variable behaviour that occurs at startup before tasks
declare themseleves interactive or not, and speeds up application startup
slightly under certain circumstances. It does cost in interactivity
slightly as load rises but it is worth it for the fairness gains.

Signed-off-by: Con Kolivas <ker...@kolivas.org>
Acked-by: Ingo Molnar <mi...@elte.hu>
Signed-off-by: Andrew Morton <ak...@osdl.org>
Signed-off-by: Linus Torvalds <torv...@osdl.org>

--------------------------------------------------


Maybe this change has revealed a scheduler weakness ?

I'm glad to test any patch or give more data :)

Bye,

--
Paolo Ornati
Linux 2.6.10-bk12-int_credit on x86_64

Con Kolivas

unread,
Dec 27, 2005, 6:27:41 PM12/27/05
to Paolo Ornati, Linux Kernel Mailing List, Ingo Molnar

Looking at your top output I see that transcode command generates 7 processes
all likely to be using cpu, and your DD slowdown is almost 7 times... I
assume it all goes away if you nice the transcode up by 3 or more.

> Hello Con and Ingo... I've found that the above problem goes away
> by reverting this:
>
> http://linux.bkbits.net:8080/linux-2.6/cset@41e054c6pwNQXzErMxvfh4IpLPXA5A?
>nav=index.html|src/|src/include|src/include/linux|related/include/linux/sche
>d.h
>
> --------------------------------------------------
>
> [PATCH] sched: remove_interactive_credit

The issue is that the scheduler interactivity estimator is a state machine and
can be fooled to some degree, and a cpu intensive task that just happens to
sleep a little bit gets significantly better priority than one that is fully
cpu bound all the time. Reverting that change is not a solution because it
can still be fooled by the same process sleeping lots for a few seconds or so
at startup and then changing to the cpu mostly-sleeping slightly behaviour.
This "fluctuating" behaviour is in my opinion worse which is why I removed
it.

Cheers,
Con

Peter Williams

unread,
Dec 27, 2005, 7:01:04 PM12/27/05
to Paolo Ornati, Linux Kernel Mailing List, Con Kolivas, Ingo Molnar

Any chance of you applying the PlugSched patches and seeing how the
other schedulers that it contains handle this situation?

The patch at:

<http://prdownloads.sourceforge.net/cpuse/plugsched-6.1.6-for-2.6.15-rc5.patch?download>

should apply without problems to the 2.6.15-rc7 kernel.

Very Brief Documentation:

You can select a default scheduler at kernel build time. If you wish to
boot with a scheduler other than the default it can be selected at boot
time by adding:

cpusched=<scheduler>

to the boot command line where <scheduler> is one of: ingosched,
nicksched, staircase, spa_no_frills, spa_ws, spa_svr or zaphod. If you
don't change the default when you build the kernel the default scheduler
will be ingosched (which is the normal scheduler).

The scheduler in force on a running system can be determined by the
contents of:

/proc/scheduler

Control parameters for the scheduler can be read/set via files in:

/sys/cpusched/<scheduler>/

Peter
--
Peter Williams pwil...@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce

Paolo Ornati

unread,
Dec 28, 2005, 5:21:48 AM12/28/05
to Peter Williams, Linux Kernel Mailing List, Con Kolivas, Ingo Molnar
On Wed, 28 Dec 2005 10:59:13 +1100
Peter Williams <pwil...@bigpond.net.au> wrote:

> Any chance of you applying the PlugSched patches and seeing how the
> other schedulers that it contains handle this situation?
>
> The patch at:
>
> <http://prdownloads.sourceforge.net/cpuse/plugsched-6.1.6-for-2.6.15-rc5.patch?download>
>
> should apply without problems to the 2.6.15-rc7 kernel.
>
> Very Brief Documentation:
>
> You can select a default scheduler at kernel build time. If you wish to
> boot with a scheduler other than the default it can be selected at boot
> time by adding:
>
> cpusched=<scheduler>
>
> to the boot command line where <scheduler> is one of: ingosched,
> nicksched, staircase, spa_no_frills, spa_ws, spa_svr or zaphod. If you
> don't change the default when you build the kernel the default scheduler
> will be ingosched (which is the normal scheduler).


First of all, this is the "pstree" structure of transcode an friends:

|-kdesktop---perl---sh---transcode-+-2*[sh-+-tccat]
| | |-tcdecode]
| | |-tcdemux]
| | `-tcextract]
| `-transcode---5*[transcode]


Results with various schedulers:

------------------------------------------------------------------------

1) nicksched: perfect! This is the behaviour I want.

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

5562 paolo 40 0 115m 18m 2428 R 82.2 3.7 0:22.16 transcode
5576 paolo 26 0 50348 4516 1912 S 9.5 0.9 0:02.43 tcdecode
5566 paolo 23 0 115m 18m 2428 S 4.6 3.7 0:01.24 transcode
5573 paolo 21 0 115m 18m 2428 S 0.9 3.7 0:00.22 transcode
5577 paolo 27 0 20356 1140 920 S 0.9 0.2 0:00.21 tcdemux
5295 root 20 0 167m 17m 3624 S 0.6 3.5 0:11.02 X
5579 paolo 20 0 47308 2540 1996 S 0.5 0.5 0:00.14 tcdecode
5574 paolo 20 0 20356 1144 920 S 0.4 0.2 0:00.11 tcdemux
..

transcode get recognized for what it is, and I/O bounded processes
don't even notice that it is running :)


2) staircase: bad, as you can see:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

5582 paolo 26 0 115m 18m 2428 R 82.7 3.7 0:47.63 transcode
5599 paolo 39 0 50352 4516 1912 R 9.6 0.9 0:05.21 tcdecode
5586 paolo 0 0 115m 18m 2428 S 4.5 3.7 0:02.61 transcode
5622 paolo 39 0 4948 1520 412 R 1.1 0.3 0:00.15 dd
5591 paolo 0 0 115m 18m 2428 S 0.6 3.7 0:00.36 transcode
5575 paolo 0 0 98476 37m 9392 S 0.4 7.5 0:01.44 perl
5597 paolo 27 0 20356 1144 920 S 0.4 0.2 0:00.21 tcdemux
5475 paolo 0 0 86556 22m 15m S 0.2 4.5 0:01.24 konsole
5388 root 0 0 167m 17m 3208 S 0.1 3.4 0:03.16 X
5587 paolo 0 0 115m 18m 2428 S 0.1 3.7 0:00.03 transcode
5595 paolo 20 0 47312 2540 1996 S 0.1 0.5 0:00.14 tcdecode
5596 paolo 26 0 22672 1268 1020 S 0.1 0.2 0:00.03 tccat
5598 paolo 28 0 22364 1436 932 S 0.1 0.3 0:00.04 tcextract


And "DD" is affected badly:

paolo@tux /mnt $ mount space/; sync; sleep 1; time dd if=space/bigfile


of=/dev/null bs=1M count=128; umount space/ 128+0 records in
128+0 records out

real 0m6.341s
user 0m0.002s
sys 0m0.229s

While transcoding:

paolo@tux /mnt $ mount space/; sync; sleep 1; time dd if=space/bigfile
of=/dev/null bs=1M count=256; umount space/ 256+0 records in
256+0 records out

real 0m15.793s
user 0m0.001s
sys 0m0.374s


3) spa_no_frills: bad, but this is OK since it is Round Robin :)

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

5356 paolo 20 0 115m 18m 2428 R 81.1 3.7 0:27.61 transcode
5371 paolo 20 0 50348 4516 1912 R 8.9 0.9 0:02.97 tcdecode
5360 paolo 20 0 115m 18m 2428 S 4.1 3.7 0:01.54 transcode
5378 paolo 20 0 4948 1520 412 D 1.4 0.3 0:00.29 dd
5364 paolo 20 0 20352 1144 920 S 0.9 0.2 0:00.20 tcdemux
5373 paolo 20 0 115m 18m 2428 S 0.7 3.7 0:00.32 transcode
5369 paolo 20 0 20356 1144 920 S 0.5 0.2 0:00.14 tcdemux
5205 root 20 0 165m 15m 2584 R 0.2 3.2 0:01.86 X


4) spa_ws: bad

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

5334 paolo 32 0 115m 18m 2428 R 82.7 3.7 0:18.77 transcode
5349 paolo 32 0 50348 4516 1912 R 8.9 0.9 0:02.00 tcdecode
5338 paolo 21 0 115m 18m 2428 S 4.6 3.7 0:01.08 transcode
5356 paolo 32 0 4948 1520 412 D 1.1 0.3 0:00.12 dd
5351 paolo 32 0 115m 18m 2428 S 1.0 3.7 0:00.20 transcode
5199 root 21 0 165m 15m 2584 S 0.4 3.2 0:01.68 X
5347 paolo 32 0 20356 1140 920 S 0.4 0.2 0:00.08 tcdemux
5296 paolo 22 0 98472 37m 9392 S 0.2 7.5 0:01.47 perl
5299 paolo 21 0 86556 22m 15m S 0.2 4.4 0:00.75 konsole
5344 paolo 32 0 47308 2540 1996 S 0.2 0.5 0:00.07 tcdecode
5339 paolo 21 0 115m 18m 2428 S 0.1 3.7 0:00.01 transcode

paolo@tux /mnt $ mount space/; sync; sleep 1; time dd if=space/bigfile
of=/dev/null bs=1M count=256; umount space/ 256+0 records in
256+0 records out

real 0m8.112s
user 0m0.001s
sys 0m0.444s

paolo@tux /mnt $ mount space/; sync; sleep 1; time dd if=space/bigfile
of=/dev/null bs=1M count=256; umount space/ 256+0 records in
256+0 records out

real 0m29.222s
user 0m0.000s
sys 0m0.400s


5) spa_svr: surprise, surprise! Not all that bad. At least DD
gets better priority than transcode... and DD real time is only a bit
affected (8s --> ~9s).


PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

5334 paolo 33 0 115m 18m 2428 R 78.1 3.7 0:22.70 transcode
5349 paolo 28 0 50352 4516 1912 S 9.0 0.9 0:02.41 tcdecode
5338 paolo 25 0 115m 18m 2428 S 4.7 3.7 0:01.29 transcode
5363 paolo 27 0 4952 1520 412 R 4.7 0.3 0:00.25 dd
5342 paolo 33 0 20352 1140 920 S 1.6 0.2 0:00.21 tcdemux
5351 paolo 25 0 115m 18m 2428 S 0.8 3.7 0:00.23 transcode
5144 root 22 0 166m 16m 3120 S 0.4 3.3 0:01.85 X
5344 paolo 23 0 47308 2540 1996 S 0.4 0.5 0:00.13 tcdecode
5347 paolo 27 0 20356 1144 920 S 0.4 0.2 0:00.10 tcdemux
5231 paolo 22 0 86660 22m 15m S 0.2 4.5 0:00.95 konsole
5271 paolo 25 0 98476 37m 9396 S 0.2 7.5 0:01.54 perl
5341 paolo 23 0 22672 1268 1020 S 0.2 0.2 0:00.02 tccat


6) zaphod: more or less like spa_svr

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

5308 paolo 34 0 115m 18m 2428 R 52.1 3.7 0:49.77 transcode
5323 paolo 32 0 50352 4516 1912 S 6.0 0.9 0:05.61 tcdecode
5356 paolo 28 0 4952 1520 412 D 3.5 0.3 0:00.28 dd
5312 paolo 28 0 115m 18m 2428 S 2.6 3.7 0:02.71 transcode
5325 paolo 31 0 115m 18m 2428 S 0.7 3.7 0:00.55 transcode
5316 paolo 37 0 20352 1140 920 S 0.4 0.2 0:00.33 tcdemux
5202 root 23 0 165m 15m 2584 S 0.2 3.1 0:01.57 X
5318 paolo 31 0 47312 2540 1996 S 0.2 0.5 0:00.28 tcdecode
5321 paolo 33 0 20356 1144 920 S 0.2 0.2 0:00.26 tcdemux
4760 messageb 25 0 13248 1068 848 S 0.1 0.2 0:00.07
dbus-daemon-1 5264 paolo 24 0 93920 17m 10m S 0.1 3.5
0:00.38 kded 5282 paolo 23 0 92712 19m 12m S 0.1 3.9
0:00.36 kdesktop


7) ingosched: bad, as already said in the original post

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

5209 paolo 16 0 115m 18m 2428 R 72.0 3.7 0:22.13 transcode
5224 paolo 22 0 50348 4516 1912 R 8.4 0.9 0:02.44 tcdecode
5213 paolo 15 0 115m 18m 2428 S 4.2 3.7 0:01.24 transcode
5243 paolo 18 0 4948 1520 412 R 1.8 0.3 0:00.14 dd
5217 paolo 19 0 20356 1144 920 R 0.8 0.2 0:00.19 tcdemux
5108 root 15 0 165m 15m 2584 S 0.6 3.1 0:01.44 X
5226 paolo 15 0 115m 18m 2428 S 0.6 3.7 0:00.20 transcode
5216 paolo 18 0 22676 1268 1020 S 0.4 0.2 0:00.03 tccat
5219 paolo 18 0 47312 2540 1996 R 0.4 0.5 0:00.12 tcdecode
5222 paolo 18 0 20356 1144 920 S 0.4 0.2 0:00.10 tcdemux
5195 paolo 16 0 98488 37m 9392 S 0.2 7.5 0:01.41 perl
5198 paolo 16 0 86552 22m 15m R 0.2 4.4 0:00.66 konsole

paolo@tux /mnt $ mount space/; sync; sleep 1; time dd if=space/bigfile of=/dev/null bs=1M count=256; umount space/
256+0 records in
256+0 records out

real 0m23.393s (instead of 8s)
user 0m0.001s
sys 0m0.418s

------------------------------------------------------------------------


So the winner for manifest superiority is "nicksched", it looks to me
even better than 2.6.10-bk12 (ingosched) with
"remove_interactive_credit" reverted.

:)

--
Paolo Ornati
Linux 2.6.15-rc5-plugsched on x86_64

Paolo Ornati

unread,
Dec 28, 2005, 6:01:27 AM12/28/05
to Con Kolivas, Linux Kernel Mailing List, Ingo Molnar
On Wed, 28 Dec 2005 10:26:58 +1100
Con Kolivas <ker...@kolivas.org> wrote:

> Looking at your top output I see that transcode command generates 7 processes
> all likely to be using cpu, and your DD slowdown is almost 7 times... I
> assume it all goes away if you nice the transcode up by 3 or more.

Yes, if I nice everything to 3 or more (nice -n 3 dvdrip ...) it works,
but I would prefer a less weak scheduler (see the other email, with
results for various schedulers).

Another thing that I've noticed is that things tends to get worse when
"transcode" is running for long time (some hours). It happened to me
some times (with ingosched and also with staircase if I remember
correctly):
after some hours of running transcode (with me away from the
machine) I've found a totally UNUSABLE system. Transcode was the king
of the machine and everything else get almost no CPU time. Switching to
a Text-Console takes something like 10s (or something like that). When
I was finally logged in as root I've reniced transcode and companyto +19
and the system was usable again ;)

To get things even more STRANGE: another time that this happened I've
done the same thing except that I've reniced them to "0" (the same nice
level they were running) ---> And the system became usable again (with
the usual slow down but still usable).

This is what I remember. Now I think we can agree that there is
something wrong... no?

Con Kolivas

unread,
Dec 28, 2005, 6:21:14 AM12/28/05
to Paolo Ornati, Linux Kernel Mailing List, Ingo Molnar
On Wednesday 28 December 2005 22:01, Paolo Ornati wrote:
> after some hours of running transcode (with me away from the
> machine) I've found a totally UNUSABLE system. Transcode was the king
> of the machine and everything else get almost no CPU time. Switching to
> a Text-Console takes something like 10s (or something like that). When
> I was finally logged in as root I've reniced transcode and companyto +19
> and the system was usable again ;)
>
> To get things even more STRANGE: another time that this happened I've
> done the same thing except that I've reniced them to "0" (the same nice
> level they were running) ---> And the system became usable again (with
> the usual slow down but still usable).
>
> This is what I remember. Now I think we can agree that there is
> something wrong... no?

This latter thing sounds more like your transcode job pushed everything out to
swap... You need to instrument this case better.

Con

Paolo Ornati

unread,
Dec 28, 2005, 6:35:30 AM12/28/05
to Con Kolivas, Linux Kernel Mailing List, Ingo Molnar
On Wed, 28 Dec 2005 22:19:23 +1100
Con Kolivas <ker...@kolivas.org> wrote:

> This latter thing sounds more like your transcode job pushed everything out to
> swap... You need to instrument this case better.
>

I don't know. The combination Swapped Out Programs + "normal" priority
strangeness can potentially result in a total disaster... but why
renicing transcode to "0" gets the system usable again?

Next time I'll grab "/proc/meminfo"... and what other info can help to
understand?

--
Paolo Ornati
Linux 2.6.15-rc5-plugsched on x86_64

Peter Williams

unread,
Dec 28, 2005, 8:44:55 AM12/28/05
to Paolo Ornati, Linux Kernel Mailing List, Con Kolivas, Ingo Molnar
Paolo Ornati wrote:
> On Wed, 28 Dec 2005 10:59:13 +1100
> Peter Williams <pwil...@bigpond.net.au> wrote:
>
>
>>Any chance of you applying the PlugSched patches and seeing how the
>>other schedulers that it contains handle this situation?
>>
>>The patch at:
>>
>><http://prdownloads.sourceforge.net/cpuse/plugsched-6.1.6-for-2.6.15-rc5.patch?download>
>>
>>should apply without problems to the 2.6.15-rc7 kernel.
>>
>>Very Brief Documentation:
>>
>>You can select a default scheduler at kernel build time. If you wish to
>>boot with a scheduler other than the default it can be selected at boot
>>time by adding:
>>
>>cpusched=<scheduler>
>>
>>to the boot command line where <scheduler> is one of: ingosched,
>>nicksched, staircase, spa_no_frills, spa_ws, spa_svr or zaphod. If you
>>don't change the default when you build the kernel the default scheduler
>>will be ingosched (which is the normal scheduler).
>
>
>
> First of all, this is the "pstree" structure of transcode an friends:
>
> |-kdesktop---perl---sh---transcode-+-2*[sh-+-tccat]
> | | |-tcdecode]
> | | |-tcdemux]
> | | `-tcextract]
> | `-transcode---5*[transcode]
>
>
> Results with various schedulers:

First, thanks for doing this.

>
> ------------------------------------------------------------------------
>
> 1) nicksched: perfect! This is the behaviour I want.
>
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 5562 paolo 40 0 115m 18m 2428 R 82.2 3.7 0:22.16 transcode
> 5576 paolo 26 0 50348 4516 1912 S 9.5 0.9 0:02.43 tcdecode
> 5566 paolo 23 0 115m 18m 2428 S 4.6 3.7 0:01.24 transcode
> 5573 paolo 21 0 115m 18m 2428 S 0.9 3.7 0:00.22 transcode
> 5577 paolo 27 0 20356 1140 920 S 0.9 0.2 0:00.21 tcdemux
> 5295 root 20 0 167m 17m 3624 S 0.6 3.5 0:11.02 X
> 5579 paolo 20 0 47308 2540 1996 S 0.5 0.5 0:00.14 tcdecode
> 5574 paolo 20 0 20356 1144 920 S 0.4 0.2 0:00.11 tcdemux

> ...


>
> transcode get recognized for what it is, and I/O bounded processes
> don't even notice that it is running :)

Interesting. This one's more or less a dead scheduler and hasn't had
any development work done on it for some time. I just keep porting the
original version to new kernels.

Yes, no surprises there.

This one is aimed purely at good interactive responsiveness (i.e.
keyboard, mouse, X server and media players such as rythmbox/xmms) so no
real surprises here either.

>
>
> 5) spa_svr: surprise, surprise! Not all that bad. At least DD
> gets better priority than transcode... and DD real time is only a bit
> affected (8s --> ~9s).
>

This will be the "throughput bonus" in action. It's overall aim is to
reduce the time tasks spend on the runqueue waiting for CPU access
a.k.a. delay. It does this by using the system load and the average
amount of CPU time that the task uses each scheduling cycle to estimate
the expected delay for the task and gives it a bonus if the actual
average delays being experienced are bigger than this value.

It's intended for server systems not interactive systems as reducing
overall delay isn't necessarily good for interactive systems where the
aim is to quell the user's impatience by giving good latency to the
interactive tasks. These aims aren't always compatible.

>
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 5334 paolo 33 0 115m 18m 2428 R 78.1 3.7 0:22.70 transcode
> 5349 paolo 28 0 50352 4516 1912 S 9.0 0.9 0:02.41 tcdecode
> 5338 paolo 25 0 115m 18m 2428 S 4.7 3.7 0:01.29 transcode
> 5363 paolo 27 0 4952 1520 412 R 4.7 0.3 0:00.25 dd
> 5342 paolo 33 0 20352 1140 920 S 1.6 0.2 0:00.21 tcdemux
> 5351 paolo 25 0 115m 18m 2428 S 0.8 3.7 0:00.23 transcode
> 5144 root 22 0 166m 16m 3120 S 0.4 3.3 0:01.85 X
> 5344 paolo 23 0 47308 2540 1996 S 0.4 0.5 0:00.13 tcdecode
> 5347 paolo 27 0 20356 1144 920 S 0.4 0.2 0:00.10 tcdemux
> 5231 paolo 22 0 86660 22m 15m S 0.2 4.5 0:00.95 konsole
> 5271 paolo 25 0 98476 37m 9396 S 0.2 7.5 0:01.54 perl
> 5341 paolo 23 0 22672 1268 1020 S 0.2 0.2 0:00.02 tccat
>
>
> 6) zaphod: more or less like spa_svr

Zaphod includes the throughput bonus in its armoury which why it is
similar in performance to spa_svr.

Thanks for this data. It will enable me to make some mods to the
spa_xxx and zaphod schedulers.

Peter
--
Peter Williams pwil...@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce

Paolo Ornati

unread,
Dec 28, 2005, 12:22:29 PM12/28/05
to Paolo Ornati, Con Kolivas, Linux Kernel Mailing List, Ingo Molnar
On Wed, 28 Dec 2005 12:35:47 +0100
Paolo Ornati <orn...@fastwebnet.it> wrote:

> > This latter thing sounds more like your transcode job pushed everything out to
> > swap... You need to instrument this case better.
> >
>
> I don't know. The combination Swapped Out Programs + "normal" priority
> strangeness can potentially result in a total disaster... but why
> renicing transcode to "0" gets the system usable again?
>
> Next time I'll grab "/proc/meminfo"... and what other info can help to
> understand?

I'm doing some tests with "ingosched" and a bit longer trancoding
session and I've found a strange thing.

DD running time can change a LOT between one run and another (always
while transcode is running):

----------------------------------------------------------------------


paolo@tux /mnt $ mount space/; sync; sleep 1; time dd if=space/bigfile
of=/dev/null bs=1M count=256; umount space/ 256+0 records in
256+0 records out

real 0m15.754s
user 0m0.000s
sys 0m0.500s

paolo@tux /mnt $ mount space/; sync; sleep 1; time dd if=space/bigfile
of=/dev/null bs=1M count=256; umount space/ 256+0 records in
256+0 records out

real 0m52.966s
user 0m0.000s
sys 0m0.504s

paolo@tux /mnt $ mount space/; sync; sleep 1; time dd if=space/bigfile
of=/dev/null bs=1M count=256; umount space/ 256+0 records in
256+0 records out

real 0m48.928s
user 0m0.004s
sys 0m0.524s
---------------------------------------------------------------------

Looking at top while running these I've seen that this is related to
how transcode's (the one that eats a lot of CPU) priority changes.

When only transcoding his priority is 16, but when also DD test is
running then that value fluctuate between 16 and 18.

DD priority is always 18 instead.

Usually transcode's prio go to 17/18 and DD runs in 15/20s, but
sometimes it doesn't fluctuate staying stuck to 16 and DD runs in ~50s.


PS:
I'm not yet able to reproduce the "totally unusable system" I was
talking about.

--
Paolo Ornati
Linux 2.6.15-rc7-plugsched on x86_64

Paolo Ornati

unread,
Dec 28, 2005, 12:38:31 PM12/28/05
to Paolo Ornati, Con Kolivas, Linux Kernel Mailing List, Ingo Molnar
On Wed, 28 Dec 2005 18:23:23 +0100
Paolo Ornati <orn...@fastwebnet.it> wrote:

> Usually transcode's prio go to 17/18 and DD runs in 15/20s, but
> sometimes it doesn't fluctuate staying stuck to 16 and DD runs in ~50s.

And now I've noticed that when that prority stops fluctuating, it stops
forever. Running the DD test again and again doesn't change anything!

For some reasons (running time?) the trancode priority is stuck to 16
and DD always performs very badly: ~50s (normally it should be 8s).

When I've noticed this the real running time of the trancode test is
about 35/40 min.

Paolo Ornati

unread,
Dec 28, 2005, 2:46:24 PM12/28/05
to Peter Williams, Linux Kernel Mailing List, Con Kolivas, Ingo Molnar

Ok, but keep in mind that these numbers are just "snapshots". With
almost all schedulers the priority of the CPU-eater transcode and other
processes fluctuate a bit (an exception here is nicksched, that gives
priority 40 to transcode and never change it).

It seems also that small variation of the priority can affect seriously
my DD test case running time (expecially, I think, with schedulers that
give to "transcode" a better-or-equal priority than "dd" -->
ingosched/staircaise).

This is another mail in witch you weren't in CC that explains it better:

http://www.ussg.iu.edu/hypermail/linux/kernel/0512.3/0647.html

--
Paolo Ornati
Linux 2.6.15-rc7-plugsched on x86_64

Nick Piggin

unread,
Dec 28, 2005, 10:15:59 PM12/28/05
to Peter Williams, Paolo Ornati, Linux Kernel Mailing List, Con Kolivas, Ingo Molnar
Peter Williams wrote:
> Paolo Ornati wrote:

>> 1) nicksched: perfect! This is the behaviour I want.
>>

..
>>
>> transcode get recognized for what it is, and I/O bounded processes
>> don't even notice that it is running :)
>
>
> Interesting. This one's more or less a dead scheduler and hasn't had
> any development work done on it for some time. I just keep porting the
> original version to new kernels.
>

It isn't a dead scheduler any more than any of the other out of tree
schedulers are (which isn't saying much, unfortunately).

I've probably got a small number of cleanups and microoptimisations
relative to what you have (I can't remember exactly what you sucked up)
.. but other than that there hasn't been much development work done for
some time because there is not much wrong with it.

--
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com

Peter Williams

unread,
Dec 28, 2005, 10:38:23 PM12/28/05
to Nick Piggin, Paolo Ornati, Linux Kernel Mailing List, Con Kolivas, Ingo Molnar
Nick Piggin wrote:
> Peter Williams wrote:
>
>> Paolo Ornati wrote:
>
>
>>> 1) nicksched: perfect! This is the behaviour I want.
>>>
> ...

>
>>>
>>> transcode get recognized for what it is, and I/O bounded processes
>>> don't even notice that it is running :)
>>
>>
>>
>> Interesting. This one's more or less a dead scheduler and hasn't had
>> any development work done on it for some time. I just keep porting
>> the original version to new kernels.
>>
>
> It isn't a dead scheduler any more than any of the other out of tree
> schedulers are (which isn't saying much, unfortunately).

Ingosched, staircase and my SPA schedulers are all evolving slowly.
Are there any out there that I don't have in PlugSched that you think
should be?

>
> I've probably got a small number of cleanups and microoptimisations
> relative to what you have (I can't remember exactly what you sucked up)

> ... but other than that there hasn't been much development work done for


> some time because there is not much wrong with it.
>

I was starting to think that you'd lost interest in this which is why I
said it was more or less dead. Sorry.

Peter
--
Peter Williams pwil...@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce

Nick Piggin

unread,
Dec 29, 2005, 3:11:43 AM12/29/05
to Peter Williams, Paolo Ornati, Linux Kernel Mailing List, Con Kolivas, Ingo Molnar
Peter Williams wrote:
> Nick Piggin wrote:

>> It isn't a dead scheduler any more than any of the other out of tree
>> schedulers are (which isn't saying much, unfortunately).
>
>
> Ingosched, staircase and my SPA schedulers are all evolving slowly.
> Are there any out there that I don't have in PlugSched that you think
> should be?
>

Not that I know of...

>>
>> I've probably got a small number of cleanups and microoptimisations
>> relative to what you have (I can't remember exactly what you sucked up)
>> ... but other than that there hasn't been much development work done for
>> some time because there is not much wrong with it.
>>
>
> I was starting to think that you'd lost interest in this which is why I
> said it was more or less dead. Sorry.
>

No worries. I haven't lost interest so much as people seem to be fairly
happy with the current scheduler and least aren't busting my door down
for updates to nicksched ;)

I'll do a resynch for 2.6.15 though.

--
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com

-

Paolo Ornati

unread,
Dec 30, 2005, 8:54:09 AM12/30/05
to Linux Kernel Mailing List, Con Kolivas, Ingo Molnar, Nick Piggin, Peter Williams
WAS: [SCHED] Totally WRONG prority calculation with specific test-case
(since 2.6.10-bk12)
http://lkml.org/lkml/2005/12/27/114/index.html

On Wed, 28 Dec 2005 10:26:58 +1100
Con Kolivas <ker...@kolivas.org> wrote:

> The issue is that the scheduler interactivity estimator is a state machine and
> can be fooled to some degree, and a cpu intensive task that just happens to
> sleep a little bit gets significantly better priority than one that is fully
> cpu bound all the time. Reverting that change is not a solution because it
> can still be fooled by the same process sleeping lots for a few seconds or so
> at startup and then changing to the cpu mostly-sleeping slightly behaviour.
> This "fluctuating" behaviour is in my opinion worse which is why I removed
> it.

Trying to find a "as simple as possible" test case for this problem
(that I consider a BUG in priority calculation) I've come up with this
very simple program:

------ sched_fooler.c -------------------------------
#include <stdlib.h>
#include <unistd.h>

static void burn_cpu(unsigned int x)
{
static char buf[1024];
int i;

for (i=0; i < x; ++i)
buf[i%sizeof(buf)] = (x-i)*3;
}

int main(int argc, char **argv)
{
unsigned long burn;
if (argc != 2)
return 1;
burn = (unsigned long)atoi(argv[1]);
while(1) {
burn_cpu(burn*1000);
usleep(1);
}
return 0;
}
----------------------------------------------

paolo@tux ~/tmp/sched_fooler $ gcc sched_fooler.c
paolo@tux ~/tmp/sched_fooler $ ./a.out 5000

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

5633 paolo 15 0 2392 320 252 S 77.7 0.1 0:18.84 a.out
5225 root 15 0 169m 19m 2956 S 2.0 3.9 0:13.17 X
5307 paolo 15 0 100m 22m 13m S 2.0 4.4 0:04.32 kicker


paolo@tux ~/tmp/sched_fooler $ ./a.out 10000

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

5637 paolo 16 0 2396 320 252 R 87.4 0.1 0:13.38 a.out
5312 paolo 16 0 86636 22m 15m R 0.1 4.5 0:02.02 konsole
1 root 16 0 2552 560 468 S 0.0 0.1 0:00.71 init


If I only run 2 of these together the system becomes everything but
interactive ;)

/a.out 10000 & ./a.out 4000

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

5714 paolo 15 0 2392 320 252 S 59.2 0.1 0:12.28 a.out
5713 paolo 16 0 2396 320 252 S 37.1 0.1 0:07.63 a.out


DD TEST: the usual DD test (to read 128 MB from a non-cached file =
disk bounded) says everything, numbers with 2.6.15-rc7:

paolo@tux /mnt $ mount space/; time dd if=space/bigfile of=/dev/null bs=1M count=128; umount space/
128+0 records in
128+0 records out

real 0m4.052s
user 0m0.004s
sys 0m0.180s

START 2 OF THEM "./a.out 10000 & ./a.out 4000"

paolo@tux /mnt $ mount space/; time dd if=space/bigfile of=/dev/null bs=1M count=128; umount space/
128+0 records in
128+0 records out

real 2m4.578s
user 0m0.000s
sys 0m0.244s


This does't surprise me anymore, since DD gets priority 18 and these
two CPU eaters get 15/16.

As usual "nicksched" is NOT affected at all, IOW it gives to these CPU
eaters the priority that they deserve.

--
Paolo Ornati
Linux 2.6.15-rc7-g3603bc8d on x86_64

Peter Williams

unread,
Dec 30, 2005, 9:13:23 PM12/30/05
to Paolo Ornati, Linux Kernel Mailing List, Con Kolivas, Ingo Molnar, Nick Piggin
> ./a.out 10000 & ./a.out 4000
>
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 5714 paolo 15 0 2392 320 252 S 59.2 0.1 0:12.28 a.out
> 5713 paolo 16 0 2396 320 252 S 37.1 0.1 0:07.63 a.out
>
>
> DD TEST: the usual DD test (to read 128 MB from a non-cached file =
> disk bounded) says everything, numbers with 2.6.15-rc7:
>
> paolo@tux /mnt $ mount space/; time dd if=space/bigfile of=/dev/null bs=1M count=128; umount space/
> 128+0 records in
> 128+0 records out
>
> real 0m4.052s
> user 0m0.004s
> sys 0m0.180s
>
> START 2 OF THEM "./a.out 10000 & ./a.out 4000"
>
> paolo@tux /mnt $ mount space/; time dd if=space/bigfile of=/dev/null bs=1M count=128; umount space/
> 128+0 records in
> 128+0 records out
>
> real 2m4.578s
> user 0m0.000s
> sys 0m0.244s
>
>
> This does't surprise me anymore, since DD gets priority 18 and these
> two CPU eaters get 15/16.
>
> As usual "nicksched" is NOT affected at all, IOW it gives to these CPU
> eaters the priority that they deserve.
>

Attached is a testing version of a patch for modifying scheduler bonus
calculations that I'm working on. Although these changes aren't
targetted at the problem you are experiencing I believe that they may
help. My testing shows that sched_fooler instances don't get any
bonuses but I would appreciate it if you could try it out.

It is a patch against the 2.6.15-rc7 kernel and includes some other
scheduling patches from the -mm kernels.

Thanks

lial-test.patch

Mike Galbraith

unread,
Dec 31, 2005, 3:14:46 AM12/31/05
to Paolo Ornati, Linux Kernel Mailing List, Con Kolivas, Ingo Molnar, Nick Piggin, Peter Williams
At 02:52 PM 12/30/2005 +0100, Paolo Ornati wrote:
>WAS: [SCHED] Totally WRONG prority calculation with specific test-case
>(since 2.6.10-bk12)
>http://lkml.org/lkml/2005/12/27/114/index.html
>
>On Wed, 28 Dec 2005 10:26:58 +1100
>Con Kolivas <ker...@kolivas.org> wrote:
>
> > The issue is that the scheduler interactivity estimator is a state
> machine and
> > can be fooled to some degree, and a cpu intensive task that just
> happens to
> > sleep a little bit gets significantly better priority than one that is
> fully
> > cpu bound all the time. Reverting that change is not a solution because it
> > can still be fooled by the same process sleeping lots for a few seconds
> or so
> > at startup and then changing to the cpu mostly-sleeping slightly
> behaviour.
> > This "fluctuating" behaviour is in my opinion worse which is why I removed
> > it.
>
>Trying to find a "as simple as possible" test case for this problem
>(that I consider a BUG in priority calculation) I've come up with this
>very simple program:
>
>------ sched_fooler.c -------------------------------

Ingo seems to have done something in 2.6.15-rc7-rt1 which defeats your
little proggy. Taking a quick peek at the rt scheduler changes, nothing
poked me in the eye, but by golly, I can't get this kernel to act up,
whereas 2.6.14-virgin does.

-Mike (off to stare harder rt patch)

Paolo Ornati

unread,
Dec 31, 2005, 5:35:17 AM12/31/05
to Peter Williams, Linux Kernel Mailing List, Con Kolivas, Ingo Molnar, Nick Piggin
On Sat, 31 Dec 2005 13:06:04 +1100
Peter Williams <pwil...@bigpond.net.au> wrote:

> Attached is a testing version of a patch for modifying scheduler bonus
> calculations that I'm working on. Although these changes aren't
> targetted at the problem you are experiencing I believe that they may
> help. My testing shows that sched_fooler instances don't get any
> bonuses but I would appreciate it if you could try it out.
>
> It is a patch against the 2.6.15-rc7 kernel and includes some other
> scheduling patches from the -mm kernels.

Yes, this fixes both my test-case (transcode & little program), they
get priority 25 instead of ~16.

But the priority of DD is now ~23 and so it still suffers a bit:

paolo@tux /mnt $ mount space/; time dd if=space/bigfile of=/dev/null bs=1M count=128; umount space/
128+0 records in
128+0 records out

real 0m8.549s (instead of 4s)
user 0m0.000s
sys 0m0.209s

--
Paolo Ornati
Linux 2.6.15-rc7-lial on x86_64

Paolo Ornati

unread,
Dec 31, 2005, 5:53:05 AM12/31/05
to Paolo Ornati, Peter Williams, Linux Kernel Mailing List, Con Kolivas, Ingo Molnar, Nick Piggin
On Sat, 31 Dec 2005 11:34:46 +0100
Paolo Ornati <orn...@fastwebnet.it> wrote:

> > It is a patch against the 2.6.15-rc7 kernel and includes some other
> > scheduling patches from the -mm kernels.
>
> Yes, this fixes both my test-case (transcode & little program), they
> get priority 25 instead of ~16.
>
> But the priority of DD is now ~23 and so it still suffers a bit:

I forgot to mention that even the others "interactive" processes
don't get a good priority too.

Xorg for example, while only moving the cursor around, gets priority
23/24. And when cpu-eaters are running (at priority 25) it isn't happy
at all, the cursor begins to move in jerks and so on...

Paolo Ornati

unread,
Dec 31, 2005, 6:01:13 AM12/31/05
to Mike Galbraith, Linux Kernel Mailing List, Con Kolivas, Ingo Molnar, Nick Piggin, Peter Williams
On Sat, 31 Dec 2005 09:13:24 +0100
Mike Galbraith <efa...@gmx.de> wrote:

> Ingo seems to have done something in 2.6.15-rc7-rt1 which defeats your
> little proggy. Taking a quick peek at the rt scheduler changes, nothing
> poked me in the eye, but by golly, I can't get this kernel to act up,
> whereas 2.6.14-virgin does.

Mmm... I get an infinite list of init segfaults trying to boot it. I've
tried disabling CONFIG_CC_OPTIMIZE_FOR_SIZE but it didn't help.

I'll try later with a simplier ".config".

--
Paolo Ornati
Linux 2.6.15-rc7-plugsched on x86_64

Con Kolivas

unread,
Dec 31, 2005, 6:12:41 AM12/31/05
to Paolo Ornati, Peter Williams, Linux Kernel Mailing List, Ingo Molnar, Nick Piggin
On Saturday 31 December 2005 21:52, Paolo Ornati wrote:
> On Sat, 31 Dec 2005 11:34:46 +0100
>
> Paolo Ornati <orn...@fastwebnet.it> wrote:
> > > It is a patch against the 2.6.15-rc7 kernel and includes some other
> > > scheduling patches from the -mm kernels.
> >
> > Yes, this fixes both my test-case (transcode & little program), they
> > get priority 25 instead of ~16.
> >
> > But the priority of DD is now ~23 and so it still suffers a bit:
>
> I forgot to mention that even the others "interactive" processes
> don't get a good priority too.
>
> Xorg for example, while only moving the cursor around, gets priority
> 23/24. And when cpu-eaters are running (at priority 25) it isn't happy
> at all, the cursor begins to move in jerks and so on...

This is why Ingo, Nick and myself think that a tweak to the heavily field
tested current cpu scheduler is best for 2.6 rather than any gutting and
replacement of the interactivity estimator (which even though this scheme is
simple and easy to understand, it clearly is an example of). Given that we
have a "2.6 forever" policy, it also means any significant cpu scheduler
rewrite, even just of the interactivity estimator, is nigh on impossible to
implement.

Cheers,
Con

Peter Williams

unread,
Dec 31, 2005, 8:45:45 AM12/31/05
to Paolo Ornati, Linux Kernel Mailing List, Con Kolivas, Ingo Molnar, Nick Piggin
Paolo Ornati wrote:
> On Sat, 31 Dec 2005 11:34:46 +0100
> Paolo Ornati <orn...@fastwebnet.it> wrote:
>
>
>>>It is a patch against the 2.6.15-rc7 kernel and includes some other
>>>scheduling patches from the -mm kernels.
>>
>>Yes, this fixes both my test-case (transcode & little program), they
>>get priority 25 instead of ~16.
>>
>>But the priority of DD is now ~23 and so it still suffers a bit:
>
>
> I forgot to mention that even the others "interactive" processes
> don't get a good priority too.
>
> Xorg for example, while only moving the cursor around, gets priority
> 23/24. And when cpu-eaters are running (at priority 25) it isn't happy
> at all, the cursor begins to move in jerks and so on...
>

OK. This probably means that the parameters that control the mechanism
need tweaking.

There should be a file /sys/cpusched/attrs/unacceptable_ia_latency which
contains the latency (in nanoseconds) that the scheduler considers
unacceptable for interactive programs. Try changing that value and see
if things improve? Making it smaller should help but if you make it too
small all the interactive tasks will end up with the same priority and
this could cause them to get in each other's way.

Thanks,


Peter
--
Peter Williams pwil...@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce

Paolo Ornati

unread,
Dec 31, 2005, 10:12:45 AM12/31/05
to Mike Galbraith, Linux Kernel Mailing List, Con Kolivas, Ingo Molnar, Nick Piggin, Peter Williams
On Sat, 31 Dec 2005 09:13:24 +0100
Mike Galbraith <efa...@gmx.de> wrote:

> Ingo seems to have done something in 2.6.15-rc7-rt1 which defeats your
> little proggy. Taking a quick peek at the rt scheduler changes, nothing
> poked me in the eye, but by golly, I can't get this kernel to act up,
> whereas 2.6.14-virgin does.

Ok, I've sucessfully booted 2.6.15-rc7-rt1 (I think that I was
having troubles with Thread Softirqs and/or Thread Hardirqs).

First thing: I've preemption disabled, but it shouldn't matter too much
since we are talking about priority calculation...

1) My program isn't defeated at all. If I start it with the same args
of the previous examples it "seems" defeated, but it isn't.

Lowering the "cpu burn argument" I can reproduce the problem again:

"./a.out 200 & ./a.out 333"

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

5607 paolo 15 0 2396 320 252 R 56.1 0.1 0:06.79 a.out
5606 paolo 15 0 2396 324 252 R 38.7 0.1 0:04.55 a.out
1 root 16 0 2556 552 468 S 0.0 0.1 0:00.28 init


2) Priority fluctuation - very interesting: playing with the only arg
my program has I've found this:

/a.out 200


PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

5628 paolo 15 0 2392 320 252 R 48.5 0.1 0:18.34 a.out

/a.out 300


PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

5633 paolo 15 0 2392 324 252 S 50.1 0.1 0:09.42 a.out

/a.out 400


PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

5634 paolo 15 0 2392 320 252 S 66.7 0.1 0:06.31 a.out

/a.out 500


PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

5638 paolo 25 0 2396 320 252 R 67.7 0.1 0:14.78 a.out

/a.out 700


PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

5640 paolo 15 0 2392 320 252 S 80.1 0.1 0:25.88 a.out

/a.out 800


PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

5644 paolo 17 0 2396 320 252 R 79.6 0.1 0:26.54 a.out


In the "./a.out 500" case, the priority starts at something like 16 and
then slowly go up to 25 _BUT_ if I start my DD test my cpu-eater
priority goes quickly to 16!

The real world test case (transcode) is a bit harder to describe: its
priority usually goes up to 25, sometimes I've seen it fluctuating a
bit (like go to 19 and then back to 25).

When I start my DD test I've seen basically 2 different behaviours:

A) good


PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

5788 paolo 25 0 114m 18m 2440 R 82.2 3.7 0:10.16 transcode
5804 paolo 15 0 49860 4500 1896 S 8.5 0.9 0:00.99 tcdecode
5808 paolo 18 0 4952 1520 412 D 5.0 0.3 0:00.36 dd

B) bad


PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

5743 paolo 18 0 114m 18m 2440 R 75.0 3.7 0:26.79 transcode
5759 paolo 15 0 49864 4500 1896 S 7.8 0.9 0:02.71 tcdecode
5750 paolo 16 0 19840 1136 916 S 1.5 0.2 0:00.23 tcdemux
5201 root 15 0 167m 17m 3336 S 0.8 3.5 0:19.38 X
5764 paolo 18 0 4948 1520 412 R 0.7 0.3 0:00.04 dd

Sometimes happens A and sometimes happens B...

PS: probably all these numbers aren't 100% reproducible... this is what
happens on my PC.

--
Paolo Ornati
Linux 2.6.15-rc7-rt1 on x86_64

Paolo Ornati

unread,
Dec 31, 2005, 11:32:40 AM12/31/05
to Peter Williams, Linux Kernel Mailing List, Con Kolivas, Ingo Molnar, Nick Piggin
On Sun, 01 Jan 2006 00:44:10 +1100
Peter Williams <pwil...@bigpond.net.au> wrote:

> OK. This probably means that the parameters that control the mechanism
> need tweaking.
>
> There should be a file /sys/cpusched/attrs/unacceptable_ia_latency which
> contains the latency (in nanoseconds) that the scheduler considers
> unacceptable for interactive programs. Try changing that value and see
> if things improve? Making it smaller should help but if you make it too
> small all the interactive tasks will end up with the same priority and
> this could cause them to get in each other's way.

I've tried different values and sometimes I've got a good feeling BUT
the behaviour is too strange to say something.

Sometimes I get what I want (dd priority ~17 and CPU eaters prio
25), sometimes I get a total disaster (dd priority 17 and CPU eaters
prio 15/16) and sometimes I get something like DD prio 22 and CPU
eaters 23/24.

All this is not well related to "unacceptable_ia_latency" values.

What I think is that the priority calculation in ingosched and other
schedulers is in general too weak, while in other schedulers is rock
solid (read: nicksched).

Maybe is just that the smarter a scheduler want to be, the more fragile
it will be.

--
Paolo Ornati
Linux 2.6.15-rc7-lial on x86_64

Mike Galbraith

unread,
Dec 31, 2005, 11:38:10 AM12/31/05
to Paolo Ornati, Linux Kernel Mailing List, Con Kolivas, Ingo Molnar, Nick Piggin, Peter Williams
At 04:11 PM 12/31/2005 +0100, Paolo Ornati wrote:
>On Sat, 31 Dec 2005 09:13:24 +0100
>Mike Galbraith <efa...@gmx.de> wrote:
>
> > Ingo seems to have done something in 2.6.15-rc7-rt1 which defeats your
> > little proggy. Taking a quick peek at the rt scheduler changes, nothing
> > poked me in the eye, but by golly, I can't get this kernel to act up,
> > whereas 2.6.14-virgin does.
>
>Ok, I've sucessfully booted 2.6.15-rc7-rt1 (I think that I was
>having troubles with Thread Softirqs and/or Thread Hardirqs).
>
>First thing: I've preemption disabled, but it shouldn't matter too much
>since we are talking about priority calculation...

Mine is fully preemptible.

>1) My program isn't defeated at all. If I start it with the same args
>of the previous examples it "seems" defeated, but it isn't.
>
>Lowering the "cpu burn argument" I can reproduce the problem again:
>
>"./a.out 200 & ./a.out 333"
>
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 5607 paolo 15 0 2396 320 252 R 56.1 0.1 0:06.79 a.out
> 5606 paolo 15 0 2396 324 252 R 38.7 0.1 0:04.55 a.out
> 1 root 16 0 2556 552 468 S 0.0 0.1 0:00.28 init

Strange. Using the exact same arguments, I do see some odd bouncing up to
high priorities, but they spend the vast majority of their time down at 25.

-Mike

Paolo Ornati

unread,
Dec 31, 2005, 12:24:57 PM12/31/05
to Mike Galbraith, Linux Kernel Mailing List, Con Kolivas, Ingo Molnar, Nick Piggin, Peter Williams
On Sat, 31 Dec 2005 17:37:11 +0100
Mike Galbraith <efa...@gmx.de> wrote:

> >"./a.out 200 & ./a.out 333"
> >
> > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> > 5607 paolo 15 0 2396 320 252 R 56.1 0.1 0:06.79 a.out
> > 5606 paolo 15 0 2396 324 252 R 38.7 0.1 0:04.55 a.out
> > 1 root 16 0 2556 552 468 S 0.0 0.1 0:00.28 init
>
> Strange. Using the exact same arguments, I do see some odd bouncing up to
> high priorities, but they spend the vast majority of their time down at 25.

You shouldn't use "the same exact numbers", you should try different
args and see if you can reproduce the problem. Or maybe preemption
make some difference... I'll try with PREEMPT enabled and see.

--
Paolo Ornati
Linux 2.6.15-rc7-plugsched on x86_64

Paolo Ornati

unread,
Dec 31, 2005, 12:42:47 PM12/31/05
to Paolo Ornati, Mike Galbraith, Linux Kernel Mailing List, Con Kolivas, Ingo Molnar, Nick Piggin, Peter Williams
On Sat, 31 Dec 2005 18:24:40 +0100
Paolo Ornati <orn...@fastwebnet.it> wrote:

> You shouldn't use "the same exact numbers", you should try different
> args and see if you can reproduce the problem. Or maybe preemption
> make some difference... I'll try with PREEMPT enabled and see.

Ok, just tried with Complete Preemption: I can easly reproduce the
problem.

For example:

"./a.out 1000 & ./a.out 1239"

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

5433 paolo 16 0 2396 324 252 R 50.3 0.1 0:34.67 a.out
5434 paolo 16 0 2392 320 252 R 47.4 0.1 0:30.76 a.out
265 root -48 -5 0 0 0 S 0.6 0.0 0:00.68 IRQ 12
5261 root 15 0 166m 16m 2844 S 0.6 3.3 0:04.81 X
5349 paolo 15 0 86640 22m 15m S 0.6 4.5 0:01.36 konsole
5344 paolo 15 0 98.8m 20m 12m S 0.3 4.1 0:01.64 kicker
5444 paolo 18 0 4948 1520 412 R 0.3 0.3 0:00.05 dd

--
Paolo Ornati
Linux 2.6.15-rc7-rt1 on x86_64

Peter Williams

unread,
Dec 31, 2005, 5:05:34 PM12/31/05
to Paolo Ornati, Linux Kernel Mailing List, Con Kolivas, Ingo Molnar, Nick Piggin
Paolo Ornati wrote:
> On Sun, 01 Jan 2006 00:44:10 +1100
> Peter Williams <pwil...@bigpond.net.au> wrote:
>
>
>>OK. This probably means that the parameters that control the mechanism
>>need tweaking.
>>
>>There should be a file /sys/cpusched/attrs/unacceptable_ia_latency which
>>contains the latency (in nanoseconds) that the scheduler considers
>>unacceptable for interactive programs. Try changing that value and see
>>if things improve? Making it smaller should help but if you make it too
>>small all the interactive tasks will end up with the same priority and
>>this could cause them to get in each other's way.
>
>
> I've tried different values and sometimes I've got a good feeling BUT
> the behaviour is too strange to say something.
>
> Sometimes I get what I want (dd priority ~17 and CPU eaters prio
> 25), sometimes I get a total disaster (dd priority 17 and CPU eaters
> prio 15/16) and sometimes I get something like DD prio 22 and CPU
> eaters 23/24.
>
> All this is not well related to "unacceptable_ia_latency" values.

OK. Thanks for trying it.

The feedback will be helpful in trying to improve the mechanisms.

>
> What I think is that the priority calculation in ingosched and other
> schedulers is in general too weak, while in other schedulers is rock
> solid (read: nicksched).
>
> Maybe is just that the smarter a scheduler want to be, the more fragile
> it will be.
>

Probably but this one is fairly simple.

I think the remaining problems with interactive responsiveness is that
bonuses increase too slowly when a latency problem is detected. I.e. a
task just gets one extra bonus point when an unacceptable latency is
detected regardless of how big the latency is. This means that it may
take several cycles for the bonus to be big enough to solve the problem.
I'm going to try making the bonus increment proportional to the size
of the latency w.r.t. the limit.

Peter
--
Peter Williams pwil...@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce

Paolo Ornati

unread,
Jan 1, 2006, 6:39:58 AM1/1/06
to Mike Galbraith, Linux Kernel Mailing List, Con Kolivas, Ingo Molnar, Nick Piggin, Peter Williams
On Sat, 31 Dec 2005 17:37:11 +0100
Mike Galbraith <efa...@gmx.de> wrote:

> Strange. Using the exact same arguments, I do see some odd bouncing up to
> high priorities, but they spend the vast majority of their time down at 25.

Mmmm... to make it more easly reproducible I've enlarged the sleep time
(1 microsecond is likely to be rounded too much and give different
results on different hardware/kernel/config...).

Compile this _without_ optimizations and try again:

------------------------------------------------
#include <stdlib.h>
#include <unistd.h>

char buf[1024];

static void burn_cpu(unsigned int x)
{

int i;

for (i=0; i < x; ++i)
buf[i%sizeof(buf)] = (x-i)*3;
}

int main(int argc, char **argv)
{
unsigned long burn;
if (argc != 2)
return 1;
burn = (unsigned long)atoi(argv[1]);

if (!burn)
return;
while(1) {
burn_cpu(burn*1000);
usleep(10000);
}
return 0;
}
-----------------------------------------


With "./a.out 3000" (and 2.6.15-rc7-rt1) I get this:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

5485 paolo 15 0 2396 320 252 R 62.7 0.1 0:09.77 a.out


Try different values: 1000, 2000, 3000 ... are you able to reproduce it
now?


If yes, try to start 2 of them with something like this:

"./a.out 3000 & ./a.out 3161"

so they are NOT syncronized and they use almost all the CPU time:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

5582 paolo 16 0 2396 320 252 S 45.7 0.1 0:05.52 a.out
5583 paolo 15 0 2392 320 252 S 45.7 0.1 0:05.49 a.out

This is the bad situation I hate: some cpu-eaters that eat all the CPU
time BUT have a really good priority only because they sleeps a bit.

--
Paolo Ornati
Linux 2.6.15-rc7-rt1 on x86_64

Mike Galbraith

unread,
Jan 2, 2006, 4:16:17 AM1/2/06
to Paolo Ornati, Linux Kernel Mailing List, Con Kolivas, Ingo Molnar, Nick Piggin, Peter Williams
At 12:39 PM 1/1/2006 +0100, Paolo Ornati wrote:

>On Sat, 31 Dec 2005 17:37:11 +0100
>Mike Galbraith <efa...@gmx.de> wrote:
>
> > Strange. Using the exact same arguments, I do see some odd bouncing up to
> > high priorities, but they spend the vast majority of their time down at 25.
>
>Mmmm... to make it more easly reproducible I've enlarged the sleep time
>(1 microsecond is likely to be rounded too much and give different
>results on different hardware/kernel/config...).
>
>Compile this _without_ optimizations and try again:

<snip>

>Try different values: 1000, 2000, 3000 ... are you able to reproduce it
>now?

Yeah. One instance running has to sustain roughly _95%_ cpu before it's
classified as a cpu piggy. Not good.

>If yes, try to start 2 of them with something like this:
>
>"./a.out 3000 & ./a.out 3161"
>
>so they are NOT syncronized and they use almost all the CPU time:
>

> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

> 5582 paolo 16 0 2396 320 252 S 45.7 0.1 0:05.52 a.out
> 5583 paolo 15 0 2392 320 252 S 45.7 0.1 0:05.49 a.out
>
>This is the bad situation I hate: some cpu-eaters that eat all the CPU
>time BUT have a really good priority only because they sleeps a bit.

Yup, your proggy fools the interactivity estimator quite well. This
problem was addressed a long time ago, and thought to be more or less
cured. Guess not.

Paolo Ornati

unread,
Jan 2, 2006, 4:53:13 AM1/2/06
to Mike Galbraith, Linux Kernel Mailing List, Con Kolivas, Ingo Molnar, Nick Piggin, Peter Williams
On Mon, 02 Jan 2006 10:15:43 +0100
Mike Galbraith <efa...@gmx.de> wrote:

> >This is the bad situation I hate: some cpu-eaters that eat all the CPU
> >time BUT have a really good priority only because they sleeps a bit.
>
> Yup, your proggy fools the interactivity estimator quite well. This
> problem was addressed a long time ago, and thought to be more or less
> cured. Guess not.

In my original real-life test case (transcode) I found that the problem
started with the removing of "interactive_credit":

http://groups.google.com/group/fa.linux.kernel/browse_thread/thread/6aa5c93c379ae9e1/a9a83db6446edaf7?lnk=st&q=insubject%3Asched+author%3APaolo+author%3AOrnati&rnum=1&hl=en#a9a83db6446edaf7

This is not actually true... in fact that change only unhidden the
problem for that particular test-case.

With my little proggy I'm now able to reproduce the problem even with
"interactive_credit" applied (for example with a 2.6.10 kernel).

Said this, and since "nicksched" doesn't have this problem at all, it
is an ingosched (and others as well) problem.

--
Paolo Ornati
Linux 2.6.15-rc7-plugsched on x86_64

Con Kolivas

unread,
Jan 12, 2006, 8:14:14 PM1/12/06
to Paolo Ornati, Linux Kernel Mailing List, Ingo Molnar, Nick Piggin, Peter Williams
On Saturday 31 December 2005 00:52, Paolo Ornati wrote:
> WAS: [SCHED] Totally WRONG prority calculation with specific test-case
> (since 2.6.10-bk12)
> http://lkml.org/lkml/2005/12/27/114/index.html
>
> On Wed, 28 Dec 2005 10:26:58 +1100
>
> Con Kolivas <ker...@kolivas.org> wrote:
> > The issue is that the scheduler interactivity estimator is a state
> > machine and can be fooled to some degree, and a cpu intensive task that
> > just happens to sleep a little bit gets significantly better priority
> > than one that is fully cpu bound all the time. Reverting that change is
> > not a solution because it can still be fooled by the same process
> > sleeping lots for a few seconds or so at startup and then changing to the
> > cpu mostly-sleeping slightly behaviour. This "fluctuating" behaviour is
> > in my opinion worse which is why I removed it.
>
> Trying to find a "as simple as possible" test case for this problem
> (that I consider a BUG in priority calculation) I've come up with this
> very simple program:

Hi Paolo.

Can you try the following patch on 2.6.15 please? I'm interested in how
adversely this affects interactive performance as well as whether it helps
your test case.

Thanks,
Con

---
include/linux/sched.h | 9 +++++-
kernel/sched.c | 72 ++++++++++++++++++++++----------------------------
2 files changed, 41 insertions(+), 40 deletions(-)

Index: linux-2.6.15/include/linux/sched.h
===================================================================
--- linux-2.6.15.orig/include/linux/sched.h
+++ linux-2.6.15/include/linux/sched.h
@@ -683,6 +683,13 @@ static inline void prefetch_stack(struct
struct audit_context; /* See audit.c */
struct mempolicy;

+enum sleep_type {
+ SLEEP_NORMAL,
+ SLEEP_NONINTERACTIVE,
+ SLEEP_INTERACTIVE,
+ SLEEP_INTERRUPTED,
+};
+
struct task_struct {
volatile long state; /* -1 unrunnable, 0 runnable, >0 stopped */
struct thread_info *thread_info;
@@ -704,7 +711,7 @@ struct task_struct {
unsigned long sleep_avg;
unsigned long long timestamp, last_ran;
unsigned long long sched_time; /* sched_clock time spent running */
- int activated;
+ enum sleep_type sleep_type;

unsigned long policy;
cpumask_t cpus_allowed;
Index: linux-2.6.15/kernel/sched.c
===================================================================
--- linux-2.6.15.orig/kernel/sched.c
+++ linux-2.6.15/kernel/sched.c
@@ -751,31 +751,22 @@ static int recalc_task_prio(task_t *p, u
* prevent them suddenly becoming cpu hogs and starving
* other processes.
*/
- if (p->mm && p->activated != -1 &&
+ if (p->mm && p->sleep_type != SLEEP_NONINTERACTIVE &&
sleep_time > INTERACTIVE_SLEEP(p)) {
p->sleep_avg = JIFFIES_TO_NS(MAX_SLEEP_AVG -
DEF_TIMESLICE);
} else {
+
/*
* The lower the sleep avg a task has the more
- * rapidly it will rise with sleep time.
+ * rapidly it will rise with sleep time. This enables
+ * tasks to rapidly recover to a low latency priority.
+ * If a task was sleeping with the noninteractive
+ * label do not apply this non-linear boost
*/
- sleep_time *= (MAX_BONUS - CURRENT_BONUS(p)) ? : 1;
-
- /*
- * Tasks waking from uninterruptible sleep are
- * limited in their sleep_avg rise as they
- * are likely to be waiting on I/O
- */
- if (p->activated == -1 && p->mm) {
- if (p->sleep_avg >= INTERACTIVE_SLEEP(p))
- sleep_time = 0;
- else if (p->sleep_avg + sleep_time >=
- INTERACTIVE_SLEEP(p)) {
- p->sleep_avg = INTERACTIVE_SLEEP(p);
- sleep_time = 0;
- }
- }
+ if (p->sleep_type != SLEEP_NONINTERACTIVE || p->mm)
+ sleep_time *=
+ (MAX_BONUS - CURRENT_BONUS(p)) ? : 1;

/*
* This code gives a bonus to interactive tasks.
@@ -818,11 +809,7 @@ static void activate_task(task_t *p, run
if (!rt_task(p))
p->prio = recalc_task_prio(p, now);

- /*
- * This checks to make sure it's not an uninterruptible task
- * that is now waking up.
- */
- if (!p->activated) {
+ if (p->sleep_type != SLEEP_NONINTERACTIVE) {
/*
* Tasks which were woken up by interrupts (ie. hw events)
* are most likely of interactive nature. So we give them
@@ -831,13 +818,13 @@ static void activate_task(task_t *p, run
* on a CPU, first time around:
*/
if (in_interrupt())
- p->activated = 2;
+ p->sleep_type = SLEEP_INTERRUPTED;
else {
/*
* Normal first-time wakeups get a credit too for
* on-runqueue time, but it will be weighted down:
*/
- p->activated = 1;
+ p->sleep_type = SLEEP_INTERACTIVE;
}
}
p->timestamp = now;
@@ -1356,22 +1343,23 @@ out_activate:
if (old_state == TASK_UNINTERRUPTIBLE) {
rq->nr_uninterruptible--;
/*
- * Tasks on involuntary sleep don't earn
- * sleep_avg beyond just interactive state.
+ * Tasks waking from uninterruptible sleep are likely
+ * to be sleeping involuntarily on I/O and are otherwise
+ * cpu bound so label them as noninteractive.
*/
- p->activated = -1;
- }
+ p->sleep_type = SLEEP_NONINTERACTIVE;
+ } else

/*
* Tasks that have marked their sleep as noninteractive get
- * woken up without updating their sleep average. (i.e. their
- * sleep is handled in a priority-neutral manner, no priority
- * boost and no penalty.)
+ * woken up with their sleep average not weighted in an
+ * interactive way.
*/
- if (old_state & TASK_NONINTERACTIVE)
- __activate_task(p, rq);
- else
- activate_task(p, rq, cpu == this_cpu);
+ if (old_state & TASK_NONINTERACTIVE)
+ p->sleep_type = SLEEP_NONINTERACTIVE;
+
+
+ activate_task(p, rq, cpu == this_cpu);
/*
* Sync wakeups (i.e. those types of wakeups where the waker
* has indicated that it will leave the CPU in short order)
@@ -2938,6 +2926,12 @@ EXPORT_SYMBOL(sub_preempt_count);

#endif

+static inline int interactive_sleep(enum sleep_type sleep_type)
+{
+ return (sleep_type == SLEEP_INTERACTIVE ||
+ sleep_type == SLEEP_INTERRUPTED);
+}
+
/*
* schedule() is the main scheduler function.
*/
@@ -3063,12 +3057,12 @@ go_idle:
queue = array->queue + idx;
next = list_entry(queue->next, task_t, run_list);

- if (!rt_task(next) && next->activated > 0) {
+ if (!rt_task(next) && interactive_sleep(next->sleep_type)) {
unsigned long long delta = now - next->timestamp;
if (unlikely((long long)(now - next->timestamp) < 0))
delta = 0;

- if (next->activated == 1)
+ if (next->sleep_type == SLEEP_INTERACTIVE)
delta = delta * (ON_RUNQUEUE_WEIGHT * 128 / 100) / 128;

array = next->array;
@@ -3081,7 +3075,7 @@ go_idle:
} else
requeue_task(next, array);
}
- next->activated = 0;
+ next->sleep_type = SLEEP_NORMAL;
switch_tasks:
if (next == rq->idle)
schedstat_inc(rq, sched_goidle);

Con Kolivas

unread,
Jan 12, 2006, 8:32:41 PM1/12/06
to Paolo Ornati, Linux Kernel Mailing List, Ingo Molnar, Nick Piggin, Peter Williams
On Friday 13 January 2006 12:13, Con Kolivas wrote:
> On Saturday 31 December 2005 00:52, Paolo Ornati wrote:
> > WAS: [SCHED] Totally WRONG prority calculation with specific test-case
> > (since 2.6.10-bk12)
> > http://lkml.org/lkml/2005/12/27/114/index.html
> >
> > On Wed, 28 Dec 2005 10:26:58 +1100
> >
> > Con Kolivas <ker...@kolivas.org> wrote:
> > > The issue is that the scheduler interactivity estimator is a state
> > > machine and can be fooled to some degree, and a cpu intensive task that
> > > just happens to sleep a little bit gets significantly better priority
> > > than one that is fully cpu bound all the time. Reverting that change is
> > > not a solution because it can still be fooled by the same process
> > > sleeping lots for a few seconds or so at startup and then changing to
> > > the cpu mostly-sleeping slightly behaviour. This "fluctuating"
> > > behaviour is in my opinion worse which is why I removed it.
> >
> > Trying to find a "as simple as possible" test case for this problem
> > (that I consider a BUG in priority calculation) I've come up with this
> > very simple program:
>
> Hi Paolo.
>
> Can you try the following patch on 2.6.15 please? I'm interested in how
> adversely this affects interactive performance as well as whether it helps
> your test case.

I should make it clear. This patch _will_ adversely affect interactivity
because your test case desires that I/O bound tasks get higher priority, and
this patch will do that. This means that I/O bound tasks will be more
noticeable now. The question is how much do we trade off one for the other.
We almost certainly are biased a little too much on the interactive side on
the mainline kernel at the moment.

Cheers,
Con

Paolo Ornati

unread,
Jan 13, 2006, 5:46:00 AM1/13/06
to Con Kolivas, Linux Kernel Mailing List, Ingo Molnar, Nick Piggin, Peter Williams
On Fri, 13 Jan 2006 12:13:11 +1100
Con Kolivas <ker...@kolivas.org> wrote:

> Can you try the following patch on 2.6.15 please? I'm interested in how
> adversely this affects interactive performance as well as whether it helps
> your test case.

"./a.out 5000 & ./a.out 5237 & ./a.out 5331 &"
"mount space/; sync; sleep 1; time dd if=space/bigfile of=/dev/null
bs=1M count=256; umount space/"

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

5445 paolo 16 0 2396 288 228 R 34.8 0.1 0:05.84 a.out
5446 paolo 15 0 2396 288 228 S 32.8 0.1 0:05.53 a.out
5444 paolo 16 0 2392 288 228 R 31.3 0.1 0:05.99 a.out
5443 paolo 16 0 10416 1104 848 R 0.2 0.2 0:00.01 top
5451 paolo 15 0 4948 1468 372 D 0.2 0.3 0:00.01 dd

DD test takes ~20 s (instead of 8s).

As you can see DD priority is now very good (15) but it still suffers
because also my test programs get good priority (15/16).


Things are BETTER on the real test case (transcode): this is because
transcode usually gets priority 16 and "dd" gets 15... so dd is quite
happy.

BUT what is STRANGE is this: usually transcode is stuck to priority 16
using about 88% of the CPU, but sometimes (don't know how to reproduce
it) its priority grows up to 25 and then stay to 25.

When transcode priority is 25 the DD test is obviously happy: in
particular 2 things can happen (this is expected because I've observed
this thing before):

1) priority of transcode stay to 25 (when the file transcode is
reading from, through pipes, IS cached).

2) CPU usage and priority of transcode go down (the file transcode is
reading from ISN'T cached and DD massive disk usage interferes with
this reading). When DD finish trancode priority go back to 25.

--
Paolo Ornati
Linux 2.6.15-kolivasPatch on x86_64

Con Kolivas

unread,
Jan 13, 2006, 5:53:25 AM1/13/06
to Paolo Ornati, Linux Kernel Mailing List, Ingo Molnar, Nick Piggin, Peter Williams
On Friday 13 January 2006 21:46, Paolo Ornati wrote:
> On Fri, 13 Jan 2006 12:13:11 +1100
>
> Con Kolivas <ker...@kolivas.org> wrote:
> > Can you try the following patch on 2.6.15 please? I'm interested in how
> > adversely this affects interactive performance as well as whether it
> > helps your test case.
>
> "./a.out 5000 & ./a.out 5237 & ./a.out 5331 &"
> "mount space/; sync; sleep 1; time dd if=space/bigfile of=/dev/null
> bs=1M count=256; umount space/"
>
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 5445 paolo 16 0 2396 288 228 R 34.8 0.1 0:05.84 a.out
> 5446 paolo 15 0 2396 288 228 S 32.8 0.1 0:05.53 a.out
> 5444 paolo 16 0 2392 288 228 R 31.3 0.1 0:05.99 a.out
> 5443 paolo 16 0 10416 1104 848 R 0.2 0.2 0:00.01 top
> 5451 paolo 15 0 4948 1468 372 D 0.2 0.3 0:00.01 dd
>
> DD test takes ~20 s (instead of 8s).
>
> As you can see DD priority is now very good (15) but it still suffers
> because also my test programs get good priority (15/16).
>
>
> Things are BETTER on the real test case (transcode): this is because
> transcode usually gets priority 16 and "dd" gets 15... so dd is quite
> happy.

This seems a reasonable compromise. In your "test app" case you are using
quirky code to reproduce the worst case scenario. Given that with your quirky
setup you are using 3 cpu hogs (effectively) then slowing down dd from 8s to
20s seems an appropriate slowdown (as opposed to the many minutes you were
getting previously).

See my followup patches that I have posted following "[PATCH 0/5] sched -
interactivity updates". The first 3 patches are what you tested. These
patches are being put up for testing hopefully in -mm.

> BUT what is STRANGE is this: usually transcode is stuck to priority 16
> using about 88% of the CPU, but sometimes (don't know how to reproduce
> it) its priority grows up to 25 and then stay to 25.
>
> When transcode priority is 25 the DD test is obviously happy: in
> particular 2 things can happen (this is expected because I've observed
> this thing before):
>
> 1) priority of transcode stay to 25 (when the file transcode is
> reading from, through pipes, IS cached).
>
> 2) CPU usage and priority of transcode go down (the file transcode is
> reading from ISN'T cached and DD massive disk usage interferes with
> this reading). When DD finish trancode priority go back to 25.

I suspect this is entirely dependent on the balance between time spent reading
on disk, waiting on pipe and so on.

Thanks for your test case and testing!

Cheers,
Con

Mike Galbraith

unread,
Jan 13, 2006, 8:02:50 AM1/13/06
to Con Kolivas, Paolo Ornati, Linux Kernel Mailing List, Ingo Molnar, Nick Piggin, Peter Williams
At 09:51 PM 1/13/2006 +1100, Con Kolivas wrote:
>On Friday 13 January 2006 21:46, Paolo Ornati wrote:
> > On Fri, 13 Jan 2006 12:13:11 +1100
> >
> > Con Kolivas <ker...@kolivas.org> wrote:
> > > Can you try the following patch on 2.6.15 please? I'm interested in how
> > > adversely this affects interactive performance as well as whether it
> > > helps your test case.
> >
> > "./a.out 5000 & ./a.out 5237 & ./a.out 5331 &"
> > "mount space/; sync; sleep 1; time dd if=space/bigfile of=/dev/null
> > bs=1M count=256; umount space/"
> >
> > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> > 5445 paolo 16 0 2396 288 228 R 34.8 0.1 0:05.84 a.out
> > 5446 paolo 15 0 2396 288 228 S 32.8 0.1 0:05.53 a.out
> > 5444 paolo 16 0 2392 288 228 R 31.3 0.1 0:05.99 a.out
> > 5443 paolo 16 0 10416 1104 848 R 0.2 0.2 0:00.01 top
> > 5451 paolo 15 0 4948 1468 372 D 0.2 0.3 0:00.01 dd
> >
> > DD test takes ~20 s (instead of 8s).
> >
> > As you can see DD priority is now very good (15) but it still suffers
> > because also my test programs get good priority (15/16).
> >
> >
> > Things are BETTER on the real test case (transcode): this is because
> > transcode usually gets priority 16 and "dd" gets 15... so dd is quite
> > happy.
>
>This seems a reasonable compromise. In your "test app" case you are using
>quirky code to reproduce the worst case scenario. Given that with your quirky
>setup you are using 3 cpu hogs (effectively) then slowing down dd from 8s to
>20s seems an appropriate slowdown (as opposed to the many minutes you were
>getting previously).

I'm sorry, but I heartily disagree. It's not a quirky setup, it's just
code that exposes a weakness just like thud, starve, irman,
irman2. Selectively bumping dd up into the upper tier won't do the other
things trivially starved to death one bit of good. On a more positive
note, I agree that dd should not be punished for waiting on disk.

>See my followup patches that I have posted following "[PATCH 0/5] sched -
>interactivity updates". The first 3 patches are what you tested. These
>patches are being put up for testing hopefully in -mm.

Then the (buggy) version of my simple throttling patch will need to come
out. (which is OK, I have a debugged potent++ version)

> > BUT what is STRANGE is this: usually transcode is stuck to priority 16
> > using about 88% of the CPU, but sometimes (don't know how to reproduce
> > it) its priority grows up to 25 and then stay to 25.
> >
> > When transcode priority is 25 the DD test is obviously happy: in
> > particular 2 things can happen (this is expected because I've observed
> > this thing before):
> >
> > 1) priority of transcode stay to 25 (when the file transcode is
> > reading from, through pipes, IS cached).
> >
> > 2) CPU usage and priority of transcode go down (the file transcode is
> > reading from ISN'T cached and DD massive disk usage interferes with
> > this reading). When DD finish trancode priority go back to 25.
>

>I suspect this is entirely dependent on the balance between time spent
>reading
>on disk, waiting on pipe and so on.

Grumble. Pipe sleep. That's another pet peeve of mine. Sleep is sleep
whether it's spelled interruptible_pipe or uninterruptible_semaphore.

-Mike

Con Kolivas

unread,
Jan 13, 2006, 9:35:54 AM1/13/06
to Mike Galbraith, Paolo Ornati, Linux Kernel Mailing List, Ingo Molnar, Nick Piggin, Peter Williams
On Saturday 14 January 2006 00:01, Mike Galbraith wrote:
> At 09:51 PM 1/13/2006 +1100, Con Kolivas wrote:
> >See my followup patches that I have posted following "[PATCH 0/5] sched -
> >interactivity updates". The first 3 patches are what you tested. These
> >patches are being put up for testing hopefully in -mm.
>
> Then the (buggy) version of my simple throttling patch will need to come
> out. (which is OK, I have a debugged potent++ version)

Your code need not be mutually exclusive with mine. I've simply damped the
current behaviour. Your sanity throttling is a good idea.

Con

Mike Galbraith

unread,
Jan 13, 2006, 11:15:54 AM1/13/06
to Con Kolivas, Paolo Ornati, Linux Kernel Mailing List, Ingo Molnar, Nick Piggin, Peter Williams
At 01:34 AM 1/14/2006 +1100, Con Kolivas wrote:
>On Saturday 14 January 2006 00:01, Mike Galbraith wrote:
> > At 09:51 PM 1/13/2006 +1100, Con Kolivas wrote:
> > >See my followup patches that I have posted following "[PATCH 0/5] sched -
> > >interactivity updates". The first 3 patches are what you tested. These
> > >patches are being put up for testing hopefully in -mm.
> >
> > Then the (buggy) version of my simple throttling patch will need to come
> > out. (which is OK, I have a debugged potent++ version)
>
>Your code need not be mutually exclusive with mine. I've simply damped the
>current behaviour. Your sanity throttling is a good idea.

I didn't mean to imply that they're mutually exclusive, and after doing
some testing, I concluded that it (or something like it) is definitely
still needed. The version that's in mm2 _is_ buggy however, so ripping it
back out wouldn't hurt my delicate little feelings one bit. In fact, it
would give me some more time to instrument and test integration with your
changes. (Which I think are good btw because they remove what I considered
to be warts; the pipe and uninterruptible sleep barriers. Um... try irman2
now... pure evilness)

Con Kolivas

unread,
Jan 13, 2006, 9:06:36 PM1/13/06
to Mike Galbraith, Paolo Ornati, Linux Kernel Mailing List, Ingo Molnar, Nick Piggin, Peter Williams, Andrew Morton
On Saturday 14 January 2006 03:15, Mike Galbraith wrote:
> At 01:34 AM 1/14/2006 +1100, Con Kolivas wrote:
> >On Saturday 14 January 2006 00:01, Mike Galbraith wrote:
> > > At 09:51 PM 1/13/2006 +1100, Con Kolivas wrote:
> > > >See my followup patches that I have posted following "[PATCH 0/5]
> > > > sched - interactivity updates". The first 3 patches are what you
> > > > tested. These patches are being put up for testing hopefully in -mm.
> > >
> > > Then the (buggy) version of my simple throttling patch will need to
> > > come out. (which is OK, I have a debugged potent++ version)
> >
> >Your code need not be mutually exclusive with mine. I've simply damped the
> >current behaviour. Your sanity throttling is a good idea.
>
> I didn't mean to imply that they're mutually exclusive, and after doing
> some testing, I concluded that it (or something like it) is definitely
> still needed. The version that's in mm2 _is_ buggy however, so ripping it
> back out wouldn't hurt my delicate little feelings one bit. In fact, it
> would give me some more time to instrument and test integration with your
> changes.

Ok I've communicated this to Andrew (cc'ed here too) so he should remove your
patch pending a new version from you.

> (Which I think are good btw because they remove what I considered
> to be warts; the pipe and uninterruptible sleep barriers.

Yes I felt your abuse wrt to these in an earlier email...

> Um... try irman2
> now... pure evilness)

Hrm I've been using staircase which is immune for so long I'd all but
forgotten about this test case. Looking at your code I assume your changes
should help this?

Con

Mike Galbraith

unread,
Jan 13, 2006, 9:59:00 PM1/13/06
to Con Kolivas, Paolo Ornati, Linux Kernel Mailing List, Ingo Molnar, Nick Piggin, Peter Williams, Andrew Morton
At 01:05 PM 1/14/2006 +1100, Con Kolivas wrote:
>On Saturday 14 January 2006 03:15, Mike Galbraith wrote:
>
> > Um... try irman2
> > now... pure evilness)
>
>Hrm I've been using staircase which is immune for so long I'd all but
>forgotten about this test case. Looking at your code I assume your changes
>should help this?

Yes. How much very much depends on how strictly I try to enforce. In my
experimental tree, I have four stages of throttling: 1 threshold to begin
trying to consume the difference between measured slice_avg and sleep_avg
(kidd gloves), 2 to begin treating all new sleep as noninteractive (stern
talking to), 3 to cut off new sleep entirely (you're grounded), and 4 is
when to start using slice_avg instead of the out of balance sleep_avg for
the priority calculation (um, bitch-slap?). Levels 1 and 2 won't stop
irman2, 3 will, and especially 4 will.

These are all /proc settings at the moment, so I can set set my starvation
pain threshold from super duper desktop (all off) to just as fair as a
running slice completion time average can possibly make it (all at 1ns
differential), and anywhere in between.

0 new messages