Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

BFS vs. mainline scheduler benchmarks and measurements

147 views
Skip to first unread message

Ingo Molnar

unread,
Sep 6, 2009, 5:10:07 PM9/6/09
to
hi Con,

I've read your BFS announcement/FAQ with great interest:

http://ck.kolivas.org/patches/bfs/bfs-faq.txt

First and foremost, let me say that i'm happy that you are hacking
the Linux scheduler again. It's perhaps proof that hacking the
scheduler is one of the most addictive things on the planet ;-)

I understand that BFS is still early code and that you are not
targeting BFS for mainline inclusion - but BFS is an interesting
and bold new approach, cutting a _lot_ of code out of
kernel/sched*.c, so it raised my curiosity and interest :-)

In the announcement and on your webpage you have compared BFS to
the mainline scheduler in various workloads - showing various
improvements over it. I have tried and tested BFS and ran a set of
benchmarks - this mail contains the results and my (quick)
findings.

So ... to get to the numbers - i've tested both BFS and the tip of
the latest upstream scheduler tree on a testbox of mine. I
intentionally didnt test BFS on any really large box - because you
described its upper limit like this in the announcement:

-----------------------
|
| How scalable is it?
|
| I don't own the sort of hardware that is likely to suffer from
| using it, so I can't find the upper limit. Based on first
| principles about the overhead of locking, and the way lookups
| occur, I'd guess that a machine with more than 16 CPUS would
| start to have less performance. BIG NUMA machines will probably
| suck a lot with this because it pays no deference to locality of
| the NUMA nodes when deciding what cpu to use. It just keeps them
| all busy. The so-called "light NUMA" that constitutes commodity
| hardware these days seems to really like BFS.
|
-----------------------

I generally agree with you that "light NUMA" is what a Linux
scheduler needs to concentrate on (at most) in terms of
scalability. Big NUMA, 4096 CPUs is not very common and we tune the
Linux scheduler for desktop and small-server workloads mostly.

So the testbox i picked fits into the upper portion of what i
consider a sane range of systems to tune for - and should still fit
into BFS's design bracket as well according to your description:
it's a dual quad core system with hyperthreading. It has twice as
many cores as the quad you tested on but it's not excessive and
certainly does not have 4096 CPUs ;-)

Here are the benchmark results:

kernel build performance:
http://redhat.com/~mingo/misc/bfs-vs-tip-kbuild.jpg

pipe performance:
http://redhat.com/~mingo/misc/bfs-vs-tip-pipe.jpg

messaging performance (hackbench):
http://redhat.com/~mingo/misc/bfs-vs-tip-messaging.jpg

OLTP performance (postgresql + sysbench)
http://redhat.com/~mingo/misc/bfs-vs-tip-oltp.jpg

Alas, as it can be seen in the graphs, i can not see any BFS
performance improvements, on this box.

Here's a more detailed description of the results:

| Kernel build performance
---------------------------

http://redhat.com/~mingo/misc/bfs-vs-tip-kbuild.jpg

In the kbuild test BFS is showing significant weaknesses up to 16
CPUs. On 8 CPUs utilized (half load) it's 27.6% slower. All results
(-j1, -j2... -j15 are slower. The peak at 100% utilization at -j16
is slightly stronger under BFS, by 1.5%. The 'absolute best' result
is sched-devel at -j64 with 46.65 seconds - the best BFS result is
47.38 seconds (at -j64) - 1.5% better.

| Pipe performance
-------------------

http://redhat.com/~mingo/misc/bfs-vs-tip-pipe.jpg

Pipe performance is a very simple test, two tasks message to each
other via pipes. I measured 1 million such messages:

http://redhat.com/~mingo/cfs-scheduler/tools/pipe-test-1m.c

The pipe test ran a number of them in parallel:

for ((i=0;i<$NR;i++)); do ~/sched-tests/pipe-test-1m & done; wait

and measured elapsed time. This tests two things: basic scheduler
performance and also scheduler fairness. (if one of these parallel
jobs is delayed unfairly then the test will finish later.)

[ see further below for a simpler pipe latency benchmark as well. ]

As can be seen in the graph BFS performed very poorly in this test:
at 8 pairs of tasks it had a runtime of 45.42 seconds - while
sched-devel finished them in 3.8 seconds.

I saw really bad interactivity in the BFS test here - the system
was starved for as long as the test ran. I stopped the tests at 8
loops - the system was unusable and i was getting IO timeouts due
to the scheduling lag:

sd 0:0:0:0: [sda] Unhandled error code
sd 0:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
end_request: I/O error, dev sda, sector 81949243
Aborting journal on device sda2.
ext3_abort called.
EXT3-fs error (device sda2): ext3_journal_start_sb: Detected aborted journal
Remounting filesystem read-only

I measured interactivity during this test:

$ time ssh aldebaran /bin/true
real 2m17.968s
user 0m0.009s
sys 0m0.003s

A single command took more than 2 minutes.

| Messaging performance
------------------------

http://redhat.com/~mingo/misc/bfs-vs-tip-messaging.jpg

Hackbench ran better - but mainline sched-devel is significantly
faster for smaller and larger loads as well. With 20 groups
mainline ran 61.5% faster.

| OLTP performance
--------------------

http://redhat.com/~mingo/misc/bfs-vs-tip-oltp.jpg

As can be seen in the graph for sysbench OLTP performance
sched-devel outperforms BFS on each of the main stages:

single client load ( 1 client - 6.3% faster )
half load ( 8 clients - 57.6% faster )
peak performance ( 16 clients - 117.6% faster )
overload ( 512 clients - 288.3% faster )

| Other tests
--------------

I also tested a couple of other things, such as lat_tcp:

BFS: TCP latency using localhost: 16.5608 microseconds
sched-devel: TCP latency using localhost: 13.5528 microseconds [22.1% faster]

lat_pipe:

BFS: Pipe latency: 4.9703 microseconds
sched-devel: Pipe latency: 2.6137 microseconds [90.1% faster]

General interactivity of BFS seemed good to me - except for the
pipe test when there was significant lag over a minute. I think
it's some starvation bug, not an inherent design property of BFS,
so i'm looking forward to re-test it with the fix.

Test environment: i used latest BFS (205 and then i re-ran under
208 and the numbers are all from 208), and the latest mainline
scheduler development tree from:

http://people.redhat.com/mingo/tip.git/README

Commit 840a065 in particular. It's on a .31-rc8 base while BFS is
on a .30 base - will be able to test BFS on a .31 base as well once
you release it. (but it doesnt matter much to the results - there
werent any heavy core kernel changes impacting these workloads.)

The system had enough RAM to have the workloads cached, and i
repeated all tests to make sure it's all representative.
Nevertheless i'd like to encourage others to repeat these (or
other) tests - the more testing the better.

I also tried to configure the kernel in a BFS friendly way, i used
HZ=1000 as recommended, turned off all debug options, etc. The
kernel config i used can be found here:

http://redhat.com/~mingo/misc/config

( Let me know if you need any more info about any of the tests i
conducted. )

Also, i'd like to outline that i agree with the general goals
described by you in the BFS announcement - small desktop systems
matter more than large systems. We find it critically important
that the mainline Linux scheduler performs well on those systems
too - and if you (or anyone else) can reproduce suboptimal behavior
please let the scheduler folks know so that we can fix/improve it.

I hope to be able to work with you on this, please dont hesitate
sending patches if you wish - and we'll also be following BFS for
good ideas and code to adopt to mainline.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Frans Pop

unread,
Sep 6, 2009, 10:10:06 PM9/6/09
to
Ingo Molnar wrote:
> So the testbox i picked fits into the upper portion of what i
> consider a sane range of systems to tune for - and should still fit
> into BFS's design bracket as well according to your description:
> it's a dual quad core system with hyperthreading.

Ingo,

Nice that you've looked into this.

Would it be possible for you to run the same tests on e.g. a dual core
and/or a UP system (or maybe just offline some CPUs?)? It would be very
interesting to see whether BFS does better in the lower portion of the
range, or if the differences you show between the two schedulers are
consistent across the range.

Cheers,
FJP

Nikos Chantziaras

unread,
Sep 6, 2009, 11:40:07 PM9/6/09
to
On 09/06/2009 11:59 PM, Ingo Molnar wrote:
>[...]

> Also, i'd like to outline that i agree with the general goals
> described by you in the BFS announcement - small desktop systems
> matter more than large systems. We find it critically important
> that the mainline Linux scheduler performs well on those systems
> too - and if you (or anyone else) can reproduce suboptimal behavior
> please let the scheduler folks know so that we can fix/improve it.

BFS improved behavior of many applications on my Intel Core 2 box in a
way that can't be benchmarked. Examples:

mplayer using OpenGL renderer doesn't drop frames anymore when dragging
and dropping the video window around in an OpenGL composited desktop
(KDE 4.3.1). (Start moving the mplayer window around; then drop it. At
the moment the move starts and at the moment you drop the window back to
the desktop, there's a big frame skip as if mplayer was frozen for a
bit; around 200 or 300ms.)

Composite desktop effects like zoom and fade out don't stall for
sub-second periods of time while there's CPU load in the background. In
other words, the desktop is more fluid and less skippy even during heavy
CPU load. Moving windows around with CPU load in the background doesn't
result in short skips.

LMMS (a tool utilizing real-time sound synthesis) does not produce
"pops", "crackles" and drops in the sound during real-time playback due
to buffer under-runs. Those problems amplify when there's heavy CPU
load in the background, while with BFS heavy load doesn't produce those
artifacts (though LMMS makes itself run SCHED_ISO with BFS) Also,
hitting a key on the keyboard needs less time for the note to become
audible when using BFS. Same should hold true for other tools who
traditionally benefit from the "-rt" kernel sources.

Games like Doom 3 and such don't "freeze" periodically for small amounts
of time (again for sub-second amounts) when something in the background
grabs CPU time (be it my mailer checking for new mail or a cron job, or
whatever.)

And, the most drastic improvement here, with BFS I can do a "make -j2"
in the kernel tree and the GUI stays fluid. Without BFS, things start
to lag, even with in-RAM builds (like having the whole kernel tree
inside a tmpfs) and gcc running with nice 19 and ionice -c 3.

Unfortunately, I can't come up with any way to somehow benchmark all of
this. There's no benchmark for "fluidity" and "responsiveness".
Running the Doom 3 benchmark, or any other benchmark, doesn't say
anything about responsiveness, it only measures how many frames were
calculated in a specific period of time. How "stable" (with no stalls)
those frames were making it to the screen is not measurable.

If BFS would imply small drops in pure performance counted in
instructions per seconds, that would be a totally acceptable regression
for desktop/multimedia/gaming PCs. Not for server machines, of course.
However, on my machine, BFS is faster in classic workloads. When I
run "make -j2" with BFS and the standard scheduler, BFS always finishes
a bit faster. Not by much, but still. One thing I'm noticing here is
that BFS produces 100% CPU load on each core with "make -j2" while the
normal scheduler stays at about 90-95% with -j2 or higher in at least
one of the cores. There seems to be under-utilization of CPU time.

Also, by searching around the net but also through discussions on
various mailing lists, there seems to be a trend: the problems for some
reason seem to occur more often with Intel CPUs (Core 2 chips and lower;
I can't say anything about Core I7) while people on AMD CPUs mostly not
being affected by most or even all of the above. (And due to this flame
wars often break out, with one party accusing the other of imagining
things). Can the integrated memory controller on AMD chips have
something to do with this? Do AMD chips generally offer better
"multithrading" behavior? Unfortunately, you didn't mention on what CPU
you ran your tests. If it was AMD, it might be a good idea to run tests
on Pentium and Core 2 CPUs.

For reference, my system is:

CPU: Intel Core 2 Duo E6600 (2.4GHz)
Mainboard: Asus P5E (Intel X38 chipset)
RAM: 6GB (2+2+1+1) dual channel DDR2 800
GPU: RV770 (Radeon HD4870).

Con Kolivas

unread,
Sep 7, 2009, 12:00:16 AM9/7/09
to
2009/9/7 Ingo Molnar <mi...@elte.hu>:
> hi Con,

Sigh..

Well hello there.

>
> I've read your BFS announcement/FAQ with great interest:
>
>    http://ck.kolivas.org/patches/bfs/bfs-faq.txt

> I understand that BFS is still early code and that you are not


> targeting BFS for mainline inclusion - but BFS is an interesting
> and bold new approach, cutting a _lot_ of code out of
> kernel/sched*.c, so it raised my curiosity and interest :-)

Hard to keep a project under wraps and get an audience at the same
time, it is. I do realise it was inevitable LKML would invade my
personal space no matter how much I didn't want it to, but it would be
rude of me to not respond.

> In the announcement and on your webpage you have compared BFS to
> the mainline scheduler in various workloads - showing various
> improvements over it. I have tried and tested BFS and ran a set of
> benchmarks - this mail contains the results and my (quick)
> findings.

/me sees Ingo run off to find the right combination of hardware and
benchmark to prove his point.

[snip lots of bullshit meaningless benchmarks showing how great cfs is
and/or how bad bfs is, along with telling people they should use these
artificial benchmarks to determine how good it is, demonstrating yet
again why benchmarks fail the desktop]

I'm not interested in a long protracted discussion about this since
I'm too busy to live linux the way full time developers do, so I'll
keep it short, and perhaps you'll understand my intent better if the
FAQ wasn't clear enough.


Do you know what a normal desktop PC looks like? No, a more realistic
question based on what you chose to benchmark to prove your point
would be: Do you know what normal people actually do on them?


Feel free to treat the question as rhetorical.

Regards,
-ck

/me checks on his distributed computing client's progress, fires up
his next H264 encode, changes music tracks and prepares to have his
arse whooped on quakelive.

Jens Axboe

unread,
Sep 7, 2009, 6:00:10 AM9/7/09
to
On Sun, Sep 06 2009, Ingo Molnar wrote:
> So ... to get to the numbers - i've tested both BFS and the tip of
> the latest upstream scheduler tree on a testbox of mine. I
> intentionally didnt test BFS on any really large box - because you
> described its upper limit like this in the announcement:

I ran a simple test as well, since I was curious to see how it performed
wrt interactiveness. One of my pet peeves with the current scheduler is
that I have to nice compile jobs, or my X experience is just awful while
the compile is running.

Now, this test case is something that attempts to see what
interactiveness would be like. It'll run a given command line while at
the same time logging delays. The delays are measured as follows:

- The app creates a pipe, and forks a child that blocks on reading from
that pipe.
- The app sleeps for a random period of time, anywhere between 100ms
and 2s. When it wakes up, it gets the current time and writes that to
the pipe.
- The child then gets woken, checks the time on its own, and logs the
difference between the two.

The idea here being that the delay between writing to the pipe and the
child reading the data and comparing should (in some way) be indicative
of how responsive the system would seem to a user.

The test app was quickly hacked up, so don't put too much into it. The
test run is a simple kernel compile, using -jX where X is the number of
threads in the system. The files are cache hot, so little IO is done.
The -x2 run is using the double number of processes as we have threads,
eg -j128 on a 64 thread box.

And I have to apologize for using a large system to test this on, I
realize it's out of the scope of BFS, but it's just easier to fire one
of these beasts up than it is to sacrifice my notebook or desktop
machine... So it's a 64 thread box. CFS -jX runtime is the baseline at
100, lower number means faster and vice versa. The latency numbers are
in msecs.


Scheduler Runtime Max lat Avg lat Std dev
----------------------------------------------------------------
CFS 100 951 462 267
CFS-x2 100 983 484 308
BFS
BFS-x2

And unfortunately this is where it ends for now, since BFS doesn't boot
on the two boxes I tried. It hard hangs right after disk detection. But
the latency numbers look pretty appalling for CFQ, so it's a bit of a
shame that I did not get to compare. I'll try again later with a newer
revision, when available.

--
Jens Axboe

Nikos Chantziaras

unread,
Sep 7, 2009, 6:20:11 AM9/7/09
to
On 09/07/2009 12:49 PM, Jens Axboe wrote:
> [...]

> And I have to apologize for using a large system to test this on, I
> realize it's out of the scope of BFS, but it's just easier to fire one
> of these beasts up than it is to sacrifice my notebook or desktop
> machine...

How does a kernel rebuild constitute "sacrifice"?


> So it's a 64 thread box. CFS -jX runtime is the baseline at
> 100, lower number means faster and vice versa. The latency numbers are
> in msecs.
>
>
> Scheduler Runtime Max lat Avg lat Std dev
> ----------------------------------------------------------------
> CFS 100 951 462 267
> CFS-x2 100 983 484 308
> BFS
> BFS-x2
>
> And unfortunately this is where it ends for now, since BFS doesn't boot
> on the two boxes I tried.

Then who post this in the first place?

Jens Axboe

unread,
Sep 7, 2009, 6:50:07 AM9/7/09
to
On Mon, Sep 07 2009, Nikos Chantziaras wrote:
> On 09/07/2009 12:49 PM, Jens Axboe wrote:
>> [...]
>> And I have to apologize for using a large system to test this on, I
>> realize it's out of the scope of BFS, but it's just easier to fire one
>> of these beasts up than it is to sacrifice my notebook or desktop
>> machine...
>
> How does a kernel rebuild constitute "sacrifice"?

It's more of a bother since I have to physically be at the notebook,
where as the server type boxes usually have remote management. The
workstation I use currently, so it'd be very disruptive to do it there.
And as things are apparently very alpha on the bfs side currently, it's
easier to 'sacrifice' an idle test box. That's the keyword, 'test'
boxes. You know, machines used for testing. Not production machines.

Plus the notebook is using btrfs which isn't format compatible with
2.6.30 on disk format.

Is there a point to this question?

>> So it's a 64 thread box. CFS -jX runtime is the baseline at
>> 100, lower number means faster and vice versa. The latency numbers are
>> in msecs.
>>
>>
>> Scheduler Runtime Max lat Avg lat Std dev
>> ----------------------------------------------------------------
>> CFS 100 951 462 267
>> CFS-x2 100 983 484 308
>> BFS
>> BFS-x2
>>
>> And unfortunately this is where it ends for now, since BFS doesn't boot
>> on the two boxes I tried.
>
> Then who post this in the first place?

You snipped the relevant part of the conclusion, the part where I make a
comment on the cfs latencies.

Don't bother replying to any of my emails if YOU continue writing emails
in this fashion. I have MUCH better things to do than entertain kiddies.
If you do get your act together and want to reply, follow lkml etiquette
and group reply.

--
Jens Axboe

Frederic Weisbecker

unread,
Sep 7, 2009, 7:10:07 AM9/7/09
to
On Mon, Sep 07, 2009 at 06:38:36AM +0300, Nikos Chantziaras wrote:
> Unfortunately, I can't come up with any way to somehow benchmark all of
> this. There's no benchmark for "fluidity" and "responsiveness". Running
> the Doom 3 benchmark, or any other benchmark, doesn't say anything about
> responsiveness, it only measures how many frames were calculated in a
> specific period of time. How "stable" (with no stalls) those frames were
> making it to the screen is not measurable.

That looks eventually benchmarkable. This is about latency.
For example, you could try to run high load tasks in the
background and then launch a task that wakes up in middle/large
periods to do something. You could measure the time it takes to wake
it up to perform what it wants.

We have some events tracing infrastructure in the kernel that can
snapshot the wake up and sched switch events.

Having CONFIG_EVENT_TRACING=y should be sufficient for that.

You just need to mount a debugfs point, say in /debug.

Then you can activate these sched events by doing:

echo 0 > /debug/tracing/tracing_on
echo 1 > /debug/tracing/events/sched/sched_switch/enable
echo 1 > /debug/tracing/events/sched/sched_wake_up/enable

#Launch your tasks

echo 1 > /debug/tracing/tracing_on

#Wait for some time

echo 0 > /debug/tracing/tracing_off

That will require some parsing of the result in /debug/tracing/trace
to get the delays between wake_up events and switch in events
for the task that periodically wakes up and then produce some
statistics such as the average or the maximum latency.

That's a bit of a rough approach to measure such latencies but that
should work.


> If BFS would imply small drops in pure performance counted in
> instructions per seconds, that would be a totally acceptable regression
> for desktop/multimedia/gaming PCs. Not for server machines, of course.
> However, on my machine, BFS is faster in classic workloads. When I run
> "make -j2" with BFS and the standard scheduler, BFS always finishes a bit
> faster. Not by much, but still. One thing I'm noticing here is that BFS
> produces 100% CPU load on each core with "make -j2" while the normal
> scheduler stays at about 90-95% with -j2 or higher in at least one of the
> cores. There seems to be under-utilization of CPU time.

That also could be benchmarkable by using the above sched events and
look at the average time spent in a cpu to run the idle tasks.

Jens Axboe

unread,
Sep 7, 2009, 8:10:06 AM9/7/09
to
On Mon, Sep 07 2009, Jens Axboe wrote:
> Scheduler Runtime Max lat Avg lat Std dev
> ----------------------------------------------------------------
> CFS 100 951 462 267
> CFS-x2 100 983 484 308
> BFS
> BFS-x2

Those numbers are buggy, btw, it's not nearly as bad. But responsiveness
under compile load IS bad though, the test app just didn't quantify it
correctly. I'll see if I can get it working properly.

Ingo Molnar

unread,
Sep 7, 2009, 8:20:07 AM9/7/09
to

* Frans Pop <ele...@planet.nl> wrote:

> Ingo Molnar wrote:
> > So the testbox i picked fits into the upper portion of what i
> > consider a sane range of systems to tune for - and should still fit
> > into BFS's design bracket as well according to your description:
> > it's a dual quad core system with hyperthreading.
>
> Ingo,
>
> Nice that you've looked into this.
>
> Would it be possible for you to run the same tests on e.g. a dual
> core and/or a UP system (or maybe just offline some CPUs?)? It
> would be very interesting to see whether BFS does better in the
> lower portion of the range, or if the differences you show between
> the two schedulers are consistent across the range.

Sure!

Note that usually we can extrapolate ballpark-figure quad and dual
socket results from 8 core results. Trends as drastic as the ones
i reported do not get reversed as one shrinks the number of cores.

[ This technique is not universal - for example borderline graphs
on cannot be extrapolated down reliably - but the graphs i
posted were far from borderline. ]

Con posted single-socket quad comparisons/graphs so to make it 100%
apples to apples i re-tested with a single-socket (non-NUMA) quad as
well, and have uploaded the new graphs/results to:

kernel build performance on quad:
http://redhat.com/~mingo/misc/bfs-vs-tip-kbuild-quad.jpg

pipe performance on quad:
http://redhat.com/~mingo/misc/bfs-vs-tip-pipe-quad.jpg

messaging performance (hackbench) on quad:
http://redhat.com/~mingo/misc/bfs-vs-tip-messaging-quad.jpg

OLTP performance (postgresql + sysbench) on quad:
http://redhat.com/~mingo/misc/bfs-vs-tip-oltp-quad.jpg

It shows similar curves and behavior to the 8-core results i posted
- BFS is slower than mainline in virtually every measurement. The
ratios are different for different parts of the graphs - but the
trend is similar.

I also re-ran a few standalone kernel latency tests with a single
quad:

lat_tcp:

BFS: TCP latency using localhost: 16.9926 microseconds
sched-devel: TCP latency using localhost: 12.4141 microseconds [36.8% faster]

as a comparison, the 8 core lat_tcp result was:

BFS: TCP latency using localhost: 16.5608 microseconds
sched-devel: TCP latency using localhost: 13.5528 microseconds [22.1% faster]

lat_pipe quad result:

BFS: Pipe latency: 4.6978 microseconds
sched-devel: Pipe latency: 2.6860 microseconds [74.8% faster]

as a comparison, the 8 core lat_pipe result was:

BFS: Pipe latency: 4.9703 microseconds
sched-devel: Pipe latency: 2.6137 microseconds [90.1% faster]

On the desktop interactivity front, i also still saw that bad
starvation artifact with BFS with multiple copies of CPU-bound
pipe-test-1m.c running in parallel:

http://redhat.com/~mingo/cfs-scheduler/tools/pipe-test-1m.c

Start up a few copies of them like this:

for ((i=0;i<32;i++)); do ./pipe-test-1m & done

and the quad eventually came to a halt here - until the tasks
finished running.

I also tested a few key data points on dual core and it shows
similar trends as well (as expected from the 8 and 4 core results).

But ... i'd really encourage everyone to test these things yourself
as well and not take anyone's word on this as granted. The more
people provide numbers, the better. The latest BFS patch can be
found at:

http://ck.kolivas.org/patches/bfs/

The mainline sched-devel tree can be found at:

http://people.redhat.com/mingo/tip.git/README

Thanks,

Ingo

Stefan Richter

unread,
Sep 7, 2009, 8:40:09 AM9/7/09
to
Ingo Molnar wrote:
> i'd really encourage everyone to test these things yourself
> as well and not take anyone's word on this as granted. The more
> people provide numbers, the better.

Besides mean values from bandwidth and latency focused tests, standard
deviations or variance, or e.g. 90th percentiles and perhaps maxima of
latency focused tests might be of interest. Or graphs with error bars.
--
Stefan Richter
-=====-==--= =--= --===
http://arcgraph.de/sr/

Markus Tornqvist

unread,
Sep 7, 2009, 9:50:07 AM9/7/09
to
Please Cc me as I'm not a subscriber.

(LKML bounced this message once already for 8-bit headers, I'm retrying
now - sorry if someone gets it twice)

On Mon, Sep 07, 2009 at 02:16:13PM +0200, Ingo Molnar wrote:
>
>Con posted single-socket quad comparisons/graphs so to make it 100%
>apples to apples i re-tested with a single-socket (non-NUMA) quad as
>well, and have uploaded the new graphs/results to:
>
> kernel build performance on quad:
> http://redhat.com/~mingo/misc/bfs-vs-tip-kbuild-quad.jpg

[...]


>
>It shows similar curves and behavior to the 8-core results i posted
>- BFS is slower than mainline in virtually every measurement. The
>ratios are different for different parts of the graphs - but the
>trend is similar.

Dude, not cool.

1. Quad HT is not the same as a 4-core desktop, you're doing it with 8 cores
2. You just proved BFS is better on the job_count == core_count case, as BFS
says it is, if you look at the graph
3. You're comparing an old version of BFS against an unreleased dev kernel

Also, you said on http://article.gmane.org/gmane.linux.kernel/886319


"I also tried to configure the kernel in a BFS friendly way, i used
HZ=1000 as recommended, turned off all debug options, etc. The
kernel config i used can be found here:
http://redhat.com/~mingo/misc/config
"

Quickly looking at the conf you have
CONFIG_HZ_250=y
CONFIG_PREEMPT_NONE=y
# CONFIG_PREEMPT_VOLUNTARY is not set
# CONFIG_PREEMPT is not set

CONFIG_ARCH_WANT_FRAME_POINTERS=y
CONFIG_FRAME_POINTER=y

And other DEBUG.

--
mjt

Ingo Molnar

unread,
Sep 7, 2009, 10:00:13 AM9/7/09
to

* Markus T?rnqvist <m...@nysv.org> wrote:

> Please Cc me as I'm not a subscriber.
>

> On Mon, Sep 07, 2009 at 02:16:13PM +0200, Ingo Molnar wrote:
> >
> >Con posted single-socket quad comparisons/graphs so to make it 100%
> >apples to apples i re-tested with a single-socket (non-NUMA) quad as
> >well, and have uploaded the new graphs/results to:
> >
> > kernel build performance on quad:
> > http://redhat.com/~mingo/misc/bfs-vs-tip-kbuild-quad.jpg
> [...]
> >
> >It shows similar curves and behavior to the 8-core results i posted
> >- BFS is slower than mainline in virtually every measurement. The
> >ratios are different for different parts of the graphs - but the
> >trend is similar.
>
> Dude, not cool.
>
> 1. Quad HT is not the same as a 4-core desktop, you're doing it with 8 cores

No, it's 4 cores. HyperThreading adds two 'siblings' per core, which
are not 'cores'.

> 2. You just proved BFS is better on the job_count == core_count case, as BFS
> says it is, if you look at the graph

I pointed that out too. I think the graphs speak for themselves:

http://redhat.com/~mingo/misc/bfs-vs-tip-kbuild-quad.jpg
http://redhat.com/~mingo/misc/bfs-vs-tip-kbuild.jpg

> 3. You're comparing an old version of BFS against an unreleased dev kernel

bfs-208 was 1 day old (and it is a 500K+ kernel patch) when i tested
it against the 2 days old sched-devel tree. Btw., i initially
measured 205 as well and spent one more day on acquiring and
analyzing the 208 results.

There's bfs-209 out there today. These tests take 8+ hours to
complete and validate. I'll re-test BFS in the future too, and as i
said it in the first mail i'll test it on a .31 base as well once
BFS has been ported to it:

> > It's on a .31-rc8 base while BFS is on a .30 base - will be able
> > to test BFS on a .31 base as well once you release it. (but it
> > doesnt matter much to the results - there werent any heavy core
> > kernel changes impacting these workloads.)

> Also, you said on http://article.gmane.org/gmane.linux.kernel/886319


> "I also tried to configure the kernel in a BFS friendly way, i used
> HZ=1000 as recommended, turned off all debug options, etc. The
> kernel config i used can be found here:
> http://redhat.com/~mingo/misc/config
> "
>
> Quickly looking at the conf you have
> CONFIG_HZ_250=y
> CONFIG_PREEMPT_NONE=y
> # CONFIG_PREEMPT_VOLUNTARY is not set
> # CONFIG_PREEMPT is not set

Indeed. HZ does not seem to matter according to what i see in my
measurements. Can you measure such sensitivity?

> CONFIG_ARCH_WANT_FRAME_POINTERS=y
> CONFIG_FRAME_POINTER=y
>
> And other DEBUG.

These are the defaults and they dont make a measurable difference to
these results. What other debug options do you mean and do they make
a difference?

Ingo

Ingo Molnar

unread,
Sep 7, 2009, 10:20:09 AM9/7/09
to

* Jens Axboe <jens....@oracle.com> wrote:

> On Mon, Sep 07 2009, Jens Axboe wrote:
> > Scheduler Runtime Max lat Avg lat Std dev
> > ----------------------------------------------------------------
> > CFS 100 951 462 267
> > CFS-x2 100 983 484 308
> > BFS
> > BFS-x2
>
> Those numbers are buggy, btw, it's not nearly as bad. But
> responsiveness under compile load IS bad though, the test app just
> didn't quantify it correctly. I'll see if I can get it working
> properly.

What's the default latency target on your box:

cat /proc/sys/kernel/sched_latency_ns

?

And yes, it would be wonderful to get a test-app from you that would
express the kind of pain you are seeing during compile jobs.

Ingo

Arjan van de Ven

unread,
Sep 7, 2009, 10:40:06 AM9/7/09
to
On Mon, 07 Sep 2009 06:38:36 +0300
Nikos Chantziaras <rea...@arcor.de> wrote:

> On 09/06/2009 11:59 PM, Ingo Molnar wrote:
> >[...]
> > Also, i'd like to outline that i agree with the general goals
> > described by you in the BFS announcement - small desktop systems
> > matter more than large systems. We find it critically important
> > that the mainline Linux scheduler performs well on those systems
> > too - and if you (or anyone else) can reproduce suboptimal behavior
> > please let the scheduler folks know so that we can fix/improve it.
>
> BFS improved behavior of many applications on my Intel Core 2 box in
> a way that can't be benchmarked. Examples:

Have you tried to see if latencytop catches such latencies ?

Arjan van de Ven

unread,
Sep 7, 2009, 10:50:11 AM9/7/09
to
On Mon, 7 Sep 2009 16:41:51 +0300

> >It shows similar curves and behavior to the 8-core results i posted
> >- BFS is slower than mainline in virtually every measurement. The
> >ratios are different for different parts of the graphs - but the
> >trend is similar.
>
> Dude, not cool.
>
> 1. Quad HT is not the same as a 4-core desktop, you're doing it with
> 8 cores

4 cores, 8 threads. Which is basically the standard desktop cpu going
forward... (4 cores already is today, 8 threads is that any day now)

--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

Michael Buesch

unread,
Sep 7, 2009, 11:20:05 AM9/7/09
to
Here's a very simple test setup on an embedded singlecore bcm47xx machine (WL500GPv2)
It uses iperf for performance testing. The iperf server is run on the
embedded device. The device is so slow that the iperf test is completely
CPU bound. The network connection is a 100MBit on the device connected
via patch cable to a 1000MBit machine.

The kernel is openwrt-2.6.30.5.

Here are the results:

Mainline CFS scheduler:

mb@homer:~$ iperf -c 192.168.1.1
------------------------------------------------------------
Client connecting to 192.168.1.1, TCP port 5001
TCP window size: 16.0 KByte (default)
------------------------------------------------------------
[ 3] local 192.168.1.99 port 35793 connected with 192.168.1.1 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-10.0 sec 27.4 MBytes 23.0 Mbits/sec
mb@homer:~$ iperf -c 192.168.1.1
------------------------------------------------------------
Client connecting to 192.168.1.1, TCP port 5001
TCP window size: 16.0 KByte (default)
------------------------------------------------------------
[ 3] local 192.168.1.99 port 35794 connected with 192.168.1.1 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-10.0 sec 27.3 MBytes 22.9 Mbits/sec
mb@homer:~$ iperf -c 192.168.1.1
------------------------------------------------------------
Client connecting to 192.168.1.1, TCP port 5001
TCP window size: 16.0 KByte (default)
------------------------------------------------------------
[ 3] local 192.168.1.99 port 56147 connected with 192.168.1.1 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-10.0 sec 27.3 MBytes 22.9 Mbits/sec


BFS scheduler:

mb@homer:~$ iperf -c 192.168.1.1
------------------------------------------------------------
Client connecting to 192.168.1.1, TCP port 5001
TCP window size: 16.0 KByte (default)
------------------------------------------------------------
[ 3] local 192.168.1.99 port 52489 connected with 192.168.1.1 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-10.0 sec 38.2 MBytes 32.0 Mbits/sec
mb@homer:~$ iperf -c 192.168.1.1
------------------------------------------------------------
Client connecting to 192.168.1.1, TCP port 5001
TCP window size: 16.0 KByte (default)
------------------------------------------------------------
[ 3] local 192.168.1.99 port 52490 connected with 192.168.1.1 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-10.0 sec 38.1 MBytes 31.9 Mbits/sec
mb@homer:~$ iperf -c 192.168.1.1
------------------------------------------------------------
Client connecting to 192.168.1.1, TCP port 5001
TCP window size: 16.0 KByte (default)
------------------------------------------------------------
[ 3] local 192.168.1.99 port 52491 connected with 192.168.1.1 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-10.0 sec 38.1 MBytes 31.9 Mbits/sec


--
Greetings, Michael.

Frans Pop

unread,
Sep 7, 2009, 11:30:12 AM9/7/09
to
On Monday 07 September 2009, Arjan van de Ven wrote:
> 4 cores, 8 threads. Which is basically the standard desktop cpu going
> forward... (4 cores already is today, 8 threads is that any day now)

Despite that I'm personally more interested in what I have available here
*now*. And that's various UP Pentium systems, one dual core Pentium D and
Core Duo.

I've been running BFS on my laptop today while doing CPU intensive jobs
(not disk intensive), and I must say that BFS does seem very responsive.
OTOH, I've also noticed some surprising things, such as processors staying
on lower frequencies while doing CPU-intensive work.

I feels like I have less of the mouse cursor and typing freezes I'm used
to with CFS, even when I'm *not* doing anything special. I've been
blaming those on still running with ordered mode ext3, but now I'm
starting to wonder.

I'll try to do more structured testing, comparisons and measurements
later. At the very least it's nice to have something to compare _with_.

Cheers,
FJP

Xavier Bestel

unread,
Sep 7, 2009, 11:30:13 AM9/7/09
to

On Mon, 2009-09-07 at 07:45 -0700, Arjan van de Ven wrote:
> On Mon, 7 Sep 2009 16:41:51 +0300
> > >It shows similar curves and behavior to the 8-core results i posted
> > >- BFS is slower than mainline in virtually every measurement. The
> > >ratios are different for different parts of the graphs - but the
> > >trend is similar.
> >
> > Dude, not cool.
> >
> > 1. Quad HT is not the same as a 4-core desktop, you're doing it with
> > 8 cores
>
> 4 cores, 8 threads. Which is basically the standard desktop cpu going
> forward... (4 cores already is today, 8 threads is that any day now)

Except on your typical smartphone, which will run linux and probably
vastly outnumber the number of "traditional" linux desktops.

Xav

Arjan van de Ven

unread,
Sep 7, 2009, 11:40:08 AM9/7/09
to
On Mon, 07 Sep 2009 17:24:29 +0200
Xavier Bestel <xavier...@free.fr> wrote:

>
> On Mon, 2009-09-07 at 07:45 -0700, Arjan van de Ven wrote:
> > On Mon, 7 Sep 2009 16:41:51 +0300
> > > >It shows similar curves and behavior to the 8-core results i
> > > >posted
> > > >- BFS is slower than mainline in virtually every measurement.
> > > >The ratios are different for different parts of the graphs - but
> > > >the trend is similar.
> > >
> > > Dude, not cool.
> > >
> > > 1. Quad HT is not the same as a 4-core desktop, you're doing it
> > > with 8 cores
> >
> > 4 cores, 8 threads. Which is basically the standard desktop cpu
> > going forward... (4 cores already is today, 8 threads is that any
> > day now)
>
> Except on your typical smartphone, which will run linux and probably
> vastly outnumber the number of "traditional" linux desktops.

yeah the trend in cellphones is only quad core without HT, not quad
core WITH ht ;-)

--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

Arjan van de Ven

unread,
Sep 7, 2009, 11:40:06 AM9/7/09
to
On Mon, 7 Sep 2009 17:20:33 +0200
Frans Pop <ele...@planet.nl> wrote:

> On Monday 07 September 2009, Arjan van de Ven wrote:
> > 4 cores, 8 threads. Which is basically the standard desktop cpu
> > going forward... (4 cores already is today, 8 threads is that any
> > day now)
>
> Despite that I'm personally more interested in what I have available
> here *now*. And that's various UP Pentium systems, one dual core
> Pentium D and Core Duo.
>
> I've been running BFS on my laptop today while doing CPU intensive
> jobs (not disk intensive), and I must say that BFS does seem very
> responsive. OTOH, I've also noticed some surprising things, such as
> processors staying on lower frequencies while doing CPU-intensive
> work.
>
> I feels like I have less of the mouse cursor and typing freezes I'm
> used to with CFS, even when I'm *not* doing anything special. I've
> been blaming those on still running with ordered mode ext3, but now
> I'm starting to wonder.
>
> I'll try to do more structured testing, comparisons and measurements
> later. At the very least it's nice to have something to compare
> _with_.
>

it's a shameless plug since I wrote it, but latencytop will be able to
tell you what your bottleneck is...
and that is very interesting to know, regardless of the "what scheduler
code" discussion;

--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

Nikos Chantziaras

unread,
Sep 7, 2009, 11:40:09 AM9/7/09
to
On 09/07/2009 03:16 PM, Ingo Molnar wrote:
> [...]
> Note that usually we can extrapolate ballpark-figure quad and dual
> socket results from 8 core results. Trends as drastic as the ones
> i reported do not get reversed as one shrinks the number of cores.
>
> Con posted single-socket quad comparisons/graphs so to make it 100%
> apples to apples i re-tested with a single-socket (non-NUMA) quad as
> well, and have uploaded the new graphs/results to:
>
> kernel build performance on quad:
> http://redhat.com/~mingo/misc/bfs-vs-tip-kbuild-quad.jpg
>
> pipe performance on quad:
> http://redhat.com/~mingo/misc/bfs-vs-tip-pipe-quad.jpg
>
> messaging performance (hackbench) on quad:
> http://redhat.com/~mingo/misc/bfs-vs-tip-messaging-quad.jpg
>
> OLTP performance (postgresql + sysbench) on quad:
> http://redhat.com/~mingo/misc/bfs-vs-tip-oltp-quad.jpg
>
> It shows similar curves and behavior to the 8-core results i posted
> - BFS is slower than mainline in virtually every measurement.

Except for numbers, what's your *experience* with BFS when it comes to
composited desktops + games + multimedia apps? (Watching high
definition videos, playing some latest high-tech 3D game, etc.) I
described the exact problems experienced with mainline in a previous reply.

Are you even using that stuff actually? Because it would be hard to
tell if your desktop consists mainly of Emacs and an xterm; you even
seem to be using Mutt so I suspect your desktop probably doesn't look
very Windows Vista/OS X/Compiz-like. Usually, with "multimedia desktop
PC" one doesn't mean:

http://foss.math.aegean.gr/~realnc/pics/desktop2.png

but rather:

http://foss.math.aegean.gr/~realnc/pics/desktop1.png

BFS probably wouldn't offer the former anything, while on the latter it
does make a difference. If your usage of the "desktop" bears a
resemblance to the first example, I'd say you might be not the most
qualified person to judge on the "Linux desktop experience." That is
not meant be offensive or patronizing, just an observation and I even
might be totally wrong about it.

Frans Pop

unread,
Sep 7, 2009, 11:50:04 AM9/7/09
to
On Monday 07 September 2009, Arjan van de Ven wrote:
> it's a shameless plug since I wrote it, but latencytop will be able to
> tell you what your bottleneck is...
> and that is very interesting to know, regardless of the "what scheduler
> code" discussion;

I'm very much aware of that and I've tried pinning it down a few times,
but failed to come up with anything conclusive. I plan to make a new
effort in this context as the freezes have increasingly been annoying me.

Unfortunately latencytop only shows a blank screen when used with BFS, but
I guess that's not totally unexpected.

Cheers,
FJP

Diego Calleja

unread,
Sep 7, 2009, 12:00:15 PM9/7/09
to
On Lunes 07 Septiembre 2009 17:24:29 Xavier Bestel escribiᅵ:

> Except on your typical smartphone, which will run linux and probably
> vastly outnumber the number of "traditional" linux desktops.

Smartphones will probably start using ARM dualcore cpus the next year,
the embedded land is no SMP-free.

Jens Axboe

unread,
Sep 7, 2009, 1:40:04 PM9/7/09
to
On Mon, Sep 07 2009, Ingo Molnar wrote:
>
> * Jens Axboe <jens....@oracle.com> wrote:
>
> > On Mon, Sep 07 2009, Jens Axboe wrote:
> > > Scheduler Runtime Max lat Avg lat Std dev
> > > ----------------------------------------------------------------
> > > CFS 100 951 462 267
> > > CFS-x2 100 983 484 308
> > > BFS
> > > BFS-x2
> >
> > Those numbers are buggy, btw, it's not nearly as bad. But
> > responsiveness under compile load IS bad though, the test app just
> > didn't quantify it correctly. I'll see if I can get it working
> > properly.
>
> What's the default latency target on your box:
>
> cat /proc/sys/kernel/sched_latency_ns
>
> ?

It's off right now, but it is set to whatever is the default. I don't
touch it.

> And yes, it would be wonderful to get a test-app from you that would
> express the kind of pain you are seeing during compile jobs.

I was hoping this one would, but it's not showing anything. I even added
support for doing the ping and wakeup over a socket, to see if the pipe
test was doing well because of the sync wakeup we do there. The net
latency is a little worse, but still good. So no luck in making that app
so far.

--
Jens Axboe

Avi Kivity

unread,
Sep 7, 2009, 2:00:07 PM9/7/09
to
On 09/07/2009 12:49 PM, Jens Axboe wrote:
>
> I ran a simple test as well, since I was curious to see how it performed
> wrt interactiveness. One of my pet peeves with the current scheduler is
> that I have to nice compile jobs, or my X experience is just awful while
> the compile is running.
>

I think the problem is that CFS is optimizing for the wrong thing. It's
trying to be fair to tasks, but these are meaningless building blocks of
jobs, which is what the user sees and measures. Your make -j128
dominates your interactive task by two orders of magnitude. If the
scheduler attempts to bridge this gap using heuristics, it will fail
badly when it misdetects since it will starve the really important
100-thread job for a task that was misdetected as interactive.

I think that bash (and the GUI shell) should put any new job (for bash,
a pipeline; for the GUI, an application launch from the menu) in a
scheduling group of its own. This way it will have equal weight in the
scheduler's eyes with interactive tasks; one will not dominate the
other. Of course if the cpu is free the compile job is welcome to use
all 128 threads.

(similarly, different login sessions should be placed in different jobs
to avoid a heavily multithreaded screensaver from overwhelming ed).

--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

Ingo Molnar

unread,
Sep 7, 2009, 2:30:10 PM9/7/09
to

That's interesting. I tried to reproduce it on x86, but the profile
does not show any scheduler overhead at all on the server:

$ perf report

#
# Samples: 8369
#
# Overhead Symbol
# ........ ......
#
9.20% [k] copy_user_generic_string
3.80% [k] e1000_clean
3.58% [k] ipt_do_table
2.72% [k] mwait_idle
2.68% [k] nf_iterate
2.28% [k] e1000_intr
2.15% [k] tcp_packet
2.10% [k] __hash_conntrack
1.59% [k] read_tsc
1.52% [k] _local_bh_enable_ip
1.34% [k] eth_type_trans
1.29% [k] __alloc_skb
1.19% [k] tcp_recvmsg
1.19% [k] ip_rcv
1.17% [k] e1000_clean_rx_irq
1.12% [k] apic_timer_interrupt
0.99% [k] vsnprintf
0.96% [k] nf_conntrack_in
0.96% [k] kmem_cache_free
0.93% [k] __kmalloc_track_caller


Could you profile it please? Also, what's the context-switch rate?

Below is the call-graph profile as well - all the overhead is in
networking and SLAB.

Ingo

$ perf report --call-graph fractal,5

#
# Samples: 8947
#
# Overhead Command Shared Object Symbol
# ........ .............. ............................. ......
#
9.06% iperf [kernel] [k] copy_user_generic_string
|
|--98.89%-- skb_copy_datagram_iovec
| |
| |--77.18%-- tcp_recvmsg
| | sock_common_recvmsg
| | __sock_recvmsg
| | sock_recvmsg
| | sys_recvfrom
| | system_call_fastpath
| | __recv
| |
| --22.82%-- tcp_rcv_established
| tcp_v4_do_rcv
| tcp_prequeue_process
| tcp_recvmsg
| sock_common_recvmsg
| __sock_recvmsg
| sock_recvmsg
| sys_recvfrom
| system_call_fastpath
| __recv
|
--1.11%-- system_call_fastpath
__GI___libc_nanosleep

3.62% [init] [kernel] [k] e1000_clean
2.96% [init] [kernel] [k] ipt_do_table
2.79% [init] [kernel] [k] mwait_idle
2.22% [init] [kernel] [k] e1000_intr
1.93% [init] [kernel] [k] nf_iterate
1.65% [init] [kernel] [k] __hash_conntrack
1.52% [init] [kernel] [k] tcp_packet
1.29% [init] [kernel] [k] ip_rcv
1.18% [init] [kernel] [k] __alloc_skb
1.15% iperf [kernel] [k] tcp_recvmsg

1.04% [init] [kernel] [k] _local_bh_enable_ip
1.02% [init] [kernel] [k] apic_timer_interrupt
1.02% [init] [kernel] [k] eth_type_trans
1.01% [init] [kernel] [k] tcp_v4_rcv
0.96% iperf [kernel] [k] kfree
|
|--95.35%-- skb_release_data
| __kfree_skb
| |
| |--79.27%-- tcp_recvmsg
| | sock_common_recvmsg
| | __sock_recvmsg
| | sock_recvmsg
| | sys_recvfrom
| | system_call_fastpath
| | __recv
| |
| --20.73%-- tcp_rcv_established
| tcp_v4_do_rcv
| tcp_prequeue_process
| tcp_recvmsg
| sock_common_recvmsg
| __sock_recvmsg
| sock_recvmsg
| sys_recvfrom
| system_call_fastpath
| __recv
|
--4.65%-- __kfree_skb
|
|--75.00%-- tcp_rcv_established
| tcp_v4_do_rcv
| tcp_prequeue_process
| tcp_recvmsg
| sock_common_recvmsg
| __sock_recvmsg
| sock_recvmsg
| sys_recvfrom
| system_call_fastpath
| __recv
|
--25.00%-- tcp_recvmsg
sock_common_recvmsg
__sock_recvmsg
sock_recvmsg
sys_recvfrom
system_call_fastpath
__recv

0.96% [init] [kernel] [k] read_tsc
0.92% iperf [kernel] [k] tcp_v4_do_rcv
|
|--95.12%-- tcp_prequeue_process
| tcp_recvmsg
| sock_common_recvmsg
| __sock_recvmsg
| sock_recvmsg
| sys_recvfrom
| system_call_fastpath
| __recv
|
--4.88%-- tcp_recvmsg
sock_common_recvmsg
__sock_recvmsg
sock_recvmsg
sys_recvfrom
system_call_fastpath
__recv

0.92% [init] [kernel] [k] e1000_clean_rx_irq
0.86% iperf [kernel] [k] tcp_rcv_established
|
|--96.10%-- tcp_v4_do_rcv
| tcp_prequeue_process
| tcp_recvmsg
| sock_common_recvmsg
| __sock_recvmsg
| sock_recvmsg
| sys_recvfrom
| system_call_fastpath
| __recv
|
--3.90%-- tcp_prequeue_process
tcp_recvmsg
sock_common_recvmsg
__sock_recvmsg
sock_recvmsg
sys_recvfrom
system_call_fastpath
__recv

0.84% iperf [kernel] [k] kmem_cache_free
|
|--93.33%-- __kfree_skb
| |
| |--71.43%-- tcp_recvmsg
| | sock_common_recvmsg
| | __sock_recvmsg
| | sock_recvmsg
| | sys_recvfrom
| | system_call_fastpath
| | __recv
| |
| --28.57%-- tcp_rcv_established
| tcp_v4_do_rcv
| tcp_prequeue_process
| tcp_recvmsg
| sock_common_recvmsg
| __sock_recvmsg
| sock_recvmsg
| sys_recvfrom
| system_call_fastpath
| __recv
|
|--4.00%-- tcp_recvmsg
| sock_common_recvmsg
| __sock_recvmsg
| sock_recvmsg
| sys_recvfrom
| system_call_fastpath
| __recv
|
--2.67%-- tcp_rcv_established
tcp_v4_do_rcv
tcp_prequeue_process
tcp_recvmsg
sock_common_recvmsg
__sock_recvmsg
sock_recvmsg
sys_recvfrom
system_call_fastpath
__recv

0.80% [init] [kernel] [k] netif_receive_skb
0.79% iperf [kernel] [k] tcp_event_data_recv
|
|--83.10%-- tcp_rcv_established
| tcp_v4_do_rcv
| tcp_prequeue_process
| tcp_recvmsg
| sock_common_recvmsg
| __sock_recvmsg
| sock_recvmsg
| sys_recvfrom
| system_call_fastpath
| __recv
|
|--12.68%-- tcp_v4_do_rcv
| tcp_prequeue_process
| tcp_recvmsg
| sock_common_recvmsg
| __sock_recvmsg
| sock_recvmsg
| sys_recvfrom
| system_call_fastpath
| __recv
|
--4.23%-- tcp_data_queue
tcp_rcv_established
tcp_v4_do_rcv
tcp_prequeue_process
tcp_recvmsg
sock_common_recvmsg
__sock_recvmsg
sock_recvmsg
sys_recvfrom
system_call_fastpath
__recv

0.67% perf [kernel] [k] format_decode
|
|--91.67%-- vsnprintf
| seq_printf
| |
| |--67.27%-- show_map_vma
| | show_map
| | seq_read
| | vfs_read
| | sys_read
| | system_call_fastpath
| | __GI_read
| |
| |--23.64%-- render_sigset_t
| | proc_pid_status
| | proc_single_show
| | seq_read
| | vfs_read
| | sys_read
| | system_call_fastpath
| | __GI_read
| |
| |--7.27%-- proc_pid_status
| | proc_single_show
| | seq_read
| | vfs_read
| | sys_read
| | system_call_fastpath
| | __GI_read
| |
| --1.82%-- cpuset_task_status_allowed
| proc_pid_status
| proc_single_show
| seq_read
| vfs_read
| sys_read
| system_call_fastpath
| __GI_read
|
--8.33%-- seq_printf
|
|--60.00%-- proc_pid_status
| proc_single_show
| seq_read
| vfs_read
| sys_read
| system_call_fastpath
| __GI_read
|
--40.00%-- show_map_vma
show_map
seq_read
vfs_read
sys_read
system_call_fastpath
__GI_read

0.65% [init] [kernel] [k] __kmalloc_track_caller
0.63% [init] [kernel] [k] nf_conntrack_in
0.63% [init] [kernel] [k] ip_route_input
0.58% perf [kernel] [k] vsnprintf
|
|--98.08%-- seq_printf
| |
| |--60.78%-- show_map_vma
| | show_map
| | seq_read
| | vfs_read
| | sys_read
| | system_call_fastpath
| | __GI_read
| |
| |--19.61%-- render_sigset_t
| | proc_pid_status
| | proc_single_show
| | seq_read
| | vfs_read
| | sys_read
| | system_call_fastpath
| | __GI_read
| |
| |--9.80%-- proc_pid_status
| | proc_single_show
| | seq_read
| | vfs_read
| | sys_read
| | system_call_fastpath
| | __GI_read
| |
| |--3.92%-- task_mem
| | proc_pid_status
| | proc_single_show
| | seq_read
| | vfs_read
| | sys_read
| | system_call_fastpath
| | __GI_read
| |
| |--3.92%-- cpuset_task_status_allowed
| | proc_pid_status
| | proc_single_show
| | seq_read
| | vfs_read
| | sys_read
| | system_call_fastpath
| | __GI_read
| |
| --1.96%-- render_cap_t
| proc_pid_status
| proc_single_show
| seq_read
| vfs_read
| sys_read
| system_call_fastpath
| __GI_read
|
--1.92%-- snprintf
proc_task_readdir
vfs_readdir
sys_getdents
system_call_fastpath
__getdents64
0x69706565000a3430

0.57% [init] [kernel] [k] ktime_get
0.57% [init] [kernel] [k] nf_nat_fn
0.56% iperf [kernel] [k] tcp_packet
|
|--68.00%-- __tcp_ack_snd_check
| tcp_rcv_established
| tcp_v4_do_rcv
| tcp_prequeue_process
| tcp_recvmsg
| sock_common_recvmsg
| __sock_recvmsg
| sock_recvmsg
| sys_recvfrom
| system_call_fastpath
| __recv
|
--32.00%-- tcp_cleanup_rbuf
tcp_recvmsg
sock_common_recvmsg
__sock_recvmsg
sock_recvmsg
sys_recvfrom
system_call_fastpath
__recv

0.56% iperf /usr/bin/iperf [.] 0x000000000059f8
|
|--8.00%-- 0x4059f8
|
|--8.00%-- 0x405a16
|
|--8.00%-- 0x4059fd
|
|--4.00%-- 0x409d22
|
|--4.00%-- 0x405871
|
|--4.00%-- 0x406ee1
|
|--4.00%-- 0x405726
|
|--4.00%-- 0x4058db
|
|--4.00%-- 0x406ee8
|
|--2.00%-- 0x405b60
|
|--2.00%-- 0x4058fd
|
|--2.00%-- 0x4058d5
|
|--2.00%-- 0x405490
|
|--2.00%-- 0x4058bb
|
|--2.00%-- 0x405b93
|
|--2.00%-- 0x405b8e
|
|--2.00%-- 0x405903
|
|--2.00%-- 0x405ba8
|
|--2.00%-- 0x406eae
|
|--2.00%-- 0x405545
|
|--2.00%-- 0x405870
|
|--2.00%-- 0x405b67
|
|--2.00%-- 0x4058ce
|
|--2.00%-- 0x40570e
|
|--2.00%-- 0x406ee4
|
|--2.00%-- 0x405a02
|
|--2.00%-- 0x406eec
|
|--2.00%-- 0x405b82
|
|--2.00%-- 0x40556a
|
|--2.00%-- 0x405755
|
|--2.00%-- 0x405a0a
|
|--2.00%-- 0x405498
|
|--2.00%-- 0x409d20
|
|--2.00%-- 0x405b21
|
--2.00%-- 0x405a2c

0.56% [init] [kernel] [k] kmem_cache_alloc
0.56% [init] [kernel] [k] __inet_lookup_established
0.55% perf [kernel] [k] number
|
|--95.92%-- vsnprintf
| |
| |--97.87%-- seq_printf
| | |
| | |--56.52%-- show_map_vma
| | | show_map
| | | seq_read
| | | vfs_read
| | | sys_read
| | | system_call_fastpath
| | | __GI_read
| | |
| | |--28.26%-- render_sigset_t
| | | proc_pid_status
| | | proc_single_show
| | | seq_read
| | | vfs_read
| | | sys_read
| | | system_call_fastpath
| | | __GI_read
| | |
| | |--6.52%-- proc_pid_status
| | | proc_single_show
| | | seq_read
| | | vfs_read
| | | sys_read
| | | system_call_fastpath
| | | __GI_read
| | |
| | |--4.35%-- render_cap_t
| | | proc_pid_status
| | | proc_single_show
| | | seq_read
| | | vfs_read
| | | sys_read
| | | system_call_fastpath
| | | __GI_read
| | |
| | --4.35%-- task_mem
| | proc_pid_status
| | proc_single_show
| | seq_read
| | vfs_read
| | sys_read
| | system_call_fastpath
| | __GI_read
| |
| --2.13%-- scnprintf
| bitmap_scnlistprintf
| seq_bitmap_list
| cpuset_task_status_allowed
| proc_pid_status
| proc_single_show
| seq_read
| vfs_read
| sys_read
| system_call_fastpath
| __GI_read
|
--4.08%-- seq_printf
|
|--50.00%-- show_map_vma
| show_map
| seq_read
| vfs_read
| sys_read
| system_call_fastpath
| __GI_read
|
--50.00%-- render_sigset_t
proc_pid_status
proc_single_show
seq_read
vfs_read
sys_read
system_call_fastpath
__GI_read

0.55% [init] [kernel] [k] native_sched_clock
0.50% iperf [kernel] [k] e1000_xmit_frame
|
|--71.11%-- __tcp_ack_snd_check
| tcp_rcv_established
| tcp_v4_do_rcv
| tcp_prequeue_process
| tcp_recvmsg
| sock_common_recvmsg
| __sock_recvmsg
| sock_recvmsg
| sys_recvfrom
| system_call_fastpath
| __recv
|
--28.89%-- tcp_cleanup_rbuf
tcp_recvmsg
sock_common_recvmsg
__sock_recvmsg
sock_recvmsg
sys_recvfrom
system_call_fastpath
__recv

0.50% iperf [kernel] [k] ipt_do_table
|
|--37.78%-- ipt_local_hook
| nf_iterate
| nf_hook_slow
| __ip_local_out
| ip_local_out
| ip_queue_xmit
| tcp_transmit_skb
| tcp_send_ack
| |
| |--58.82%-- __tcp_ack_snd_check
| | tcp_rcv_established
| | tcp_v4_do_rcv
| | tcp_prequeue_process
| | tcp_recvmsg
| | sock_common_recvmsg
| | __sock_recvmsg
| | sock_recvmsg
| | sys_recvfrom
| | system_call_fastpath
| | __recv
| |
| --41.18%-- tcp_cleanup_rbuf
| tcp_recvmsg
| sock_common_recvmsg
| __sock_recvmsg
| sock_recvmsg
| sys_recvfrom
| system_call_fastpath
| __recv
|
|--31.11%-- ipt_post_routing_hook
| nf_iterate
| nf_hook_slow
| ip_output
| ip_local_out
| ip_queue_xmit
| tcp_transmit_skb
| tcp_send_ack
| |
| |--64.29%-- __tcp_ack_snd_check
| | tcp_rcv_established
| | tcp_v4_do_rcv
| | tcp_prequeue_process
| | tcp_recvmsg
| | sock_common_recvmsg
| | __sock_recvmsg
| | sock_recvmsg
| | sys_recvfrom
| | system_call_fastpath
| | __recv
| |
| --35.71%-- tcp_cleanup_rbuf
| tcp_recvmsg
| sock_common_recvmsg
| __sock_recvmsg
| sock_recvmsg
| sys_recvfrom
| system_call_fastpath
| __recv
|
|--20.00%-- ipt_local_out_hook
| nf_iterate
| nf_hook_slow
| __ip_local_out
| ip_local_out
| ip_queue_xmit
| tcp_transmit_skb
| tcp_send_ack
| |
| |--88.89%-- __tcp_ack_snd_check
| | tcp_rcv_established
| | tcp_v4_do_rcv
| | tcp_prequeue_process
| | tcp_recvmsg
| | sock_common_recvmsg
| | __sock_recvmsg
| | sock_recvmsg
| | sys_recvfrom
| | system_call_fastpath
| | __recv
| |
| --11.11%-- tcp_cleanup_rbuf
| tcp_recvmsg
| sock_common_recvmsg
| __sock_recvmsg
| sock_recvmsg
| sys_recvfrom
| system_call_fastpath
| __recv
|
|--6.67%-- nf_iterate
| nf_hook_slow
| |
| |--66.67%-- ip_output
| | ip_local_out
| | ip_queue_xmit
| | tcp_transmit_skb
| | tcp_send_ack
| | tcp_cleanup_rbuf
| | tcp_recvmsg
| | sock_common_recvmsg
| | __sock_recvmsg
| | sock_recvmsg
| | sys_recvfrom
| | system_call_fastpath
| | __recv
| |
| --33.33%-- __ip_local_out
| ip_local_out
| ip_queue_xmit
| tcp_transmit_skb
| tcp_send_ack
| __tcp_ack_snd_check
| tcp_rcv_established
| tcp_v4_do_rcv
| tcp_prequeue_process
| tcp_recvmsg
| sock_common_recvmsg
| __sock_recvmsg
| sock_recvmsg
| sys_recvfrom
| system_call_fastpath
| __recv
|
|--2.22%-- ipt_local_in_hook
| nf_iterate
| nf_hook_slow
| ip_local_deliver
| ip_rcv_finish
| ip_rcv
| netif_receive_skb
| napi_skb_finish
| napi_gro_receive
| e1000_receive_skb
| e1000_clean_rx_irq
| e1000_clean
| net_rx_action
| __do_softirq
| call_softirq
| do_softirq
| irq_exit
| do_IRQ
| ret_from_intr
| vgettimeofday
|
--2.22%-- ipt_pre_routing_hook
nf_iterate
nf_hook_slow
ip_rcv
netif_receive_skb
napi_skb_finish
napi_gro_receive
e1000_receive_skb
e1000_clean_rx_irq
e1000_clean
net_rx_action
__do_softirq
call_softirq
do_softirq
irq_exit
do_IRQ
ret_from_intr
__GI___libc_nanosleep

0.50% iperf [kernel] [k] schedule
|
|--57.78%-- do_nanosleep
| hrtimer_nanosleep
| sys_nanosleep
| system_call_fastpath
| __GI___libc_nanosleep
|
|--33.33%-- schedule_timeout
| sk_wait_data
| tcp_recvmsg
| sock_common_recvmsg
| __sock_recvmsg
| sock_recvmsg
| sys_recvfrom
| system_call_fastpath
| __recv
|
|--6.67%-- hrtimer_nanosleep
| sys_nanosleep
| system_call_fastpath
| __GI___libc_nanosleep
|
--2.22%-- sk_wait_data
tcp_recvmsg
sock_common_recvmsg
__sock_recvmsg
sock_recvmsg
sys_recvfrom
system_call_fastpath
__recv

0.49% iperf [kernel] [k] tcp_transmit_skb
|
|--97.73%-- tcp_send_ack
| |
| |--83.72%-- __tcp_ack_snd_check
| | tcp_rcv_established
| | tcp_v4_do_rcv
| | |
| | |--97.22%-- tcp_prequeue_process
| | | tcp_recvmsg
| | | sock_common_recvmsg
| | | __sock_recvmsg
| | | sock_recvmsg
| | | sys_recvfrom
| | | system_call_fastpath
| | | __recv
| | |
| | --2.78%-- release_sock
| | tcp_recvmsg
| | sock_common_recvmsg
| | __sock_recvmsg
| | sock_recvmsg
| | sys_recvfrom
| | system_call_fastpath
| | __recv
| |
| --16.28%-- tcp_cleanup_rbuf
| tcp_recvmsg
| sock_common_recvmsg
| __sock_recvmsg
| sock_recvmsg
| sys_recvfrom
| system_call_fastpath
| __recv
|
--2.27%-- __tcp_ack_snd_check
tcp_rcv_established
tcp_v4_do_rcv
tcp_prequeue_process
tcp_recvmsg
sock_common_recvmsg
__sock_recvmsg
sock_recvmsg
sys_recvfrom
system_call_fastpath
__recv

0.49% [init] [kernel] [k] nf_hook_slow
0.48% iperf [kernel] [k] virt_to_head_page
|
|--53.49%-- kfree
| skb_release_data
| __kfree_skb
| |
| |--65.22%-- tcp_recvmsg
| | sock_common_recvmsg
| | __sock_recvmsg
| | sock_recvmsg
| | sys_recvfrom
| | system_call_fastpath
| | __recv
| |
| --34.78%-- tcp_rcv_established
| tcp_v4_do_rcv
| tcp_prequeue_process
| tcp_recvmsg
| sock_common_recvmsg
| __sock_recvmsg
| sock_recvmsg
| sys_recvfrom
| system_call_fastpath
| __recv
|
|--18.60%-- skb_release_data
| __kfree_skb
| |
| |--62.50%-- tcp_rcv_established
| | tcp_v4_do_rcv
| | tcp_prequeue_process
| | tcp_recvmsg
| | sock_common_recvmsg
| | __sock_recvmsg
| | sock_recvmsg
| | sys_recvfrom
| | system_call_fastpath
| | __recv
| |
| --37.50%-- tcp_recvmsg
| sock_common_recvmsg
| __sock_recvmsg
| sock_recvmsg
| sys_recvfrom
| system_call_fastpath
| __recv
|
|--18.60%-- kmem_cache_free
| __kfree_skb
| |
| |--62.50%-- tcp_rcv_established
| | tcp_v4_do_rcv
| | tcp_prequeue_process
| | tcp_recvmsg
| | sock_common_recvmsg
| | __sock_recvmsg
| | sock_recvmsg
| | sys_recvfrom
| | system_call_fastpath
| | __recv
| |
| --37.50%-- tcp_recvmsg
| sock_common_recvmsg
| __sock_recvmsg
| sock_recvmsg
| sys_recvfrom
| system_call_fastpath
| __recv
|
--9.30%-- __kfree_skb
|
|--75.00%-- tcp_rcv_established
| tcp_v4_do_rcv
| tcp_prequeue_process
| tcp_recvmsg
| sock_common_recvmsg
| __sock_recvmsg
| sock_recvmsg
| sys_recvfrom
| system_call_fastpath
| __recv
|
--25.00%-- tcp_recvmsg
sock_common_recvmsg
__sock_recvmsg
sock_recvmsg
sys_recvfrom
system_call_fastpath
__recv
...

Jerome Glisse

unread,
Sep 7, 2009, 2:40:08 PM9/7/09
to
On Mon, 2009-09-07 at 13:50 +1000, Con Kolivas wrote:

> /me checks on his distributed computing client's progress, fires up
> his next H264 encode, changes music tracks and prepares to have his
> arse whooped on quakelive.
> --

For such computer usage i would strongly suggest that you look into
GPU driver development there is a lot of performances to win in this
area and my feeling is that you can improve what you are doing
(games -> opengl (so GPU), H264 (encoding is harder to accelerate
with a GPU but for decoding and displaying it you definitely want
to involve the GPU), and tons of others things you are doing on your
linux desktop would go faster if GPU was put to more use. A wild guess
is that you can get a 2 or even 3 figures percentage improvement
with better GPU driver. My point is that i don't think a linux
scheduler improvement (compared to what we have now) will give a
significant boost for the linux desktop, on the contrary any even
slight improvement to the GPU driver stack can give you a boost.
Another way of saying that, there is no point into prioritizing X or
desktop app if CPU has to do all the drawing by itself (CPU is
several magnitude slower than GPU at doing such kind of task).

Regards,
Jerome Glisse

Daniel Walker

unread,
Sep 7, 2009, 2:50:06 PM9/7/09
to
On Mon, 2009-09-07 at 20:26 +0200, Ingo Molnar wrote:
> That's interesting. I tried to reproduce it on x86, but the profile
> does not show any scheduler overhead at all on the server:

If the scheduler isn't running the task which causes the lower
throughput , would that even show up in profiling output?

Daniel

Jens Axboe

unread,
Sep 7, 2009, 2:50:06 PM9/7/09
to
On Mon, Sep 07 2009, Avi Kivity wrote:
> On 09/07/2009 12:49 PM, Jens Axboe wrote:
>>
>> I ran a simple test as well, since I was curious to see how it performed
>> wrt interactiveness. One of my pet peeves with the current scheduler is
>> that I have to nice compile jobs, or my X experience is just awful while
>> the compile is running.
>>
>
> I think the problem is that CFS is optimizing for the wrong thing. It's
> trying to be fair to tasks, but these are meaningless building blocks of
> jobs, which is what the user sees and measures. Your make -j128
> dominates your interactive task by two orders of magnitude. If the
> scheduler attempts to bridge this gap using heuristics, it will fail
> badly when it misdetects since it will starve the really important
> 100-thread job for a task that was misdetected as interactive.

Agree, I was actually looking into doing joint latency for X number of
tasks for the test app. I'll try and do that and see if we can detect
something from that.

--
Jens Axboe

Michael Buesch

unread,
Sep 7, 2009, 3:00:13 PM9/7/09
to
On Monday 07 September 2009 20:26:29 Ingo Molnar wrote:
> Could you profile it please? Also, what's the context-switch rate?

As far as I can tell, the broadcom mips architecture does not have profiling support.
It does only have some proprietary profiling registers that nobody wrote kernel
support for, yet.

--
Greetings, Michael.

Ingo Molnar

unread,
Sep 7, 2009, 4:40:08 PM9/7/09
to

* Jens Axboe <jens....@oracle.com> wrote:

> Agree, I was actually looking into doing joint latency for X
> number of tasks for the test app. I'll try and do that and see if
> we can detect something from that.

Could you please try latest -tip:

http://people.redhat.com/mingo/tip.git/README

(c26f010 or later)

Does it get any better with make -j128 build jobs? Peter just fixed
a bug in the SMP load-balancer that can cause interactivity problems
on large CPU count systems.

Ingo

Jens Axboe

unread,
Sep 7, 2009, 4:50:07 PM9/7/09
to
On Mon, Sep 07 2009, Ingo Molnar wrote:
>
> * Jens Axboe <jens....@oracle.com> wrote:
>
> > Agree, I was actually looking into doing joint latency for X
> > number of tasks for the test app. I'll try and do that and see if
> > we can detect something from that.
>
> Could you please try latest -tip:
>
> http://people.redhat.com/mingo/tip.git/README
>
> (c26f010 or later)
>
> Does it get any better with make -j128 build jobs? Peter just fixed

The compile 'problem' is on my workstation, which is a dual core Intel
core 2. I use -j4 on that typically. On the bigger boxes, I don't notice
any interactivity problems, largely because I don't run anything latency
sensitive on those :-)

> a bug in the SMP load-balancer that can cause interactivity problems
> on large CPU count systems.

Worth trying on the dual core box?

--
Jens Axboe

Jens Axboe

unread,
Sep 7, 2009, 4:50:09 PM9/7/09
to
On Mon, Sep 07 2009, Jens Axboe wrote:
> > And yes, it would be wonderful to get a test-app from you that would
> > express the kind of pain you are seeing during compile jobs.
>
> I was hoping this one would, but it's not showing anything. I even added
> support for doing the ping and wakeup over a socket, to see if the pipe
> test was doing well because of the sync wakeup we do there. The net
> latency is a little worse, but still good. So no luck in making that app
> so far.

Here's a version that bounces timestamps between a producer and a number
of consumers (clients). Not really tested much, but perhaps someone can
compare this on a box that boots BFS and see what happens.

To run it, use -cX where X is the number of children that you wait for a
response from. The max delay between this children is logged for each
wakeup. You can invoke it ala:

$ ./latt -c4 'make -j4'

and it'll dump the max/avg/stddev bounce time after make has completed,
or if you just want to play around, start the compile in one xterm and
do:

$ ./latt -c4 'sleep 5'

to just log for a small period of time. Vary the number of clients to
see how that changes the aggregated latency. 1 should be fast, adding
more clients quickly adds up.

Additionally, it has a -f and -t option that controls the window of
sleep time for the parent between each message. The numbers are in
msecs, and it defaults to a minimum of 100msecs and up to 500msecs.

--
Jens Axboe

latt.c

Ingo Molnar

unread,
Sep 7, 2009, 5:00:10 PM9/7/09
to

* Michael Buesch <m...@bu3sch.de> wrote:

> On Monday 07 September 2009 20:26:29 Ingo Molnar wrote:
> > Could you profile it please? Also, what's the context-switch rate?
>
> As far as I can tell, the broadcom mips architecture does not have
> profiling support. It does only have some proprietary profiling
> registers that nobody wrote kernel support for, yet.

Well, what does 'vmstat 1' show - how many context switches are
there per second on the iperf server? In theory if it's a truly
saturated box, there shouldnt be many - just a single iperf task
running at 100% CPU utilization or so.

(Also, if there's hrtimer support for that board then perfcounters
could be used to profile it.)

Ingo

Jens Axboe

unread,
Sep 7, 2009, 5:10:06 PM9/7/09
to
On Mon, Sep 07 2009, Peter Zijlstra wrote:

> On Mon, 2009-09-07 at 22:46 +0200, Jens Axboe wrote:
> > > a bug in the SMP load-balancer that can cause interactivity problems
> > > on large CPU count systems.
> >
> > Worth trying on the dual core box?
>
> I debugged the issue on a dual core :-)
>
> It should be more pronounced on larger machines, but its present on
> dual-core too.

Alright, I'll upgrade that box to -tip tomorrow and see if it makes
a noticable difference. At -j4 or higher, I can literally see windows
slowly popping up when switching to a different virtual desktop.

Peter Zijlstra

unread,
Sep 7, 2009, 5:10:09 PM9/7/09
to
On Mon, 2009-09-07 at 22:46 +0200, Jens Axboe wrote:
> > a bug in the SMP load-balancer that can cause interactivity problems
> > on large CPU count systems.
>
> Worth trying on the dual core box?

I debugged the issue on a dual core :-)

It should be more pronounced on larger machines, but its present on
dual-core too.

--

Ingo Molnar

unread,
Sep 7, 2009, 6:20:08 PM9/7/09
to

* Jens Axboe <jens....@oracle.com> wrote:

> On Mon, Sep 07 2009, Peter Zijlstra wrote:
> > On Mon, 2009-09-07 at 22:46 +0200, Jens Axboe wrote:
> > > > a bug in the SMP load-balancer that can cause interactivity problems
> > > > on large CPU count systems.
> > >
> > > Worth trying on the dual core box?
> >
> > I debugged the issue on a dual core :-)
> >
> > It should be more pronounced on larger machines, but its present on
> > dual-core too.
>
> Alright, I'll upgrade that box to -tip tomorrow and see if it
> makes a noticable difference. At -j4 or higher, I can literally
> see windows slowly popping up when switching to a different
> virtual desktop.

btw., if you run -tip and have these enabled:

CONFIG_PERF_COUNTER=y
CONFIG_EVENT_TRACING=y

cd tools/perf/
make -j install

... then you can use a couple of new perfcounters features to
measure scheduler latencies. For example:

perf stat -e sched:sched_stat_wait -e task-clock ./hackbench 20

Will tell you how many times this workload got delayed by waiting
for CPU time.

You can repeat the workload as well and see the statistical
properties of those metrics:

aldebaran:/home/mingo> perf stat --repeat 10 -e \
sched:sched_stat_wait:r -e task-clock ./hackbench 20
Time: 0.251
Time: 0.214
Time: 0.254
Time: 0.278
Time: 0.245
Time: 0.308
Time: 0.242
Time: 0.222
Time: 0.268
Time: 0.244

Performance counter stats for './hackbench 20' (10 runs):

59826 sched:sched_stat_wait # 0.026 M/sec ( +- 5.540% )
2280.099643 task-clock-msecs # 7.525 CPUs ( +- 1.620% )

0.303013390 seconds time elapsed ( +- 3.189% )

To get scheduling events, do:

# perf list 2>&1 | grep sched:
sched:sched_kthread_stop [Tracepoint event]
sched:sched_kthread_stop_ret [Tracepoint event]
sched:sched_wait_task [Tracepoint event]
sched:sched_wakeup [Tracepoint event]
sched:sched_wakeup_new [Tracepoint event]
sched:sched_switch [Tracepoint event]
sched:sched_migrate_task [Tracepoint event]
sched:sched_process_free [Tracepoint event]
sched:sched_process_exit [Tracepoint event]
sched:sched_process_wait [Tracepoint event]
sched:sched_process_fork [Tracepoint event]
sched:sched_signal_send [Tracepoint event]
sched:sched_stat_wait [Tracepoint event]
sched:sched_stat_sleep [Tracepoint event]
sched:sched_stat_iowait [Tracepoint event]

stat_wait/sleep/iowait would be the interesting ones, for latency
analysis.

Or, if you want to see all the specific delays and want to see
min/max/avg, you can do:

perf record -e sched:sched_stat_wait:r -f -R -c 1 ./hackbench 20
perf trace

Ingo

Pekka Pietikainen

unread,
Sep 7, 2009, 8:00:12 PM9/7/09
to
On Mon, Sep 07, 2009 at 10:57:01PM +0200, Ingo Molnar wrote:
> > > Could you profile it please? Also, what's the context-switch rate?
> >
> > As far as I can tell, the broadcom mips architecture does not have
> > profiling support. It does only have some proprietary profiling
> > registers that nobody wrote kernel support for, yet.
> Well, what does 'vmstat 1' show - how many context switches are
> there per second on the iperf server? In theory if it's a truly
> saturated box, there shouldnt be many - just a single iperf task
Yay, finally something that's measurable in this thread \o/

Gigabit Ethernet iperf on an Atom or so might be something that
shows similar effects yet is debuggable. Anyone feel like taking a shot?

That beast doing iperf probably ends up making it go quite close to it's
limits (IO, mem bw, cpu). IIRC the routing/bridging performance is
something like 40Mbps (depends a lot on the model, corresponds pretty
well with the Mhz of the beast).

Maybe not totally unlike what make -j16 does to a 1-4 core box?

Thomas Fjellstrom

unread,
Sep 7, 2009, 8:00:10 PM9/7/09
to
On Sun September 6 2009, Nikos Chantziaras wrote:

> On 09/06/2009 11:59 PM, Ingo Molnar wrote:
> >[...]
> > Also, i'd like to outline that i agree with the general goals
> > described by you in the BFS announcement - small desktop systems
> > matter more than large systems. We find it critically important
> > that the mainline Linux scheduler performs well on those systems
> > too - and if you (or anyone else) can reproduce suboptimal behavior
> > please let the scheduler folks know so that we can fix/improve it.
>
> BFS improved behavior of many applications on my Intel Core 2 box in a
> way that can't be benchmarked. Examples:
>
> mplayer using OpenGL renderer doesn't drop frames anymore when dragging
> and dropping the video window around in an OpenGL composited desktop
> (KDE 4.3.1). (Start moving the mplayer window around; then drop it. At
> the moment the move starts and at the moment you drop the window back to
> the desktop, there's a big frame skip as if mplayer was frozen for a
> bit; around 200 or 300ms.)
>
> Composite desktop effects like zoom and fade out don't stall for
> sub-second periods of time while there's CPU load in the background. In
> other words, the desktop is more fluid and less skippy even during heavy
> CPU load. Moving windows around with CPU load in the background doesn't
> result in short skips.
>
> LMMS (a tool utilizing real-time sound synthesis) does not produce
> "pops", "crackles" and drops in the sound during real-time playback due
> to buffer under-runs. Those problems amplify when there's heavy CPU
> load in the background, while with BFS heavy load doesn't produce those
> artifacts (though LMMS makes itself run SCHED_ISO with BFS) Also,
> hitting a key on the keyboard needs less time for the note to become
> audible when using BFS. Same should hold true for other tools who
> traditionally benefit from the "-rt" kernel sources.
>
> Games like Doom 3 and such don't "freeze" periodically for small amounts
> of time (again for sub-second amounts) when something in the background
> grabs CPU time (be it my mailer checking for new mail or a cron job, or
> whatever.)
>
> And, the most drastic improvement here, with BFS I can do a "make -j2"
> in the kernel tree and the GUI stays fluid. Without BFS, things start
> to lag, even with in-RAM builds (like having the whole kernel tree
> inside a tmpfs) and gcc running with nice 19 and ionice -c 3.
>
> Unfortunately, I can't come up with any way to somehow benchmark all of
> this. There's no benchmark for "fluidity" and "responsiveness".
> Running the Doom 3 benchmark, or any other benchmark, doesn't say
> anything about responsiveness, it only measures how many frames were
> calculated in a specific period of time. How "stable" (with no stalls)
> those frames were making it to the screen is not measurable.
>
> If BFS would imply small drops in pure performance counted in
> instructions per seconds, that would be a totally acceptable regression
> for desktop/multimedia/gaming PCs. Not for server machines, of course.
> However, on my machine, BFS is faster in classic workloads. When I
> run "make -j2" with BFS and the standard scheduler, BFS always finishes
> a bit faster. Not by much, but still. One thing I'm noticing here is
> that BFS produces 100% CPU load on each core with "make -j2" while the
> normal scheduler stays at about 90-95% with -j2 or higher in at least
> one of the cores. There seems to be under-utilization of CPU time.
>
> Also, by searching around the net but also through discussions on
> various mailing lists, there seems to be a trend: the problems for some
> reason seem to occur more often with Intel CPUs (Core 2 chips and lower;
> I can't say anything about Core I7) while people on AMD CPUs mostly not
> being affected by most or even all of the above. (And due to this flame
> wars often break out, with one party accusing the other of imagining
> things). Can the integrated memory controller on AMD chips have
> something to do with this? Do AMD chips generally offer better
> "multithrading" behavior? Unfortunately, you didn't mention on what CPU
> you ran your tests. If it was AMD, it might be a good idea to run tests
> on Pentium and Core 2 CPUs.
>
> For reference, my system is:
>
> CPU: Intel Core 2 Duo E6600 (2.4GHz)
> Mainboard: Asus P5E (Intel X38 chipset)
> RAM: 6GB (2+2+1+1) dual channel DDR2 800
> GPU: RV770 (Radeon HD4870).
>

My Phenom 9550 (2.2Ghz) whips the pants off my Intel Q6600 (2.6Ghz). I and a
friend of mine both get large amounts of stalling when doing a lot of IO. I
haven't seen such horrible desktop interactivity since before the new
schedulers and the -ck patchset came out for 2.4.x. Its a heck of a lot better
on my AMD Phenom's, but some lag is noticeable these days, even when it wasn't
a few kernel releases ago.

Intel Specs:
CPU: Intel Core 2 Quad Q6600 (2.6Ghz)
Mainboard: ASUS P5K-SE (Intel p35 iirc)
RAM: 4G 800Mhz DDR2 dual channel (4x1G)
GPU: NVidia 8800GTS 320M

AMD Specs:
CPU: AMD Phenom I 9550 (2.2Ghz)
Mainboard: Gigabyte MA78GM-S2H
RAM: 4G 800Mhz DDR2 dual channel (2x2G)
GPU: Onboard Radeon 3200HD

AMD Specs x2:
CPU: AMD Phenom II 810 (2.6Ghz)
Mainboard: Gigabyte MA790FXT-UD5P
RAM: 4G 1066Mhz DDR3 dual channel (2x2G)
GPU: NVidia 8800GTS 320M (or currently a 8400GS)

Of course I get better performance out of the Phenom II vs either other box,
but it surprises me that I'd get more out of the budget AMD box over the not
so budget Intel box.

--
Thomas Fjellstrom
tfjel...@shaw.ca

Nikos Chantziaras

unread,
Sep 8, 2009, 3:20:05 AM9/8/09
to
On 09/07/2009 05:40 PM, Arjan van de Ven wrote:

> On Mon, 07 Sep 2009 06:38:36 +0300
> Nikos Chantziaras<rea...@arcor.de> wrote:
>
>> On 09/06/2009 11:59 PM, Ingo Molnar wrote:
>>> [...]
>>> Also, i'd like to outline that i agree with the general goals
>>> described by you in the BFS announcement - small desktop systems
>>> matter more than large systems. We find it critically important
>>> that the mainline Linux scheduler performs well on those systems
>>> too - and if you (or anyone else) can reproduce suboptimal behavior
>>> please let the scheduler folks know so that we can fix/improve it.
>>
>> BFS improved behavior of many applications on my Intel Core 2 box in
>> a way that can't be benchmarked. Examples:
>
> Have you tried to see if latencytop catches such latencies ?

I've just tried it.

I start latencytop and then mplayer on a video that doesn't max out the
CPU (needs about 20-30% of a single core (out of 2 available)). Then,
while the video is playing, I press Alt+Tab repeatedly which makes the
desktop compositor kick-in and stay active (it lays out all windows as a
"flip-switch", similar to the Microsoft Vista Aero alt+tab effect).
Repeatedly pressing alt+tab results in the compositor (in this case KDE
4.3.1) keep doing processing. With the mainline scheduler, mplayer
starts dropping frames and skip sound like crazy for the whole duration
of this exercise.

latencytop has this to say:

http://foss.math.aegean.gr/~realnc/pics/latop1.png

Though I don't really understand what this tool is trying to tell me, I
hope someone does.

Ingo Molnar

unread,
Sep 8, 2009, 3:50:14 AM9/8/09
to

* Ingo Molnar <mi...@elte.hu> wrote:

> That's interesting. I tried to reproduce it on x86, but the
> profile does not show any scheduler overhead at all on the server:

I've now simulated a saturated iperf server by adding an
udelay(3000) to e1000_intr() in via the patch below.

There's no idle time left that way:

Cpu(s): 0.0%us, 2.6%sy, 0.0%ni, 0.0%id, 0.0%wa, 93.2%hi, 4.2%si, 0.0%st
Mem: 1021044k total, 93400k used, 927644k free, 5068k buffers
Swap: 8193140k total, 0k used, 8193140k free, 25404k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1604 mingo 20 0 38300 956 724 S 99.4 0.1 3:15.07 iperf
727 root 15 -5 0 0 0 S 0.2 0.0 0:00.41 kondemand/0
1226 root 20 0 6452 336 240 S 0.2 0.0 0:00.06 irqbalance
1387 mingo 20 0 78872 1988 1300 S 0.2 0.2 0:00.23 sshd
1657 mingo 20 0 12752 1128 800 R 0.2 0.1 0:01.34 top
1 root 20 0 10320 684 572 S 0.0 0.1 0:01.79 init
2 root 15 -5 0 0 0 S 0.0 0.0 0:00.00 kthreadd

And the server is only able to saturate half of the 1 gigabit
bandwidth:

Client connecting to t, TCP port 5001


TCP window size: 16.0 KByte (default)
------------------------------------------------------------

[ 3] local 10.0.1.19 port 50836 connected with 10.0.1.14 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-10.0 sec 504 MBytes 423 Mbits/sec
------------------------------------------------------------
Client connecting to t, TCP port 5001


TCP window size: 16.0 KByte (default)
------------------------------------------------------------

[ 3] local 10.0.1.19 port 50837 connected with 10.0.1.14 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-10.0 sec 502 MBytes 420 Mbits/sec


perf top is showing:

------------------------------------------------------------------------------
PerfTop: 28517 irqs/sec kernel:99.4% [100000 cycles], (all, 1 CPUs)
------------------------------------------------------------------------------

samples pcnt kernel function
_______ _____ _______________

139553.00 - 93.2% : delay_tsc
2098.00 - 1.4% : hmac_digest
561.00 - 0.4% : ip_call_ra_chain
335.00 - 0.2% : neigh_alloc
279.00 - 0.2% : __hash_conntrack
257.00 - 0.2% : dev_activate
186.00 - 0.1% : proc_tcp_available_congestion_control
178.00 - 0.1% : e1000_get_regs
167.00 - 0.1% : tcp_event_data_recv

delay_tsc() dominates, as expected. Still zero scheduler overhead
and the contex-switch rate is well below 1000 per sec.

Then i booted v2.6.30 vanilla, added the udelay(3000) and got:

[ 5] local 10.0.1.14 port 5001 connected with 10.0.1.19 port 47026
[ 5] 0.0-10.0 sec 493 MBytes 412 Mbits/sec
[ 4] local 10.0.1.14 port 5001 connected with 10.0.1.19 port 47027
[ 4] 0.0-10.0 sec 520 MBytes 436 Mbits/sec
[ 5] local 10.0.1.14 port 5001 connected with 10.0.1.19 port 47028
[ 5] 0.0-10.0 sec 506 MBytes 424 Mbits/sec
[ 4] local 10.0.1.14 port 5001 connected with 10.0.1.19 port 47029
[ 4] 0.0-10.0 sec 496 MBytes 415 Mbits/sec

i.e. essentially the same throughput. (and this shows that using .30
versus .31 did not materially impact iperf performance in this test,
under these conditions and with this hardware)

The i applied the BFS patch to v2.6.30 and used the same
udelay(3000) hack and got:

No measurable change in throughput.

Obviously, this test is not equivalent to your test - but it does
show that even saturated iperf is getting scheduled just fine. (or,
rather, does not get scheduled all that much.)

[ 5] local 10.0.1.14 port 5001 connected with 10.0.1.19 port 38505
[ 5] 0.0-10.1 sec 481 MBytes 401 Mbits/sec
[ 4] local 10.0.1.14 port 5001 connected with 10.0.1.19 port 38506
[ 4] 0.0-10.0 sec 505 MBytes 423 Mbits/sec
[ 5] local 10.0.1.14 port 5001 connected with 10.0.1.19 port 38507
[ 5] 0.0-10.0 sec 508 MBytes 426 Mbits/sec
[ 4] local 10.0.1.14 port 5001 connected with 10.0.1.19 port 38508
[ 4] 0.0-10.0 sec 486 MBytes 406 Mbits/sec

So either your MIPS system has some unexpected dependency on the
scheduler, or there's something weird going on.

Mind poking on this one to figure out whether it's all repeatable
and why that slowdown happens? Multiple attempts to reproduce it
failed here for me.

Ingo

Ingo Molnar

unread,
Sep 8, 2009, 4:10:06 AM9/8/09
to

* Pekka Pietikainen <p...@ee.oulu.fi> wrote:

> On Mon, Sep 07, 2009 at 10:57:01PM +0200, Ingo Molnar wrote:
> > > > Could you profile it please? Also, what's the context-switch rate?
> > >
> > > As far as I can tell, the broadcom mips architecture does not have
> > > profiling support. It does only have some proprietary profiling
> > > registers that nobody wrote kernel support for, yet.
> > Well, what does 'vmstat 1' show - how many context switches are
> > there per second on the iperf server? In theory if it's a truly
> > saturated box, there shouldnt be many - just a single iperf task
>
> Yay, finally something that's measurable in this thread \o/

My initial posting in this thread contains 6 separate types of
measurements, rather extensive ones. Out of those, 4 measurements
were latency oriented, two were throughput oriented. Plenty of data,
plenty of results, and very good reproducability.

> Gigabit Ethernet iperf on an Atom or so might be something that
> shows similar effects yet is debuggable. Anyone feel like taking a
> shot?

I tried iperf on x86 and simulated saturation and no, there's no BFS
versus mainline performance difference that i can measure - simply
because a saturated iperf server does not schedule much - it's busy
handling all that networking workload.

I did notice that iperf is somewhat noisy: it can easily have weird
outliers regardless of which scheduler is used. That could be an
effect of queueing/timing: depending on precisely what order packets
arrive and they get queued by the networking stack, does get a
cache-effective pathway of packets get opened - while with slightly
different timings, that pathway closes and we get much worse
queueing performance. I saw noise on the order of magnitude of 10%,
so iperf has to be measured carefully before drawing conclusions.

> That beast doing iperf probably ends up making it go quite close
> to it's limits (IO, mem bw, cpu). IIRC the routing/bridging
> performance is something like 40Mbps (depends a lot on the model,
> corresponds pretty well with the Mhz of the beast).
>
> Maybe not totally unlike what make -j16 does to a 1-4 core box?

No, a single iperf session is very different from kbuild make -j16.

Firstly, iperf server is just a single long-lived task - so we
context-switch between that and the idle thread , [and perhaps a
kernel thread such as ksoftirqd]. The scheduler essentially has no
leeway what task to schedule and for how long: if there's work going
on the iperf server task will run - if there's none, the idle task
runs. [modulo ksoftirqd - depending on the driver model and
dependent on precise timings.]

kbuild -j16 on the other hand is a complex hierarchy and mixture of
thousands of short-lived and long-lived tasks. The scheduler has a
lot of leeway to decide what to schedule and for how long.

From a scheduler perspective the two workloads could not be any more
different. Kbuild does test scheduler decisions in non-trivial ways
- iperf server does not really.

Ingo

Nikos Chantziaras

unread,
Sep 8, 2009, 4:20:13 AM9/8/09
to
On 09/08/2009 11:04 AM, Ingo Molnar wrote:
>
> * Pekka Pietikainen<p...@ee.oulu.fi> wrote:
>
>> On Mon, Sep 07, 2009 at 10:57:01PM +0200, Ingo Molnar wrote:
>>>>> Could you profile it please? Also, what's the context-switch rate?
>>>>
>>>> As far as I can tell, the broadcom mips architecture does not have
>>>> profiling support. It does only have some proprietary profiling
>>>> registers that nobody wrote kernel support for, yet.
>>> Well, what does 'vmstat 1' show - how many context switches are
>>> there per second on the iperf server? In theory if it's a truly
>>> saturated box, there shouldnt be many - just a single iperf task
>>
>> Yay, finally something that's measurable in this thread \o/
>
> My initial posting in this thread contains 6 separate types of
> measurements, rather extensive ones. Out of those, 4 measurements
> were latency oriented, two were throughput oriented. Plenty of data,
> plenty of results, and very good reproducability.

None of which involve latency-prone GUI applications running on cheap
commodity hardware though. I listed examples where mainline seems to
behave sub-optimal and ways to reproduce them but this doesn't seem to
be an area of interest.

Arjan van de Ven

unread,
Sep 8, 2009, 4:30:13 AM9/8/09
to
On Tue, 08 Sep 2009 10:19:06 +0300
Nikos Chantziaras <rea...@arcor.de> wrote:

> latencytop has this to say:
>
> http://foss.math.aegean.gr/~realnc/pics/latop1.png
>
> Though I don't really understand what this tool is trying to tell me,
> I hope someone does.

unfortunately this is both an older version of latencytop, and it's
incorrectly installed ;-(
Latencytop is supposed to translate those cryptic strings to english,
but due to not being correctly installed, it does not do this ;(

the latest version of latencytop also has a GUI (thanks to Ben)

--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

Arjan van de Ven

unread,
Sep 8, 2009, 4:40:04 AM9/8/09
to
On Tue, 08 Sep 2009 10:19:06 +0300
Nikos Chantziaras <rea...@arcor.de> wrote:

> latencytop has this to say:
>
> http://foss.math.aegean.gr/~realnc/pics/latop1.png
>
> Though I don't really understand what this tool is trying to tell me,
> I hope someone does.

despite the untranslated content, it is clear that you have scheduler
delays (either due to scheduler bugs or cpu contention) of upto 68
msecs... Second in line is your binary AMD graphics driver that is
chewing up 14% of your total latency...


--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

Jens Axboe

unread,
Sep 8, 2009, 5:20:08 AM9/8/09
to
On Mon, Sep 07 2009, Jens Axboe wrote:
> On Mon, Sep 07 2009, Jens Axboe wrote:
> > > And yes, it would be wonderful to get a test-app from you that would
> > > express the kind of pain you are seeing during compile jobs.
> >
> > I was hoping this one would, but it's not showing anything. I even added
> > support for doing the ping and wakeup over a socket, to see if the pipe
> > test was doing well because of the sync wakeup we do there. The net
> > latency is a little worse, but still good. So no luck in making that app
> > so far.
>
> Here's a version that bounces timestamps between a producer and a number
> of consumers (clients). Not really tested much, but perhaps someone can
> compare this on a box that boots BFS and see what happens.

And here's a newer version. It ensures that clients are running before
sending a timestamp, and it drops the first and last log entry to
eliminate any weird effects there. Accuracy should also be improved.

On an idle box, it'll usually log all zeroes. Sometimes I see 3-4msec
latencies, weird.

--
Jens Axboe

latt.c

Benjamin Herrenschmidt

unread,
Sep 8, 2009, 6:00:12 AM9/8/09
to
On Tue, 2009-09-08 at 09:48 +0200, Ingo Molnar wrote:
> So either your MIPS system has some unexpected dependency on the
> scheduler, or there's something weird going on.
>
> Mind poking on this one to figure out whether it's all repeatable
> and why that slowdown happens? Multiple attempts to reproduce it
> failed here for me.

Could it be the scheduler using constructs that don't do well on MIPS ?

I remember at some stage we spotted an expensive multiply in there,
maybe there's something similar, or some unaligned or non-cache friendly
vs. the MIPS cache line size data structure, that sort of thing ...

Is this a SW loaded TLB ? Does it misses on kernel space ? That could
also be some differences in how many pages are touched by each scheduler
causing more TLB pressure. This will be mostly invisible on x86.

At this stage, it will be hard to tell without some profile data I
suppose. Maybe next week I can try on a small SW loaded TLB embedded PPC
see if I can reproduce some of that, but no promises here.

Cheers,
Ben.

Nikos Chantziaras

unread,
Sep 8, 2009, 6:20:06 AM9/8/09
to
On 09/08/2009 11:38 AM, Arjan van de Ven wrote:
> On Tue, 08 Sep 2009 10:19:06 +0300
> Nikos Chantziaras<rea...@arcor.de> wrote:
>
>> latencytop has this to say:
>>
>> http://foss.math.aegean.gr/~realnc/pics/latop1.png
>>
>> Though I don't really understand what this tool is trying to tell me,
>> I hope someone does.
>
> despite the untranslated content, it is clear that you have scheduler
> delays (either due to scheduler bugs or cpu contention) of upto 68
> msecs... Second in line is your binary AMD graphics driver that is
> chewing up 14% of your total latency...

I've now used a correctly installed and up-to-date version of latencytop
and repeated the test. Also, I got rid of AMD's binary blob and used
kernel DRM drivers for my graphics card to throw fglrx out of the
equation (which btw didn't help; the exact same problems occur).

Here the result:

http://foss.math.aegean.gr/~realnc/pics/latop2.png

Again: this is on an Intel Core 2 Duo CPU.

Ingo Molnar

unread,
Sep 8, 2009, 6:20:06 AM9/8/09
to

* Nikos Chantziaras <rea...@arcor.de> wrote:

> On 09/08/2009 11:04 AM, Ingo Molnar wrote:
>>
>> * Pekka Pietikainen<p...@ee.oulu.fi> wrote:
>>
>>> On Mon, Sep 07, 2009 at 10:57:01PM +0200, Ingo Molnar wrote:
>>>>>> Could you profile it please? Also, what's the context-switch rate?
>>>>>
>>>>> As far as I can tell, the broadcom mips architecture does not have
>>>>> profiling support. It does only have some proprietary profiling
>>>>> registers that nobody wrote kernel support for, yet.
>>>> Well, what does 'vmstat 1' show - how many context switches are
>>>> there per second on the iperf server? In theory if it's a truly
>>>> saturated box, there shouldnt be many - just a single iperf task
>>>
>>> Yay, finally something that's measurable in this thread \o/
>>
>> My initial posting in this thread contains 6 separate types of
>> measurements, rather extensive ones. Out of those, 4 measurements
>> were latency oriented, two were throughput oriented. Plenty of
>> data, plenty of results, and very good reproducability.
>
> None of which involve latency-prone GUI applications running on

> cheap commodity hardware though. [...]

The lat_tcp, lat_pipe and pipe-test numbers are all benchmarks that
characterise such workloads - they show the latency of context
switches.

I also tested where Con posted numbers that BFS has an edge over
mainline: kbuild performance. Should i not have done that?

Also note the interbench latency measurements that Con posted:

http://ck.kolivas.org/patches/bfs/interbench-bfs-cfs.txt

--- Benchmarking simulated cpu of Audio in the presence of simulated ---
Load Latency +/- SD (ms) Max Latency % Desired CPU % Deadlines Met
None 0.004 +/- 0.00436 0.006 100 100
Video 0.008 +/- 0.00879 0.015 100 100
X 0.006 +/- 0.0067 0.014 100 100
Burn 0.005 +/- 0.00563 0.009 100 100
Write 0.005 +/- 0.00887 0.16 100 100
Read 0.006 +/- 0.00696 0.018 100 100
Compile 0.007 +/- 0.00751 0.019 100 100

Versus the mainline scheduler:

--- Benchmarking simulated cpu of Audio in the presence of simulated ---
Load Latency +/- SD (ms) Max Latency % Desired CPU % Deadlines Met
None 0.005 +/- 0.00562 0.007 100 100
Video 0.003 +/- 0.00333 0.009 100 100
X 0.003 +/- 0.00409 0.01 100 100
Burn 0.004 +/- 0.00415 0.006 100 100
Write 0.005 +/- 0.00592 0.021 100 100
Read 0.004 +/- 0.00463 0.009 100 100
Compile 0.003 +/- 0.00426 0.014 100 100

look at those standard deviation numbers, their spread is way too
high, often 50% or more - very hard to compare such noisy data.

Furthermore, they happen to show the 2.6.30 mainline scheduler
outperforming BFS in almost every interactivity metric.

Check it for yourself and compare the entries. I havent made those
measurements, Con did.

For example 'Compile' latencies:

--- Benchmarking simulated cpu of Audio in the presence of simulated Load
Latency +/- SD (ms) Max Latency % Desired CPU % Deadlines Met
v2.6.30: Compile 0.003 +/- 0.00426 0.014 100 100
BFS: Compile 0.007 +/- 0.00751 0.019 100 100

but ... with a near 100% standard deviation that's pretty hard to
judge. The Max Latency went from 14 usecs under v2.6.30 to 19 usecs
on BFS.

> [...] I listed examples where mainline seems to behave

> sub-optimal and ways to reproduce them but this doesn't seem to be
> an area of interest.

It is an area of interest of course. That's how the interactivity
results above became possible.

Ingo

Nikos Chantziaras

unread,
Sep 8, 2009, 6:50:05 AM9/8/09
to
On 09/08/2009 01:12 PM, Ingo Molnar wrote:
>
> * Nikos Chantziaras<rea...@arcor.de> wrote:
>
>> On 09/08/2009 11:04 AM, Ingo Molnar wrote:
>>>
>>> * Pekka Pietikainen<p...@ee.oulu.fi> wrote:
>>>
>>>> On Mon, Sep 07, 2009 at 10:57:01PM +0200, Ingo Molnar wrote:
>>>>>>> Could you profile it please? Also, what's the context-switch rate?
>>>>>>
>>>>>> As far as I can tell, the broadcom mips architecture does not have
>>>>>> profiling support. It does only have some proprietary profiling
>>>>>> registers that nobody wrote kernel support for, yet.
>>>>> Well, what does 'vmstat 1' show - how many context switches are
>>>>> there per second on the iperf server? In theory if it's a truly
>>>>> saturated box, there shouldnt be many - just a single iperf task
>>>>
>>>> Yay, finally something that's measurable in this thread \o/
>>>
>>> My initial posting in this thread contains 6 separate types of
>>> measurements, rather extensive ones. Out of those, 4 measurements
>>> were latency oriented, two were throughput oriented. Plenty of
>>> data, plenty of results, and very good reproducability.
>>
>> None of which involve latency-prone GUI applications running on
>> cheap commodity hardware though. [...]
>
> The lat_tcp, lat_pipe and pipe-test numbers are all benchmarks that
> characterise such workloads - they show the latency of context
> switches.
>
> I also tested where Con posted numbers that BFS has an edge over
> mainline: kbuild performance. Should i not have done that?

It's good that you did, of course. However, when someone reports a
problem/issue, the developer usually tries to reproduce the problem; he
needs to see what the user sees. This is how it's usually done, not
only in most other development environments, but also here from I could
gather by reading this list. When getting reports about interactivity
issues and with very specific examples of how to reproduce, I would have
expected that most developers interested in identifying the issue would
try to reproduce the same problem and work from there. That would mean
that you (or anyone else with an interest of tracking this down) would
follow the examples given (by me and others, like enabling desktop
compositing, firing up mplayer with a video and generally reproducing
this using the quite detailed steps I posted as a recipe).

However, in this case, instead of the above, raw numbers are posted with
batch jobs and benchmarks that aren't actually reproducing the issue as
described by the reporter(s). That way, the developer doesn't get to
experience the issue firt-hand (and due to this possibly missing the
real cause). In most other bug reports or issues, the right thing seems
to happen and the devs try to reproduce it exactly as described. But
not in this case. I suspect this is due to most devs not using the
software components on their machines that are necessary for this and
therefore it would take too much time to reproduce the issue exactly as
described?

Nikos Chantziaras

unread,
Sep 8, 2009, 7:40:09 AM9/8/09
to
On 09/08/2009 02:54 AM, Thomas Fjellstrom wrote:
> On Sun September 6 2009, Nikos Chantziaras wrote:
>> [...]

>> For reference, my system is:
>>
>> CPU: Intel Core 2 Duo E6600 (2.4GHz)
>> Mainboard: Asus P5E (Intel X38 chipset)
>> RAM: 6GB (2+2+1+1) dual channel DDR2 800
>> GPU: RV770 (Radeon HD4870).
>>
>
> My Phenom 9550 (2.2Ghz) whips the pants off my Intel Q6600 (2.6Ghz). I and a
> friend of mine both get large amounts of stalling when doing a lot of IO. I
> haven't seen such horrible desktop interactivity since before the new
> schedulers and the -ck patchset came out for 2.4.x. Its a heck of a lot better
> on my AMD Phenom's, but some lag is noticeable these days, even when it wasn't
> a few kernel releases ago.

It seems someone tried BFS on quite slower hardware: Android. According
to the feedback, the device is much more responsive with BFS:
http://twitter.com/cyanogen

Juergen Beisert

unread,
Sep 8, 2009, 7:40:08 AM9/8/09
to
On Dienstag, 8. September 2009, Nikos Chantziaras wrote:
> On 09/08/2009 11:38 AM, Arjan van de Ven wrote:
> > On Tue, 08 Sep 2009 10:19:06 +0300
> >
> > Nikos Chantziaras<rea...@arcor.de> wrote:
> >> latencytop has this to say:
> >>
> >> http://foss.math.aegean.gr/~realnc/pics/latop1.png
> >>
> >> Though I don't really understand what this tool is trying to tell me,
> >> I hope someone does.
> >
> > despite the untranslated content, it is clear that you have scheduler
> > delays (either due to scheduler bugs or cpu contention) of upto 68
> > msecs... Second in line is your binary AMD graphics driver that is
> > chewing up 14% of your total latency...
>
> I've now used a correctly installed and up-to-date version of latencytop
> and repeated the test. Also, I got rid of AMD's binary blob and used
> kernel DRM drivers for my graphics card to throw fglrx out of the
> equation (which btw didn't help; the exact same problems occur).
>
> Here the result:
>
> http://foss.math.aegean.gr/~realnc/pics/latop2.png
>
> Again: this is on an Intel Core 2 Duo CPU.

Just an idea: Maybe some system management code hits you?

jbe

--
Pengutronix e.K. | Juergen Beisert |
Linux Solutions for Science and Industry | Phone: +49-8766-939 228 |
Vertretung Sued/Muenchen, Germany | Fax: +49-5121-206917-5555 |
Amtsgericht Hildesheim, HRA 2686 | http://www.pengutronix.de/ |

Ingo Molnar

unread,
Sep 8, 2009, 7:40:13 AM9/8/09
to

* Nikos Chantziaras <rea...@arcor.de> wrote:

> [...] That would mean that you (or anyone else with an interest of

> tracking this down) would follow the examples given (by me and
> others, like enabling desktop compositing, firing up mplayer with
> a video and generally reproducing this using the quite detailed
> steps I posted as a recipe).

Could you follow up on Frederic's detailed tracing suggestions that
would give us the source of the latency?

( Also, as per lkml etiquette, please try to keep the Cc: list
intact when replying to emails. I missed your first reply
that you un-Cc:-ed. )

A quick look at the latencytop output suggests a scheduling latency.
Could you send me the kernel .config that you are using?

Ingo

el_es

unread,
Sep 8, 2009, 8:10:07 AM9/8/09
to
Ingo Molnar <mingo <at> elte.hu> writes:


> For example 'Compile' latencies:
>
> --- Benchmarking simulated cpu of Audio in the presence of simulated Load
> Latency +/- SD (ms) Max Latency % Desired CPU %
Deadlines Met
> v2.6.30: Compile 0.003 +/- 0.00426 0.014 100 100
> BFS: Compile 0.007 +/- 0.00751 0.019 100 100
>
> but ... with a near 100% standard deviation that's pretty hard to
> judge. The Max Latency went from 14 usecs under v2.6.30 to 19 usecs
> on BFS.
>
[...]

> Ingo
>

This just struck me : maybe what desktop users *feel* is exactly that : current
approach is too fine-grained, trying to achieve the minimum latency with *most*
reproductible result (less stddev) at all cost ? And BFS just doesn't care?
I know this sounds like heresy.

[ the space below is to satisfy the brain-dead GMane posting engine].


Lukasz

Theodore Tso

unread,
Sep 8, 2009, 8:10:08 AM9/8/09
to
On Tue, Sep 08, 2009 at 01:13:34PM +0300, Nikos Chantziaras wrote:
>> despite the untranslated content, it is clear that you have scheduler
>> delays (either due to scheduler bugs or cpu contention) of upto 68
>> msecs... Second in line is your binary AMD graphics driver that is
>> chewing up 14% of your total latency...
>
> I've now used a correctly installed and up-to-date version of latencytop
> and repeated the test. Also, I got rid of AMD's binary blob and used
> kernel DRM drivers for my graphics card to throw fglrx out of the
> equation (which btw didn't help; the exact same problems occur).
>
> Here the result:
>
> http://foss.math.aegean.gr/~realnc/pics/latop2.png

This was with an unmodified 2.6.31-rcX kernel? Does Latencytop do
anything useful on a BFS-patched kernel?

- Ted

Message has been deleted

Felix Fietkau

unread,
Sep 8, 2009, 9:50:04 AM9/8/09
to
Benjamin Herrenschmidt wrote:
> On Tue, 2009-09-08 at 09:48 +0200, Ingo Molnar wrote:
>> So either your MIPS system has some unexpected dependency on the
>> scheduler, or there's something weird going on.
>>
>> Mind poking on this one to figure out whether it's all repeatable
>> and why that slowdown happens? Multiple attempts to reproduce it
>> failed here for me.
>
> Could it be the scheduler using constructs that don't do well on MIPS ?
>
> I remember at some stage we spotted an expensive multiply in there,
> maybe there's something similar, or some unaligned or non-cache friendly
> vs. the MIPS cache line size data structure, that sort of thing ...
>
> Is this a SW loaded TLB ? Does it misses on kernel space ? That could
> also be some differences in how many pages are touched by each scheduler
> causing more TLB pressure. This will be mostly invisible on x86.
The TLB is SW loaded, yes. However it should not do any misses on kernel
space, since the whole segment is in a wired TLB entry.

- Felix

Arjan van de Ven

unread,
Sep 8, 2009, 10:20:05 AM9/8/09
to
On Tue, 08 Sep 2009 13:13:34 +0300
Nikos Chantziaras <rea...@arcor.de> wrote:

> On 09/08/2009 11:38 AM, Arjan van de Ven wrote:
> > On Tue, 08 Sep 2009 10:19:06 +0300
> > Nikos Chantziaras<rea...@arcor.de> wrote:
> >
> >> latencytop has this to say:
> >>
> >> http://foss.math.aegean.gr/~realnc/pics/latop1.png
> >>
> >> Though I don't really understand what this tool is trying to tell
> >> me, I hope someone does.
> >
> > despite the untranslated content, it is clear that you have
> > scheduler delays (either due to scheduler bugs or cpu contention)
> > of upto 68 msecs... Second in line is your binary AMD graphics
> > driver that is chewing up 14% of your total latency...
>
> I've now used a correctly installed and up-to-date version of
> latencytop and repeated the test. Also, I got rid of AMD's binary
> blob and used kernel DRM drivers for my graphics card to throw fglrx
> out of the equation (which btw didn't help; the exact same problems
> occur).
>
> Here the result:
>
> http://foss.math.aegean.gr/~realnc/pics/latop2.png
>
> Again: this is on an Intel Core 2 Duo CPU.


so we finally have objective numbers!

now the interesting part is also WHERE the latency hits. Because
fundamentally, if you oversubscribe the CPU, you WILL get scheduling
latency.. simply you have more to run than there is CPU.

Now the scheduler impacts this latency in two ways
* Deciding how long apps run before someone else gets to take over
("time slicing")
* Deciding who gets to run first/more; eg priority between apps

the first one more or less controls the maximum, while the second one
controls which apps get to enjoy this maximum.

latencytop shows you both, but it is interesting to see how much the
apps get that you care about latency for....

--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

Michael Buesch

unread,
Sep 8, 2009, 10:50:05 AM9/8/09
to
On Tuesday 08 September 2009 09:48:25 Ingo Molnar wrote:
> Mind poking on this one to figure out whether it's all repeatable
> and why that slowdown happens?

I repeated the test several times, because I couldn't really believe that
there's such a big difference for me, but the results were the same.
I don't really know what's going on nor how to find out what's going on.

--
Greetings, Michael.

Peter Zijlstra

unread,
Sep 8, 2009, 11:30:18 AM9/8/09
to
On Tue, 2009-09-08 at 11:13 +0200, Jens Axboe wrote:
> And here's a newer version.

I tinkered a bit with your proglet and finally found the problem.

You used a single pipe per child, this means the loop in run_child()
would consume what it just wrote out until it got force preempted by the
parent which would also get woken.

This results in the child spinning a while (its full quota) and only
reporting the last timestamp to the parent.

Since consumer (parent) is a single thread the program basically
measures the worst delay in a thundering herd wakeup of N children.

The below version yields:

idle

[root@opteron sched]# ./latt -c8 sleep 30
Entries: 664 (clients=8)

Averages:
------------------------------
Max 128 usec
Avg 26 usec
Stdev 16 usec


make -j4

[root@opteron sched]# ./latt -c8 sleep 30
Entries: 648 (clients=8)

Averages:
------------------------------
Max 20861 usec
Avg 3763 usec
Stdev 4637 usec


Mike's patch, make -j4

[root@opteron sched]# ./latt -c8 sleep 30
Entries: 648 (clients=8)

Averages:
------------------------------
Max 17854 usec
Avg 6298 usec
Stdev 4735 usec

latt.c

Michael Buesch

unread,
Sep 8, 2009, 11:50:10 AM9/8/09
to
On Monday 07 September 2009 22:57:01 Ingo Molnar wrote:
>
> * Michael Buesch <m...@bu3sch.de> wrote:

>
> > On Monday 07 September 2009 20:26:29 Ingo Molnar wrote:
> > > Could you profile it please? Also, what's the context-switch rate?
> >
> > As far as I can tell, the broadcom mips architecture does not have
> > profiling support. It does only have some proprietary profiling
> > registers that nobody wrote kernel support for, yet.
>
> Well, what does 'vmstat 1' show - how many context switches are
> there per second on the iperf server? In theory if it's a truly
> saturated box, there shouldnt be many - just a single iperf task
> running at 100% CPU utilization or so.
>
> (Also, if there's hrtimer support for that board then perfcounters
> could be used to profile it.)

CFS:

procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
1 0 0 15892 1684 5868 0 0 0 0 268 6 31 69 0 0
1 0 0 15892 1684 5868 0 0 0 0 266 2 34 66 0 0
1 0 0 15892 1684 5868 0 0 0 0 266 6 33 67 0 0
1 0 0 15892 1684 5868 0 0 0 0 267 4 37 63 0 0
1 0 0 15892 1684 5868 0 0 0 0 267 6 34 66 0 0
[ 4] local 192.168.1.1 port 5001 connected with 192.168.1.99 port 47278
2 0 0 15756 1684 5868 0 0 0 0 1655 68 26 74 0 0
2 0 0 15756 1684 5868 0 0 0 0 1945 88 20 80 0 0
2 0 0 15756 1684 5868 0 0 0 0 1882 85 20 80 0 0
2 0 0 15756 1684 5868 0 0 0 0 1923 86 18 82 0 0
2 0 0 15756 1684 5868 0 0 0 0 1986 87 23 77 0 0
2 0 0 15756 1684 5868 0 0 0 0 1923 87 17 83 0 0
2 0 0 15756 1684 5868 0 0 0 0 1951 84 19 81 0 0
2 0 0 15756 1684 5868 0 0 0 0 1970 87 18 82 0 0
2 0 0 15756 1684 5868 0 0 0 0 1972 85 23 77 0 0
2 0 0 15756 1684 5868 0 0 0 0 1961 87 18 82 0 0
[ ID] Interval Transfer Bandwidth
[ 4] 0.0-10.0 sec 28.6 MBytes 23.9 Mbits/sec
1 0 0 15752 1684 5868 0 0 0 0 599 22 22 78 0 0
1 0 0 15752 1684 5868 0 0 0 0 269 4 32 68 0 0
1 0 0 15752 1684 5868 0 0 0 0 266 4 29 71 0 0
1 0 0 15764 1684 5868 0 0 0 0 267 6 37 63 0 0
1 0 0 15764 1684 5868 0 0 0 0 267 4 31 69 0 0
1 0 0 15768 1684 5868 0 0 0 0 266 4 51 49 0 0


I'm currently unable to test BFS, because the device throws strange flash errors.
Maybe the flash is broken :(

Jesse Brandeburg

unread,
Sep 8, 2009, 2:00:11 PM9/8/09
to
On Tue, Sep 8, 2009 at 5:57 AM, Serge
Belyshev<bely...@depni.sinp.msu.ru> wrote:
>
> Hi. I've done measurments of time taken by make -j4 kernel build
> on a quadcore box. �Results are interesting: mainline kernel
> has regressed since v2.6.23 release by more than 10%.

Is this related to why I now have to double the amount of threads X I
pass to make -jX, in order to use all my idle time for a kernel
compile? I had noticed (without measuring exactly) that it seems with
each kernel released in this series mentioned, I had to increase my
number of worker threads, my common working model now is (cpus * 2) in
order to get zero idle time.

Sorry I haven't tested BFS yet, but am interested to see if it helps
interactivity when playing flash videos on my dual core laptop.

Nikos Chantziaras

unread,
Sep 8, 2009, 2:20:05 PM9/8/09
to
On 09/07/2009 02:01 PM, Frederic Weisbecker wrote:

> On Mon, Sep 07, 2009 at 06:38:36AM +0300, Nikos Chantziaras wrote:
>> Unfortunately, I can't come up with any way to somehow benchmark all of
>> this. There's no benchmark for "fluidity" and "responsiveness". Running
>> the Doom 3 benchmark, or any other benchmark, doesn't say anything about
>> responsiveness, it only measures how many frames were calculated in a
>> specific period of time. How "stable" (with no stalls) those frames were
>> making it to the screen is not measurable.
>
>
> That looks eventually benchmarkable. This is about latency.
> For example, you could try to run high load tasks in the
> background and then launch a task that wakes up in middle/large
> periods to do something. You could measure the time it takes to wake
> it up to perform what it wants.
>
> We have some events tracing infrastructure in the kernel that can
> snapshot the wake up and sched switch events.
>
> Having CONFIG_EVENT_TRACING=y should be sufficient for that.
>
> You just need to mount a debugfs point, say in /debug.
>
> Then you can activate these sched events by doing:
>
> echo 0> /debug/tracing/tracing_on
> echo 1> /debug/tracing/events/sched/sched_switch/enable
> echo 1> /debug/tracing/events/sched/sched_wake_up/enable
>
> #Launch your tasks
>
> echo 1> /debug/tracing/tracing_on
>
> #Wait for some time
>
> echo 0> /debug/tracing/tracing_off
>
> That will require some parsing of the result in /debug/tracing/trace
> to get the delays between wake_up events and switch in events
> for the task that periodically wakes up and then produce some
> statistics such as the average or the maximum latency.
>
> That's a bit of a rough approach to measure such latencies but that
> should work.

I've tried this with 2.6.31-rc9 while running mplayer and alt+tabbing
repeatedly to the point where mplayer starts to stall and drop frames.
This produced a 4.1MB trace file (132k bzip2'ed):

http://foss.math.aegean.gr/~realnc/kernel/trace1.bz2

Uncompressed for online viewing:

http://foss.math.aegean.gr/~realnc/kernel/trace1

I must admit that I don't know what it is I'm looking at :P

Nikos Chantziaras

unread,
Sep 8, 2009, 2:30:05 PM9/8/09
to
On 09/08/2009 08:47 PM, Jesse Brandeburg wrote:
>[...]

> Sorry I haven't tested BFS yet, but am interested to see if it helps
> interactivity when playing flash videos on my dual core laptop.

Interactivity: yes (Flash will not result in the rest of the system
lagging).

Flash videos: they will still play as bad as before. BFS has no way to
fix broken code inside Flash :P

Nikos Chantziaras

unread,
Sep 8, 2009, 2:40:06 PM9/8/09
to
On 09/08/2009 03:57 PM, Serge Belyshev wrote:
>
> Hi. I've done measurments of time taken by make -j4 kernel build
> on a quadcore box. Results are interesting: mainline kernel
> has regressed since v2.6.23 release by more than 10%.

It seems more people are starting to confirm this issue:

http://foldingforum.org/viewtopic.php?f=44&t=11336

IMHO it's not *that* dramatic as some people there describe it ("Is it
the holy grail?") but if something makes your desktop "smooth as silk"
just like that, it might seem as a holy grail ;) In any case, there
clearly seems to be a performance problem with the mainline scheduler on
many people's desktops that are being solved by BFS.

Nikos Chantziaras

unread,
Sep 8, 2009, 3:10:06 PM9/8/09
to
On 09/08/2009 02:35 PM, Ingo Molnar wrote:
>
> * Nikos Chantziaras<rea...@arcor.de> wrote:
>
>> [...] That would mean that you (or anyone else with an interest of
>> tracking this down) would follow the examples given (by me and
>> others, like enabling desktop compositing, firing up mplayer with
>> a video and generally reproducing this using the quite detailed
>> steps I posted as a recipe).
>
> Could you follow up on Frederic's detailed tracing suggestions that
> would give us the source of the latency?

I've set it up and ran the tests now.


> ( Also, as per lkml etiquette, please try to keep the Cc: list
> intact when replying to emails. I missed your first reply
> that you un-Cc:-ed. )

Sorry for that.


> A quick look at the latencytop output suggests a scheduling latency.
> Could you send me the kernel .config that you are using?

That would be this one:

http://foss.math.aegean.gr/~realnc/kernel/config-2.6.31-rc9

Jeff Garzik

unread,
Sep 8, 2009, 3:10:07 PM9/8/09
to
On 09/08/2009 01:47 PM, Jesse Brandeburg wrote:
> On Tue, Sep 8, 2009 at 5:57 AM, Serge
> Belyshev<bely...@depni.sinp.msu.ru> wrote:
>>
>> Hi. I've done measurments of time taken by make -j4 kernel build
>> on a quadcore box. Results are interesting: mainline kernel
>> has regressed since v2.6.23 release by more than 10%.
>
> Is this related to why I now have to double the amount of threads X I
> pass to make -jX, in order to use all my idle time for a kernel
> compile? I had noticed (without measuring exactly) that it seems with
> each kernel released in this series mentioned, I had to increase my
> number of worker threads, my common working model now is (cpus * 2) in
> order to get zero idle time.

You will almost certainly see idle CPUs/threads with "make -jN_CPUS" due
to processes waiting for I/O.

If you're curious, there is also room for experimenting with make's "-l"
argument, which caps the number of jobs based on load average rather
than a static number of job slots.

Jeff

Jeff Garzik

unread,
Sep 8, 2009, 3:30:12 PM9/8/09
to
On 09/08/2009 03:20 PM, Serge Belyshev wrote:

> Jeff Garzik<je...@garzik.org> writes:
>
>> You will almost certainly see idle CPUs/threads with "make -jN_CPUS"
>> due to processes waiting for I/O.
>
> Just to clarify: I have excluded all I/O effects in my plots completely
> by building completely from tmpfs. Also, before each actual measurment
> there was a thrown-off "pre-caching" one. And my box has 8GB of RAM.

You could always one-up that by using ramfs ;)

Serge Belyshev

unread,
Sep 8, 2009, 3:30:14 PM9/8/09
to
Jeff Garzik <je...@garzik.org> writes:

> You will almost certainly see idle CPUs/threads with "make -jN_CPUS"
> due to processes waiting for I/O.

Just to clarify: I have excluded all I/O effects in my plots completely


by building completely from tmpfs. Also, before each actual measurment
there was a thrown-off "pre-caching" one. And my box has 8GB of RAM.

Frans Pop

unread,
Sep 8, 2009, 4:30:05 PM9/8/09
to
Arjan van de Ven wrote:
> the latest version of latencytop also has a GUI (thanks to Ben)

That looks nice, but...

I kind of miss the split screen feature where latencytop would show both
the overall figures + the ones for the currently most affected task.
Downside of that last was that I never managed to keep the display on a
specific task.

The graphical display also makes it impossible to simply copy and paste
the results.

Having the freeze button is nice though.

Would it be possible to have a command line switch that allows to start
the old textual mode?

Looks like the man page needs updating too :-)

Cheers,
FJP

Jens Axboe

unread,
Sep 8, 2009, 4:40:08 PM9/8/09
to
On Tue, Sep 08 2009, Peter Zijlstra wrote:
> On Tue, 2009-09-08 at 11:13 +0200, Jens Axboe wrote:
> > And here's a newer version.
>
> I tinkered a bit with your proglet and finally found the problem.
>
> You used a single pipe per child, this means the loop in run_child()
> would consume what it just wrote out until it got force preempted by the
> parent which would also get woken.
>
> This results in the child spinning a while (its full quota) and only
> reporting the last timestamp to the parent.

Oh doh, that's not well thought out. Well it was a quick hack :-)
Thanks for the fixup, now it's at least usable to some degree.

> Since consumer (parent) is a single thread the program basically
> measures the worst delay in a thundering herd wakeup of N children.

Yes, it's really meant to measure how long it takes to wake a group of
processes, assuming that this is where things fall down on the 'box
loaded, switch desktop' case. Now whether that's useful or not or
whether this test app is worth the bits it takes up on the hard drive,
is another question.

--
Jens Axboe

Michal Schmidt

unread,
Sep 8, 2009, 5:20:10 PM9/8/09
to
Dne Tue, 8 Sep 2009 22:22:43 +0200
Frans Pop <ele...@planet.nl> napsal(a):

> Would it be possible to have a command line switch that allows to
> start the old textual mode?

I use:
DISPLAY= latencytop

:-)
Michal

Frans Pop

unread,
Sep 8, 2009, 5:20:14 PM9/8/09
to
On Tuesday 08 September 2009, Frans Pop wrote:
> Arjan van de Ven wrote:
> > the latest version of latencytop also has a GUI (thanks to Ben)
>
> That looks nice, but...
>
> I kind of miss the split screen feature where latencytop would show
> both the overall figures + the ones for the currently most affected
> task. Downside of that last was that I never managed to keep the
> display on a specific task.
[...]

> Would it be possible to have a command line switch that allows to start
> the old textual mode?

I got a private reply suggesting that --nogui might work, and it does.
Thanks a lot Nikos!

> Looks like the man page needs updating too :-)

So this definitely needs attention :-P
Support of the standard -h and --help options would be great too.

Nikos Chantziaras

unread,
Sep 8, 2009, 5:30:10 PM9/8/09
to
On 09/08/2009 03:03 PM, Theodore Tso wrote:
> On Tue, Sep 08, 2009 at 01:13:34PM +0300, Nikos Chantziaras wrote:
>>> despite the untranslated content, it is clear that you have scheduler
>>> delays (either due to scheduler bugs or cpu contention) of upto 68
>>> msecs... Second in line is your binary AMD graphics driver that is
>>> chewing up 14% of your total latency...
>>
>> I've now used a correctly installed and up-to-date version of latencytop
>> and repeated the test. Also, I got rid of AMD's binary blob and used
>> kernel DRM drivers for my graphics card to throw fglrx out of the
>> equation (which btw didn't help; the exact same problems occur).
>>
>> Here the result:
>>
>> http://foss.math.aegean.gr/~realnc/pics/latop2.png
>
> This was with an unmodified 2.6.31-rcX kernel?

Yes (-rc9). I also tested with 2.6.30.5 and getting the same results.


> Does Latencytop do anything useful on a BFS-patched kernel?

Nope. BFS does not support any form of tracing yet. latencytop runs
but only shows a blank list. All I can say is that a BFS patched kernel
with the same .config fixes all visible latency issues.

GeunSik Lim

unread,
Sep 8, 2009, 5:50:10 PM9/8/09
to
On Wed, Sep 9, 2009 at 6:11 AM, Frans Pop<ele...@planet.nl> wrote:
>> Would it be possible to have a command line switch that allows to start
>> the old textual mode?
> I got a private reply suggesting that --nogui might work, and it does.
Um. You means that you tested with runlevel 3(multiuser mode). Is it right?
Frans. Can you share me your linux distribution for this test?
I want to check with same conditions(e.g:linux distribution like
fedora 11,ubuntu9.04 , runlevel, and so on.).

> Thanks a lot Nikos!
>> Looks like the man page needs updating too :-)
> So this definitely needs attention :-P
> Support of the standard -h and --help options would be great too.
> Cheers,
> FJP
> --

Thanks,
GeunSik Lim.

--
Regards,
GeunSik Lim ( Samsung Electronics )
Blog : http://blog.naver.com/invain/
e-Mail: geuns...@samsung.com
lee...@gmail.com , lee...@gmail.com

Nikos Chantziaras

unread,
Sep 8, 2009, 6:10:09 PM9/8/09
to
On 09/08/2009 02:32 PM, Juergen Beisert wrote:
> On Dienstag, 8. September 2009, Nikos Chantziaras wrote:
>> On 09/08/2009 11:38 AM, Arjan van de Ven wrote:
>>> On Tue, 08 Sep 2009 10:19:06 +0300
>>>
>>> Nikos Chantziaras<rea...@arcor.de> wrote:
>>>> latencytop has this to say:
>>>>
>>>> http://foss.math.aegean.gr/~realnc/pics/latop1.png
>>>>
>>>> Though I don't really understand what this tool is trying to tell me,
>>>> I hope someone does.
>>>
>>> despite the untranslated content, it is clear that you have scheduler
>>> delays (either due to scheduler bugs or cpu contention) of upto 68
>>> msecs... Second in line is your binary AMD graphics driver that is
>>> chewing up 14% of your total latency...
>>
>> I've now used a correctly installed and up-to-date version of latencytop
>> and repeated the test. Also, I got rid of AMD's binary blob and used
>> kernel DRM drivers for my graphics card to throw fglrx out of the
>> equation (which btw didn't help; the exact same problems occur).
>>
>> Here the result:
>>
>> http://foss.math.aegean.gr/~realnc/pics/latop2.png
>>
>> Again: this is on an Intel Core 2 Duo CPU.
>
> Just an idea: Maybe some system management code hits you?

I'm not sure what is meant with "system management code."

Serge Belyshev

unread,
Sep 8, 2009, 6:20:10 PM9/8/09
to
Serge Belyshev <bely...@depni.sinp.msu.ru> writes:
>[snip]

I've updated the graphs, added kernels 2.6.24..2.6.29:
http://img186.imageshack.us/img186/7029/epicmakej4.png

And added comparison with best-performing 2.6.23 kernel:
http://img34.imageshack.us/img34/7563/epicbfstips.png

>
> Conclusions are
> 1) mainline has severely regressed since v2.6.23
> 2) BFS shows optimal performance at make -jN where N equals number of
> h/w threads, while current mainline scheduler performance is far from
> optimal in this case.

Frans Pop

unread,
Sep 8, 2009, 6:40:04 PM9/8/09
to
On Tuesday 08 September 2009, you wrote:
> On Wed, Sep 9, 2009 at 6:11 AM, Frans Pop<ele...@planet.nl> wrote:
> >> Would it be possible to have a command line switch that allows to
> >> start the old textual mode?
> >
> > I got a private reply suggesting that --nogui might work, and it
> > does.
>
> Um. You means that you tested with runlevel 3(multiuser mode). Is it
> right? Frans. Can you share me your linux distribution for this test? I
> want to check with same conditions(e.g:linux distribution like fedora
> 11,ubuntu9.04 , runlevel, and so on.).

I ran it from KDE's konsole by just entering 'sudo latencytop --nogui' at
the command prompt.

Distro is Debian stable ("Lenny"), which does not have differences between
runlevels: by default they all start a desktop environment (if a display
manager like xdm/kdm/gdm is installed). But if you really want to know,
the runlevel was 2 ;-)

Cheers,
FJP

Nikos Chantziaras

unread,
Sep 8, 2009, 7:00:21 PM9/8/09
to

Sounds plausible. However, with mainline this latency is very, very
noticeable. With BFS I need to look really hard to detect it or do
outright silly things, like a "make -j50". (At first I wrote "-j20"
here but then went ahead an tested it just for kicks, and BFS would
still let me use the GUI smoothly, LOL. So then I corrected it to
"-j50"...)

Jiri Kosina

unread,
Sep 8, 2009, 7:30:10 PM9/8/09
to
On Wed, 9 Sep 2009, Nikos Chantziaras wrote:

> > > Here the result:
> > >
> > > http://foss.math.aegean.gr/~realnc/pics/latop2.png
> > >
> > > Again: this is on an Intel Core 2 Duo CPU.
> >
> > Just an idea: Maybe some system management code hits you?
>
> I'm not sure what is meant with "system management code."

System management interrupt happens when firmware/BIOS/HW-debugger is
executed in privilege mode so high, that even OS can't do anything about
that.

It is used in many situations, such as

- memory errors
- ACPI (mostly fan control)
- TPM

OS has small to none possibility to influence SMI/SMM. But if this would
be the cause, you should probably obtain completely different results on
different hardware configuration (as it is likely to have completely
different SMM behavior).

--
Jiri Kosina
SUSE Labs, Novell Inc.

Nikos Chantziaras

unread,
Sep 8, 2009, 7:40:07 PM9/8/09
to
On 09/09/2009 02:20 AM, Jiri Kosina wrote:
> On Wed, 9 Sep 2009, Nikos Chantziaras wrote:
>
>>>> Here the result:
>>>>
>>>> http://foss.math.aegean.gr/~realnc/pics/latop2.png
>>>>
>>>> Again: this is on an Intel Core 2 Duo CPU.
>>>
>>> Just an idea: Maybe some system management code hits you?
>>
>> I'm not sure what is meant with "system management code."
>
> System management interrupt happens when firmware/BIOS/HW-debugger is
> executed in privilege mode so high, that even OS can't do anything about
> that.
>
> It is used in many situations, such as
>
> - memory errors
> - ACPI (mostly fan control)
> - TPM
>
> OS has small to none possibility to influence SMI/SMM. But if this would
> be the cause, you should probably obtain completely different results on
> different hardware configuration (as it is likely to have completely
> different SMM behavior).

Wouldn't that mean that a BFS-patched kernel would suffer from this too?

In any case, of the above, only fan control is active, and I've run with
it disabled on occasion (hot summer days, I wanted to just keep it max
with no fan control) with no change. As far as I can tell, the Asus P5E
doesn't have a TPM (the "Deluxe" and "VM" models seem to have one.) As
for memory errors, I use unbuffered non-ECC RAM which passes a
memtest86+ cycle cleanly (well, at least the last time I ran it through
one, a few months ago.)

Benjamin Herrenschmidt

unread,
Sep 8, 2009, 8:30:06 PM9/8/09
to
> The TLB is SW loaded, yes. However it should not do any misses on kernel
> space, since the whole segment is in a wired TLB entry.

Including vmalloc space ?

Ben.

David Miller

unread,
Sep 8, 2009, 8:40:06 PM9/8/09
to
From: Benjamin Herrenschmidt <be...@kernel.crashing.org>
Date: Wed, 09 Sep 2009 10:28:22 +1000

>> The TLB is SW loaded, yes. However it should not do any misses on kernel
>> space, since the whole segment is in a wired TLB entry.
>
> Including vmalloc space ?

No, MIPS does take SW tlb misses on vmalloc space. :-)

Ralf Baechle

unread,
Sep 8, 2009, 8:50:09 PM9/8/09
to
On Tue, Sep 08, 2009 at 07:50:00PM +1000, Benjamin Herrenschmidt wrote:

> On Tue, 2009-09-08 at 09:48 +0200, Ingo Molnar wrote:
> > So either your MIPS system has some unexpected dependency on the
> > scheduler, or there's something weird going on.


> >
> > Mind poking on this one to figure out whether it's all repeatable

> > and why that slowdown happens? Multiple attempts to reproduce it
> > failed here for me.
>
> Could it be the scheduler using constructs that don't do well on MIPS ?

It would surprise me.

I'm wondering if BFS has properties that make it perform better on a very
low memory system; I guess the BCM74xx system will have like 32MB or 64MB
only.

> I remember at some stage we spotted an expensive multiply in there,
> maybe there's something similar, or some unaligned or non-cache friendly
> vs. the MIPS cache line size data structure, that sort of thing ...
>
> Is this a SW loaded TLB ? Does it misses on kernel space ? That could
> also be some differences in how many pages are touched by each scheduler
> causing more TLB pressure. This will be mostly invisible on x86.

Software refilled. No misses ever for kernel space or low-mem; think of
it as low-mem and kernel executable living in a 512MB page that is mapped
by a mechanism outside the TLB. Vmalloc ranges are TLB mapped. Ioremap
address ranges only if above physical address 512MB.

An emulated unaligned load/store is very expensive; one that is encoded
properly by GCC for __attribute__((packed)) is only 1 cycle and 1
instruction ( = 4 bytes) extra.

> At this stage, it will be hard to tell without some profile data I
> suppose. Maybe next week I can try on a small SW loaded TLB embedded PPC
> see if I can reproduce some of that, but no promises here.

Ralf

Felix Fietkau

unread,
Sep 8, 2009, 9:40:05 PM9/8/09
to
Ralf Baechle wrote:
>> I remember at some stage we spotted an expensive multiply in there,
>> maybe there's something similar, or some unaligned or non-cache friendly
>> vs. the MIPS cache line size data structure, that sort of thing ...
>>
>> Is this a SW loaded TLB ? Does it misses on kernel space ? That could
>> also be some differences in how many pages are touched by each scheduler
>> causing more TLB pressure. This will be mostly invisible on x86.
>
> Software refilled. No misses ever for kernel space or low-mem; think of
> it as low-mem and kernel executable living in a 512MB page that is mapped
> by a mechanism outside the TLB. Vmalloc ranges are TLB mapped. Ioremap
> address ranges only if above physical address 512MB.
>
> An emulated unaligned load/store is very expensive; one that is encoded
> properly by GCC for __attribute__((packed)) is only 1 cycle and 1
> instruction ( = 4 bytes) extra.
CFS definitely isn't causing any emulated unaligned load/stores on these
devices, we've tested that.

- Felix

Markus Tornqvist

unread,
Sep 9, 2009, 2:10:06 AM9/9/09
to
Let's test gmane's followup feature ;)

Ingo Molnar <mingo <at> elte.hu> writes:

> > Please Cc me as I'm not a subscriber.
> > > kernel build performance on quad:
> > > http://redhat.com/~mingo/misc/bfs-vs-tip-kbuild-quad.jpg
> > [...]
> > >
> > >It shows similar curves and behavior to the 8-core results i posted
> > >- BFS is slower than mainline in virtually every measurement. The
> > >ratios are different for different parts of the graphs - but the
> > >trend is similar.
> >
> > Dude, not cool.
> >
> > 1. Quad HT is not the same as a 4-core desktop, you're doing it with 8 cores
>
> No, it's 4 cores. HyperThreading adds two 'siblings' per core, which
> are not 'cores'.

Like Serge Belyshev says in
http://article.gmane.org/gmane.linux.kernel/886881
and Con thanks you inthe FAQ for confiming it:
"h/w threads" - My Sempron II X4 lists four of those, and it seems
to be a common setup.

> > 2. You just proved BFS is better on the job_count == core_count case, as BFS
> > says it is, if you look at the graph
>
> I pointed that out too. I think the graphs speak for themselves:
>
> http://redhat.com/~mingo/misc/bfs-vs-tip-kbuild-quad.jpg
> http://redhat.com/~mingo/misc/bfs-vs-tip-kbuild.jpg

Those are in alignment with the FAQ, for the hardware threads.

Mr Belyshev's benchmarks are closer to a common desktop and they rock
over CFS.

That's also something that IMO "we" forgot here: it doesn't really matter!

BFS is not up for merging, it feels way better than CFS on the desktop
and it does not scale.

This thread can be about improving CFS, I do not care personally, and
will stay out of that discussion.

> There's bfs-209 out there today. These tests take 8+ hours to
> complete and validate. I'll re-test BFS in the future too, and as i
> said it in the first mail i'll test it on a .31 base as well once
> BFS has been ported to it:

Apropos your tests, under which circumstances would I have a million
piped messages on my desktop?

Would you care to comment on the relevance of your other tests from
a desktop point of view?

Fortunately you got help from the community as posted on the list.

> > Also, you said on http://article.gmane.org/gmane.linux.kernel/886319
> > "I also tried to configure the kernel in a BFS friendly way, i used
> > HZ=1000 as recommended, turned off all debug options, etc. The
> > kernel config i used can be found here:
> > http://redhat.com/~mingo/misc/config
> > "
> >
> > Quickly looking at the conf you have
> > CONFIG_HZ_250=y
> > CONFIG_PREEMPT_NONE=y
> > # CONFIG_PREEMPT_VOLUNTARY is not set
> > # CONFIG_PREEMPT is not set
>
> Indeed. HZ does not seem to matter according to what i see in my
> measurements. Can you measure such sensitivity?

Hardly the point - You said one thing and got caught with something else,
which doesn't give a credible image.

Can I measure it? IANAKH, and I think there are people more passionate
here to run benchmark scripts and endless analyses.

All I can "measure" is that my desktop experience isn't stuttery and jittery
with basic stuff like scrolling over Firefox tabs with my mouse wheel
while watching pr0n.

> > CONFIG_ARCH_WANT_FRAME_POINTERS=y
> > CONFIG_FRAME_POINTER=y
> >
> > And other DEBUG.
>
> These are the defaults and they dont make a measurable difference to
> these results. What other debug options do you mean and do they make
> a difference?

Don't care as long as your kernel comparisons truly were with equivalent
settings to each other.

Köszönöm.

--
mjt

Ingo Molnar

unread,
Sep 9, 2009, 2:20:06 AM9/9/09
to

* Jens Axboe <jens....@oracle.com> wrote:

> On Tue, Sep 08 2009, Peter Zijlstra wrote:
> > On Tue, 2009-09-08 at 11:13 +0200, Jens Axboe wrote:
> > > And here's a newer version.
> >
> > I tinkered a bit with your proglet and finally found the
> > problem.
> >
> > You used a single pipe per child, this means the loop in
> > run_child() would consume what it just wrote out until it got
> > force preempted by the parent which would also get woken.
> >
> > This results in the child spinning a while (its full quota) and
> > only reporting the last timestamp to the parent.
>
> Oh doh, that's not well thought out. Well it was a quick hack :-)
> Thanks for the fixup, now it's at least usable to some degree.

What kind of latencies does it report on your box?

Our vanilla scheduler default latency targets are:

single-core: 20 msecs
dual-core: 40 msecs
quad-core: 60 msecs
opto-core: 80 msecs

You can enable CONFIG_SCHED_DEBUG=y and set it directly as well via
/proc/sys/kernel/sched_latency_ns:

echo 10000000 > /proc/sys/kernel/sched_latency_ns

Ingo

Nikos Chantziaras

unread,
Sep 9, 2009, 4:40:07 AM9/9/09
to
On 09/09/2009 09:13 AM, Ingo Molnar wrote:
>
> * Jens Axboe<jens....@oracle.com> wrote:
>
>> On Tue, Sep 08 2009, Peter Zijlstra wrote:
>>> On Tue, 2009-09-08 at 11:13 +0200, Jens Axboe wrote:
>>>> And here's a newer version.
>>>
>>> I tinkered a bit with your proglet and finally found the
>>> problem.
>>>
>>> You used a single pipe per child, this means the loop in
>>> run_child() would consume what it just wrote out until it got
>>> force preempted by the parent which would also get woken.
>>>
>>> This results in the child spinning a while (its full quota) and
>>> only reporting the last timestamp to the parent.
>>
>> Oh doh, that's not well thought out. Well it was a quick hack :-)
>> Thanks for the fixup, now it's at least usable to some degree.
>
> What kind of latencies does it report on your box?
>
> Our vanilla scheduler default latency targets are:
>
> single-core: 20 msecs
> dual-core: 40 msecs
> quad-core: 60 msecs
> opto-core: 80 msecs
>
> You can enable CONFIG_SCHED_DEBUG=y and set it directly as well via
> /proc/sys/kernel/sched_latency_ns:
>
> echo 10000000> /proc/sys/kernel/sched_latency_ns

I've tried values ranging from 10000000 down to 100000. This results in
the stalls/freezes being a bit shorter, but clearly still there. It
does not eliminate them.

If there's anything else I can try/test, I would be happy to do so.

Mike Galbraith

unread,
Sep 9, 2009, 5:00:16 AM9/9/09
to
On Wed, 2009-09-09 at 08:13 +0200, Ingo Molnar wrote:
> * Jens Axboe <jens....@oracle.com> wrote:
>
> > On Tue, Sep 08 2009, Peter Zijlstra wrote:
> > > On Tue, 2009-09-08 at 11:13 +0200, Jens Axboe wrote:
> > > > And here's a newer version.
> > >
> > > I tinkered a bit with your proglet and finally found the
> > > problem.
> > >
> > > You used a single pipe per child, this means the loop in
> > > run_child() would consume what it just wrote out until it got
> > > force preempted by the parent which would also get woken.
> > >
> > > This results in the child spinning a while (its full quota) and
> > > only reporting the last timestamp to the parent.
> >
> > Oh doh, that's not well thought out. Well it was a quick hack :-)
> > Thanks for the fixup, now it's at least usable to some degree.
>
> What kind of latencies does it report on your box?
>
> Our vanilla scheduler default latency targets are:
>
> single-core: 20 msecs
> dual-core: 40 msecs
> quad-core: 60 msecs
> opto-core: 80 msecs
>
> You can enable CONFIG_SCHED_DEBUG=y and set it directly as well via
> /proc/sys/kernel/sched_latency_ns:
>
> echo 10000000 > /proc/sys/kernel/sched_latency_ns

He would also need to lower min_granularity, otherwise, it'd be larger
than the whole latency target.

I'm testing right now, and one thing that is definitely a problem is the
amount of sleeper fairness we're giving. A full latency is just too
much short term fairness in my testing. While sleepers are catching up,
hogs languish. That's the biggest issue going on.

I've also been doing some timings of make -j4 (looking at idle time),
and find that child_runs_first is mildly detrimental to fork/exec load,
as are buddies.

I'm running with the below at the moment. (the kthread/workqueue thing
is just because I don't see any reason for it to exist, so consider it
to be a waste of perfectly good math;)

diff --git a/kernel/kthread.c b/kernel/kthread.c
index 6ec4643..a44210e 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -16,8 +16,6 @@
#include <linux/mutex.h>
#include <trace/events/sched.h>

-#define KTHREAD_NICE_LEVEL (-5)
-
static DEFINE_SPINLOCK(kthread_create_lock);
static LIST_HEAD(kthread_create_list);

@@ -150,7 +148,6 @@ struct task_struct *kthread_create(int (*threadfn)(void *data),
* The kernel thread should not inherit these properties.
*/
sched_setscheduler_nocheck(create.result, SCHED_NORMAL, &param);
- set_user_nice(create.result, KTHREAD_NICE_LEVEL);
set_cpus_allowed_ptr(create.result, cpu_all_mask);
}
return create.result;
@@ -226,7 +223,6 @@ int kthreadd(void *unused)
/* Setup a clean context for our children to inherit. */
set_task_comm(tsk, "kthreadd");
ignore_signals(tsk);
- set_user_nice(tsk, KTHREAD_NICE_LEVEL);
set_cpus_allowed_ptr(tsk, cpu_all_mask);
set_mems_allowed(node_possible_map);

diff --git a/kernel/sched.c b/kernel/sched.c
index c512a02..e68c341 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -7124,33 +7124,6 @@ void __cpuinit init_idle(struct task_struct *idle, int cpu)
*/
cpumask_var_t nohz_cpu_mask;

-/*
- * Increase the granularity value when there are more CPUs,
- * because with more CPUs the 'effective latency' as visible
- * to users decreases. But the relationship is not linear,
- * so pick a second-best guess by going with the log2 of the
- * number of CPUs.
- *
- * This idea comes from the SD scheduler of Con Kolivas:
- */
-static inline void sched_init_granularity(void)
-{
- unsigned int factor = 1 + ilog2(num_online_cpus());
- const unsigned long limit = 200000000;
-
- sysctl_sched_min_granularity *= factor;
- if (sysctl_sched_min_granularity > limit)
- sysctl_sched_min_granularity = limit;
-
- sysctl_sched_latency *= factor;
- if (sysctl_sched_latency > limit)
- sysctl_sched_latency = limit;
-
- sysctl_sched_wakeup_granularity *= factor;
-
- sysctl_sched_shares_ratelimit *= factor;
-}
-
#ifdef CONFIG_SMP
/*
* This is how migration works:
@@ -9356,7 +9329,6 @@ void __init sched_init_smp(void)
/* Move init over to a non-isolated CPU */
if (set_cpus_allowed_ptr(current, non_isolated_cpus) < 0)
BUG();
- sched_init_granularity();
free_cpumask_var(non_isolated_cpus);

alloc_cpumask_var(&fallback_doms, GFP_KERNEL);
@@ -9365,7 +9337,6 @@ void __init sched_init_smp(void)
#else
void __init sched_init_smp(void)
{
- sched_init_granularity();
}
#endif /* CONFIG_SMP */

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index e386e5d..ff7fec9 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -51,7 +51,7 @@ static unsigned int sched_nr_latency = 5;
* After fork, child runs first. (default) If set to 0 then
* parent will (try to) run first.
*/
-const_debug unsigned int sysctl_sched_child_runs_first = 1;
+const_debug unsigned int sysctl_sched_child_runs_first = 0;

/*
* sys_sched_yield() compat mode
@@ -713,7 +713,7 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
if (!initial) {
/* sleeps upto a single latency don't count. */
if (sched_feat(NEW_FAIR_SLEEPERS)) {
- unsigned long thresh = sysctl_sched_latency;
+ unsigned long thresh = sysctl_sched_min_granularity;

/*
* Convert the sleeper threshold into virtual time.
@@ -1502,7 +1502,8 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int sync)
*/
if (sched_feat(LAST_BUDDY) && likely(se->on_rq && curr != rq->idle))
set_last_buddy(se);
- set_next_buddy(pse);
+ if (sched_feat(NEXT_BUDDY))
+ set_next_buddy(pse);

/*
* We can come here with TIF_NEED_RESCHED already set from new task
diff --git a/kernel/sched_features.h b/kernel/sched_features.h
index 4569bfa..85d30d1 100644
--- a/kernel/sched_features.h
+++ b/kernel/sched_features.h
@@ -13,5 +13,6 @@ SCHED_FEAT(LB_BIAS, 1)
SCHED_FEAT(LB_WAKEUP_UPDATE, 1)
SCHED_FEAT(ASYM_EFF_LOAD, 1)
SCHED_FEAT(WAKEUP_OVERLAP, 0)
-SCHED_FEAT(LAST_BUDDY, 1)
+SCHED_FEAT(LAST_BUDDY, 0)
+SCHED_FEAT(NEXT_BUDDY, 0)
SCHED_FEAT(OWNER_SPIN, 1)
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 3c44b56..addfe2d 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -317,8 +317,6 @@ static int worker_thread(void *__cwq)
if (cwq->wq->freezeable)
set_freezable();

- set_user_nice(current, -5);
-
for (;;) {
prepare_to_wait(&cwq->more_work, &wait, TASK_INTERRUPTIBLE);
if (!freezing(current) &&

Nikos Chantziaras

unread,
Sep 9, 2009, 5:10:08 AM9/9/09
to

Thank you for mentioning min_granularity. After:

echo 10000000 > /proc/sys/kernel/sched_latency_ns
echo 2000000 > /proc/sys/kernel/sched_min_granularity_ns

I can clearly see an improvement: animations that are supposed to be
fluid "skip" much less now, and in one occasion (simply moving the video
window around) have been eliminated completely. However, there seems to
be a side effect from having CONFIG_SCHED_DEBUG enabled; things seem to
be generally a tad more "jerky" with that option enabled, even when not
even touching the latency and granularity defaults.

I'll try the patch you posted and see if this further improves things.

Peter Zijlstra

unread,
Sep 9, 2009, 5:10:09 AM9/9/09
to
On Wed, 2009-09-09 at 10:52 +0200, Mike Galbraith wrote:
> @@ -1502,7 +1502,8 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int sync)
> */
> if (sched_feat(LAST_BUDDY) && likely(se->on_rq && curr != rq->idle))
> set_last_buddy(se);
> - set_next_buddy(pse);
> + if (sched_feat(NEXT_BUDDY))
> + set_next_buddy(pse);
>
> /*
> * We can come here with TIF_NEED_RESCHED already set from new task

You might want to test stuff like sysbench again, iirc we went on a
cache-trashing rampage without buddies.

Our goal is not to excel at any one load but to not suck at any one
load.

Mike Galbraith

unread,
Sep 9, 2009, 5:20:06 AM9/9/09
to
On Wed, 2009-09-09 at 11:02 +0200, Peter Zijlstra wrote:
> On Wed, 2009-09-09 at 10:52 +0200, Mike Galbraith wrote:
> > @@ -1502,7 +1502,8 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int sync)
> > */
> > if (sched_feat(LAST_BUDDY) && likely(se->on_rq && curr != rq->idle))
> > set_last_buddy(se);
> > - set_next_buddy(pse);
> > + if (sched_feat(NEXT_BUDDY))
> > + set_next_buddy(pse);
> >
> > /*
> > * We can come here with TIF_NEED_RESCHED already set from new task
>
> You might want to test stuff like sysbench again, iirc we went on a
> cache-trashing rampage without buddies.
>
> Our goal is not to excel at any one load but to not suck at any one
> load.

Oh absolutely. I wouldn't want buddies disabled by default, I only
added the buddy knob to test effects on fork/exec.

I only posted to patch to give Jens something canned to try out.

-Mike

Jens Axboe

unread,
Sep 9, 2009, 5:20:06 AM9/9/09
to

Using latt, it seems better than -rc9. The below are entries logged
while running make -j128 on a 64 thread box. I did two runs on each, and
latt is using 8 clients.

-rc9
Max 23772 usec
Avg 1129 usec
Stdev 4328 usec
Stdev mean 117 usec

Max 32709 usec
Avg 1467 usec
Stdev 5095 usec
Stdev mean 136 usec

-rc9 + patch

Max 11561 usec
Avg 1532 usec
Stdev 1994 usec
Stdev mean 48 usec

Max 9590 usec
Avg 1550 usec
Stdev 2051 usec
Stdev mean 50 usec

max latency is way down, and much smaller variation as well.


--
Jens Axboe

Peter Zijlstra

unread,
Sep 9, 2009, 5:20:08 AM9/9/09
to
On Wed, 2009-09-09 at 12:05 +0300, Nikos Chantziaras wrote:

> Thank you for mentioning min_granularity. After:
>
> echo 10000000 > /proc/sys/kernel/sched_latency_ns
> echo 2000000 > /proc/sys/kernel/sched_min_granularity_ns

You might also want to do:

echo 2000000 > /proc/sys/kernel/sched_wakeup_granularity_ns

That affects when a newly woken task will preempt an already running
task.

> I can clearly see an improvement: animations that are supposed to be
> fluid "skip" much less now, and in one occasion (simply moving the video
> window around) have been eliminated completely. However, there seems to
> be a side effect from having CONFIG_SCHED_DEBUG enabled; things seem to
> be generally a tad more "jerky" with that option enabled, even when not
> even touching the latency and granularity defaults.

There's more code in the scheduler with that enabled but unless you've
got a terrible high ctx rate that really shouldn't affect things.

Anyway, you can always poke at these numbers in the code, and like Mike
did, kill sched_init_granularity().

Nikos Chantziaras

unread,
Sep 9, 2009, 5:50:05 AM9/9/09
to
On 09/09/2009 12:17 PM, Peter Zijlstra wrote:
> On Wed, 2009-09-09 at 12:05 +0300, Nikos Chantziaras wrote:
>
>> Thank you for mentioning min_granularity. After:
>>
>> echo 10000000> /proc/sys/kernel/sched_latency_ns
>> echo 2000000> /proc/sys/kernel/sched_min_granularity_ns
>
> You might also want to do:
>
> echo 2000000> /proc/sys/kernel/sched_wakeup_granularity_ns
>
> That affects when a newly woken task will preempt an already running
> task.

Lowering wakeup_granularity seems to make things worse in an interesting
way:

With low wakeup_granularity, the video itself will start skipping if I
move the window around. However, the window manager's effect of moving
a window around is smooth.

With high wakeup_granularity, the video itself will not skip while
moving the window around. But this time, the window manager's effect of
the window move is skippy.

(I should point out that only with the BFS-patched kernel can I have a
smooth video *and* a smooth window-moving effect at the same time.)
Mainline seems to prioritize one of the two according to whether
wakeup_granularity is raised or lowered. However, I have not tested
Mike's patch yet (but will do so ASAP.)

Benjamin Herrenschmidt

unread,
Sep 9, 2009, 6:00:17 AM9/9/09
to
On Tue, 2009-09-08 at 22:22 +0200, Frans Pop wrote:
> Arjan van de Ven wrote:
> > the latest version of latencytop also has a GUI (thanks to Ben)
>
> That looks nice, but...
>
> I kind of miss the split screen feature where latencytop would show both
> the overall figures + the ones for the currently most affected task.
> Downside of that last was that I never managed to keep the display on a
> specific task.

Any idea of how to present it ? I'm happy to spend 5mn improving the
GUI :-)

> The graphical display also makes it impossible to simply copy and paste
> the results.

Ah that's right. I'm not 100% sure how to do that (first experiments
with gtk). I suppose I could try to do some kind of "snapshot" feature
which saves the results in textual form.

> Having the freeze button is nice though.


>
> Would it be possible to have a command line switch that allows to start
> the old textual mode?

It's there iirc. --nogui :-)

Cheers,
Ben.

> Looks like the man page needs updating too :-)
>

> Cheers,
> FJP

Nikos Chantziaras

unread,
Sep 9, 2009, 6:20:12 AM9/9/09
to
On 09/09/2009 12:40 PM, Nikos Chantziaras wrote:
> On 09/09/2009 12:17 PM, Peter Zijlstra wrote:
>> On Wed, 2009-09-09 at 12:05 +0300, Nikos Chantziaras wrote:
>>
>>> Thank you for mentioning min_granularity. After:
>>>
>>> echo 10000000> /proc/sys/kernel/sched_latency_ns
>>> echo 2000000> /proc/sys/kernel/sched_min_granularity_ns
>>
>> You might also want to do:
>>
>> echo 2000000> /proc/sys/kernel/sched_wakeup_granularity_ns
>>
>> That affects when a newly woken task will preempt an already running
>> task.
>
> Lowering wakeup_granularity seems to make things worse in an interesting
> way:
>
> With low wakeup_granularity, the video itself will start skipping if I
> move the window around. However, the window manager's effect of moving a
> window around is smooth.
>
> With high wakeup_granularity, the video itself will not skip while
> moving the window around. But this time, the window manager's effect of
> the window move is skippy.
>
> (I should point out that only with the BFS-patched kernel can I have a
> smooth video *and* a smooth window-moving effect at the same time.)
> Mainline seems to prioritize one of the two according to whether
> wakeup_granularity is raised or lowered. However, I have not tested
> Mike's patch yet (but will do so ASAP.)

I've tested Mike's patch and it achieves the same effect as raising
sched_min_granularity.

To round it up:

By testing various values for sched_latency_ns, sched_min_granularity_ns
and sched_wakeup_granularity_ns, I can achieve three results:

1. Fluid animations for the foreground app, skippy ones for
the rest (video plays nicely, rest of the desktop lags.)

2. Fluid animations for the background apps, a skippy one for
the one in the foreground (dekstop behaves nicely, video lags.)

3. Equally skippy/jerky behavior for all of them.

Unfortunately, a "4. Equally fluid behavior for all of them" cannot be
achieved with mainline, unless I missed some other tweak.

David Newall

unread,
Sep 9, 2009, 7:20:07 AM9/9/09
to
Benjamin Herrenschmidt wrote:
> On Tue, 2009-09-08 at 22:22 +0200, Frans Pop wrote:
>
>> Arjan van de Ven wrote:
>>
>>> the latest version of latencytop also has a GUI (thanks to Ben)
>>>
>> That looks nice, but...
>>
>> I kind of miss the split screen feature where latencytop would show both
>> the overall figures + the ones for the currently most affected task.
>> Downside of that last was that I never managed to keep the display on a
>> specific task.
>>
>
> Any idea of how to present it ? I'm happy to spend 5mn improving the
> GUI :-)

Use a second window.

Benjamin Herrenschmidt

unread,
Sep 9, 2009, 7:40:08 AM9/9/09
to
On Wed, 2009-09-09 at 20:44 +0930, David Newall wrote:
> Benjamin Herrenschmidt wrote:
> > On Tue, 2009-09-08 at 22:22 +0200, Frans Pop wrote:
> >
> >> Arjan van de Ven wrote:
> >>
> >>> the latest version of latencytop also has a GUI (thanks to Ben)
> >>>
> >> That looks nice, but...
> >>
> >> I kind of miss the split screen feature where latencytop would show both
> >> the overall figures + the ones for the currently most affected task.
> >> Downside of that last was that I never managed to keep the display on a
> >> specific task.
> >>
> >
> > Any idea of how to present it ? I'm happy to spend 5mn improving the
> > GUI :-)
>
> Use a second window.

I'm not too fan of cluttering the screen with windows... I suppose I
could have a separate pane for the "global" view but I haven't found a
way to lay it out in a way that doesn't suck :-) I could have done a 3rd
colums on the right with the overall view but it felt like using too
much screen real estate.

I'll experiment a bit, maybe 2 windows is indeed the solution. But you
get into the problem of what to do if only one of them is closed ? Do I
add a menu bar on each of them to re-open the "other" one if closed ?
etc...

Don't get me wrong, I have a shitload of experience doing GUIs (back in
the old days when I was hacking on MacOS), though I'm relatively new to
GTK. But GUI design is rather hard in general :-)

Ben.

It is loading more messages.
0 new messages