BFS vs. mainline scheduler benchmarks and measurements

130 views
Skip to first unread message

Ingo Molnar

unread,
Sep 6, 2009, 5:10:07 PM9/6/09
to
hi Con,

I've read your BFS announcement/FAQ with great interest:

http://ck.kolivas.org/patches/bfs/bfs-faq.txt

First and foremost, let me say that i'm happy that you are hacking
the Linux scheduler again. It's perhaps proof that hacking the
scheduler is one of the most addictive things on the planet ;-)

I understand that BFS is still early code and that you are not
targeting BFS for mainline inclusion - but BFS is an interesting
and bold new approach, cutting a _lot_ of code out of
kernel/sched*.c, so it raised my curiosity and interest :-)

In the announcement and on your webpage you have compared BFS to
the mainline scheduler in various workloads - showing various
improvements over it. I have tried and tested BFS and ran a set of
benchmarks - this mail contains the results and my (quick)
findings.

So ... to get to the numbers - i've tested both BFS and the tip of
the latest upstream scheduler tree on a testbox of mine. I
intentionally didnt test BFS on any really large box - because you
described its upper limit like this in the announcement:

-----------------------
|
| How scalable is it?
|
| I don't own the sort of hardware that is likely to suffer from
| using it, so I can't find the upper limit. Based on first
| principles about the overhead of locking, and the way lookups
| occur, I'd guess that a machine with more than 16 CPUS would
| start to have less performance. BIG NUMA machines will probably
| suck a lot with this because it pays no deference to locality of
| the NUMA nodes when deciding what cpu to use. It just keeps them
| all busy. The so-called "light NUMA" that constitutes commodity
| hardware these days seems to really like BFS.
|
-----------------------

I generally agree with you that "light NUMA" is what a Linux
scheduler needs to concentrate on (at most) in terms of
scalability. Big NUMA, 4096 CPUs is not very common and we tune the
Linux scheduler for desktop and small-server workloads mostly.

So the testbox i picked fits into the upper portion of what i
consider a sane range of systems to tune for - and should still fit
into BFS's design bracket as well according to your description:
it's a dual quad core system with hyperthreading. It has twice as
many cores as the quad you tested on but it's not excessive and
certainly does not have 4096 CPUs ;-)

Here are the benchmark results:

kernel build performance:
http://redhat.com/~mingo/misc/bfs-vs-tip-kbuild.jpg

pipe performance:
http://redhat.com/~mingo/misc/bfs-vs-tip-pipe.jpg

messaging performance (hackbench):
http://redhat.com/~mingo/misc/bfs-vs-tip-messaging.jpg

OLTP performance (postgresql + sysbench)
http://redhat.com/~mingo/misc/bfs-vs-tip-oltp.jpg

Alas, as it can be seen in the graphs, i can not see any BFS
performance improvements, on this box.

Here's a more detailed description of the results:

| Kernel build performance
---------------------------

http://redhat.com/~mingo/misc/bfs-vs-tip-kbuild.jpg

In the kbuild test BFS is showing significant weaknesses up to 16
CPUs. On 8 CPUs utilized (half load) it's 27.6% slower. All results
(-j1, -j2... -j15 are slower. The peak at 100% utilization at -j16
is slightly stronger under BFS, by 1.5%. The 'absolute best' result
is sched-devel at -j64 with 46.65 seconds - the best BFS result is
47.38 seconds (at -j64) - 1.5% better.

| Pipe performance
-------------------

http://redhat.com/~mingo/misc/bfs-vs-tip-pipe.jpg

Pipe performance is a very simple test, two tasks message to each
other via pipes. I measured 1 million such messages:

http://redhat.com/~mingo/cfs-scheduler/tools/pipe-test-1m.c

The pipe test ran a number of them in parallel:

for ((i=0;i<$NR;i++)); do ~/sched-tests/pipe-test-1m & done; wait

and measured elapsed time. This tests two things: basic scheduler
performance and also scheduler fairness. (if one of these parallel
jobs is delayed unfairly then the test will finish later.)

[ see further below for a simpler pipe latency benchmark as well. ]

As can be seen in the graph BFS performed very poorly in this test:
at 8 pairs of tasks it had a runtime of 45.42 seconds - while
sched-devel finished them in 3.8 seconds.

I saw really bad interactivity in the BFS test here - the system
was starved for as long as the test ran. I stopped the tests at 8
loops - the system was unusable and i was getting IO timeouts due
to the scheduling lag:

sd 0:0:0:0: [sda] Unhandled error code
sd 0:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
end_request: I/O error, dev sda, sector 81949243
Aborting journal on device sda2.
ext3_abort called.
EXT3-fs error (device sda2): ext3_journal_start_sb: Detected aborted journal
Remounting filesystem read-only

I measured interactivity during this test:

$ time ssh aldebaran /bin/true
real 2m17.968s
user 0m0.009s
sys 0m0.003s

A single command took more than 2 minutes.

| Messaging performance
------------------------

http://redhat.com/~mingo/misc/bfs-vs-tip-messaging.jpg

Hackbench ran better - but mainline sched-devel is significantly
faster for smaller and larger loads as well. With 20 groups
mainline ran 61.5% faster.

| OLTP performance
--------------------

http://redhat.com/~mingo/misc/bfs-vs-tip-oltp.jpg

As can be seen in the graph for sysbench OLTP performance
sched-devel outperforms BFS on each of the main stages:

single client load ( 1 client - 6.3% faster )
half load ( 8 clients - 57.6% faster )
peak performance ( 16 clients - 117.6% faster )
overload ( 512 clients - 288.3% faster )

| Other tests
--------------

I also tested a couple of other things, such as lat_tcp:

BFS: TCP latency using localhost: 16.5608 microseconds
sched-devel: TCP latency using localhost: 13.5528 microseconds [22.1% faster]

lat_pipe:

BFS: Pipe latency: 4.9703 microseconds
sched-devel: Pipe latency: 2.6137 microseconds [90.1% faster]

General interactivity of BFS seemed good to me - except for the
pipe test when there was significant lag over a minute. I think
it's some starvation bug, not an inherent design property of BFS,
so i'm looking forward to re-test it with the fix.

Test environment: i used latest BFS (205 and then i re-ran under
208 and the numbers are all from 208), and the latest mainline
scheduler development tree from:

http://people.redhat.com/mingo/tip.git/README

Commit 840a065 in particular. It's on a .31-rc8 base while BFS is
on a .30 base - will be able to test BFS on a .31 base as well once
you release it. (but it doesnt matter much to the results - there
werent any heavy core kernel changes impacting these workloads.)

The system had enough RAM to have the workloads cached, and i
repeated all tests to make sure it's all representative.
Nevertheless i'd like to encourage others to repeat these (or
other) tests - the more testing the better.

I also tried to configure the kernel in a BFS friendly way, i used
HZ=1000 as recommended, turned off all debug options, etc. The
kernel config i used can be found here:

http://redhat.com/~mingo/misc/config

( Let me know if you need any more info about any of the tests i
conducted. )

Also, i'd like to outline that i agree with the general goals
described by you in the BFS announcement - small desktop systems
matter more than large systems. We find it critically important
that the mainline Linux scheduler performs well on those systems
too - and if you (or anyone else) can reproduce suboptimal behavior
please let the scheduler folks know so that we can fix/improve it.

I hope to be able to work with you on this, please dont hesitate
sending patches if you wish - and we'll also be following BFS for
good ideas and code to adopt to mainline.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Frans Pop

unread,
Sep 6, 2009, 10:10:06 PM9/6/09
to
Ingo Molnar wrote:
> So the testbox i picked fits into the upper portion of what i
> consider a sane range of systems to tune for - and should still fit
> into BFS's design bracket as well according to your description:
> it's a dual quad core system with hyperthreading.

Ingo,

Nice that you've looked into this.

Would it be possible for you to run the same tests on e.g. a dual core
and/or a UP system (or maybe just offline some CPUs?)? It would be very
interesting to see whether BFS does better in the lower portion of the
range, or if the differences you show between the two schedulers are
consistent across the range.

Cheers,
FJP

Nikos Chantziaras

unread,
Sep 6, 2009, 11:40:07 PM9/6/09
to
On 09/06/2009 11:59 PM, Ingo Molnar wrote:
>[...]

> Also, i'd like to outline that i agree with the general goals
> described by you in the BFS announcement - small desktop systems
> matter more than large systems. We find it critically important
> that the mainline Linux scheduler performs well on those systems
> too - and if you (or anyone else) can reproduce suboptimal behavior
> please let the scheduler folks know so that we can fix/improve it.

BFS improved behavior of many applications on my Intel Core 2 box in a
way that can't be benchmarked. Examples:

mplayer using OpenGL renderer doesn't drop frames anymore when dragging
and dropping the video window around in an OpenGL composited desktop
(KDE 4.3.1). (Start moving the mplayer window around; then drop it. At
the moment the move starts and at the moment you drop the window back to
the desktop, there's a big frame skip as if mplayer was frozen for a
bit; around 200 or 300ms.)

Composite desktop effects like zoom and fade out don't stall for
sub-second periods of time while there's CPU load in the background. In
other words, the desktop is more fluid and less skippy even during heavy
CPU load. Moving windows around with CPU load in the background doesn't
result in short skips.

LMMS (a tool utilizing real-time sound synthesis) does not produce
"pops", "crackles" and drops in the sound during real-time playback due
to buffer under-runs. Those problems amplify when there's heavy CPU
load in the background, while with BFS heavy load doesn't produce those
artifacts (though LMMS makes itself run SCHED_ISO with BFS) Also,
hitting a key on the keyboard needs less time for the note to become
audible when using BFS. Same should hold true for other tools who
traditionally benefit from the "-rt" kernel sources.

Games like Doom 3 and such don't "freeze" periodically for small amounts
of time (again for sub-second amounts) when something in the background
grabs CPU time (be it my mailer checking for new mail or a cron job, or
whatever.)

And, the most drastic improvement here, with BFS I can do a "make -j2"
in the kernel tree and the GUI stays fluid. Without BFS, things start
to lag, even with in-RAM builds (like having the whole kernel tree
inside a tmpfs) and gcc running with nice 19 and ionice -c 3.

Unfortunately, I can't come up with any way to somehow benchmark all of
this. There's no benchmark for "fluidity" and "responsiveness".
Running the Doom 3 benchmark, or any other benchmark, doesn't say
anything about responsiveness, it only measures how many frames were
calculated in a specific period of time. How "stable" (with no stalls)
those frames were making it to the screen is not measurable.

If BFS would imply small drops in pure performance counted in
instructions per seconds, that would be a totally acceptable regression
for desktop/multimedia/gaming PCs. Not for server machines, of course.
However, on my machine, BFS is faster in classic workloads. When I
run "make -j2" with BFS and the standard scheduler, BFS always finishes
a bit faster. Not by much, but still. One thing I'm noticing here is
that BFS produces 100% CPU load on each core with "make -j2" while the
normal scheduler stays at about 90-95% with -j2 or higher in at least
one of the cores. There seems to be under-utilization of CPU time.

Also, by searching around the net but also through discussions on
various mailing lists, there seems to be a trend: the problems for some
reason seem to occur more often with Intel CPUs (Core 2 chips and lower;
I can't say anything about Core I7) while people on AMD CPUs mostly not
being affected by most or even all of the above. (And due to this flame
wars often break out, with one party accusing the other of imagining
things). Can the integrated memory controller on AMD chips have
something to do with this? Do AMD chips generally offer better
"multithrading" behavior? Unfortunately, you didn't mention on what CPU
you ran your tests. If it was AMD, it might be a good idea to run tests
on Pentium and Core 2 CPUs.

For reference, my system is:

CPU: Intel Core 2 Duo E6600 (2.4GHz)
Mainboard: Asus P5E (Intel X38 chipset)
RAM: 6GB (2+2+1+1) dual channel DDR2 800
GPU: RV770 (Radeon HD4870).

Con Kolivas

unread,
Sep 7, 2009, 12:00:16 AM9/7/09
to
2009/9/7 Ingo Molnar <mi...@elte.hu>:
> hi Con,

Sigh..

Well hello there.

>
> I've read your BFS announcement/FAQ with great interest:
>
>    http://ck.kolivas.org/patches/bfs/bfs-faq.txt

> I understand that BFS is still early code and that you are not


> targeting BFS for mainline inclusion - but BFS is an interesting
> and bold new approach, cutting a _lot_ of code out of
> kernel/sched*.c, so it raised my curiosity and interest :-)

Hard to keep a project under wraps and get an audience at the same
time, it is. I do realise it was inevitable LKML would invade my
personal space no matter how much I didn't want it to, but it would be
rude of me to not respond.

> In the announcement and on your webpage you have compared BFS to
> the mainline scheduler in various workloads - showing various
> improvements over it. I have tried and tested BFS and ran a set of
> benchmarks - this mail contains the results and my (quick)
> findings.

/me sees Ingo run off to find the right combination of hardware and
benchmark to prove his point.

[snip lots of bullshit meaningless benchmarks showing how great cfs is
and/or how bad bfs is, along with telling people they should use these
artificial benchmarks to determine how good it is, demonstrating yet
again why benchmarks fail the desktop]

I'm not interested in a long protracted discussion about this since
I'm too busy to live linux the way full time developers do, so I'll
keep it short, and perhaps you'll understand my intent better if the
FAQ wasn't clear enough.


Do you know what a normal desktop PC looks like? No, a more realistic
question based on what you chose to benchmark to prove your point
would be: Do you know what normal people actually do on them?


Feel free to treat the question as rhetorical.

Regards,
-ck

/me checks on his distributed computing client's progress, fires up
his next H264 encode, changes music tracks and prepares to have his
arse whooped on quakelive.

Jens Axboe

unread,
Sep 7, 2009, 6:00:10 AM9/7/09
to
On Sun, Sep 06 2009, Ingo Molnar wrote:
> So ... to get to the numbers - i've tested both BFS and the tip of
> the latest upstream scheduler tree on a testbox of mine. I
> intentionally didnt test BFS on any really large box - because you
> described its upper limit like this in the announcement:

I ran a simple test as well, since I was curious to see how it performed
wrt interactiveness. One of my pet peeves with the current scheduler is
that I have to nice compile jobs, or my X experience is just awful while
the compile is running.

Now, this test case is something that attempts to see what
interactiveness would be like. It'll run a given command line while at
the same time logging delays. The delays are measured as follows:

- The app creates a pipe, and forks a child that blocks on reading from
that pipe.
- The app sleeps for a random period of time, anywhere between 100ms
and 2s. When it wakes up, it gets the current time and writes that to
the pipe.
- The child then gets woken, checks the time on its own, and logs the
difference between the two.

The idea here being that the delay between writing to the pipe and the
child reading the data and comparing should (in some way) be indicative
of how responsive the system would seem to a user.

The test app was quickly hacked up, so don't put too much into it. The
test run is a simple kernel compile, using -jX where X is the number of
threads in the system. The files are cache hot, so little IO is done.
The -x2 run is using the double number of processes as we have threads,
eg -j128 on a 64 thread box.

And I have to apologize for using a large system to test this on, I
realize it's out of the scope of BFS, but it's just easier to fire one
of these beasts up than it is to sacrifice my notebook or desktop
machine... So it's a 64 thread box. CFS -jX runtime is the baseline at
100, lower number means faster and vice versa. The latency numbers are
in msecs.


Scheduler Runtime Max lat Avg lat Std dev
----------------------------------------------------------------
CFS 100 951 462 267
CFS-x2 100 983 484 308
BFS
BFS-x2

And unfortunately this is where it ends for now, since BFS doesn't boot
on the two boxes I tried. It hard hangs right after disk detection. But
the latency numbers look pretty appalling for CFQ, so it's a bit of a
shame that I did not get to compare. I'll try again later with a newer
revision, when available.

--
Jens Axboe

Nikos Chantziaras

unread,
Sep 7, 2009, 6:20:11 AM9/7/09
to
On 09/07/2009 12:49 PM, Jens Axboe wrote:
> [...]

> And I have to apologize for using a large system to test this on, I
> realize it's out of the scope of BFS, but it's just easier to fire one
> of these beasts up than it is to sacrifice my notebook or desktop
> machine...

How does a kernel rebuild constitute "sacrifice"?


> So it's a 64 thread box. CFS -jX runtime is the baseline at
> 100, lower number means faster and vice versa. The latency numbers are
> in msecs.
>
>
> Scheduler Runtime Max lat Avg lat Std dev
> ----------------------------------------------------------------
> CFS 100 951 462 267
> CFS-x2 100 983 484 308
> BFS
> BFS-x2
>
> And unfortunately this is where it ends for now, since BFS doesn't boot
> on the two boxes I tried.

Then who post this in the first place?

Jens Axboe

unread,
Sep 7, 2009, 6:50:07 AM9/7/09
to
On Mon, Sep 07 2009, Nikos Chantziaras wrote:
> On 09/07/2009 12:49 PM, Jens Axboe wrote:
>> [...]
>> And I have to apologize for using a large system to test this on, I
>> realize it's out of the scope of BFS, but it's just easier to fire one
>> of these beasts up than it is to sacrifice my notebook or desktop
>> machine...
>
> How does a kernel rebuild constitute "sacrifice"?

It's more of a bother since I have to physically be at the notebook,
where as the server type boxes usually have remote management. The
workstation I use currently, so it'd be very disruptive to do it there.
And as things are apparently very alpha on the bfs side currently, it's
easier to 'sacrifice' an idle test box. That's the keyword, 'test'
boxes. You know, machines used for testing. Not production machines.

Plus the notebook is using btrfs which isn't format compatible with
2.6.30 on disk format.

Is there a point to this question?

>> So it's a 64 thread box. CFS -jX runtime is the baseline at
>> 100, lower number means faster and vice versa. The latency numbers are
>> in msecs.
>>
>>
>> Scheduler Runtime Max lat Avg lat Std dev
>> ----------------------------------------------------------------
>> CFS 100 951 462 267
>> CFS-x2 100 983 484 308
>> BFS
>> BFS-x2
>>
>> And unfortunately this is where it ends for now, since BFS doesn't boot
>> on the two boxes I tried.
>
> Then who post this in the first place?

You snipped the relevant part of the conclusion, the part where I make a
comment on the cfs latencies.

Don't bother replying to any of my emails if YOU continue writing emails
in this fashion. I have MUCH better things to do than entertain kiddies.
If you do get your act together and want to reply, follow lkml etiquette
and group reply.

--
Jens Axboe

Frederic Weisbecker

unread,
Sep 7, 2009, 7:10:07 AM9/7/09
to
On Mon, Sep 07, 2009 at 06:38:36AM +0300, Nikos Chantziaras wrote:
> Unfortunately, I can't come up with any way to somehow benchmark all of
> this. There's no benchmark for "fluidity" and "responsiveness". Running
> the Doom 3 benchmark, or any other benchmark, doesn't say anything about
> responsiveness, it only measures how many frames were calculated in a
> specific period of time. How "stable" (with no stalls) those frames were
> making it to the screen is not measurable.

That looks eventually benchmarkable. This is about latency.
For example, you could try to run high load tasks in the
background and then launch a task that wakes up in middle/large
periods to do something. You could measure the time it takes to wake
it up to perform what it wants.

We have some events tracing infrastructure in the kernel that can
snapshot the wake up and sched switch events.

Having CONFIG_EVENT_TRACING=y should be sufficient for that.

You just need to mount a debugfs point, say in /debug.

Then you can activate these sched events by doing:

echo 0 > /debug/tracing/tracing_on
echo 1 > /debug/tracing/events/sched/sched_switch/enable
echo 1 > /debug/tracing/events/sched/sched_wake_up/enable

#Launch your tasks

echo 1 > /debug/tracing/tracing_on

#Wait for some time

echo 0 > /debug/tracing/tracing_off

That will require some parsing of the result in /debug/tracing/trace
to get the delays between wake_up events and switch in events
for the task that periodically wakes up and then produce some
statistics such as the average or the maximum latency.

That's a bit of a rough approach to measure such latencies but that
should work.


> If BFS would imply small drops in pure performance counted in
> instructions per seconds, that would be a totally acceptable regression
> for desktop/multimedia/gaming PCs. Not for server machines, of course.
> However, on my machine, BFS is faster in classic workloads. When I run
> "make -j2" with BFS and the standard scheduler, BFS always finishes a bit
> faster. Not by much, but still. One thing I'm noticing here is that BFS
> produces 100% CPU load on each core with "make -j2" while the normal
> scheduler stays at about 90-95% with -j2 or higher in at least one of the
> cores. There seems to be under-utilization of CPU time.

That also could be benchmarkable by using the above sched events and
look at the average time spent in a cpu to run the idle tasks.

Jens Axboe

unread,
Sep 7, 2009, 8:10:06 AM9/7/09
to
On Mon, Sep 07 2009, Jens Axboe wrote:
> Scheduler Runtime Max lat Avg lat Std dev
> ----------------------------------------------------------------
> CFS 100 951 462 267
> CFS-x2 100 983 484 308
> BFS
> BFS-x2

Those numbers are buggy, btw, it's not nearly as bad. But responsiveness
under compile load IS bad though, the test app just didn't quantify it
correctly. I'll see if I can get it working properly.

Ingo Molnar

unread,
Sep 7, 2009, 8:20:07 AM9/7/09
to

* Frans Pop <ele...@planet.nl> wrote:

> Ingo Molnar wrote:
> > So the testbox i picked fits into the upper portion of what i
> > consider a sane range of systems to tune for - and should still fit
> > into BFS's design bracket as well according to your description:
> > it's a dual quad core system with hyperthreading.
>
> Ingo,
>
> Nice that you've looked into this.
>
> Would it be possible for you to run the same tests on e.g. a dual
> core and/or a UP system (or maybe just offline some CPUs?)? It
> would be very interesting to see whether BFS does better in the
> lower portion of the range, or if the differences you show between
> the two schedulers are consistent across the range.

Sure!

Note that usually we can extrapolate ballpark-figure quad and dual
socket results from 8 core results. Trends as drastic as the ones
i reported do not get reversed as one shrinks the number of cores.

[ This technique is not universal - for example borderline graphs
on cannot be extrapolated down reliably - but the graphs i
posted were far from borderline. ]

Con posted single-socket quad comparisons/graphs so to make it 100%
apples to apples i re-tested with a single-socket (non-NUMA) quad as
well, and have uploaded the new graphs/results to:

kernel build performance on quad:
http://redhat.com/~mingo/misc/bfs-vs-tip-kbuild-quad.jpg

pipe performance on quad:
http://redhat.com/~mingo/misc/bfs-vs-tip-pipe-quad.jpg

messaging performance (hackbench) on quad:
http://redhat.com/~mingo/misc/bfs-vs-tip-messaging-quad.jpg

OLTP performance (postgresql + sysbench) on quad:
http://redhat.com/~mingo/misc/bfs-vs-tip-oltp-quad.jpg

It shows similar curves and behavior to the 8-core results i posted
- BFS is slower than mainline in virtually every measurement. The
ratios are different for different parts of the graphs - but the
trend is similar.

I also re-ran a few standalone kernel latency tests with a single
quad:

lat_tcp:

BFS: TCP latency using localhost: 16.9926 microseconds
sched-devel: TCP latency using localhost: 12.4141 microseconds [36.8% faster]

as a comparison, the 8 core lat_tcp result was:

BFS: TCP latency using localhost: 16.5608 microseconds
sched-devel: TCP latency using localhost: 13.5528 microseconds [22.1% faster]

lat_pipe quad result:

BFS: Pipe latency: 4.6978 microseconds
sched-devel: Pipe latency: 2.6860 microseconds [74.8% faster]

as a comparison, the 8 core lat_pipe result was:

BFS: Pipe latency: 4.9703 microseconds
sched-devel: Pipe latency: 2.6137 microseconds [90.1% faster]

On the desktop interactivity front, i also still saw that bad
starvation artifact with BFS with multiple copies of CPU-bound
pipe-test-1m.c running in parallel:

http://redhat.com/~mingo/cfs-scheduler/tools/pipe-test-1m.c

Start up a few copies of them like this:

for ((i=0;i<32;i++)); do ./pipe-test-1m & done

and the quad eventually came to a halt here - until the tasks
finished running.

I also tested a few key data points on dual core and it shows
similar trends as well (as expected from the 8 and 4 core results).

But ... i'd really encourage everyone to test these things yourself
as well and not take anyone's word on this as granted. The more
people provide numbers, the better. The latest BFS patch can be
found at:

http://ck.kolivas.org/patches/bfs/

The mainline sched-devel tree can be found at:

http://people.redhat.com/mingo/tip.git/README

Thanks,

Ingo

Stefan Richter

unread,
Sep 7, 2009, 8:40:09 AM9/7/09
to
Ingo Molnar wrote:
> i'd really encourage everyone to test these things yourself
> as well and not take anyone's word on this as granted. The more
> people provide numbers, the better.

Besides mean values from bandwidth and latency focused tests, standard
deviations or variance, or e.g. 90th percentiles and perhaps maxima of
latency focused tests might be of interest. Or graphs with error bars.
--
Stefan Richter
-=====-==--= =--= --===
http://arcgraph.de/sr/

Markus Tornqvist

unread,
Sep 7, 2009, 9:50:07 AM9/7/09
to
Please Cc me as I'm not a subscriber.

(LKML bounced this message once already for 8-bit headers, I'm retrying
now - sorry if someone gets it twice)

On Mon, Sep 07, 2009 at 02:16:13PM +0200, Ingo Molnar wrote:
>
>Con posted single-socket quad comparisons/graphs so to make it 100%
>apples to apples i re-tested with a single-socket (non-NUMA) quad as
>well, and have uploaded the new graphs/results to:
>
> kernel build performance on quad:
> http://redhat.com/~mingo/misc/bfs-vs-tip-kbuild-quad.jpg

[...]


>
>It shows similar curves and behavior to the 8-core results i posted
>- BFS is slower than mainline in virtually every measurement. The
>ratios are different for different parts of the graphs - but the
>trend is similar.

Dude, not cool.

1. Quad HT is not the same as a 4-core desktop, you're doing it with 8 cores
2. You just proved BFS is better on the job_count == core_count case, as BFS
says it is, if you look at the graph
3. You're comparing an old version of BFS against an unreleased dev kernel

Also, you said on http://article.gmane.org/gmane.linux.kernel/886319


"I also tried to configure the kernel in a BFS friendly way, i used
HZ=1000 as recommended, turned off all debug options, etc. The
kernel config i used can be found here:
http://redhat.com/~mingo/misc/config
"

Quickly looking at the conf you have
CONFIG_HZ_250=y
CONFIG_PREEMPT_NONE=y
# CONFIG_PREEMPT_VOLUNTARY is not set
# CONFIG_PREEMPT is not set

CONFIG_ARCH_WANT_FRAME_POINTERS=y
CONFIG_FRAME_POINTER=y

And other DEBUG.

--
mjt

Ingo Molnar

unread,
Sep 7, 2009, 10:00:13 AM9/7/09
to

* Markus T?rnqvist <m...@nysv.org> wrote:

> Please Cc me as I'm not a subscriber.
>

> On Mon, Sep 07, 2009 at 02:16:13PM +0200, Ingo Molnar wrote:
> >
> >Con posted single-socket quad comparisons/graphs so to make it 100%
> >apples to apples i re-tested with a single-socket (non-NUMA) quad as
> >well, and have uploaded the new graphs/results to:
> >
> > kernel build performance on quad:
> > http://redhat.com/~mingo/misc/bfs-vs-tip-kbuild-quad.jpg
> [...]
> >
> >It shows similar curves and behavior to the 8-core results i posted
> >- BFS is slower than mainline in virtually every measurement. The
> >ratios are different for different parts of the graphs - but the
> >trend is similar.
>
> Dude, not cool.
>
> 1. Quad HT is not the same as a 4-core desktop, you're doing it with 8 cores

No, it's 4 cores. HyperThreading adds two 'siblings' per core, which
are not 'cores'.

> 2. You just proved BFS is better on the job_count == core_count case, as BFS
> says it is, if you look at the graph

I pointed that out too. I think the graphs speak for themselves:

http://redhat.com/~mingo/misc/bfs-vs-tip-kbuild-quad.jpg
http://redhat.com/~mingo/misc/bfs-vs-tip-kbuild.jpg

> 3. You're comparing an old version of BFS against an unreleased dev kernel

bfs-208 was 1 day old (and it is a 500K+ kernel patch) when i tested
it against the 2 days old sched-devel tree. Btw., i initially
measured 205 as well and spent one more day on acquiring and
analyzing the 208 results.

There's bfs-209 out there today. These tests take 8+ hours to
complete and validate. I'll re-test BFS in the future too, and as i
said it in the first mail i'll test it on a .31 base as well once
BFS has been ported to it:

> > It's on a .31-rc8 base while BFS is on a .30 base - will be able
> > to test BFS on a .31 base as well once you release it. (but it
> > doesnt matter much to the results - there werent any heavy core
> > kernel changes impacting these workloads.)

> Also, you said on http://article.gmane.org/gmane.linux.kernel/886319


> "I also tried to configure the kernel in a BFS friendly way, i used
> HZ=1000 as recommended, turned off all debug options, etc. The
> kernel config i used can be found here:
> http://redhat.com/~mingo/misc/config
> "
>
> Quickly looking at the conf you have
> CONFIG_HZ_250=y
> CONFIG_PREEMPT_NONE=y
> # CONFIG_PREEMPT_VOLUNTARY is not set
> # CONFIG_PREEMPT is not set

Indeed. HZ does not seem to matter according to what i see in my
measurements. Can you measure such sensitivity?

> CONFIG_ARCH_WANT_FRAME_POINTERS=y
> CONFIG_FRAME_POINTER=y
>
> And other DEBUG.

These are the defaults and they dont make a measurable difference to
these results. What other debug options do you mean and do they make
a difference?

Ingo

Ingo Molnar

unread,
Sep 7, 2009, 10:20:09 AM9/7/09
to

* Jens Axboe <jens....@oracle.com> wrote:

> On Mon, Sep 07 2009, Jens Axboe wrote:
> > Scheduler Runtime Max lat Avg lat Std dev
> > ----------------------------------------------------------------
> > CFS 100 951 462 267
> > CFS-x2 100 983 484 308
> > BFS
> > BFS-x2
>
> Those numbers are buggy, btw, it's not nearly as bad. But
> responsiveness under compile load IS bad though, the test app just
> didn't quantify it correctly. I'll see if I can get it working
> properly.

What's the default latency target on your box:

cat /proc/sys/kernel/sched_latency_ns

?

And yes, it would be wonderful to get a test-app from you that would
express the kind of pain you are seeing during compile jobs.

Ingo

Arjan van de Ven

unread,
Sep 7, 2009, 10:40:06 AM9/7/09
to
On Mon, 07 Sep 2009 06:38:36 +0300
Nikos Chantziaras <rea...@arcor.de> wrote:

> On 09/06/2009 11:59 PM, Ingo Molnar wrote:
> >[...]
> > Also, i'd like to outline that i agree with the general goals
> > described by you in the BFS announcement - small desktop systems
> > matter more than large systems. We find it critically important
> > that the mainline Linux scheduler performs well on those systems
> > too - and if you (or anyone else) can reproduce suboptimal behavior
> > please let the scheduler folks know so that we can fix/improve it.
>
> BFS improved behavior of many applications on my Intel Core 2 box in
> a way that can't be benchmarked. Examples:

Have you tried to see if latencytop catches such latencies ?

Arjan van de Ven

unread,
Sep 7, 2009, 10:50:11 AM9/7/09
to
On Mon, 7 Sep 2009 16:41:51 +0300

> >It shows similar curves and behavior to the 8-core results i posted
> >- BFS is slower than mainline in virtually every measurement. The
> >ratios are different for different parts of the graphs - but the
> >trend is similar.
>
> Dude, not cool.
>
> 1. Quad HT is not the same as a 4-core desktop, you're doing it with
> 8 cores

4 cores, 8 threads. Which is basically the standard desktop cpu going
forward... (4 cores already is today, 8 threads is that any day now)

--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

Michael Buesch

unread,
Sep 7, 2009, 11:20:05 AM9/7/09
to
Here's a very simple test setup on an embedded singlecore bcm47xx machine (WL500GPv2)
It uses iperf for performance testing. The iperf server is run on the
embedded device. The device is so slow that the iperf test is completely
CPU bound. The network connection is a 100MBit on the device connected
via patch cable to a 1000MBit machine.

The kernel is openwrt-2.6.30.5.

Here are the results:

Mainline CFS scheduler:

mb@homer:~$ iperf -c 192.168.1.1
------------------------------------------------------------
Client connecting to 192.168.1.1, TCP port 5001
TCP window size: 16.0 KByte (default)
------------------------------------------------------------
[ 3] local 192.168.1.99 port 35793 connected with 192.168.1.1 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-10.0 sec 27.4 MBytes 23.0 Mbits/sec
mb@homer:~$ iperf -c 192.168.1.1
------------------------------------------------------------
Client connecting to 192.168.1.1, TCP port 5001
TCP window size: 16.0 KByte (default)
------------------------------------------------------------
[ 3] local 192.168.1.99 port 35794 connected with 192.168.1.1 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-10.0 sec 27.3 MBytes 22.9 Mbits/sec
mb@homer:~$ iperf -c 192.168.1.1
------------------------------------------------------------
Client connecting to 192.168.1.1, TCP port 5001
TCP window size: 16.0 KByte (default)
------------------------------------------------------------
[ 3] local 192.168.1.99 port 56147 connected with 192.168.1.1 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-10.0 sec 27.3 MBytes 22.9 Mbits/sec


BFS scheduler:

mb@homer:~$ iperf -c 192.168.1.1
------------------------------------------------------------
Client connecting to 192.168.1.1, TCP port 5001
TCP window size: 16.0 KByte (default)
------------------------------------------------------------
[ 3] local 192.168.1.99 port 52489 connected with 192.168.1.1 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-10.0 sec 38.2 MBytes 32.0 Mbits/sec
mb@homer:~$ iperf -c 192.168.1.1
------------------------------------------------------------
Client connecting to 192.168.1.1, TCP port 5001
TCP window size: 16.0 KByte (default)
------------------------------------------------------------
[ 3] local 192.168.1.99 port 52490 connected with 192.168.1.1 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-10.0 sec 38.1 MBytes 31.9 Mbits/sec
mb@homer:~$ iperf -c 192.168.1.1
------------------------------------------------------------
Client connecting to 192.168.1.1, TCP port 5001
TCP window size: 16.0 KByte (default)
------------------------------------------------------------
[ 3] local 192.168.1.99 port 52491 connected with 192.168.1.1 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-10.0 sec 38.1 MBytes 31.9 Mbits/sec


--
Greetings, Michael.

Frans Pop

unread,
Sep 7, 2009, 11:30:12 AM9/7/09
to
On Monday 07 September 2009, Arjan van de Ven wrote:
> 4 cores, 8 threads. Which is basically the standard desktop cpu going
> forward... (4 cores already is today, 8 threads is that any day now)

Despite that I'm personally more interested in what I have available here
*now*. And that's various UP Pentium systems, one dual core Pentium D and
Core Duo.

I've been running BFS on my laptop today while doing CPU intensive jobs
(not disk intensive), and I must say that BFS does seem very responsive.
OTOH, I've also noticed some surprising things, such as processors staying
on lower frequencies while doing CPU-intensive work.

I feels like I have less of the mouse cursor and typing freezes I'm used
to with CFS, even when I'm *not* doing anything special. I've been
blaming those on still running with ordered mode ext3, but now I'm
starting to wonder.

I'll try to do more structured testing, comparisons and measurements
later. At the very least it's nice to have something to compare _with_.

Cheers,
FJP

Xavier Bestel

unread,
Sep 7, 2009, 11:30:13 AM9/7/09
to

On Mon, 2009-09-07 at 07:45 -0700, Arjan van de Ven wrote:
> On Mon, 7 Sep 2009 16:41:51 +0300
> > >It shows similar curves and behavior to the 8-core results i posted
> > >- BFS is slower than mainline in virtually every measurement. The
> > >ratios are different for different parts of the graphs - but the
> > >trend is similar.
> >
> > Dude, not cool.
> >
> > 1. Quad HT is not the same as a 4-core desktop, you're doing it with
> > 8 cores
>
> 4 cores, 8 threads. Which is basically the standard desktop cpu going
> forward... (4 cores already is today, 8 threads is that any day now)

Except on your typical smartphone, which will run linux and probably
vastly outnumber the number of "traditional" linux desktops.

Xav

Arjan van de Ven

unread,
Sep 7, 2009, 11:40:08 AM9/7/09
to
On Mon, 07 Sep 2009 17:24:29 +0200
Xavier Bestel <xavier...@free.fr> wrote:

>
> On Mon, 2009-09-07 at 07:45 -0700, Arjan van de Ven wrote:
> > On Mon, 7 Sep 2009 16:41:51 +0300
> > > >It shows similar curves and behavior to the 8-core results i
> > > >posted
> > > >- BFS is slower than mainline in virtually every measurement.
> > > >The ratios are different for different parts of the graphs - but
> > > >the trend is similar.
> > >
> > > Dude, not cool.
> > >
> > > 1. Quad HT is not the same as a 4-core desktop, you're doing it
> > > with 8 cores
> >
> > 4 cores, 8 threads. Which is basically the standard desktop cpu
> > going forward... (4 cores already is today, 8 threads is that any
> > day now)
>
> Except on your typical smartphone, which will run linux and probably
> vastly outnumber the number of "traditional" linux desktops.

yeah the trend in cellphones is only quad core without HT, not quad
core WITH ht ;-)

--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

Arjan van de Ven

unread,
Sep 7, 2009, 11:40:06 AM9/7/09
to
On Mon, 7 Sep 2009 17:20:33 +0200
Frans Pop <ele...@planet.nl> wrote:

> On Monday 07 September 2009, Arjan van de Ven wrote:
> > 4 cores, 8 threads. Which is basically the standard desktop cpu
> > going forward... (4 cores already is today, 8 threads is that any
> > day now)
>
> Despite that I'm personally more interested in what I have available
> here *now*. And that's various UP Pentium systems, one dual core
> Pentium D and Core Duo.
>
> I've been running BFS on my laptop today while doing CPU intensive
> jobs (not disk intensive), and I must say that BFS does seem very
> responsive. OTOH, I've also noticed some surprising things, such as
> processors staying on lower frequencies while doing CPU-intensive
> work.
>
> I feels like I have less of the mouse cursor and typing freezes I'm
> used to with CFS, even when I'm *not* doing anything special. I've
> been blaming those on still running with ordered mode ext3, but now
> I'm starting to wonder.
>
> I'll try to do more structured testing, comparisons and measurements
> later. At the very least it's nice to have something to compare
> _with_.
>

it's a shameless plug since I wrote it, but latencytop will be able to
tell you what your bottleneck is...
and that is very interesting to know, regardless of the "what scheduler
code" discussion;

--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

Nikos Chantziaras

unread,
Sep 7, 2009, 11:40:09 AM9/7/09
to
On 09/07/2009 03:16 PM, Ingo Molnar wrote:
> [...]
> Note that usually we can extrapolate ballpark-figure quad and dual
> socket results from 8 core results. Trends as drastic as the ones
> i reported do not get reversed as one shrinks the number of cores.
>
> Con posted single-socket quad comparisons/graphs so to make it 100%
> apples to apples i re-tested with a single-socket (non-NUMA) quad as
> well, and have uploaded the new graphs/results to:
>
> kernel build performance on quad:
> http://redhat.com/~mingo/misc/bfs-vs-tip-kbuild-quad.jpg
>
> pipe performance on quad:
> http://redhat.com/~mingo/misc/bfs-vs-tip-pipe-quad.jpg
>
> messaging performance (hackbench) on quad:
> http://redhat.com/~mingo/misc/bfs-vs-tip-messaging-quad.jpg
>
> OLTP performance (postgresql + sysbench) on quad:
> http://redhat.com/~mingo/misc/bfs-vs-tip-oltp-quad.jpg
>
> It shows similar curves and behavior to the 8-core results i posted
> - BFS is slower than mainline in virtually every measurement.

Except for numbers, what's your *experience* with BFS when it comes to
composited desktops + games + multimedia apps? (Watching high
definition videos, playing some latest high-tech 3D game, etc.) I
described the exact problems experienced with mainline in a previous reply.

Are you even using that stuff actually? Because it would be hard to
tell if your desktop consists mainly of Emacs and an xterm; you even
seem to be using Mutt so I suspect your desktop probably doesn't look
very Windows Vista/OS X/Compiz-like. Usually, with "multimedia desktop
PC" one doesn't mean:

http://foss.math.aegean.gr/~realnc/pics/desktop2.png

but rather:

http://foss.math.aegean.gr/~realnc/pics/desktop1.png

BFS probably wouldn't offer the former anything, while on the latter it
does make a difference. If your usage of the "desktop" bears a
resemblance to the first example, I'd say you might be not the most
qualified person to judge on the "Linux desktop experience." That is
not meant be offensive or patronizing, just an observation and I even
might be totally wrong about it.

Frans Pop

unread,
Sep 7, 2009, 11:50:04 AM9/7/09
to
On Monday 07 September 2009, Arjan van de Ven wrote:
> it's a shameless plug since I wrote it, but latencytop will be able to
> tell you what your bottleneck is...
> and that is very interesting to know, regardless of the "what scheduler
> code" discussion;

I'm very much aware of that and I've tried pinning it down a few times,
but failed to come up with anything conclusive. I plan to make a new
effort in this context as the freezes have increasingly been annoying me.

Unfortunately latencytop only shows a blank screen when used with BFS, but
I guess that's not totally unexpected.

Cheers,
FJP

Diego Calleja

unread,
Sep 7, 2009, 12:00:15 PM9/7/09
to
On Lunes 07 Septiembre 2009 17:24:29 Xavier Bestel escribiᅵ:

> Except on your typical smartphone, which will run linux and probably
> vastly outnumber the number of "traditional" linux desktops.

Smartphones will probably start using ARM dualcore cpus the next year,
the embedded land is no SMP-free.

Jens Axboe

unread,
Sep 7, 2009, 1:40:04 PM9/7/09
to
On Mon, Sep 07 2009, Ingo Molnar wrote:
>
> * Jens Axboe <jens....@oracle.com> wrote:
>
> > On Mon, Sep 07 2009, Jens Axboe wrote:
> > > Scheduler Runtime Max lat Avg lat Std dev
> > > ----------------------------------------------------------------
> > > CFS 100 951 462 267
> > > CFS-x2 100 983 484 308
> > > BFS
> > > BFS-x2
> >
> > Those numbers are buggy, btw, it's not nearly as bad. But
> > responsiveness under compile load IS bad though, the test app just
> > didn't quantify it correctly. I'll see if I can get it working
> > properly.
>
> What's the default latency target on your box:
>
> cat /proc/sys/kernel/sched_latency_ns
>
> ?

It's off right now, but it is set to whatever is the default. I don't
touch it.

> And yes, it would be wonderful to get a test-app from you that would
> express the kind of pain you are seeing during compile jobs.

I was hoping this one would, but it's not showing anything. I even added
support for doing the ping and wakeup over a socket, to see if the pipe
test was doing well because of the sync wakeup we do there. The net
latency is a little worse, but still good. So no luck in making that app
so far.

--
Jens Axboe

Avi Kivity

unread,
Sep 7, 2009, 2:00:07 PM9/7/09
to
On 09/07/2009 12:49 PM, Jens Axboe wrote:
>
> I ran a simple test as well, since I was curious to see how it performed
> wrt interactiveness. One of my pet peeves with the current scheduler is
> that I have to nice compile jobs, or my X experience is just awful while
> the compile is running.
>

I think the problem is that CFS is optimizing for the wrong thing. It's
trying to be fair to tasks, but these are meaningless building blocks of
jobs, which is what the user sees and measures. Your make -j128
dominates your interactive task by two orders of magnitude. If the
scheduler attempts to bridge this gap using heuristics, it will fail
badly when it misdetects since it will starve the really important
100-thread job for a task that was misdetected as interactive.

I think that bash (and the GUI shell) should put any new job (for bash,
a pipeline; for the GUI, an application launch from the menu) in a
scheduling group of its own. This way it will have equal weight in the
scheduler's eyes with interactive tasks; one will not dominate the
other. Of course if the cpu is free the compile job is welcome to use
all 128 threads.

(similarly, different login sessions should be placed in different jobs
to avoid a heavily multithreaded screensaver from overwhelming ed).

--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

Ingo Molnar

unread,
Sep 7, 2009, 2:30:10 PM9/7/09
to

That's interesting. I tried to reproduce it on x86, but the profile
does not show any scheduler overhead at all on the server:

$ perf report

#
# Samples: 8369
#
# Overhead Symbol
# ........ ......
#
9.20% [k] copy_user_generic_string
3.80% [k] e1000_clean
3.58% [k] ipt_do_table
2.72% [k] mwait_idle
2.68% [k] nf_iterate
2.28% [k] e1000_intr
2.15% [k] tcp_packet
2.10% [k] __hash_conntrack
1.59% [k] read_tsc
1.52% [k] _local_bh_enable_ip
1.34% [k] eth_type_trans
1.29% [k] __alloc_skb
1.19% [k] tcp_recvmsg
1.19% [k] ip_rcv
1.17% [k] e1000_clean_rx_irq
1.12% [k] apic_timer_interrupt
0.99% [k] vsnprintf
0.96% [k] nf_conntrack_in
0.96% [k] kmem_cache_free
0.93% [k] __kmalloc_track_caller


Could you profile it please? Also, what's the context-switch rate?

Below is the call-graph profile as well - all the overhead is in
networking and SLAB.

Ingo

$ perf report --call-graph fractal,5

#
# Samples: 8947
#
# Overhead Command Shared Object Symbol
# ........ .............. ............................. ......
#
9.06% iperf [kernel] [k] copy_user_generic_string
|
|--98.89%-- skb_copy_datagram_iovec
| |
| |--77.18%-- tcp_recvmsg
| | sock_common_recvmsg
| | __sock_recvmsg
| | sock_recvmsg
| | sys_recvfrom
| | system_call_fastpath
| | __recv
| |
| --22.82%-- tcp_rcv_established
| tcp_v4_do_rcv
| tcp_prequeue_process
| tcp_recvmsg
| sock_common_recvmsg
| __sock_recvmsg
| sock_recvmsg
| sys_recvfrom
| system_call_fastpath
| __recv
|
--1.11%-- system_call_fastpath
__GI___libc_nanosleep

3.62% [init] [kernel] [k] e1000_clean
2.96% [init] [kernel] [k] ipt_do_table
2.79% [init] [kernel] [k] mwait_idle
2.22% [init] [kernel] [k] e1000_intr
1.93% [init] [kernel] [k] nf_iterate
1.65% [init] [kernel] [k] __hash_conntrack
1.52% [init] [kernel] [k] tcp_packet
1.29% [init] [kernel] [k] ip_rcv
1.18% [init] [kernel] [k] __alloc_skb
1.15% iperf [kernel] [k] tcp_recvmsg

1.04% [init] [kernel] [k] _local_bh_enable_ip
1.02% [init] [kernel] [k] apic_timer_interrupt
1.02% [init] [kernel] [k] eth_type_trans
1.01% [init] [kernel] [k] tcp_v4_rcv
0.96% iperf [kernel] [k] kfree
|
|--95.35%-- skb_release_data
| __kfree_skb
| |
| |--79.27%-- tcp_recvmsg
| | sock_common_recvmsg
| | __sock_recvmsg
| | sock_recvmsg
| | sys_recvfrom
| | system_call_fastpath
| | __recv
| |
| --20.73%-- tcp_rcv_established
| tcp_v4_do_rcv
| tcp_prequeue_process
| tcp_recvmsg
| sock_common_recvmsg
| __sock_recvmsg
| sock_recvmsg
| sys_recvfrom
| system_call_fastpath
| __recv
|
--4.65%-- __kfree_skb
|
|--75.00%-- tcp_rcv_established
| tcp_v4_do_rcv
| tcp_prequeue_process
| tcp_recvmsg
| sock_common_recvmsg
| __sock_recvmsg
| sock_recvmsg
| sys_recvfrom
| system_call_fastpath
| __recv
|
--25.00%-- tcp_recvmsg
sock_common_recvmsg
__sock_recvmsg
sock_recvmsg
sys_recvfrom
system_call_fastpath
__recv

0.96% [init] [kernel] [k] read_tsc
0.92% iperf [kernel] [k] tcp_v4_do_rcv
|
|--95.12%-- tcp_prequeue_process
| tcp_recvmsg
| sock_common_recvmsg
| __sock_recvmsg
| sock_recvmsg
| sys_recvfrom
| system_call_fastpath
| __recv
|
--4.88%-- tcp_recvmsg
sock_common_recvmsg
__sock_recvmsg
sock_recvmsg
sys_recvfrom
system_call_fastpath
__recv

0.92% [init] [kernel] [k] e1000_clean_rx_irq
0.86% iperf [kernel] [k] tcp_rcv_established
|
|--96.10%-- tcp_v4_do_rcv
| tcp_prequeue_process
| tcp_recvmsg
| sock_common_recvmsg
| __sock_recvmsg
| sock_recvmsg
| sys_recvfrom
| system_call_fastpath
| __recv
|
--3.90%-- tcp_prequeue_process
tcp_recvmsg
sock_common_recvmsg
__sock_recvmsg
sock_recvmsg
sys_recvfrom
system_call_fastpath
__recv

0.84% iperf [kernel] [k] kmem_cache_free
|
|--93.33%-- __kfree_skb
| |
| |--71.43%-- tcp_recvmsg
| | sock_common_recvmsg
| | __sock_recvmsg
| | sock_recvmsg
| | sys_recvfrom
| | system_call_fastpath
| | __recv
| |
| --28.57%-- tcp_rcv_established
| tcp_v4_do_rcv
| tcp_prequeue_process
| tcp_recvmsg
| sock_common_recvmsg
| __sock_recvmsg
| sock_recvmsg
| sys_recvfrom
| system_call_fastpath
| __recv
|
|--4.00%-- tcp_recvmsg
| sock_common_recvmsg
| __sock_recvmsg
| sock_recvmsg
| sys_recvfrom
| system_call_fastpath
| __recv
|
--2.67%-- tcp_rcv_established
tcp_v4_do_rcv
tcp_prequeue_process
tcp_recvmsg
sock_common_recvmsg
__sock_recvmsg
sock_recvmsg
sys_recvfrom
system_call_fastpath
__recv

0.80% [init] [kernel] [k] netif_receive_skb
0.79% iperf [kernel] [k] tcp_event_data_recv
|
|--83.10%-- tcp_rcv_established
| tcp_v4_do_rcv
| tcp_prequeue_process
| tcp_recvmsg
| sock_common_recvmsg
| __sock_recvmsg
| sock_recvmsg
| sys_recvfrom
| system_call_fastpath
| __recv
|
|--12.68%-- tcp_v4_do_rcv
| tcp_prequeue_process
| tcp_recvmsg
| sock_common_recvmsg
| __sock_recvmsg
| sock_recvmsg
| sys_recvfrom
| system_call_fastpath
| __recv
|
--4.23%-- tcp_data_queue
tcp_rcv_established
tcp_v4_do_rcv
tcp_prequeue_process
tcp_recvmsg
sock_common_recvmsg
__sock_recvmsg
sock_recvmsg
sys_recvfrom
system_call_fastpath
__recv

0.67% perf [kernel] [k] format_decode
|
|--91.67%-- vsnprintf
| seq_printf
| |
| |--67.27%-- show_map_vma
| | show_map
| | seq_read
| | vfs_read
| | sys_read
| | system_call_fastpath
| | __GI_read
| |
| |--23.64%-- render_sigset_t
| | proc_pid_status
| | proc_single_show
| | seq_read
| | vfs_read
| | sys_read
| | system_call_fastpath
| | __GI_read
| |
| |--7.27%-- proc_pid_status
| | proc_single_show
| | seq_read
| | vfs_read
| | sys_read
| | system_call_fastpath
| | __GI_read
| |
| --1.82%-- cpuset_task_status_allowed
| proc_pid_status
| proc_single_show
| seq_read
| vfs_read
| sys_read
| system_call_fastpath
| __GI_read
|
--8.33%-- seq_printf
|
|--60.00%-- proc_pid_status
| proc_single_show
| seq_read
| vfs_read
| sys_read
| system_call_fastpath
| __GI_read
|
--40.00%-- show_map_vma
show_map
seq_read
vfs_read
sys_read
system_call_fastpath
__GI_read

0.65% [init] [kernel] [k] __kmalloc_track_caller
0.63% [init] [kernel] [k] nf_conntrack_in
0.63% [init] [kernel] [k] ip_route_input
0.58% perf [kernel] [k] vsnprintf
|
|--98.08%-- seq_printf
| |
| |--60.78%-- show_map_vma
| | show_map
| | seq_read
| | vfs_read
| | sys_read
| | system_call_fastpath
| | __GI_read
| |
| |--19.61%-- render_sigset_t
| | proc_pid_status
| | proc_single_show
| | seq_read
| | vfs_read
| | sys_read
| | system_call_fastpath
| | __GI_read
| |
| |--9.80%-- proc_pid_status
| | proc_single_show
| | seq_read
| | vfs_read
| | sys_read
| | system_call_fastpath
| | __GI_read
| |
| |--3.92%-- task_mem
| | proc_pid_status
| | proc_single_show
| | seq_read
| | vfs_read
| | sys_read
| | system_call_fastpath
| | __GI_read
| |
| |--3.92%-- cpuset_task_status_allowed
| | proc_pid_status
| | proc_single_show
| | seq_read
| | vfs_read
| | sys_read
| | system_call_fastpath
| | __GI_read
| |
| --1.96%-- render_cap_t
| proc_pid_status
| proc_single_show
| seq_read
| vfs_read
| sys_read
| system_call_fastpath
| __GI_read
|
--1.92%-- snprintf
proc_task_readdir
vfs_readdir
sys_getdents
system_call_fastpath
__getdents64
0x69706565000a3430

0.57% [init] [kernel] [k] ktime_get
0.57% [init] [kernel] [k] nf_nat_fn
0.56% iperf [kernel] [k] tcp_packet
|
|--68.00%-- __tcp_ack_snd_check
| tcp_rcv_established
| tcp_v4_do_rcv
| tcp_prequeue_process
| tcp_recvmsg
| sock_common_recvmsg
| __sock_recvmsg
| sock_recvmsg
| sys_recvfrom
| system_call_fastpath
| __recv
|
--32.00%-- tcp_cleanup_rbuf
tcp_recvmsg
sock_common_recvmsg
__sock_recvmsg
sock_recvmsg
sys_recvfrom
system_call_fastpath
__recv

0.56% iperf /usr/bin/iperf [.] 0x000000000059f8
|
|--8.00%-- 0x4059f8
|
|--8.00%-- 0x405a16
|
|--8.00%-- 0x4059fd
|
|--4.00%-- 0x409d22
|
|--4.00%-- 0x405871
|
|--4.00%-- 0x406ee1
|
|--4.00%-- 0x405726
|
|--4.00%-- 0x4058db
|
|--4.00%-- 0x406ee8
|
|--2.00%-- 0x405b60
|
|--2.00%-- 0x4058fd
|
|--2.00%-- 0x4058d5
|
|--2.00%-- 0x405490
|
|--2.00%-- 0x4058bb
|
|--2.00%-- 0x405b93
|
|--2.00%-- 0x405b8e
|
|--2.00%-- 0x405903
|
|--2.00%-- 0x405ba8
|
|--2.00%-- 0x406eae
|
|--2.00%-- 0x405545
|
|--2.00%-- 0x405870
|
|--2.00%-- 0x405b67
|
|--2.00%-- 0x4058ce
|
|--2.00%-- 0x40570e
|
|--2.00%-- 0x406ee4
|
|--2.00%-- 0x405a02
|
|--2.00%-- 0x406eec
|
|--2.00%-- 0x405b82
|
|--2.00%-- 0x40556a
|
|--2.00%-- 0x405755
|
|--2.00%-- 0x405a0a
|
|--2.00%-- 0x405498
|
|--2.00%-- 0x409d20
|
|--2.00%-- 0x405b21
|
--2.00%-- 0x405a2c

0.56% [init] [kernel] [k] kmem_cache_alloc
0.56% [init] [kernel] [k] __inet_lookup_established
0.55% perf [kernel] [k] number
|
|--95.92%-- vsnprintf
| |
| |--97.87%-- seq_printf
| | |
| | |--56.52%-- show_map_vma
| | | show_map
| | | seq_read
| | | vfs_read
| | | sys_read
| | | system_call_fastpath
| | | __GI_read
| | |
| | |--28.26%-- render_sigset_t
| | | proc_pid_status
| | | proc_single_show
| | | seq_read
| | | vfs_read
| | | sys_read
| | | system_call_fastpath
| | | __GI_read
| | |
| | |--6.52%-- proc_pid_status
| | | proc_single_show
| | | seq_read
| | | vfs_read
| | | sys_read
| | | system_call_fastpath
| | | __GI_read
| | |
| | |--4.35%-- render_cap_t
| | | proc_pid_status
| | | proc_single_show
| | | seq_read
| | | vfs_read
| | | sys_read
| | | system_call_fastpath
| | | __GI_read
| | |
| | --4.35%-- task_mem
| | proc_pid_status
| | proc_single_show
| | seq_read
| | vfs_read
| | sys_read
| | system_call_fastpath
| | __GI_read
| |
| --2.13%-- scnprintf
| bitmap_scnlistprintf
| seq_bitmap_list
| cpuset_task_status_allowed
| proc_pid_status
| proc_single_show
| seq_read
| vfs_read
| sys_read
| system_call_fastpath
| __GI_read
|
--4.08%-- seq_printf
|
|--50.00%-- show_map_vma
| show_map
| seq_read
| vfs_read
| sys_read
| system_call_fastpath
| __GI_read
|
--50.00%-- render_sigset_t
proc_pid_status
proc_single_show
seq_read
vfs_read
sys_read
system_call_fastpath
__GI_read

0.55% [init] [kernel] [k] native_sched_clock
0.50% iperf [kernel] [k] e1000_xmit_frame
|
|--71.11%-- __tcp_ack_snd_check
| tcp_rcv_established
| tcp_v4_do_rcv
| tcp_prequeue_process
| tcp_recvmsg
| sock_common_recvmsg
| __sock_recvmsg
| sock_recvmsg
| sys_recvfrom
| system_call_fastpath
| __recv
|
--28.89%-- tcp_cleanup_rbuf
tcp_recvmsg
sock_common_recvmsg
__sock_recvmsg
sock_recvmsg
sys_recvfrom
system_call_fastpath
__recv

0.50% iperf [kernel] [k] ipt_do_table
|
|--37.78%-- ipt_local_hook
| nf_iterate
| nf_hook_slow
| __ip_local_out
| ip_local_out
| ip_queue_xmit
| tcp_transmit_skb
| tcp_send_ack
| |
| |--58.82%-- __tcp_ack_snd_check
| | tcp_rcv_established
| | tcp_v4_do_rcv
| | tcp_prequeue_process
| | tcp_recvmsg
| | sock_common_recvmsg
| | __sock_recvmsg
| | sock_recvmsg
| | sys_recvfrom
| | system_call_fastpath
| | __recv
| |
| --41.18%-- tcp_cleanup_rbuf
| tcp_recvmsg
| sock_common_recvmsg
| __sock_recvmsg
| sock_recvmsg
| sys_recvfrom
| system_call_fastpath
| __recv
|
|--31.11%-- ipt_post_routing_hook
| nf_iterate
| nf_hook_slow
| ip_output
| ip_local_out
| ip_queue_xmit
| tcp_transmit_skb
| tcp_send_ack
| |
| |--64.29%-- __tcp_ack_snd_check
| | tcp_rcv_established
| | tcp_v4_do_rcv
| | tcp_prequeue_process
| | tcp_recvmsg
| | sock_common_recvmsg
| | __sock_recvmsg
| | sock_recvmsg
| | sys_recvfrom
| | system_call_fastpath
| | __recv
| |
| --35.71%-- tcp_cleanup_rbuf
| tcp_recvmsg
| sock_common_recvmsg
| __sock_recvmsg
| sock_recvmsg
| sys_recvfrom
| system_call_fastpath
| __recv
|
|--20.00%-- ipt_local_out_hook
| nf_iterate
| nf_hook_slow
| __ip_local_out
| ip_local_out
| ip_queue_xmit
| tcp_transmit_skb
| tcp_send_ack
| |
| |--88.89%-- __tcp_ack_snd_check
| | tcp_rcv_established
| | tcp_v4_do_rcv
| | tcp_prequeue_process
| | tcp_recvmsg
| | sock_common_recvmsg
| | __sock_recvmsg
| | sock_recvmsg
| | sys_recvfrom
| | system_call_fastpath
| | __recv
| |
| --11.11%-- tcp_cleanup_rbuf
| tcp_recvmsg
| sock_common_recvmsg
| __sock_recvmsg
| sock_recvmsg
| sys_recvfrom
| system_call_fastpath
| __recv
|
|--6.67%-- nf_iterate
| nf_hook_slow
| |
| |--66.67%-- ip_output
| | ip_local_out
| | ip_queue_xmit
| | tcp_transmit_skb
| | tcp_send_ack
| | tcp_cleanup_rbuf
| | tcp_recvmsg
| | sock_common_recvmsg
| | __sock_recvmsg
| | sock_recvmsg
| | sys_recvfrom
| | system_call_fastpath
| | __recv
| |
| --33.33%-- __ip_local_out
| ip_local_out
| ip_queue_xmit
| tcp_transmit_skb
| tcp_send_ack
| __tcp_ack_snd_check
| tcp_rcv_established
| tcp_v4_do_rcv
| tcp_prequeue_process
| tcp_recvmsg
| sock_common_recvmsg
| __sock_recvmsg
| sock_recvmsg
| sys_recvfrom
| system_call_fastpath
| __recv
|
|--2.22%-- ipt_local_in_hook
| nf_iterate
| nf_hook_slow
| ip_local_deliver
| ip_rcv_finish
| ip_rcv
| netif_receive_skb
| napi_skb_finish
| napi_gro_receive
| e1000_receive_skb
| e1000_clean_rx_irq
| e1000_clean
| net_rx_action
| __do_softirq
| call_softirq
| do_softirq
| irq_exit
| do_IRQ
| ret_from_intr
| vgettimeofday
|
--2.22%-- ipt_pre_routing_hook
nf_iterate
nf_hook_slow
ip_rcv
netif_receive_skb
napi_skb_finish
napi_gro_receive
e1000_receive_skb
e1000_clean_rx_irq
e1000_clean
net_rx_action
__do_softirq
call_softirq
do_softirq
irq_exit
do_IRQ
ret_from_intr
__GI___libc_nanosleep

0.50% iperf [kernel] [k] schedule
|
|--57.78%-- do_nanosleep
| hrtimer_nanosleep
| sys_nanosleep
| system_call_fastpath
| __GI___libc_nanosleep
|
|--33.33%-- schedule_timeout
| sk_wait_data
| tcp_recvmsg
| sock_common_recvmsg
| __sock_recvmsg
| sock_recvmsg
| sys_recvfrom
| system_call_fastpath
| __recv
|
|--6.67%-- hrtimer_nanosleep
| sys_nanosleep
| system_call_fastpath
| __GI___libc_nanosleep
|
--2.22%-- sk_wait_data
tcp_recvmsg
sock_common_recvmsg
__sock_recvmsg
sock_recvmsg
sys_recvfrom
system_call_fastpath
__recv

0.49% iperf [kernel] [k] tcp_transmit_skb
|
|--97.73%-- tcp_send_ack
| |
| |--83.72%-- __tcp_ack_snd_check
| | tcp_rcv_established
| | tcp_v4_do_rcv
| | |
| | |--97.22%-- tcp_prequeue_process
| | | tcp_recvmsg
| | | sock_common_recvmsg
| | | __sock_recvmsg
| | | sock_recvmsg
| | | sys_recvfrom
| | | system_call_fastpath
| | | __recv
| | |
| | --2.78%-- release_sock
| | tcp_recvmsg
| | sock_common_recvmsg
| | __sock_recvmsg
| | sock_recvmsg
| | sys_recvfrom
| | system_call_fastpath
| | __recv
| |
| --16.28%-- tcp_cleanup_rbuf
| tcp_recvmsg
| sock_common_recvmsg
| __sock_recvmsg
| sock_recvmsg
| sys_recvfrom
| system_call_fastpath
| __recv
|
--2.27%-- __tcp_ack_snd_check
tcp_rcv_established
tcp_v4_do_rcv
tcp_prequeue_process
tcp_recvmsg
sock_common_recvmsg
__sock_recvmsg
sock_recvmsg
sys_recvfrom
system_call_fastpath
__recv

0.49% [init] [kernel] [k] nf_hook_slow
0.48% iperf [kernel] [k] virt_to_head_page
|
|--53.49%-- kfree
| skb_release_data
| __kfree_skb
| |
| |--65.22%-- tcp_recvmsg
| | sock_common_recvmsg
| | __sock_recvmsg
| | sock_recvmsg
| | sys_recvfrom
| | system_call_fastpath
| | __recv
| |
| --34.78%-- tcp_rcv_established
| tcp_v4_do_rcv
| tcp_prequeue_process
| tcp_recvmsg
| sock_common_recvmsg
| __sock_recvmsg
| sock_recvmsg
| sys_recvfrom
| system_call_fastpath
| __recv
|
|--18.60%-- skb_release_data
| __kfree_skb
| |
| |--62.50%-- tcp_rcv_established
| | tcp_v4_do_rcv
| | tcp_prequeue_process
| | tcp_recvmsg
| | sock_common_recvmsg
| | __sock_recvmsg
| | sock_recvmsg
| | sys_recvfrom
| | system_call_fastpath
| | __recv
| |
| --37.50%-- tcp_recvmsg
| sock_common_recvmsg
| __sock_recvmsg
| sock_recvmsg
| sys_recvfrom
| system_call_fastpath
| __recv
|
|--18.60%-- kmem_cache_free
| __kfree_skb
| |
| |--62.50%-- tcp_rcv_established
| | tcp_v4_do_rcv
| | tcp_prequeue_process
| | tcp_recvmsg
| | sock_common_recvmsg
| | __sock_recvmsg
| | sock_recvmsg
| | sys_recvfrom
| | system_call_fastpath
| | __recv
| |
| --37.50%-- tcp_recvmsg
| sock_common_recvmsg
| __sock_recvmsg
| sock_recvmsg
| sys_recvfrom
| system_call_fastpath
| __recv
|
--9.30%-- __kfree_skb
|
|--75.00%-- tcp_rcv_established
| tcp_v4_do_rcv
| tcp_prequeue_process
| tcp_recvmsg
| sock_common_recvmsg
| __sock_recvmsg
| sock_recvmsg
| sys_recvfrom
| system_call_fastpath
| __recv
|
--25.00%-- tcp_recvmsg
sock_common_recvmsg
__sock_recvmsg
sock_recvmsg
sys_recvfrom
system_call_fastpath
__recv
...

Jerome Glisse

unread,
Sep 7, 2009, 2:40:08 PM9/7/09
to
On Mon, 2009-09-07 at 13:50 +1000, Con Kolivas wrote:

> /me checks on his distributed computing client's progress, fires up
> his next H264 encode, changes music tracks and prepares to have his
> arse whooped on quakelive.
> --

For such computer usage i would strongly suggest that you look into
GPU driver development there is a lot of performances to win in this
area and my feeling is that you can improve what you are doing
(games -> opengl (so GPU), H264 (encoding is harder to accelerate
with a GPU but for decoding and displaying it you definitely want
to involve the GPU), and tons of others things you are doing on your
linux desktop would go faster if GPU was put to more use. A wild guess
is that you can get a 2 or even 3 figures percentage improvement
with better GPU driver. My point is that i don't think a linux
scheduler improvement (compared to what we have now) will give a
significant boost for the linux desktop, on the contrary any even
slight improvement to the GPU driver stack can give you a boost.
Another way of saying that, there is no point into prioritizing X or
desktop app if CPU has to do all the drawing by itself (CPU is
several magnitude slower than GPU at doing such kind of task).

Regards,
Jerome Glisse

Daniel Walker

unread,
Sep 7, 2009, 2:50:06 PM9/7/09
to
On Mon, 2009-09-07 at 20:26 +0200, Ingo Molnar wrote:
> That's interesting. I tried to reproduce it on x86, but the profile
> does not show any scheduler overhead at all on the server:

If the scheduler isn't running the task which causes the lower
throughput , would that even show up in profiling output?

Daniel

Jens Axboe

unread,
Sep 7, 2009, 2:50:06 PM9/7/09
to
On Mon, Sep 07 2009, Avi Kivity wrote:
> On 09/07/2009 12:49 PM, Jens Axboe wrote:
>>
>> I ran a simple test as well, since I was curious to see how it performed
>> wrt interactiveness. One of my pet peeves with the current scheduler is
>> that I have to nice compile jobs, or my X experience is just awful while
>> the compile is running.
>>
>
> I think the problem is that CFS is optimizing for the wrong thing. It's
> trying to be fair to tasks, but these are meaningless building blocks of
> jobs, which is what the user sees and measures. Your make -j128
> dominates your interactive task by two orders of magnitude. If the
> scheduler attempts to bridge this gap using heuristics, it will fail
> badly when it misdetects since it will starve the really important
> 100-thread job for a task that was misdetected as interactive.

Agree, I was actually looking into doing joint latency for X number of
tasks for the test app. I'll try and do that and see if we can detect
something from that.

--
Jens Axboe

Michael Buesch

unread,
Sep 7, 2009, 3:00:13 PM9/7/09
to
On Monday 07 September 2009 20:26:29 Ingo Molnar wrote:
> Could you profile it please? Also, what's the context-switch rate?

As far as I can tell, the broadcom mips architecture does not have profiling support.
It does only have some proprietary profiling registers that nobody wrote kernel
support for, yet.

--
Greetings, Michael.

Ingo Molnar

unread,
Sep 7, 2009, 4:40:08 PM9/7/09
to

* Jens Axboe <jens....@oracle.com> wrote:

> Agree, I was actually looking into doing joint latency for X
> number of tasks for the test app. I'll try and do that and see if
> we can detect something from that.

Could you please try latest -tip:

http://people.redhat.com/mingo/tip.git/README

(c26f010 or later)

Does it get any better with make -j128 build jobs? Peter just fixed
a bug in the SMP load-balancer that can cause interactivity problems
on large CPU count systems.

Ingo

Jens Axboe

unread,
Sep 7, 2009, 4:50:07 PM9/7/09
to
On Mon, Sep 07 2009, Ingo Molnar wrote:
>
> * Jens Axboe <jens....@oracle.com> wrote:
>
> > Agree, I was actually looking into doing joint latency for X
> > number of tasks for the test app. I'll try and do that and see if
> > we can detect something from that.
>
> Could you please try latest -tip:
>
> http://people.redhat.com/mingo/tip.git/README
>
> (c26f010 or later)
>
> Does it get any better with make -j128 build jobs? Peter just fixed

The compile 'problem' is on my workstation, which is a dual core Intel
core 2. I use -j4 on that typically. On the bigger boxes, I don't notice
any interactivity problems, largely because I don't run anything latency
sensitive on those :-)

> a bug in the SMP load-balancer that can cause interactivity problems
> on large CPU count systems.

Worth trying on the dual core box?

--
Jens Axboe

Jens Axboe

unread,
Sep 7, 2009, 4:50:09 PM9/7/09
to
On Mon, Sep 07 2009, Jens Axboe wrote:
> > And yes, it would be wonderful to get a test-app from you that would
> > express the kind of pain you are seeing during compile jobs.
>
> I was hoping this one would, but it's not showing anything. I even added
> support for doing the ping and wakeup over a socket, to see if the pipe
> test was doing well because of the sync wakeup we do there. The net
> latency is a little worse, but still good. So no luck in making that app
> so far.

Here's a version that bounces timestamps between a producer and a number
of consumers (clients). Not really tested much, but perhaps someone can
compare this on a box that boots BFS and see what happens.

To run it, use -cX where X is the number of children that you wait for a
response from. The max delay between this children is logged for each
wakeup. You can invoke it ala:

$ ./latt -c4 'make -j4'

and it'll dump the max/avg/stddev bounce time after make has completed,
or if you just want to play around, start the compile in one xterm and
do:

$ ./latt -c4 'sleep 5'

to just log for a small period of time. Vary the number of clients to
see how that changes the aggregated latency. 1 should be fast, adding
more clients quickly adds up.

Additionally, it has a -f and -t option that controls the window of
sleep time for the parent between each message. The numbers are in
msecs, and it defaults to a minimum of 100msecs and up to 500msecs.

--
Jens Axboe

latt.c

Ingo Molnar

unread,
Sep 7, 2009, 5:00:10 PM9/7/09
to

* Michael Buesch <m...@bu3sch.de> wrote:

> On Monday 07 September 2009 20:26:29 Ingo Molnar wrote:
> > Could you profile it please? Also, what's the context-switch rate?
>
> As far as I can tell, the broadcom mips architecture does not have
> profiling support. It does only have some proprietary profiling
> registers that nobody wrote kernel support for, yet.

Well, what does 'vmstat 1' show - how many context switches are
there per second on the iperf server? In theory if it's a truly
saturated box, there shouldnt be many - just a single iperf task
running at 100% CPU utilization or so.

(Also, if there's hrtimer support for that board then perfcounters
could be used to profile it.)

Ingo

Jens Axboe

unread,
Sep 7, 2009, 5:10:06 PM9/7/09
to
On Mon, Sep 07 2009, Peter Zijlstra wrote:

> On Mon, 2009-09-07 at 22:46 +0200, Jens Axboe wrote:
> > > a bug in the SMP load-balancer that can cause interactivity problems
> > > on large CPU count systems.
> >
> > Worth trying on the dual core box?
>
> I debugged the issue on a dual core :-)
>
> It should be more pronounced on larger machines, but its present on
> dual-core too.

Alright, I'll upgrade that box to -tip tomorrow and see if it makes
a noticable difference. At -j4 or higher, I can literally see windows
slowly popping up when switching to a different virtual desktop.

Peter Zijlstra

unread,
Sep 7, 2009, 5:10:09 PM9/7/09
to
On Mon, 2009-09-07 at 22:46 +0200, Jens Axboe wrote:
> > a bug in the SMP load-balancer that can cause interactivity problems
> > on large CPU count systems.
>
> Worth trying on the dual core box?

I debugged the issue on a dual core :-)

It should be more pronounced on larger machines, but its present on
dual-core too.

--

Ingo Molnar

unread,
Sep 7, 2009, 6:20:08 PM9/7/09
to

* Jens Axboe <jens....@oracle.com> wrote:

> On Mon, Sep 07 2009, Peter Zijlstra wrote:
> > On Mon, 2009-09-07 at 22:46 +0200, Jens Axboe wrote:
> > > > a bug in the SMP load-balancer that can cause interactivity problems
> > > > on large CPU count systems.
> > >
> > > Worth trying on the dual core box?
> >
> > I debugged the issue on a dual core :-)
> >
> > It should be more pronounced on larger machines, but its present on
> > dual-core too.
>
> Alright, I'll upgrade that box to -tip tomorrow and see if it
> makes a noticable difference. At -j4 or higher, I can literally
> see windows slowly popping up when switching to a different
> virtual desktop.

btw., if you run -tip and have these enabled:

CONFIG_PERF_COUNTER=y
CONFIG_EVENT_TRACING=y

cd tools/perf/
make -j install

... then you can use a couple of new perfcounters features to
measure scheduler latencies. For example:

perf stat -e sched:sched_stat_wait -e task-clock ./hackbench 20

Will tell you how many times this workload got delayed by waiting
for CPU time.

You can repeat the workload as well and see the statistical
properties of those metrics:

aldebaran:/home/mingo> perf stat --repeat 10 -e \
sched:sched_stat_wait:r -e task-clock ./hackbench 20
Time: 0.251
Time: 0.214
Time: 0.254
Time: 0.278
Time: 0.245
Time: 0.308
Time: 0.242
Time: 0.222
Time: 0.268
Time: 0.244

Performance counter stats for './hackbench 20' (10 runs):

59826 sched:sched_stat_wait # 0.026 M/sec ( +- 5.540% )
2280.099643 task-clock-msecs # 7.525 CPUs ( +- 1.620% )

0.303013390 seconds time elapsed ( +- 3.189% )

To get scheduling events, do:

# perf list 2>&1 | grep sched:
sched:sched_kthread_stop [Tracepoint event]
sched:sched_kthread_stop_ret [Tracepoint event]
sched:sched_wait_task [Tracepoint event]
sched:sched_wakeup [Tracepoint event]
sched:sched_wakeup_new [Tracepoint event]
sched:sched_switch [Tracepoint event]
sched:sched_migrate_task [Tracepoint event]
sched:sched_process_free [Tracepoint event]
sched:sched_process_exit [Tracepoint event]
sched:sched_process_wait [Tracepoint event]
sched:sched_process_fork [Tracepoint event]
sched:sched_signal_send [Tracepoint event]
sched:sched_stat_wait [Tracepoint event]
sched:sched_stat_sleep [Tracepoint event]
sched:sched_stat_iowait [Tracepoint event]

stat_wait/sleep/iowait would be the interesting ones, for latency
analysis.

Or, if you want to see all the specific delays and want to see
min/max/avg, you can do:

perf record -e sched:sched_stat_wait:r -f -R -c 1 ./hackbench 20
perf trace

Ingo

Pekka Pietikainen

unread,
Sep 7, 2009, 8:00:12 PM9/7/09
to
On Mon, Sep 07, 2009 at 10:57:01PM +0200, Ingo Molnar wrote:
> > > Could you profile it please? Also, what's the context-switch rate?
> >
> > As far as I can tell, the broadcom mips architecture does not have
> > profiling support. It does only have some proprietary profiling
> > registers that nobody wrote kernel support for, yet.
> Well, what does 'vmstat 1' show - how many context switches are
> there per second on the iperf server? In theory if it's a truly
> saturated box, there shouldnt be many - just a single iperf task
Yay, finally something that's measurable in this thread \o/

Gigabit Ethernet iperf on an Atom or so might be something that
shows similar effects yet is debuggable. Anyone feel like taking a shot?

That beast doing iperf probably ends up making it go quite close to it's
limits (IO, mem bw, cpu). IIRC the routing/bridging performance is
something like 40Mbps (depends a lot on the model, corresponds pretty
well with the Mhz of the beast).

Maybe not totally unlike what make -j16 does to a 1-4 core box?

Thomas Fjellstrom

unread,
Sep 7, 2009, 8:00:10 PM9/7/09
to
On Sun September 6 2009, Nikos Chantziaras wrote:

> On 09/06/2009 11:59 PM, Ingo Molnar wrote:
> >[...]
> > Also, i'd like to outline that i agree with the general goals
> > described by you in the BFS announcement - small desktop systems
> > matter more than large systems. We find it critically important
> > that the mainline Linux scheduler performs well on those systems
> > too - and if you (or anyone else) can reproduce suboptimal behavior
> > please let the scheduler folks know so that we can fix/improve it.
>
> BFS improved behavior of many applications on my Intel Core 2 box in a
> way that can't be benchmarked. Examples:
>
> mplayer using OpenGL renderer doesn't drop frames anymore when dragging
> and dropping the video window around in an OpenGL composited desktop
> (KDE 4.3.1). (Start moving the mplayer window around; then drop it. At
> the moment the move starts and at the moment you drop the window back to
> the desktop, there's a big frame skip as if mplayer was frozen for a
> bit; around 200 or 300ms.)
>
> Composite desktop effects like zoom and fade out don't stall for
> sub-second periods of time while there's CPU load in the background. In
> other words, the desktop is more fluid and less skippy even during heavy
> CPU load. Moving windows around with CPU load in the background doesn't
> result in short skips.
>
> LMMS (a tool utilizing real-time sound synthesis) does not produce
> "pops", "crackles" and drops in the sound during real-time playback due
> to buffer under-runs. Those problems amplify when there's heavy CPU
> load in the background, while with BFS heavy load doesn't produce those
> artifacts (though LMMS makes itself run SCHED_ISO with BFS) Also,
> hitting a key on the keyboard needs less time for the note to become
> audible when using BFS. Same should hold true for other tools who
> traditionally benefit from the "-rt" kernel sources.
>
> Games like Doom 3 and such don't "freeze" periodically for small amounts
> of time (again for sub-second amounts) when something in the background
> grabs CPU time (be it my mailer checking for new mail or a cron job, or
> whatever.)
>
> And, the most drastic improvement here, with BFS I can do a "make -j2"
> in the kernel tree and the GUI stays fluid. Without BFS, things start
> to lag, even with in-RAM builds (like having the whole kernel tree
> inside a tmpfs) and gcc running with nice 19 and ionice -c 3.
>
> Unfortunately, I can't come up with any way to somehow benchmark all of
> this. There's no benchmark for "fluidity" and "responsiveness".
> Running the Doom 3 benchmark, or any other benchmark, doesn't say
> anything about responsiveness, it only measures how many frames were
> calculated in a specific period of time. How "stable" (with no stalls)
> those frames were making it to the screen is not measurable.
>
> If BFS would imply small drops in pure performance counted in
> instructions per seconds, that would be a totally acceptable regression
> for desktop/multimedia/gaming PCs. Not for server machines, of course.
> However, on my machine, BFS is faster in classic workloads. When I
> run "make -j2" with BFS and the standard scheduler, BFS always finishes
> a bit faster. Not by much, but still. One thing I'm noticing here is
> that BFS produces 100% CPU load on each core with "make -j2" while the
> normal scheduler stays at about 90-95% with -j2 or higher in at least
> one of the cores. There seems to be under-utilization of CPU time.
>
> Also, by searching around the net but also through discussions on
> various mailing lists, there seems to be a trend: the problems for some
> reason seem to occur more often with Intel CPUs (Core 2 chips and lower;
> I can't say anything about Core I7) while people on AMD CPUs mostly not
> being affected by most or even all of the above. (And due to this flame
> wars often break out, with one party accusing the other of imagining
> things). Can the integrated memory controller on AMD chips have
> something to do with this? Do AMD chips generally offer better
> "multithrading" behavior? Unfortunately, you didn't mention on what CPU
> you ran your tests. If it was AMD, it might be a good idea to run tests
> on Pentium and Core 2 CPUs.
>
> For reference, my system is:
>
> CPU: Intel Core 2 Duo E6600 (2.4GHz)
> Mainboard: Asus P5E (Intel X38 chipset)
> RAM: 6GB (2+2+1+1) dual channel DDR2 800
> GPU: RV770 (Radeon HD4870).
>

My Phenom 9550 (2.2Ghz) whips the pants off my Intel Q6600 (2.6Ghz). I and a
friend of mine both get large amounts of stalling when doing a lot of IO. I
haven't seen such horrible desktop interactivity since before the new
schedulers and the -ck patchset came out for 2.4.x. Its a heck of a lot better
on my AMD Phenom's, but some lag is noticeable these days, even when it wasn't
a few kernel releases ago.

Intel Specs:
CPU: Intel Core 2 Quad Q6600 (2.6Ghz)
Mainboard: ASUS P5K-SE (Intel p35 iirc)
RAM: 4G 800Mhz DDR2 dual channel (4x1G)
GPU: NVidia 8800GTS 320M

AMD Specs:
CPU: AMD Phenom I 9550 (2.2Ghz)
Mainboard: Gigabyte MA78GM-S2H
RAM: 4G 800Mhz DDR2 dual channel (2x2G)
GPU: Onboard Radeon 3200HD

AMD Specs x2:
CPU: AMD Phenom II 810 (2.6Ghz)
Mainboard: Gigabyte MA790FXT-UD5P
RAM: 4G 1066Mhz DDR3 dual channel (2x2G)
GPU: NVidia 8800GTS 320M (or currently a 8400GS)

Of course I get better performance out of the Phenom II vs either other box,
but it surprises me that I'd get more out of the budget AMD box over the not
so budget Intel box.

--
Thomas Fjellstrom
tfjel...@shaw.ca

Nikos Chantziaras

unread,
Sep 8, 2009, 3:20:05 AM9/8/09
to
On 09/07/2009 05:40 PM, Arjan van de Ven wrote:

> On Mon, 07 Sep 2009 06:38:36 +0300
> Nikos Chantziaras<rea...@arcor.de> wrote:
>
>> On 09/06/2009 11:59 PM, Ingo Molnar wrote:
>>> [...]
>>> Also, i'd like to outline that i agree with the general goals
>>> described by you in the BFS announcement - small desktop systems
>>> matter more than large systems. We find it critically important
>>> that the mainline Linux scheduler performs well on those systems
>>> too - and if you (or anyone else) can reproduce suboptimal behavior
>>> please let the scheduler folks know so that we can fix/improve it.
>>
>> BFS improved behavior of many applications on my Intel Core 2 box in
>> a way that can't be benchmarked. Examples:
>
> Have you tried to see if latencytop catches such latencies ?

I've just tried it.

I start latencytop and then mplayer on a video that doesn't max out the
CPU (needs about 20-30% of a single core (out of 2 available)). Then,
while the video is playing, I press Alt+Tab repeatedly which makes the
desktop compositor kick-in and stay active (it lays out all windows as a
"flip-switch", similar to the Microsoft Vista Aero alt+tab effect).
Repeatedly pressing alt+tab results in the compositor (in this case KDE
4.3.1) keep doing processing. With the mainline scheduler, mplayer
starts dropping frames and skip sound like crazy for the whole duration
of this exercise.

latencytop has this to say:

http://foss.math.aegean.gr/~realnc/pics/latop1.png

Though I don't really understand what this tool is trying to tell me, I
hope someone does.

Ingo Molnar

unread,
Sep 8, 2009, 3:50:14 AM9/8/09
to

* Ingo Molnar <mi...@elte.hu> wrote:

> That's interesting. I tried to reproduce it on x86, but the
> profile does not show any scheduler overhead at all on the server:

I've now simulated a saturated iperf server by adding an
udelay(3000) to e1000_intr() in via the patch below.

There's no idle time left that way:

Cpu(s): 0.0%us, 2.6%sy, 0.0%ni, 0.0%id, 0.0%wa, 93.2%hi, 4.2%si, 0.0%st
Mem: 1021044k total, 93400k used, 927644k free, 5068k buffers
Swap: 8193140k total, 0k used, 8193140k free, 25404k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1604 mingo 20 0 38300 956 724 S 99.4 0.1 3:15.07 iperf
727 root 15 -5 0 0 0 S 0.2 0.0 0:00.41 kondemand/0
1226 root 20 0 6452 336 240 S 0.2 0.0 0:00.06 irqbalance
1387 mingo 20 0 78872 1988 1300 S 0.2 0.2 0:00.23 sshd
1657 mingo 20 0 12752 1128 800 R 0.2 0.1 0:01.34 top
1 root 20 0 10320 684 572 S 0.0 0.1 0:01.79 init
2 root 15 -5 0 0 0 S 0.0 0.0 0:00.00 kthreadd

And the server is only able to saturate half of the 1 gigabit
bandwidth:

Client connecting to t, TCP port 5001


TCP window size: 16.0 KByte (default)
------------------------------------------------------------

[ 3] local 10.0.1.19 port 50836 connected with 10.0.1.14 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-10.0 sec 504 MBytes 423 Mbits/sec
------------------------------------------------------------
Client connecting to t, TCP port 5001


TCP window size: 16.0 KByte (default)
------------------------------------------------------------

[ 3] local 10.0.1.19 port 50837 connected with 10.0.1.14 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-10.0 sec 502 MBytes 420 Mbits/sec


perf top is showing:

------------------------------------------------------------------------------
PerfTop: 28517 irqs/sec kernel:99.4% [100000 cycles], (all, 1 CPUs)
------------------------------------------------------------------------------

samples pcnt kernel function
_______ _____ _______________

139553.00 - 93.2% : delay_tsc
2098.00 - 1.4% : hmac_digest
561.00 - 0.4% : ip_call_ra_chain
335.00 - 0.2% : neigh_alloc
279.00 - 0.2% : __hash_conntrack
257.00 - 0.2% : dev_activate
186.00 - 0.1% : proc_tcp_available_congestion_control
178.00 - 0.1% : e1000_get_regs
167.00 - 0.1% : tcp_event_data_recv

delay_tsc() dominates, as expected. Still zero scheduler overhead
and the contex-switch rate is well below 1000 per sec.

Then i booted v2.6.30 vanilla, added the udelay(3000) and got:

[ 5] local 10.0.1.14 port 5001 connected with 10.0.1.19 port 47026
[ 5] 0.0-10.0 sec 493 MBytes 412 Mbits/sec
[ 4] local 10.0.1.14 port 5001 connected with 10.0.1.19 port 47027
[ 4] 0.0-10.0 sec 520 MBytes 436 Mbits/sec
[ 5] local 10.0.1.14 port 5001 connected with 10.0.1.19 port 47028
[ 5] 0.0-10.0 sec 506 MBytes 424 Mbits/sec
[ 4] local 10.0.1.14 port 5001 connected with 10.0.1.19 port 47029
[ 4] 0.0-10.0 sec 496 MBytes 415 Mbits/sec

i.e. essentially the same throughput. (and this shows that using .30
versus .31 did not materially impact iperf performance in this test,
under these conditions and with this hardware)

The i applied the BFS patch to v2.6.30 and used the same
udelay(3000) hack and got:

No measurable change in throughput.

Obviously, this test is not equivalent to your test - but it does
show that even saturated iperf is getting scheduled just fine. (or,
rather, does not get scheduled all that much.)

[ 5] local 10.0.1.14 port 5001 connected with 10.0.1.19 port 38505
[ 5] 0.0-10.1 sec 481 MBytes 401 Mbits/sec
[ 4] local 10.0.1.14 port 5001 connected with 10.0.1.19 port 38506
[ 4] 0.0-10.0 sec 505 MBytes 423 Mbits/sec
[ 5] local 10.0.1.14 port 5001 connected with 10.0.1.19 port 38507
[ 5] 0.0-10.0 sec 508 MBytes 426 Mbits/sec
[ 4] local 10.0.1.14 port 5001 connected with 10.0.1.19 port 38508
[ 4] 0.0-10.0 sec 486 MBytes 406 Mbits/sec

So either your MIPS system has some unexpected dependency on the
scheduler, or there's something weird going on.

Mind poking on this one to figure out whether it's all repeatable
and why that slowdown happens? Multiple attempts to reproduce it
failed here for me.

Ingo

Ingo Molnar

unread,
Sep 8, 2009, 4:10:06 AM9/8/09
to

* Pekka Pietikainen <p...@ee.oulu.fi> wrote:

> On Mon, Sep 07, 2009 at 10:57:01PM +0200, Ingo Molnar wrote:
> > > > Could you profile it please? Also, what's the context-switch rate?
> > >
> > > As far as I can tell, the broadcom mips architecture does not have
> > > profiling support. It does only have some proprietary profiling
> > > registers that nobody wrote kernel support for, yet.
> > Well, what does 'vmstat 1' show - how many context switches are
> > there per second on the iperf server? In theory if it's a truly
> > saturated box, there shouldnt be many - just a single iperf task
>
> Yay, finally something that's measurable in this thread \o/

My initial posting in this thread contains 6 separate types of
measurements, rather extensive ones. Out of those, 4 measurements
were latency oriented, two were throughput oriented. Plenty of data,
plenty of results, and very good reproducability.

> Gigabit Ethernet iperf on an Atom or so might be something that
> shows similar effects yet is debuggable. Anyone feel like taking a
> shot?

I tried iperf on x86 and simulated saturation and no, there's no BFS
versus mainline performance difference that i can measure - simply
because a saturated iperf server does not schedule much - it's busy
handling all that networking workload.

I did notice that iperf is somewhat noisy: it can easily have weird
outliers regardless of which scheduler is used. That could be an
effect of queueing/timing: depending on precisely what order packets
arrive and they get queued by the networking stack, does get a
cache-effective pathway of packets get opened - while with slightly
different timings, that pathway closes and we get much worse
queueing performance. I saw noise on the order of magnitude of 10%,
so iperf has to be measured carefully before drawing conclusions.

> That beast doing iperf probably ends up making it go quite close
> to it's limits (IO, mem bw, cpu). IIRC the routing/bridging
> performance is something like 40Mbps (depends a lot on the model,
> corresponds pretty well with the Mhz of the beast).
>
> Maybe not totally unlike what make -j16 does to a 1-4 core box?

No, a single iperf session is very different from kbuild make -j16.

Firstly, iperf server is just a single long-lived task - so we
context-switch between that and the idle thread , [and perhaps a
kernel thread such as ksoftirqd]. The scheduler essentially has no
leeway what task to schedule and for how long: if there's work going
on the iperf server task will run - if there's none, the idle task
runs. [modulo ksoftirqd - depending on the driver model and
dependent on precise timings.]

kbuild -j16 on the other hand is a complex hierarchy and mixture of
thousands of short-lived and long-lived tasks. The scheduler has a
lot of leeway to decide what to schedule and for how long.

From a scheduler perspective the two workloads could not be any more
different. Kbuild does test scheduler decisions in non-trivial ways
- iperf server does not really.

Ingo

Nikos Chantziaras

unread,
Sep 8, 2009, 4:20:13 AM9/8/09
to
On 09/08/2009 11:04 AM, Ingo Molnar wrote:
>
> * Pekka Pietikainen<p...@ee.oulu.fi> wrote:
>
>> On Mon, Sep 07, 2009 at 10:57:01PM +0200, Ingo Molnar wrote:
>>>>> Could you profile it please? Also, what's the context-switch rate?
>>>>
>>>> As far as I can tell, the broadcom mips architecture does not have
>>>> profiling support. It does only have some proprietary profiling
>>>> registers that nobody wrote kernel support for, yet.
>>> Well, what does 'vmstat 1' show - how many context switches are
>>> there per second on the iperf server? In theory if it's a truly
>>> saturated box, there shouldnt be many - just a single iperf task
>>
>> Yay, finally something that's measurable in this thread \o/
>
> My initial posting in this thread contains 6 separate types of
> measurements, rather extensive ones. Out of those, 4 measurements
> were latency oriented, two were throughput oriented. Plenty of data,
> plenty of results, and very good reproducability.

None of which involve latency-prone GUI applications running on cheap
commodity hardware though. I listed examples where mainline seems to
behave sub-optimal and ways to reproduce them but this doesn't seem to
be an area of interest.

Arjan van de Ven

unread,
Sep 8, 2009, 4:30:13 AM9/8/09
to
On Tue, 08 Sep 2009 10:19:06 +0300
Nikos Chantziaras <rea...@arcor.de> wrote:

> latencytop has this to say:
>
> http://foss.math.aegean.gr/~realnc/pics/latop1.png
>
> Though I don't really understand what this tool is trying to tell me,
> I hope someone does.

unfortunately this is both an older version of latencytop, and it's
incorrectly installed ;-(
Latencytop is supposed to translate those cryptic strings to english,
but due to not being correctly installed, it does not do this ;(

the latest version of latencytop also has a GUI (thanks to Ben)

--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

Arjan van de Ven

unread,
Sep 8, 2009, 4:40:04 AM9/8/09
to
On Tue, 08 Sep 2009 10:19:06 +0300
Nikos Chantziaras <rea...@arcor.de> wrote:

> latencytop has this to say:
>
> http://foss.math.aegean.gr/~realnc/pics/latop1.png
>
> Though I don't really understand what this tool is trying to tell me,
> I hope someone does.

despite the untranslated content, it is clear that you have scheduler
delays (either due to scheduler bugs or cpu contention) of upto 68
msecs... Second in line is your binary AMD graphics driver that is
chewing up 14% of your total latency...


--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

Jens Axboe

unread,
Sep 8, 2009, 5:20:08 AM9/8/09
to
On Mon, Sep 07 2009, Jens Axboe wrote:
> On Mon, Sep 07 2009, Jens Axboe wrote:
> > > And yes, it would be wonderful to get a test-app from you that would
> > > express the kind of pain you are seeing during compile jobs.
> >
> > I was hoping this one would, but it's not showing anything. I even added
> > support for doing the ping and wakeup over a socket, to see if the pipe
> > test was doing well because of the sync wakeup we do there. The net
> > latency is a little worse, but still good. So no luck in making that app
> > so far.
>
> Here's a version that bounces timestamps between a producer and a number
> of consumers (clients). Not really tested much, but perhaps someone can
> compare this on a box that boots BFS and see what happens.

And here's a newer version. It ensures that clients are running before
sending a timestamp, and it drops the first and last log entry to
eliminate any weird effects there. Accuracy should also be improved.

On an idle box, it'll usually log all zeroes. Sometimes I see 3-4msec
latencies, weird.

--
Jens Axboe

latt.c

Benjamin Herrenschmidt

unread,
Sep 8, 2009, 6:00:12 AM9/8/09
to
On Tue, 2009-09-08 at 09:48 +0200, Ingo Molnar wrote:
> So either your MIPS system has some unexpected dependency on the
> scheduler, or there's something weird going on.
>
> Mind poking on this one to figure out whether it's all repeatable
> and why that slowdown happens? Multiple attempts to reproduce it
> failed here for me.

Could it be the scheduler using constructs that don't do well on MIPS ?

I remember at some stage we spotted an expensive multiply in there,
maybe there's something similar, or some unaligned or non-cache friendly
vs. the MIPS cache line size data structure, that sort of thing ...

Is this a SW loaded TLB ? Does it misses on kernel space ? That could
also be some differences in how many pages are touched by each scheduler
causing more TLB pressure. This will be mostly invisible on x86.

At this stage, it will be hard to tell without some profile data I
suppose. Maybe next week I can try on a small SW loaded TLB embedded PPC
see if I can reproduce some of that, but no promises here.

Cheers,
Ben.

Nikos Chantziaras

unread,
Sep 8, 2009, 6:20:06 AM9/8/09
to
On 09/08/2009 11:38 AM, Arjan van de Ven wrote:
> On Tue, 08 Sep 2009 10:19:06 +0300
> Nikos Chantziaras<rea...@arcor.de> wrote:
>
>> latencytop has this to say:
>>
>> http://foss.math.aegean.gr/~realnc/pics/latop1.png
>>
>> Though I don't really understand what this tool is trying to tell me,
>> I hope someone does.
>
> despite the untranslated content, it is clear that you have scheduler
> delays (either due to scheduler bugs or cpu contention) of upto 68
> msecs... Second in line is your binary AMD graphics driver that is
> chewing up 14% of your total latency...

I've now used a correctly installed and up-to-date version of latencytop
and repeated the test. Also, I got rid of AMD's binary blob and used
kernel DRM drivers for my graphics card to throw fglrx out of the
equation (which btw didn't help; the exact same problems occur).

Here the result:

http://foss.math.aegean.gr/~realnc/pics/latop2.png

Again: this is on an Intel Core 2 Duo CPU.

Ingo Molnar

unread,
Sep 8, 2009, 6:20:06 AM9/8/09