PREEMPT_RT and I-PIPE: the numbers, take 3

Kristian Benoit

unread,

Jun 29, 2005, 6:40:04 PM6/29/05

to

This is the 3rd run of our tests.

Here are the changes since last time:

- Modified the IRQ latency measurement code on the logger to do a busy-
wait on the target instead of responding to an interrupt triggered by
the target's "reply". As Ingo had suggested, this very much replicates
what lpptest.c does. In fact, we actually copied Ingo's loop.

- LMbench runs are now conducted 5 times instead of just 3.

- The software versions being used were:
2.6.12 - final
RT-V0.7.50-35
I-pipe v0.7

System load:
------------

As in the two first runs, total system load is measured by LMbench's
execution time under various system loads. This time, however, each test
was run 5 times instead of 3. While we would like to consider these
tests to be generally more representative of general system behavior than
previous published results, it remains that measuring the total running
time of something like LMbench only gives a general idea of overall
system overhead. In fact, looking at the raw results, we can note that
the running times often vary quite a lot. For a more trustworthy
measurement, one must look at the results found by LMbench.

Note that the total running time of all configurations under the
"plain" and "IRQ test" loads has gone down significantly (by about
20 to 25 seconds). This is likely the result of an optimization within
2.6.12-final. Most of the other results for the other loads are,
nevertheless, in the same range of values found previously, except the
"IRQ & hd" results for PREEMPT_RT which show an improvement.

LMbench running times:
+--------------------+-------+-------+-------+-------+-------+
| Kernel | plain | IRQ | ping | IRQ & | IRQ & |
| | | test | flood | ping | hd |
+====================+=======+=======+=======+=======+=======+
| Vanilla-2.6.12 | 150 s | 153 s | 183 s | 187 s | 242 s |
+====================+=======+=======+=======+=======+=======+
| with RT-V0.7.50-35 | 155 s | 155 s | 205 s | 206 s | 250 s |
+--------------------+-------+-------+-------+-------+-------+
| % | 3.3 | 1.3 | 12.0 | 10.2 | 3.3 |
+====================+=======+=======+=======+=======+=======+
| with Ipipe-0.7 | 153 s | 154 s | 193 s | 198 s | 259 s |
+--------------------+-------+-------+-------+-------+-------+
| % | 2.0 | 0.6 | 5.5 | 5.9 | 7.0 |
+--------------------+-------+-------+-------+-------+-------+

Legend:
plain = Nothing special
IRQ test = on logger: triggering target every 1ms
ping flood = on host: "sudo ping -f $TARGET_IP_ADDR"
IRQ & ping = combination of the previous two
IRQ & hd = IRQ test with the following being done on the target:
"while [ true ]
do dd if=/dev/zero of=/tmp/dummy count=512 bs=1m
done"

Looking at the LMbench output, which is clearly a more adequate
benchmark, here are some highlights (this is averaged on 5 runs using
lmbsum):

"plain" run:

Measurements | Vanilla | preemp_rt | ipipe
---------------+-------------+----------------+-------------
fork | 93us | 157us (+69%) | 95us (+2%)
open/close | 2.3us | 3.7us (+43%) | 2.4us (+4%)
execve | 351us | 446us (+27%) | 363us (+3%)
select 500fd | 12.7us | 25.8us (+103%) | 12.8us (+1%)
mmap | 660us | 2867us (+334%) | 677us (+3%)
pipe | 7.1us | 11.6us (+63%) | 7.4us (+4%)

"IRQ test" run:
Measurements | Vanilla | preemp_rt | ipipe
---------------+-------------+----------------+-------------
fork | 96us | 158us (+65%) | 97us (+1%)
open/close | 2.4us | 3.7us (+54%) | 2.4us (~)
execve | 355us | 453us (+28%) | 365us (+3%)
select 500fd | 12.8us | 26.0us (+103%) | 12.8us (~%)
mmap | 662us | 2893us (+337%) | 679us (+3%)
pipe | 7.1us | 13.2us (+86%) | 7.5us (+6%)

"ping flood" run:
Measurements | Vanilla | preemp_rt | ipipe
---------------+-------------+----------------+-------------
fork | 137us | 288us (+110%) | 162us (+18%)
open/close | 3.9us | 7.0us (+79%) | 4.0us (+3%)
execve | 562us | 865us (+54%) | 657us (+17%)
select 500fd | 19.3us | 47.4us (+146%) | 21.0us (+4%)
mmap | 987us | 4921us (+399%) | 1056us (+7%)
pipe | 11.0us | 23.7us (+115%) | 13.3us (+20%)

"IRQ & ping" run:
Measurements | Vanilla | preemp_rt | ipipe
---------------+-------------+----------------+-------------
fork | 143us | 291us (+103%) | 163us (+14%)
open/close | 3.9us | 7.1us (+82%) | 4.0us (+3%)
execve | 567us | 859us (+51%) | 648us (+14%)
select 500fd | 19.6us | 52.2us (+166%) | 21.4us (+9%)
mmap | 983us | 5061us (+415%) | 1110us (+13%)
pipe | 12.2us | 28.0us (+130%) | 12.7us (+4%)

"IRQ & hd" run:
Measurements | Vanilla | preemp_rt | ipipe
---------------+-------------+----------------+-------------
fork | 96us | 164us (+71%) | 100us (+4%)
open/close | 2.5us | 3.8us (+52%) | 2.5us (~)
execve | 373us | 479us (+28%) | 382us (+2%)
select 500fd | 13.3us | 27.2us (+105%) | 13.4us (+1%)
mmap | 683us | 3013us (+341%) | 712us (+4%)
pipe | 9.9us | 23.0us (+132%) | 10.6us (+7%)

These results are consistent with those highlighted during the discussion
following the publication of the last test run.

Interrupt response time:
------------------------

Unlike the two first times, these times are measured using a busy-wait
on the logger. Basically, we disable interrupts, fire the interrupt to
the target, log the time, while(1) until the input on the parallel port
changes, and then log the time again to obtain the response time. Each
of these runs accumulates above 1,000,000 interrupt latency measurements.

We stand corrected as to the method that was used to collect interrupt
latency measurements. Ingo's suggestion to disable all interrupts on
the logger to collect the target's response does indeed mostly eliminate
logger-side latencies. However, we've sporadically ran into situations
where the logger locked-up, whereas it didn't before when we used to
measure the response using another interrupt. This happened around 3 times
in total over all of our test runs (and that's a lot of test runs), so
it isn't systematic, but it did happen.

+--------------------+------------+------+-------+------+--------+
| Kernel | sys load | Aver | Max | Min | StdDev |
+====================+============+======+=======+======+========+
| | None | 5.7 | 51.8 | 5.6 | 0.3 |
| | Ping | 5.8 | 51.8 | 5.6 | 0.5 |
| Vanilla-2.6.12 | lm. + ping | 6.2 | 83.5 | 5.6 | 1.0 |
| | lmbench | 6.0 | 57.6 | 5.6 | 0.9 |
| | lm. + hd | 6.5 | 177.4 | 5.6 | 4.1 |
| | DoHell | 6.9 | 525.4 | 5.6 | 5.2 |
+--------------------+------------+------+-------+------+--------+
| | None | 5.7 | 47.5 | 5.7 | 0.2 |
| | Ping | 7.0 | 63.4 | 5.7 | 1.6 |
| with RT-V0.7.50-35 | lm. + ping | 7.9 | 66.2 | 5.7 | 1.9 |
| | lmbench | 7.4 | 51.8 | 5.7 | 1.4 |
| | lm. + hd | 7.3 | 53.4 | 5.7 | 1.9 |
| | DoHell | 7.9 | 59.1 | 5.7 | 1.8 |
+--------------------+------------+------+-------+------+--------+
| | None | 7.1 | 50.4 | 5.7 | 0.2 |
| | Ping | 7.3 | 47.6 | 5.7 | 0.4 |
| with Ipipe-0.7 | lm.+ ping | 7.7 | 50.4 | 5.7 | 0.8 |
| | lmbench | 7.5 | 50.5 | 5.7 | 0.7 |
| | lm. + hd | 7.5 | 51.8 | 5.7 | 0.7 |
| | DoHell | 7.6 | 50.5 | 5.7 | 0.7 |
+--------------------+------------+------+-------+------+--------+

Legend:
None = nothing special
ping = on host: "sudo ping -f $TARGET_IP_ADDR"
lm. + ping = previous test and "make rerun" in lmbench-2.0.4/src/ on
target
lmbench = "make rerun" in lmbench-2.0.4/src/ on target
lm. + hd = previous test with the following being done on the target:
"while [ true ]
do dd if=/dev/zero of=/tmp/dummy count=512 bs=1m
done"
DoHell = See:
http://marc.theaimsgroup.com/?l=linux-kernel&m=111947618802722&w=2

We don't know whether we've hit the maximums Ingo alluded to, but we
did integrate his dohell script and the only noticeable difference was
with Linux where the maximum jumped to 525.4 micro-seconds. But that
was with vanilla only. Neither PREEMPT_RT nor I-PIPE exhibited such
maximums under the same load.

Overall analysis:
-----------------

We had not intended to redo a 3rd run so early, but we're happy we did
given the doubts expressed by some on the LKML. And as we suspected, these
new results very much corroborate what we had found earlier. As such, our
conclusions remain mostly unchanged:

- Both approaches yield similar results in terms of interrupt response times.
On most runs I-pipe seems to do slightly better on the maximums, but there
are cases where PREEMPT_RT does better. The results obtained in this
3rd round do corroborate the average latency extrapolated from analyzing
previous runs, but they contradict our earlier analysis of the true
maximum delays. Instead, it was Paul McKenny that was right, the maximums
are indeed around 50us. Also, Ingo was right in as far as we found a
maximum for vanilla Linux that went all the way up to 525us. We stand
corrected on both these issues. Nevertheless, the above results are
consistent with previously published ones.

- In terms of system load, PREEMPT_RT is typically still higher in
cost than I-pipe, as had been seen in the two previous runs. Please note
that we've done our best to use the truly latest version of PREEMPT_RT
available at this time. We actually did on purpose to finish all other
tests before doing PREEMPT_RT runs in order to make sure we were using
the latest possible release, as this was one of the main grievances
expressed by those supporting PREEMPT_RT.

For a more complete discussion of our conclusions please see our previous
test results publication:
http://marc.theaimsgroup.com/?l=linux-kernel&m=111928813818151&w=2

Again, we hope these results will help those interested in enhancing Linux
for use in real-time applications to better direct their efforts.

Kristian Benoit
Karim Yaghmour
--
Pushing Embedded and Real-Time Linux Systems Beyond the Limits
http://www.opersys.com || 1-866-677-4546
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Bill Huey

unread,

Jun 29, 2005, 7:00:21 PM6/29/05

to

On Wed, Jun 29, 2005 at 06:29:24PM -0400, Kristian Benoit wrote:
> Overall analysis:
...

> We had not intended to redo a 3rd run so early, but we're happy we did
> given the doubts expressed by some on the LKML. And as we suspected, these
> new results very much corroborate what we had found earlier. As such, our
> conclusions remain mostly unchanged:

Did you compile your host Linux kernel with CONFIG_SMP in place ? That's
critical since a UP kernel removes both spinlock and blocking locks in
critical paths makes micro benchmarks sort of invalid.

The benchmark is sort of confusing two things and merging them into one.
Both the latency statistic and kernel performance must be kept seperate.
The overall kernel performance is a more complicate issue that has to be
analysize differently using a more complicated methodology. That because
an RTOS use of PREEMPT_RT is going to be under a different circumstance
than that of a pure dual kernel set up of some sort. The functionalities
aren't the same.

I suggest that you compile the dual kernel with SMP turned on and try it
again, otherwise it's not really testing the overhead of any of the locking
for either the PREEMPT_RT or dual kernel set ups. That's really the only
outstanding statistic that I've noticed in that benchmark.

bill

Andrea Arcangeli

unread,

Jun 29, 2005, 7:10:07 PM6/29/05

to

On Wed, Jun 29, 2005 at 03:57:34PM -0700, Bill Huey wrote:
> Did you compile your host Linux kernel with CONFIG_SMP in place ? That's
> critical since a UP kernel removes both spinlock and blocking locks in
> critical paths makes micro benchmarks sort of invalid.

Why should he compile with CONFIG_SMP when CONFIG_SMP is absolutely
unnecessary without preempt-RT?

If you're an embedded developer, and you're _not_ using preempt-RT, why
in your right mind would you compile your kernel with CONFIG_SMP
enabled if you've only 1 cpu and no SMP in the hardware?

> I suggest that you compile the dual kernel with SMP turned on and try it
> again, otherwise it's not really testing the overhead of any of the locking
> for either the PREEMPT_RT or dual kernel set ups. That's really the only
> outstanding statistic that I've noticed in that benchmark.

On UP the overhead of the spinlocks is measurable but it doesn't have
such huge order of magnitude, so even if you would enable CONFIG_SMP
(which makes absolutely no sense since embedded developers have 1 cpu to
deal with), you'd still underperform greatly compared to only
CONFIG_SMP=y. So even if somebody could buy that the benchmark is unfair
with CONFIG_SMP=n, I can just tell you that comparing against
CONFIG_SMP=y isn't going to save preempt-rt.

Bill Huey

unread,

Jun 29, 2005, 7:30:10 PM6/29/05

to

On Thu, Jun 30, 2005 at 01:03:08AM +0200, Andrea Arcangeli wrote:
> Why should he compile with CONFIG_SMP when CONFIG_SMP is absolutely
> unnecessary without preempt-RT?

Because I'm the Anti-Ingo and say so.

bill

Paul E. McKenney

unread,

Jun 29, 2005, 8:00:40 PM6/29/05

to

On Wed, Jun 29, 2005 at 06:29:24PM -0400, Kristian Benoit wrote:

> This is the 3rd run of our tests.
>
> Here are the changes since last time:
>
> - Modified the IRQ latency measurement code on the logger to do a busy-
> wait on the target instead of responding to an interrupt triggered by
> the target's "reply". As Ingo had suggested, this very much replicates
> what lpptest.c does. In fact, we actually copied Ingo's loop.
>
> - LMbench runs are now conducted 5 times instead of just 3.
>
> - The software versions being used were:
> 2.6.12 - final
> RT-V0.7.50-35
> I-pipe v0.7

Excellent work, I very much appreciate the effort and the results!!!

Thanx, Paul

Paul E. McKenney

unread,

Jun 29, 2005, 8:10:11 PM6/29/05

to

On Wed, Jun 29, 2005 at 03:57:34PM -0700, Bill Huey wrote:

> On Wed, Jun 29, 2005 at 06:29:24PM -0400, Kristian Benoit wrote:
> > Overall analysis:
> ...
> > We had not intended to redo a 3rd run so early, but we're happy we did
> > given the doubts expressed by some on the LKML. And as we suspected, these
> > new results very much corroborate what we had found earlier. As such, our
> > conclusions remain mostly unchanged:
>
> Did you compile your host Linux kernel with CONFIG_SMP in place ? That's
> critical since a UP kernel removes both spinlock and blocking locks in
> critical paths makes micro benchmarks sort of invalid.
>
> The benchmark is sort of confusing two things and merging them into one.
> Both the latency statistic and kernel performance must be kept seperate.
> The overall kernel performance is a more complicate issue that has to be
> analysize differently using a more complicated methodology. That because
> an RTOS use of PREEMPT_RT is going to be under a different circumstance
> than that of a pure dual kernel set up of some sort. The functionalities
> aren't the same.
>
> I suggest that you compile the dual kernel with SMP turned on and try it
> again, otherwise it's not really testing the overhead of any of the locking
> for either the PREEMPT_RT or dual kernel set ups. That's really the only
> outstanding statistic that I've noticed in that benchmark.

If you were suggesting this to be run on an SMP system, I would agree
with you. I, too, would very much like to see these results run on a
2-CPU or 4-CPU system, although I am most certainly -not- asking Kristian
and Karim to do this work -- it is very much someone else's turn in the
barrel, I would say!

However, on a UP system, I have to agree with Kristian's choice of
configuration. An embedded system developer running on a UP system would
naturally use a UP Linux kernel build, so it makes sense to benchmark
a UP kernel on a UP system.

Thanx, Paul

Bill Huey

unread,

Jun 29, 2005, 9:50:06 PM6/29/05

to

On Wed, Jun 29, 2005 at 04:54:22PM -0700, Paul E. McKenney wrote:
> If you were suggesting this to be run on an SMP system, I would agree
> with you. I, too, would very much like to see these results run on a
> 2-CPU or 4-CPU system, although I am most certainly -not- asking Kristian
> and Karim to do this work -- it is very much someone else's turn in the
> barrel, I would say!

No, I'm suggesting that you and other folks understand the basic ideas
behind this patch and stop asking unbelievably stupid questions. This has
been covered over and over again, and I shouldn't have to repeat these
positions constantly because folks have both a language comprehension
problem and inability to bug off appropriately.

> However, on a UP system, I have to agree with Kristian's choice of
> configuration. An embedded system developer running on a UP system would
> naturally use a UP Linux kernel build, so it makes sense to benchmark
> a UP kernel on a UP system.

Dual cores are going to be standard in the next few years so RTOSs should
anticipate these things coming down the pipeline.

bill

Nick Piggin

unread,

Jun 29, 2005, 10:00:11 PM6/29/05

to

Bill Huey (hui) wrote:
> On Wed, Jun 29, 2005 at 04:54:22PM -0700, Paul E. McKenney wrote:
>
>>If you were suggesting this to be run on an SMP system, I would agree
>>with you. I, too, would very much like to see these results run on a
>>2-CPU or 4-CPU system, although I am most certainly -not- asking Kristian
>>and Karim to do this work -- it is very much someone else's turn in the
>>barrel, I would say!
>
>
> No, I'm suggesting that you and other folks understand the basic ideas
> behind this patch and stop asking unbelievably stupid questions. This has
> been covered over and over again, and I shouldn't have to repeat these
> positions constantly because folks have both a language comprehension
> problem and inability to bug off appropriately.
>

Sorry Bill, on this forum numbers talk and bullshit walks. So
please go away or at least drop me from your cc list until you
feel you can be civil to people.

Send instant messages to your online friends http://au.messenger.yahoo.com

Nicolas Pitre

unread,

Jun 29, 2005, 10:10:06 PM6/29/05

to

On Wed, 29 Jun 2005, Bill Huey wrote:

> No, I'm suggesting that you and other folks understand the basic ideas
> behind this patch and stop asking unbelievably stupid questions. This has
> been covered over and over again, and I shouldn't have to repeat these
> positions constantly because folks have both a language comprehension
> problem and inability to bug off appropriately.

Given the above statement I'm forced to believe you might have another
kind of problem yourself. Being arch with people isn't going to make
you many friends.

You really should improve your tolerance and communication skills or bug
off appropriately yourself.

Nicolas

Nick Piggin

unread,

Jun 29, 2005, 10:20:09 PM6/29/05

to

Bill Huey (hui) wrote:

> On Thu, Jun 30, 2005 at 11:56:55AM +1000, Nick Piggin wrote:
>
>>Sorry Bill, on this forum numbers talk and bullshit walks. So
>>please go away or at least drop me from your cc list until you
>>feel you can be civil to people.
>
>

> Numbers my ass. This has been covered over and over again. I don't
> need to repeat myself to folks that didn't get it in the first place.
>
> Multipule folks have tried to explain this to you and others and
> you still don't get it.
>

What part of "drop me from your cc list" were you having trouble
understanding?

Bill Huey

unread,

Jun 29, 2005, 10:20:11 PM6/29/05

to

On Wed, Jun 29, 2005 at 10:01:42PM -0400, Nicolas Pitre wrote:
> Given the above statement I'm forced to believe you might have another
> kind of problem yourself. Being arch with people isn't going to make
> you many friends.
>
> You really should improve your tolerance and communication skills or bug
> off appropriately yourself.

It's justified from dealing with the amount of ignorance surround this
patch. It has been covered over and over again and me and others lost
patience with the kinds of idiot comments surround it.

bill

Bill Huey

unread,

Jun 29, 2005, 10:20:10 PM6/29/05

to

On Thu, Jun 30, 2005 at 11:56:55AM +1000, Nick Piggin wrote:

> Sorry Bill, on this forum numbers talk and bullshit walks. So
> please go away or at least drop me from your cc list until you
> feel you can be civil to people.

Numbers my ass. This has been covered over and over again. I don't

need to repeat myself to folks that didn't get it in the first place.

Multipule folks have tried to explain this to you and others and
you still don't get it.

bill

Bill Huey

unread,

Jun 29, 2005, 10:30:13 PM6/29/05

to

On Thu, Jun 30, 2005 at 12:09:04PM +1000, Nick Piggin wrote:
> What part of "drop me from your cc list" were you having trouble
> understanding?

Then don't comment or reply to these emails and use some kind of
condescending tone.

bill

Paul E. McKenney

unread,

Jun 29, 2005, 10:30:13 PM6/29/05

to

On Wed, Jun 29, 2005 at 06:50:41PM -0700, Bill Huey wrote:
> On Wed, Jun 29, 2005 at 04:54:22PM -0700, Paul E. McKenney wrote:
> > If you were suggesting this to be run on an SMP system, I would agree
> > with you. I, too, would very much like to see these results run on a
> > 2-CPU or 4-CPU system, although I am most certainly -not- asking Kristian
> > and Karim to do this work -- it is very much someone else's turn in the
> > barrel, I would say!
>
> No, I'm suggesting that you and other folks understand the basic ideas
> behind this patch and stop asking unbelievably stupid questions. This has
> been covered over and over again, and I shouldn't have to repeat these
> positions constantly because folks have both a language comprehension
> problem and inability to bug off appropriately.

Sorry to disappoint you, but I stand by my statements (I see no questions
in my earlier email that you quoted).

To repeat, comparing a UP kernels on UP systems seems eminently fair
and evenhanded to me. Similarly, comparing SMP kernels running on SMP
systems seems quite fair and evenhanded to me. Running SMP kernels on
UP systems can provide some useful information, but why would such a
benchmark be of interest to someone writing realtime applications that
will run on a UP system?

Keep in mind that performance is only one aspect to consider when
comparing the different approaches.

And why should Kristian and Karim be asked to run an SMP-kernel test?
They have released their framework, so others can do this if they wish.
Besides, didn't someone recently offer to do some testing?

I sympathize with the language-comprehension problem -- I would no
doubt be completely helpless in your native language. I do appreciate
the effort you make to deal with English.

> > However, on a UP system, I have to agree with Kristian's choice of
> > configuration. An embedded system developer running on a UP system would
> > naturally use a UP Linux kernel build, so it makes sense to benchmark
> > a UP kernel on a UP system.
>
> Dual cores are going to be standard in the next few years so RTOSs should
> anticipate these things coming down the pipeline.

Agreed, though single-core CPUs aren't going to disappear any time soon.
People still use 8-bit Z80s, after all, and have been for over 25 years.

But if dual-core CPUs are going to be standard, why did you object to
comparing the two patches on an SMP system?

Thanx, Paul

Ingo Molnar

unread,

Jun 30, 2005, 2:10:08 AM6/30/05

to

* Kristian Benoit <kbe...@opersys.com> wrote:

> This is the 3rd run of our tests.

thanks for the testing!

> Here are the changes since last time:
>
> - Modified the IRQ latency measurement code on the logger to do a
> busy- wait on the target instead of responding to an interrupt
> triggered by the target's "reply". As Ingo had suggested, this very
> much replicates what lpptest.c does. In fact, we actually copied
> Ingo's loop.

[...]

> We stand corrected as to the method that was used to collect interrupt
> latency measurements. Ingo's suggestion to disable all interrupts on
> the logger to collect the target's response does indeed mostly
> eliminate logger-side latencies. However, we've sporadically ran into
> situations where the logger locked-up, whereas it didn't before when
> we used to measure the response using another interrupt. This happened
> around 3 times in total over all of our test runs (and that's a lot of

> test runs), so it isn't systematic, but it did happen. [...]

are you timing-out based on a TSC-based deadline like lpptest does? If
an interrupt gets lost then the logger may lock up, if it's looping with
interrupts disabled.

> +--------------------+------------+------+-------+------+--------+
> | Kernel | sys load | Aver | Max | Min | StdDev |
> +====================+============+======+=======+======+========+

> +--------------------+------------+------+-------+------+--------+

> | | None | 5.7 | 47.5 | 5.7 | 0.2 |
> | | Ping | 7.0 | 63.4 | 5.7 | 1.6 |
> | with RT-V0.7.50-35 | lm. + ping | 7.9 | 66.2 | 5.7 | 1.9 |
> | | lmbench | 7.4 | 51.8 | 5.7 | 1.4 |
> | | lm. + hd | 7.3 | 53.4 | 5.7 | 1.9 |
> | | DoHell | 7.9 | 59.1 | 5.7 | 1.8 |
> +--------------------+------------+------+-------+------+--------+

> We don't know whether we've hit the maximums Ingo alluded to, but we

> did integrate his dohell script and the only noticeable difference was
> with Linux where the maximum jumped to 525.4 micro-seconds. But that
> was with vanilla only. Neither PREEMPT_RT nor I-PIPE exhibited such
> maximums under the same load.

i'm seeing roughly half of that worst-case IRQ latency on similar
hardware (2GHz Athlon64), so i believe your system has some hardware
latency that masks the capabilities of the underlying RTOS. It would be
interesting to see IRQSOFF_TIMING + LATENCY_TRACE critical path
information from the -RT tree. Just enable those two options in the
.config (on the host side), and do:

echo 0 > /proc/sys/kernel/preempt_max_latency

and the kernel will begin measuring and tracing worst-case latency
paths. Then put some load on the host when you see a 50+ usec latency
reported to the syslog, send me the /proc/latency_trace. It should be a
matter of a few minutes to capture this information.

also, i'm wondering why you tested with only 1,000,000 samples. I
routinely do 100,000,000 sample tests, and i did one overnight test with
more than 1 billion samples, and the latency difference is quite
significant between say 1,000,000 samples and 100,000,000 samples. All
you need to do is to increase the rate of interrupts generated by the
logger - e.g. my testbox can handle 80,000 irqs/sec with only 15% CPU
overhead.

Ingo

Steven Rostedt

unread,

Jun 30, 2005, 3:00:15 AM6/30/05

to

On Wed, 29 Jun 2005, Bill Huey wrote:

> On Thu, Jun 30, 2005 at 12:09:04PM +1000, Nick Piggin wrote:
> > What part of "drop me from your cc list" were you having trouble
> > understanding?
>
> Then don't comment or reply to these emails and use some kind of
> condescending tone.
>
> bill
>

Bill,

I'm on your side of the issue here, and I have a high stake in going with
Ingo's RT patch. If you want to be listened to, don't insult people or
just get angry in your posts. Best thing to do is let Ingo reply, since
he will do his best to find out why the numbers are the way they are, and
fix the situation. The RT patch still has far to go and these tests help
(even if they show a bad light on RT), because Ingo and company will
most likely show where the tests have failed, or if it is RT that failed,
we can fix it. This is just like the MS benchmarks against Linux, they
actually showed where Linux had a bottle neck, and although people
complained that MS was setting up the test to hurt Linux, it actually
found a problem with Linux that was quickly fixed. So we can thank MS in
making Linux a more competitive OS.

Kristian and Karim have put a lot of effort into understanding the RT
patch (thanks guys!) and we should appreciate it. I don't really believe
that they are trying to hurt RT (like MS was with Linux), they are just
trying to make real comparisons. In the end, I strongly believe that
their work will help the RT patch.

-- Steve

Ingo Molnar

unread,

Jun 30, 2005, 3:20:08 AM6/30/05

to

* Paul E. McKenney <pau...@us.ibm.com> wrote:

> However, on a UP system, I have to agree with Kristian's choice of
> configuration. An embedded system developer running on a UP system
> would naturally use a UP Linux kernel build, so it makes sense to
> benchmark a UP kernel on a UP system.

sure.

keeping that in mind, PREEMPT_RT is quite similar to the SMP kernel (it
in fact activates much of the SMP code), so if you want to isolate the
overhead coming from the non-locking portions of PREEMPT_RT, you'd
compare to the SMP kernel. I do that frequently.

another point is that this test is measuring the overhead of PREEMPT_RT,
without measuring the benefit of the cost: RT-task scheduling latencies.
We know since the rtirq patch (to which i-pipe is quite similar) that we
can achieve good irq-service latencies via relatively simple means, but
that's not what PREEMPT_RT attempts to do. (PREEMPT_RT necessarily has
to have good irq-response times too, but much of the focus went to the
other aspects of RT task scheduling.)

were the wakeup latencies of true RT tasks tested, you could see which
technique does what. But all that is being tested here is pure overhead
to non-RT tasks, and the worst-case latency of raw interrupt handling.
While they are important and necessary components of the whole picture,
they are not the whole picture. This is a test that is pretty much
guaranteed to show -RT as having higher costs - in fact i'm surprised it
held up this well :)

so in that sense, this test is like running an SMP kernel on an UP box
and comparing it against the UP kernel (or running an SMP kernel on an
SMP box but only running a single task to measure performance), and
concluding that it has higher costs. It is a technically correct
conclusion, but obviously misses the whole picture, and totally misses
the point behind the SMP kernel.

Ingo

Ingo Molnar

unread,

Jun 30, 2005, 6:50:16 AM6/30/05

to

* Kristian Benoit <kbe...@opersys.com> wrote:

> This is the 3rd run of our tests.

i'm still having problems reproducing your numbers, even the 'plain'
ones. I cannot even get the same ballpark figures, on 3 separate
systems. To pick one number:

> "plain" run:
>
> Measurements | Vanilla | preemp_rt |

> ---------------+-------------+----------------+
> mmap | 660us | 2867us (+334%) |

i was unable to reproduce this level of lat_mmap degradation. I do
indeed see a slowdown [*], but nowhere near the 4.3x slowdown measured
here. I have tried the very lmbench version you used (2.0.4) on 3
different systems (Athlon64 2GHz, Celeron 466MHz, Xeon 2.4GHz - the last
one should be pretty similar to your 2.8GHz Xeon testbox) and neither
showed this level of slowdown.

i couldnt figure out which precise options were used by your test,
because i only found the summary lmbench page of one of the older tests
- so i did my lat_mmap testing with various sizes: 10MB, 30MB, 70MB,
150MB, 200MB, 500MB. (My best guess would be that since your target box
has 512MB of RAM, lmbench will pick an mmap-file size of 144 MB. Or if
it's the 256MB box, lmbench will pick roughly 70 MB. I covered those
likely sizes too.) Neither size showed this level of slowdown.

so my tentative conclusion would be that the -RT kernel is still
misconfigured somehow. Did you have HIGHMEM64 and HIGHPTE enabled
perhaps? Those i suggested to be turned off in one of my first mails to
you, it is something that will cause bad performance under PREEMPT_RT.
(Highmem64 is unwarranted for an embedded test anyway - it's only needed
to support more than 4 GB of RAM.) Could you send me the test 3 .config
you used on the -RT kernel?

Ingo

[*] fixed in -50-36 and later PREEMPT_RT kernels

Zwane Mwaikambo

unread,

Jun 30, 2005, 10:30:19 AM6/30/05

to

On Wed, 29 Jun 2005, Bill Huey wrote:

> On Thu, Jun 30, 2005 at 12:09:04PM +1000, Nick Piggin wrote:
> > What part of "drop me from your cc list" were you having trouble
> > understanding?
>
> Then don't comment or reply to these emails and use some kind of
> condescending tone.

That really is precious coming from you... You're doing advocates of the
PREEMPT_RT patchset a great disservice with your constant haranguing of
other participants in the discussion. Please, let's show some common
courtesy.

Thanks,
Zwane

Bill Davidsen

unread,

Jun 30, 2005, 11:10:08 AM6/30/05

to

Bill Huey (hui) wrote:
> On Wed, Jun 29, 2005 at 04:54:22PM -0700, Paul E. McKenney wrote:
>
>>If you were suggesting this to be run on an SMP system, I would agree
>>with you. I, too, would very much like to see these results run on a
>>2-CPU or 4-CPU system, although I am most certainly -not- asking Kristian
>>and Karim to do this work -- it is very much someone else's turn in the
>>barrel, I would say!
>
>
> No, I'm suggesting that you and other folks understand the basic ideas
> behind this patch and stop asking unbelievably stupid questions. This has
> been covered over and over again, and I shouldn't have to repeat these
> positions constantly because folks have both a language comprehension
> problem and inability to bug off appropriately.

The reasons you have to repeat yourself are (a) you lack communications
skills and expect people to read past your insults, (b) you're just
technically wrong in some cases, such as saying that the results would
be different if the kernel were compiled in an unrealistic way.

>
>
>>However, on a UP system, I have to agree with Kristian's choice of
>>configuration. An embedded system developer running on a UP system would
>>naturally use a UP Linux kernel build, so it makes sense to benchmark
>>a UP kernel on a UP system.
>
>
> Dual cores are going to be standard in the next few years so RTOSs should
> anticipate these things coming down the pipeline.

s/standard/common/

I doubt that single core CPUs are going to vanish, there are too many
power critical (heat critical) embedded applications. In many the
response time is important but the total CPU capability isn't an issue
while battery life or fanless operation is.

Your point that SMP operation is important is true, but I see no reason
to think Ingo has missed that.

--
-bill davidsen (davi...@tmr.com)
"The secret to procrastination is to put things off until the
last possible moment - but no longer" -me

Paul E. McKenney

unread,

Jun 30, 2005, 11:50:09 AM6/30/05

to

On Thu, Jun 30, 2005 at 09:07:09AM +0200, Ingo Molnar wrote:
>
> * Paul E. McKenney <pau...@us.ibm.com> wrote:
>
> > However, on a UP system, I have to agree with Kristian's choice of
> > configuration. An embedded system developer running on a UP system
> > would naturally use a UP Linux kernel build, so it makes sense to
> > benchmark a UP kernel on a UP system.
>
> sure.
>
> keeping that in mind, PREEMPT_RT is quite similar to the SMP kernel (it
> in fact activates much of the SMP code), so if you want to isolate the
> overhead coming from the non-locking portions of PREEMPT_RT, you'd
> compare to the SMP kernel. I do that frequently.

Agreed! For someone working on making PREEMPT_RT better, comparing to
SMP code running on a UP kernel is extremely useful, since it gives an
idea on where to focus. But someone who was wanting to build a realtime
application might not care so much.

Me, I would like to see an SMP-kernel comparison on a 2-CPU or 4-CPU
system, though probably not too many applications want SMP realtime
quite yet. My guess is that SMP realtime will be increasingly
important, though.

> another point is that this test is measuring the overhead of PREEMPT_RT,
> without measuring the benefit of the cost: RT-task scheduling latencies.
> We know since the rtirq patch (to which i-pipe is quite similar) that we
> can achieve good irq-service latencies via relatively simple means, but
> that's not what PREEMPT_RT attempts to do. (PREEMPT_RT necessarily has
> to have good irq-response times too, but much of the focus went to the
> other aspects of RT task scheduling.)

Agreed, a PREEMPT_RT-to-IPIPE comparison will never be an apples-to-apples
comparison. Raw data will never be a substitute for careful thought,
right? ;-)

> were the wakeup latencies of true RT tasks tested, you could see which
> technique does what. But all that is being tested here is pure overhead
> to non-RT tasks, and the worst-case latency of raw interrupt handling.
> While they are important and necessary components of the whole picture,
> they are not the whole picture. This is a test that is pretty much
> guaranteed to show -RT as having higher costs - in fact i'm surprised it
> held up this well :)

Me too! ;-) For me, the real surprise was that I-PIPE's and PREEMPT_RT's
worst-case irq latencies were roughly comparable. I would have guessed
that I-PIPE would have been 2x-3x better than PREEMPT_RT.

And I expect that there are a number of applications where it is worth
paying the extra system-call cost in order to gain the better latencies,
particularly those that spend most of their time executing in user mode.
As you continue your work reducing the costs, more and more applications
would see the benefit.

Other applications might need the low-cost system calls badly enough
that they want to deal with the greater complexity of the non-unified
I-PIPE programming model.

Still others might be satisfied with the less-good realtime latency of
a straight CONFIG_PREEMPT kernel. Or even of a non-CONFIG_PREEMPT
kernel.

> so in that sense, this test is like running an SMP kernel on an UP box
> and comparing it against the UP kernel (or running an SMP kernel on an
> SMP box but only running a single task to measure performance), and
> concluding that it has higher costs. It is a technically correct
> conclusion, but obviously misses the whole picture, and totally misses
> the point behind the SMP kernel.

Agreed, this experiment would not be useful to an user -- after all,
if the user has a single-threaded application, they should just buy
a UP box, run a UP kernel on it, and not bother with the experiment.
This experiment -might- be useful to a developer who is working on
either the SMP kernel or on the hardware, and who wants to measure the
overhead of SMP -- in fact, I did this sort of experiment (among others,
of course!) in my RCU dissertation.

I also agree that Kristian's and Karim's benchmark does not show the
full picture. No set of benchmark data, no matter how carefully designed
and collected, will ever be a substitute for careful consideration of
all aspects of the problem. The realtime latency and the added overhead
are important, but they are not the only important things.

But, for the moment, I need to get back to an RCU implementation that
I owe you. It is at least limping, which is better than I would have
expect, but needs quite a bit more work. ;-)

Thanx, Paul

Ingo Molnar

unread,

Jun 30, 2005, 12:30:17 PM6/30/05

to

* Paul E. McKenney <pau...@us.ibm.com> wrote:

> > another point is that this test is measuring the overhead of PREEMPT_RT,
> > without measuring the benefit of the cost: RT-task scheduling latencies.
> > We know since the rtirq patch (to which i-pipe is quite similar) that we
> > can achieve good irq-service latencies via relatively simple means, but
> > that's not what PREEMPT_RT attempts to do. (PREEMPT_RT necessarily has
> > to have good irq-response times too, but much of the focus went to the
> > other aspects of RT task scheduling.)
>
> Agreed, a PREEMPT_RT-to-IPIPE comparison will never be an
> apples-to-apples comparison. Raw data will never be a substitute for
> careful thought, right? ;-)

well, it could still be tested, since it's so easy: the dohell script is
already doing all of that as it runs rtc_wakeup - which runs a
SCHED_FIFO task and carefully measures wakeup latencies. If it is used
with 1024 Hz (the default) and it can be used in every test without
impacting the system load in any noticeable way.

Ingo

Sven-Thorsten Dietrich

unread,

Jun 30, 2005, 1:00:11 PM6/30/05

to

On Thu, 2005-06-30 at 18:17 +0200, Ingo Molnar wrote:
> * Paul E. McKenney <pau...@us.ibm.com> wrote:
>
> > > another point is that this test is measuring the overhead of PREEMPT_RT,
> > > without measuring the benefit of the cost: RT-task scheduling latencies.
> > > We know since the rtirq patch (to which i-pipe is quite similar) that we
> > > can achieve good irq-service latencies via relatively simple means, but
> > > that's not what PREEMPT_RT attempts to do. (PREEMPT_RT necessarily has
> > > to have good irq-response times too, but much of the focus went to the
> > > other aspects of RT task scheduling.)
> >
> > Agreed, a PREEMPT_RT-to-IPIPE comparison will never be an
> > apples-to-apples comparison. Raw data will never be a substitute for
> > careful thought, right? ;-)
>
> well, it could still be tested, since it's so easy: the dohell script is
> already doing all of that as it runs rtc_wakeup - which runs a
> SCHED_FIFO task and carefully measures wakeup latencies. If it is used
> with 1024 Hz (the default) and it can be used in every test without
> impacting the system load in any noticeable way.
>

I use a parallel implementation that has acquired the name FRD
(Fast Real Time Domain).

It triggers off any IRQ, and measures time to get RT task(s) running.

The objective is to measure periodic task performance for one or more
tasks of equal, ascending, or descending priorities.

The first task is worken by IRQ, the other tasks wake each other and
either yield or preempt, depending on ascending or descending priority.

Especially when one RT task wakes an RT task of higher priority,
interesting things happen.

Average and Worst-case Histograms are produced in /proc, for sleep time,
run time, task wake-up latency (preemption), inter-task switch, and
absolute latency from IRQ assertion (IRQ latency + preemption) if the
IRQ assertion time is available.

On many archs, a spare auto-resetting timer can be used for the IRQ
source. With the auto-rest timer, the rollover count is available
a-priori.

This allows getting the absolute latency since IRQ assertion, i.e. time
since timer rollover.

It is nice to get a feel for the combined impact of IRQ disable and
preemption on task response.

For portability, I have a hook into do_timer, and I acknowledge the
blind spot this creates, but like I said, you can use any IRQ and just
hook up your own way to get the IRQ-assert time stamps.

For a really scientific test, you can write an IRQ handler and a driver
to hook up an external signal generator, Cesium or GPS, and GPIB, or
what have you. Anything to drive the external time stamps into the
program, but that is an exercise for the developer.

If anyone is interested, I can update it for Ingo's latest RT tree and
send it out.

Sven

Ingo Molnar

unread,

Jun 30, 2005, 1:10:09 PM6/30/05

to

* Kristian Benoit <kbe...@opersys.com> wrote:

> "plain" run:
>
> Measurements | Vanilla | preemp_rt |

> ---------------+-------------+----------------+
> fork | 93us | 157us (+69%) |
> open/close | 2.3us | 3.7us (+43%) |
> execve | 351us | 446us (+27%) |

> select 500fd | 12.7us | 25.8us (+103%) |

> mmap | 660us | 2867us (+334%) |
> pipe | 7.1us | 11.6us (+63%) |

update: these tests should perform significantly better on the freshly
released -50-37 (or later) PREEMPT_RT kernels. (fork, execve, mmap [*]
was improved in -50-36, the others in -50-37)

Ingo

[*] the extent of the above fork/execve/mmap costs is still unexplained,
unless the test was run with HIGHMEM64 and HIGHPTE enabled.

Bill Huey

unread,

Jun 30, 2005, 3:00:16 PM6/30/05

to

On Thu, Jun 30, 2005 at 10:59:57AM -0400, Bill Davidsen wrote:
> The reasons you have to repeat yourself are (a) you lack communications
> skills and expect people to read past your insults, (b) you're just
> technically wrong in some cases, such as saying that the results would
> be different if the kernel were compiled in an unrealistic way.

I'm going to stick to my original points that I've already discussed in
a number of earlier threads. SMP is needed for this patch to really
work. What was not understood by folks, still isn't, is that the
functionality of this system is more extensive than a simple dual kernel
set up. This is not a personality or communication skills issue. It's
folks not understand the patch out of ignorance and irrational fear, how
it really works, and what advantages it gives over dual kernels.

A lot of it is just flat out ignorance about RTOS in the Linux community.
They confuse theoretical and practical issues and assume that they are
right when they don't write RT apps, don't understand the patch or the
direction it's going.

This material has been covered over and over again in previous threads,
but a disconnected, persistent and hysterical group still FUDs this patch
continuously, which is why I'm losing patience with this. If you're coming
in middle of this story it can certainly seem that way, but the reverse is
true. If I have to insult folks as a preventative for negative rumors about
this patch, then so be it.

> Your point that SMP operation is important is true, but I see no reason
> to think Ingo has missed that.

No negative comments were directed at Ingo of any sort. The nature of RTOS
changes as hardware advances. That's all I'm saying.

bill

Bill Huey

unread,

Jun 30, 2005, 3:10:10 PM6/30/05

to

On Thu, Jun 30, 2005 at 08:15:20AM -0600, Zwane Mwaikambo wrote:
> That really is precious coming from you... You're doing advocates of the
> PREEMPT_RT patchset a great disservice with your constant haranguing of
> other participants in the discussion. Please, let's show some common
> courtesy.

It's a reaction to stubbornness from the other folks and nothing more. If
folks were technically articulating something that was genuinely reasonable
and about a legitmate concern, then I'd have a different reaction.

bill

Paul E. McKenney

unread,

Jun 30, 2005, 7:20:07 PM6/30/05

to

On Thu, Jun 30, 2005 at 06:17:26PM +0200, Ingo Molnar wrote:
>
> * Paul E. McKenney <pau...@us.ibm.com> wrote:
>
> > > another point is that this test is measuring the overhead of PREEMPT_RT,
> > > without measuring the benefit of the cost: RT-task scheduling latencies.
> > > We know since the rtirq patch (to which i-pipe is quite similar) that we
> > > can achieve good irq-service latencies via relatively simple means, but
> > > that's not what PREEMPT_RT attempts to do. (PREEMPT_RT necessarily has
> > > to have good irq-response times too, but much of the focus went to the
> > > other aspects of RT task scheduling.)
> >
> > Agreed, a PREEMPT_RT-to-IPIPE comparison will never be an
> > apples-to-apples comparison. Raw data will never be a substitute for
> > careful thought, right? ;-)
>
> well, it could still be tested, since it's so easy: the dohell script is
> already doing all of that as it runs rtc_wakeup - which runs a
> SCHED_FIFO task and carefully measures wakeup latencies. If it is used
> with 1024 Hz (the default) and it can be used in every test without
> impacting the system load in any noticeable way.

OK, I think that I finally understand what you are getting at -- and I
agree that it would be interesting to get latency measurements during the
actual lmbench runs. However, if I understand correctly, you would want
roughly 1,000,000 latency measurements per lmbench run segment, which,
at 1024 Hz, would mean that each segment would take about 20 minutes.
A single lmbench run would then take many hours.

Is this really what you are getting at, or are you instead thinking
in terms of a single maximum-latency measurement covering the entire
lmbench run?

Thanx, Paul