The setup is exactly the same as the one previously described. So if
you've missed our earlier posting, it'll probably be easier to go read
that one first for background info:
http://marc.theaimsgroup.com/?l=linux-kernel&m=111846495403131&w=2
Of course we would have liked to post the complete packages and
configs that were used instead. Given the time spent on getting the
tests to work again with the new configs, we were unable to put
together a decent package though. Without further promising any
dates, we will make those available shortly.
Also, note that the versions of the approaches we were testing went
through significant changes between those versions we last used and
the versions published in the current testset.
For PREEMPT_RT, word from Ingo is that a lot of things were fixed
since the version we were using in our last tests. Hence, we used
a more recent version and relied on the .config provided to us
by Ingo. And indeed the numbers seem to confirm Ingo's analysis.
As we will see below, PREEMPT_RT behaves quite well.
For Adeos ... hmm Adeos? what's that ... I mean I-pipe ... well
that's the news about it isn't it. While we were working on
benchmarking his stuff, Philippe refactored the core component
of the Adeos patch in order to isolate the interrupt pipeline
therein implemented. Philippe's recent posting of these patches
should provide a better explanation than we can put here. But
basically, what we used and measured in Adeos is very much the
same thing we are doing with the I-pipe here. Though the version
posted Friday by Philippe to the LKML is more recent than the
one we used for our tests, it should be good enough for comparison.
Another change since the last test run is that in this case,
both setups are compared to the same kernel, 2.6.12-rc6. Needless
to say that we are much happier with comparing two approaches to
the same exact kernel version instead of trying to correlate
results from two different kernels.
In terms of changes to the tests, a few minor things have
changed. Most importantly, the hd test now correctly contains a
"bs=1m", and hence results in the creation of a 512MB file on
disk. Unlike the wishes we expressed earlier, we haven't integrated
additional tests to the ones we had already carried out. It was
enough work already to get a repeat of the same tests that
followed the recommendations of the proponents of each method.
Those tests that were mentioned remain relevant nonetheless,
especially hackbench and dbench.
So here are the results, and an attempted analysis. As was said
earlier, we don't believe any single test run will ever
definitively rule in favor or against either approach. Only
continued benchmarking will help steer discussions we hope in
the right direction.
Total system load:
------------------
Like last time, total system load is measured by LMbench's total
execution time under various system loads. Again, these are on
an average of 3 runs each, so the previous caveats still apply:
on such a few runs, the numbers should be read as a general
tendency with more definitive numbers requiring many more repeats.
LMbench running times:
+--------------------+-------+-------+-------+-------+-------+
| Kernel | plain | IRQ | ping | IRQ & | IRQ & |
| | | test | flood | ping | hd |
+====================+=======+=======+=======+=======+=======+
| Vanilla-2.6.12-rc6 | 175 s | 176 s | 185 s | 187 s | 246 s |
+====================+=======+=======+=======+=======+=======+
| with RT-V0.7.48-25 | 182 s | 184 s | 201 s | 202 s | 278 s |
+--------------------+-------+-------+-------+-------+-------+
| % | 4.0 | 4.5 | 8.6 | 8.0 | 13.0 |
+====================+=======+=======+=======+=======+=======+
| with Ipipe-0.4 | 176 s | 179 s | 189 s | 190 s | 260 s |
+--------------------+-------+-------+-------+-------+-------+
| % | 0.5 | 1.7 | 2.2 | 1.6 | 5.7 |
+--------------------+-------+-------+-------+-------+-------+
Legend:
plain = Nothing special
IRQ test = on logger: triggering target every 1ms
ping flood = on host: "sudo ping -f $TARGET_IP_ADDR"
IRQ & ping = combination of the previous two
IRQ & hd = IRQ test with the following being done on the target:
"while [ true ]
do dd if=/dev/zero of=/tmp/dummy count=512 bs=1m
done"
In general, it seems that rc6 provides similar results as rc2 and
rc4 did in our earlier tests, so that's a good indication of the
validity of those numbers. The only number that is significantly
different is the IRQ & hd test, but that is easily explained by
the fact that in our earlier test we did not have the "bs=1m" in
the "dd" command. Hence, only a 512 bytes file was generated
instead of the intended 512MB.
For PREEMPT_RT the numbers are also similar to those we found last
time, which again helps confirm the tendency. However, as expected,
the HD test is significantly different, for the same reasons as
above (namely "bs=1m"). Without entering the world of speculation
too much, it should be fair to say that the overhead generally
observed in the various PREEMPT_RT runs in comparison to the
vanilla runs is likely due to the additions introduced by the
patch to some of the kernel's critical mechanisms.
For the Ipipe, the numbers are better than those observed for Adeos
in the last run for the light loads, but not as good for the heavy
loads. While the I-pipe's impact remains much lower than PREEMPT_RT,
as was the case for Adeos in the last test run, it is rather
difficult to compare both test results as the I-pipe is not exactly
Adeos. One important modification is that Adeos domains each ran
using a separate stack and Adeos switched to the appropriate stack
when switching domains. The current I-pipe has had this functionality
stripped from it and, for what we make of it, the interrupts are
handled on whatever stack exists at that time. It appears that this
had the advantage of reducing the size of the patch, but if the
I-pipe can be compared to Adeos, then it appears that removing this
functionality has reduced performance. It is also possible that the
slowdown may be due to subtle bugs introduced by the refactoring.
More testing would need to be carried out to better determine the
cause.
In general, it appears that I-pipe's impact on general system
performance is lower than PREEMPT_RT's.
Interrupt response time:
------------------------
Like last time, interrupt response time is measured by the delay it
takes for the target to respond to interrupts from the logger. Given
that somewhere between 500,000 and 650,000 interrupts are generated
by the logger for each test run, we believe that these results
illustrate fairly well the behavior of the measured approaches.
Time in micro-seconds:
+--------------------+------------+------+-------+------+--------+
| Kernel | sys load | Aver | Max | Min | StdDev |
+====================+============+======+=======+======+========+
| | None | 13.9 | 55.5 | 13.4 | 0.4 |
| | Ping | 14.0 | 57.9 | 13.3 | 0.4 |
| Vanilla-2.6.12-rc6 | lm. + ping | 14.3 | 171.6 | 13.4 | 1.0 |
| | lmbench | 14.2 | 150.2 | 13.4 | 1.0 |
| | lm. + hd | 14.7 | 191.7 | 13.3 | 4.0 |
+--------------------+------------+------+-------+------+--------+
| | None | 13.9 | 53.1 | 13.4 | 0.4 |
| | Ping | 14.4 | 56.2 | 13.4 | 0.9 |
| with RT-V0.7.48-25 | lm. + ping | 14.7 | 56.9 | 13.4 | 1.1 |
| | lmbench | 14.3 | 57.0 | 13.4 | 0.7 |
| | lm. + hd | 14.3 | 58.9 | 13.4 | 0.8 |
+--------------------+------------+------+-------+------+--------+
| | None | 13.9 | 53.3 | 13.5 | 0.8 |
| | Ping | 14.2 | 57.2 | 13.6 | 0.9 |
| with Ipipe-0.4 | lm.+ ping | 14.5 | 56.5 | 13.5 | 0.9 |
| | lmbench | 14.3 | 55.6 | 13.4 | 0.9 |
| | lm. + hd | 14.4 | 55.5 | 13.4 | 0.9 |
+--------------------+------------+------+-------+------+--------+
Legend:
None = nothing special
ping = on host: "sudo ping -f $TARGET_IP_ADDR"
lm. + ping = previous test and "make rerun" in lmbench-2.0.4/src/ on
target
lmbench = "make rerun" in lmbench-2.0.4/src/ on target
lm. + hd = previous test with the following being done on the
target:
"while [ true ]
do dd if=/dev/zero of=/tmp/dummy count=512 bs=1m
done"
In last week's test run it may have seemed that there was a distinct
pattern for each setup and that average response times were slightly
different (give or take a microsecond here or there.) In the current
test run it appears that the average response time in all
configurations is identical, that the same can be said about the
minimum response times, and that vanilla clearly has much larger
maximum response times when compared to either real-time extension.
For PREEMPT_RT clearly the results are much better than last time.
Indeed it appears that, as Ingo predicted, a combination of the
proper configuration options and most recent additions gives
PREEMPT_RT important gains. In comparison to last week's results
all measures are lower: average response time, maximum response time,
minimum response time, and standard deviation. This is very good.
But that's not all. PREEMPT_RT also comes down virtually neck-to-
neck with the I-pipe (and the previous numbers from Adeos) in
terms of maximum interrupt response time. Certainly those backing
PREEMPT_RT, and others we hope, will find this quite positive.
The I-pipe, for its part, has yielded overall identical results to
Adeos' results from last week. In doing so, it confirms its claims
of inheriting Adeos' most important feature: the ability to obtain
deterministic interrupt response times.
Overall analysis:
-----------------
The guiding principal in carrying out those tests has been to help
us, and we hope others, understand the impact the proposed real-time
additions have on Linux. As such, these numbers are far from being
the entire story. They only provide additional hints to those
studying Linux's progression towards real-time responsiveness.
Clearly the approaches analyzed here take a different path to
their enabling of real-time in Linux. And while both are being
submitted by separate groups to the general Linux community, it
is important, as was made clear in earlier threads, to highlight
that these approaches can and have already been used together --
some have outright labeled them orthogonal. It follows that a
comparison between these approaches should not necessarily be used
for trying to determine a "winner".
Instead, we suggest that the above tests will likely be used to
better help those needing real-time responsiveness to decide which
approach is best for them. As can be expected, such choices are
often guided by compromise. We would have liked to offer some
general guidelines, but after thinking long and hard, we thought
it'd be best if those were laid out through the _constructive_
criticism of the community. Nonetheless, here are some general
_comments_ to start things off.
On the face of it, one would be tempted to conclude that if you
are looking for a lightweight mechanism for obtaining deterministic
interrupt response times, then the I-pipe seems to be a pretty safe
bet. After all, its impact on general system performance seems less
than that of PREEMPT_RT.
However, the I-pipe alone doesn't replace the functionality offered
by PREEMPT_RT, most importantly in regards to the rt scheduling of
user-space processes.
So at this point one would be tempted to conclude that if you're
looking for a fully integrated rt solution for Linux, PREEMPT_RT
would be the best candidate. After all it seems to provide as
good interrupt response times as I-pipe while providing rt
scheduling of user-space processes, albeit with a price on
general system performance.
But as was stated earlier in these rt threads -- then the argument
was made for Adeos, but it still applies for the I-pipe --, the
I-pipe is also an enabler for running an rt executive side-by-side
with Linux. With Fusion, for example, user-space processes can be
provided with rt scheduling and other services using the
transparent migration of service requests between Linux and the
RT-executive.
Again, as we said earlier, we do not believe that the approaches
are mutually exclusive. In fact, given that the I-pipe can be
easily integrated with PREEMPT_RT, it should be fairly simple to
obtain a single Linux tree that supports both for the PREEMPT_RT
approach and the I-pipe/Fusion approach.
IFF a choice must be made between both PREEMPT_RT and the I-pipe/
Fusion, then further tests would need to be carried out, in
particular with regards to scheduling latencies. Based on
guidance provided by previous discussion, however, such tests may
actually reveal that, as in this case, the actual choice depends on
a features vs. performance compromise.
Which brings us back to Paul's earlier summary: it appears that
there is not a single approach to solve all real-time problems
in Linux. Instead, it seems that a clever combination of many
approaches is the best way forward. In that regard, we hope that
the above tests will help guide this analysis in a fruitful
direction.
As usual, your feedback is appreciated.
Best regards,
Kristian Benoit
Karim Yaghmour
--
Pushing Embedded and Real-Time Linux Systems Beyond the Limits
http://www.opersys.com || 1-866-677-4546
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
You might what to try the overall times numbers with the voluntary
preemption instead. That option doesn't convert spinlocks and still
uses interrupt threads. I'd be surprised that the spinlocks would
contribute to that much overhead. Nevertheless, I'll be curious
about those results.
bill
Again, good stuff!
It looks to me that I-PIPE is an example of a "nested OS", with
Linux nested within the I-PIPE functionality. One could take
the RTAI-Fusion approach, but this measurement is of I-PIPE
rather than RTAI-Fusion, right? (And use of RTAI-Fusion might
or might not change these results significantly, just trying to
make sure I understand what these specific tests apply to.)
Also, if I understand correctly, the interrupt latency measured
is to the Linux kernel running within I-PIPE, rather than to I-PIPE
itself. Is this the case, or am I confused?
Thanx, Paul
Sorry, the I-pipe is likely in the "none-of-the-above" category. It's
actually not much of a category itself. For one thing, it's clearly
not an RTOS in any sense of the word.
The I-pipe is just a layer that allows multiple pieces of code to
share an interrupt stream in a prioritized fashion. It doesn't
schedule anything or provide any sort of abstraction whatsoever.
Your piece of code just gets a spot in the pipeline and receives
interrupts accordingly. Not much nesting there. It's just a new
feature in Linux.
Have a look at the patches and description posted by Philippe last
Friday for more detail.
> One could take
> the RTAI-Fusion approach, but this measurement is of I-PIPE
> rather than RTAI-Fusion, right? (And use of RTAI-Fusion might
> or might not change these results significantly, just trying to
> make sure I understand what these specific tests apply to.)
That's inconsequential. Whether Fusion is loaded or not doesn't
preclude a loaded driver to have a higher priority than
Fusion itself and therefore continue receiving interrupt even if
Fusion itself has disabled interrupts ...
The loading of Fusion would change nothing to these measurements.
> Also, if I understand correctly, the interrupt latency measured
> is to the Linux kernel running within I-PIPE, rather than to I-PIPE
> itself. Is this the case, or am I confused?
What's being measured here is a loadable module that allocates an
spot in the ipipe that has higher priority than Linux and puts
itself there. Therefore, regardless of what other piece of code
in the kernel disables interrupts, that specific driver still
has its registered ipipe handler called deterministically ...
Don't know, but from the looks of it we're not transmitting on
the same frequency ...
Karim
--
Author, Speaker, Developer, Consultant
Pushing Embedded and Real-Time Linux Systems Beyond the Limits
http://www.opersys.com || ka...@opersys.com || 1-866-677-4546
It is a bit of an edge case for any of the categories.
> > One could take
> > the RTAI-Fusion approach, but this measurement is of I-PIPE
> > rather than RTAI-Fusion, right? (And use of RTAI-Fusion might
> > or might not change these results significantly, just trying to
> > make sure I understand what these specific tests apply to.)
>
> That's inconsequential. Whether Fusion is loaded or not doesn't
> preclude a loaded driver to have a higher priority than
> Fusion itself and therefore continue receiving interrupt even if
> Fusion itself has disabled interrupts ...
>
> The loading of Fusion would change nothing to these measurements.
OK...
> > Also, if I understand correctly, the interrupt latency measured
> > is to the Linux kernel running within I-PIPE, rather than to I-PIPE
> > itself. Is this the case, or am I confused?
>
> What's being measured here is a loadable module that allocates an
> spot in the ipipe that has higher priority than Linux and puts
> itself there. Therefore, regardless of what other piece of code
> in the kernel disables interrupts, that specific driver still
> has its registered ipipe handler called deterministically ...
>
> Don't know, but from the looks of it we're not transmitting on
> the same frequency ...
Probably just my not fully understanding I-PIPE (to say nothing of
not fully understanding your test setup!), but I would have expected
I-PIPE to be able to get somewhere in the handfuls of microseconds of
interrupt latency. Looks like it prevents Linux from ever disabling
real interrupts -- my first guess after reading your email was that
Linux was disabling real interrupts and keeping I-PIPE from getting
there in time.
Thanx, Paul
Ingo, what's the status of putting irq 0 back in a thread with
PREEMPT_RT? IIRC this had some adverse (maybe unfixable?) effects so it
was disabled a few months ago.
I don't think there's much point in comparing i-pipe to PREEMPT_RT if we
know that 21usec pipeline effect from the timer IRQ (see list archives)
is still there.
Lee
You're welcome. Thanks to Karim and Opersys that showed me the
path !! :)
> > Nevertheless, maybe it's worth that I clarify the setup further.
> > Here's what we had:
> >
> > +----------+
> > | HOST |
> > +----------+
> > |
> > |
> > | Ethernet LAN
> > |
> > / \
> > / \
> > / \
> > / \
> > / \
> > / \
> > / \
> > +--------+ SERIAL +--------+
> > | LOGGER |----------| TARGET |
> > +--------+ +--------+
> >
> > The logger sends an interrupt to the target every 1ms. Here's the
> > path travelled by this interrupt (L for logger, T for target):
> >
> > 1- L:adeos-registered handler is called at timer interrupt
> > 2- L:record TSC for transmission
> > 3- L:write out to parallel port
> > 4- T:ipipe-registered handler called to receive interrupt
> > 5- T:write out to parallel port
> > 6- L:adeos-registered handler called to receive interrupt
> > 7- L:record TSC for receipt
> >
> > The response times obtained include all that goes on from 2 to
> > 7, including all hardware-related delays. The target's true
> > response time is from 3.5 to 5.5 (the .5 being the actual
> > time it takes for the signal to reache the pins on the actual
> > physical parallel port outside the computer.)
> >
> > The time from 2 to 3.5 includes the execution time for a few
> > instructions (record TSC value to RAM and outb()) and the delay
> > for the hardware to put the value out on the parallel port.
> >
> > The time from 5.5 to 7 includes an additional copy of adeos'
> > interrupt response time. IOW, in all cases, we're at least
> > recording adeos' interrupt response time at least once. Like
> > we explained in our first posting (and as backed up by the
> > data found in both postings) the adeos-to-adeos setup shows
> > that this delay is bound. In fact, we can safely assume that
> > 2*max_ipipe_delay ~= 55us and that 2*average_ipipe_delay
> > ~= 14us. And therefore:
> >
> > max_ipipe_delay = 27.5us
> > average_ipipe_delay = 7us
> > max_preempt_delay = 55us - max_ipipe_delay = 27.5us
> > average_preempt_delay = 14 us - average_ipipe_delay = 7us
> >
> > Presumably the 7us above should fit the "handful" you refer
> > to. At least I hope.
>
> I have big hands, so 7us could indeed qualify as a "handful".
>
> Any insights as to what leads to the larger maximum delay? Some guesses
> include worst-case cache-miss patterns and interrupt disabling that I
> missed in my quick scan of the patch.
>
> If I understand your analysis correctly (hah!!!), your breakdown
> of the maximum delay assumes that the maximum delays for the logger
> and the target are correlated. What causes this correlation?
> My (probably hopelessly naive) assumption would be that there would
> be no such correlation. In absence of correlation, one might
> approximate the maximum ipipe delay by subtracting the -average-
> ipipe delay from the maximum preemption delay, for 55us - 7us = 48us.
> Is this the case, or am I missing something here?
Your analysis is correct, but with 600,000 samples, it is possible that
we got 2 peeks (perhaps not maximum), one on the logger and one on the
target. So in my point of view, the maximum value is probably somewhere
between 55us / 2 and 55us - 7us. And probably closer to 55us / 2.
> Of course, in the case of the -average- preemption measurements, dividing
> by two to get the average ipipe delay makes perfect sense.
>
> Whatever the answer to my maximum-delay question, the same breakdown of
> the raw latency figures would apply to the CONFIG_PREEMPT_RT case, right?
>
> Thanx, Paul
>
Kristian
A link to the archives somewhere would be greatly appreciated, this is
a very high traffic list after all...
Thanks,
Karim
--
Author, Speaker, Developer, Consultant
Pushing Embedded and Real-Time Linux Systems Beyond the Limits
http://www.opersys.com || ka...@opersys.com || 1-866-677-4546
> Your analysis is correct, but with 600,000 samples, it is possible
> that we got 2 peeks (perhaps not maximum), one on the logger and one
> on the target. So in my point of view, the maximum value is probably
> somewhere between 55us / 2 and 55us - 7us. And probably closer to 55us
> / 2.
you could try the LPPTEST kernel driver and testlpp utility i integrated
into the -RT patchset. It avoids target-side latencies almost
completely. Especially since you had problems with parallel interrupts
you should give it a go and compare the results.
Ingo
>
> * Kristian Benoit <kbe...@opersys.com> wrote:
>
> > Your analysis is correct, but with 600,000 samples, it is possible
> > that we got 2 peeks (perhaps not maximum), one on the logger and one
> > on the target. So in my point of view, the maximum value is probably
> > somewhere between 55us / 2 and 55us - 7us. And probably closer to 55us
> > / 2.
>
> you could try the LPPTEST kernel driver and testlpp utility i
> integrated into the -RT patchset. It avoids target-side latencies
> almost completely. Especially since you had problems with parallel
> interrupts you should give it a go and compare the results.
correction: logger-side latencies are avoided.
:)
> Any insights as to what leads to the larger maximum delay? Some guesses
> include worst-case cache-miss patterns and interrupt disabling that I
> missed in my quick scan of the patch.
Beats me. Given that PREEMPT_RT and the I-pipe get to the same maximum
by using two entirely different approaches, I'm guessing this has more
to do with hardware-related contention than anything inside the patches
themselves.
> If I understand your analysis correctly (hah!!!), your breakdown
> of the maximum delay assumes that the maximum delays for the logger
> and the target are correlated. What causes this correlation?
No it doesn't. I'm just inferring the maximum and average using the
data obtained in the ipipe-to-ipipe setup. In that specific case,
I'm assuming that the interrupt latency on both systems for the
same type of interrupt is identical (after all, these machines are
physically identical, albeit one has 512MB or RAM and the other
256.)
There is no correlation. Just the assumption that what's actually
being measured is twice the latency of the ipipe in that specific
setup.
Given that the interrupt latency of preempt_rt is measured using one
machine runing adeos (read ipipe) and the other preempt_rt, I'm
deducing the latency of preempt_rt based on the numbers obtained
for the ipipe by looking at the ipipe-to-ipipe setup.
> My (probably hopelessly naive) assumption would be that there would
> be no such correlation. In absence of correlation, one might
> approximate the maximum ipipe delay by subtracting the -average-
> ipipe delay from the maximum preemption delay, for 55us - 7us = 48us.
> Is this the case, or am I missing something here?
Not directly. You'd have to start by saying that the true maximum ipipe
delay is obtained by substracting the average ipipe delay from the
measured maximum ipipe delay (to play safe you could even substract
the minimum.)
However such a maximum isn't correlated by the data. If indeed there
was a difference between the maximums, averages and minimums of the
ipipe and preempt_rt, the shear quantity of measurements would not
have shown such latency similarities. IOW, it is expected that at
least once in a blue moon we'll hit that case where both the target
and the logger demonstrate their highest possible latency. That's
what we can safely assume 55us is, again given the number of samples.
Remember that on the first run, we sometimes observed a maximum
ipipe-to-ipipe response time of 21us. That's because in those runs
the blue-moon scenario didn't materialize.
> Of course, in the case of the -average- preemption measurements, dividing
> by two to get the average ipipe delay makes perfect sense.
There's no correlation, so I don't see this one.
> Whatever the answer to my maximum-delay question, the same breakdown of
> the raw latency figures would apply to the CONFIG_PREEMPT_RT case, right?
Sure, but again see the above caveats.
Karim
--
Author, Speaker, Developer, Consultant
Pushing Embedded and Real-Time Linux Systems Beyond the Limits
http://www.opersys.com || ka...@opersys.com || 1-866-677-4546
Sorry, I don't see this. I've just looked at lpptest.c and it does
practically the same thing LRTBF is doing, have a look for yourself
at the code in LRTBF.
In fact lpptest.c is probably running at a higher cost on the logger
since it executes a copy_to_user() for every single data point
collected. In the case of the LRTBF, we just buffer the results in a
preallocated buffer and then read them all at once after the testrun.
Unless I'm missing something, there is nothing done in lpptest that
we aren't already doing on either side, logger-side latencies
included.
As for the interrupt problems, they were pilot error. They
disappeared once the APIC was enabled. That's therefore a non-issue.
Karim
--
Author, Speaker, Developer, Consultant
Pushing Embedded and Real-Time Linux Systems Beyond the Limits
http://www.opersys.com || ka...@opersys.com || 1-866-677-4546
>
> Ingo Molnar wrote:
> >>you could try the LPPTEST kernel driver and testlpp utility i
> >>integrated into the -RT patchset. It avoids target-side latencies
> >>almost completely. Especially since you had problems with parallel
> >>interrupts you should give it a go and compare the results.
> >
> >
> > correction: logger-side latencies are avoided.
>
> Sorry, I don't see this. I've just looked at lpptest.c and it does
> practically the same thing LRTBF is doing, have a look for yourself
> at the code in LRTBF.
you should take another look. The crutial difference is that AFAICS
lrtbf is using interrupts on _both_ the logger and the target side.
lpptest only uses interrupts on the target side (that is what we are
measuring), but uses polling _with all interrupts disabled_ on the
logger side. This makes things much more reliable, as it's not some
complex mix of two worst-case latencies, but a small constant overhead
on the logger side and the worst-case latency on the target side. This
also means i can run whatever lpptest version on the logger side, i dont
have to worry about its latencies because there are none that are
variable.
> In fact lpptest.c is probably running at a higher cost on the logger
> since it executes a copy_to_user() for every single data point
> collected. [...]
logger-side overhead does not matter at all, and the 8 bytes copy is not
measured in the overhead. (it is also insignificant.)
> [...] In the case of the LRTBF, we just buffer the results in a
> preallocated buffer and then read them all at once after the testrun.
>
> Unless I'm missing something, there is nothing done in lpptest that we
> aren't already doing on either side, logger-side latencies included.
>
> As for the interrupt problems, they were pilot error. They disappeared
> once the APIC was enabled. That's therefore a non-issue.
well, LPPTEST works just fine with the i8259A PIC too. (which is much
more common in embedded setups than IO-APICs)
Ingo
I see that now, cool!!! And thank you and Kristian for putting this
together!
> Nevertheless, maybe it's worth that I clarify the setup further.
I have big hands, so 7us could indeed qualify as a "handful".
Any insights as to what leads to the larger maximum delay? Some guesses
include worst-case cache-miss patterns and interrupt disabling that I
missed in my quick scan of the patch.
If I understand your analysis correctly (hah!!!), your breakdown
of the maximum delay assumes that the maximum delays for the logger
and the target are correlated. What causes this correlation?
My (probably hopelessly naive) assumption would be that there would
be no such correlation. In absence of correlation, one might
approximate the maximum ipipe delay by subtracting the -average-
ipipe delay from the maximum preemption delay, for 55us - 7us = 48us.
Is this the case, or am I missing something here?
Of course, in the case of the -average- preemption measurements, dividing
by two to get the average ipipe delay makes perfect sense.
Whatever the answer to my maximum-delay question, the same breakdown of
the raw latency figures would apply to the CONFIG_PREEMPT_RT case, right?
Thanx, Paul
Karim mean:
+----------+
| HOST |
+----------+
|
|
| Ethernet LAN
|
/ \
/ \
/ \
/ \
/ \
/ \
/ \
+--------+ PARALLEL +--------+
| LOGGER |----------| TARGET |
+--------+ +--------+
Kristian
Have a look at the announcement just made by Kristian about the LRTBF.
There's a tarball with all the code for the drivers, scripts and
configs we used.
Nevertheless, maybe it's worth that I clarify the setup further.
Here's what we had:
+----------+
| HOST |
+----------+
|
|
| Ethernet LAN
|
/ \
/ \
/ \
/ \
/ \
/ \
/ \
+--------+ SERIAL +--------+
| LOGGER |----------| TARGET |
+--------+ +--------+
The logger sends an interrupt to the target every 1ms. Here's the
Karim
--
Author, Speaker, Developer, Consultant
Pushing Embedded and Real-Time Linux Systems Beyond the Limits
http://www.opersys.com || ka...@opersys.com || 1-866-677-4546
To tell you the truth, we've spent a considerable amount of time as
it is on this and we need to move on to other things. So while we
may do a repeat, we think the latest results are good enough to
feed the discussion for some time. Also, we've finally release the
LRTBF, so we hope you'll try it out yourself and share what you find.
We'll continue making LRTBF releases with the contributions we get.
In regards to the performance, of PREEMPT_RT, we've also made
available the comparative output of LMbench for the various runs
as part of the LRTBF release. Have a look at the bottom of this
section:
http://www.opersys.com/lrtbf/#latest-results
I don't know what part of PREEMPT_RT causes this, but looking at
some of the numbers from this output one is tempted to conclude that
PREEMPT_RT causes a very significant impact on system load. And I
don't say this lightly. Have a look for example at the local
communication latencies between vanilla and PREEMPT_RT when the
target is subject to the HD test. For a pipe, this goes from 9.4us
to 21.6. For UDP this goes from 22us to 1070us !!! Even on a
system without any load, the numbers are similar. Ouch.
I have no envy of starting a flamefest, but to be fair I must
mention that the output for the same runs for I-pipe is much more
reasonable. In fact it's much closer to vanilla performance. Again,
please no flames ... just read the numbers.
Notice that that part of the test (LMbench) does not require a
repeat of our complete setup. A simple LMbench comparison even
on an isolated machine should yield similar results.
Karim
--
Author, Speaker, Developer, Consultant
Pushing Embedded and Real-Time Linux Systems Beyond the Limits
http://www.opersys.com || ka...@opersys.com || 1-866-677-4546
Quite possible, perhaps worst-case cache state.
Quite possible, depending on what the raw distribution of times looks
like. If there are a smallish number of 55us events (as there would
have to be given an average of 7us), the blue-moon scenario would lead
one to expect a much larger number of ~30us events (27.5us + 3.5us).
In absence of a ~30us bulge, there would still be the possibility that
one might see an even bluer (violet?) moon that might stack up to ~100us.
Heck, there might be that possibility anyway, but such is life when
measuring latencies. :-/
(And, yes, there are other CDFs lacking a 30us bulge that would be
consistent with a 55us "blue-moon" bulge -- so I guess I am asking
if you have the CDF or the raw latency measurements -- though the
data set might be a bit large... And I would have to think about
how one goes about deriving individual-latency CDF(s) given a single
dual-latency CDF, assuming that this is even possible...)
> > Of course, in the case of the -average- preemption measurements, dividing
> > by two to get the average ipipe delay makes perfect sense.
>
> There's no correlation, so I don't see this one.
You are right that there might not be a correlation, and that it might
be OK to just divide the maximum latency by two, but I can imagine
cases where dividing by two was not appropriate.
> > Whatever the answer to my maximum-delay question, the same breakdown of
> > the raw latency figures would apply to the CONFIG_PREEMPT_RT case, right?
>
> Sure, but again see the above caveats.
Thanks for the info!
Thanx, Paul
Understood.
> I don't know what part of PREEMPT_RT causes this, but looking at
> some of the numbers from this output one is tempted to conclude that
> PREEMPT_RT causes a very significant impact on system load. And I
> don't say this lightly. Have a look for example at the local
> communication latencies between vanilla and PREEMPT_RT when the
> target is subject to the HD test. For a pipe, this goes from 9.4us
> to 21.6. For UDP this goes from 22us to 1070us !!! Even on a
> system without any load, the numbers are similar. Ouch.
I'm involved in other things now, but I wouldn't be surprised if it
was some kind of scheduler bug + softirq wacked interaction. softirqs,
from the last time I looked, was pretty raw in RT. Another thing to do
is to subtract the number of irq-thread context switches from the total
system context switch to see if there's any kind of oddities with the
spinlock conversion. I doubt there's going to be a ton of 'overscheduling',
nevertheless it would be a valuable number. This is such a new patch
that weird things like this are likely, but it's going to take an
investigation to see what the real cause is.
FreeBSD went through some slow down when they moved to SMPng, but not
the kind of numbers you show for things surrounding the network stack.
Something clearly bad happened.
bill
This is a bandwidth issue. The compressed archive containing the
interrupt latencies of all our test runs is 100MB. I could provide
a URL _privately_ to a handful of individuals, but beyond that
someone's going to have to host it.
Let me know if you want this.
Of course, now that LRTBF is out there, others can generate their
own data sets.
Karim
--
Author, Speaker, Developer, Consultant
Pushing Embedded and Real-Time Linux Systems Beyond the Limits
http://www.opersys.com || ka...@opersys.com || 1-866-677-4546
Note that the numbers are not freak accidents, they are consistent
accross the various setups. So in total, that's 15 LMbench runs,
all showing consistent _severe_ cost for preempt_rt. And this despite
the fact that it comes down neck-to-neck with the ipipe on
interrupt response time in those same tests. I would highly suggest
setting up an automated benchmark for automatically running LMbench
on every preempt_rt release and compare that to the vanilla kernel.
Karim
--
Author, Speaker, Developer, Consultant
Pushing Embedded and Real-Time Linux Systems Beyond the Limits
http://www.opersys.com || ka...@opersys.com || 1-866-677-4546
I see. Granted, this is different. We will redo a limited testset
with either lpptest or a modified version of LRTBF that does
exactly what you describe. Specifically, we will redo the testrun
that is the most painfull to vanilla Linux in terms of interrupt
latency, the HD test, for both preempt_rt and ipipe.
However, I have very serious doubts that this will make any
difference whatsoever. Granted the numbers will be slightly lowever,
but it won't invalidate the conclusions previously obtained and it
still won't allow any of us to isolate the exact hardware-
specific overhead of interrupt delivery. IOW, I want to make
sure that it is clear that we're not doing this because we doubt
our results. To the contrary, we're doing it to ensure that any
doubts regarding our results are dissipated.
> logger-side overhead does not matter at all, and the 8 bytes copy is not
> measured in the overhead. (it is also insignificant.)
Maybe, but your user-space application does a printf on every data
point it gets ... Not the best that can be. The clean thing to do
here is to cumulate the stuff in a buffer and dump it all postmortem.
> well, LPPTEST works just fine with the i8259A PIC too. (which is much
> more common in embedded setups than IO-APICs)
LRTBF doesn't have a problem with the i8259a, it's the hardware
we were using that didn't behave properly under high interrupt
load. This is a system-specific problem. I haven't run lpptest on
the actual target we used, but I have no reason to believe it
wouldn't behave the same types of problems we got with LRTBF.
There is no difference on the target-side between LRTBF and
lpptest.
Karim
--
Author, Speaker, Developer, Consultant
Pushing Embedded and Real-Time Linux Systems Beyond the Limits
http://www.opersys.com || ka...@opersys.com || 1-866-677-4546
Like I said to Ingo in my other response, we're going to use a
technique similar to his lpptest (i.e. disable all interrupts and
actively wait for a reponse from the target) on the the logger to
settle the issue. I very seriously doubt the results will be any
different, but we want to make sure that there are no doubts
remaining.
Karim
--
Author, Speaker, Developer, Consultant
Pushing Embedded and Real-Time Linux Systems Beyond the Limits
http://www.opersys.com || ka...@opersys.com || 1-866-677-4546
Possible, but it could also be a large peak and a small one.
Any way of getting the logger's latency separately? Or the target's?
Thanx, Paul
> Any way of getting the logger's latency separately? Or the target's?
with lpptest (PREEMPT_RT's built-in parallel-port latency driver) that's
possible, as it polls the target with interrupts disabled, eliminating
much of the logger-side latencies. The main effect is that it's now only
a single worst-case latency that is measured, instead of having to have
two worst-cases meet.
Here's a rough calculation to show what the stakes are: if there's a
1:100000 chance to trigger a worst-case irq handling latency, and you
have 600000 samples, then with lpptest you'll see an average of 6 events
during the measurement. With lrtfb (the one Karim used) the chance to
see both of these worst-case latencies on both sides of the measurement
is 1:10000000000, and you'd see 0.00006 of them during the measurement.
I.e. the chances of seeing the true max latency are pretty slim.
So if you want to reliably measure worst-case latencies in your expected
lifetime, you truly never want to serially couple the probabilities of
worst-case latencies on the target and the logger side.
Ingo
> > target is subject to the HD test. For a pipe, this goes from 9.4us
> > to 21.6. For UDP this goes from 22us to 1070us !!! Even on a
> > system without any load, the numbers are similar. Ouch.
>
> I'm involved in other things now, but I wouldn't be surprised if it
> was some kind of scheduler bug + softirq wacked interaction. [...]
the UDP-over-localhost latency was a softirq processing bug that is
fixed in current PREEMPT_RT patches. (real over-the-network latency was
never impacted, that's why it wasnt noticed before.)
Ingo
If indeed there are 6 events on a single-side which are worst-case,
then you would have to also factor in the probability of obtaining
an average or below average result on the other side. So again, if
all runs were measuring average on each side, one would expect that
at least one of the runs would have a bump over the 55us mark. Yet,
they all have the same maximum.
Here's an overview of the results spread in the case of IRQ latency
measurements in the HD case (this is just a view of one case, a
true study would require drawing graphs showing the spread for all
tests):
Of 833,000 results for PREEMPT_RT:
- 36 values are above 50us (0.0045% or 4.5/100,000)
- 860 values are 19us and above
Of 781,000 results for IPIPE:
- 213 values are above 50us (0.0273% or 27/100,000)
- 311 values are 19us and above
Contrary to the illustration you make above, it would seem that the
fact that both machines are running the same mechanism, the
blue-moon effect multiplies upward instead of downward. This,
though, is but a preliminary analysis.
Notes:
- Below 19us, the number of measurement points increases for both
setups as we get closer to the 14us average mark.
- There are more data points for PREEMPT_RT than ipipe because
LMbench takes much more time to complete on the former.
> So if you want to reliably measure worst-case latencies in your expected
> lifetime, you truly never want to serially couple the probabilities of
> worst-case latencies on the target and the logger side.
Like I said, we're going to settle this one to avoid any further
doubts.
Karim
--
Author, Speaker, Developer, Consultant
Pushing Embedded and Real-Time Linux Systems Beyond the Limits
http://www.opersys.com || ka...@opersys.com || 1-866-677-4546
Yeah, I looked at the numbers more carefully and it's clearly showing
some problems. There is a significant context switch penalty with
irq-threads and I wouldn't surprise if this is what's going on.
bill
> Ingo Molnar wrote:
> > with lpptest (PREEMPT_RT's built-in parallel-port latency driver) that's
> > possible, as it polls the target with interrupts disabled, eliminating
> > much of the logger-side latencies. The main effect is that it's now only
> > a single worst-case latency that is measured, instead of having to have
> > two worst-cases meet.
> >
> > Here's a rough calculation to show what the stakes are: if there's a
> > 1:100000 chance to trigger a worst-case irq handling latency, and you
> > have 600000 samples, then with lpptest you'll see an average of 6 events
> > during the measurement. With lrtfb (the one Karim used) the chance to
> > see both of these worst-case latencies on both sides of the measurement
> > is 1:10000000000, and you'd see 0.00006 of them during the measurement.
> > I.e. the chances of seeing the true max latency are pretty slim.
>
> If indeed there are 6 events on a single-side which are worst-case,
> then you would have to also factor in the probability of obtaining an
> average or below average result on the other side. So again, if all
> runs were measuring average on each side, one would expect that at
> least one of the runs would have a bump over the 55us mark. Yet, they
> all have the same maximum.
if your likelyhood of getting a 'combo max' event is 1:10000000000 then
you'll basically never see the max! What you will see are combinations
of lower-order critical paths - i.e. a worst-case path of 35 usecs
combined with another, more likely critical path of 20 usecs. You'll
still have the statistical appearance of having found a 'max'.
your only hope to have valid results would be if the likelyhood of the
maximum path is much higher than the one in my example. But even then,
you've significantly reduced the likelyhood of seeing an actual
worst-case latency total.
From all the test i've done, 600,000 samples are not enough to trigger
the worst-case latency - even with the polling method! Also, your tests
dont really load the system, so you have a fundamentally lower chance of
seeing worst-case latencies. My tests do a dd test, a flood ping, an
LTP-40-copies test, an rtc_wakeup 8192 Hz test and an infinite loop of
hackbench test all in parallel, and even in such circumstances and with
a polling approach i need above 1 million samples to hit the worst-case
path! (which i cannot know for sure to be the worst-case path, but which
i'm reasonably confident about, based on the distribution of the
latencies and having done tens of millions of samples in overnight
tests.) Obviously it's a much bigger constraint on the IRQ subsystem if
_all_ interrupt _and_ DMA sources in the system are as active as
possible.
so ... give the -50-12 -RT tree a try and report back the lpptest
results you are getting. [ I know the results i am seeing, but i wont
post them as a counter-point because i'm obviously biased :-) I'll let
people without an axe to grind do the measurements. ]
Ingo
That's good to hear, but here are some random stats from the
idle run:
Measurements | Vanilla | preemp_rt | ipipe
---------------+-------------+----------------+-------------
fork | 95us | 157us (+65%) | 97us (+2%)
open/close | 1.4us | 2.1us (+50%) | 1.4us (~)
execve | 355us | 452us (+27%) | 365us (+3%)
select 500fd | 12.7us | 27.7us (+118%) | 12.7us (~)
mmap | 660us | 2886us (+337%) | 673us (+2%)
... | ... | ... | ...
Here's a under ping flood conditions:
Measurements | Vanilla | preemp_rt | ipipe
---------------+-------------+----------------+-------------
fork | 112us | 223us (+99%) | 121us (+8%)
open/close | 1.7us | 3.0us (+76%) | 1.8us (+6%)
execve | 421us | 652us (+55%) | 467us (+11%)
select 500fd | 14.6us | 38.1us (+161%) | 15.5us (+6%)
mmap | 760us | 3936us (+418%) | 819us (+8%)
... | ... | ... | ...
etc. it's like this accross the board.
Unless that fix you mention above takes care of these, there's
something seriously wrong going on here.
Are there existing LMbench results posted of PREEMPT_RT which
one can go back to for comparison?
Karim
--
Author, Speaker, Developer, Consultant
Pushing Embedded and Real-Time Linux Systems Beyond the Limits
http://www.opersys.com || ka...@opersys.com || 1-866-677-4546
Hmm... well, I can't say I'm uninterested. Any chances we can get a
copy of the scripts you use to do the MOAILT (Mother Of All Irq Latency
Tests).
> so ... give the -50-12 -RT tree a try and report back the lpptest
> results you are getting.
First things first, we want to report back that our setup is validated
before we go onto this one. So we've modified LRTBF to do the busy-wait
thing.
> [ I know the results i am seeing, but i wont
> post them as a counter-point because i'm obviously biased :-) I'll let
> people without an axe to grind do the measurements. ]
That's an extra reason for giving us a copy (or pointing us to one) of
the script you use to run your tests :)
Karim
--
Author, Speaker, Developer, Consultant
Pushing Embedded and Real-Time Linux Systems Beyond the Limits
http://www.opersys.com || ka...@opersys.com || 1-866-677-4546
I remembered your description, but it's always nice to see exactly
what's being done. Thanks very much for sending this, we'll integrate
it into the LRTBF.
> Btw., what happened to adeos
> irq latency testing?)
It got obsoleted by the ipipe testing. It's basically the same thing.
What we were testing in the first released testbench was the usage
of the interrupt pipeline in adeos. Now that Philippe has forked it
out to make it more straightforward for people to look at (instead of
thinking they are looking at a true full nanokernel), it was just
the appropriate thing to do to use the I-pipe patch instead. The
mechansim being measured is exactly the same thing.
The approach of measuring the target's and the logger's latencies
separately is a -much- better approach than using strange mathematical
techniques with strange mathematical assumptions. So please don't
waste any further time on my misguided request for the full data set!
> Of course, now that LRTBF is out there, others can generate their
> own data sets.
True enough!
Thanx, Paul
> Hmm... well, I can't say I'm uninterested. Any chances we can get a
> copy of the scripts you use to do the MOAILT (Mother Of All Irq
> Latency Tests).
my 'dohell' script is embarrasingly simple:
while true; do dd if=/dev/zero of=bigfile bs=1024000 count=1024; done &
while true; do killall hackbench; sleep 5; done &
while true; do ./hackbench 20; done &
( cd ltp-full-20040707; su mingo -c ./run40; ) &
ping -l 100000 -q -s 10 -f v &
du / &
./dortc &
and i also start a preloaded flood-ping externally. ./dortc does:
chrt -f 98 -p `pidof 'IRQ 8'`
cd rtc_wakeup
./rtc_wakeup -f 8192 -t 100000
and ./run40 does:
while true; do ./runalltests.sh -x 40; done
it's not rocket science - i'm just starting a sensible mix of latency
generators, without letting any of them dominate the landscape.
> > [ I know the results i am seeing, but i wont
> > post them as a counter-point because i'm obviously biased :-) I'll let
> > people without an axe to grind do the measurements. ]
>
> That's an extra reason for giving us a copy (or pointing us to one) of
> the script you use to run your tests :)
see above. (It's no secret, i described components of this workload in
one of my first mails to the adeos thread. Btw., what happened to adeos
irq latency testing?)
Ingo
My concern exactly!
> So if you want to reliably measure worst-case latencies in your expected
> lifetime, you truly never want to serially couple the probabilities of
> worst-case latencies on the target and the logger side.
Sounds more practical than the analytical approach! (Take the Laplace
transform of the density function, square root, and then take the inverse
Laplace transform, if I remember correctly... Which would end up
showing a small probability of the maximum latency being the full
amount, which ends up really not telling you anything.)
Thanx, Paul
> Ingo Molnar wrote:
> > the UDP-over-localhost latency was a softirq processing bug that is
> > fixed in current PREEMPT_RT patches. (real over-the-network latency was
> > never impacted, that's why it wasnt noticed before.)
>
> That's good to hear, but here are some random stats from the idle run:
please retest using recent (i.e. today's) -RT kernels. There were a
whole bunch of fixes that could affect these numbers. (But i'm sure you
know very well that you cannot expect a fully-preemptible kernel to have
zero runtime cost. In that sense, if you want to be fair, you should
compare it to the SMP kernel, as total preemptability is a similar
technological feat and has very similar parallelism constraints.)
Ingo
> > so ... give the -50-12 -RT tree a try and report back the lpptest
> > results you are getting.
>
> First things first, we want to report back that our setup is validated
> before we go onto this one. So we've modified LRTBF to do the
> busy-wait thing.
here's another bug in the way you are testing PREEMPT_RT irq latencies.
Right now you are doing this in lrtbf-0.1a/drivers/par-test.c:
if (request_irq ( PAR_TEST_IRQ,
&par_test_irq_handler,
#if CONFIG_PREEMPT_RT
SA_NODELAY,
#else //!CONFIG_PREEMPT_RT
SA_INTERRUPT,
#endif //PREEMPT_RT
you should set the SA_INTERRUPT flag in the PREEMPT_RT case too! I.e.
the relevant line above should be:
SA_NODELAY | SA_INTERRUPT,
otherwise par_test_irq_handler will run with interrupts enabled, opening
the window for other interrupts to be injected and increasing the
worst-case latency! Take a look at drivers/char/lpptest.c how to do this
properly. Also, double-check that there is no IRQ 7 thread running on
the PREEMPT_RT kernel, to make sure you are measuring irq latencies.
Ingo
We'll check on this also. Thanks for pointing it out.
Karim
--
Author, Speaker, Developer, Consultant
Pushing Embedded and Real-Time Linux Systems Beyond the Limits
http://www.opersys.com || ka...@opersys.com || 1-866-677-4546
Well, if you want to be even more fair, you could hold off on publishing
benchmark results that compare an experimental, not fully debugged
feature with a mature technology.
Lee
At this point, we're bound to rerun some of the tests. But there's
only so many times that one can claim that such and such test isn't
good enough because it doesn't have all the latest bells and whistles.
Surely there's more to this overhead than just rudimentary bugfixes.
Sorry, but it's just kind of frustrating to put so much time on
something like this and have results offhandidly dismissed just
because it isn't the truely bleeding edge. These results are
similar to our previous testset, which was on another version of
preempt_rt, which we were told then had had a bunch of fixes ...
Like I suggested earlier, there should be an automated test by
which each preempt_rt release is lmbenched against vanilla.
> (But i'm sure you
> know very well that you cannot expect a fully-preemptible kernel to have
> zero runtime cost. In that sense, if you want to be fair, you should
> compare it to the SMP kernel, as total preemptability is a similar
> technological feat and has very similar parallelism constraints.)
With this line of defense I sense things can get hairy fairly
rapidely. So I'll try to tread carefully.
Bare in mind here that what we're trying to find out with such
tests is what is the bare minimum cost of the proposed rt
enhancements to Linux, and how well do these perform in their
rt duties, the most basic of which being interrupt latency.
We understand that none of these approaches have zero cost, and
we also understand that not all approaches provide the same
mechanisms. However, a critical question must be answered:
What is the least-intrusive change, or sets of changes, that can
be applied to the Linux kernel to obtain rt, and what mechanisms
can, or cannot, be built on top of it (them)? (the unknown here
being that rt is defined differently by different people.)
One answer is what we postulated as being a combination of
PREEMPT_RT and the Ipipe, each serving a separate problem space.
Speaking just for myself here:
To be honest, however, I have a very hard time, as a user, to
convince myself that I should enable preempt_rt under any but
the most dire situations given the results I now have in front
of me. Surely there's more of an argument than "this will cost
you as much as SMP" for someone deploying UP systems, which
apparently is the main target of preempt_rt with things like
audio and embedded systems.
I want to believe, but accepting more than >50% overhead over
regular system calls requires more than just religion ...
Any automated test, as suggested above, that would show some
sort of performance impact decrease over release iterations
would be helpful for sure. But that 50%+ is going to have to
melt significantly along the way ... for me at least.
Karim
--
Author, Speaker, Developer, Consultant
Pushing Embedded and Real-Time Linux Systems Beyond the Limits
http://www.opersys.com || ka...@opersys.com || 1-866-677-4546
Would you have applied similar logic had the results been inverted?
Surely the nature of scientific improvement is somewhere along
the lines of experiment, compare, enhance, and retry.
If PREEMPT_RT should not be studied, then what good is it to
continue talking about it on the LKML or even to continue posting
the patches there?
Surely the goal in doing that is to make it better and more
acceptable to the larger crowd. And if that is so, then isn't
it to everyone's advantage therefore to make a strong case for
its adoption?
Did you really expect that no one was going to start running
performance tests on preemp_rt somewhere along the way until
its developers gave an "official" ok? Isn't it better to know
about such results sooner rather than later?
... I'm sorry, I'm somewhat lost here. I can just guess that
you're expressing your dissapointment at the results, and
that's something I can understand very well. But shouldn't
these results encourage you try even harder? Lest you are
telling me that that's as good as it gets ... ?
As a side note about the I-pipe (formerly Adeos), it should
be noted that, in as far as I can recall, its latency response
and performance impact have not varied a lot since its first
introduction over 3 years ago. The mechanism's simplicity
makes it unlikely to introduce any sort of significant
overhead.
Karim
--
Author, Speaker, Developer, Consultant
Pushing Embedded and Real-Time Linux Systems Beyond the Limits
http://www.opersys.com || ka...@opersys.com || 1-866-677-4546
Agreed. I think it makes more sense to keep comparing it against the UP
kernel with preempt disabled, since most embedded devices aren't smp.
> To be honest, however, I have a very hard time, as a user, to
> convince myself that I should enable preempt_rt under any but
> the most dire situations given the results I now have in front
Same here.
> Ingo Molnar wrote:
> > please retest using recent (i.e. today's) -RT kernels. There were a
> > whole bunch of fixes that could affect these numbers.
>
> At this point, we're bound to rerun some of the tests. But there's
> only so many times that one can claim that such and such test isn't
> good enough because it doesn't have all the latest bells and whistles.
> Surely there's more to this overhead than just rudimentary bugfixes.
well, it was your choice to benchmark ADEOS against PREEMPT_RT, right?
You posted numbers that showed your project in a favorable light while
the PREEMPT_RT numbers were more than 100% off. Your second batch of
numbers showed a tie, but we still dont know the true correct PREEMPT_RT
irq latency values on your hardware, because your testing still had
bugs. So a minimum requirement would be to post accurate numbers - you
have started this after all.
this thread showcases one of the many reasons why 'vendor sponsored
benchmarking' is such a bad idea. I wont post benchmark numbers
comparing PREEMPT_RT to 'other' realtime projects. I'm obviously biased,
everyone else sees me as biased, so what's the point? Should i pretend
i'm not biased towards the stuff i wrote? That would be hypocritical
beyond recognition. I dont benchmark PREEMPT_RT against other projects
because i know it perfectly well that it is the best thing since sliced
bread ;)
Ingo
> To be honest, however, I have a very hard time, as a user, to
> convince myself that I should enable preempt_rt under any but
> the most dire situations given the results I now have in front
> of me. Surely there's more of an argument than "this will cost
> you as much as SMP" for someone deploying UP systems, which
> apparently is the main target of preempt_rt with things like
> audio and embedded systems.
What situation do you consider dire?
I do appreciate the testing that you've done and I hope you do more in the
future. Remember that PREEMPT_RT is a fast moving object, we need those
kinds of tests every day ..
Daniel
This is an unwarranted personal attack.
Should I simply refrain from conducting tests because 4+ years ago
I made a suggestion on how to obtain rt performance in Linux?
Heck, I didn't even write a single line of code of it, someone
else did.
If I wanted to show "my" project in such a good light, would I have
gone back and redone tests, and then published them even if those
numbers now showed that the "other" project was as good as "mine"?
Would I have even listened to any of your suggestions and gone
back and had the tests changed to fit your requirements? Would I
still be telling that we're going to further fix the tests based
on your feedback?
If I benchmarked adeos and preempt_rt it's simply because these
are the two patches on top of vanilla Linux that are actively
developed and claim to provide true rt on Linux. Let me ask you
a simple question: would a benchmark of adeos against nothing
but itself been any relevant?
As for the accuracy of the numbers, they are correct in as far
as I'm concerned. They aren't to your liking, that's different.
You complained about the way we measured irq latency and we
promised to repeat with your suggestions. But in this part of
the thread, I was simply asking about the over the top impact
as seen in the LMbench results. Do we need to fix LMbench too?
I repeat, the software we used is available for you to download.
If we've made a mistake, we'll more than gladly acknowledge it.
As for any hint that we're somehow fixing or showing these results
to better "sell" anything, then the scripts, drivers and configs
we used are all out there, you're more than welcome to show how
evil we are.
FWIW, Opersys has no engineering cycles to spare. This current
testset and all related human and hardware costs are actually
coming straight out of my personal pocket. I have no client paying
for this. And, FWIW, I had absolutely no idea what we were going
to find when I started this. Certainly I didn't expect that
preempt_rt would be able to do as good as the ipipe in terms of
interrupt latency, and that's to your credit.
And if you still disagree, then please go ahead and publish your
own test results or have others do so and show how wrong we are.
We've given you everything we used on our side. If there's
anything else you need, let us know.
After all, showing how much of a fraud we are shouldn't be that
difficult, you're a very competent developer. And because of
that last reason, I have a hard time holding this ad-hominem
attack against you. I am dissapointed though.
Karim
--
Author, Speaker, Developer, Consultant
Pushing Embedded and Real-Time Linux Systems Beyond the Limits
http://www.opersys.com || ka...@opersys.com || 1-866-677-4546
Which is exactly what I was suggesting.
Karim
--
Author, Speaker, Developer, Consultant
Pushing Embedded and Real-Time Linux Systems Beyond the Limits
http://www.opersys.com || ka...@opersys.com || 1-866-677-4546
What you found with this is the very limit of your patience :)
> preempt_rt would be able to do as good as the ipipe in terms of
> interrupt latency, and that's to your credit.
...
> After all, showing how much of a fraud we are shouldn't be that
> difficult, you're a very competent developer. And because of
> that last reason, I have a hard time holding this ad-hominem
> attack against you. I am dissapointed though.
He's probably confusing you from the real FUDers. I don't see you
as a FUDer.
He's just resentful fighting with you over attention from the same
batch of strippers at last years OLS. :)
bill
Thanks, I appreciate the vote of confidence.
> He's just resentful fighting with you over attention from the same
> batch of strippers at last years OLS. :)
But I don't want to "fight" Ingo. There would just be no point
whatsoever with "fighting" with one the best developers Linux
has. I started my involvement in these recent threads with a
very clear statement that I was open to being shown wrong in
having exclusively championed the nanokernel approach in the
past. I set out to show myself wrong with these tests and
beside some vague expectations, I truely didn't know what I
was going to find. I certainly wouldn't have bet a hot-dog on
preempt_rt coming neck-to-neck with the ipipe on interrupt
latency ... So yes, in doing so some results I've found aren't
that nice. But, hell, I didn't invent those results. They are
there for anyone to repdroduce or contradict. I have no
monopoly over LMbench, PC hardware, the Linux kernel, or
anything else used to get those numbers.
Karim
--
Author, Speaker, Developer, Consultant
Pushing Embedded and Real-Time Linux Systems Beyond the Limits
http://www.opersys.com || ka...@opersys.com || 1-866-677-4546
I understand that Karim doesn't have time to keep running these tests, but
I'm wondering if others would be able to scrape up surplus hardware and
setup the test config, accepting configs/kernels from a couple people to
test.
I know that I have a large number of slow (<200MHz) pentiums that are just
sitting around at home and could be used for this, but I don't know if
they would be considered fast enough for any portions of this test (I have
a much smaller number of faster machines that could possibly be used)
even if none of the big-name testing facilities want to do this we should
be able to get something setup.
David Lang
--
There are two ways of constructing a software design. One way is to make it so simple that there are obviously no deficiencies. And the other way is to make it so complicated that there are no obvious deficiencies.
-- C.A.R. Hoare
Yeah, but so what ? don't freak out and take all of this so seriously.
It's not like nanokernels are going to disappear when this patch gets
broader acceptance. And who cares if you're wrong ? you ? :) Really,
get a grip man. :)
And, of course, DUH, making a kernel fully preemptive makes it (near)
real time. These aren't unexpected results.
> that nice. But, hell, I didn't invent those results. They are
> there for anyone to repdroduce or contradict. I have no
> monopoly over LMbench, PC hardware, the Linux kernel, or
> anything else used to get those numbers.
Thanks for the numbers, really. I do expect some kind of performance
degradation, but there seems to be triggering some oddities with the
patch that aren't consistent with some of our expectations.
Be patient. :)
bill
I don't think that there should be any limitation on the type of
hardware being tested. In fact, I would think that having as
diverse a test hardware as possible would be a good thing.
Many of the embedded platforms are in fact not that far different
from those slow pentiums you have lying around.
My 0.02$,
Karim
--
Author, Speaker, Developer, Consultant
Pushing Embedded and Real-Time Linux Systems Beyond the Limits
http://www.opersys.com || ka...@opersys.com || 1-866-677-4546
Bill, it's not like Opersys is VC-funded or anything of that type.
We're a self-sufficient operation and as such we cannot afford to
waste any time. So when I dedicate close to 6 weeks of man time
to a given project, you bet I'm going to take it seriously.
I have no objection to let things cool off, but it's a lot to
ask that I do not take seriously attacks on our ability to
produce objective results.
We can leave it at that if you want.
Karim
--
Author, Speaker, Developer, Consultant
Pushing Embedded and Real-Time Linux Systems Beyond the Limits
http://www.opersys.com || ka...@opersys.com || 1-866-677-4546
> If I wanted to show "my" project in such a good light, would I have
> gone back and redone tests, and then published them even if those
> numbers now showed that the "other" project was as good as "mine"?
> Would I have even listened to any of your suggestions and gone back
> and had the tests changed to fit your requirements? Would I still be
> telling that we're going to further fix the tests based on your
> feedback?
if anything i wrote offended you i'd like to apologize for it. I feel
pretty strongly about the stuff i do, but i always regret 99.9% of the
flames in the next morning :) Also, i only realized when reading your
reply that you took my "vendor sponsored benchmarking" remark literally
(and that's my bad too). I never thought of you as a 'vendor' or having
any commercial interest in this benchmarking - it was just a stupid
analogy from me. I should have written "supporter driven benchmarking"
or so - that would still have been a pretty nice flame ;)
also please consider the other side of the coin. You posted numbers that
initially put PREEMPT_RT in a pretty bad light. Those numbers are still
being linked to from your website, without any indication to suggest
that they are incorrect. Even in your above paragraph you are not
talking about flawed numbers, you are talking about 'changing the tests
to fit my requirements'. Heck i have no 'requirements' other than to see
fair numbers. And if adeos/ipipe happens to beat PREEMPT_RT in a fair
irq latency test you wont hear a complaint from me. (you might see a
patch quite soon though ;)
And i know what irq latencies to expect from PREEMPT_RT. It takes me 5
minutes to do a 10 million samples irq test using LPPTEST, the histogram
takes only 200 bytes on the screen, and the numbers i'm getting differ
from your numbers - but obviously i cannot run it on your hardware. The
rtc_wakeup and built-in latency-tracer numbers differ too. They could be
all wrong though, so i'm curious what your independent method will
yield.
your lmbench results look accurate and fair, the slowdown during
irq-load is a known and expected effect of IRQ threading. If you flood
ping a box and generate context-switches instead of plain interrupts,
there will be noticeable overhead. I checked some of the lmbench numbers
today on my testbox, and while there's overhead, it's significantly less
than the 90% degradation you were seeing. That's why i suggested to you
to retest using the current base - but you of course dont 'have to'.
There were a number of bugs fixed in the past few dozen iterations of
patches that affected various components of lmbench :)
Ingo
> Subject: Re: PREEMPT_RT vs I-PIPE: the numbers, part 2
>
>
> David Lang wrote:
>> I know that I have a large number of slow (<200MHz) pentiums that are just
>> sitting around at home and could be used for this, but I don't know if
>> they would be considered fast enough for any portions of this test (I have
>> a much smaller number of faster machines that could possibly be used)
>
> I don't think that there should be any limitation on the type of
> hardware being tested. In fact, I would think that having as
> diverse a test hardware as possible would be a good thing.
> Many of the embedded platforms are in fact not that far different
> from those slow pentiums you have lying around.
Ok, I'll dig them out and see about getting them setup.
what pinout do I need to connect the printer ports
I'm thinking that the best approach for this would be to setup a static
logger and host and then one (or more) target machines, then we can setup
a small website on the host that will allow Ingo (and others) to submit
kernels for testing, queue those kernels and then run the tests on each
one in turn (and if it runs out of kernels to test it re-tests the last
one with a longer run)
how much needs to change in userspace between the various tests? I would
assume that between the plain, preempt, and RT kernels no userspace
changes are needed, what about the other options?
given the slow speed of these systems it would seem to make more sense to
have a full kernel downloaded to them rather then having the local box
compile it.
does this sound reasonable?
David Lang
>
> My 0.02$,
>
> Karim
> --
> Author, Speaker, Developer, Consultant
> Pushing Embedded and Real-Time Linux Systems Beyond the Limits
> http://www.opersys.com || ka...@opersys.com || 1-866-677-4546
>
--
There are two ways of constructing a software design. One way is to make it so simple that there are obviously no deficiencies. And the other way is to make it so complicated that there are no obvious deficiencies.
-- C.A.R. Hoare
For LRTBF you'll find the pinout in the README of the package.
> I'm thinking that the best approach for this would be to setup a static
> logger and host and then one (or more) target machines, then we can setup
> a small website on the host that will allow Ingo (and others) to submit
> kernels for testing, queue those kernels and then run the tests on each
> one in turn (and if it runs out of kernels to test it re-tests the last
> one with a longer run)
Things is you're going to need one logger per target. As for a
small website, that sounds good enough. Don't know how feasible
it would be but it may be desirable to also have a background
task that automatically checks for new releases and conducts
the tests automatically.
> how much needs to change in userspace between the various tests? I would
> assume that between the plain, preempt, and RT kernels no userspace
> changes are needed, what about the other options?
There are no user-space changes needed, but you may need to
install a few things that aren't there (LMbench, LTP, hackbench,
etc.)
> given the slow speed of these systems it would seem to make more sense to
> have a full kernel downloaded to them rather then having the local box
> compile it.
It's your choice really, but if the tests are to be automated,
then local compile shouldn't be a problem since you won't be
waiting on it personally.
> does this sound reasonable?
For me at least.
Karim
--
Author, Speaker, Developer, Consultant
Pushing Embedded and Real-Time Linux Systems Beyond the Limits
http://www.opersys.com || ka...@opersys.com || 1-866-677-4546
Thanks for the clarification, sorry if I jumped to the wrong conclusions
regarding what you meant to say.
> also please consider the other side of the coin. You posted numbers that
> initially put PREEMPT_RT in a pretty bad light. Those numbers are still
> being linked to from your website, without any indication to suggest
> that they are incorrect. Even in your above paragraph you are not
> talking about flawed numbers, you are talking about 'changing the tests
> to fit my requirements'. Heck i have no 'requirements' other than to see
> fair numbers. And if adeos/ipipe happens to beat PREEMPT_RT in a fair
> irq latency test you wont hear a complaint from me. (you might see a
> patch quite soon though ;)
OK, please recheck the webpage, I've now added a warning specifically
to the effect that the numbers need to be rerun. Hopefully this clears
things up.
> And i know what irq latencies to expect from PREEMPT_RT. It takes me 5
> minutes to do a 10 million samples irq test using LPPTEST, the histogram
> takes only 200 bytes on the screen, and the numbers i'm getting differ
> from your numbers - but obviously i cannot run it on your hardware. The
> rtc_wakeup and built-in latency-tracer numbers differ too. They could be
> all wrong though, so i'm curious what your independent method will
> yield.
Well the method we're using is certainly not absolute. That's why we're
providing the scripts. There's no saying that others (outside yourself
and us) will find some other outlandish results. But hopefully the more
we study these things, the more consistent we can characterize them.
> your lmbench results look accurate and fair, the slowdown during
> irq-load is a known and expected effect of IRQ threading. If you flood
> ping a box and generate context-switches instead of plain interrupts,
> there will be noticeable overhead. I checked some of the lmbench numbers
> today on my testbox, and while there's overhead, it's significantly less
> than the 90% degradation you were seeing. That's why i suggested to you
> to retest using the current base - but you of course dont 'have to'.
> There were a number of bugs fixed in the past few dozen iterations of
> patches that affected various components of lmbench :)
I certainly welcome this. Thanks for partly confirming our results,
and pointing out that newever versions are better. Like I said
earlier, we're bound to repeat our tests with all that's been
suggested now ... So we will redo with the version you mentioned
earlier. Hopefully this time we'll get it right.
Karim
--
Author, Speaker, Developer, Consultant
Pushing Embedded and Real-Time Linux Systems Beyond the Limits
http://www.opersys.com || ka...@opersys.com || 1-866-677-4546
> David Lang wrote:
>> what pinout do I need to connect the printer ports
>
> For LRTBF you'll find the pinout in the README of the package.
>
>> I'm thinking that the best approach for this would be to setup a static
>> logger and host and then one (or more) target machines, then we can setup
>> a small website on the host that will allow Ingo (and others) to submit
>> kernels for testing, queue those kernels and then run the tests on each
>> one in turn (and if it runs out of kernels to test it re-tests the last
>> one with a longer run)
>
> Things is you're going to need one logger per target. As for a
> small website, that sounds good enough. Don't know how feasible
> it would be but it may be desirable to also have a background
> task that automatically checks for new releases and conducts
> the tests automatically.
the only problem with that would be the need for these low-powered boxes
to compile the kernel.
>> how much needs to change in userspace between the various tests? I would
>> assume that between the plain, preempt, and RT kernels no userspace
>> changes are needed, what about the other options?
>
> There are no user-space changes needed, but you may need to
> install a few things that aren't there (LMbench, LTP, hackbench,
> etc.)
>
>> given the slow speed of these systems it would seem to make more sense to
>> have a full kernel downloaded to them rather then having the local box
>> compile it.
>
> It's your choice really, but if the tests are to be automated,
> then local compile shouldn't be a problem since you won't be
> waiting on it personally.
that depends on how quickly Ingo releases updates, it would be nice to
have a system fast enough to run the tests on each version before the next
is released :-)
>> does this sound reasonable?
>
> For me at least.
>
> Karim
> --
> Author, Speaker, Developer, Consultant
> Pushing Embedded and Real-Time Linux Systems Beyond the Limits
> http://www.opersys.com || ka...@opersys.com || 1-866-677-4546
>
--
There are two ways of constructing a software design. One way is to make it so simple that there are obviously no deficiencies. And the other way is to make it so complicated that there are no obvious deficiencies.
-- C.A.R. Hoare
Are you talking about the first run where you left all those expensive
PREEMPT_RT debug options enabled?
IMHO those numbers should be taken down, they're completely meaningless.
Lee
That's just flamebait. Anyone who's ever read an LKML thread knows
better than to just trust the topmost parent. As for those who
don't read LKML very often, then the "Latest Results" is the section
they'll be most interested in and that specific section starts with
with a big fat warning.
You can label the results meaningless if you wish, suit yourself.
Karim
--
Author, Speaker, Developer, Consultant
Pushing Embedded and Real-Time Linux Systems Beyond the Limits
http://www.opersys.com || ka...@opersys.com || 1-866-677-4546
I could provide some help here, by providing the schematics and firmware
for having a microcontroller do the pulse timing part. The schematics
should be extremely simple, and easy to build in a breadboard (no
soldering required) with standard parts from electronics resellers.
With a hardware solution we could measure the *actual* target latency
with sub-microsecond accuracy, and do some fun stuff too, like
triggering the pulse at random intervals in a given range, etc.
The microcontroller would then connect to the logger (or the HOST in
your setup, and avoid an extra computer) through a serial port to report
the measurements.
Is this something that could be useful, or do you think this is just
overkill?
--
Paulo Marques - www.grupopie.com
It is a mistake to think you can solve any major problems
just with potatoes.
Douglas Adams
> Ingo, what's the status of putting irq 0 back in a thread with
> PREEMPT_RT? IIRC this had some adverse (maybe unfixable?) effects so
> it was disabled a few months ago.
the jury is still out on that one - but right now it seems it's too much
complexity for a handful of usecs of latency improvement. Especially
with things like high-resolution timer support, the threading IRQ0
doesnt seem to be worth it.
Ingo