Re: Pin-pointing the root of unusual application latencies

Ingo Molnar

unread,

Jul 23, 2007, 5:54:37 AM7/23/07

to John Sigler, linux-r...@vger.kernel.org, oprofi...@lists.sourceforge.net, linux-...@vger.kernel.org

* John Sigler <linux....@free.fr> wrote:

> Here's a /proc/latency_trace dump. What is there to understand?
>
> # cat /proc/latency_trace
> preemption latency trace v1.1.5 on 2.6.20.7-rt8
> --------------------------------------------------------------------
> latency: 26 us, #2/2, CPU#0 | (M:rt VP:0, KP:0, SP:1 HP:1)
> -----------------
> | task: softirq-timer/0-4 (uid:0 nice:0 policy:1 rt_prio:50)
> -----------------
>
> _------=> CPU#
> / _-----=> irqs-off
> | / _----=> need-resched
> || / _---=> hardirq/softirq
> ||| / _--=> preempt-depth
> |||| /
> ||||| delay
> cmd pid ||||| time | caller
> \ / ||||| \ | /
> <...>-4 0D..1 26us : trace_stop_sched_switched
> (__sched_text_start)

could you try:

http://redhat.com/~mingo/latency-tracing-patches/trace-it.c

and run it like this:

./trace-it 1 > trace.txt

does it produce lots of trace entries? If not then
CONFIG_FUNCTION_TRACING is not enabled. Once you see lots of output in
the file, the tracer is up and running and you can start tracing the
latency in your app.

your above wakeup-tracing output suggests that your app is probably not
delayed by scheduling latencies, but by some other, higher-level
latencies.

To track it down, use the method that trace-it.c uses to start/stop
tracing, i.e. put the prctl(0,1); / prctl(0,0); calls into your app to
start/stop tracing and the kernel will do the rest once you've set
/proc/sys/kernel/preempt_max_latency back to 0: /proc/latency_trace will
always contain the longest latency that your app triggered, of the
critical path you programmed into it.

also check the cyclictest source of how to do app-driven latency
tracing. (And please post any latency traces to this list, we might be
able to pinpoint the source of the latencies.)

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

John Sigler

unread,

Jul 23, 2007, 10:14:14 AM7/23/07

to Ingo Molnar, linux-r...@vger.kernel.org, oprofi...@lists.sourceforge.net, linux-...@vger.kernel.org

Ingo Molnar wrote:

# ./trace-it 1 >/tmp/trace.txt
# wc /tmp/trace.txt
65555 393277 4096317 /tmp/trace.txt

preemption latency trace v1.1.5 on 2.6.20.7-rt8
--------------------------------------------------------------------

latency: 1020019 us, #65536/76272, CPU#0 | (M:rt VP:0, KP:0, SP:1 HP:1)
-----------------
| task: trace-it-939 (uid:0 nice:0 policy:0 rt_prio:0)
-----------------

_------=> CPU#
/ _-----=> irqs-off
| / _----=> need-resched
|| / _---=> hardirq/softirq
||| / _--=> preempt-depth
|||| /
||||| delay
cmd pid ||||| time | caller
\ / ||||| \ | /

<...>-939 0D... 0us : read_tsc (get_monotonic_cycles)
<...>-939 0D... 1us : read_tsc (get_monotonic_cycles)
<...>-939 0D... 1us : read_tsc (get_monotonic_cycles)
<...>-939 0D... 1us : read_tsc (get_monotonic_cycles)
<...>-939 0D... 2us : read_tsc (get_monotonic_cycles)
<...>-939 0D... 2us : read_tsc (get_monotonic_cycles)
<...>-939 0D... 2us : read_tsc (get_monotonic_cycles)
<...>-939 0D... 2us : read_tsc (get_monotonic_cycles)
<...>-939 0D... 3us : read_tsc (get_monotonic_cycles)
<...>-939 0D... 3us : read_tsc (get_monotonic_cycles)
[SNIP]
<...>-939 0D... 19763us : read_tsc (get_monotonic_cycles)
<...>-939 0D... 19763us : read_tsc (get_monotonic_cycles)
<...>-939 0D... 19763us : read_tsc (get_monotonic_cycles)
<...>-939 0D... 19764us : read_tsc (get_monotonic_cycles)
<...>-939 0D... 19764us : read_tsc (get_monotonic_cycles)
<...>-939 0D... 19764us : read_tsc (get_monotonic_cycles)
<...>-939 0D... 19765us : read_tsc (get_monotonic_cycles)
<...>-939 0D... 19765us : read_tsc (get_monotonic_cycles)
<...>-939 0D... 19765us : read_tsc (get_monotonic_cycles)
<...>-939 0D... 19765us : read_tsc (get_monotonic_cycles)

vim:ft=help

(I snipped ~65500 lines where only the number of "us" changes.)

Shouldn't there be other functions in the output?

Am I reading correctly that some read_tsc calls take 20 ms?

dmesg reports:
( trace-it-939 |#0): new 1020019 us user-latency.

> To track it down, use the method that trace-it.c uses to start/stop
> tracing, i.e. put the prctl(0,1); / prctl(0,0); calls into your app to
> start/stop tracing and the kernel will do the rest once you've set
> /proc/sys/kernel/preempt_max_latency back to 0: /proc/latency_trace will
> always contain the longest latency that your app triggered, of the
> critical path you programmed into it.

Here's what I came up with:
http://linux.kernel.free.fr/latency/check_dektec_input3.cxx
(I enable tracing only 1% of the time.)

The output looks very much like the one I got when I ran trace-it

preemption latency trace v1.1.5 on 2.6.20.7-rt8
--------------------------------------------------------------------

latency: 19996 us, #65536/65815, CPU#0 | (M:rt VP:0, KP:0, SP:1 HP:1)
-----------------
| task: check_dektec_in-1151 (uid:0 nice:0 policy:2 rt_prio:80)
-----------------

_------=> CPU#
/ _-----=> irqs-off
| / _----=> need-resched
|| / _---=> hardirq/softirq
||| / _--=> preempt-depth
|||| /
||||| delay
cmd pid ||||| time | caller
\ / ||||| \ | /

<...>-1151 0D... 0us : read_tsc (get_monotonic_cycles)
<...>-1151 0D... 1us : read_tsc (get_monotonic_cycles)
<...>-1151 0D... 1us : read_tsc (get_monotonic_cycles)
<...>-1151 0D... 1us : read_tsc (get_monotonic_cycles)
<...>-1151 0D... 2us : read_tsc (get_monotonic_cycles)
<...>-1151 0D... 2us : read_tsc (get_monotonic_cycles)
<...>-1151 0D... 2us : read_tsc (get_monotonic_cycles)
<...>-1151 0D... 3us : read_tsc (get_monotonic_cycles)
<...>-1151 0D... 3us : read_tsc (get_monotonic_cycles)
<...>-1151 0D... 3us : read_tsc (get_monotonic_cycles)
<...>-1151 0D... 4us : read_tsc (get_monotonic_cycles)
[...]
<...>-1151 0D... 19764us : read_tsc (get_monotonic_cycles)
<...>-1151 0D... 19764us : read_tsc (get_monotonic_cycles)
<...>-1151 0D... 19765us : read_tsc (get_monotonic_cycles)
<...>-1151 0D... 19765us : read_tsc (get_monotonic_cycles)
<...>-1151 0D... 19765us : read_tsc (get_monotonic_cycles)
<...>-1151 0D... 19766us : read_tsc (get_monotonic_cycles)
<...>-1151 0D... 19766us : read_tsc (get_monotonic_cycles)
<...>-1151 0D... 19766us : read_tsc (get_monotonic_cycles)

Remarks:

1. Again, shouldn't there be other functions in the output?

2. How much overhead do the prctl calls incur? Is it possible that they
are somehow masking my problem? (I'll let the program run all night to
maximize the chances of capturing the anomalous latency.)

> also check the cyclictest source of how to do app-driven latency
> tracing.

Are you talking about this code:
http://git.kernel.org/?p=linux/kernel/git/tglx/rt-tests.git;a=blob_plain;f=cyclictest/cyclictest.c;hb=HEAD
I will study it.

Regards.

Ingo Molnar

unread,

Jul 23, 2007, 12:08:30 PM7/23/07

to John Sigler, linux-r...@vger.kernel.org, oprofi...@lists.sourceforge.net, linux-...@vger.kernel.org

* John Sigler <linux....@free.fr> wrote:

> > ./trace-it 1 > trace.txt
> >
> > does it produce lots of trace entries? If not then
> > CONFIG_FUNCTION_TRACING is not enabled. Once you see lots of output
> > in the file, the tracer is up and running and you can start tracing
> > the latency in your app.
>
> # ./trace-it 1 >/tmp/trace.txt
> # wc /tmp/trace.txt
> 65555 393277 4096317 /tmp/trace.txt
>
> preemption latency trace v1.1.5 on 2.6.20.7-rt8

> <...>-939 0D... 0us : read_tsc (get_monotonic_cycles)

add 'notrace' to the definition of read_tsc in arch/i386/kernel/tsc.c.
(check recent -rt's read_tsc() definition) (or do echo 1 >
/proc/sys/kernel/trace_use_raw_cycles - if you are using recent enough
-rt)

> >To track it down, use the method that trace-it.c uses to start/stop
> >tracing, i.e. put the prctl(0,1); / prctl(0,0); calls into your app
> >to start/stop tracing and the kernel will do the rest once you've set
> >/proc/sys/kernel/preempt_max_latency back to 0: /proc/latency_trace
> >will always contain the longest latency that your app triggered, of
> >the critical path you programmed into it.
>
> Here's what I came up with:
> http://linux.kernel.free.fr/latency/check_dektec_input3.cxx
> (I enable tracing only 1% of the time.)
>
> The output looks very much like the one I got when I ran trace-it

> 1. Again, shouldn't there be other functions in the output?

there should be, once you fix read_tsc().

> 2. How much overhead do the prctl calls incur? Is it possible that
> they are somehow masking my problem? (I'll let the program run all
> night to maximize the chances of capturing the anomalous latency.)

it's not the prctl calls that matter but the fact that every kernel
function will be traced if you have FUNCTION_TRACING on. It mask any
races - but latencies are usually not masked (they are enlarged).

Ingo

John Sigler

unread,

Jul 23, 2007, 12:44:15 PM7/23/07

to Ingo Molnar, linux-r...@vger.kernel.org, oprofi...@lists.sourceforge.net, linux-...@vger.kernel.org

Ingo Molnar wrote:

> add 'notrace' to the definition of read_tsc in arch/i386/kernel/tsc.c

OK.

> (or do echo 1 > /proc/sys/kernel/trace_use_raw_cycles

> if you are using recent enough -rt)

Is patch-2.6.20-rt8 recent enough?

# ./trace-it 1 >trace
# cat trace

preemption latency trace v1.1.5 on 2.6.20.7-rt8

--------------------------------------------------------------------
latency: 1000034 us, #10083/10083, CPU#0 | (M:rt VP:0, KP:0, SP:1 HP:1)

-----------------
| task: trace-it-939 (uid:0 nice:0 policy:0 rt_prio:0)
-----------------

_------=> CPU#
/ _-----=> irqs-off
| / _----=> need-resched
|| / _---=> hardirq/softirq
||| / _--=> preempt-depth
|||| /
||||| delay
cmd pid ||||| time | caller
\ / ||||| \ | /

trace-it-939 0D... 0us : user_trace_start (sys_prctl)
trace-it-939 0.... 0us : rt_up (user_trace_start)
trace-it-939 0...1 1us : rt_mutex_unlock (rt_up)
trace-it-939 0D... 1us+< (0)
trace-it-939 0.... 4us > sys_rt_sigprocmask (00000000 bfadff4c
000000d8)
trace-it-939 0.... 5us : sys_rt_sigprocmask (sysenter_past_esp)
trace-it-939 0.... 5us : copy_from_user (sys_rt_sigprocmask)
trace-it-939 0.... 5us : __copy_from_user_ll (copy_from_user)
trace-it-939 0.... 5us : sigprocmask (sys_rt_sigprocmask)
trace-it-939 0.... 5us : __lock_text_start (sigprocmask)
trace-it-939 0.... 6us : recalc_sigpending (sigprocmask)
[...]
<idle>-0 0DN.. 1000027us : __sched_text_start (cpu_idle)
<idle>-0 0DN.1 1000027us : __sched_text_start (c0100f30 0 0)
<idle>-0 0DN.1 1000027us : sched_clock (__sched_text_start)
<idle>-0 0D..2 1000028us : __switch_to (__sched_text_start)
trace-it-939 0D..2 1000028us : __sched_text_start <<idle>-0> (20 -2)
trace-it-939 0D..1 1000029us : trace_stop_sched_switched
(__sched_text_start)
trace-it-939 0.... 1000029us : hrtimer_cancel (do_nanosleep)
trace-it-939 0.... 1000030us : hrtimer_try_to_cancel (hrtimer_cancel)
trace-it-939 0D... 1000031us < (0)
trace-it-939 0.... 1000032us > sys_prctl (00000000 00000000 000000d8)
trace-it-939 0.... 1000032us : sys_prctl (sysenter_past_esp)
trace-it-939 0.... 1000032us : user_trace_stop (sys_prctl)
trace-it-939 0D... 1000033us : user_trace_stop (sys_prctl)

This looks better.

I will let my test program run overnight.

Regards.

John Sigler

unread,

Jul 24, 2007, 4:31:28 AM7/24/07

to Ingo Molnar, linux-r...@vger.kernel.org, oprofi...@lists.sourceforge.net, linux-...@vger.kernel.org

Ingo Molnar wrote:

> add 'notrace' to the definition of read_tsc in arch/i386/kernel/tsc.c

( check_dektec_in-1095 |#0): new 271 us user-latency.
( check_dektec_in-1095 |#0): new 275 us user-latency.
( check_dektec_in-1095 |#0): new 290 us user-latency.
( check_dektec_in-1095 |#0): new 297 us user-latency.
( check_dektec_in-1095 |#0): new 345 us user-latency.
( check_dektec_in-1095 |#0): new 358 us user-latency.
( check_dektec_in-1095 |#0): new 384 us user-latency.
( check_dektec_in-1095 |#0): new 392 us user-latency.
( check_dektec_in-1095 |#0): new 395 us user-latency.
( check_dektec_in-1095 |#0): new 396 us user-latency.
( check_dektec_in-1095 |#0): new 1031 us user-latency.
( check_dektec_in-1095 |#0): new 1100 us user-latency.
( check_dektec_in-1095 |#0): new 1105 us user-latency.
( check_dektec_in-1095 |#0): new 1106 us user-latency.

Here's the function trace for the 1106-盜 latency:

http://linux.kernel.free.fr/latency/1106-us-trace.txt

These two lines repeat ~2000 times for ~800 盜:

softirq--4 0.... 272us : __lock_text_start (rt_run_flush)
softirq--4 0.... 272us : rt_spin_unlock (rt_run_flush)

With a pair of the following in the middle:

softirq--4 0.... 670us : call_rcu (rt_run_flush)
softirq--4 0D..1 670us : __rcu_advance_callbacks (call_rcu)

PID 4 is [softirq-timer/0] and has priority 50 in SCHED_FIFO.
My process has priority 80 in SCHED_RR. It is waiting for IRQ10.

# cat /proc/interrupts
CPU0
0: 37 XT-PIC-XT timer
1: 2 XT-PIC-XT i8042
2: 0 XT-PIC-XT cascade
7: 0 XT-PIC-XT acpi
10: 151250933 XT-PIC-XT eth2, Dta1xx
11: 12435 XT-PIC-XT eth0
12: 4 XT-PIC-XT eth1
14: 17154 XT-PIC-XT ide0
NMI: 0
LOC: 5786548
ERR: 0
MIS: 0

> or do echo 1 > /proc/sys/kernel/trace_use_raw_cycles

> if you are using recent enough -rt

http://people.redhat.com/mingo/realtime-preempt/older/patch-2.6.20-rt8

+int trace_use_raw_cycles = 0;
+
+#ifdef CONFIG_EVENT_TRACE
+/*
+ * Convert raw cycles to usecs.
+ * Note: this is not the 'clocksource cycles' value, it's the raw
+ * cycle counter cycles. We use GTOD to timestamp latency start/end
+ * points, but the trace entries inbetween are timestamped with
+ * get_cycles().
+ */
+static unsigned long notrace cycles_to_us(cycle_t delta)
+{
+ if (!trace_use_raw_cycles)
+ return cycles_to_usecs(delta);
+#ifdef CONFIG_X86
+ do_div(delta, cpu_khz/1000+1);
+#elif defined(CONFIG_PPC)
+ delta = mulhwu(tb_to_us, delta);
+#elif defined(CONFIG_ARM)
+ delta = mach_cycles_to_usecs(delta);
+#else
+ #error Implement cycles_to_usecs.
+#endif
+
+ return (unsigned long) delta;
+}
+#endif

# cat /proc/sys/kernel/trace_use_raw_cycles
0

Should I set trace_use_raw_cycles=1 even if I notrace read_tsc?

Regards.

John Sigler

unread,

Jul 24, 2007, 5:20:51 AM7/24/07

to Ingo Molnar, linux-r...@vger.kernel.org, oprofi...@lists.sourceforge.net, linux-...@vger.kernel.org

John Sigler wrote:

> ( check_dektec_in-1095 |#0): new 271 us user-latency.
> ( check_dektec_in-1095 |#0): new 275 us user-latency.
> ( check_dektec_in-1095 |#0): new 290 us user-latency.
> ( check_dektec_in-1095 |#0): new 297 us user-latency.
> ( check_dektec_in-1095 |#0): new 345 us user-latency.
> ( check_dektec_in-1095 |#0): new 358 us user-latency.
> ( check_dektec_in-1095 |#0): new 384 us user-latency.
> ( check_dektec_in-1095 |#0): new 392 us user-latency.
> ( check_dektec_in-1095 |#0): new 395 us user-latency.
> ( check_dektec_in-1095 |#0): new 396 us user-latency.
> ( check_dektec_in-1095 |#0): new 1031 us user-latency.
> ( check_dektec_in-1095 |#0): new 1100 us user-latency.
> ( check_dektec_in-1095 |#0): new 1105 us user-latency.
> ( check_dektec_in-1095 |#0): new 1106 us user-latency.
>
> Here's the function trace for the 1106-盜 latency:
>
> http://linux.kernel.free.fr/latency/1106-us-trace.txt

The function trace for 400-盜 latencies is different:

( check_dektec_in-1145 |#0): new 275 us user-latency.
( check_dektec_in-1145 |#0): new 276 us user-latency.
( check_dektec_in-1145 |#0): new 288 us user-latency.
( check_dektec_in-1145 |#0): new 289 us user-latency.
( check_dektec_in-1145 |#0): new 289 us user-latency.
( check_dektec_in-1145 |#0): new 290 us user-latency.
( check_dektec_in-1145 |#0): new 297 us user-latency.
( check_dektec_in-1145 |#0): new 345 us user-latency.
( check_dektec_in-1145 |#0): new 354 us user-latency.
( check_dektec_in-1145 |#0): new 377 us user-latency.
( check_dektec_in-1145 |#0): new 393 us user-latency.

http://linux.kernel.free.fr/latency/393-us-trace.txt

There are ~200 calls to ioread32 from mdio_read from speedo_timer.

http://lxr.linux.no/source/drivers/net/eepro100.c#L1159
http://lxr.linux.no/source/drivers/net/eepro100.c#L928

In this case, and as far as I understand, the culprit is the eepro100
driver talking to one of the NICs (which one?). Is that correct?

What is the consequence of IRQ10 being shared by eth2 and
by my I/O board?

How can I force Linux to assign different IRQs to every peripheral
if I have free IRQs lines?

John Sigler

unread,

Jul 25, 2007, 9:04:51 AM7/25/07

to linux-r...@vger.kernel.org, Ingo Molnar, oprofi...@lists.sourceforge.net, linux-...@vger.kernel.org

John Sigler wrote:

Any idea what went wrong in the above function trace?
Why is the kernel spinning in circles that way?

Ingo Molnar

unread,

Jul 25, 2007, 9:06:00 AM7/25/07

to John Sigler, linux-r...@vger.kernel.org, oprofi...@lists.sourceforge.net, linux-...@vger.kernel.org

* John Sigler <linux....@free.fr> wrote:

> >With a pair of the following in the middle:
> >
> >softirq--4 0.... 670us : call_rcu (rt_run_flush)
> >softirq--4 0D..1 670us : __rcu_advance_callbacks (call_rcu)

> Any idea what went wrong in the above function trace? Why is the

> kernel spinning in circles that way?

does your test-app have higher priority than softirq--4 ?

Ingo

John Sigler

unread,

Jul 25, 2007, 9:20:37 AM7/25/07

to Ingo Molnar, linux-r...@vger.kernel.org, linux-...@vger.kernel.org

(Dropping oprofi...@lists.sourceforge.net)

Ingo Molnar wrote:

> John Sigler wrote:
>
>>> With a pair of the following in the middle:
>>>
>>> softirq--4 0.... 670us : call_rcu (rt_run_flush)
>>> softirq--4 0D..1 670us : __rcu_advance_callbacks (call_rcu)
>
>> Any idea what went wrong in the above function trace? Why is the
>> kernel spinning in circles that way?
>
> does your test-app have higher priority than softirq--4 ?

PID 4 is [softirq-timer/0] and has priority 50 in SCHED_FIFO.

My process has priority 80 in SCHED_RR. It is waiting for IRQ10.

My user-space app has higher priority than everything except
PID 2 which is [posix_cpu_timer]

Is this a priority inversion issue?

Regards.

Ingo Molnar

unread,

Jul 25, 2007, 9:39:05 AM7/25/07

to John Sigler, linux-r...@vger.kernel.org, linux-...@vger.kernel.org

* John Sigler <linux....@free.fr> wrote:

> > does your test-app have higher priority than softirq--4 ?
>
> PID 4 is [softirq-timer/0] and has priority 50 in SCHED_FIFO. My
> process has priority 80 in SCHED_RR. It is waiting for IRQ10.
>
> My user-space app has higher priority than everything except PID 2
> which is [posix_cpu_timer]

well what priority does the IRQ 10 kernel thread have? It should be prio
80 too if it's in your critical path.

Ingo

John Sigler

unread,

Jul 25, 2007, 10:06:14 AM7/25/07

to Ingo Molnar, linux-r...@vger.kernel.org, linux-...@vger.kernel.org

Ingo Molnar wrote:

> John Sigler wrote:

>
>> Ingo Molnar wrote:
>>
>>> does your test-app have higher priority than softirq--4 ?
>>
>> PID 4 is [softirq-timer/0] and has priority 50 in SCHED_FIFO. My
>> process has priority 80 in SCHED_RR. It is waiting for IRQ10.
>>
>> My user-space app has higher priority than everything except PID 2
>> which is [posix_cpu_timer]
>
> well what priority does the IRQ 10 kernel thread have? It should be prio
> 80 too if it's in your critical path.

Is there some form of priority inheritance? Does the IRQ handler get a
priority boost if a high priority task is waiting for it?

I didn't dare mess with the default kernel thread priorities.

# ps -eo pid,class,rtprio,ni,pri,stat,comm
PID CLS RTPRIO NI PRI STAT COMMAND
1 TS - 0 24 Ss init
2 FF 99 - 139 S posix_cpu_timer
3 FF 50 - 90 S softirq-high/0
4 FF 50 - 90 S softirq-timer/0
5 FF 50 - 90 S softirq-net-tx/
6 FF 50 - 90 S softirq-net-rx/
7 FF 50 - 90 S softirq-block/0
8 FF 50 - 90 S softirq-tasklet
9 FF 50 - 90 S softirq-sched/0
10 FF 50 - 90 S softirq-hrtimer
11 FF 50 - 90 S softirq-rcu/0
12 TS - -10 34 S< desched/0
13 FF 1 - 41 S< events/0
14 TS - -5 19 S< khelper
15 TS - -5 19 S< kthread
34 TS - -5 29 S< kblockd/0
35 TS - -5 19 S< kacpid
36 FF 50 - 90 S< IRQ-7
82 TS - -5 29 S< kseriod
101 TS - 0 16 S pdflush
102 TS - 0 24 S pdflush
103 TS - -5 21 S< kswapd0
104 TS - -5 29 S< flush_filesd/0
105 TS - -5 21 S< aio/0
704 FF 50 - 90 S< IRQ-14
720 FF 50 - 90 S< IRQ-12
721 FF 50 - 90 S< IRQ-1
848 FF 50 - 90 S< IRQ-11
913 FF 50 - 90 S< IRQ-10
922 TS - 0 21 Ss sshd
925 TS - 0 24 Ss+ agetty
926 TS - 0 21 Ss+ agetty
931 TS - 0 24 Rs sshd
933 TS - 0 24 Rs bash

# cat /proc/interrupts
CPU0
0: 37 XT-PIC-XT timer
1: 2 XT-PIC-XT i8042
2: 0 XT-PIC-XT cascade
7: 0 XT-PIC-XT acpi

10: 175 XT-PIC-XT eth2, Dta1xx
11: 1129 XT-PIC-XT eth0
12: 4 XT-PIC-XT eth1
14: 21482 XT-PIC-XT ide0
NMI: 0
LOC: 161632
ERR: 0
MIS: 0

IRQ 10 is shared between a NIC and an I/O board.

For eth2, the kernel said:
ACPI: PCI Interrupt 0000:00:0a.0[A] -> Link [LNKC]
-> GSI 10 (level, low) -> IRQ 10

For Dta1xx, the kernel said:
ACPI: PCI Interrupt 0000:02:0e.0[A] -> Link [LNKC]
-> GSI 10 (level, low) -> IRQ 10

Is it possible to avoid the two boards sharing IRQ 10?

Regards.

Karsten Wiese

unread,

Jul 25, 2007, 11:27:30 AM7/25/07

to John Sigler, Ingo Molnar, linux-r...@vger.kernel.org, linux-...@vger.kernel.org

Am Mittwoch, 25. Juli 2007 schrieb John Sigler:
>
> Is there some form of priority inheritance? Does the IRQ handler get a
> priority boost if a high priority task is waiting for it?

No. But that would be "nice to have".

Karsten

John Sigler

unread,

Jul 25, 2007, 11:46:56 AM7/25/07

to Karsten Wiese, Ingo Molnar, linux-r...@vger.kernel.org, linux-...@vger.kernel.org

Karsten Wiese wrote:

> John Sigler wrote:
>
>> Is there some form of priority inheritance? Does the IRQ handler get a
>> priority boost if a high priority task is waiting for it?
>
> No. But that would be "nice to have".

No to the first question? to the second question? or to both? :-)

In kernel/futex.c does "PI" stand for Priority Inheritance?

e.g.

/*
* Priority Inheritance state:
*/
struct futex_pi_state {
/*
* list of 'owned' pi_state instances - these have to be
* cleaned up in do_exit() if the task exits prematurely:
*/
struct list_head list;

/*
* The PI object:
*/
struct rt_mutex pi_mutex;

struct task_struct *owner;
atomic_t refcount;

union futex_key key;
};

Karsten Wiese

unread,

Jul 25, 2007, 12:31:00 PM7/25/07

to John Sigler, Ingo Molnar, linux-r...@vger.kernel.org, linux-...@vger.kernel.org

Am Mittwoch, 25. Juli 2007 schrieb John Sigler:

> Karsten Wiese wrote:
>
> > John Sigler wrote:
> >
> >> Is there some form of priority inheritance? Does the IRQ handler get a
> >> priority boost if a high priority task is waiting for it?
> >
> > No. But that would be "nice to have".
>
> No to the first question? to the second question? or to both? :-)
>

To the second.
I checked some soundcard drivers some time ago in -rt kernels, looking for
that "priority boosted by consuming process threaded interrupt handler".
Then there was no such.

> In kernel/futex.c does "PI" stand for Priority Inheritance?
>

I guess so, yes.

Len Brown

unread,

Jul 25, 2007, 1:10:27 PM7/25/07

to John Sigler, Ingo Molnar, linux-r...@vger.kernel.org, linux-...@vger.kernel.org

On Wednesday 25 July 2007 10:05, John Sigler wrote:

> # cat /proc/interrupts
> CPU0
> 0: 37 XT-PIC-XT timer
> 1: 2 XT-PIC-XT i8042
> 2: 0 XT-PIC-XT cascade
> 7: 0 XT-PIC-XT acpi
> 10: 175 XT-PIC-XT eth2, Dta1xx
> 11: 1129 XT-PIC-XT eth0
> 12: 4 XT-PIC-XT eth1
> 14: 21482 XT-PIC-XT ide0
> NMI: 0
> LOC: 161632
> ERR: 0
> MIS: 0
>
> IRQ 10 is shared between a NIC and an I/O board.
>
> For eth2, the kernel said:
> ACPI: PCI Interrupt 0000:00:0a.0[A] -> Link [LNKC]
> -> GSI 10 (level, low) -> IRQ 10
>
> For Dta1xx, the kernel said:
> ACPI: PCI Interrupt 0000:02:0e.0[A] -> Link [LNKC]
> -> GSI 10 (level, low) -> IRQ 10
>
> Is it possible to avoid the two boards sharing IRQ 10?

Maybe. In this configuration, INTA of the two devices
is physically connected to the same wire on the device-side
of the interrupt re-mapper -- so you'd have to change the configuration.
If you have an IOAPIC and can enable it, that will not hurt --
though unless something else changes, these devices are still
tied together on the device-side of the mapper.
So if you can physically move one of the devices to another slot
that is your best bet.

I'd need a bunch of info from your system to tell you what
you can do ahead of time, including full dmesg, lspci -vv
and acpidump.

-Len

John Sigler

unread,

Jul 26, 2007, 4:37:58 AM7/26/07

to Len Brown, Ingo Molnar, linux-r...@vger.kernel.org, linux-...@vger.kernel.org

Len Brown wrote:

I believe this board does not provide an IO-APIC.
Even the LAPIC is disabled in the BIOS.
(Why would they do that??)

> though unless something else changes, these devices are still
> tied together on the device-side of the mapper.
> So if you can physically move one of the devices to another slot
> that is your best bet.

I will try.

> I'd need a bunch of info from your system to tell you what
> you can do ahead of time, including full dmesg, lspci -vv
> and acpidump.

The motherboard is an Adlink EBC-2000T with 3 on-board Intel 82559 NICs.
http://www.adlinktech.com/PD/web/PD_detail.php?pid=213

VIA Pro133T (VT82C694T + VT82C686B) chipset.
http://www.viaarena.com/default.aspx?PageID=40&STypeID=12

Kernel config:
http://linux.kernel.free.fr/latency/config-2.6.20.7-rt8-adlink-latency

dmesg:
http://linux.kernel.free.fr/latency/dmesg.adlink

lspci:
http://linux.kernel.free.fr/latency/lspci.adlink

acpidump:
http://linux.kernel.free.fr/latency/acpidump.adlink

Regards.

John Sigler

unread,

Jul 26, 2007, 6:46:09 AM7/26/07

to Len Brown, Ingo Molnar, linux-r...@vger.kernel.org, linux-...@vger.kernel.org

John Sigler wrote:

The NICs are on-board, therefore it is not possible to move them.

The motherboard only has one PCI slot, so the manufacturer includes
a backplane (is that what it's called?) to provide two PCI slots.

The results I've given so far were with the I/O board inserted in
the bottom slot. If it is inserted in the top slot, the results are
different indeed.

# cat /proc/interrupts
CPU0
0: 37 XT-PIC-XT timer
1: 2 XT-PIC-XT i8042
2: 0 XT-PIC-XT cascade

5: 20270 XT-PIC-XT Dta1xx
7: 0 XT-PIC-XT acpi
10: 4 XT-PIC-XT eth2
11: 2639 XT-PIC-XT eth0
12: 4 XT-PIC-XT eth1
14: 13984 XT-PIC-XT ide0
NMI: 0
LOC: 518501
ERR: 0
MIS: 0

$ diff dmesg.adlink dmesg2.adlink
208c208,210

< ACPI: PCI Interrupt 0000:02:0e.0[A] -> Link [LNKC] -> GSI 10 (level,
low) -> IRQ 10

---
> ACPI: PCI Interrupt Link [LNKD] enabled at IRQ 5
> PCI: setting IRQ 5 as level-triggered
> ACPI: PCI Interrupt 0000:02:0f.0[A] -> Link [LNKD] -> GSI 5 (level,
low) -> IRQ 5

$ diff lspci.adlink lspci2.adlink
121c121
< 02:0e.0 Multimedia controller: PLX Technology, Inc. 9056 PCI I/O
Accelerator
---
> 02:0f.0 Multimedia controller: PLX Technology, Inc. 9056 PCI I/O
Accelerator
126c126
< Interrupt: pin A routed to IRQ 10
---
> Interrupt: pin A routed to IRQ 5

I'll give a 4-port PCI NIC a spin.

John Sigler

unread,

Jul 26, 2007, 8:03:42 AM7/26/07

to Len Brown, linux-r...@vger.kernel.org, linux-...@vger.kernel.org

With a 4-port NIC (instead of the I/O board) inserted in the top slot.

# cat /proc/interrupts
CPU0
0: 37 XT-PIC-XT timer
1: 2 XT-PIC-XT i8042
2: 0 XT-PIC-XT cascade

5: 0 XT-PIC-XT eth3
7: 0 XT-PIC-XT acpi
10: 4 XT-PIC-XT eth1, eth5
11: 4 XT-PIC-XT eth2, eth6
12: 265 XT-PIC-XT eth0, eth4
14: 12726 XT-PIC-XT ide0
NMI: 0
LOC: 12704
ERR: 0
MIS: 0

With the 4-port NIC inserted in the bottom slot.

# cat /proc/interrupts
CPU0
0: 37 XT-PIC-XT timer
1: 2 XT-PIC-XT i8042
2: 0 XT-PIC-XT cascade

5: 4 XT-PIC-XT eth2, eth3
7: 0 XT-PIC-XT acpi
10: 99 XT-PIC-XT eth0, eth5
11: 4 XT-PIC-XT eth1, eth6
12: 0 XT-PIC-XT eth4
14: 12138 XT-PIC-XT ide0
NMI: 0
LOC: 3554
ERR: 0
MIS: 0

As far as I understand, this board has 4 interrupt lines, and they are
mapped (by the BIOS? by Linux?) to ISA IRQs 5, 10, 11, and 12.

If I insert another 4-port NIC in the remaining slot, eth7 through eth10
will spread to these IRQs. Correct?

Suppose 3 NICs share the same interrupt line, and each receives 1500
packets per second. (I suppose that translates to 1500 interrupt
requests per second.) Will the system cope with 4500 interrupt requests
from 3 different devices on the same interrupt line? How different is it
from 4500 interrupt requests from a single device not sharing the line?

John Sigler

unread,

Jul 26, 2007, 11:17:03 AM7/26/07

to linux-r...@vger.kernel.org, Ingo Molnar, linux-...@vger.kernel.org, linu...@vger.kernel.org

[ Adding linux-net to the mix ]

John Sigler wrote:

> ( check_dektec_in-1095 |#0): new 271 us user-latency.
> ( check_dektec_in-1095 |#0): new 275 us user-latency.
> ( check_dektec_in-1095 |#0): new 290 us user-latency.
> ( check_dektec_in-1095 |#0): new 297 us user-latency.
> ( check_dektec_in-1095 |#0): new 345 us user-latency.
> ( check_dektec_in-1095 |#0): new 358 us user-latency.
> ( check_dektec_in-1095 |#0): new 384 us user-latency.
> ( check_dektec_in-1095 |#0): new 392 us user-latency.
> ( check_dektec_in-1095 |#0): new 395 us user-latency.
> ( check_dektec_in-1095 |#0): new 396 us user-latency.
> ( check_dektec_in-1095 |#0): new 1031 us user-latency.
> ( check_dektec_in-1095 |#0): new 1100 us user-latency.
> ( check_dektec_in-1095 |#0): new 1105 us user-latency.
> ( check_dektec_in-1095 |#0): new 1106 us user-latency.
>
> Here's the function trace for the 1106-盜 latency:
>
> http://linux.kernel.free.fr/latency/1106-us-trace.txt
>
> These two lines repeat ~2000 times for ~800 盜:
>
> softirq--4 0.... 272us : __lock_text_start (rt_run_flush)
> softirq--4 0.... 272us : rt_spin_unlock (rt_run_flush)
>
> With a pair of the following in the middle:
>
> softirq--4 0.... 670us : call_rcu (rt_run_flush)
> softirq--4 0D..1 670us : __rcu_advance_callbacks (call_rcu)

Could someone explain why the kernel is "spinning" in rt_run_flush.
http://lxr.linux.no/source/net/ipv4/route.c#L692
What is the kernel trying to do?

Is the kernel in the following loop?

701 for (i = rt_hash_mask; i >= 0; i--) {
702 spin_lock_bh(rt_hash_lock_addr(i));
703 rth = rt_hash_table[i].chain;
704 if (rth)
705 rt_hash_table[i].chain = NULL;
706 spin_unlock_bh(rt_hash_lock_addr(i));

Line 702 would be __lock_text_start (rt_run_flush)
Line 706 would be rt_spin_unlock
Is that correct?

And this would be done 2000 times because rt_hash_mask=2000?