a few minutes after upgrading from -rc5 to -rc6 I got:
[ 1310.670986] =============================================
[ 1310.671690] [ INFO: possible recursive locking detected ]
[ 1310.672097] 2.6.24-rc6 #1
[ 1310.672421] ---------------------------------------------
[ 1310.672828] FahCore_a0.exe/3692 is trying to acquire lock:
[ 1310.673238] (&q->lock){++..}, at: [<c011544b>] __wake_up+0x1b/0x50
[ 1310.673869]
[ 1310.673870] but task is already holding lock:
[ 1310.674567] (&q->lock){++..}, at: [<c011544b>] __wake_up+0x1b/0x50
[ 1310.675267]
[ 1310.675268] other info that might help us debug this:
[ 1310.675952] 5 locks held by FahCore_a0.exe/3692:
[ 1310.676334] #0: (rcu_read_lock){..--}, at: [<c038b620>] net_rx_action+0x60/0x1b0
[ 1310.677251] #1: (rcu_read_lock){..--}, at: [<c0388d60>] netif_receive_skb+0x100/0x470
[ 1310.677924] #2: (rcu_read_lock){..--}, at: [<c03a7fb2>] ip_local_deliver_finish+0x32/0x210
[ 1310.678460] #3: (clock-AF_INET){-.-?}, at: [<c038164e>] sock_def_readable+0x1e/0x80
[ 1310.679250] #4: (&q->lock){++..}, at: [<c011544b>] __wake_up+0x1b/0x50
[ 1310.680151]
[ 1310.680152] stack backtrace:
[ 1310.680772] Pid: 3692, comm: FahCore_a0.exe Not tainted 2.6.24-rc6 #1
[ 1310.681209] [<c01038aa>] show_trace_log_lvl+0x1a/0x30
[ 1310.681659] [<c0104322>] show_trace+0x12/0x20
[ 1310.682085] [<c0104cba>] dump_stack+0x6a/0x70
[ 1310.682512] [<c0138ec1>] __lock_acquire+0x971/0x10c0
[ 1310.682961] [<c013966e>] lock_acquire+0x5e/0x80
[ 1310.683392] [<c0419b78>] _spin_lock_irqsave+0x38/0x50
[ 1310.683914] [<c011544b>] __wake_up+0x1b/0x50
[ 1310.684337] [<c018e2ba>] ep_poll_safewake+0x9a/0xc0
[ 1310.684822] [<c018f11b>] ep_poll_callback+0x8b/0xe0
[ 1310.685265] [<c0114418>] __wake_up_common+0x48/0x70
[ 1310.685712] [<c0115467>] __wake_up+0x37/0x50
[ 1310.686136] [<c03816aa>] sock_def_readable+0x7a/0x80
[ 1310.686579] [<c0381c2b>] sock_queue_rcv_skb+0xeb/0x150
[ 1310.687028] [<c03c7d99>] udp_queue_rcv_skb+0x139/0x2a0
[ 1310.687554] [<c03c81f1>] __udp4_lib_rcv+0x2f1/0x7e0
[ 1310.687996] [<c03c86f2>] udp_rcv+0x12/0x20
[ 1310.688415] [<c03a80a5>] ip_local_deliver_finish+0x125/0x210
[ 1310.688881] [<c03a84ed>] ip_local_deliver+0x2d/0x90
[ 1310.689323] [<c03a7d6b>] ip_rcv_finish+0xeb/0x300
[ 1310.689760] [<c03a8425>] ip_rcv+0x195/0x230
[ 1310.690182] [<c0388fdc>] netif_receive_skb+0x37c/0x470
[ 1310.690632] [<c038ba39>] process_backlog+0x69/0xc0
[ 1310.691175] [<c038b6f7>] net_rx_action+0x137/0x1b0
[ 1310.691681] [<c011e5c2>] __do_softirq+0x52/0xb0
[ 1310.692006] [<c0104e94>] do_softirq+0x94/0xe0
[ 1310.692301] =======================
This is a single CPU machine, and the box was quite busy due to disk I/O
(load 6-8). The machine continues to run and all is well now. Even the
application mentioned above (FahCore_a0.exe) is running fine
("Folding@Home", cpu bound). The binary is located on an jfs filesystem,
which was also under heavy I/O. Can someone tell me why the backtrace
shows so much net* stuff? There was not much net I/O...
more details and .config: http://nerdbynature.de/bits/2.6.24-rc6
Thanks,
Christian.
--
BOFH excuse #312:
incompatible bit-registration operators
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
[ Added even more CCs :-) Seems eventpoll & net related. ]
The net part might just be a red herring, since the problem is that
__wake_up is somehow reentering itself.
Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <her...@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
/*
* Perform a safe wake up of the poll wait list. The problem is that
* with the new callback'd wake up system, it is possible that the
* poll callback is reentered from inside the call to wake_up() done
* on the poll wait queue head. The rule is that we cannot reenter the
* wake up code from the same task more than EP_MAX_POLLWAKE_NESTS times,
* and we cannot reenter the same wait queue head at all. This will
* enable to have a hierarchy of epoll file descriptor of no more than
* EP_MAX_POLLWAKE_NESTS deep. We need the irq version of the spin lock
* because this one gets called by the poll callback, that in turn is called
* from inside a wake_up(), that might be called from irq context.
*/
Seems to suggest that the epoll code can indeed recurse into wakeup.
Davide, Johannes, any ideas?
Since EP_MAX_POLLWAKE_NESTS < MAX_LOCKDEP_SUBCLASSES we could perhaps do
something like:
wake_up_nested(..., wake_nests);
although I'm not quite sure that is correct, my understanding of this
code is still fragile at best.
I remember I talked with Arjan about this time ago. Basically, since 1)
you can drop an epoll fd inside another epoll fd 2) callback-based wakeups
are used, you can see a wake_up() from inside another wake_up(), but they
will never refer to the same lock instance.
Think about:
dfd = socket(...);
efd1 = epoll_create();
efd2 = epoll_create();
epoll_ctl(efd1, EPOLL_CTL_ADD, dfd, ...);
epoll_ctl(efd2, EPOLL_CTL_ADD, efd1, ...);
When a packet arrives to the device underneath "dfd", the net code will
issue a wake_up() on its poll wake list. Epoll (efd1) has installed a
callback wakeup entry on that queue, and the wake_up() performed by the
"dfd" net code will end up in ep_poll_callback(). At this point epoll
(efd1) notices that it may have some event ready, so it needs to wake up
the waiters on its poll wait list (efd2). So it calls ep_poll_safewake()
that ends up in another wake_up(), after having checked about the
recursion constraints. That are, no more than EP_MAX_POLLWAKE_NESTS, to
avoid stack blasting. Never hit the same queue, to avoid loops like:
epoll_ctl(efd2, EPOLL_CTL_ADD, efd1, ...);
epoll_ctl(efd3, EPOLL_CTL_ADD, efd2, ...);
epoll_ctl(efd4, EPOLL_CTL_ADD, efd3, ...);
epoll_ctl(efd1, EPOLL_CTL_ADD, efd4, ...);
The code "if (tncur->wq == wq || ..." prevents re-entering the same
queue/lock.
I don't know how the lockdep code works, so I can't say about
wake_up_nested(). Although I have a feeling is not enough in this case.
A solution may be to move the call to ep_poll_safewake() (that'd become a
simple wake_up()) inside a tasklet or whatever is today trendy for delayed
work. But his kinda scares me to be honest, since epoll has already a
bunch of places where it could be asynchronously hit (plus performance
regression will need to be verified).
- Davide
Although I'm not able to reproduce this one right now, I'm happy to test
any patches you guys come up with.
Thanks for your time,
Christian.
--
BOFH excuse #78:
Yes, yes, its called a design limitation
it's quite possible that i'm wrong but just interested...
why in ep_poll_safewake() the assignment
struct list_head *lsthead = &psw->wake_task_list;
is not protected by spinlock?
- Cyrill -
it was a completely stupid question... please drop ;)
I think this would be the most correct change. But I wonder if it is possible
to do something more generic (but otoh more stupid/hackish and less safe).
Consider this "just for illustration" patch,
--- t/kernel/lockdep.c 2007-11-09 12:57:31.000000000 +0300
+++ t/kernel/lockdep.c 2008-01-07 19:43:50.000000000 +0300
@@ -1266,10 +1266,13 @@ check_deadlock(struct task_struct *curr,
struct held_lock *prev;
int i;
- for (i = 0; i < curr->lockdep_depth; i++) {
+ for (i = curr->lockdep_depth; --i >= 0; ) {
prev = curr->held_locks + i;
if (prev->class != next->class)
continue;
+
+ if (prev->trylock == -1)
+ return 2;
/*
* Allow read-after-read recursion of the same
* lock class (i.e. read_lock(lock)+read_lock(lock)):
-------------------------------------------------------------------------
Now,
// trylock == -1
#define spin_mark_nested(l) \
lock_acquire(&(l)->dep_map, 0, -1, 0, 2, _THIS_IP_)
#define spin_unmark_nested(l) \
lock_release(&(l)->dep_map, 1, _THIS_IP_)
and ep_poll_safewake() can do:
/* Do really wake up now */
spin_mark_nested(&wq->lock);
wake_up(wq);
spin_unmark_nested(&wq->lock);
Possible?
Oleg.
I tested the patch above with the following code,
wait_queue_head_t w1, w2, w3;
init_waitqueue_head(&w1);
init_waitqueue_head(&w2);
init_waitqueue_head(&w3);
local_irq_disable();
spin_lock(&w1.lock);
spin_mark_nested(&w2.lock);
spin_lock(&w2.lock);
spin_mark_nested(&w3.lock);
wake_up(&w3);
spin_unmark_nested(&w3.lock);
spin_unlock(&w2.lock);
spin_unmark_nested(&w2.lock);
spin_unlock(&w1.lock);
local_irq_enable();
seems to work. What do you think?
> On Sat, 5 Jan 2008, Davide Libenzi wrote:
> > A solution may be to move the call to ep_poll_safewake() (that'd become a
> > simple wake_up()) inside a tasklet or whatever is today trendy for delayed
> > work. But his kinda scares me to be honest, since epoll has already a
> > bunch of places where it could be asynchronously hit (plus performance
> > regression will need to be verified).
>
> Although I'm not able to reproduce this one right now, I'm happy to test any
> patches you guys come up with.
There's no need to reproduce it. It's there, it's among us ;)
- Davide
I've been pondering this for a while, and some days I really like it,
some days I don't.
The problem I have with it is that it becomes very easy to falsely
annotate problems away - its a very powerful annotation. That said, its
almost powerful enough to annotate the device semaphore/mutex problem.
I think I'll do wake_up_nested() for now and keep this around.
Thanks for this very nice idea though.
Also, I don't like the overloading of ->trylock, this is really hackish.
> I think I'll do wake_up_nested() for now and keep this around.
Agreed.
Perhaps it is a bit easier to use spin_lock_nested() + __wake_up_common()
directly (we have a lot of wake_up_xxx helpers), but this is up to you.
Offtopic question. Why do we have so many lockdep stuff in timer.c and hrtimer.c ?
We never lock 2 bases at the same time, except in migrate_timers(). We can kill
double_spin_lock() and base_lock_keys[] and just use spin_lock_nested in
migrate_timers(), no?
Oleg.
Sorry for the delay, this message seems to have gotten lost in my
inbox :-/
On Tue, 2008-01-15 at 00:27 +0300, Oleg Nesterov wrote:
> On 01/13, Peter Zijlstra wrote:
> Offtopic question. Why do we have so many lockdep stuff in timer.c and hrtimer.c ?
> We never lock 2 bases at the same time, except in migrate_timers(). We can kill
> double_spin_lock() and base_lock_keys[] and just use spin_lock_nested in
> migrate_timers(), no?
Lets ask Thomas.. :-)
> ( trimmed CC list )
>
> Sorry for the delay, this message seems to have gotten lost in my
> inbox :-/
>
> On Tue, 2008-01-15 at 00:27 +0300, Oleg Nesterov wrote:
> > On 01/13, Peter Zijlstra wrote:
>
> > Offtopic question. Why do we have so many lockdep stuff in timer.c and hrtimer.c ?
> > We never lock 2 bases at the same time, except in migrate_timers(). We can kill
> > double_spin_lock() and base_lock_keys[] and just use spin_lock_nested in
> > migrate_timers(), no?
>
> Lets ask Thomas.. :-)
No objections as long as it all works :)
tglx