Linus tree as of 5fecc9d8f59e765c2a48379dd7c6f5cf88c7d75a
Dave
======================================================
[ INFO: HARDIRQ-safe -> HARDIRQ-unsafe lock order detected ]
3.5.0+ #122 Not tainted
------------------------------------------------------
trinity-child2/5327 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
blocked: (tasklist_lock){.+.+..}, instance: ffffffff81c05098, at: [<ffffffff8109762b>] posix_cpu_timer_del+0x2b/0xe0
and this task is already holding:
blocked: (&(&new_timer->it_lock)->rlock){-.-...}, instance: ffff880143bce170, at: [<ffffffff81093d49>] __lock_timer+0x89/0x1f0
which would create a new lock dependency:
(&(&new_timer->it_lock)->rlock){-.-...} -> (tasklist_lock){.+.+..}
but this new dependency connects a HARDIRQ-irq-safe lock:
(&(&new_timer->it_lock)->rlock){-.-...}
... which became HARDIRQ-irq-safe at:
[<ffffffff810d8e31>] __lock_acquire+0x7e1/0x1ae0
[<ffffffff810da83d>] lock_acquire+0xad/0x220
[<ffffffff8168a15d>] _raw_spin_lock_irqsave+0x7d/0xd0
[<ffffffff81093c17>] posix_timer_fn+0x37/0xe0
[<ffffffff8109a534>] __run_hrtimer+0x94/0x4c0
[<ffffffff8109b157>] hrtimer_interrupt+0xf7/0x230
[<ffffffff816953a9>] smp_apic_timer_interrupt+0x69/0x99
[<ffffffff8169402f>] apic_timer_interrupt+0x6f/0x80
[<ffffffff811644e9>] __generic_file_aio_write+0x239/0x450
[<ffffffff81164773>] generic_file_aio_write+0x73/0xe0
[<ffffffff81258832>] ext4_file_write+0xc2/0x280
[<ffffffff811cfb76>] do_sync_write+0xe6/0x120
[<ffffffff811d04af>] vfs_write+0xaf/0x190
[<ffffffff811d07ed>] sys_write+0x4d/0x90
[<ffffffff8169336d>] system_call_fastpath+0x1a/0x1f
to a HARDIRQ-irq-unsafe lock:
(&(&p->alloc_lock)->rlock){+.+...}
... which became HARDIRQ-irq-unsafe at:
... [<ffffffff810d8c37>] __lock_acquire+0x5e7/0x1ae0
[<ffffffff810da83d>] lock_acquire+0xad/0x220
[<ffffffff816895f6>] _raw_spin_lock+0x46/0x80
[<ffffffff811d8882>] set_task_comm+0x32/0x180
[<ffffffff81095848>] kthreadd+0x38/0x2e0
[<ffffffff81694934>] kernel_thread_helper+0x4/0x10
On Tue, Jul 24, 2012 at 04:36:13PM -0400, Dave Jones wrote:
> Linus tree as of 5fecc9d8f59e765c2a48379dd7c6f5cf88c7d75a
> > Dave
> > ======================================================
> [ INFO: HARDIRQ-safe -> HARDIRQ-unsafe lock order detected ]
> 3.5.0+ #122 Not tainted
> ------------------------------------------------------
> trinity-child2/5327 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
> blocked: (tasklist_lock){.+.+..}, instance: ffffffff81c05098, at: [<ffffffff8109762b>] posix_cpu_timer_del+0x2b/0xe0
> > and this task is already holding:
> blocked: (&(&new_timer->it_lock)->rlock){-.-...}, instance: ffff880143bce170, at: [<ffffffff81093d49>] __lock_timer+0x89/0x1f0
> which would create a new lock dependency:
> (&(&new_timer->it_lock)->rlock){-.-...} -> (tasklist_lock){.+.+..}
> > but this new dependency connects a HARDIRQ-irq-safe lock:
> (&(&new_timer->it_lock)->rlock){-.-...}
> ... which became HARDIRQ-irq-safe at:
Shall I start bisecting this ? I can trigger it very easily, but if you
can give me a set of commits to narrow down, it'll speed up the bisection.
On Sat, Jul 28, 2012 at 12:20 AM, Dave Jones <da...@redhat.com> wrote:
> On Tue, Jul 24, 2012 at 04:36:13PM -0400, Dave Jones wrote:
> > Linus tree as of 5fecc9d8f59e765c2a48379dd7c6f5cf88c7d75a
> > Dave
> > ======================================================
> > [ INFO: HARDIRQ-safe -> HARDIRQ-unsafe lock order detected ]
> > 3.5.0+ #122 Not tainted
> > ------------------------------------------------------
> > trinity-child2/5327 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
> > blocked: (tasklist_lock){.+.+..}, instance: ffffffff81c05098, at: [<ffffffff8109762b>] posix_cpu_timer_del+0x2b/0xe0
> > and this task is already holding:
> > blocked: (&(&new_timer->it_lock)->rlock){-.-...}, instance: ffff880143bce170, at: [<ffffffff81093d49>] __lock_timer+0x89/0x1f0
> > which would create a new lock dependency:
> > (&(&new_timer->it_lock)->rlock){-.-...} -> (tasklist_lock){.+.+..}
> > but this new dependency connects a HARDIRQ-irq-safe lock:
> > (&(&new_timer->it_lock)->rlock){-.-...}
> > ... which became HARDIRQ-irq-safe at:
> Shall I start bisecting this ? I can trigger it very easily, but if you
> can give me a set of commits to narrow down, it'll speed up the bisection.
It should a real possible deadlock, could you test the below patch to
see if it can fix the warning?
--
diff --git a/kernel/posix-cpu-timers.c b/kernel/posix-cpu-timers.c
index 125cb67..29f6a8e 100644
--- a/kernel/posix-cpu-timers.c
+++ b/kernel/posix-cpu-timers.c
@@ -420,7 +420,7 @@ static int posix_cpu_timer_del(struct k_itimer *timer)
int ret = 0;
if (likely(p != NULL)) {
- read_lock(&tasklist_lock);
+ /* tasklist_lock held already in timer_delete */
if (unlikely(p->sighand == NULL)) {
/*
* We raced with the reaping of the task.
@@ -435,7 +435,6 @@ static int posix_cpu_timer_del(struct k_itimer *timer)
list_del(&timer->it.cpu.entry);
spin_unlock(&p->sighand->siglock);
}
- read_unlock(&tasklist_lock);
if (!ret)
put_task_struct(p);
diff --git a/kernel/posix-timers.c b/kernel/posix-timers.c
index 69185ae..222d24c 100644
--- a/kernel/posix-timers.c
+++ b/kernel/posix-timers.c
@@ -884,15 +884,31 @@ SYSCALL_DEFINE1(timer_delete, timer_t, timer_id)
struct k_itimer *timer;
unsigned long flags;
+ /*
+ * hold tasklist_lock to protect tsk->sighand which might be
+ * accessed inside ->timer_del. It should be held before
+ * timer->it_lock to avoid the below deadlock:
+ * CPU0 CPU1
+ * lock(tasklist_lock)
+ * timer_delete()
+ * lock(timer->it_lock)
+ * lock(tasklist_lock)
+ * <timer interrupt>
+ * lock(timer->it_lock)
+ */
+ read_lock(&tasklist_lock);
retry_delete:
timer = lock_timer(timer_id, &flags);
- if (!timer)
+ if (!timer) {
+ read_unlock(&tasklist_lock);
return -EINVAL;
+ }
On Thu, Aug 16, 2012 at 08:54:31PM +0800, Ming Lei wrote:
> On Sat, Jul 28, 2012 at 12:20 AM, Dave Jones <da...@redhat.com> wrote:
> > On Tue, Jul 24, 2012 at 04:36:13PM -0400, Dave Jones wrote:
> > > Linus tree as of 5fecc9d8f59e765c2a48379dd7c6f5cf88c7d75a
> > >
> > > Dave
> > >
> > > ======================================================
> > > [ INFO: HARDIRQ-safe -> HARDIRQ-unsafe lock order detected ]
> > > 3.5.0+ #122 Not tainted
> > > ------------------------------------------------------
> > > trinity-child2/5327 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
> > > blocked: (tasklist_lock){.+.+..}, instance: ffffffff81c05098, at: [<ffffffff8109762b>] posix_cpu_timer_del+0x2b/0xe0
> > >
> > > and this task is already holding:
> > > blocked: (&(&new_timer->it_lock)->rlock){-.-...}, instance: ffff880143bce170, at: [<ffffffff81093d49>] __lock_timer+0x89/0x1f0
> > > which would create a new lock dependency:
> > > (&(&new_timer->it_lock)->rlock){-.-...} -> (tasklist_lock){.+.+..}
> > >
> > > but this new dependency connects a HARDIRQ-irq-safe lock:
> > > (&(&new_timer->it_lock)->rlock){-.-...}
> > > ... which became HARDIRQ-irq-safe at:
> >
> > Shall I start bisecting this ? I can trigger it very easily, but if you
> > can give me a set of commits to narrow down, it'll speed up the bisection.
> > It should a real possible deadlock, could you test the below patch to
> see if it can fix the warning?
I've not managed to hit it in a while. It seems very dependant upon
specific builds for some reason. Very strange.
On Tue, 2012-07-24 at 16:36 -0400, Dave Jones wrote:
> ======================================================
> [ INFO: HARDIRQ-safe -> HARDIRQ-unsafe lock order detected ]
> 3.5.0+ #122 Not tainted
> ------------------------------------------------------
> trinity-child2/5327 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
> blocked: (tasklist_lock){.+.+..}, instance: ffffffff81c05098, at: [<ffffffff8109762b>] posix_cpu_timer_del+0x2b/0xe0
> and this task is already holding:
> blocked: (&(&new_timer->it_lock)->rlock){-.-...}, instance: ffff880143bce170, at: [<ffffffff81093d49>] __lock_timer+0x89/0x1f0
> which would create a new lock dependency:
> (&(&new_timer->it_lock)->rlock){-.-...} -> (tasklist_lock){.+.+..}
> but this new dependency connects a HARDIRQ-irq-safe lock:
> to a HARDIRQ-irq-unsafe lock:
> (&(&p->alloc_lock)->rlock){+.+...}
> other info that might help us debug this:
And since it_lock is IRQ-safe and alloc_lock isn't, you've got the IRQ
inversion deadlock reported.
The task_lock() in keyctl_session_to_parent() comes from Al who didn't
think it necessary to write a changelog in d35abdb2.
David, Al, anybody want to have a go at fixing this?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
+#define TWORK_EXITED ((struct callback_head *)1)
+
int
task_work_add(struct task_struct *task, struct callback_head *twork, bool notify)
{
struct callback_head *last, *first;
unsigned long flags;
+ int err = -ESRCH;
/*
- * Not inserting the new work if the task has already passed
- * exit_task_work() is the responisbility of callers.
+ * We must not insert the new work if the exiting task has already
+ * passed task_work_run().
*/
raw_spin_lock_irqsave(&task->pi_lock, flags);
- last = task->task_works;
- first = last ? last->next : twork;
- twork->next = first;
- if (last)
- last->next = twork;
- task->task_works = twork;
+ if (likely(task->task_works != TWORK_EXITED) {
+ last = task->task_works;
+ first = last ? last->next : twork;
+ twork->next = first;
+ if (last)
+ last->next = twork;
+ task->task_works = twork;
+ err = 0;
+ }
raw_spin_unlock_irqrestore(&task->pi_lock, flags);
/* test_and_set_bit() implies mb(), see tracehook_notify_resume(). */
- if (notify)
+ if (!err && notify)
set_notify_resume(task);
- return 0;
+ return err;
}
raw_spin_lock_irqsave(&task->pi_lock, flags);
last = task->task_works;
- if (last) {
+ if (last && last != TWORK_EXITED) {
struct callback_head *q = last, *p = q->next;
while (1) {
if (p->func == func) {
@@ -63,7 +69,12 @@ void task_work_run(void)
while (1) {
raw_spin_lock_irq(&task->pi_lock);
p = task->task_works;
- task->task_works = NULL;
+ /*
+ * twork->func() can do task_work_add(), do not
+ * set TWORK_EXITED until the list becomes empty.
+ */
+ task->task_works = (!p && (task->flags & PF_EXITING))
+ ? TWORK_EXITED : NULL;
raw_spin_unlock_irq(&task->pi_lock);
On Fri, 2012-08-17 at 17:14 +0200, Oleg Nesterov wrote:
> I still think that task_work_add() should synhronize with exit_task_work()
> itself and fail if necessary. But I wasn't able to convince Al ;)
I'm not at all sure how that relates to needing task_lock() in the
keyctl stuff.
Also, can't task_work use llist stuff? That would also avoid using
->pi_lock.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On Mon, 2012-08-20 at 09:15 +0200, Peter Zijlstra wrote:
> On Fri, 2012-08-17 at 17:14 +0200, Oleg Nesterov wrote:
> > I still think that task_work_add() should synhronize with exit_task_work()
> > itself and fail if necessary. But I wasn't able to convince Al ;)
> I'm not at all sure how that relates to needing task_lock() in the
> keyctl stuff.
> Also, can't task_work use llist stuff? That would also avoid using
> ->pi_lock.
On Mon, 2012-08-20 at 08:19 -0400, Steven Rostedt wrote:
> > > + for (;;) {
> > > + if (entry == &dead)
> But your compiler likes "entry == &dead"? ;-)
Yes, fancy as GCC is these days, it doesn't quite bother with the
meta-physical questions just yet.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
> On Fri, 2012-08-17 at 17:14 +0200, Oleg Nesterov wrote:
> > I still think that task_work_add() should synhronize with exit_task_work()
> > itself and fail if necessary. But I wasn't able to convince Al ;)
> I'm not at all sure how that relates to needing task_lock() in the
> keyctl stuff.
To serialize with exit_mm() which clears ->mm.
We shouldn't do task_work_add(task) if the exiting task has already
passed exit_task_work(). There is no way to do this after ed3e694d7
(and I think this is wrong), so keyctl relies on the fact that
exit_task_work() is called after exit_mm().
> Also, can't task_work use llist stuff? That would also avoid using
> ->pi_lock.
Not sure.... task_work_add/run can use cmpxchg, but _cancel can't.
> Yes, we can add the explicit argument to __task_work_run(), but it can
> check PF_EXITING instead, this looks simpler to me.
I guess we could.. but I thought the explicit callback was simpler ;-)
> Note also your patch breaks fifo, but this is fixable.
Why do you care about the order? Iterating a single linked queue in fifo
seems more expensive than useful.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
> On Mon, 2012-08-20 at 17:10 +0200, Peter Zijlstra wrote:
> > On Mon, 2012-08-20 at 16:59 +0200, Oleg Nesterov wrote:
> > > On 08/20, Peter Zijlstra wrote:
On Mon, 2012-08-20 at 17:32 +0200, Oleg Nesterov wrote:
> > I guess we could steal the entire list and requeue it afterwards and
> > lift TIF_NOTIFY_RESUME along with it..
> We can't. This can race with exit_task_work(). And this can break
> fput(), the task can return to the userspace without __fput().
So we could put that spinlock back around cancel and run and leave add
lockless. That'd solve my immediate problem but its not something I'm
proud of.
All schemes I can come up with end up being broken or effectively the
same as the above proposal.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On Mon, 2012-08-20 at 17:43 +0200, Oleg Nesterov wrote:
> BTW, do you think it is really important to try to avoid ->pi_lock?
It is for me, I'm doing task_work_add() from under rq->lock.. :/ I could
fudge around and delay, but not having to would be nice.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
> On Mon, 2012-08-20 at 17:05 +0200, Oleg Nesterov wrote:
> > Yes, we can add the explicit argument to __task_work_run(), but it can
> > check PF_EXITING instead, this looks simpler to me.
> I guess we could.. but I thought the explicit callback was simpler ;-)
I won't insist. The patch I sent uses PF_EXITING and the fake
"struct callback_head* TWORK_EXITED", but this looks almost the same.
> > Note also your patch breaks fifo, but this is fixable.
> Why do you care about the order?
IMHO, this is just more natural.
For example. keyctl_session_to_parent() does _cancel only to protect
from exploits doing keyctl(KEYCTL_SESSION_TO_PARENT) in an endless
loop. It could simply do task_work_add(), but in this case we need
fifo for correctness.
> Iterating a single linked queue in fifo
> seems more expensive than useful.
Currently the list is fifo (we add to the last element), this is O(1).
But the list should be short, we can reverse it in _run() if we change
task_work_add() to add to the head.