lockdep trace from posix timers

Dave Jones

unread,

Jul 24, 2012, 4:40:02 PM7/24/12

to

Linus tree as of 5fecc9d8f59e765c2a48379dd7c6f5cf88c7d75a

Dave

======================================================
[ INFO: HARDIRQ-safe -> HARDIRQ-unsafe lock order detected ]
3.5.0+ #122 Not tainted
------------------------------------------------------
trinity-child2/5327 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
blocked: (tasklist_lock){.+.+..}, instance: ffffffff81c05098, at: [<ffffffff8109762b>] posix_cpu_timer_del+0x2b/0xe0

and this task is already holding:
blocked: (&(&new_timer->it_lock)->rlock){-.-...}, instance: ffff880143bce170, at: [<ffffffff81093d49>] __lock_timer+0x89/0x1f0
which would create a new lock dependency:
(&(&new_timer->it_lock)->rlock){-.-...} -> (tasklist_lock){.+.+..}

but this new dependency connects a HARDIRQ-irq-safe lock:
(&(&new_timer->it_lock)->rlock){-.-...}
... which became HARDIRQ-irq-safe at:
[<ffffffff810d8e31>] __lock_acquire+0x7e1/0x1ae0
[<ffffffff810da83d>] lock_acquire+0xad/0x220
[<ffffffff8168a15d>] _raw_spin_lock_irqsave+0x7d/0xd0
[<ffffffff81093c17>] posix_timer_fn+0x37/0xe0
[<ffffffff8109a534>] __run_hrtimer+0x94/0x4c0
[<ffffffff8109b157>] hrtimer_interrupt+0xf7/0x230
[<ffffffff816953a9>] smp_apic_timer_interrupt+0x69/0x99
[<ffffffff8169402f>] apic_timer_interrupt+0x6f/0x80
[<ffffffff811644e9>] __generic_file_aio_write+0x239/0x450
[<ffffffff81164773>] generic_file_aio_write+0x73/0xe0
[<ffffffff81258832>] ext4_file_write+0xc2/0x280
[<ffffffff811cfb76>] do_sync_write+0xe6/0x120
[<ffffffff811d04af>] vfs_write+0xaf/0x190
[<ffffffff811d07ed>] sys_write+0x4d/0x90
[<ffffffff8169336d>] system_call_fastpath+0x1a/0x1f

to a HARDIRQ-irq-unsafe lock:
(&(&p->alloc_lock)->rlock){+.+...}
... which became HARDIRQ-irq-unsafe at:
... [<ffffffff810d8c37>] __lock_acquire+0x5e7/0x1ae0
[<ffffffff810da83d>] lock_acquire+0xad/0x220
[<ffffffff816895f6>] _raw_spin_lock+0x46/0x80
[<ffffffff811d8882>] set_task_comm+0x32/0x180
[<ffffffff81095848>] kthreadd+0x38/0x2e0
[<ffffffff81694934>] kernel_thread_helper+0x4/0x10

other info that might help us debug this:

Chain exists of:
&(&new_timer->it_lock)->rlock --> tasklist_lock --> &(&p->alloc_lock)->rlock

Possible interrupt unsafe locking scenario:

CPU0 CPU1
---- ----
lock(&(&p->alloc_lock)->rlock);
local_irq_disable();
lock(&(&new_timer->it_lock)->rlock);
lock(tasklist_lock);
<Interrupt>
lock(&(&new_timer->it_lock)->rlock);

*** DEADLOCK ***

1 lock on stack by trinity-child2/5327:
#0: blocked: (&(&new_timer->it_lock)->rlock){-.-...}, instance: ffff880143bce170, at: [<ffffffff81093d49>] __lock_timer+0x89/0x1f0

the dependencies between HARDIRQ-irq-safe lock and the holding lock:
-> (&(&new_timer->it_lock)->rlock){-.-...} ops: 584 {
IN-HARDIRQ-W at:
[<ffffffff810d8e31>] __lock_acquire+0x7e1/0x1ae0
[<ffffffff810da83d>] lock_acquire+0xad/0x220
[<ffffffff8168a15d>] _raw_spin_lock_irqsave+0x7d/0xd0
[<ffffffff81093c17>] posix_timer_fn+0x37/0xe0
[<ffffffff8109a534>] __run_hrtimer+0x94/0x4c0
[<ffffffff8109b157>] hrtimer_interrupt+0xf7/0x230
[<ffffffff816953a9>] smp_apic_timer_interrupt+0x69/0x99
[<ffffffff8169402f>] apic_timer_interrupt+0x6f/0x80
[<ffffffff811644e9>] __generic_file_aio_write+0x239/0x450
[<ffffffff81164773>] generic_file_aio_write+0x73/0xe0
[<ffffffff81258832>] ext4_file_write+0xc2/0x280
[<ffffffff811cfb76>] do_sync_write+0xe6/0x120
[<ffffffff811d04af>] vfs_write+0xaf/0x190
[<ffffffff811d07ed>] sys_write+0x4d/0x90
[<ffffffff8169336d>] system_call_fastpath+0x1a/0x1f
IN-SOFTIRQ-W at:
[<ffffffff810d8bfe>] __lock_acquire+0x5ae/0x1ae0
[<ffffffff810da83d>] lock_acquire+0xad/0x220
[<ffffffff8168a15d>] _raw_spin_lock_irqsave+0x7d/0xd0
[<ffffffff81093c17>] posix_timer_fn+0x37/0xe0
[<ffffffff8109a534>] __run_hrtimer+0x94/0x4c0
[<ffffffff8109b157>] hrtimer_interrupt+0xf7/0x230
[<ffffffff8109b2bb>] __hrtimer_peek_ahead_timers.part.27+0x2b/0x30
[<ffffffff8109b309>] hrtimer_peek_ahead_timers+0x49/0xa0
[<ffffffff8109b391>] run_hrtimer_softirq+0x31/0x40
[<ffffffff810746e0>] __do_softirq+0xf0/0x400
[<ffffffff81074b2c>] run_ksoftirqd+0x13c/0x230
[<ffffffff81095517>] kthread+0xb7/0xc0
[<ffffffff81694934>] kernel_thread_helper+0x4/0x10
INITIAL USE at:
[<ffffffff810d8956>] __lock_acquire+0x306/0x1ae0
[<ffffffff810da83d>] lock_acquire+0xad/0x220
[<ffffffff8168a15d>] _raw_spin_lock_irqsave+0x7d/0xd0
[<ffffffff81093d49>] __lock_timer+0x89/0x1f0
[<ffffffff810947a4>] sys_timer_delete+0x44/0x100
[<ffffffff8169336d>] system_call_fastpath+0x1a/0x1f
}
... key at: [<ffffffff820ac8a0>] __key.29841+0x0/0x8
... acquired at:
[<ffffffff810d820b>] check_irq_usage+0x5b/0xe0
[<ffffffff810d93da>] __lock_acquire+0xd8a/0x1ae0
[<ffffffff810da83d>] lock_acquire+0xad/0x220
[<ffffffff81689b59>] _raw_read_lock+0x49/0x80
[<ffffffff8109762b>] posix_cpu_timer_del+0x2b/0xe0
[<ffffffff81094786>] sys_timer_delete+0x26/0x100
[<ffffffff8169336d>] system_call_fastpath+0x1a/0x1f

the dependencies between the lock to be acquired and HARDIRQ-irq-unsafe lock:
-> (&(&p->alloc_lock)->rlock){+.+...} ops: 1189942 {
HARDIRQ-ON-W at:
[<ffffffff810d8c37>] __lock_acquire+0x5e7/0x1ae0
[<ffffffff810da83d>] lock_acquire+0xad/0x220
[<ffffffff816895f6>] _raw_spin_lock+0x46/0x80
[<ffffffff811d8882>] set_task_comm+0x32/0x180
[<ffffffff81095848>] kthreadd+0x38/0x2e0
[<ffffffff81694934>] kernel_thread_helper+0x4/0x10
SOFTIRQ-ON-W at:
[<ffffffff810d8c6c>] __lock_acquire+0x61c/0x1ae0
[<ffffffff810da83d>] lock_acquire+0xad/0x220
[<ffffffff816895f6>] _raw_spin_lock+0x46/0x80
[<ffffffff811d8882>] set_task_comm+0x32/0x180
[<ffffffff81095848>] kthreadd+0x38/0x2e0
[<ffffffff81694934>] kernel_thread_helper+0x4/0x10
INITIAL USE at:
[<ffffffff810d8956>] __lock_acquire+0x306/0x1ae0
[<ffffffff810da83d>] lock_acquire+0xad/0x220
[<ffffffff816895f6>] _raw_spin_lock+0x46/0x80
[<ffffffff811d8882>] set_task_comm+0x32/0x180
[<ffffffff81095848>] kthreadd+0x38/0x2e0
[<ffffffff81694934>] kernel_thread_helper+0x4/0x10
}
... key at: [<ffffffff820883a0>] __key.45832+0x0/0x8
... acquired at:
[<ffffffff810da83d>] lock_acquire+0xad/0x220
[<ffffffff816895f6>] _raw_spin_lock+0x46/0x80
[<ffffffff812d5f2e>] keyctl_session_to_parent+0xde/0x490
[<ffffffff812d634d>] sys_keyctl+0x6d/0x1a0
[<ffffffff8169336d>] system_call_fastpath+0x1a/0x1f

-> (tasklist_lock){.+.+..} ops: 35423 {
HARDIRQ-ON-R at:
[<ffffffff810d8b1e>] __lock_acquire+0x4ce/0x1ae0
[<ffffffff810da83d>] lock_acquire+0xad/0x220
[<ffffffff81689b59>] _raw_read_lock+0x49/0x80
[<ffffffff810701a2>] do_wait+0xb2/0x360
[<ffffffff81072095>] sys_wait4+0x75/0xf0
[<ffffffff8108a482>] wait_for_helper+0x82/0xb0
[<ffffffff81694934>] kernel_thread_helper+0x4/0x10
SOFTIRQ-ON-R at:
[<ffffffff810d8c6c>] __lock_acquire+0x61c/0x1ae0
[<ffffffff810da83d>] lock_acquire+0xad/0x220
[<ffffffff81689b59>] _raw_read_lock+0x49/0x80
[<ffffffff810701a2>] do_wait+0xb2/0x360
[<ffffffff81072095>] sys_wait4+0x75/0xf0
[<ffffffff8108a482>] wait_for_helper+0x82/0xb0
[<ffffffff81694934>] kernel_thread_helper+0x4/0x10
INITIAL USE at:
[<ffffffff810d8956>] __lock_acquire+0x306/0x1ae0
[<ffffffff810da83d>] lock_acquire+0xad/0x220
[<ffffffff81689f0c>] _raw_write_lock_irq+0x5c/0xa0
[<ffffffff81068a41>] copy_process.part.22+0x1051/0x1750
[<ffffffff810692d0>] do_fork+0x140/0x4e0
[<ffffffff81024606>] kernel_thread+0x76/0x80
[<ffffffff81664602>] rest_init+0x26/0x154
[<ffffffff81efecb3>] start_kernel+0x3f8/0x405
[<ffffffff81efe356>] x86_64_start_reservations+0x131/0x135
[<ffffffff81efe4a2>] x86_64_start_kernel+0x148/0x157
}
... key at: [<ffffffff81c05098>] tasklist_lock+0x18/0x80
... acquired at:
[<ffffffff810d820b>] check_irq_usage+0x5b/0xe0
[<ffffffff810d93da>] __lock_acquire+0xd8a/0x1ae0
[<ffffffff810da83d>] lock_acquire+0xad/0x220
[<ffffffff81689b59>] _raw_read_lock+0x49/0x80
[<ffffffff8109762b>] posix_cpu_timer_del+0x2b/0xe0
[<ffffffff81094786>] sys_timer_delete+0x26/0x100
[<ffffffff8169336d>] system_call_fastpath+0x1a/0x1f

stack backtrace:
Pid: 5327, comm: trinity-child2 Not tainted 3.5.0+ #122
Call Trace:
[<ffffffff810d8194>] check_usage+0x4e4/0x500
[<ffffffff81023729>] ? native_sched_clock+0x19/0x80
[<ffffffff810d59a8>] ? trace_hardirqs_off_caller+0x28/0xd0
[<ffffffff81023729>] ? native_sched_clock+0x19/0x80
[<ffffffff810d820b>] check_irq_usage+0x5b/0xe0
[<ffffffff810d93da>] __lock_acquire+0xd8a/0x1ae0
[<ffffffff810d8956>] ? __lock_acquire+0x306/0x1ae0
[<ffffffff810d59a8>] ? trace_hardirqs_off_caller+0x28/0xd0
[<ffffffff810da2a5>] ? lock_release_non_nested+0x175/0x320
[<ffffffff810da83d>] lock_acquire+0xad/0x220
[<ffffffff8109762b>] ? posix_cpu_timer_del+0x2b/0xe0
[<ffffffff81689b59>] _raw_read_lock+0x49/0x80
[<ffffffff8109762b>] ? posix_cpu_timer_del+0x2b/0xe0
[<ffffffff81093d95>] ? __lock_timer+0xd5/0x1f0
[<ffffffff8109762b>] posix_cpu_timer_del+0x2b/0xe0
[<ffffffff81094786>] sys_timer_delete+0x26/0x100
[<ffffffff8169336d>] system_call_fastpath+0x1a/0x1f

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Dave Jones

unread,

Jul 27, 2012, 12:30:01 PM7/27/12

to

On Tue, Jul 24, 2012 at 04:36:13PM -0400, Dave Jones wrote:
> Linus tree as of 5fecc9d8f59e765c2a48379dd7c6f5cf88c7d75a
>
> Dave
>
> ======================================================
> [ INFO: HARDIRQ-safe -> HARDIRQ-unsafe lock order detected ]
> 3.5.0+ #122 Not tainted
> ------------------------------------------------------
> trinity-child2/5327 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
> blocked: (tasklist_lock){.+.+..}, instance: ffffffff81c05098, at: [<ffffffff8109762b>] posix_cpu_timer_del+0x2b/0xe0
>
> and this task is already holding:
> blocked: (&(&new_timer->it_lock)->rlock){-.-...}, instance: ffff880143bce170, at: [<ffffffff81093d49>] __lock_timer+0x89/0x1f0
> which would create a new lock dependency:
> (&(&new_timer->it_lock)->rlock){-.-...} -> (tasklist_lock){.+.+..}
>
> but this new dependency connects a HARDIRQ-irq-safe lock:
> (&(&new_timer->it_lock)->rlock){-.-...}
> ... which became HARDIRQ-irq-safe at:

Shall I start bisecting this ? I can trigger it very easily, but if you
can give me a set of commits to narrow down, it'll speed up the bisection.

Dave

Ming Lei

unread,

Aug 16, 2012, 9:00:02 AM8/16/12

to

On Sat, Jul 28, 2012 at 12:20 AM, Dave Jones <da...@redhat.com> wrote:
> On Tue, Jul 24, 2012 at 04:36:13PM -0400, Dave Jones wrote:
> > Linus tree as of 5fecc9d8f59e765c2a48379dd7c6f5cf88c7d75a
> >
> > Dave
> >
> > ======================================================
> > [ INFO: HARDIRQ-safe -> HARDIRQ-unsafe lock order detected ]
> > 3.5.0+ #122 Not tainted
> > ------------------------------------------------------
> > trinity-child2/5327 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
> > blocked: (tasklist_lock){.+.+..}, instance: ffffffff81c05098, at: [<ffffffff8109762b>] posix_cpu_timer_del+0x2b/0xe0
> >
> > and this task is already holding:
> > blocked: (&(&new_timer->it_lock)->rlock){-.-...}, instance: ffff880143bce170, at: [<ffffffff81093d49>] __lock_timer+0x89/0x1f0
> > which would create a new lock dependency:
> > (&(&new_timer->it_lock)->rlock){-.-...} -> (tasklist_lock){.+.+..}
> >
> > but this new dependency connects a HARDIRQ-irq-safe lock:
> > (&(&new_timer->it_lock)->rlock){-.-...}
> > ... which became HARDIRQ-irq-safe at:
>
> Shall I start bisecting this ? I can trigger it very easily, but if you
> can give me a set of commits to narrow down, it'll speed up the bisection.

It should a real possible deadlock, could you test the below patch to
see if it can fix the warning?

--
diff --git a/kernel/posix-cpu-timers.c b/kernel/posix-cpu-timers.c
index 125cb67..29f6a8e 100644
--- a/kernel/posix-cpu-timers.c
+++ b/kernel/posix-cpu-timers.c
@@ -420,7 +420,7 @@ static int posix_cpu_timer_del(struct k_itimer *timer)
int ret = 0;

if (likely(p != NULL)) {
- read_lock(&tasklist_lock);
+ /* tasklist_lock held already in timer_delete */
if (unlikely(p->sighand == NULL)) {
/*
* We raced with the reaping of the task.
@@ -435,7 +435,6 @@ static int posix_cpu_timer_del(struct k_itimer *timer)
list_del(&timer->it.cpu.entry);
spin_unlock(&p->sighand->siglock);
}
- read_unlock(&tasklist_lock);

if (!ret)
put_task_struct(p);
diff --git a/kernel/posix-timers.c b/kernel/posix-timers.c
index 69185ae..222d24c 100644
--- a/kernel/posix-timers.c
+++ b/kernel/posix-timers.c
@@ -884,15 +884,31 @@ SYSCALL_DEFINE1(timer_delete, timer_t, timer_id)
struct k_itimer *timer;
unsigned long flags;

+ /*
+ * hold tasklist_lock to protect tsk->sighand which might be
+ * accessed inside ->timer_del. It should be held before
+ * timer->it_lock to avoid the below deadlock:
+ * CPU0 CPU1
+ * lock(tasklist_lock)
+ * timer_delete()
+ * lock(timer->it_lock)
+ * lock(tasklist_lock)
+ * <timer interrupt>
+ * lock(timer->it_lock)
+ */
+ read_lock(&tasklist_lock);
retry_delete:
timer = lock_timer(timer_id, &flags);
- if (!timer)
+ if (!timer) {
+ read_unlock(&tasklist_lock);
return -EINVAL;
+ }

if (timer_delete_hook(timer) == TIMER_RETRY) {
unlock_timer(timer, flags);
goto retry_delete;
}
+ read_unlock(&tasklist_lock);

spin_lock(&current->sighand->siglock);
list_del(&timer->list);

Thanks,
--
Ming Lei

Dave Jones

unread,

Aug 16, 2012, 10:10:02 AM8/16/12

to

On Thu, Aug 16, 2012 at 08:54:31PM +0800, Ming Lei wrote:
> On Sat, Jul 28, 2012 at 12:20 AM, Dave Jones <da...@redhat.com> wrote:
> > On Tue, Jul 24, 2012 at 04:36:13PM -0400, Dave Jones wrote:
> > > Linus tree as of 5fecc9d8f59e765c2a48379dd7c6f5cf88c7d75a
> > >
> > > Dave
> > >
> > > ======================================================
> > > [ INFO: HARDIRQ-safe -> HARDIRQ-unsafe lock order detected ]
> > > 3.5.0+ #122 Not tainted
> > > ------------------------------------------------------
> > > trinity-child2/5327 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
> > > blocked: (tasklist_lock){.+.+..}, instance: ffffffff81c05098, at: [<ffffffff8109762b>] posix_cpu_timer_del+0x2b/0xe0
> > >
> > > and this task is already holding:
> > > blocked: (&(&new_timer->it_lock)->rlock){-.-...}, instance: ffff880143bce170, at: [<ffffffff81093d49>] __lock_timer+0x89/0x1f0
> > > which would create a new lock dependency:
> > > (&(&new_timer->it_lock)->rlock){-.-...} -> (tasklist_lock){.+.+..}
> > >
> > > but this new dependency connects a HARDIRQ-irq-safe lock:
> > > (&(&new_timer->it_lock)->rlock){-.-...}
> > > ... which became HARDIRQ-irq-safe at:
> >
> > Shall I start bisecting this ? I can trigger it very easily, but if you
> > can give me a set of commits to narrow down, it'll speed up the bisection.
>
> It should a real possible deadlock, could you test the below patch to
> see if it can fix the warning?

I've not managed to hit it in a while. It seems very dependant upon
specific builds for some reason. Very strange.

Dave

Peter Zijlstra

unread,

Aug 16, 2012, 2:10:02 PM8/16/12

to

On Tue, 2012-07-24 at 16:36 -0400, Dave Jones wrote:

> ======================================================
> [ INFO: HARDIRQ-safe -> HARDIRQ-unsafe lock order detected ]
> 3.5.0+ #122 Not tainted
> ------------------------------------------------------
> trinity-child2/5327 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
> blocked: (tasklist_lock){.+.+..}, instance: ffffffff81c05098, at: [<ffffffff8109762b>] posix_cpu_timer_del+0x2b/0xe0
>
> and this task is already holding:
> blocked: (&(&new_timer->it_lock)->rlock){-.-...}, instance: ffff880143bce170, at: [<ffffffff81093d49>] __lock_timer+0x89/0x1f0
> which would create a new lock dependency:
> (&(&new_timer->it_lock)->rlock){-.-...} -> (tasklist_lock){.+.+..}
>
> but this new dependency connects a HARDIRQ-irq-safe lock:

> to a HARDIRQ-irq-unsafe lock:
> (&(&p->alloc_lock)->rlock){+.+...}

> other info that might help us debug this:
>
> Chain exists of:
> &(&new_timer->it_lock)->rlock --> tasklist_lock --> &(&p->alloc_lock)->rlock
>
> Possible interrupt unsafe locking scenario:
>
> CPU0 CPU1
> ---- ----
> lock(&(&p->alloc_lock)->rlock);
> local_irq_disable();
> lock(&(&new_timer->it_lock)->rlock);
> lock(tasklist_lock);
> <Interrupt>
> lock(&(&new_timer->it_lock)->rlock);
>
> *** DEADLOCK ***
>
> 1 lock on stack by trinity-child2/5327:
> #0: blocked: (&(&new_timer->it_lock)->rlock){-.-...}, instance: ffff880143bce170, at: [<ffffffff81093d49>] __lock_timer+0x89/0x1f0

> the dependencies between the lock to be acquired and HARDIRQ-irq-unsafe lock:

> [<ffffffff810da83d>] lock_acquire+0xad/0x220
> [<ffffffff816895f6>] _raw_spin_lock+0x46/0x80
> [<ffffffff812d5f2e>] keyctl_session_to_parent+0xde/0x490
> [<ffffffff812d634d>] sys_keyctl+0x6d/0x1a0
> [<ffffffff8169336d>] system_call_fastpath+0x1a/0x1f

> stack backtrace:
> Pid: 5327, comm: trinity-child2 Not tainted 3.5.0+ #122
> Call Trace:
> [<ffffffff810d8194>] check_usage+0x4e4/0x500
> [<ffffffff81023729>] ? native_sched_clock+0x19/0x80
> [<ffffffff810d59a8>] ? trace_hardirqs_off_caller+0x28/0xd0
> [<ffffffff81023729>] ? native_sched_clock+0x19/0x80
> [<ffffffff810d820b>] check_irq_usage+0x5b/0xe0
> [<ffffffff810d93da>] __lock_acquire+0xd8a/0x1ae0
> [<ffffffff810d8956>] ? __lock_acquire+0x306/0x1ae0
> [<ffffffff810d59a8>] ? trace_hardirqs_off_caller+0x28/0xd0
> [<ffffffff810da2a5>] ? lock_release_non_nested+0x175/0x320
> [<ffffffff810da83d>] lock_acquire+0xad/0x220
> [<ffffffff8109762b>] ? posix_cpu_timer_del+0x2b/0xe0
> [<ffffffff81689b59>] _raw_read_lock+0x49/0x80
> [<ffffffff8109762b>] ? posix_cpu_timer_del+0x2b/0xe0
> [<ffffffff81093d95>] ? __lock_timer+0xd5/0x1f0
> [<ffffffff8109762b>] posix_cpu_timer_del+0x2b/0xe0
> [<ffffffff81094786>] sys_timer_delete+0x26/0x100
> [<ffffffff8169336d>] system_call_fastpath+0x1a/0x1f

So we have:

sys_keyctl()
keyctl_session_to_parent()
write_lock_irq(&tasklist_lock)
task_lock(parent) parent->alloc_lock

VS

sys_timer_delete()
lock_timer() timer->it_lock
posix_cpu_timer_del()
read_lock(&tasklist_lock)

Creating:

timer->it_lock -> tasklist_lock -> task->alloc_lock

And since it_lock is IRQ-safe and alloc_lock isn't, you've got the IRQ
inversion deadlock reported.

The task_lock() in keyctl_session_to_parent() comes from Al who didn't
think it necessary to write a changelog in d35abdb2.

David, Al, anybody want to have a go at fixing this?

Oleg Nesterov

unread,

Aug 17, 2012, 11:20:02 AM8/17/12

to

On 08/16, Peter Zijlstra wrote:
>
> write_lock_irq(&tasklist_lock)
> task_lock(parent) parent->alloc_lock

And this is already wrong. See the comment above task_lock().

> And since it_lock is IRQ-safe and alloc_lock isn't, you've got the IRQ
> inversion deadlock reported.

Yes. Or, IOW, write_lock(tasklist) is IRQ-safe and thus it can't nest
with alloc_lock.

> David, Al, anybody want to have a go at fixing this?

I still think that task_work_add() should synhronize with exit_task_work()
itself and fail if necessary. But I wasn't able to convince Al ;)

Oleg.

Oleg Nesterov

unread,

Aug 17, 2012, 11:30:02 AM8/17/12

to

On 08/17, Oleg Nesterov wrote:
>
> On 08/16, Peter Zijlstra wrote:
> >
> > write_lock_irq(&tasklist_lock)
> > task_lock(parent) parent->alloc_lock
>
> And this is already wrong. See the comment above task_lock().
>
> > And since it_lock is IRQ-safe and alloc_lock isn't, you've got the IRQ
> > inversion deadlock reported.
>
> Yes. Or, IOW, write_lock(tasklist) is IRQ-safe and thus it can't nest
> with alloc_lock.
>
> > David, Al, anybody want to have a go at fixing this?
>
> I still think that task_work_add() should synhronize with exit_task_work()
> itself and fail if necessary. But I wasn't able to convince Al ;)

And this is my old patch: http://marc.info/?l=linux-kernel&m=134082268721700
It should be re-diffed of course.

Oleg Nesterov

unread,

Aug 17, 2012, 12:50:01 PM8/17/12

to

On 08/17, Oleg Nesterov wrote:
>
> On 08/17, Oleg Nesterov wrote:
> >
> > On 08/16, Peter Zijlstra wrote:
> > >
> > > write_lock_irq(&tasklist_lock)
> > > task_lock(parent) parent->alloc_lock
> >
> > And this is already wrong. See the comment above task_lock().
> >
> > > And since it_lock is IRQ-safe and alloc_lock isn't, you've got the IRQ
> > > inversion deadlock reported.
> >
> > Yes. Or, IOW, write_lock(tasklist) is IRQ-safe and thus it can't nest
> > with alloc_lock.
> >
> > > David, Al, anybody want to have a go at fixing this?
> >
> > I still think that task_work_add() should synhronize with exit_task_work()
> > itself and fail if necessary. But I wasn't able to convince Al ;)
>
> And this is my old patch: http://marc.info/?l=linux-kernel&m=134082268721700
> It should be re-diffed of course.

Something like below. Uncompiled/untested, I need to re-check and test.
Now we can remove that task_lock() and rely on task_work_add().

Al, what do you think?

Oleg.

--- x/include/linux/task_work.h
+++ x/include/linux/task_work.h
@@ -18,8 +18,7 @@ void task_work_run(void);

static inline void exit_task_work(struct task_struct *task)
{
- if (unlikely(task->task_works))
- task_work_run();
+ task_work_run();
}

#endif /* _LINUX_TASK_WORK_H */
--- x/kernel/task_work.c
+++ x/kernel/task_work.c
@@ -2,29 +2,35 @@
#include <linux/task_work.h>
#include <linux/tracehook.h>

+#define TWORK_EXITED ((struct callback_head *)1)
+
int
task_work_add(struct task_struct *task, struct callback_head *twork, bool notify)
{
struct callback_head *last, *first;
unsigned long flags;
+ int err = -ESRCH;

/*
- * Not inserting the new work if the task has already passed
- * exit_task_work() is the responisbility of callers.
+ * We must not insert the new work if the exiting task has already
+ * passed task_work_run().
*/
raw_spin_lock_irqsave(&task->pi_lock, flags);
- last = task->task_works;
- first = last ? last->next : twork;
- twork->next = first;
- if (last)
- last->next = twork;
- task->task_works = twork;
+ if (likely(task->task_works != TWORK_EXITED) {
+ last = task->task_works;
+ first = last ? last->next : twork;
+ twork->next = first;
+ if (last)
+ last->next = twork;
+ task->task_works = twork;
+ err = 0;
+ }
raw_spin_unlock_irqrestore(&task->pi_lock, flags);

/* test_and_set_bit() implies mb(), see tracehook_notify_resume(). */
- if (notify)
+ if (!err && notify)
set_notify_resume(task);
- return 0;
+ return err;
}

struct callback_head *
@@ -35,7 +41,7 @@ task_work_cancel(struct task_struct *tas

raw_spin_lock_irqsave(&task->pi_lock, flags);
last = task->task_works;
- if (last) {
+ if (last && last != TWORK_EXITED) {
struct callback_head *q = last, *p = q->next;
while (1) {
if (p->func == func) {
@@ -63,7 +69,12 @@ void task_work_run(void)
while (1) {
raw_spin_lock_irq(&task->pi_lock);
p = task->task_works;
- task->task_works = NULL;
+ /*
+ * twork->func() can do task_work_add(), do not
+ * set TWORK_EXITED until the list becomes empty.
+ */
+ task->task_works = (!p && (task->flags & PF_EXITING))
+ ? TWORK_EXITED : NULL;
raw_spin_unlock_irq(&task->pi_lock);

if (unlikely(!p))

Peter Zijlstra

unread,

Aug 20, 2012, 3:20:02 AM8/20/12

to

On Fri, 2012-08-17 at 17:14 +0200, Oleg Nesterov wrote:
> I still think that task_work_add() should synhronize with exit_task_work()
> itself and fail if necessary. But I wasn't able to convince Al ;)

I'm not at all sure how that relates to needing task_lock() in the
keyctl stuff.

Also, can't task_work use llist stuff? That would also avoid using
->pi_lock.

Peter Zijlstra

unread,

Aug 20, 2012, 7:50:02 AM8/20/12

to

On Mon, 2012-08-20 at 09:15 +0200, Peter Zijlstra wrote:
> On Fri, 2012-08-17 at 17:14 +0200, Oleg Nesterov wrote:
> > I still think that task_work_add() should synhronize with exit_task_work()
> > itself and fail if necessary. But I wasn't able to convince Al ;)
>
> I'm not at all sure how that relates to needing task_lock() in the
> keyctl stuff.
>
> Also, can't task_work use llist stuff? That would also avoid using
> ->pi_lock.

How about something like the below?

---
include/linux/task_work.h | 7 +--
kernel/exit.c | 2 +-
kernel/task_work.c | 120 ++++++++++++++++++++++++----------------------
3 files changed, 65 insertions(+), 64 deletions(-)

diff --git a/include/linux/task_work.h b/include/linux/task_work.h
index fb46b03..f365416 100644
--- a/include/linux/task_work.h
+++ b/include/linux/task_work.h
@@ -15,11 +15,6 @@ init_task_work(struct callback_head *twork, task_work_func_t func)
int task_work_add(struct task_struct *task, struct callback_head *twork, bool);
struct callback_head *task_work_cancel(struct task_struct *, task_work_func_t);
void task_work_run(void);
-
-static inline void exit_task_work(struct task_struct *task)
-{

- if (unlikely(task->task_works))
- task_work_run();

-}
+void task_work_exit(void);

#endif /* _LINUX_TASK_WORK_H */
diff --git a/kernel/exit.c b/kernel/exit.c
index f65345f..92aa94b 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -992,7 +992,7 @@ void do_exit(long code)
exit_shm(tsk);
exit_files(tsk);
exit_fs(tsk);
- exit_task_work(tsk);
+ task_work_exit();
check_stack_usage();
exit_thread();

diff --git a/kernel/task_work.c b/kernel/task_work.c
index 91d4e17..e5eac14 100644
--- a/kernel/task_work.c
+++ b/kernel/task_work.c
@@ -2,79 +2,85 @@
#include <linux/task_work.h>
#include <linux/tracehook.h>

+static void task_work_nop(struct callback_head *work)
+{
+}
+
+static struct callback_head dead = {
+ .next = NULL,
+ .func = task_work_nop,
+};
+
int
-task_work_add(struct task_struct *task, struct callback_head *twork, bool notify)
+task_work_add(struct task_struct *task, struct callback_head *work, bool notify)
{
- struct callback_head *last, *first;
- unsigned long flags;
-
- /*

- * Not inserting the new work if the task has already passed
- * exit_task_work() is the responisbility of callers.

- */
- raw_spin_lock_irqsave(&task->pi_lock, flags);

- last = task->task_works;
- first = last ? last->next : twork;
- twork->next = first;
- if (last)
- last->next = twork;
- task->task_works = twork;

- raw_spin_unlock_irqrestore(&task->pi_lock, flags);
+ struct callback_head **head = &task->task_works;
+ struct callback_head *entry, *old_entry;
+
+ entry = &head;
+ for (;;) {
+ if (entry == &dead)
+ return -ESRCH;
+
+ old_entry = entry;
+ work->next = entry;
+ entry = cmpxchg(head, old_entry, work);
+ if (entry == old_entry)
+ break;
+ }

/* test_and_set_bit() implies mb(), see tracehook_notify_resume(). */

if (notify)
set_notify_resume(task);
+
return 0;
}

struct callback_head *
task_work_cancel(struct task_struct *task, task_work_func_t func)
{
- unsigned long flags;
- struct callback_head *last, *res = NULL;
-
- raw_spin_lock_irqsave(&task->pi_lock, flags);

- last = task->task_works;

- if (last) {
- struct callback_head *q = last, *p = q->next;
- while (1) {
- if (p->func == func) {
- q->next = p->next;
- if (p == last)
- task->task_works = q == p ? NULL : q;
- res = p;
- break;
- }
- if (p == last)
- break;
- q = p;
- p = q->next;
+ struct callback_head **workp, *work;
+
+again:
+ workp = &task->task_works;
+ work = *workp;
+ while (work) {
+ if (work->func == func) {
+ if (cmpxchg(workp, work, work->next) == work)
+ return work;
+ goto again;
}
+
+ workp = &work->next;
+ work = *workp;
}
- raw_spin_unlock_irqrestore(&task->pi_lock, flags);
- return res;
+
+ return NULL;
}

-void task_work_run(void)
+static void __task_work_run(struct callback_head *tail)
{
- struct task_struct *task = current;
- struct callback_head *p, *q;
-
- while (1) {
- raw_spin_lock_irq(&task->pi_lock);
- p = task->task_works;

- task->task_works = NULL;

- raw_spin_unlock_irq(&task->pi_lock);
-
- if (unlikely(!p))
- return;
-
- q = p->next; /* head */
- p->next = NULL; /* cut it */
- while (q) {
- p = q->next;
- q->func(q);
- q = p;
+ struct callback_head **head = &current->task_works;
+
+ do {
+ struct callback_head *work = xchg(head, NULL);
+ while (work) {
+ struct callback_head *next = ACCESS_ONCE(work->next);
+
+ WARN_ON_ONCE(work == &dead);
+
+ work->func(work);
+ work = next;
}
- }
+ } while (cmpxchg(head, NULL, tail) != NULL);
+}
+
+void task_work_run(void)
+{
+ __task_work_run(NULL);
+}
+
+void task_work_exit(void)
+{
+ __task_work_run(&dead);

Peter Zijlstra

unread,

Aug 20, 2012, 7:50:02 AM8/20/12

to

On Mon, 2012-08-20 at 13:44 +0200, Peter Zijlstra wrote:
> > I'm not at all sure how that relates to needing task_lock() in the
> > keyctl stuff.

Ah, is it used to serialize against exit_mm() so that the !->mm check
will suffice to avoid queuing works past exit_task_work() ?

Peter Zijlstra

unread,

Aug 20, 2012, 8:00:02 AM8/20/12

to

On Mon, 2012-08-20 at 13:44 +0200, Peter Zijlstra wrote:

> int

> +task_work_add(struct task_struct *task, struct callback_head *work, bool notify)
> {

> + struct callback_head **head = &task->task_works;
> + struct callback_head *entry, *old_entry;
> +
> + entry = &head;

My compiler just alerted me that:

entry = *head;

works a lot better..

> + for (;;) {
> + if (entry == &dead)
> + return -ESRCH;
> +
> + old_entry = entry;
> + work->next = entry;
> + entry = cmpxchg(head, old_entry, work);
> + if (entry == old_entry)
> + break;
> + }
>
> /* test_and_set_bit() implies mb(), see tracehook_notify_resume(). */
> if (notify)
> set_notify_resume(task);
> +
> return 0;
> }

Steven Rostedt

unread,

Aug 20, 2012, 8:20:02 AM8/20/12

to

On Mon, 2012-08-20 at 13:50 +0200, Peter Zijlstra wrote:
> On Mon, 2012-08-20 at 13:44 +0200, Peter Zijlstra wrote:
> > int
> > +task_work_add(struct task_struct *task, struct callback_head *work, bool notify)
> > {
> > + struct callback_head **head = &task->task_works;
> > + struct callback_head *entry, *old_entry;
> > +
> > + entry = &head;
>
> My compiler just alerted me that:
>
> entry = *head;
>
> works a lot better..
>
> > + for (;;) {
> > + if (entry == &dead)

But your compiler likes "entry == &dead"? ;-)

-- Steve

Peter Zijlstra

unread,

Aug 20, 2012, 8:30:02 AM8/20/12

to

On Mon, 2012-08-20 at 08:19 -0400, Steven Rostedt wrote:
>
> > > + for (;;) {
> > > + if (entry == &dead)
>
> But your compiler likes "entry == &dead"? ;-)
>

Yes, fancy as GCC is these days, it doesn't quite bother with the
meta-physical questions just yet.

Oleg Nesterov

unread,

Aug 20, 2012, 11:00:02 AM8/20/12

to

On 08/20, Peter Zijlstra wrote:
>
> On Fri, 2012-08-17 at 17:14 +0200, Oleg Nesterov wrote:
> > I still think that task_work_add() should synhronize with exit_task_work()
> > itself and fail if necessary. But I wasn't able to convince Al ;)
>
> I'm not at all sure how that relates to needing task_lock() in the
> keyctl stuff.

To serialize with exit_mm() which clears ->mm.

We shouldn't do task_work_add(task) if the exiting task has already
passed exit_task_work(). There is no way to do this after ed3e694d7
(and I think this is wrong), so keyctl relies on the fact that
exit_task_work() is called after exit_mm().

> Also, can't task_work use llist stuff? That would also avoid using
> ->pi_lock.

Not sure.... task_work_add/run can use cmpxchg, but _cancel can't.

Oleg.

Oleg Nesterov

unread,

Aug 20, 2012, 11:10:02 AM8/20/12

to

On 08/20, Peter Zijlstra wrote:
>

> task_work_cancel(struct task_struct *task, task_work_func_t func)
> {

> ...

> +
> +again:
> + workp = &task->task_works;
> + work = *workp;
> + while (work) {
> + if (work->func == func) {

But this all can race with task_work_run() if task != current.

This "work" can be freed/reused. And it should only return !NULL
if twork->func() was not called.

Oleg.

Oleg Nesterov

unread,

Aug 20, 2012, 11:10:04 AM8/20/12

to

On 08/20, Peter Zijlstra wrote:
>

Yes, we can add the explicit argument to __task_work_run(), but it can
check PF_EXITING instead, this looks simpler to me.

Note also your patch breaks fifo, but this is fixable.

Oleg.

Peter Zijlstra

unread,

Aug 20, 2012, 11:20:02 AM8/20/12

to

I guess we could.. but I thought the explicit callback was simpler ;-)

> Note also your patch breaks fifo, but this is fixable.

Why do you care about the order? Iterating a single linked queue in fifo
seems more expensive than useful.

Peter Zijlstra

unread,

Aug 20, 2012, 11:20:03 AM8/20/12

to

On Mon, 2012-08-20 at 16:59 +0200, Oleg Nesterov wrote:
> On 08/20, Peter Zijlstra wrote:
> >
> > task_work_cancel(struct task_struct *task, task_work_func_t func)
> > {
> > ...
> > +
> > +again:
> > + workp = &task->task_works;
> > + work = *workp;
> > + while (work) {
> > + if (work->func == func) {
>
> But this all can race with task_work_run() if task != current.
>
> This "work" can be freed/reused. And it should only return !NULL
> if twork->func() was not called.

Ah, because we could be iterating the list after the xchg done by run.

Hmm..

Peter Zijlstra

unread,

Aug 20, 2012, 11:30:03 AM8/20/12

to

On Mon, 2012-08-20 at 17:10 +0200, Peter Zijlstra wrote:
> On Mon, 2012-08-20 at 16:59 +0200, Oleg Nesterov wrote:
> > On 08/20, Peter Zijlstra wrote:
> > >
> > > task_work_cancel(struct task_struct *task, task_work_func_t func)
> > > {
> > > ...
> > > +
> > > +again:
> > > + workp = &task->task_works;
> > > + work = *workp;
> > > + while (work) {
> > > + if (work->func == func) {
> >
> > But this all can race with task_work_run() if task != current.
> >
> > This "work" can be freed/reused. And it should only return !NULL
> > if twork->func() was not called.
>
> Ah, because we could be iterating the list after the xchg done by run.

I guess we could steal the entire list and requeue it afterwards and
lift TIF_NOTIFY_RESUME along with it.. but I can't say that's pretty.

Oleg Nesterov

unread,

Aug 20, 2012, 11:40:03 AM8/20/12

to

On 08/20, Peter Zijlstra wrote:
>
> On Mon, 2012-08-20 at 17:10 +0200, Peter Zijlstra wrote:
> > On Mon, 2012-08-20 at 16:59 +0200, Oleg Nesterov wrote:
> > > On 08/20, Peter Zijlstra wrote:
> > > >
> > > > task_work_cancel(struct task_struct *task, task_work_func_t func)
> > > > {
> > > > ...
> > > > +
> > > > +again:
> > > > + workp = &task->task_works;
> > > > + work = *workp;
> > > > + while (work) {
> > > > + if (work->func == func) {
> > >
> > > But this all can race with task_work_run() if task != current.
> > >
> > > This "work" can be freed/reused. And it should only return !NULL
> > > if twork->func() was not called.
> >
> > Ah, because we could be iterating the list after the xchg done by run.
>
> I guess we could steal the entire list and requeue it afterwards and
> lift TIF_NOTIFY_RESUME along with it..

We can't. This can race with exit_task_work(). And this can break
fput(), the task can return to the userspace without __fput().

> but I can't say that's pretty.

Yes ;)

Oleg.

Peter Zijlstra

unread,

Aug 20, 2012, 11:50:01 AM8/20/12

to

On Mon, 2012-08-20 at 17:32 +0200, Oleg Nesterov wrote:

> > I guess we could steal the entire list and requeue it afterwards and
> > lift TIF_NOTIFY_RESUME along with it..
>
> We can't. This can race with exit_task_work(). And this can break
> fput(), the task can return to the userspace without __fput().

So we could put that spinlock back around cancel and run and leave add
lockless. That'd solve my immediate problem but its not something I'm
proud of.

All schemes I can come up with end up being broken or effectively the
same as the above proposal.

Peter Zijlstra

unread,

Aug 20, 2012, 11:50:01 AM8/20/12

to

On Mon, 2012-08-20 at 17:43 +0200, Oleg Nesterov wrote:
> BTW, do you think it is really important to try to avoid ->pi_lock?

It is for me, I'm doing task_work_add() from under rq->lock.. :/ I could
fudge around and delay, but not having to would be nice.

Oleg Nesterov

unread,

Aug 20, 2012, 11:50:02 AM8/20/12

to

On 08/20, Peter Zijlstra wrote:
>

> Also, can't task_work use llist stuff? That would also avoid using
> ->pi_lock.

BTW, do you think it is really important to try to avoid ->pi_lock?

Oleg.

Oleg Nesterov

unread,

Aug 20, 2012, 11:50:02 AM8/20/12

to

On 08/20, Peter Zijlstra wrote:
>
> On Mon, 2012-08-20 at 17:05 +0200, Oleg Nesterov wrote:
> >
> > Yes, we can add the explicit argument to __task_work_run(), but it can
> > check PF_EXITING instead, this looks simpler to me.
>
> I guess we could.. but I thought the explicit callback was simpler ;-)

I won't insist. The patch I sent uses PF_EXITING and the fake
"struct callback_head* TWORK_EXITED", but this looks almost the same.

> > Note also your patch breaks fifo, but this is fixable.
>
> Why do you care about the order?

IMHO, this is just more natural.

For example. keyctl_session_to_parent() does _cancel only to protect
from exploits doing keyctl(KEYCTL_SESSION_TO_PARENT) in an endless
loop. It could simply do task_work_add(), but in this case we need
fifo for correctness.

> Iterating a single linked queue in fifo
> seems more expensive than useful.

Currently the list is fifo (we add to the last element), this is O(1).

But the list should be short, we can reverse it in _run() if we change
task_work_add() to add to the head.

Oleg.

Peter Zijlstra

unread,

Aug 20, 2012, 12:00:03 PM8/20/12

to

On Mon, 2012-08-20 at 17:41 +0200, Oleg Nesterov wrote:

> I won't insist. The patch I sent uses PF_EXITING and the fake
> "struct callback_head* TWORK_EXITED", but this looks almost the same.

Right, I used a fake callback_head because it avoided a few special
cases since its a dereferencable pointer.

> > > Note also your patch breaks fifo, but this is fixable.
> >
> > Why do you care about the order?
>
> IMHO, this is just more natural.

Depends on what you're used to I guess ;-) Both RCU and irq_work are
filo, this seems to be the natural way for single linked lists.

> For example. keyctl_session_to_parent() does _cancel only to protect
> from exploits doing keyctl(KEYCTL_SESSION_TO_PARENT) in an endless
> loop. It could simply do task_work_add(), but in this case we need
> fifo for correctness.

I'm not entirely sure I see, not doing the cancel would delay the free
until the executing of key_change_session_keyring()? doing that keyctl()
in an indefinite loop involves going back to userspace, so where's the
resource issue?

Also, I'm not seeing where the FIFO requirement comes from.

> > Iterating a single linked queue in fifo
> > seems more expensive than useful.
>
> Currently the list is fifo (we add to the last element), this is O(1).

depends on what way you look at the list I guess, with a single linked
list there's only one end you can add to in O(1), so we're calling that
the tail?

> But the list should be short, we can reverse it in _run() if we change
> task_work_add() to add to the head.

Reversing a (single linked) list is O(n^2).. which is indeed doable for
short lists, but why assume its short?

Oleg Nesterov

unread,

Aug 20, 2012, 12:10:01 PM8/20/12

to

On 08/20, Peter Zijlstra wrote:
>

> On Mon, 2012-08-20 at 17:32 +0200, Oleg Nesterov wrote:
>
> > > I guess we could steal the entire list and requeue it afterwards and
> > > lift TIF_NOTIFY_RESUME along with it..
> >
> > We can't. This can race with exit_task_work(). And this can break
> > fput(), the task can return to the userspace without __fput().
>
> So we could put that spinlock back around cancel and run and leave add
> lockless. That'd solve my immediate problem but its not something I'm
> proud of.

Which problem?

We can probably use bit_spin_lock() and avoid ->pi_lock.

Of course, we can add the new lock into task_struct, but this is not nice.

Oleg.

Peter Zijlstra

unread,

Aug 20, 2012, 12:10:01 PM8/20/12

to

On Mon, 2012-08-20 at 17:58 +0200, Oleg Nesterov wrote:
> On 08/20, Peter Zijlstra wrote:
> >
> > On Mon, 2012-08-20 at 17:32 +0200, Oleg Nesterov wrote:
> >
> > > > I guess we could steal the entire list and requeue it afterwards and
> > > > lift TIF_NOTIFY_RESUME along with it..
> > >
> > > We can't. This can race with exit_task_work(). And this can break
> > > fput(), the task can return to the userspace without __fput().
> >
> > So we could put that spinlock back around cancel and run and leave add
> > lockless. That'd solve my immediate problem but its not something I'm
> > proud of.
>
> Which problem?

/me doing task_work_add() from under rq->lock..

> We can probably use bit_spin_lock() and avoid ->pi_lock.

tglx will kill us both for even thinking of bit-spinlocks.

> Of course, we can add the new lock into task_struct, but this is not nice.

If we can limit the lock to cancel/run I'm ok.

Oleg Nesterov

unread,

Aug 20, 2012, 12:10:02 PM8/20/12

to

On 08/20, Peter Zijlstra wrote:
>

> On Mon, 2012-08-20 at 17:43 +0200, Oleg Nesterov wrote:
> > BTW, do you think it is really important to try to avoid ->pi_lock?
>
> It is for me, I'm doing task_work_add() from under rq->lock.. :/ I could
> fudge around and delay, but not having to would be nice.

Aha, got it.

At first glance, bit_spin_lock() can work. I'll try to make the patches
tomorrow.

Oleg.

Oleg Nesterov

unread,

Aug 20, 2012, 12:20:01 PM8/20/12

to

On 08/20, Peter Zijlstra wrote:
>

> On Mon, 2012-08-20 at 17:41 +0200, Oleg Nesterov wrote:
>
> > IMHO, this is just more natural.
>
> Depends on what you're used to I guess ;-)

I have to agree ;)

> > For example. keyctl_session_to_parent() does _cancel only to protect
> > from exploits doing keyctl(KEYCTL_SESSION_TO_PARENT) in an endless
> > loop. It could simply do task_work_add(), but in this case we need
> > fifo for correctness.
>
> I'm not entirely sure I see, not doing the cancel would delay the free
> until the executing of key_change_session_keyring()? doing that keyctl()
> in an indefinite loop involves going back to userspace, so where's the
> resource issue?

But the child does task_work_add(current->parent). IOW, there are 2
different tasks. Now suppose that ->parent sleeps.

> Also, I'm not seeing where the FIFO requirement comes from.

Again, suppose that ->parent sleeps. The last KEYCTL_SESSION_TO_PARENT
request should "win". Although I agree, this is not that important.

> > > Iterating a single linked queue in fifo
> > > seems more expensive than useful.
> >
> > Currently the list is fifo (we add to the last element), this is O(1).
>
> depends on what way you look at the list I guess, with a single linked
> list there's only one end you can add to in O(1), so we're calling that
> the tail?

Sorry, can't understand...

task->task_works points to the last node in the circular single-linked list,
task_work_add() adds the new element after the last one and updates
task->task_works. This is O(1).

> > But the list should be short, we can reverse it in _run() if we change
> > task_work_add() to add to the head.
>
> Reversing a (single linked) list is O(n^2)..

Hmm. This is O(n). You can simply iterate over this list once, changing
the ->next pointer to point back.

> which is indeed doable for
> short lists, but why assume its short?

I agree, it is better to not do this.

Oleg.

Oleg Nesterov

unread,

Aug 20, 2012, 12:30:02 PM8/20/12

to

On 08/20, Peter Zijlstra wrote:
>

> Anyway, would taking ->pi_lock over _cancel and _run suffice?

I thinks yes... But I can't think properly today.

Peter Zijlstra

unread,

Aug 20, 2012, 12:30:03 PM8/20/12

to

On Mon, 2012-08-20 at 18:10 +0200, Oleg Nesterov wrote:
> task->task_works points to the last node in the circular single-linked list,
> task_work_add() adds the new element after the last one and updates
> task->task_works. This is O(1).

Agreed, the way I was looking at that is: ->task_works points to the
head and we put a new one in front, that too is O(1) ;-)

> > > But the list should be short, we can reverse it in _run() if we change
> > > task_work_add() to add to the head.
> >
> > Reversing a (single linked) list is O(n^2)..
>
> Hmm. This is O(n). You can simply iterate over this list once, changing
> the ->next pointer to point back.

OK, I'm going to stop and step away from the computer now.. clearly I
more than useless today :/

But yeah.. that could be done.

Anyway, would taking ->pi_lock over _cancel and _run suffice?

Oleg Nesterov

unread,

Aug 21, 2012, 2:40:01 PM8/21/12

to

On 08/20, Oleg Nesterov wrote:
>
> On 08/20, Peter Zijlstra wrote:
> >
> > Anyway, would taking ->pi_lock over _cancel and _run suffice?
>
> I thinks yes... But I can't think properly today.

OK, how about the code below?

Oleg.

#define TWORK_EXITED ((struct callback_head*)1)

static inline bool cmp_xchg(struct callback_head **ptr, void *old, void *new)
{
return cmpxchg(ptr, old, new) == old;
}

int
task_work_add(struct task_struct *task, struct callback_head *twork, bool notify)
{
do {
twork->next = ACCESS_ONCE(task->task_works);
if (unlikely(twork->next == TWORK_EXITED))
return -ESRCH;
} while (cmp_xchg(&task->task_works, twork->next, twork));

/* test_and_set_bit() implies mb(), see tracehook_notify_resume(). */
if (notify)
set_notify_resume(task);

return 0;
}

struct callback_head *
task_work_cancel(struct task_struct *task, task_work_func_t func)
{

struct callback_head *twork = NULL, **pprev = &task->task_works;
unsigned long flags;

raw_spin_lock_irqsave(&task->pi_lock, flags);
if (likely(*pprev != TWORK_EXITED)) {
for (; (twork = *pprev); pprev = &twork) {
read_barrier_depends();
if (twork->func != func)
continue;

while (!cmp_xchg(pprev, twork, twork->next))
// COMMENT ABOUT THE RACE WITH _add()
pprev = &(*pprev)->next;
break;
}
}
raw_spin_unlock_irqrestore(&task->pi_lock, flags);

return twork;
}

void task_work_run(void)
{

struct task_struct *task = current;

struct callback_head *twork, *marker;

for (;;) {
raw_spin_lock_irq(&task->pi_lock);
do {
twork = ACCESS_ONCE(task->task_works);
marker = (!twork && (task->flags & PF_EXITING)) ?
TWORK_EXITED : NULL;
} while (cmp_xchg(&task->task_works, twork, marker));
raw_spin_unlock_irq(&task->pi_lock);

if (!twork)
break;

...run works...

Oleg Nesterov

unread,

Aug 21, 2012, 2:40:01 PM8/21/12

to

On 08/21, Oleg Nesterov wrote:
>
> static inline bool cmp_xchg(struct callback_head **ptr, void *old, void *new)
> {
> return cmpxchg(ptr, old, new) == old;
> }
>
> int
> task_work_add(struct task_struct *task, struct callback_head *twork, bool notify)
> {
> do {
> twork->next = ACCESS_ONCE(task->task_works);
> if (unlikely(twork->next == TWORK_EXITED))
> return -ESRCH;
> } while (cmp_xchg(&task->task_works, twork->next, twork));

^^^^^^^^^^^^^^^

while (!cmp_xchg(...))

Oleg.

Oleg Nesterov

unread,

Aug 24, 2012, 3:00:01 PM8/24/12

to

On 08/21, Oleg Nesterov wrote:
>
> > task_work_add(struct task_struct *task, struct callback_head *twork, bool notify)
> > {
> > do {
> > twork->next = ACCESS_ONCE(task->task_works);
> > if (unlikely(twork->next == TWORK_EXITED))
> > return -ESRCH;
> > } while (cmp_xchg(&task->task_works, twork->next, twork));
> ^^^^^^^^^^^^^^^
>
> while (!cmp_xchg(...))

and _cancel can be simplified a bit.

OK, I actually tested the code below, seems to work.

Peter, if you think it can work for you and if you agree with
the implementation I will be happy to send the patch.

The only change outside of task_work.c is that exit_task_work()
should call task_work_run() unconditionally.

As for cmp_xchg below... Of course it is not strictly needed.
But otoh, personally I'd like to have this helper (probably
renamed) somewhere in include/linux. Perhaps this is just me,
but cmpxchg() always confuses me, and most users only need
success_or_not from cmpxchg.

Oleg.

#define cmp_xchg(ptr, _old, new) \
({ \
__typeof__(_old) old = (_old); \
cmpxchg(ptr, old, new) == old; \
})

#define TWORK_EXITED ((struct callback_head*)1)

int
task_work_add(struct task_struct *task, struct callback_head *work, bool notify)
{
do {

work->next = ACCESS_ONCE(task->task_works);

if (unlikely(work->next == TWORK_EXITED))
return -ESRCH;
} while (!cmp_xchg(&task->task_works, work->next, work));

/* test_and_set_bit() implies mb(), see tracehook_notify_resume(). */
if (notify)
set_notify_resume(task);
return 0;
}

struct callback_head *
task_work_cancel(struct task_struct *task, task_work_func_t func)
{

struct callback_head **pprev = &task->task_works;
struct callback_head *work = NULL;

unsigned long flags;

raw_spin_lock_irqsave(&task->pi_lock, flags);
if (likely(*pprev != TWORK_EXITED)) {

while ((work = *pprev)) {
read_barrier_depends();
if (work->func != func)
pprev = &work->next;
else if (cmp_xchg(pprev, work, work->next))
break;
}
}
raw_spin_unlock_irqrestore(&task->pi_lock, flags);

return work;

}

void task_work_run(void)
{
struct task_struct *task = current;

struct callback_head *work, *head, *next;

for (;;) {
raw_spin_lock_irq(&task->pi_lock);
do {

work = ACCESS_ONCE(task->task_works);
head = !work && (task->flags & PF_EXITING) ?
TWORK_EXITED : NULL;
} while (!cmp_xchg(&task->task_works, work, head));
raw_spin_unlock_irq(&task->pi_lock);

if (!work)
break;

#if PETERZ_WONT_ARGUE_AGAINST_FIFO_TOO_MUCH
head = NULL;
do {
next = work->next;
work->next = head;
head = work;
work = next;
} while (work);

work = head;
#endif
do {
next = work->next;
work->func(work);
work = next;
cond_resched();
} while (work);

Oleg Nesterov

unread,

Aug 26, 2012, 3:20:01 PM8/26/12

to

On 08/24, Oleg Nesterov wrote:
>
> Peter, if you think it can work for you and if you agree with
> the implementation I will be happy to send the patch.

I think I should try anyway ;)

To simplify the review, I attached the resulting code below.

Changes:

- Comments.

- Not sure this is really better, but task_work_run()
does not need to actually take pi_lock, unlock_wait
is enough.

However, in this case the dummy entry is better than
the fake pointer.

Oleg.

#include <linux/spinlock.h>
#include <linux/task_work.h>
#include <linux/tracehook.h>

static struct callback_head work_exited; /* all we need is ->next == NULL */

int
task_work_add(struct task_struct *task, struct callback_head *work, bool notify)
{

struct callback_head *head;

do {
head = ACCESS_ONCE(task->task_works);
if (unlikely(head == &work_exited))
return -ESRCH;
work->next = head;
} while (cmpxchg(&task->task_works, head, work) != head);

if (notify)
set_notify_resume(task);
return 0;
}

struct callback_head *
task_work_cancel(struct task_struct *task, task_work_func_t func)
{
struct callback_head **pprev = &task->task_works;
struct callback_head *work = NULL;
unsigned long flags;

/*
* If cmpxchg() fails we continue without updating pprev.
* Either we raced with task_work_add() which added the
* new entry before this work, we will find it again. Or
* we raced with task_work_run(), *pprev == NULL/exited.
*/
raw_spin_lock_irqsave(&task->pi_lock, flags);
while ((work = ACCESS_ONCE(*pprev))) {

read_barrier_depends();
if (work->func != func)
pprev = &work->next;

else if (cmpxchg(pprev, work, work->next) == work)

break;
}
raw_spin_unlock_irqrestore(&task->pi_lock, flags);

return work;
}

void task_work_run(void)
{
struct task_struct *task = current;
struct callback_head *work, *head, *next;

for (;;) {

/*
* work->func() can do task_work_add(), do not set
* work_exited unless the list is empty.
*/

do {
work = ACCESS_ONCE(task->task_works);
head = !work && (task->flags & PF_EXITING) ?

&work_exited : NULL;
} while (cmpxchg(&task->task_works, work, head) != work);

if (!work)
break;
/*
* Synchronize with task_work_cancel(). It can't remove
* the first entry == work, cmpxchg(task_works) should
* fail, but it can play with *work and other entries.
*/
raw_spin_unlock_wait(&task->pi_lock);
smp_mb();

/* Reverse the list to run the works in fifo order */

head = NULL;
do {
next = work->next;
work->next = head;
head = work;
work = next;
} while (work);

work = head;

Oleg Nesterov

unread,

Aug 26, 2012, 3:20:01 PM8/26/12

to

ed3e694d "move exit_task_work() past exit_files() et.al" destroyed
the add/exit synchronization we had, the caller itself should ensure
task_work_add() can't race with the exiting task.

However, this is not convenient/simple, and the only user which tries
to do this is buggy (see the next patch). Unless the task is current,
there is simply no way to do this in general.

Change exit_task_work()->task_work_run() to use the dummy "work_exited"
entry to let task_work_add() know it should fail.

Signed-off-by: Oleg Nesterov <ol...@redhat.com>
---
include/linux/task_work.h | 3 +--
kernel/task_work.c | 22 ++++++++++++++++------
2 files changed, 17 insertions(+), 8 deletions(-)

diff --git a/include/linux/task_work.h b/include/linux/task_work.h
index fb46b03..ca5a1cf 100644
--- a/include/linux/task_work.h
+++ b/include/linux/task_work.h
@@ -18,8 +18,7 @@ void task_work_run(void);

static inline void exit_task_work(struct task_struct *task)
{
- if (unlikely(task->task_works))
- task_work_run();
+ task_work_run();
}

#endif /* _LINUX_TASK_WORK_H */
diff --git a/kernel/task_work.c b/kernel/task_work.c
index f13ec0b..65bd3c9 100644
--- a/kernel/task_work.c
+++ b/kernel/task_work.c
@@ -2,16 +2,17 @@
#include <linux/task_work.h>
#include <linux/tracehook.h>

+static struct callback_head work_exited; /* all we need is ->next == NULL */
+

int
task_work_add(struct task_struct *task, struct callback_head *work, bool notify)
{
struct callback_head *head;

- /*
- * Not inserting the new work if the task has already passed
- * exit_task_work() is the responisbility of callers.
- */
+

do {
head = ACCESS_ONCE(task->task_works);

+ if (unlikely(head == &work_exited))
+ return -ESRCH;

work->next = head;
} while (cmpxchg(&task->task_works, head, work) != head);

@@ -30,7 +31,7 @@ task_work_cancel(struct task_struct *task, task_work_func_t func)

* If cmpxchg() fails we continue without updating pprev.
* Either we raced with task_work_add() which added the
* new entry before this work, we will find it again. Or

- * we raced with task_work_run(), *pprev == NULL.
+ * we raced with task_work_run(), *pprev == NULL/exited.

*/
raw_spin_lock_irqsave(&task->pi_lock, flags);
while ((work = ACCESS_ONCE(*pprev))) {

@@ -51,7 +52,16 @@ void task_work_run(void)

struct callback_head *work, *head, *next;

for (;;) {

- work = xchg(&task->task_works, NULL);
+ /*
+ * work->func() can do task_work_add(), do not set
+ * work_exited unless the list is empty.
+ */
+ do {
+ work = ACCESS_ONCE(task->task_works);
+ head = !work && (task->flags & PF_EXITING) ?
+ &work_exited : NULL;
+ } while (cmpxchg(&task->task_works, work, head) != work);
+
if (!work)
break;
/*
--
1.5.5.1

Oleg Nesterov

unread,

Aug 26, 2012, 3:20:02 PM8/26/12

to

This reverts commit d35abdb28824cf74f0a106a0f9c6f3ff700a35bf.

task_lock() was added to ensure exit_mm() and thus exit_task_work() is
not possible before task_work_add().

This is wrong, task_lock() must not be nested with write_lock(tasklist).
And this is no longer needed, task_work_add() fails if it is called
after exit_task_work().

Reported-by: Dave Jones <da...@redhat.com>

Signed-off-by: Oleg Nesterov <ol...@redhat.com>
---

security/keys/keyctl.c | 2 --
1 files changed, 0 insertions(+), 2 deletions(-)

diff --git a/security/keys/keyctl.c b/security/keys/keyctl.c
index 3364fbf..6cfc647 100644
--- a/security/keys/keyctl.c
+++ b/security/keys/keyctl.c
@@ -1486,7 +1486,6 @@ long keyctl_session_to_parent(void)
oldwork = NULL;
parent = me->real_parent;

- task_lock(parent);
/* the parent mustn't be init and mustn't be a kernel thread */
if (parent->pid <= 1 || !parent->mm)
goto unlock;
@@ -1530,7 +1529,6 @@ long keyctl_session_to_parent(void)
if (!ret)
newwork = NULL;
unlock:
- task_unlock(parent);
write_unlock_irq(&tasklist_lock);
rcu_read_unlock();
if (oldwork)
--
1.5.5.1

Peter Zijlstra

unread,

Aug 28, 2012, 12:30:02 PM8/28/12

to

On Fri, 2012-08-24 at 20:56 +0200, Oleg Nesterov wrote:
>
> Peter, if you think it can work for you and if you agree with
> the implementation I will be happy to send the patch.

Yeah I think it would work, but I'm not sure why you're introducing the
cmp_xchg helper just for this..

Anyway, how about something like the below, it pops the works one by one
when running, that way when the cancel will only return NULL when the
work is either being executed or already executed.

( And yeah, I know, its not FIFO ;-)

---
include/linux/task_work.h | 7 +--
kernel/exit.c | 2 +-
kernel/task_work.c | 130 +++++++++++++++++++++++++--------------------
3 files changed, 75 insertions(+), 64 deletions(-)

diff --git a/include/linux/task_work.h b/include/linux/task_work.h
index fb46b03..f365416 100644
--- a/include/linux/task_work.h
+++ b/include/linux/task_work.h
@@ -15,11 +15,6 @@ init_task_work(struct callback_head *twork, task_work_func_t func)
int task_work_add(struct task_struct *task, struct callback_head *twork, bool);
struct callback_head *task_work_cancel(struct task_struct *, task_work_func_t);
void task_work_run(void);
-
-static inline void exit_task_work(struct task_struct *task)
-{

- if (unlikely(task->task_works))
- task_work_run();

-}
+void task_work_exit(void);

#endif /* _LINUX_TASK_WORK_H */
diff --git a/kernel/exit.c b/kernel/exit.c
index f65345f..92aa94b 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -992,7 +992,7 @@ void do_exit(long code)
exit_shm(tsk);
exit_files(tsk);
exit_fs(tsk);
- exit_task_work(tsk);
+ task_work_exit();
check_stack_usage();
exit_thread();

diff --git a/kernel/task_work.c b/kernel/task_work.c
index 91d4e17..7767924 100644
--- a/kernel/task_work.c
+++ b/kernel/task_work.c
@@ -2,79 +2,95 @@
#include <linux/task_work.h>
#include <linux/tracehook.h>

+static void task_work_nop(struct callback_head *work)
+{
+}
+
+static struct callback_head dead = {
+ .next = NULL,
+ .func = task_work_nop,
+};
+
int
-task_work_add(struct task_struct *task, struct callback_head *twork, bool notify)
+task_work_add(struct task_struct *task, struct callback_head *work, bool notify)
{
- struct callback_head *last, *first;
- unsigned long flags;
-

- /*
- * Not inserting the new work if the task has already passed
- * exit_task_work() is the responisbility of callers.
- */

- raw_spin_lock_irqsave(&task->pi_lock, flags);
- last = task->task_works;
- first = last ? last->next : twork;
- twork->next = first;
- if (last)
- last->next = twork;
- task->task_works = twork;
- raw_spin_unlock_irqrestore(&task->pi_lock, flags);
+ struct callback_head **head = &task->task_works;
+ struct callback_head *entry, *old_entry;
+
+ entry = *head;

+ for (;;) {
+ if (entry == &dead)

+ return -ESRCH;
+
+ old_entry = entry;
+ work->next = entry;
+ entry = cmpxchg(head, old_entry, work);
+ if (entry == old_entry)
+ break;
+ }

/* test_and_set_bit() implies mb(), see tracehook_notify_resume(). */
if (notify)
set_notify_resume(task);

+

return 0;
}

struct callback_head *
task_work_cancel(struct task_struct *task, task_work_func_t func)
{

- unsigned long flags;
- struct callback_head *last, *res = NULL;
-
- raw_spin_lock_irqsave(&task->pi_lock, flags);
- last = task->task_works;
- if (last) {
- struct callback_head *q = last, *p = q->next;
- while (1) {
- if (p->func == func) {
- q->next = p->next;
- if (p == last)
- task->task_works = q == p ? NULL : q;
- res = p;
- break;
- }
- if (p == last)
- break;
- q = p;

- p = q->next;

+ struct callback_head **workp, *work;

+
+again:
+ workp = &task->task_works;
+ work = *workp;
+ while (work) {
+ if (work->func == func) {

+ if (cmpxchg(workp, work, work->next) == work)
+ return work;
+ goto again;
}
+
+ workp = &work->next;
+ work = *workp;
}
- raw_spin_unlock_irqrestore(&task->pi_lock, flags);
- return res;
+
+ return NULL;
}

-void task_work_run(void)
+static callback_head *task_work_pop(void)

{
- struct task_struct *task = current;
- struct callback_head *p, *q;
-
- while (1) {
- raw_spin_lock_irq(&task->pi_lock);
- p = task->task_works;
- task->task_works = NULL;
- raw_spin_unlock_irq(&task->pi_lock);
-
- if (unlikely(!p))
- return;
-
- q = p->next; /* head */
- p->next = NULL; /* cut it */
- while (q) {
- p = q->next;
- q->func(q);
- q = p;

- }
+ struct callback_head **head = &current->task_work;
+ struct callback_head *entry, *old_entry;
+
+ entry = *head;
+ for (;;) {
+ if (!entry || entry == &dead)
+ return NULL;
+
+ old_entry = entry;
+ entry = cmpxchg(head, entry, entry->next);
+ if (entry == old_entry)
+ break;
}
+
+ return entry;
+}
+
+void task_work_run(void)
+{
+ struct callback_head *work;
+
+ for (work = task_work_pop(); work; )
+ work->func(work);
+}
+
+void task_work_exit(void)
+{

+ struct callback_head **head = &current->task_works;
+

+again:
+ task_work_run();
+ if (cmpxchg(head, NULL, &dead) != NULL)
+ goto again;

Oleg Nesterov

unread,

Aug 28, 2012, 1:00:02 PM8/28/12

to

On 08/28, Peter Zijlstra wrote:
>
> On Fri, 2012-08-24 at 20:56 +0200, Oleg Nesterov wrote:
> >
> > Peter, if you think it can work for you and if you agree with
> > the implementation I will be happy to send the patch.
>
> Yeah I think it would work, but I'm not sure why you're introducing the
> cmp_xchg helper just for this..

Please look at 1-4 the patches I sent (only 1-2 are relevant), I removed
this helper. Although I still think it makes sense, but of course not in
task_work.c.

But you can't dereference this pointer. Without some locking this
can race with another task_work_cancel() or task_work_run(), this
work can be free/unmapped/reused.

> + if (cmpxchg(workp, work, work->next) == work)
> + return work;

Or this can race with task_work_cancel(work) + task_work_add(work).
cmpxchg() can succeed even if work->func is already different.

Well, this obviously means cmpxchg() for each entry...

> ( And yeah, I know, its not FIFO ;-)

Cough. akpm didn't like fifo, Linus disliked it too...

And now you! Whats going on??? ;)

Oleg.

Oleg Nesterov

unread,

Aug 28, 2012, 1:20:02 PM8/28/12

to

On 08/28, Oleg Nesterov wrote:
>
> On 08/28, Peter Zijlstra wrote:
> >
> > +again:
> > + workp = &task->task_works;
> > + work = *workp;
> > + while (work) {
> > + if (work->func == func) {
>
> But you can't dereference this pointer. Without some locking this
> can race with another task_work_cancel() or task_work_run(), this
> work can be free/unmapped/reused.
>
> > + if (cmpxchg(workp, work, work->next) == work)
> > + return work;
>
> Or this can race with task_work_cancel(work) + task_work_add(work).
> cmpxchg() can succeed even if work->func is already different.

Even simpler, this can race with another task_work_cancel() which
is going to remove work->next.

Peter Zijlstra

unread,

Aug 28, 2012, 1:30:02 PM8/28/12

to

On Tue, 2012-08-28 at 19:01 +0200, Oleg Nesterov wrote:
> > struct callback_head *
> > task_work_cancel(struct task_struct *task, task_work_func_t func)
> > {

> > + struct callback_head **workp, *work;
> > +
> > +again:
> > + workp = &task->task_works;
> > + work = *workp;
> > + while (work) {
> > + if (work->func == func) {
>
> But you can't dereference this pointer. Without some locking this
> can race with another task_work_cancel() or task_work_run(), this
> work can be free/unmapped/reused.
>
> > + if (cmpxchg(workp, work, work->next) == work)
> > + return work;
>
> Or this can race with task_work_cancel(work) + task_work_add(work).
> cmpxchg() can succeed even if work->func is already different.

Bah.. you and your races ;-)

Surely we can do this locklessly.. I'll go try harder still.

Oleg Nesterov

unread,

Aug 29, 2012, 11:30:02 AM8/29/12

to

On 08/28, Peter Zijlstra wrote:
>

> Surely we can do this locklessly.. I'll go try harder still.

I doubt...

Even ignore work->func check, somehow you need to ensure that
work->next == new can't be changed durung cmpxchg(..., new).

Anyway, if this is possible, can't you do this on top of 1-4
I sent? There are simple, and solve the problems we discusssed.

Off-topic. This is really minor, bur can't we simplify llist_add()?

static inline bool llist_add(struct llist_node *new, struct llist_head *head)
{
struct llist_node *old;

do {
old = ACCESS_ONCE(head->first);
new->next = old;
} while (cmpxchg(&head->first, old, new) != old);

return old == NULL;
}

looks simpler and saves a couple of insns. The likely case should
assume that cmpxchg() succeeds after the 1st attempt. If it fails,
another LOAD from head->first should be very cheap.

And note this ACCESS_ONCE(head->first) above. I think that (in theory)
the current code needs it too. But only in theory, I guess.

Oleg.

Oleg Nesterov

unread,

Sep 6, 2012, 2:00:03 PM9/6/12

to

Ping...

Al, will you agree with these changes?

Peter, do you think you can do your make-it-lockless patch (hehe, I
think this is not possible ;) on top?

Peter Zijlstra

unread,

Sep 6, 2012, 2:40:02 PM9/6/12

to

On Thu, 2012-09-06 at 20:01 +0200, Oleg Nesterov wrote:
> Ping...

Right, email backlog :-)

> Peter, do you think you can do your make-it-lockless patch (hehe, I
> think this is not possible ;) on top?

Sure, I was trying to see if I could play games with the _cancel
semantics that would satisfy the two callsites and be possible. No joy
yet though.

Anyway, do you want me to take these patches through -tip?

Oleg Nesterov

unread,

Sep 7, 2012, 9:20:01 AM9/7/12

to

On 09/06, Peter Zijlstra wrote:
>
> Anyway, do you want me to take these patches through -tip?

This would be great, thanks.

Oleg.

tip-bot for Oleg Nesterov

unread,

Sep 14, 2012, 2:10:02 AM9/14/12

to

Commit-ID: 9da33de62431c7839f98156720862262272a8380
Gitweb: http://git.kernel.org/tip/9da33de62431c7839f98156720862262272a8380
Author: Oleg Nesterov <ol...@redhat.com>
AuthorDate: Sun, 26 Aug 2012 21:12:11 +0200
Committer: Ingo Molnar <mi...@kernel.org>
CommitDate: Thu, 13 Sep 2012 16:47:34 +0200

task_work: task_work_add() should not succeed after exit_task_work()

ed3e694d "move exit_task_work() past exit_files() et.al" destroyed
the add/exit synchronization we had, the caller itself should ensure
task_work_add() can't race with the exiting task.

However, this is not convenient/simple, and the only user which tries
to do this is buggy (see the next patch). Unless the task is current,
there is simply no way to do this in general.

Change exit_task_work()->task_work_run() to use the dummy "work_exited"
entry to let task_work_add() know it should fail.

Signed-off-by: Oleg Nesterov <ol...@redhat.com>

Signed-off-by: Peter Zijlstra <a.p.zi...@chello.nl>
Cc: Al Viro <vi...@zeniv.linux.org.uk>
Cc: Linus Torvalds <torv...@linux-foundation.org>
Cc: Andrew Morton <ak...@linux-foundation.org>
Cc: Thomas Gleixner <tg...@linutronix.de>
Link: http://lkml.kernel.org/r/2012082619...@redhat.com
Signed-off-by: Ingo Molnar <mi...@kernel.org>

tip-bot for Oleg Nesterov

unread,

Sep 14, 2012, 2:20:01 AM9/14/12

to

Commit-ID: b3f68f16dbcde6fcdf0fd27695391ff7e9d41233
Gitweb: http://git.kernel.org/tip/b3f68f16dbcde6fcdf0fd27695391ff7e9d41233
Author: Oleg Nesterov <ol...@redhat.com>
AuthorDate: Sun, 26 Aug 2012 21:12:14 +0200
Committer: Ingo Molnar <mi...@kernel.org>
CommitDate: Thu, 13 Sep 2012 16:47:36 +0200

task_work: Revert "hold task_lock around checks in keyctl"

This reverts commit d35abdb28824cf74f0a106a0f9c6f3ff700a35bf.

task_lock() was added to ensure exit_mm() and thus exit_task_work() is
not possible before task_work_add().

This is wrong, task_lock() must not be nested with write_lock(tasklist).

And this is no longer needed, task_work_add() now fails if it is called

after exit_task_work().

Reported-by: Dave Jones <da...@redhat.com>
Signed-off-by: Oleg Nesterov <ol...@redhat.com>

Signed-off-by: Peter Zijlstra <a.p.zi...@chello.nl>
Cc: Al Viro <vi...@zeniv.linux.org.uk>
Cc: Linus Torvalds <torv...@linux-foundation.org>
Cc: Andrew Morton <ak...@linux-foundation.org>

Link: http://lkml.kernel.org/r/2012082619...@redhat.com
Signed-off-by: Ingo Molnar <mi...@kernel.org>

---
security/keys/keyctl.c | 2 --
1 files changed, 0 insertions(+), 2 deletions(-)

diff --git a/security/keys/keyctl.c b/security/keys/keyctl.c
index 3364fbf..6cfc647 100644
--- a/security/keys/keyctl.c
+++ b/security/keys/keyctl.c
@@ -1486,7 +1486,6 @@ long keyctl_session_to_parent(void)
oldwork = NULL;
parent = me->real_parent;

- task_lock(parent);
/* the parent mustn't be init and mustn't be a kernel thread */
if (parent->pid <= 1 || !parent->mm)
goto unlock;
@@ -1530,7 +1529,6 @@ long keyctl_session_to_parent(void)
if (!ret)
newwork = NULL;
unlock:
- task_unlock(parent);
write_unlock_irq(&tasklist_lock);
rcu_read_unlock();
if (oldwork)
--