Transitive visibility in kernel memory model

Dmitry Vyukov

unread,

Sep 18, 2015, 6:00:59 AM9/18/15

to Paul McKenney, kt...@googlegroups.com

Hi Paul,

I've got the following race report (you may not look at it in detail,
explanation below):

==================================================================
ThreadSanitizer: data-race in __change_pid

Write at 0xffff880481892a48 of size 8 by thread 2962 on CPU 8:
[< inline >] __hlist_del include/linux/list.h:640
[< inline >] hlist_del_rcu include/linux/rculist.h:345
[<ffffffff810b81e5>] __change_pid+0x85/0x120 kernel/pid.c:410
[<ffffffff810b8bc3>] detach_pid+0x23/0x30 kernel/pid.c:422
[< inline >] __unhash_process kernel/exit.c:67
[< inline >] __exit_signal kernel/exit.c:140
[<ffffffff8108ccc2>] release_task+0x5f2/0xaa0 kernel/exit.c:184
[< inline >] wait_task_zombie kernel/exit.c:1111
[<ffffffff8108ef41>] wait_consider_task+0x9f1/0x18a0 kernel/exit.c:1354
[< inline >] do_wait_thread kernel/exit.c:1417
[<ffffffff8108ff9f>] do_wait+0x1af/0x3b0 kernel/exit.c:1488
[< inline >] SYSC_wait4 kernel/exit.c:1615
[<ffffffff810906d5>] SyS_wait4+0xe5/0x160 kernel/exit.c:1584
[<ffffffff81ee3e11>] entry_SYSCALL_64_fastpath+0x31/0x95
arch/x86/entry/entry_64.S:188

Previous atomic read at 0xffff880481892a48 of size 8 by thread 2961 on CPU 5:
[<ffffffff810b7b19>] pid_task+0x29/0x60 kernel/pid.c:445
[<ffffffff810a072d>] kill_pid_info+0x2d/0x90 kernel/signal.c:1341
[< inline >] kill_something_info kernel/signal.c:1426
[<ffffffff810a0903>] SYSC_kill+0x123/0x2f0 kernel/signal.c:2906
[<ffffffff810a3520>] SyS_kill+0x20/0x30 kernel/signal.c:2896
[<ffffffff81ee3e11>] entry_SYSCALL_64_fastpath+0x31/0x95
arch/x86/entry/entry_64.S:188

Mutexes locked by thread 2962:
Mutex 195 is locked here:
[< inline >] __raw_write_lock_irq include/linux/rwlock_api_smp.h:217
[<ffffffff81ee38d4>] _raw_write_lock_irq+0x64/0x70
kernel/locking/spinlock.c:311
[<ffffffff8108c745>] release_task+0x75/0xaa0 kernel/exit.c:182
[< inline >] wait_task_zombie kernel/exit.c:1111
[<ffffffff8108ef41>] wait_consider_task+0x9f1/0x18a0 kernel/exit.c:1354
[< inline >] do_wait_thread kernel/exit.c:1417
[<ffffffff8108ff9f>] do_wait+0x1af/0x3b0 kernel/exit.c:1488
[< inline >] SYSC_wait4 kernel/exit.c:1615
[<ffffffff810906d5>] SyS_wait4+0xe5/0x160 kernel/exit.c:1584
[<ffffffff81ee3e11>] entry_SYSCALL_64_fastpath+0x31/0x95
arch/x86/entry/entry_64.S:188

Mutex 517812 is locked here:
[< inline >] spin_lock include/linux/spinlock.h:312
[< inline >] __exit_signal kernel/exit.c:93
[<ffffffff8108c7e4>] release_task+0x114/0xaa0 kernel/exit.c:184
[< inline >] wait_task_zombie kernel/exit.c:1111
[<ffffffff8108ef41>] wait_consider_task+0x9f1/0x18a0 kernel/exit.c:1354
[< inline >] do_wait_thread kernel/exit.c:1417
[<ffffffff8108ff9f>] do_wait+0x1af/0x3b0 kernel/exit.c:1488
[< inline >] SYSC_wait4 kernel/exit.c:1615
[<ffffffff810906d5>] SyS_wait4+0xe5/0x160 kernel/exit.c:1584
[<ffffffff81ee3e11>] entry_SYSCALL_64_fastpath+0x31/0x95
arch/x86/entry/entry_64.S:188

Mutex 521594 is locked here:
[< inline >] __raw_spin_lock include/linux/spinlock_api_smp.h:158
[<ffffffff81ee37d0>] _raw_spin_lock+0x50/0x70 kernel/locking/spinlock.c:151
[< inline >] spin_lock include/linux/spinlock.h:312
[< inline >] write_seqlock include/linux/seqlock.h:470
[< inline >] __exit_signal kernel/exit.c:127
[<ffffffff8108c871>] release_task+0x1a1/0xaa0 kernel/exit.c:184
[< inline >] wait_task_zombie kernel/exit.c:1111
[<ffffffff8108ef41>] wait_consider_task+0x9f1/0x18a0 kernel/exit.c:1354
[< inline >] do_wait_thread kernel/exit.c:1417
[<ffffffff8108ff9f>] do_wait+0x1af/0x3b0 kernel/exit.c:1488
[< inline >] SYSC_wait4 kernel/exit.c:1615
[<ffffffff810906d5>] SyS_wait4+0xe5/0x160 kernel/exit.c:1584
[<ffffffff81ee3e11>] entry_SYSCALL_64_fastpath+0x31/0x95
arch/x86/entry/entry_64.S:188

==================================================================

The race is on "rculist". One thread makes rcu_dereference of list head:

struct task_struct *pid_task(struct pid *pid, enum pid_type type)
{
struct task_struct *result = NULL;
if (pid) {
struct hlist_node *first;
first =
rcu_dereference_check(hlist_first_rcu(&pid->tasks[type]),
lockdep_tasklist_lock_is_held());
if (first)
result = hlist_entry(first, struct
task_struct, pids[(type)].node);
}
return result;
}
EXPORT_SYMBOL(pid_task);

While another thread deletes the first element of the list:

static inline void __hlist_del(struct hlist_node *n)
{
struct hlist_node *next = n->next;
struct hlist_node **pprev = n->pprev;
*pprev = next;
if (next)
next->pprev = pprev;
}

static inline void hlist_del_rcu(struct hlist_node *n)
{
__hlist_del(n);
n->pprev = LIST_POISON2;
}

KTSAN does not like the plain assignment "*pprev = next;" in __hlist_del.

As far as I understand the intention behind is that when you remove an
element you do not make any new elements available to lock-free
readers, and so you don't need to do release (rcu_assign_pointer). Is
it correct?

This is essentially a transitive pointer passage:

// global data
T *g1, *g2;

// thread 1 (this is a thread that allocated a node)
T *p = alloc_t();
rcu_assign_pointer(&g1, p); // release

// thread 2 (this is the thread that does hlist_del_rcu, moves node
from head->next to head)
T *p = *g1;
*g2 = p;

// thread 3
T *p = rcu_dereference(&g2); // acquire
use(p);

C/C++ memory model requires the intermediate thread to use
acquire/release to pass the pointer, not just the first and the last
threads in the chain. However, I am not sure whether the rule is
dictated by real hardware features, or merely an abstraction overhead.

Can you think of any architectures where such transitive pointer
passage w/o barriers can cause issues?

Thank you

--
Dmitry Vyukov, Software Engineer, dvy...@google.com
Google Germany GmbH, Dienerstraße 12, 80331, München
Geschäftsführer: Graham Law, Christine Elizabeth Flores
Registergericht und -nummer: Hamburg, HRB 86891
Sitz der Gesellschaft: Hamburg
Diese E-Mail ist vertraulich. Wenn Sie nicht der richtige Adressat
sind, leiten Sie diese bitte nicht weiter, informieren Sie den
Absender und löschen Sie die E-Mail und alle Anhänge. Vielen Dank.
This e-mail is confidential. If you are not the right addressee please
do not forward it, please inform the sender, and please erase this
e-mail including any attachments. Thanks.

Paul E. McKenney

unread,

Sep 18, 2015, 11:53:00 AM9/18/15

to Dmitry Vyukov, kt...@googlegroups.com

The above needs at least an WRITE_ONCE(), good catch. Please see below
for a patch for this class of issue. Peter Zijlstra beat me to the one
in RCU_INIT_POINTER(). ;-)

> if (next)
> next->pprev = pprev;

Readers should not be traversing backwards, so this can be as is.

> }
>
> static inline void hlist_del_rcu(struct hlist_node *n)
> {
> __hlist_del(n);
> n->pprev = LIST_POISON2;
> }
>
> KTSAN does not like the plain assignment "*pprev = next;" in __hlist_del.

As well it should not.

> As far as I understand the intention behind is that when you remove an
> element you do not make any new elements available to lock-free
> readers, and so you don't need to do release (rcu_assign_pointer). Is
> it correct?

Exactly!

> This is essentially a transitive pointer passage:
>
> // global data
> T *g1, *g2;
>
> // thread 1 (this is a thread that allocated a node)
> T *p = alloc_t();
> rcu_assign_pointer(&g1, p); // release
>
> // thread 2 (this is the thread that does hlist_del_rcu, moves node
> from head->next to head)
> T *p = *g1;
> *g2 = p;
>
> // thread 3
> T *p = rcu_dereference(&g2); // acquire
> use(p);
>
> C/C++ memory model requires the intermediate thread to use
> acquire/release to pass the pointer, not just the first and the last
> threads in the chain. However, I am not sure whether the rule is
> dictated by real hardware features, or merely an abstraction overhead.
>
> Can you think of any architectures where such transitive pointer
> passage w/o barriers can cause issues?

Well, normally the insertions and deletions would be protected by
locks, which would provide some additional ordering. But even ignoring
that, I believe that this should work.

Let's start with TSO architectures (x86, s390, SPARC, ...). Here,
we have no stores followed by loads to different locations, so we
should get full ordering.

On Alpha, we have a full barrier on the rcu_dereference(), which will
give the needed ordering.

On PowerPC and ARM, the combination of coherence order and address
dependencies forces the needed ordering. On PowerPC, the following
litmus test covers it:

------------------------------------------------------------------------

PPC hlist-del
""
(* Dmitry Vyukov's hlist_del() question: Can hlist_del() omit release
barrier? This test ignores locking. *)
{
x=x; y=41; z=42;
0:r1=1; 0:r2=x; 0:r4=y; 0:r6=z;
1:r2=x; 1:r4=y; 1:r6=z;
2:r2=x; 2:r4=y; 2:r6=z;
}
P0 | P1 | P2 ;
stw r1,0(r6) | lwz r3,0(r2) | lwz r3,0(r2) ;
lwsync | stw r6,0(r2) | lwz r5,0(r3) ;
stw r6,0(r2) | | ;
stw r6,0(r4) | | ;
lwsync | | ;
stw r4,0(r2) | | ;

exists
(1:r3=y /\ 2:r5=42 /\ 2:r3=2:r3)

------------------------------------------------------------------------

And here is what herd (http://lwn.net/Articles/608550/) says:

Test hlist-del Allowed
States 16
1:r3=x; 2:r3=x; 2:r5=x;
1:r3=x; 2:r3=x; 2:r5=y;
1:r3=x; 2:r3=x; 2:r5=z;
1:r3=x; 2:r3=y; 2:r5=z;
1:r3=x; 2:r3=z; 2:r5=1;
1:r3=x; 2:r3=z; 2:r5=42;
1:r3=y; 2:r3=x; 2:r5=x;
1:r3=y; 2:r3=x; 2:r5=y;
1:r3=y; 2:r3=x; 2:r5=z;
1:r3=y; 2:r3=y; 2:r5=z;
1:r3=y; 2:r3=z; 2:r5=1;
1:r3=z; 2:r3=x; 2:r5=x;
1:r3=z; 2:r3=x; 2:r5=y;
1:r3=z; 2:r3=x; 2:r5=z;
1:r3=z; 2:r3=y; 2:r5=z;
1:r3=z; 2:r3=z; 2:r5=1;
No
Witnesses
Positive: 0 Negative: 45
Condition exists (1:r3=y /\ 2:r5=42 /\ 2:r3=2:r3)
Observation hlist-del Never 0 45
Hash=ece02a524b2678ee6459d90490c75056

The "Never" says that the PowerPC architecture does not allow the
misordering. I get similar results with another litmus test that does
two deletions from a four-element linked list. This of course does not
prove anything, but it does lend credence to the analysis saying that it
cannot happen. And there has been discussion of making release sequences
less strict.

Thanx, Paul

------------------------------------------------------------------------

rculist: Use WRITE_ONCE() when deleting from reader-visible list

The various RCU list-deletion macros (list_del_rcu(),
hlist_del_init_rcu(), hlist_del_rcu(), hlist_bl_del_init_rcu(),
hlist_bl_del_rcu(), hlist_nulls_del_init_rcu(), and hlist_nulls_del_rcu())
do plain stores into the ->next pointer of the preceding list elemment.
Unfortunately, the compiler is within its rights to (for example) use
byte-at-a-time writes to update the pointer, which would fatally confuse
concurrent readers. This patch therefore adds the needed WRITE_ONCE()
macros.

KernelThreadSanitizer (KTSAN) reported the __hlist_del() issue, which
is a problem when __hlist_del() is invoked by hlist_del_rcu().

Reported-by: Dmitry Vyukov <dvy...@google.com>
Signed-off-by: Paul E. McKenney <pau...@linux.vnet.ibm.com>

diff --git a/include/linux/list.h b/include/linux/list.h
index 3e3e64a61002..993395a2e55c 100644
--- a/include/linux/list.h
+++ b/include/linux/list.h
@@ -87,7 +87,7 @@ static inline void list_add_tail(struct list_head *new, struct list_head *head)
static inline void __list_del(struct list_head * prev, struct list_head * next)
{
next->prev = prev;
- prev->next = next;
+ WRITE_ONCE(prev->next, next);
}

/**
@@ -615,7 +615,8 @@ static inline void __hlist_del(struct hlist_node *n)

{
struct hlist_node *next = n->next;
struct hlist_node **pprev = n->pprev;

- *pprev = next;
+
+ WRITE_ONCE(*pprev, next);

if (next)
next->pprev = pprev;
}

diff --git a/include/linux/list_bl.h b/include/linux/list_bl.h
index 2eb88556c5c5..6cc6e7230e30 100644
--- a/include/linux/list_bl.h
+++ b/include/linux/list_bl.h
@@ -93,7 +93,8 @@ static inline void __hlist_bl_del(struct hlist_bl_node *n)
LIST_BL_BUG_ON((unsigned long)n & LIST_BL_LOCKMASK);

/* pprev may be `first`, so be careful not to lose the lock bit */
- *pprev = (struct hlist_bl_node *)
+ WRITE_ONCE(*pprev,
+ (struct hlist_bl_node *)
((unsigned long)next |
((unsigned long)*pprev & LIST_BL_LOCKMASK));
if (next)
diff --git a/include/linux/list_nulls.h b/include/linux/list_nulls.h
index f266661d2666..444d2b1313bd 100644
--- a/include/linux/list_nulls.h
+++ b/include/linux/list_nulls.h
@@ -76,7 +76,8 @@ static inline void __hlist_nulls_del(struct hlist_nulls_node *n)
{
struct hlist_nulls_node *next = n->next;
struct hlist_nulls_node **pprev = n->pprev;
- *pprev = next;
+
+ WRITE_ONCE(*pprev, next);
if (!is_a_nulls(next))
next->pprev = pprev;
}

Dmitry Vyukov

unread,

Sep 19, 2015, 7:38:18 AM9/19/15

to Paul McKenney, kt...@googlegroups.com

Thanks for the detailed explanation.

Regarding the patch, I see that you added WRITE_ONCE to __list_del,
which is a plain single-threaded primitive. If that's considered OK,
then I also see lots of races between list_del/add and list_empty. For
example:

if (!list_empty(list)) {
lock(list_lock);
if (!list_empty(list)) {
do something with the list
}
unlock(list_lock);
}

This is kind-of mostly OK, because list_empty predicate only checks
list head with NULL, but does not do any dereferences. But it is still
disturbing to see all these data race reports.
What I ended up with is a parallel set of APIs like list_del_once,
list_add_once, list_empty_once which use READ/WRITE_ONCE. But it is
somewhat clumsy and doubles the list API.
Can we extend your patch to add READ/WRITE_ONCE to list_add/empty as well?

Paul E. McKenney

unread,

Sep 20, 2015, 12:25:25 AM9/20/15

to Dmitry Vyukov, kt...@googlegroups.com

Like shown below, you mean? Given that llist_empty() already uses
ACCESS_ONCE(), seems like it should be OK. ;-)

> --
> Dmitry Vyukov, Software Engineer, dvy...@google.com
> Google Germany GmbH, Dienerstraße 12, 80331, München

Ah, you moved?

> Geschäftsführer: Graham Law, Christine Elizabeth Flores
> Registergericht und -nummer: Hamburg, HRB 86891
> Sitz der Gesellschaft: Hamburg
> Diese E-Mail ist vertraulich. Wenn Sie nicht der richtige Adressat
> sind, leiten Sie diese bitte nicht weiter, informieren Sie den
> Absender und löschen Sie die E-Mail und alle Anhänge. Vielen Dank.
> This e-mail is confidential. If you are not the right addressee please
> do not forward it, please inform the sender, and please erase this
> e-mail including any attachments. Thanks.

------------------------------------------------------------------------

diff --git a/include/linux/list.h b/include/linux/list.h
index 993395a2e55c..122c225f9104 100644
--- a/include/linux/list.h
+++ b/include/linux/list.h
@@ -186,7 +186,7 @@ static inline int list_is_last(const struct list_head *list,
*/
static inline int list_empty(const struct list_head *head)
{
- return head->next == head;
+ return READ_ONCE(head->next) == head;
}

/**
@@ -608,7 +608,7 @@ static inline int hlist_unhashed(const struct hlist_node *h)

static inline int hlist_empty(const struct hlist_head *h)
{
- return !h->first;
+ return !READ_ONCE(h->first);

}

static inline void __hlist_del(struct hlist_node *n)

diff --git a/include/linux/list_bl.h b/include/linux/list_bl.h
index 8132214e8efd..ee7229a6c06a 100644
--- a/include/linux/list_bl.h
+++ b/include/linux/list_bl.h
@@ -70,7 +70,7 @@ static inline void hlist_bl_set_first(struct hlist_bl_head *h,

static inline int hlist_bl_empty(const struct hlist_bl_head *h)
{
- return !((unsigned long)h->first & ~LIST_BL_LOCKMASK);
+ return !((unsigned long)READ_ONCE(h->first) & ~LIST_BL_LOCKMASK);
}

static inline void hlist_bl_add_head(struct hlist_bl_node *n,
diff --git a/include/linux/list_nulls.h b/include/linux/list_nulls.h
index 444d2b1313bd..b01fe1009084 100644
--- a/include/linux/list_nulls.h
+++ b/include/linux/list_nulls.h
@@ -57,7 +57,7 @@ static inline int hlist_nulls_unhashed(const struct hlist_nulls_node *h)

static inline int hlist_nulls_empty(const struct hlist_nulls_head *h)
{
- return is_a_nulls(h->first);
+ return is_a_nulls(READ_ONCE(h->first));
}

static inline void hlist_nulls_add_head(struct hlist_nulls_node *n,

Dmitry Vyukov

unread,

Sep 20, 2015, 6:08:22 AM9/20/15

to Paul McKenney, kt...@googlegroups.com

On Sun, Sep 20, 2015 at 6:25 AM, Paul E. McKenney

No, I mean races on plain list, not llist. Like this one:
https://github.com/google/ktsan/commit/def7229cbb4e0d697789a4582b0881f09e3b9d7c
I've seen 5 or more similar ones.

>> --
>> Dmitry Vyukov, Software Engineer, dvy...@google.com
>> Google Germany GmbH, Dienerstraße 12, 80331, München
>
> Ah, you moved?

Yes, I had to move from Moscow, Russia to Munich, Germany in June due
to Moscow Google engineering office closure. I enjoy it so far (tons
of paperwork aside). Oktoberfest started just yesterday, plan to visit
it today with family.

--
Dmitry Vyukov, Software Engineer, dvy...@google.com
Google Germany GmbH, Dienerstraße 12, 80331, München

Paul E. McKenney

unread,

Sep 20, 2015, 1:07:25 PM9/20/15

to Dmitry Vyukov, kt...@googlegroups.com

On Sun, Sep 20, 2015 at 12:08:02PM +0200, Dmitry Vyukov wrote:
> On Sun, Sep 20, 2015 at 6:25 AM, Paul E. McKenney
> <pau...@linux.vnet.ibm.com> wrote:
> > On Sat, Sep 19, 2015 at 01:37:58PM +0200, Dmitry Vyukov wrote:
> >> On Fri, Sep 18, 2015 at 5:52 PM, Paul E. McKenney
> >> <pau...@linux.vnet.ibm.com> wrote:
> >> > On Fri, Sep 18, 2015 at 12:00:39PM +0200, Dmitry Vyukov wrote:

[ . . . ]

> >> > The "Never" says that the PowerPC architecture does not allow the
> >> > misordering. I get similar results with another litmus test that does
> >> > two deletions from a four-element linked list. This of course does not
> >> > prove anything, but it does lend credence to the analysis saying that it
> >> > cannot happen. And there has been discussion of making release sequences
> >> > less strict.
> >>
> >>
> >> Thanks for the detailed explanation.
> >>
> >> Regarding the patch, I see that you added WRITE_ONCE to __list_del,
> >> which is a plain single-threaded primitive. If that's considered OK,
> >> then I also see lots of races between list_del/add and list_empty. For
> >> example:
> >>
> >> if (!list_empty(list)) {
> >> lock(list_lock);
> >> if (!list_empty(list)) {
> >> do something with the list
> >> }
> >> unlock(list_lock);
> >> }
> >>
> >> This is kind-of mostly OK, because list_empty predicate only checks
> >> list head with NULL, but does not do any dereferences. But it is still
> >> disturbing to see all these data race reports.
> >> What I ended up with is a parallel set of APIs like list_del_once,
> >> list_add_once, list_empty_once which use READ/WRITE_ONCE. But it is
> >> somewhat clumsy and doubles the list API.
> >> Can we extend your patch to add READ/WRITE_ONCE to list_add/empty as well?

Yes, I had that patch on the earlier email, but it was late and I wasn't
being very clear in describing it. Please see the end of this email.
Or the earlier one, for that matter. Though I was thinking in terms of
RCU-protected lists, so it is incomplete. More patch later.

Understood. My point was that llist_empty() already has an ACCESS_ONCE(),
so there should not be objections to adding READ_ONCE() to the various
other *list_empty() primitives, which the patch below does.

That said, I was thinking in terms of RCU-protected lists, and your
patch above shows that WRITE_ONCE() is needed in all of the various list
mutators, not just __list_del(). So I will put together another patch
for that.

> >> --
> >> Dmitry Vyukov, Software Engineer, dvy...@google.com
> >> Google Germany GmbH, Dienerstraße 12, 80331, München
> >
> > Ah, you moved?
>
> Yes, I had to move from Moscow, Russia to Munich, Germany in June due
> to Moscow Google engineering office closure. I enjoy it so far (tons
> of paperwork aside). Oktoberfest started just yesterday, plan to visit
> it today with family.

Sounds good!

I have only been to Munich once, but in early December rather than
during Oktoberfest. Nevertheless, the old town was quite interesting
with lots of street vendors. Including one that had some puzzles that
my wife found to be quite challenging. ;-)

Me, I am in Bellevue (near Seattle) for a conference. Today is getting
my presentation in shape, plus a hike out near Issaquah.

Thanx, Paul

Paul E. McKenney

unread,

Sep 21, 2015, 10:37:08 AM9/21/15

to Dmitry Vyukov, kt...@googlegroups.com

On Sun, Sep 20, 2015 at 10:07:21AM -0700, Paul E. McKenney wrote:
> On Sun, Sep 20, 2015 at 12:08:02PM +0200, Dmitry Vyukov wrote:
> > On Sun, Sep 20, 2015 at 6:25 AM, Paul E. McKenney
> > <pau...@linux.vnet.ibm.com> wrote:
> > > On Sat, Sep 19, 2015 at 01:37:58PM +0200, Dmitry Vyukov wrote:
> > >> On Fri, Sep 18, 2015 at 5:52 PM, Paul E. McKenney
> > >> <pau...@linux.vnet.ibm.com> wrote:
> > >> > On Fri, Sep 18, 2015 at 12:00:39PM +0200, Dmitry Vyukov wrote:

[ . . . ]

> > > Like shown below, you mean? Given that llist_empty() already uses
> > > ACCESS_ONCE(), seems like it should be OK. ;-)
> >
> > No, I mean races on plain list, not llist. Like this one:
> > https://github.com/google/ktsan/commit/def7229cbb4e0d697789a4582b0881f09e3b9d7c
> > I've seen 5 or more similar ones.
>
> Understood. My point was that llist_empty() already has an ACCESS_ONCE(),
> so there should not be objections to adding READ_ONCE() to the various
> other *list_empty() primitives, which the patch below does.
>
> That said, I was thinking in terms of RCU-protected lists, and your
> patch above shows that WRITE_ONCE() is needed in all of the various list
> mutators, not just __list_del(). So I will put together another patch
> for that.

And here is an untested patch for list and hlist additions. Thoughts?

Thanx, Paul

------------------------------------------------------------------------

list: Use WRITE_ONCE() when adding to lists and hlists

Code that does lockless emptiness testing of non-RCU lists is relying
on the list-addition code to write the list head's ->next pointer
atomically. This commit therefore adds WRITE_ONCE() to list-addition
pointer stores that could affect the head's ->next pointer.

Reported-by: Dmitry Vyukov <dvy...@google.com>
Signed-off-by: Paul E. McKenney <pau...@linux.vnet.ibm.com>

diff --git a/include/linux/list.h b/include/linux/list.h

index 993395a2e55c..d7e31fe398b3 100644
--- a/include/linux/list.h
+++ b/include/linux/list.h
@@ -42,7 +42,7 @@ static inline void __list_add(struct list_head *new,
next->prev = new;
new->next = next;
new->prev = prev;
- prev->next = new;
+ WRITE_ONCE(prev->next, new);
}
#else
extern void __list_add(struct list_head *new,
@@ -642,7 +642,7 @@ static inline void hlist_add_head(struct hlist_node *n, struct hlist_head *h)
n->next = first;
if (first)
first->pprev = &n->next;
- h->first = n;
+ WRITE_ONCE(h->first, n);
n->pprev = &h->first;
}

@@ -653,14 +653,14 @@ static inline void hlist_add_before(struct hlist_node *n,
n->pprev = next->pprev;
n->next = next;
next->pprev = &n->next;
- *(n->pprev) = n;
+ WRITE_ONCE(*(n->pprev), n);
}

static inline void hlist_add_behind(struct hlist_node *n,
struct hlist_node *prev)
{
n->next = prev->next;
- prev->next = n;
+ WRITE_ONCE(prev->next, n);
n->pprev = &prev->next;

if (n->next)
diff --git a/lib/list_debug.c b/lib/list_debug.c
index c24c2f7e296f..3859bf63561c 100644
--- a/lib/list_debug.c
+++ b/lib/list_debug.c
@@ -37,7 +37,7 @@ void __list_add(struct list_head *new,
next->prev = new;
new->next = next;
new->prev = prev;
- prev->next = new;
+ WRITE_ONCE(prev->next, new);
}
EXPORT_SYMBOL(__list_add);

Dmitry Vyukov

unread,

Sep 21, 2015, 10:40:07 AM9/21/15

to Paul McKenney, kt...@googlegroups.com

On Mon, Sep 21, 2015 at 7:08 AM, Paul E. McKenney

<pau...@linux.vnet.ibm.com> wrote:
> On Sun, Sep 20, 2015 at 10:07:21AM -0700, Paul E. McKenney wrote:
>> On Sun, Sep 20, 2015 at 12:08:02PM +0200, Dmitry Vyukov wrote:
>> > On Sun, Sep 20, 2015 at 6:25 AM, Paul E. McKenney
>> > <pau...@linux.vnet.ibm.com> wrote:
>> > > On Sat, Sep 19, 2015 at 01:37:58PM +0200, Dmitry Vyukov wrote:
>> > >> On Fri, Sep 18, 2015 at 5:52 PM, Paul E. McKenney
>> > >> <pau...@linux.vnet.ibm.com> wrote:
>> > >> > On Fri, Sep 18, 2015 at 12:00:39PM +0200, Dmitry Vyukov wrote:
>
> [ . . . ]
>
>> > > Like shown below, you mean? Given that llist_empty() already uses
>> > > ACCESS_ONCE(), seems like it should be OK. ;-)
>> >
>> > No, I mean races on plain list, not llist. Like this one:
>> > https://github.com/google/ktsan/commit/def7229cbb4e0d697789a4582b0881f09e3b9d7c
>> > I've seen 5 or more similar ones.
>>
>> Understood. My point was that llist_empty() already has an ACCESS_ONCE(),
>> so there should not be objections to adding READ_ONCE() to the various
>> other *list_empty() primitives, which the patch below does.
>>
>> That said, I was thinking in terms of RCU-protected lists, and your
>> patch above shows that WRITE_ONCE() is needed in all of the various list
>> mutators, not just __list_del(). So I will put together another patch
>> for that.
>
> And here is an untested patch for list and hlist additions. Thoughts?

Looks good to me

Dmitry Vyukov

unread,

Sep 22, 2015, 4:35:46 AM9/22/15

to Paul McKenney, kt...@googlegroups.com

On Sun, Sep 20, 2015 at 7:07 PM, Paul E. McKenney

Yeah, the city is very nice. And the same time it is capital of the
largest land in the most developed EU country.

> Me, I am in Bellevue (near Seattle) for a conference. Today is getting
> my presentation in shape, plus a hike out near Issaquah.

Pictures of Issaquah look very beautiful... if I found the right
Issaquah :) Good luck there

Dmitry Vyukov

unread,

Oct 6, 2015, 11:20:10 AM10/6/15

to ktsan, dvy...@google.com, pau...@linux.vnet.ibm.com

Paul, what is the status of the patch?

I can't find any other references to it in internet. We badly need it to overcome "benign" races with KTSAN.

Paul E. McKenney

unread,

Oct 6, 2015, 11:32:55 AM10/6/15

to Dmitry Vyukov, ktsan

> > Reported-by: Dmitry Vyukov <dvy...@google.com <javascript:>>

> > Signed-off-by: Paul E. McKenney <pau...@linux.vnet.ibm.com

> > <javascript:>>

I will be pushing this into the next merge window, assuming testing continues
to go well.

Thanx, Paul

Dmitry Vyukov

unread,

Oct 6, 2015, 11:34:06 AM10/6/15

to Paul McKenney, ktsan

Thanks!
Looking forward

Andrey Konovalov

unread,

Oct 6, 2015, 11:43:26 AM10/6/15

to Dmitry Vyukov, Paul McKenney, ktsan

Hi Paul,

I'm also seeing reports with racy plist_node_empty checks, one of them is below.
In this very case plist_node_empty calls list_empty and if the latter
is fixed, it should be enough, but perhaps plist might require some
other fixes.

==================================================================
ThreadSanitizer: data-race in plist_del

Write at 0xffff88028844fd10 of size 8 by thread 2415 on CPU 3:
[< inline >] INIT_LIST_HEAD include/linux/list.h:27
[< inline >] list_del_init include/linux/list.h:158
[<ffffffff8153e6cd>] plist_del+0xed/0x160 lib/plist.c:131
[<ffffffff8112afe7>] __unqueue_futex+0x67/0xc0 kernel/futex.c:1087
[<ffffffff8112b0c6>] mark_wake_futex+0x86/0xb0 kernel/futex.c:1109
[<ffffffff8112c46e>] futex_wake+0x20e/0x270 kernel/futex.c:1257
[<ffffffff8112f45c>] do_futex+0x14c/0x1090 kernel/futex.c:2970
[< inline >] SYSC_futex kernel/futex.c:3024
[<ffffffff81130423>] SyS_futex+0x83/0x190 kernel/futex.c:2994
[<ffffffff81ea89d1>] entry_SYSCALL_64_fastpath+0x31/0x95
arch/x86/entry/entry_64.S:188

Previous read at 0xffff88028844fd10 of size 8 by thread 2413 on CPU 0:
[< inline >] list_empty include/linux/list.h:201
[< inline >] plist_node_empty include/linux/plist.h:223
[<ffffffff8112bbea>] futex_wait_queue_me+0x21a/0x240 kernel/futex.c:2087
[<ffffffff8112c84a>] futex_wait+0x18a/0x3b0 kernel/futex.c:2209
[<ffffffff8112f489>] do_futex+0x179/0x1090 kernel/futex.c:2966
[< inline >] SYSC_futex kernel/futex.c:3024
[<ffffffff81130423>] SyS_futex+0x83/0x190 kernel/futex.c:2994
[<ffffffff81ea89d1>] entry_SYSCALL_64_fastpath+0x31/0x95
arch/x86/entry/entry_64.S:188

Mutexes locked by thread 2415:
Mutex 382300 is locked here:
[< inline >] __raw_spin_lock include/linux/spinlock_api_smp.h:158
[<ffffffff81ea8350>] _raw_spin_lock+0x50/0x70 kernel/locking/spinlock.c:151
[< inline >] spin_lock include/linux/spinlock.h:312
[<ffffffff8112c352>] futex_wake+0xf2/0x270 kernel/futex.c:1244
[<ffffffff8112f45c>] do_futex+0x14c/0x1090 kernel/futex.c:2970
[< inline >] SYSC_futex kernel/futex.c:3024
[<ffffffff81130423>] SyS_futex+0x83/0x190 kernel/futex.c:2994
[<ffffffff81ea89d1>] entry_SYSCALL_64_fastpath+0x31/0x95
arch/x86/entry/entry_64.S:188

==================================================================

Thanks!

> --
> You received this message because you are subscribed to the Google Groups "ktsan" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to ktsan+un...@googlegroups.com.
> To post to this group, send email to kt...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/ktsan/CACT4Y%2BYj8_N9RtFUv_f_RXmnwp8j96ddxJdDNo8xK-PBDxNv7A%40mail.gmail.com.
> For more options, visit https://groups.google.com/d/optout.

Paul E. McKenney

unread,

Oct 8, 2015, 7:54:23 PM10/8/15

to Andrey Konovalov, Dmitry Vyukov, ktsan

On Tue, Oct 06, 2015 at 05:43:25PM +0200, Andrey Konovalov wrote:
> Hi Paul,
>
> I'm also seeing reports with racy plist_node_empty checks, one of them is below.
> In this very case plist_node_empty calls list_empty and if the latter
> is fixed, it should be enough, but perhaps plist might require some
> other fixes.

Hmmmm... Does your .config file have CONFIG_DEBUG_LIST set?

Thanx, Paul

Andrey Konovalov

unread,

Oct 9, 2015, 6:32:06 AM10/9/15

to Paul McKenney, Dmitry Vyukov, ktsan

On Fri, Oct 9, 2015 at 1:54 AM, Paul E. McKenney
<pau...@linux.vnet.ibm.com> wrote:
> On Tue, Oct 06, 2015 at 05:43:25PM +0200, Andrey Konovalov wrote:
>> Hi Paul,
>>
>> I'm also seeing reports with racy plist_node_empty checks, one of them is below.
>> In this very case plist_node_empty calls list_empty and if the latter
>> is fixed, it should be enough, but perhaps plist might require some
>> other fixes.
>
> Hmmmm... Does your .config file have CONFIG_DEBUG_LIST set?

No, it doesn't.

Thanks.

Paul E. McKenney

unread,

Oct 12, 2015, 8:13:09 PM10/12/15

to Andrey Konovalov, Dmitry Vyukov, ktsan

On Fri, Oct 09, 2015 at 12:32:05PM +0200, Andrey Konovalov wrote:
> On Fri, Oct 9, 2015 at 1:54 AM, Paul E. McKenney
> <pau...@linux.vnet.ibm.com> wrote:
> > On Tue, Oct 06, 2015 at 05:43:25PM +0200, Andrey Konovalov wrote:
> >> Hi Paul,
> >>
> >> I'm also seeing reports with racy plist_node_empty checks, one of them is below.
> >> In this very case plist_node_empty calls list_empty and if the latter
> >> is fixed, it should be enough, but perhaps plist might require some
> >> other fixes.
> >
> > Hmmmm... Does your .config file have CONFIG_DEBUG_LIST set?
>
> No, it doesn't.

Ah, I get it. We have INIT_LIST_HEAD() racing against list_empty(),
and INIT_LIST_HEAD() is not marked. I was initially a bit reluctant
to mark INIT_LIST_HEAD(), but the invocation from within list_del_init()
is pretty legit.

So, does the patch below help?

Thanx, Paul

------------------------------------------------------------------------

commit d5f0e318f27f9e952cf72f5d9dabf729864a226a
Author: Paul E. McKenney <pau...@linux.vnet.ibm.com>
Date: Mon Oct 12 16:56:42 2015 -0700

list: Use WRITE_ONCE() when initializing list_head structures

Code that does lockless emptiness testing of non-RCU lists is relying

on INIT_LIST_HEAD() to write the list head's ->next pointer atomically,
particularly when INIT_LIST_HEAD() is invoked from list_del_init().
This commit therefore adds WRITE_ONCE() to this function's pointer stores

that could affect the head's ->next pointer.

Reported-by: Andrey Konovalov <andre...@google.com>

Signed-off-by: Paul E. McKenney <pau...@linux.vnet.ibm.com>

diff --git a/include/linux/list.h b/include/linux/list.h
index 06c2d887a918..5356f4d661a7 100644
--- a/include/linux/list.h
+++ b/include/linux/list.h
@@ -24,7 +24,7 @@

static inline void INIT_LIST_HEAD(struct list_head *list)
{
- list->next = list;
+ WRITE_ONCE(list->next, list);
list->prev = list;
}

Dmitry Vyukov

unread,

Oct 13, 2015, 2:02:02 AM10/13/15

to Paul McKenney, Andrey Konovalov, ktsan

On Tue, Oct 13, 2015 at 2:13 AM, Paul E. McKenney

<pau...@linux.vnet.ibm.com> wrote:
> On Fri, Oct 09, 2015 at 12:32:05PM +0200, Andrey Konovalov wrote:
>> On Fri, Oct 9, 2015 at 1:54 AM, Paul E. McKenney
>> <pau...@linux.vnet.ibm.com> wrote:
>> > On Tue, Oct 06, 2015 at 05:43:25PM +0200, Andrey Konovalov wrote:
>> >> Hi Paul,
>> >>
>> >> I'm also seeing reports with racy plist_node_empty checks, one of them is below.
>> >> In this very case plist_node_empty calls list_empty and if the latter
>> >> is fixed, it should be enough, but perhaps plist might require some
>> >> other fixes.
>> >
>> > Hmmmm... Does your .config file have CONFIG_DEBUG_LIST set?
>>
>> No, it doesn't.
>
> Ah, I get it. We have INIT_LIST_HEAD() racing against list_empty(),
> and INIT_LIST_HEAD() is not marked. I was initially a bit reluctant
> to mark INIT_LIST_HEAD(), but the invocation from within list_del_init()
> is pretty legit.
>
> So, does the patch below help?

Yes, it helps.

Thank you.

Paul E. McKenney

unread,

Oct 13, 2015, 10:55:32 AM10/13/15

to Dmitry Vyukov, Andrey Konovalov, ktsan

On Tue, Oct 13, 2015 at 08:01:42AM +0200, Dmitry Vyukov wrote:
> On Tue, Oct 13, 2015 at 2:13 AM, Paul E. McKenney
> <pau...@linux.vnet.ibm.com> wrote:
> > On Fri, Oct 09, 2015 at 12:32:05PM +0200, Andrey Konovalov wrote:
> >> On Fri, Oct 9, 2015 at 1:54 AM, Paul E. McKenney
> >> <pau...@linux.vnet.ibm.com> wrote:
> >> > On Tue, Oct 06, 2015 at 05:43:25PM +0200, Andrey Konovalov wrote:
> >> >> Hi Paul,
> >> >>
> >> >> I'm also seeing reports with racy plist_node_empty checks, one of them is below.
> >> >> In this very case plist_node_empty calls list_empty and if the latter
> >> >> is fixed, it should be enough, but perhaps plist might require some
> >> >> other fixes.
> >> >
> >> > Hmmmm... Does your .config file have CONFIG_DEBUG_LIST set?
> >>
> >> No, it doesn't.
> >
> > Ah, I get it. We have INIT_LIST_HEAD() racing against list_empty(),
> > and INIT_LIST_HEAD() is not marked. I was initially a bit reluctant
> > to mark INIT_LIST_HEAD(), but the invocation from within list_del_init()
> > is pretty legit.
> >
> > So, does the patch below help?
>
> Yes, it helps.
>
> Thank you.

Very good, if testing and review goes well, I will push it into the
4.5 merge window (the one after this coming one).

Reply all

Reply to author

Forward