[PATCH] lib: fix data race in rhashtable_rehash

Dmitry Vyukov

unread,

Sep 21, 2015, 4:08:54 AM9/21/15

to tg...@suug.ch, net...@vger.kernel.org, linux-...@vger.kernel.org, k...@google.com, andre...@google.com, gli...@google.com, kt...@googlegroups.com, pau...@linux.vnet.ibm.com, Dmitry Vyukov

rhashtable_rehash_one() uses plain writes to update entry->next,
while it is being concurrently accessed by readers.
Unfortunately, the compiler is within its rights to (for example) use
byte-at-a-time writes to update the pointer, which would fatally confuse
concurrent readers.

Use WRITE_ONCE to update entry->next in rhashtable_rehash_one().

The data race was found with KernelThreadSanitizer (KTSAN).

Signed-off-by: Dmitry Vyukov <dvy...@google.com>
---
KTSAN report for the record:

ThreadSanitizer: data-race in netlink_lookup

Atomic read at 0xffff880480443bd0 of size 8 by thread 2747 on CPU 11:
[< inline >] rhashtable_lookup_fast include/linux/rhashtable.h:543
[< inline >] __netlink_lookup net/netlink/af_netlink.c:1026
[<ffffffff81bd9a84>] netlink_lookup+0x134/0x1c0 net/netlink/af_netlink.c:1046
[< inline >] netlink_getsockbyportid net/netlink/af_netlink.c:1616
[<ffffffff81bdc701>] netlink_unicast+0x111/0x300 net/netlink/af_netlink.c:1812
[<ffffffff81bdcdb9>] netlink_sendmsg+0x4c9/0x5f0 net/netlink/af_netlink.c:2443
[< inline >] sock_sendmsg_nosec net/socket.c:610
[<ffffffff81b5d6f3>] sock_sendmsg+0x83/0x90 net/socket.c:620
[<ffffffff81b5e59f>] ___sys_sendmsg+0x3cf/0x3e0 net/socket.c:1952
[<ffffffff81b5f6ac>] __sys_sendmsg+0x4c/0xb0 net/socket.c:1986
[< inline >] SYSC_sendmsg net/socket.c:1997
[<ffffffff81b5f740>] SyS_sendmsg+0x30/0x50 net/socket.c:1993
[<ffffffff81ee3e11>] entry_SYSCALL_64_fastpath+0x31/0x95
arch/x86/entry/entry_64.S:188

Previous write at 0xffff880480443bd0 of size 8 by thread 213 on CPU 4:
[< inline >] rhashtable_rehash_one lib/rhashtable.c:193
[< inline >] rhashtable_rehash_chain lib/rhashtable.c:213
[< inline >] rhashtable_rehash_table lib/rhashtable.c:257
[<ffffffff8156f7e0>] rht_deferred_worker+0x3b0/0x6d0 lib/rhashtable.c:373
[<ffffffff810b1d6e>] process_one_work+0x47e/0x930 kernel/workqueue.c:2036
[<ffffffff810b22d0>] worker_thread+0xb0/0x900 kernel/workqueue.c:2170
[<ffffffff810bba40>] kthread+0x150/0x170 kernel/kthread.c:209
[<ffffffff81ee420f>] ret_from_fork+0x3f/0x70 arch/x86/entry/entry_64.S:529

Mutexes locked by thread 213:
Mutex 217217 is locked here:
[<ffffffff81ee0407>] mutex_lock+0x57/0x70 kernel/locking/mutex.c:108
[<ffffffff8156f475>] rht_deferred_worker+0x45/0x6d0 lib/rhashtable.c:363
[<ffffffff810b1d6e>] process_one_work+0x47e/0x930 kernel/workqueue.c:2036
[<ffffffff810b22d0>] worker_thread+0xb0/0x900 kernel/workqueue.c:2170
[<ffffffff810bba40>] kthread+0x150/0x170 kernel/kthread.c:209
[<ffffffff81ee420f>] ret_from_fork+0x3f/0x70 arch/x86/entry/entry_64.S:529

Mutex 431216 is locked here:
[< inline >] __raw_spin_lock_bh include/linux/spinlock_api_smp.h:149
[<ffffffff81ee3195>] _raw_spin_lock_bh+0x65/0x80 kernel/locking/spinlock.c:175
[< inline >] spin_lock_bh include/linux/spinlock.h:317
[< inline >] rhashtable_rehash_chain lib/rhashtable.c:212
[< inline >] rhashtable_rehash_table lib/rhashtable.c:257
[<ffffffff8156f616>] rht_deferred_worker+0x1e6/0x6d0 lib/rhashtable.c:373
[<ffffffff810b1d6e>] process_one_work+0x47e/0x930 kernel/workqueue.c:2036
[<ffffffff810b22d0>] worker_thread+0xb0/0x900 kernel/workqueue.c:2170
[<ffffffff810bba40>] kthread+0x150/0x170 kernel/kthread.c:209
[<ffffffff81ee420f>] ret_from_fork+0x3f/0x70 arch/x86/entry/entry_64.S:529

Mutex 432766 is locked here:
[< inline >] __raw_spin_lock include/linux/spinlock_api_smp.h:158
[<ffffffff81ee37d0>] _raw_spin_lock+0x50/0x70 kernel/locking/spinlock.c:151
[< inline >] rhashtable_rehash_one lib/rhashtable.c:186
[< inline >] rhashtable_rehash_chain lib/rhashtable.c:213
[< inline >] rhashtable_rehash_table lib/rhashtable.c:257
[<ffffffff8156f79b>] rht_deferred_worker+0x36b/0x6d0 lib/rhashtable.c:373
[<ffffffff810b1d6e>] process_one_work+0x47e/0x930 kernel/workqueue.c:2036
[<ffffffff810b22d0>] worker_thread+0xb0/0x900 kernel/workqueue.c:2170
[<ffffffff810bba40>] kthread+0x150/0x170 kernel/kthread.c:209
[<ffffffff81ee420f>] ret_from_fork+0x3f/0x70 arch/x86/entry/entry_64.S:529
---
lib/rhashtable.c | 9 ++++++---
1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index cc0c697..978624d 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -188,9 +188,12 @@ static int rhashtable_rehash_one(struct rhashtable *ht, unsigned int old_hash)
new_tbl, new_hash);

if (rht_is_a_nulls(head))
- INIT_RHT_NULLS_HEAD(entry->next, ht, new_hash);
- else
- RCU_INIT_POINTER(entry->next, head);
+ head = (struct rhash_head *)rht_marker(ht, new_hash);
+ /* We don't insert any new nodes that were not previously accessible
+ * to readers, so we don't need to use rcu_assign_pointer().
+ * But entry is being concurrently accessed by readers, so we need to
+ * use at least WRITE_ONCE. */
+ WRITE_ONCE(entry->next, head);

rcu_assign_pointer(new_tbl->buckets[new_hash], entry);
spin_unlock(new_bucket_lock);
--
2.6.0.rc0.131.gf624c3d

Eric Dumazet

unread,

Sep 21, 2015, 9:31:57 AM9/21/15

to Dmitry Vyukov, tg...@suug.ch, net...@vger.kernel.org, linux-...@vger.kernel.org, k...@google.com, andre...@google.com, gli...@google.com, kt...@googlegroups.com, pau...@linux.vnet.ibm.com

This is bogus.

1) Linux is certainly not working if some arch or compiler is not doing
single word writes. WRITE_ONCE() would not help at all to enforce this.

2) If new node is not yet visible, we don't care if we write
entry->next using any kind of operation.

So the WRITE_ONCE() is not needed at all.

> + WRITE_ONCE(entry->next, head);

The rcu_assign_pointer() immediately following is enough in this case.

We have hundred of similar cases in the kernel.

Eric Dumazet

unread,

Sep 21, 2015, 10:51:50 AM9/21/15

to Dmitry Vyukov, tg...@suug.ch, net...@vger.kernel.org, linux-...@vger.kernel.org, k...@google.com, andre...@google.com, gli...@google.com, kt...@googlegroups.com, pau...@linux.vnet.ibm.com

On Mon, 2015-09-21 at 06:31 -0700, Eric Dumazet wrote:
> On Mon, 2015-09-21 at 10:08 +0200, Dmitry Vyukov wrote:
> > rhashtable_rehash_one() uses plain writes to update entry->next,
> > while it is being concurrently accessed by readers.
> > Unfortunately, the compiler is within its rights to (for example) use
> > byte-at-a-time writes to update the pointer, which would fatally confuse
> > concurrent readers.
> >

> This is bogus.
>
> 1) Linux is certainly not working if some arch or compiler is not doing
> single word writes. WRITE_ONCE() would not help at all to enforce this.
>
> 2) If new node is not yet visible, we don't care if we write
> entry->next using any kind of operation.
>
> So the WRITE_ONCE() is not needed at all.
>
>
>
> > + WRITE_ONCE(entry->next, head);
>
>
> The rcu_assign_pointer() immediately following is enough in this case.
>
> We have hundred of similar cases in the kernel.
>
>

The changelog and comment are totally confusing.

Please remove the bogus parts in them, and/or rephrase.

The important part here is that we rehash an item, so we need to make
sure to maintain consistent ->next field, and need to prevent compiler
from using ->next as a temporary variable.

ptr->next = 1UL | ((base + offset) << 1);

Is dangerous because compiler could issue :

ptr->next = (base + offset);

ptr->next <<= 1;

ptr->next += 1UL;

Frankly, all this looks like an oversight in this code.

Not sure why the NULLS value is even recomputed.

diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index cc0c69710dcf..0a29f07ba45a 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -187,10 +187,7 @@ static int rhashtable_rehash_one(struct rhashtable *ht, unsigned int old_hash)
head = rht_dereference_bucket(new_tbl->buckets[new_hash],
new_tbl, new_hash);

- if (rht_is_a_nulls(head))

- INIT_RHT_NULLS_HEAD(entry->next, ht, new_hash);
- else
- RCU_INIT_POINTER(entry->next, head);

+ RCU_INIT_POINTER(entry->next, head);

Dmitry Vyukov

unread,

Sep 21, 2015, 11:10:28 AM9/21/15

to Eric Dumazet, tg...@suug.ch, net...@vger.kernel.org, LKML, Kostya Serebryany, Andrey Konovalov, Alexander Potapenko, kt...@googlegroups.com, Paul McKenney

I have not looked in detail yet, but the NULLS recomputation uses
new_hash, which obviously wasn't available when the value was
previously computed. Don't know yet whether it is important or not.

>
> diff --git a/lib/rhashtable.c b/lib/rhashtable.c
> index cc0c69710dcf..0a29f07ba45a 100644
> --- a/lib/rhashtable.c
> +++ b/lib/rhashtable.c
> @@ -187,10 +187,7 @@ static int rhashtable_rehash_one(struct rhashtable *ht, unsigned int old_hash)
> head = rht_dereference_bucket(new_tbl->buckets[new_hash],
> new_tbl, new_hash);
>
> - if (rht_is_a_nulls(head))
> - INIT_RHT_NULLS_HEAD(entry->next, ht, new_hash);
> - else
> - RCU_INIT_POINTER(entry->next, head);
> + RCU_INIT_POINTER(entry->next, head);
>
> rcu_assign_pointer(new_tbl->buckets[new_hash], entry);
> spin_unlock(new_bucket_lock);
>
>

> --
> You received this message because you are subscribed to the Google Groups "ktsan" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to ktsan+un...@googlegroups.com.
> To post to this group, send email to kt...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/ktsan/1442847108.29850.56.camel%40edumazet-glaptop2.roam.corp.google.com.
> For more options, visit https://groups.google.com/d/optout.

--
Dmitry Vyukov, Software Engineer, dvy...@google.com
Google Germany GmbH, Dienerstraße 12, 80331, München
Geschäftsführer: Graham Law, Christine Elizabeth Flores
Registergericht und -nummer: Hamburg, HRB 86891
Sitz der Gesellschaft: Hamburg
Diese E-Mail ist vertraulich. Wenn Sie nicht der richtige Adressat
sind, leiten Sie diese bitte nicht weiter, informieren Sie den
Absender und löschen Sie die E-Mail und alle Anhänge. Vielen Dank.
This e-mail is confidential. If you are not the right addressee please
do not forward it, please inform the sender, and please erase this
e-mail including any attachments. Thanks.

Eric Dumazet

unread,

Sep 21, 2015, 11:15:33 AM9/21/15

to Dmitry Vyukov, tg...@suug.ch, net...@vger.kernel.org, LKML, Kostya Serebryany, Andrey Konovalov, Alexander Potapenko, kt...@googlegroups.com, Paul McKenney

Well, head already contains the right value, set in bucket_table_alloc()

for (i = 0; i < nbuckets; i++)
INIT_RHT_NULLS_HEAD(tbl->buckets[i], ht, i);

Think of this nulls value as a special NULL pointer.

If hash table is properly allocated/initialized, all the chains are
correctly ending with a proper NULL pointer.

Thomas Graf

unread,

Sep 21, 2015, 6:25:40 PM9/21/15

to Eric Dumazet, Dmitry Vyukov, net...@vger.kernel.org, linux-...@vger.kernel.org, k...@google.com, andre...@google.com, gli...@google.com, kt...@googlegroups.com, pau...@linux.vnet.ibm.com

On 09/21/15 at 07:51am, Eric Dumazet wrote:
> The important part here is that we rehash an item, so we need to make
> sure to maintain consistent ->next field, and need to prevent compiler
> from using ->next as a temporary variable.
>
> ptr->next = 1UL | ((base + offset) << 1);
>
> Is dangerous because compiler could issue :
>
> ptr->next = (base + offset);
>
> ptr->next <<= 1;
>
> ptr->next += 1UL;
>
> Frankly, all this looks like an oversight in this code.
>
> Not sure why the NULLS value is even recomputed.

The hash of the chain is part of the NULLS value. Since the
entry might have been moved to a different chain, the NULLS
value must be recalculated to contain the proper hash.

However, nobody is using the hash today as far as I can
see so we could as well just remove it and use the base
value only for the nulls marker.

Eric Dumazet

unread,

Sep 21, 2015, 7:03:41 PM9/21/15

to Thomas Graf, Dmitry Vyukov, net...@vger.kernel.org, linux-...@vger.kernel.org, k...@google.com, andre...@google.com, gli...@google.com, kt...@googlegroups.com, pau...@linux.vnet.ibm.com

What I said is :

In @head you already have the correct nulls value, from hash table.

You do not need to recompute this value, and/or test if hash table chain
is empty.

If hash bucket is empty, it contains the appropriate NULLS value.

If you are paranoiac add this debugging check :

if (rht_is_a_nulls(head))
BUG_ON(head != (struct rhash_head *)rht_marker(ht, new_hash));

Therefore, simply fix the bug and unnecessary code with :

diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index cc0c69710dcf..a54ff8949f91 100644

Thomas Graf

unread,

Sep 22, 2015, 4:20:00 AM9/22/15

to Eric Dumazet, Dmitry Vyukov, net...@vger.kernel.org, linux-...@vger.kernel.org, k...@google.com, andre...@google.com, gli...@google.com, kt...@googlegroups.com, pau...@linux.vnet.ibm.com

On 09/21/15 at 04:03pm, Eric Dumazet wrote:
> What I said is :
>
> In @head you already have the correct nulls value, from hash table.
>
> You do not need to recompute this value, and/or test if hash table chain
> is empty.
>
> If hash bucket is empty, it contains the appropriate NULLS value.
>
> If you are paranoiac add this debugging check :
>
> if (rht_is_a_nulls(head))
> BUG_ON(head != (struct rhash_head *)rht_marker(ht, new_hash));
>
>
> Therefore, simply fix the bug and unnecessary code with :

You are absolutely right Eric. Do you want to revise your patch Dmitry?
Eric's proposed fix absolutely the best way to fix this.

Dmitry Vyukov

unread,

Sep 22, 2015, 4:51:56 AM9/22/15

to eric.d...@gmail.com, net...@vger.kernel.org, linux-...@vger.kernel.org, tg...@suug.ch, k...@google.com, andre...@google.com, gli...@google.com, kt...@googlegroups.com, pau...@linux.vnet.ibm.com, Dmitry Vyukov

rhashtable_rehash_one() uses complex logic to update entry->next field,
after INIT_RHT_NULLS_HEAD and NULLS_MARKER expansion:

entry->next = 1 | ((base + off) << 1)

This can be compiled along the lines of:

entry->next = base + off
entry->next <<= 1
entry->next |= 1

Which will break concurrent readers.

NULLS value recomputation is not needed here, so just remove
the complex logic.

The data race was found with KernelThreadSanitizer (KTSAN).

Signed-off-by: Dmitry Vyukov <dvy...@google.com>
---

v2: Remove NULLS values recomputation as it is not needed.
Update commit description to clarify that the problem
is not with racy reads/writes per se but rather with
the complex update logic.

lib/rhashtable.c | 5 +----
1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index cc0c697..a54ff89 100644

--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -187,10 +187,7 @@ static int rhashtable_rehash_one(struct rhashtable *ht, unsigned int old_hash)
head = rht_dereference_bucket(new_tbl->buckets[new_hash],
new_tbl, new_hash);

- if (rht_is_a_nulls(head))
- INIT_RHT_NULLS_HEAD(entry->next, ht, new_hash);
- else
- RCU_INIT_POINTER(entry->next, head);
+ RCU_INIT_POINTER(entry->next, head);

rcu_assign_pointer(new_tbl->buckets[new_hash], entry);
spin_unlock(new_bucket_lock);

--
2.6.0.rc0.131.gf624c3d

Dmitry Vyukov

unread,

Sep 22, 2015, 4:53:08 AM9/22/15

to Thomas Graf, Eric Dumazet, net...@vger.kernel.org, LKML, Kostya Serebryany, Andrey Konovalov, Alexander Potapenko, kt...@googlegroups.com, Paul McKenney

Mailed v2 of the patch.

Eric Dumazet

unread,

Sep 22, 2015, 5:05:12 AM9/22/15

to Dmitry Vyukov, net...@vger.kernel.org, linux-...@vger.kernel.org, tg...@suug.ch, k...@google.com, andre...@google.com, gli...@google.com, kt...@googlegroups.com, pau...@linux.vnet.ibm.com

On Tue, 2015-09-22 at 10:51 +0200, Dmitry Vyukov wrote:
> rhashtable_rehash_one() uses complex logic to update entry->next field,
> after INIT_RHT_NULLS_HEAD and NULLS_MARKER expansion:
>
> entry->next = 1 | ((base + off) << 1)
>
> This can be compiled along the lines of:
>
> entry->next = base + off
> entry->next <<= 1
> entry->next |= 1
>
> Which will break concurrent readers.
>
> NULLS value recomputation is not needed here, so just remove
> the complex logic.
>
> The data race was found with KernelThreadSanitizer (KTSAN).
>
> Signed-off-by: Dmitry Vyukov <dvy...@google.com>
> ---

Thanks Dmitry

Acked-by: Eric Dumazet <edum...@google.com>

Thomas Graf

unread,

Sep 22, 2015, 5:17:40 AM9/22/15

to Dmitry Vyukov, eric.d...@gmail.com, net...@vger.kernel.org, linux-...@vger.kernel.org, k...@google.com, andre...@google.com, gli...@google.com, kt...@googlegroups.com, pau...@linux.vnet.ibm.com

On 09/22/15 at 10:51am, Dmitry Vyukov wrote:
> rhashtable_rehash_one() uses complex logic to update entry->next field,
> after INIT_RHT_NULLS_HEAD and NULLS_MARKER expansion:
>
> entry->next = 1 | ((base + off) << 1)
>
> This can be compiled along the lines of:
>
> entry->next = base + off
> entry->next <<= 1
> entry->next |= 1
>
> Which will break concurrent readers.
>
> NULLS value recomputation is not needed here, so just remove
> the complex logic.
>
> The data race was found with KernelThreadSanitizer (KTSAN).
>
> Signed-off-by: Dmitry Vyukov <dvy...@google.com>

Acked-by: Thomas Graf <tg...@suug.ch>

Herbert Xu

unread,

Sep 22, 2015, 11:19:11 AM9/22/15

to Eric Dumazet, tg...@suug.ch, dvy...@google.com, net...@vger.kernel.org, linux-...@vger.kernel.org, k...@google.com, andre...@google.com, gli...@google.com, kt...@googlegroups.com, pau...@linux.vnet.ibm.com

Eric Dumazet <eric.d...@gmail.com> wrote:
>
> What I said is :
>
> In @head you already have the correct nulls value, from hash table.
>
> You do not need to recompute this value, and/or test if hash table chain
> is empty.
>
> If hash bucket is empty, it contains the appropriate NULLS value.
>
> If you are paranoiac add this debugging check :
>
> if (rht_is_a_nulls(head))
> BUG_ON(head != (struct rhash_head *)rht_marker(ht, new_hash));
>
>
> Therefore, simply fix the bug and unnecessary code with :

Ack. I remember seeing this when I was working on it but never
got around to removing this bogosity.

Thanks,
--
Email: Herbert Xu <her...@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Herbert Xu

unread,

Sep 22, 2015, 11:20:07 AM9/22/15

to Dmitry Vyukov, eric.d...@gmail.com, net...@vger.kernel.org, linux-...@vger.kernel.org, tg...@suug.ch, k...@google.com, andre...@google.com, gli...@google.com, kt...@googlegroups.com, pau...@linux.vnet.ibm.com, dvy...@google.com

Dmitry Vyukov <dvy...@google.com> wrote:
> rhashtable_rehash_one() uses complex logic to update entry->next field,
> after INIT_RHT_NULLS_HEAD and NULLS_MARKER expansion:
>
> entry->next = 1 | ((base + off) << 1)
>
> This can be compiled along the lines of:
>
> entry->next = base + off
> entry->next <<= 1
> entry->next |= 1
>
> Which will break concurrent readers.
>
> NULLS value recomputation is not needed here, so just remove
> the complex logic.
>
> The data race was found with KernelThreadSanitizer (KTSAN).
>
> Signed-off-by: Dmitry Vyukov <dvy...@google.com>

Acked-by: Herbert Xu <her...@gondor.apana.org.au>

David Miller

unread,

Sep 22, 2015, 8:36:31 PM9/22/15

to dvy...@google.com, eric.d...@gmail.com, net...@vger.kernel.org, linux-...@vger.kernel.org, tg...@suug.ch, k...@google.com, andre...@google.com, gli...@google.com, kt...@googlegroups.com, pau...@linux.vnet.ibm.com

From: Dmitry Vyukov <dvy...@google.com>
Date: Tue, 22 Sep 2015 10:51:52 +0200

> rhashtable_rehash_one() uses complex logic to update entry->next field,
> after INIT_RHT_NULLS_HEAD and NULLS_MARKER expansion:
>
> entry->next = 1 | ((base + off) << 1)
>
> This can be compiled along the lines of:
>
> entry->next = base + off
> entry->next <<= 1
> entry->next |= 1
>
> Which will break concurrent readers.
>
> NULLS value recomputation is not needed here, so just remove
> the complex logic.
>
> The data race was found with KernelThreadSanitizer (KTSAN).
>
> Signed-off-by: Dmitry Vyukov <dvy...@google.com>

Applied, thanks.

Reply all

Reply to author

Forward

[PATCH] lib: fix data race in rhashtable_rehash_one

Dmitry Vyukov

Eric Dumazet

Eric Dumazet

Dmitry Vyukov

Eric Dumazet

Thomas Graf

Eric Dumazet

Thomas Graf

Dmitry Vyukov

Dmitry Vyukov

Eric Dumazet

Thomas Graf

Herbert Xu

Herbert Xu

David Miller