[PATCH 3/5] futex: Larger hash table

Davidlohr Bueso

unread,

Nov 22, 2013, 8:00:03 PM11/22/13

to

Currently, the futex global hash table suffers from it's fixed, smallish
(for today's standards) size of 256 entries, as well as its lack of NUMA
awareness. Large systems, using many futexes, can be prone to high amounts
of collisions; where these futexes hash to the same bucket and lead to
extra contention on the same hb->lock. Furthermore, cacheline bouncing is a
reality when we have multiple hb->locks residing on the same cacheline and
different futexes hash to adjacent buckets.

This patch keeps the current static size of 16 entries for small systems,
or otherwise, 256 * ncpus (or larger as we need to round the number to a
power of 2). Note that this number of CPUs accounts for all CPUs that can
ever be available in the system, taking into consideration things like
hotpluging. While we do impose extra overhead at bootup by making the hash
table larger, this is a one time thing, and does not shadow the benefits
of this patch.

Also, similar to other core kernel components (pid, dcache, tcp), by using
alloc_large_system_hash() we benefit from its NUMA awareness and thus the
table is distributed among the nodes instead of in a single one. We impose
this function's minimum limit of 256 entries, so that in worst case scenarios
or issues, we still end up using the current amount anyways.

For a custom microbenchmark that pounds on the uaddr hashing -- making the wait
path fail at futex_wait_setup() returning -EWOULDBLOCK for large amounts of futexes,
we can see the following benefits on a 80-core, 8-socket 1Tb server:

+---------+----------------------------------+------------------------------------------+----------+
| threads | baseline (ops/sec) [insns/cycle] | large hash table (ops/sec) [insns/cycle] | increase |
+---------+----------------------------------+------------------------------------------+----------+
| 512 | 34429 [0.07] | 255274 [0.48] | +641.45% |
| 256 | 65452 [0.07] | 443563 [0.41] | +577.69% |
| 128 | 125111 [0.07] | 742613 [0.33] | +493.56% |
| 80 | 203642 [0.09] | 1028147 [0.29] | +404.87% |
| 64 | 262944 [0.09] | 997300 [0.28] | +279.28% |
| 32 | 642390 [0.24] | 965996 [0.27] | +50.37 |
+---------+----------------------------------+------------------------------------------+----------+

Cc: Ingo Molnar <mi...@kernel.org>
Cc: Darren Hart <dvh...@linux.intel.com>
Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Thomas Gleixner <tg...@linutronix.de>
Cc: Mike Galbraith <efa...@gmx.de>
Cc: Jeff Mahoney <je...@suse.com>
Cc: Linus Torvalds <torv...@linux-foundation.org>
Cc: Scott Norton <scott....@hp.com>
Cc: Tom Vaden <tom....@hp.com>
Cc: Aswin Chandramouleeswaran <as...@hp.com>
Signed-off-by: Waiman Long <Waima...@hp.com>
Signed-off-by: Jason Low <jason...@hp.com>
Signed-off-by: Davidlohr Bueso <davi...@hp.com>
---
kernel/futex.c | 26 +++++++++++++++++++++-----
1 file changed, 21 insertions(+), 5 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index 0768c68..5fa9eb0 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -63,6 +63,7 @@
#include <linux/sched/rt.h>
#include <linux/hugetlb.h>
#include <linux/freezer.h>
+#include <linux/bootmem.h>

#include <asm/futex.h>

@@ -70,7 +71,11 @@

int __read_mostly futex_cmpxchg_enabled;

-#define FUTEX_HASHBITS (CONFIG_BASE_SMALL ? 4 : 8)
+#if CONFIG_BASE_SMALL
+static unsigned long futex_hashsize = 16;
+#else
+static unsigned long futex_hashsize;
+#endif

/*
* Futex flags used to encode options to functions and preserve them across
@@ -151,7 +156,11 @@ struct futex_hash_bucket {
struct plist_head chain;
};

-static struct futex_hash_bucket futex_queues[1<<FUTEX_HASHBITS];
+#if CONFIG_BASE_SMALL
+static struct futex_hash_bucket futex_queues[futex_hashsize];
+#else
+static struct futex_hash_bucket *futex_queues;
+#endif

/*
* We hash on the keys returned from get_futex_key (see below).
@@ -161,7 +170,7 @@ static struct futex_hash_bucket *hash_futex(union futex_key *key)
u32 hash = jhash2((u32*)&key->both.word,
(sizeof(key->both.word)+sizeof(key->both.ptr))/4,
key->both.offset);
- return &futex_queues[hash & ((1 << FUTEX_HASHBITS)-1)];
+ return &futex_queues[hash & (futex_hashsize - 1)];
}

/*
@@ -2715,7 +2724,14 @@ SYSCALL_DEFINE6(futex, u32 __user *, uaddr, int, op, u32, val,
static int __init futex_init(void)
{
u32 curval;
- int i;
+ unsigned long i;
+
+#if !CONFIG_BASE_SMALL
+ futex_hashsize = roundup_pow_of_two((256 * num_possible_cpus()));
+ futex_queues = alloc_large_system_hash("futex", sizeof(*futex_queues),
+ futex_hashsize, 0, 0, NULL, NULL,
+ 256, futex_hashsize);
+#endif

/*
* This will fail and we want it. Some arch implementations do
@@ -2730,7 +2746,7 @@ static int __init futex_init(void)
if (cmpxchg_futex_value_locked(&curval, NULL, 0, 0) == -EFAULT)
futex_cmpxchg_enabled = 1;

- for (i = 0; i < ARRAY_SIZE(futex_queues); i++) {
+ for (i = 0; i < futex_hashsize; i++) {
plist_head_init(&futex_queues[i].chain);
spin_lock_init(&futex_queues[i].lock);
}
--
1.8.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Davidlohr Bueso

unread,

Nov 22, 2013, 8:00:03 PM11/22/13

to

We have been dealing with a customer database workload on large
12Tb, 240 core 16 socket NUMA system that exhibits high amounts
of contention on some of the locks that serialize internal futex
data structures. This workload specially suffers in the wakeup
paths, where waiting on the corresponding hb->lock can account for
up to ~60% of the time. The result of such calls can mostly be
classified as (i) nothing to wake up and (ii) wakeup large amount
of tasks.

Before these patches are applied, we can see this pathological behavior:

37.12% 826174 xxx [kernel.kallsyms] [k] _raw_spin_lock
--- _raw_spin_lock
|
|--97.14%-- futex_wake
| do_futex
| sys_futex
| system_call_fastpath
| |
| |--99.70%-- 0x7f383fbdea1f
| | yyy

43.71% 762296 xxx [kernel.kallsyms] [k] _raw_spin_lock
--- _raw_spin_lock
|
|--53.74%-- futex_wake
| do_futex
| sys_futex
| system_call_fastpath
| |
| |--99.40%-- 0x7fe7d44a4c05
| | zzz
|--45.90%-- futex_wait_setup
| futex_wait
| do_futex
| sys_futex
| system_call_fastpath
| 0x7fe7ba315789
| syscall

With these patches, contention is practically non existent:

0.10% 49 xxx [kernel.kallsyms] [k] _raw_spin_lock
--- _raw_spin_lock
|
|--76.06%-- futex_wait_setup
| futex_wait
| do_futex
| sys_futex
| system_call_fastpath
| |
| |--99.90%-- 0x7f3165e63789
| | syscall|
...
|--6.27%-- futex_wake
| do_futex
| sys_futex
| system_call_fastpath
| |
| |--54.56%-- 0x7f317fff2c05
...

Patches 1 & 2 are cleanups and micro optimizations.

Patch 3 addresses the well known issue of the global hash table.
By creating a larger and NUMA aware table, we can reduce the false
sharing and collisions, thus reducing the chance of different futexes
using hb->lock.

Patch 4 reduces contention on the corresponding hb->lock by not trying to
acquire it if there are no blocked tasks in the waitqueue.
This particularly deals with point (i) above, where we see that it is not
uncommon for up to 90% of wakeup calls end up returning 0, indicating that no
tasks were woken.

Patch 5 resurrects a two year old idea from Peter Zijlstra to delay
the waking of the blocked tasks to be done without holding the hb->lock:
https://lkml.org/lkml/2011/9/14/118

This is useful for locking primitives that can effect multiple wakeups
per operation and want to avoid the futex's internal spinlock contention by
delaying the wakeups until we've released the hb->lock.
This particularly deals with point (ii) above, where we can observe that
in occasions the wake calls end up waking 125 to 200 waiters in what we believe
are RW locks in the application.

This patchset has also been tested on smaller systems for a variety of
benchmarks, including java workloads, kernel builds and custom bang-the-hell-out-of
hb locks programs. So far, no functional or performance regressions have been seen.
Furthermore, no issues were found when running the different tests in the futextest
suite: http://git.kernel.org/cgit/linux/kernel/git/dvhart/futextest.git/

This patchset applies on top of Linus' tree as of v3.13-rc1.

Special thanks to Scott Norton, Tom Vanden and Mark Ray for help presenting,
debugging and analyzing the data.

futex: Misc cleanups
futex: Check for pi futex_q only once
futex: Larger hash table
futex: Avoid taking hb lock if nothing to wakeup
sched,futex: Provide delayed wakeup list

include/linux/sched.h | 41 ++++++++++++++++++
kernel/futex.c | 113 +++++++++++++++++++++++++++-----------------------
kernel/sched/core.c | 19 +++++++++
3 files changed, 122 insertions(+), 51 deletions(-)

Davidlohr Bueso

unread,

Nov 22, 2013, 8:00:03 PM11/22/13

to

From: Jason Low <jason...@hp.com>

- Remove unnecessary head variables.
- Delete unused parameter in queue_unlock().

Cc: Ingo Molnar <mi...@kernel.org>
Cc: Darren Hart <dvh...@linux.intel.com>
Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Thomas Gleixner <tg...@linutronix.de>
Cc: Mike Galbraith <efa...@gmx.de>
Cc: Jeff Mahoney <je...@suse.com>
Cc: Linus Torvalds <torv...@linux-foundation.org>
Cc: Scott Norton <scott....@hp.com>
Cc: Tom Vaden <tom....@hp.com>
Cc: Aswin Chandramouleeswaran <as...@hp.com>

Cc: Waiman Long <Waima...@hp.com>

Signed-off-by: Jason Low <jason...@hp.com>
Signed-off-by: Davidlohr Bueso <davi...@hp.com>
---

kernel/futex.c | 39 ++++++++++++---------------------------
1 file changed, 12 insertions(+), 27 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index 80ba086..e6ffe73 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -597,13 +597,10 @@ lookup_pi_state(u32 uval, struct futex_hash_bucket *hb,
{
struct futex_pi_state *pi_state = NULL;
struct futex_q *this, *next;
- struct plist_head *head;
struct task_struct *p;
pid_t pid = uval & FUTEX_TID_MASK;

- head = &hb->chain;
-
- plist_for_each_entry_safe(this, next, head, list) {
+ plist_for_each_entry_safe(this, next, &hb->chain, list) {
if (match_futex(&this->key, key)) {
/*
* Another waiter already exists - bump up
@@ -985,7 +982,6 @@ futex_wake(u32 __user *uaddr, unsigned int flags, int nr_wake, u32 bitset)
{
struct futex_hash_bucket *hb;
struct futex_q *this, *next;
- struct plist_head *head;
union futex_key key = FUTEX_KEY_INIT;
int ret;

@@ -998,9 +994,8 @@ futex_wake(u32 __user *uaddr, unsigned int flags, int nr_wake, u32 bitset)

hb = hash_futex(&key);
spin_lock(&hb->lock);
- head = &hb->chain;

- plist_for_each_entry_safe(this, next, head, list) {
+ plist_for_each_entry_safe(this, next, &hb->chain, list) {
if (match_futex (&this->key, &key)) {
if (this->pi_state || this->rt_waiter) {
ret = -EINVAL;
@@ -1033,7 +1028,6 @@ futex_wake_op(u32 __user *uaddr1, unsigned int flags, u32 __user *uaddr2,
{
union futex_key key1 = FUTEX_KEY_INIT, key2 = FUTEX_KEY_INIT;
struct futex_hash_bucket *hb1, *hb2;
- struct plist_head *head;
struct futex_q *this, *next;
int ret, op_ret;

@@ -1081,9 +1075,7 @@ retry_private:
goto retry;
}

- head = &hb1->chain;
-
- plist_for_each_entry_safe(this, next, head, list) {
+ plist_for_each_entry_safe(this, next, &hb1->chain, list) {
if (match_futex (&this->key, &key1)) {
if (this->pi_state || this->rt_waiter) {
ret = -EINVAL;
@@ -1096,10 +1088,8 @@ retry_private:
}

if (op_ret > 0) {
- head = &hb2->chain;
-
op_ret = 0;
- plist_for_each_entry_safe(this, next, head, list) {
+ plist_for_each_entry_safe(this, next, &hb2->chain, list) {
if (match_futex (&this->key, &key2)) {
if (this->pi_state || this->rt_waiter) {
ret = -EINVAL;
@@ -1269,7 +1259,6 @@ static int futex_requeue(u32 __user *uaddr1, unsigned int flags,
int drop_count = 0, task_count = 0, ret;
struct futex_pi_state *pi_state = NULL;
struct futex_hash_bucket *hb1, *hb2;
- struct plist_head *head1;
struct futex_q *this, *next;
u32 curval2;

@@ -1392,8 +1381,7 @@ retry_private:
}
}

- head1 = &hb1->chain;
- plist_for_each_entry_safe(this, next, head1, list) {
+ plist_for_each_entry_safe(this, next, &hb1->chain, list) {
if (task_count - nr_wake >= nr_requeue)
break;

@@ -1493,7 +1481,7 @@ static inline struct futex_hash_bucket *queue_lock(struct futex_q *q)
}

static inline void
-queue_unlock(struct futex_q *q, struct futex_hash_bucket *hb)
+queue_unlock(struct futex_hash_bucket *hb)
__releases(&hb->lock)
{
spin_unlock(&hb->lock);
@@ -1866,7 +1854,7 @@ retry_private:
ret = get_futex_value_locked(&uval, uaddr);

if (ret) {
- queue_unlock(q, *hb);
+ queue_unlock(*hb);

ret = get_user(uval, uaddr);
if (ret)
@@ -1880,7 +1868,7 @@ retry_private:
}

if (uval != val) {
- queue_unlock(q, *hb);
+ queue_unlock(*hb);
ret = -EWOULDBLOCK;
}

@@ -2028,7 +2016,7 @@ retry_private:
* Task is exiting and we just wait for the
* exit to complete.
*/
- queue_unlock(&q, hb);
+ queue_unlock(hb);
put_futex_key(&q.key);
cond_resched();
goto retry;
@@ -2080,7 +2068,7 @@ retry_private:
goto out_put_key;

out_unlock_put_key:
- queue_unlock(&q, hb);
+ queue_unlock(hb);

out_put_key:
put_futex_key(&q.key);
@@ -2090,7 +2078,7 @@ out:
return ret != -EINTR ? ret : -ERESTARTNOINTR;

uaddr_faulted:
- queue_unlock(&q, hb);
+ queue_unlock(hb);

ret = fault_in_user_writeable(uaddr);
if (ret)
@@ -2112,7 +2100,6 @@ static int futex_unlock_pi(u32 __user *uaddr, unsigned int flags)
{
struct futex_hash_bucket *hb;
struct futex_q *this, *next;
- struct plist_head *head;
union futex_key key = FUTEX_KEY_INIT;
u32 uval, vpid = task_pid_vnr(current);
int ret;
@@ -2152,9 +2139,7 @@ retry:
* Ok, other tasks may need to be woken up - check waiters
* and do the wakeup if necessary:
*/
- head = &hb->chain;
-
- plist_for_each_entry_safe(this, next, head, list) {
+ plist_for_each_entry_safe(this, next, &hb->chain, list) {
if (!match_futex (&this->key, &key))
continue;
ret = wake_futex_pi(uaddr, uval, this);

Davidlohr Bueso

unread,

Nov 22, 2013, 8:00:03 PM11/22/13

to

From: Peter Zijlstra <pet...@infradead.org>

Original patchset: https://lkml.org/lkml/2011/9/14/118

This is useful for locking primitives that can effect multiple

wakeups per operation and want to avoid lock internal lock contention
by delaying the wakeups until we've released the lock internal locks.

Alternatively it can be used to avoid issuing multiple wakeups, and
thus save a few cycles, in packet processing. Queue all target tasks
and wakeup once you've processed all packets. That way you avoid
waking the target task multiple times if there were multiple packets
for the same task.

This patch adds the needed infrastructure into the scheduler code
and uses the new wake_list to delay the futex wakeups until
after we've released the hash bucket locks. This avoids the newly
woken tasks from immediately getting stuck on the hb->lock.

Cc: Ingo Molnar <mi...@kernel.org>
Cc: Darren Hart <dvh...@linux.intel.com>
Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Thomas Gleixner <tg...@linutronix.de>
Cc: Mike Galbraith <efa...@gmx.de>
Cc: Jeff Mahoney <je...@suse.com>
Cc: Linus Torvalds <torv...@linux-foundation.org>
Cc: Scott Norton <scott....@hp.com>
Cc: Tom Vaden <tom....@hp.com>
Cc: Aswin Chandramouleeswaran <as...@hp.com>
Cc: Waiman Long <Waima...@hp.com>

Tested-by: Jason Low <jason...@hp.com>
[forward ported]

Signed-off-by: Davidlohr Bueso <davi...@hp.com>
---

Please note that in the original thread there was some debate
about spurious wakeups (https://lkml.org/lkml/2011/9/17/31), so
you can consider this more of an RFC patch if folks believe that
this functionality is incomplete/buggy.

include/linux/sched.h | 41 +++++++++++++++++++++++++++++++++++++++++
kernel/futex.c | 31 +++++++++++++++----------------
kernel/sched/core.c | 19 +++++++++++++++++++
3 files changed, 75 insertions(+), 16 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 7e35d4b..679aabb 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -793,6 +793,20 @@ enum cpu_idle_type {

extern int __weak arch_sd_sibiling_asym_packing(void);

+struct wake_list_head {
+ struct wake_list_node *first;
+};
+
+struct wake_list_node {
+ struct wake_list_node *next;
+};
+
+#define WAKE_LIST_TAIL ((struct wake_list_node *) 0x01)
+
+#define WAKE_LIST(name) \
+ struct wake_list_head name = { WAKE_LIST_TAIL }
+
+
struct sched_domain_attr {
int relax_domain_level;
};
@@ -1078,6 +1092,8 @@ struct task_struct {
unsigned int btrace_seq;
#endif

+ struct wake_list_node wake_list;
+
unsigned int policy;
int nr_cpus_allowed;
cpumask_t cpus_allowed;
@@ -2044,6 +2060,31 @@ extern void wake_up_new_task(struct task_struct *tsk);
extern void sched_fork(unsigned long clone_flags, struct task_struct *p);
extern void sched_dead(struct task_struct *p);

+static inline void
+wake_list_add(struct wake_list_head *head, struct task_struct *p)
+{
+ struct wake_list_node *n = &p->wake_list;
+
+ /*
+ * Atomically grab the task, if ->wake_list is !0 already it means
+ * its already queued (either by us or someone else) and will get the
+ * wakeup due to that.
+ *
+ * This cmpxchg() implies a full barrier, which pairs with the write
+ * barrier implied by the wakeup in wake_up_list().
+ */
+ if (cmpxchg(&n->next, NULL, head->first))
+ return;
+
+ /*
+ * The head is context local, there can be no concurrency.
+ */
+ get_task_struct(p);
+ head->first = n;
+}
+
+extern void wake_up_list(struct wake_list_head *head, unsigned int state);
+
extern void proc_caches_init(void);
extern void flush_signals(struct task_struct *);
extern void __flush_signals(struct task_struct *);
diff --git a/kernel/futex.c b/kernel/futex.c
index 2f9dd5d..3d60a3d 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -851,20 +851,17 @@ static void __unqueue_futex(struct futex_q *q)
* The hash bucket lock must be held when this is called.
* Afterwards, the futex_q must not be accessed.
*/
-static void wake_futex(struct futex_q *q)
+static void wake_futex(struct wake_list_head *wake_list, struct futex_q *q)
{
struct task_struct *p = q->task;

/*
- * We set q->lock_ptr = NULL _before_ we wake up the task. If
- * a non-futex wake up happens on another CPU then the task
- * might exit and p would dereference a non-existing task
- * struct. Prevent this by holding a reference on p across the
- * wake up.
+ * Queue up and delay the futex wakeups until the hb lock
+ * has been released.
*/
- get_task_struct(p);
-
+ wake_list_add(wake_list, p);
__unqueue_futex(q);
+
/*
* The waiting task can free the futex_q as soon as
* q->lock_ptr = NULL is written, without taking any locks. A
@@ -873,9 +870,6 @@ static void wake_futex(struct futex_q *q)
*/
smp_wmb();
q->lock_ptr = NULL;
-
- wake_up_state(p, TASK_NORMAL);
- put_task_struct(p);
}

static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q *this)
@@ -992,6 +986,7 @@ futex_wake(u32 __user *uaddr, unsigned int flags, int nr_wake, u32 bitset)
struct futex_q *this, *next;

union futex_key key = FUTEX_KEY_INIT;
int ret;

+ WAKE_LIST(wake_list);

if (!bitset)
return -EINVAL;
@@ -1018,7 +1013,7 @@ futex_wake(u32 __user *uaddr, unsigned int flags, int nr_wake, u32 bitset)
if (!(this->bitset & bitset))
continue;

- wake_futex(this);
+ wake_futex(&wake_list, this);
if (++ret >= nr_wake)
break;
}
@@ -1027,6 +1022,7 @@ futex_wake(u32 __user *uaddr, unsigned int flags, int nr_wake, u32 bitset)
spin_unlock(&hb->lock);
out_put_key:
put_futex_key(&key);
+ wake_up_list(&wake_list, TASK_NORMAL);
out:
return ret;
}
@@ -1043,7 +1039,7 @@ futex_wake_op(u32 __user *uaddr1, unsigned int flags, u32 __user *uaddr2,
struct futex_hash_bucket *hb1, *hb2;

struct futex_q *this, *next;
int ret, op_ret;

-
+ WAKE_LIST(wake_list);
retry:
ret = get_futex_key(uaddr1, flags & FLAGS_SHARED, &key1, VERIFY_READ);
if (unlikely(ret != 0))
@@ -1094,7 +1090,7 @@ retry_private:
ret = -EINVAL;
goto out_unlock;
}
- wake_futex(this);
+ wake_futex(&wake_list, this);
if (++ret >= nr_wake)
break;
}
@@ -1108,7 +1104,7 @@ retry_private:
ret = -EINVAL;
goto out_unlock;
}
- wake_futex(this);
+ wake_futex(&wake_list, this);
if (++op_ret >= nr_wake2)
break;
}
@@ -1122,6 +1118,7 @@ out_put_keys:
put_futex_key(&key2);
out_put_key1:
put_futex_key(&key1);
+ wake_up_list(&wake_list, TASK_NORMAL);
out:
return ret;
}
@@ -1276,6 +1273,7 @@ static int futex_requeue(u32 __user *uaddr1, unsigned int flags,
struct futex_hash_bucket *hb1, *hb2;

struct futex_q *this, *next;
u32 curval2;

+ WAKE_LIST(wake_list);

if (requeue_pi) {
/*
@@ -1423,7 +1421,7 @@ retry_private:
* woken by futex_unlock_pi().
*/
if (++task_count <= nr_wake && !requeue_pi) {
- wake_futex(this);
+ wake_futex(&wake_list, this);
continue;
}

@@ -1476,6 +1474,7 @@ out_put_keys:
put_futex_key(&key2);
out_put_key1:
put_futex_key(&key1);
+ wake_up_list(&wake_list, TASK_NORMAL);
out:
if (pi_state != NULL)
free_pi_state(pi_state);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c180860..1e1b1ad 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1695,6 +1695,25 @@ int wake_up_state(struct task_struct *p, unsigned int state)
return try_to_wake_up(p, state, 0);
}

+void wake_up_list(struct wake_list_head *head, unsigned int state)
+{
+ struct wake_list_node *n = head->first;
+ struct task_struct *p;
+
+ while (n != WAKE_LIST_TAIL) {
+ p = container_of(n, struct task_struct, wake_list);
+ n = n->next;
+
+ p->wake_list.next = NULL;
+ /*
+ * wake_up_state() implies a wmb() to pair with the queueing
+ * in wake_list_add() so as not to miss wakeups.
+ */
+ wake_up_state(p, state);
+ put_task_struct(p);
+ }
+}
+
/*
* Perform scheduler related setup for a newly forked process p.
* p is forked by current.

Davidlohr Bueso

unread,

Nov 22, 2013, 8:00:03 PM11/22/13

to

In futex_wake() there is clearly no point in taking the hb->lock if
we know beforehand that there are no tasks to be woken. This comes
at the smaller cost of doing some atomic operations to keep track of
the list's size. Specifically, increment the counter when an element is
added to the list, and decrement when it is removed. Of course, if the
counter is 0, then there are no tasks blocked on a futex. Some special
considerations:

- increment the counter at queue_lock() as we always end up calling
queue_me() which adds the element to the list. Upon any error,
queue_unlock() is called for housekeeping, for which we decrement
to mach the increment done in queue_lock().

- decrement the counter at __unqueue_me() to reflect when an element is
removed from the queue for wakeup related purposes.

Cc: Ingo Molnar <mi...@kernel.org>
Cc: Darren Hart <dvh...@linux.intel.com>
Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Thomas Gleixner <tg...@linutronix.de>
Cc: Mike Galbraith <efa...@gmx.de>
Cc: Jeff Mahoney <je...@suse.com>
Cc: Linus Torvalds <torv...@linux-foundation.org>
Cc: Scott Norton <scott....@hp.com>
Cc: Tom Vaden <tom....@hp.com>
Cc: Aswin Chandramouleeswaran <as...@hp.com>

Signed-off-by: Waiman Long <Waima...@hp.com>
Signed-off-by: Jason Low <jason...@hp.com>

Signed-off-by: Davidlohr Bueso <davi...@hp.com>
---

kernel/futex.c | 14 ++++++++++++++
1 file changed, 14 insertions(+)

diff --git a/kernel/futex.c b/kernel/futex.c
index 5fa9eb0..2f9dd5d 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -152,6 +152,7 @@ static const struct futex_q futex_q_init = {
* waiting on a futex.
*/
struct futex_hash_bucket {
+ atomic_t len;
spinlock_t lock;
struct plist_head chain;
};
@@ -843,6 +844,7 @@ static void __unqueue_futex(struct futex_q *q)

hb = container_of(q->lock_ptr, struct futex_hash_bucket, lock);
plist_del(&q->list, &hb->chain);
+ atomic_dec(&hb->len);
}

/*
@@ -999,6 +1001,10 @@ futex_wake(u32 __user *uaddr, unsigned int flags, int nr_wake, u32 bitset)
goto out;

hb = hash_futex(&key);
+ /* make sure we really have tasks to wakeup */
+ if (atomic_read(&hb->len) == 0)
+ goto out_put_key;
+
spin_lock(&hb->lock);

plist_for_each_entry_safe(this, next, &hb->chain, list) {

@@ -1019,6 +1025,7 @@ futex_wake(u32 __user *uaddr, unsigned int flags, int nr_wake, u32 bitset)
}

spin_unlock(&hb->lock);
+out_put_key:
put_futex_key(&key);
out:
return ret;
@@ -1137,7 +1144,9 @@ void requeue_futex(struct futex_q *q, struct futex_hash_bucket *hb1,
*/
if (likely(&hb1->chain != &hb2->chain)) {
plist_del(&q->list, &hb1->chain);
+ atomic_dec(&hb1->len);
plist_add(&q->list, &hb2->chain);
+ atomic_inc(&hb2->len);
q->lock_ptr = &hb2->lock;
}
get_futex_key_refs(key2);
@@ -1480,6 +1489,8 @@ static inline struct futex_hash_bucket *queue_lock(struct futex_q *q)
struct futex_hash_bucket *hb;

hb = hash_futex(&q->key);
+ atomic_inc(&hb->len);
+
q->lock_ptr = &hb->lock;

spin_lock(&hb->lock);
@@ -1491,6 +1502,7 @@ queue_unlock(struct futex_hash_bucket *hb)
__releases(&hb->lock)
{
spin_unlock(&hb->lock);
+ atomic_dec(&hb->len);
}

/**
@@ -2222,6 +2234,7 @@ int handle_early_requeue_pi_wakeup(struct futex_hash_bucket *hb,
* Unqueue the futex_q and determine which it was.
*/
plist_del(&q->list, &hb->chain);
+ atomic_dec(&hb->len);

/* Handle spurious wakeups gracefully */
ret = -EWOULDBLOCK;
@@ -2749,6 +2762,7 @@ static int __init futex_init(void)

for (i = 0; i < futex_hashsize; i++) {
plist_head_init(&futex_queues[i].chain);
spin_lock_init(&futex_queues[i].lock);

+ atomic_set(&futex_queues[i].len, 0);
}

return 0;

Davidlohr Bueso

unread,

Nov 22, 2013, 8:00:03 PM11/22/13

to

All wake_futex() callers already verify that the we are not dealing with
a pi futex_q, so we can remove the redundant WARN() check, as this is never
triggered anyway.

Cc: Ingo Molnar <mi...@kernel.org>
Cc: Darren Hart <dvh...@linux.intel.com>
Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Thomas Gleixner <tg...@linutronix.de>
Cc: Mike Galbraith <efa...@gmx.de>
Cc: Jeff Mahoney <je...@suse.com>
Cc: Linus Torvalds <torv...@linux-foundation.org>
Cc: Scott Norton <scott....@hp.com>
Cc: Tom Vaden <tom....@hp.com>
Cc: Aswin Chandramouleeswaran <as...@hp.com>

Cc: Waiman Long <Waima...@hp.com>
Cc: Jason Low <jason...@hp.com>

Signed-off-by: Davidlohr Bueso <davi...@hp.com>
---

kernel/futex.c | 3 ---
1 file changed, 3 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index e6ffe73..0768c68 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -844,9 +844,6 @@ static void wake_futex(struct futex_q *q)

{
struct task_struct *p = q->task;

- if (WARN(q->pi_state || q->rt_waiter, "refusing to wake PI futex\n"))
- return;
-
/*

* We set q->lock_ptr = NULL _before_ we wake up the task. If

* a non-futex wake up happens on another CPU then the task

Linus Torvalds

unread,

Nov 22, 2013, 8:30:02 PM11/22/13

to

On Fri, Nov 22, 2013 at 4:56 PM, Davidlohr Bueso <davi...@hp.com> wrote:
> In futex_wake() there is clearly no point in taking the hb->lock if
> we know beforehand that there are no tasks to be woken. This comes
> at the smaller cost of doing some atomic operations to keep track of
> the list's size.

Hmm. Why? Afaik, you only care about "empty or not". And if you don't
need the serialization from locking, then afaik you can just do a
"plist_head_empty()" without holding the lock.

NOTE!

The "list_empty()" function is very much designed to work even without
holding a lock (as long as the head itself exists reliably, of course)
BUT you have to then guarantee yourself that your algorithm doesn't
have any races wrt other CPU's adding an entry to the list at the same
time. Not holding a lock obviously means that you are not serialized
against that.. We've had problems with people doing

if (!list_empty(waiters))
wake_up_list(..)

because they wouldn't wake people up who just got added.

But considering that your atomic counter checking has the same lack of
serialization, at least the plist_head_empty() check shouldn't be any
worse than that counter thing.. And doesn't need any steenking atomic
ops or a new counter field.

Linus

Jason Low

unread,

Nov 22, 2013, 10:10:02 PM11/22/13

to

On Fri, Nov 22, 2013 at 5:25 PM, Linus Torvalds
<torv...@linux-foundation.org> wrote:
> On Fri, Nov 22, 2013 at 4:56 PM, Davidlohr Bueso <davi...@hp.com> wrote:
>> In futex_wake() there is clearly no point in taking the hb->lock if
>> we know beforehand that there are no tasks to be woken. This comes
>> at the smaller cost of doing some atomic operations to keep track of
>> the list's size.
>
> Hmm. Why? Afaik, you only care about "empty or not". And if you don't
> need the serialization from locking, then afaik you can just do a
> "plist_head_empty()" without holding the lock.
>
> NOTE!
>
> The "list_empty()" function is very much designed to work even without
> holding a lock (as long as the head itself exists reliably, of course)
> BUT you have to then guarantee yourself that your algorithm doesn't
> have any races wrt other CPU's adding an entry to the list at the same
> time. Not holding a lock obviously means that you are not serialized
> against that.. We've had problems with people doing
>
> if (!list_empty(waiters))
> wake_up_list(..)
>
> because they wouldn't wake people up who just got added.
>
> But considering that your atomic counter checking has the same lack of
> serialization, at least the plist_head_empty() check shouldn't be any
> worse than that counter thing.. And doesn't need any steenking atomic
> ops or a new counter field.

Hi Linus,

In this patch, since we do the atomic increments before holding the
hb->lock during situations where we may potentially add a task to the
list, then we would only skip attempting the wake up if no other
thread has already held the hb->lock and is in the process of adding a
task to the list, in addition to the list being empty. Would this be
enough to protect it from any race condition?

Thanks,
Jason

Davidlohr Bueso

unread,

Nov 22, 2013, 10:30:01 PM11/22/13

to

On Fri, 2013-11-22 at 17:25 -0800, Linus Torvalds wrote:
> On Fri, Nov 22, 2013 at 4:56 PM, Davidlohr Bueso <davi...@hp.com> wrote:
> > In futex_wake() there is clearly no point in taking the hb->lock if
> > we know beforehand that there are no tasks to be woken. This comes
> > at the smaller cost of doing some atomic operations to keep track of
> > the list's size.
>
> Hmm. Why? Afaik, you only care about "empty or not". And if you don't
> need the serialization from locking, then afaik you can just do a
> "plist_head_empty()" without holding the lock.

I remember this being the original approach, but after noticing some
strange behavior we quickly decided it wasn't the path. And sure enough,
I just double checked and tried the patch without atomic ops and can see
things being off: one of the futextest performance cases is stuck
blocked on a futex and I couldn't reboot the machine either -- nothing
apparent in dmesg, just not 100% functional. The thing is, we can only
avoid taking the lock only if nobody else is trying to add itself to the
list.

Thanks,
Davidlohr

Waiman Long

unread,

Nov 22, 2013, 11:10:04 PM11/22/13

to

On 11/22/2013 08:25 PM, Linus Torvalds wrote:
> On Fri, Nov 22, 2013 at 4:56 PM, Davidlohr Bueso<davi...@hp.com> wrote:
>> In futex_wake() there is clearly no point in taking the hb->lock if
>> we know beforehand that there are no tasks to be woken. This comes
>> at the smaller cost of doing some atomic operations to keep track of
>> the list's size.
> Hmm. Why? Afaik, you only care about "empty or not". And if you don't
> need the serialization from locking, then afaik you can just do a
> "plist_head_empty()" without holding the lock.
>
> NOTE!
>
> The "list_empty()" function is very much designed to work even without
> holding a lock (as long as the head itself exists reliably, of course)
> BUT you have to then guarantee yourself that your algorithm doesn't
> have any races wrt other CPU's adding an entry to the list at the same
> time. Not holding a lock obviously means that you are not serialized
> against that.. We've had problems with people doing
>
> if (!list_empty(waiters))
> wake_up_list(..)
>
> because they wouldn't wake people up who just got added.
>
> But considering that your atomic counter checking has the same lack of
> serialization, at least the plist_head_empty() check shouldn't be any
> worse than that counter thing.. And doesn't need any steenking atomic
> ops or a new counter field.
>
> Linus

Before the patch, a waker will wake up one or waiters who call
spin_lock() before the waker. If we just use plist_head_empty(), we will
miss waiters who have called spin_lock(), but have not yet got the lock
or inserted itself into the wait list. By adding atomic counter for
checking purpose, we again has a single point where any waiters who pass
that point (inc atomic counter) can be woken up by a waiter who check
the atomic counter after they inc it. This acts sort of like
serialization without using a lock.

-Longman

Hart, Darren

unread,

Nov 23, 2013, 12:50:02 AM11/23/13

to

On Fri, 2013-11-22 at 21:40 -0800, Darren Hart wrote:

> On Fri, 2013-11-22 at 16:56 -0800, Davidlohr Bueso wrote:
> > In futex_wake() there is clearly no point in taking the hb->lock if
> > we know beforehand that there are no tasks to be woken. This comes
> > at the smaller cost of doing some atomic operations to keep track of
> > the list's size. Specifically, increment the counter when an element is
> > added to the list, and decrement when it is removed. Of course, if the
> > counter is 0, then there are no tasks blocked on a futex. Some special
> > considerations:
> >
> > - increment the counter at queue_lock() as we always end up calling
> > queue_me() which adds the element to the list. Upon any error,
> > queue_unlock() is called for housekeeping, for which we decrement
> > to mach the increment done in queue_lock().
> >
> > - decrement the counter at __unqueue_me() to reflect when an element is
> > removed from the queue for wakeup related purposes.
>

> What is the problem you are trying to solve here?

Apologies, too quick on the trigger. I see plenty of detail in 0/5. Will
spend some time reviewing that.

--
Darren Hart
Intel Open Source Technology Center
Yocto Project - Linux Kernel

Darren Hart

unread,

Nov 23, 2013, 12:50:02 AM11/23/13

to

On Fri, 2013-11-22 at 16:56 -0800, Davidlohr Bueso wrote:

> In futex_wake() there is clearly no point in taking the hb->lock if
> we know beforehand that there are no tasks to be woken. This comes
> at the smaller cost of doing some atomic operations to keep track of
> the list's size. Specifically, increment the counter when an element is
> added to the list, and decrement when it is removed. Of course, if the
> counter is 0, then there are no tasks blocked on a futex. Some special
> considerations:
>
> - increment the counter at queue_lock() as we always end up calling
> queue_me() which adds the element to the list. Upon any error,
> queue_unlock() is called for housekeeping, for which we decrement
> to mach the increment done in queue_lock().
>
> - decrement the counter at __unqueue_me() to reflect when an element is
> removed from the queue for wakeup related purposes.

What is the problem you are trying to solve here?

The fast-path in futexes is not calling into the kernel at all, if
you've made the syscall, you've already lost. What kind of improvement
are you seeing by adding this optimization? Is it worth the introduction
of atomic operations into a file known to the state of California to
increase the risk of liver failure?

--
Darren Hart
Intel Open Source Technology Center
Yocto Project - Linux Kernel

Darren Hart

unread,

Nov 23, 2013, 1:00:02 AM11/23/13

to

On Fri, 2013-11-22 at 16:56 -0800, Davidlohr Bueso wrote:

> We have been dealing with a customer database workload on large
> 12Tb, 240 core 16 socket NUMA system that exhibits high amounts
> of contention on some of the locks that serialize internal futex
> data structures. This workload specially suffers in the wakeup
> paths, where waiting on the corresponding hb->lock can account for
> up to ~60% of the time. The result of such calls can mostly be
> classified as (i) nothing to wake up and (ii) wakeup large amount
> of tasks.

With as many cores as you have, have you done any analysis of how
effective the hashing algorithm is, and would more buckets relieve some
of the contention.... ah, I see below that you did. Nice work.

Sorry to be dense, can you spell out how 60% falls out of these numbers?

Can you determine how much benefit comes from 3 and how much additional
benefit comes from 4?

>
> Patch 5 resurrects a two year old idea from Peter Zijlstra to delay
> the waking of the blocked tasks to be done without holding the hb->lock:
> https://lkml.org/lkml/2011/9/14/118
>
> This is useful for locking primitives that can effect multiple wakeups
> per operation and want to avoid the futex's internal spinlock contention by
> delaying the wakeups until we've released the hb->lock.
> This particularly deals with point (ii) above, where we can observe that
> in occasions the wake calls end up waking 125 to 200 waiters in what we believe
> are RW locks in the application.
>
> This patchset has also been tested on smaller systems for a variety of
> benchmarks, including java workloads, kernel builds and custom bang-the-hell-out-of
> hb locks programs. So far, no functional or performance regressions have been seen.
> Furthermore, no issues were found when running the different tests in the futextest
> suite: http://git.kernel.org/cgit/linux/kernel/git/dvhart/futextest.git/

Excellent. Would you be able to contribute any of these (C only please)
to the stress test group?

>
> This patchset applies on top of Linus' tree as of v3.13-rc1.
>
> Special thanks to Scott Norton, Tom Vanden and Mark Ray for help presenting,
> debugging and analyzing the data.
>
> futex: Misc cleanups
> futex: Check for pi futex_q only once
> futex: Larger hash table
> futex: Avoid taking hb lock if nothing to wakeup
> sched,futex: Provide delayed wakeup list
>
> include/linux/sched.h | 41 ++++++++++++++++++
> kernel/futex.c | 113 +++++++++++++++++++++++++++-----------------------
> kernel/sched/core.c | 19 +++++++++
> 3 files changed, 122 insertions(+), 51 deletions(-)
>

--

Darren Hart
Intel Open Source Technology Center
Yocto Project - Linux Kernel

Mike Galbraith

unread,

Nov 23, 2013, 1:40:01 AM11/23/13

to

On Fri, 2013-11-22 at 21:55 -0800, Darren Hart wrote:
> On Fri, 2013-11-22 at 16:56 -0800, Davidlohr Bueso wrote:

> > This patchset has also been tested on smaller systems for a variety of
> > benchmarks, including java workloads, kernel builds and custom bang-the-hell-out-of
> > hb locks programs. So far, no functional or performance regressions have been seen.
> > Furthermore, no issues were found when running the different tests in the futextest
> > suite: http://git.kernel.org/cgit/linux/kernel/git/dvhart/futextest.git/
>
> Excellent. Would you be able to contribute any of these (C only please)
> to the stress test group?

FWIW, I plugged this series into an rt kernel (extra raciness) and beat
it up a bit on a 64 core box too. Nothing fell out, nor did futextest
numbers change outside variance (poor box has 8 whole gig ram, single
numa node, so kinda crippled/wimpy, and not good box for benchmarking).

What concerned me most about the series was 5/5.. looks like a great
idea to me, but the original thread did not have a happy ending.

-Mike

Davidlohr Bueso

unread,

Nov 23, 2013, 1:40:01 AM11/23/13

to

Hi Darren,

On Fri, 2013-11-22 at 21:55 -0800, Darren Hart wrote:

> On Fri, 2013-11-22 at 16:56 -0800, Davidlohr Bueso wrote:
> > We have been dealing with a customer database workload on large
> > 12Tb, 240 core 16 socket NUMA system that exhibits high amounts
> > of contention on some of the locks that serialize internal futex
> > data structures. This workload specially suffers in the wakeup
> > paths, where waiting on the corresponding hb->lock can account for
> > up to ~60% of the time. The result of such calls can mostly be
> > classified as (i) nothing to wake up and (ii) wakeup large amount
> > of tasks.
>
> With as many cores as you have, have you done any analysis of how

> effective the hashing algorithm is, and would more buckets relieve someHi

By adding the respective percentages of futex_wake()*_raw_spin_lock
calls.

While I don't have specific per-patch data, there are indications that
the workload mostly deals with a handful of futexes. So its pretty safe
to assume that patch 4 is the one with the most benefit for _this_
particular workload.

>
> >
> > Patch 5 resurrects a two year old idea from Peter Zijlstra to delay
> > the waking of the blocked tasks to be done without holding the hb->lock:
> > https://lkml.org/lkml/2011/9/14/118
> >
> > This is useful for locking primitives that can effect multiple wakeups
> > per operation and want to avoid the futex's internal spinlock contention by
> > delaying the wakeups until we've released the hb->lock.
> > This particularly deals with point (ii) above, where we can observe that
> > in occasions the wake calls end up waking 125 to 200 waiters in what we believe
> > are RW locks in the application.
> >
> > This patchset has also been tested on smaller systems for a variety of
> > benchmarks, including java workloads, kernel builds and custom bang-the-hell-out-of
> > hb locks programs. So far, no functional or performance regressions have been seen.
> > Furthermore, no issues were found when running the different tests in the futextest
> > suite: http://git.kernel.org/cgit/linux/kernel/git/dvhart/futextest.git/
>
> Excellent. Would you be able to contribute any of these (C only please)
> to the stress test group?
>

Sure.

Thanks,
Davidlohr

Darren Hart

unread,

Nov 23, 2013, 1:40:01 AM11/23/13

to

On Fri, 2013-11-22 at 16:56 -0800, Davidlohr Bueso wrote:

This was added deliberately after adding said checks to the callers...
admittedly after a very long debug session I didn't ever want to repeat.
Sometimes warnings are added to make sure we caught everything and later
removed.... sometimes they are added to make sure nothing new ever
breaks this again. Since the failure scenario is non-obvious, unless
this is causing some significant performance issues for you, I'd prefer
this stays.

See commit aa10990e028cac3d5e255711fb9fb47e00700e35 for details.

--
Darren Hart
Intel Open Source Technology Center
Yocto Project - Linux Kernel

Darren Hart

unread,

Nov 23, 2013, 2:00:02 AM11/23/13

to

On Fri, 2013-11-22 at 16:56 -0800, Davidlohr Bueso wrote:

> From: Jason Low <jason...@hp.com>
>
> - Remove unnecessary head variables.
> - Delete unused parameter in queue_unlock().
>
> Cc: Ingo Molnar <mi...@kernel.org>
> Cc: Darren Hart <dvh...@linux.intel.com>
> Cc: Peter Zijlstra <pet...@infradead.org>
> Cc: Thomas Gleixner <tg...@linutronix.de>
> Cc: Mike Galbraith <efa...@gmx.de>
> Cc: Jeff Mahoney <je...@suse.com>
> Cc: Linus Torvalds <torv...@linux-foundation.org>
> Cc: Scott Norton <scott....@hp.com>
> Cc: Tom Vaden <tom....@hp.com>
> Cc: Aswin Chandramouleeswaran <as...@hp.com>
> Cc: Waiman Long <Waima...@hp.com>
> Signed-off-by: Jason Low <jason...@hp.com>
> Signed-off-by: Davidlohr Bueso <davi...@hp.com>

My concern was that we saved off head for a reason, but the hb lock is
held in these cases, and the call is safe against removal, so I don't
see a problem. Thanks.

Reviewed-by: Darren Hart <dvh...@linux.intel.com>

--
Darren Hart
Intel Open Source Technology Center
Yocto Project - Linux Kernel

Darren Hart

unread,

Nov 23, 2013, 2:00:02 AM11/23/13

to

On Fri, 2013-11-22 at 16:56 -0800, Davidlohr Bueso wrote:

Something in me squirms at the #if/#else here, but I'll leave that to
the tip maintainers to call out if they are so inclined. Same below.
Otherwise, looks reasonable to me.

Reviewed-by: Darren Hart <dvh...@linux.intel.com>

--
Darren Hart
Intel Open Source Technology Center
Yocto Project - Linux Kernel

Darren Hart

unread,

Nov 23, 2013, 2:30:01 AM11/23/13

to

On Fri, 2013-11-22 at 19:19 -0800, Davidlohr Bueso wrote:
> On Fri, 2013-11-22 at 17:25 -0800, Linus Torvalds wrote:
> > On Fri, Nov 22, 2013 at 4:56 PM, Davidlohr Bueso <davi...@hp.com> wrote:
> > > In futex_wake() there is clearly no point in taking the hb->lock if
> > > we know beforehand that there are no tasks to be woken. This comes
> > > at the smaller cost of doing some atomic operations to keep track of
> > > the list's size.
> >
> > Hmm. Why? Afaik, you only care about "empty or not". And if you don't
> > need the serialization from locking, then afaik you can just do a
> > "plist_head_empty()" without holding the lock.
>
> I remember this being the original approach, but after noticing some
> strange behavior we quickly decided it wasn't the path. And sure enough,
> I just double checked and tried the patch without atomic ops and can see
> things being off: one of the futextest performance cases is stuck
> blocked on a futex and I couldn't reboot the machine either -- nothing
> apparent in dmesg, just not 100% functional. The thing is, we can only
> avoid taking the lock only if nobody else is trying to add itself to the
> list.

In your usage, the worst case scenario is that you detect 0 when locking
may have blocked and found a waiter. Correct?

In this case, you return 0, instead of 1 (or more).

This suggests to me a bug in the futextest testcase. Which test
specifically hung up waiting?

Futex hangs are almost always bad userspace code (my bad userspace code
in this case ;-)

--
Darren Hart
Intel Open Source Technology Center
Yocto Project - Linux Kernel

Darren Hart

unread,

Nov 23, 2013, 2:30:02 AM11/23/13

to

On Fri, 2013-11-22 at 16:56 -0800, Davidlohr Bueso wrote:

> In futex_wake() there is clearly no point in taking the hb->lock if
> we know beforehand that there are no tasks to be woken. This comes
> at the smaller cost of doing some atomic operations to keep track of
> the list's size. Specifically, increment the counter when an element is
> added to the list, and decrement when it is removed. Of course, if the
> counter is 0, then there are no tasks blocked on a futex. Some special
> considerations:
>
> - increment the counter at queue_lock() as we always end up calling
> queue_me() which adds the element to the list. Upon any error,
> queue_unlock() is called for housekeeping, for which we decrement
> to mach the increment done in queue_lock().

^match

>
> - decrement the counter at __unqueue_me() to reflect when an element is
> removed from the queue for wakeup related purposes.

__unqueue_futex (not __unqueue_me)

> @@ -999,6 +1001,10 @@ futex_wake(u32 __user *uaddr, unsigned int flags, int nr_wake, u32 bitset)
> goto out;
>
> hb = hash_futex(&key);
> + /* make sure we really have tasks to wakeup */

Nit, but please use proper sentence formatting for consistency with the
rest of the comments in futex.c (most of them anyway).

/* Make sure we really have tasks to wake up. */

Now... I'm not thrilled with adding atomics if we don't need to,
especially for an optimization since the atomics themselves cause enough
problems, especially across a large number of CPUs... I'll respond to
Linus's thread though.

--
Darren Hart
Intel Open Source Technology Center
Yocto Project - Linux Kernel

Peter Zijlstra

unread,

Nov 23, 2013, 6:50:01 AM11/23/13

to

Right, from what I remember, this patch can cause spurious wakeups, and
while all our regular sleeping lock / wait thingies can deal with this,
not all creative schedule() usage in the tree can deal with this.

There's about ~1400 (or there were that many 2 years ago, might be more
by now) schedule() calls, many of which are open coded wait constructs
of which most are buggy in one way or another.

So we first need to audit / fix all those before we can do this one.

I used to have a patch to schedule() that would always immediately fall
through and only actually block on the second call; it illustrated the
problem really well, in fact so well the kernels fails to boot most
times.

Peter Zijlstra

unread,

Nov 23, 2013, 7:10:01 AM11/23/13

to

> I used to have a patch to schedule() that would always immediately fall
> through and only actually block on the second call; it illustrated the
> problem really well, in fact so well the kernels fails to boot most
> times.

I found the below on my filesystem -- making it apply shouldn't be hard.
Making it work is the same effort as that patch you sent, we need to
guarantee all schedule() callers can deal with not actually sleeping --
aka. spurious wakeups.

I don't think anybody ever got that thing to run reliable enough to see
if the idea proposed in the patch made any difference to actual
workloads though.

---
Subject:
From: Peter Zijlstra <a.p.zi...@chello.nl>
Date: Thu Dec 09 17:51:09 CET 2010

Signed-off-by: Peter Zijlstra <a.p.zi...@chello.nl>
Link: http://lkml.kernel.org/n/tip-v17vshx6ua...@git.kernel.org
---
include/linux/sched.h | 5 +++--
kernel/sched/core.c | 18 ++++++++++++++++++
2 files changed, 21 insertions(+), 2 deletions(-)

--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -153,9 +153,10 @@ print_cfs_rq(struct seq_file *m, int cpu
#define TASK_DEAD 64
#define TASK_WAKEKILL 128
#define TASK_WAKING 256
-#define TASK_STATE_MAX 512
+#define TASK_YIELD 512
+#define TASK_STATE_MAX 1024

-#define TASK_STATE_TO_CHAR_STR "RSDTtZXxKW"
+#define TASK_STATE_TO_CHAR_STR "RSDTtZXxKWY"

extern char ___assert_task_state[1 - 2*!!(
sizeof(TASK_STATE_TO_CHAR_STR)-1 != ilog2(TASK_STATE_MAX)+1)];
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -931,6 +931,7 @@ void set_task_cpu(struct task_struct *p,
* ttwu() will sort out the placement.
*/
WARN_ON_ONCE(p->state != TASK_RUNNING && p->state != TASK_WAKING &&
+ !(p->state & TASK_YIELD) &&
!(task_thread_info(p)->preempt_count & PREEMPT_ACTIVE));

#ifdef CONFIG_LOCKDEP
@@ -2864,6 +2865,22 @@ static void __sched __schedule(void)
if (unlikely(signal_pending_state(prev->state, prev))) {
prev->state = TASK_RUNNING;
} else {
+ /*
+ * Provide an auto-yield feature on schedule().
+ *
+ * The thought is to avoid a sleep+wakeup cycle
+ * if simply yielding the cpu will suffice to
+ * satisfy the required condition.
+ *
+ * Assumes the calling schedule() site can deal
+ * with spurious wakeups.
+ */
+ if (prev->state & TASK_YIELD) {
+ prev->state &= ~TASK_YIELD;
+ if (rq->nr_running > 1)
+ goto no_deactivate;
+ }
+
deactivate_task(rq, prev, DEQUEUE_SLEEP);
prev->on_rq = 0;

@@ -2880,6 +2897,7 @@ static void __sched __schedule(void)
try_to_wake_up_local(to_wakeup);
}
}
+no_deactivate:
switch_count = &prev->nvcsw;

Thomas Gleixner

unread,

Nov 23, 2013, 8:20:02 AM11/23/13

to

On Fri, 22 Nov 2013, Davidlohr Bueso wrote:
> On Fri, 2013-11-22 at 17:25 -0800, Linus Torvalds wrote:
> > On Fri, Nov 22, 2013 at 4:56 PM, Davidlohr Bueso <davi...@hp.com> wrote:
> > > In futex_wake() there is clearly no point in taking the hb->lock if
> > > we know beforehand that there are no tasks to be woken. This comes
> > > at the smaller cost of doing some atomic operations to keep track of
> > > the list's size.
> >
> > Hmm. Why? Afaik, you only care about "empty or not". And if you don't
> > need the serialization from locking, then afaik you can just do a
> > "plist_head_empty()" without holding the lock.
>
> I remember this being the original approach, but after noticing some
> strange behavior we quickly decided it wasn't the path. And sure enough,
> I just double checked and tried the patch without atomic ops and can see
> things being off: one of the futextest performance cases is stuck
> blocked on a futex and I couldn't reboot the machine either -- nothing
> apparent in dmesg, just not 100% functional. The thing is, we can only
> avoid taking the lock only if nobody else is trying to add itself to the
> list.

Right. The reason is:

CPU 0 CPU 1

val = *futex;
futex_wait(futex, val);

spin_lock(&hb->lock);

uval = *futex;
*futex = newval;

if (uval != val) { futex_wake();
spin_unlock(&hb->lock); if (plist_empty(hb))
return; return;
}

plist_add(hb, self);
spin_unlock(&hb->lock);
schedule();
...

So the waiter on CPU0 gets queued after the user space value changed
and the wakee on CPU1 returns because plist is empty. Waiter therefor
waits forever.

So with the atomic ops you are changing that to:

CPU 0 CPU 1

val = *futex;
futex_wait(futex, val);

spin_lock(&hb->lock);

atomic_inc(&hb->waiters);
uval = *futex;
*futex = newval;

if (uval != val) { futex_wake();
spin_unlock(&hb->lock); if (!atomic_read(&hb->waiters))
return; return;
}
spin_lock((&hb->lock);
plist_add(hb, self);
spin_unlock(&hb->lock);
schedule(); wakeup_waiters(hb);
...

which restores the ordering guarantee, which the hash bucket lock
provided so far.

Now the question is why we queue the waiter _AFTER_ reading the user
space value. The comment in the code is pretty non sensical:

* On the other hand, we insert q and release the hash-bucket only
* after testing *uaddr. This guarantees that futex_wait() will NOT
* absorb a wakeup if *uaddr does not match the desired values
* while the syscall executes.

There is no reason why we cannot queue _BEFORE_ reading the user space
value. We just have to dequeue in all the error handling cases, but
for the fast path it does not matter at all.

CPU 0 CPU 1

val = *futex;
futex_wait(futex, val);

spin_lock(&hb->lock);

plist_add(hb, self);
smp_wmb();

uval = *futex;
*futex = newval;
futex_wake();

smp_rmb();
if (plist_empty(hb))
return;

spin_lock((&hb->lock);
if (uval != val) {
plist_del(self);
spin_unlock(&hb->lock);
return;
}

spin_unlock(&hb->lock);
schedule(); wakeup_waiters(hb);
...

Thanks,

tglx

Linus Torvalds

unread,

Nov 23, 2013, 10:50:02 PM11/23/13

to

On Sat, Nov 23, 2013 at 5:16 AM, Thomas Gleixner <tg...@linutronix.de> wrote:
>
> Now the question is why we queue the waiter _AFTER_ reading the user
> space value. The comment in the code is pretty non sensical:
>
> * On the other hand, we insert q and release the hash-bucket only
> * after testing *uaddr. This guarantees that futex_wait() will NOT
> * absorb a wakeup if *uaddr does not match the desired values
> * while the syscall executes.
>
> There is no reason why we cannot queue _BEFORE_ reading the user space
> value. We just have to dequeue in all the error handling cases, but
> for the fast path it does not matter at all.
>
> CPU 0 CPU 1
>
> val = *futex;
> futex_wait(futex, val);
>
> spin_lock(&hb->lock);
>
> plist_add(hb, self);
> smp_wmb();
>
> uval = *futex;
> *futex = newval;
> futex_wake();
>
> smp_rmb();
> if (plist_empty(hb))
> return;

> ...

This would seem to be a nicer approach indeed, without needing the
extra atomics.

Davidlohr, mind trying Thomas' approach?

Linus

Davidlohr Bueso

unread,

Nov 24, 2013, 12:20:01 AM11/24/13

to

I think a strong comment in wake_futex() warning about pi futex_qs would
do nowadays, but fair enough, I was kind of expecting this reply.

Thanks,
Davidlohr

Davidlohr Bueso

unread,

Nov 24, 2013, 12:20:01 AM11/24/13

to

Yep, I think we can all agree that doing this optization without atomic
ops is a big plus.

>
> Davidlohr, mind trying Thomas' approach?

I just took a quick look and it seems pretty straightforward, but not
without some details to consider. We basically have to redo/reorder
futex_wait_setup(), which checks that uval == val, and
futex_wait_queue_me(), which adds the task to the list and blocks. Now,
both futex_wait() and futex_wait_requeue_pi() have this logic, but since
we don't use futex_wake() to wakeup tasks on pi futex_qs, I believe it's
ok to only change futex_wait(), while the order of the uval checking
doesn't matter for futex_wait_requeue_pi() so it can stay as is.

The following is the general skeleton of what Thomas is probably
referring to (yes, it needs comments, cleanups, refactoring, etc, etc).
Futexes are easy to screw up but at least this passes a kernel build and
all futextests on a large box.

static int futex_wait(u32 __user *uaddr, unsigned int flags, u32 val,
ktime_t *abs_time, u32 bitset)
{
struct hrtimer_sleeper timeout, *to = NULL;
struct restart_block *restart;
struct futex_hash_bucket *hb;
struct futex_q q = futex_q_init;
int prio, ret = 0;
u32 uval;

if (!bitset)
return -EINVAL;
q.bitset = bitset;

if (abs_time) {
to = &timeout;

hrtimer_init_on_stack(&to->timer, (flags & FLAGS_CLOCKRT) ?
CLOCK_REALTIME : CLOCK_MONOTONIC,
HRTIMER_MODE_ABS);
hrtimer_init_sleeper(to, current);
hrtimer_set_expires_range_ns(&to->timer, *abs_time,
current->timer_slack_ns);
}
retry:
ret = get_futex_key(uaddr, flags & FLAGS_SHARED, &q.key, VERIFY_READ);
if (unlikely(ret != 0))
goto out;

retry_private:
hb = queue_lock(&q);
prio = min(current->normal_prio, MAX_RT_PRIO);

plist_node_init(&q.list, prio);
plist_add(&q.list, &hb->chain);
q.task = current;
/* do NOT drop the lock here */
smp_wmb();

ret = get_futex_value_locked(&uval, uaddr);
if (ret) {

plist_del(&q.list, &hb->chain);
spin_unlock(&hb->lock);

ret = get_user(uval, uaddr);
if (ret)

goto out_put_key;

if (!(flags & FLAGS_SHARED))
goto retry_private;

put_futex_key(&q.key);
goto retry;

}

if (uval != val) {
plist_del(&q.list, &hb->chain);
spin_unlock(&hb->lock);
ret = -EWOULDBLOCK;
goto out_put_key;
}

set_current_state(TASK_INTERRUPTIBLE);
spin_unlock(&hb->lock);

/* Arm the timer */
if (to) {
hrtimer_start_expires(&to->timer, HRTIMER_MODE_ABS);
if (!hrtimer_active(&to->timer))
to->task = NULL;
}

/*
* If we have been removed from the hash list, then another task
* has tried to wake us, and we can skip the call to schedule().
*/
if (likely(!plist_node_empty(&q.list))) {
/*
* If the timer has already expired, current will already be
* flagged for rescheduling. Only call schedule if there
* is no timeout, or if it has yet to expire.
*/
if (!to || to->task)
freezable_schedule();
}
__set_current_state(TASK_RUNNING);

/* If we were woken (and unqueued), we succeeded, whatever. */
ret = 0;
/* unqueue_me() drops q.key ref */
if (!unqueue_me(&q))
goto out;
ret = -ETIMEDOUT;
if (to && !to->task)
goto out;

/*
* We expect signal_pending(current), but we might be the
* victim of a spurious wakeup as well.
*/
if (!signal_pending(current))
goto retry;

ret = -ERESTARTSYS;
if (!abs_time)
goto out;

restart = &current_thread_info()->restart_block;
restart->fn = futex_wait_restart;
restart->futex.uaddr = uaddr;
restart->futex.val = val;
restart->futex.time = abs_time->tv64;
restart->futex.bitset = bitset;
restart->futex.flags = flags | FLAGS_HAS_TIMEOUT;

ret = -ERESTART_RESTARTBLOCK;

out_put_key:
if (ret)
put_futex_key(&q.key);
out:
if (to) {
hrtimer_cancel(&to->timer);
destroy_hrtimer_on_stack(&to->timer);
}
return ret;
}

Thanks,
Davidlohr

Davidlohr Bueso

unread,

Nov 24, 2013, 12:30:01 AM11/24/13

to

On Sat, 2013-11-23 at 13:01 +0100, Peter Zijlstra wrote:
> > I used to have a patch to schedule() that would always immediately fall
> > through and only actually block on the second call; it illustrated the
> > problem really well, in fact so well the kernels fails to boot most
> > times.
>
> I found the below on my filesystem -- making it apply shouldn't be hard.
> Making it work is the same effort as that patch you sent, we need to
> guarantee all schedule() callers can deal with not actually sleeping --
> aka. spurious wakeups.

Thanks, I'll definitely try the patch and see what comes up.

>
> I don't think anybody ever got that thing to run reliable enough to see
> if the idea proposed in the patch made any difference to actual
> workloads though.

Since your idea can also be applied to sysv sems (patch 3/3 back then),
I can definitely do some Oracle runs which IIRC, also likes doing
multiple wakeups at once. In any case this patch deals very nicely with
our customer workload, which is why I believe its particularly good
here.

Thanks,
Davidlohr

Thomas Gleixner

unread,

Nov 25, 2013, 7:10:03 AM11/25/13

to

There is no mechanism which prevents a futex_wake() call on the inner
futex of the wait_requeue_pi mechanism. So no, we have to change both.

futexes are no place for believe. Either you understand them
completely or you just leave them alone.

Thanks,

tglx

Thomas Gleixner

unread,

Nov 25, 2013, 11:30:02 AM11/25/13

to

I went through the issue with Peter and he noticed, that we need
smp_mb() in both places. That's what we have right now with the
spin_lock() and it is required as we need to guarantee that

The waiter observes the change to the uaddr value after it added
itself to the plist

The waker observes plist not empty if the change to uaddr was made
after the waiter checked the value.

write(plist) | write(futex_uaddr)
mb() | mb()
read(futex_uaddr) | read(plist)

The spin_lock mb() on the waiter side does not help here because it
happpens before the write(plist) and not after it.

Thanks,

tglx

Peter Zijlstra

unread,

Nov 25, 2013, 11:40:02 AM11/25/13

to

Ah, note that spin_lock() is only a smp_mb() on x86, in general its an
ACQUIRE barrier which is weaker than a full mb and will not suffice in
this case even it if were in the right place.

Thomas Gleixner

unread,

Nov 25, 2013, 12:40:01 PM11/25/13

to

So now the question is whether this lockless empty check optimization
which seems to be quite nice on x86 for a particular workload will
have any negative side effects on other architectures.

If the smp_mb() is heavy weight, then it will hurt massivly in the
case where the hash bucket is not empty, because we add the price for
the smp_mb() just for no gain.

In that context it would also be helpful to measure the overhead on
x86 for the !empty case.

Thanks,

tglx

Peter Zijlstra

unread,

Nov 25, 2013, 12:40:02 PM11/25/13

to

On Mon, Nov 25, 2013 at 06:32:55PM +0100, Thomas Gleixner wrote:
> In that context it would also be helpful to measure the overhead on
> x86 for the !empty case.

Yes, because mfence is by no means a cheap instruction on x86.

Davidlohr Bueso

unread,

Nov 25, 2013, 2:00:02 PM11/25/13

to

Absolutely, I will add these comparisons. If we do notice that we end up
hurting the !empty case, would the current patch using atomic ops still
be considered? We have made sure that none of the changes in this set
affects performance on other workloads/smaller systems.

Thanks,
Davidlohr

Thomas Gleixner

unread,

Nov 25, 2013, 2:50:03 PM11/25/13

to

On Sat, 23 Nov 2013, Thomas Gleixner wrote:
> On Fri, 22 Nov 2013, Davidlohr Bueso wrote:
> So with the atomic ops you are changing that to:
>
> CPU 0 CPU 1
>
> val = *futex;
> futex_wait(futex, val);
>
> spin_lock(&hb->lock);
>
> atomic_inc(&hb->waiters);
> uval = *futex;
> *futex = newval;
>
> if (uval != val) { futex_wake();
> spin_unlock(&hb->lock); if (!atomic_read(&hb->waiters))
> return; return;
> }
> spin_lock((&hb->lock);
> plist_add(hb, self);
> spin_unlock(&hb->lock);
> schedule(); wakeup_waiters(hb);
> ...
>
> which restores the ordering guarantee, which the hash bucket lock
> provided so far.

Actually that's not true by design, it just happens to work.

atomic_inc() on x86 is a "lock incl".

The LOCK prefix just guarantees that the cache line which is affected
by the INCL is locked. And it guarantees that locked operations
serialize all outstanding load and store operations.

But Documentation/atomic_ops.txt says about atomic_inc():

"One very important aspect of these two routines is that they DO NOT
require any explicit memory barriers. They need only perform the
atomic_t counter update in an SMP safe manner."

So while this has a barrier on x86, it's not guaranteed.

atomic_read() is a simple read.

This does not guarantee anything. And if you read
Documentation/atomic_ops.txt it's well documented:

"*** WARNING: atomic_read() and atomic_set() DO NOT IMPLY BARRIERS! ***"

So now your code melts down to:

write(hb->waiters) | write(uaddr)
mb | read(hb->waiters)
read(uaddr)

I fear you simply managed to make the window small enough that your
testing was not longer able expose it.

Thomas Gleixner

unread,

Nov 25, 2013, 3:00:02 PM11/25/13

to

Please read my last reply to the atomic ops approach.

Aside of that we need numbers for a significant range of !x86.

Thanks,

tglx

Darren Hart

unread,

Nov 25, 2013, 3:10:01 PM11/25/13

to

But it is guaranteed to be "in an SMP safe manner"... which I guess just
means that two writes will not intermix bytes, but no guarantee that the
value will be visible to other CPUs unless some kind of barrier is
explicitly imposed.

Correct?

> atomic_read() is a simple read.
>
> This does not guarantee anything. And if you read
> Documentation/atomic_ops.txt it's well documented:
>
> "*** WARNING: atomic_read() and atomic_set() DO NOT IMPLY BARRIERS! ***"
>
>
> So now your code melts down to:
>
> write(hb->waiters) | write(uaddr)
> mb | read(hb->waiters)
> read(uaddr)
>
> I fear you simply managed to make the window small enough that your
> testing was not longer able expose it.

Does seem to be the case.

--
Darren Hart
Intel Open Source Technology Center
Yocto Project - Linux Kernel

Thomas Gleixner

unread,

Nov 25, 2013, 3:30:02 PM11/25/13

to

On Mon, 25 Nov 2013, Darren Hart wrote:
> On Mon, 2013-11-25 at 20:47 +0100, Thomas Gleixner wrote:
> > > which restores the ordering guarantee, which the hash bucket lock
> > > provided so far.
> >
> > Actually that's not true by design, it just happens to work.
> >
> > atomic_inc() on x86 is a "lock incl".
> >
> > The LOCK prefix just guarantees that the cache line which is affected
> > by the INCL is locked. And it guarantees that locked operations
> > serialize all outstanding load and store operations.
> >
> > But Documentation/atomic_ops.txt says about atomic_inc():
> >
> > "One very important aspect of these two routines is that they DO NOT
> > require any explicit memory barriers. They need only perform the
> > atomic_t counter update in an SMP safe manner."
> >
> > So while this has a barrier on x86, it's not guaranteed.
>
>
> But it is guaranteed to be "in an SMP safe manner"... which I guess just
> means that two writes will not intermix bytes, but no guarantee that the
> value will be visible to other CPUs unless some kind of barrier is
> explicitly imposed.
>
> Correct?

Yep, that's what it sayes.

> > So now your code melts down to:
> >
> > write(hb->waiters) | write(uaddr)
> > mb | read(hb->waiters)
> > read(uaddr)
> >
> > I fear you simply managed to make the window small enough that your
> > testing was not longer able expose it.
>
> Does seem to be the case.

You must be aware, that between the the write(uaddr) and the
read(hb->waiters) is the syscall, i.e. a user/kernel space
transition.

sysenter/syscall have no documented barriers inside, but we don't know
whether the actual hw implementation provides one or if the memory ops
between modifying uaddr and reaching the read(hb->waiters) point
including the real atomic op on the waiter side are good enough to
paper over the issue.

Thanks,

tglx

Thomas Gleixner

unread,

Nov 26, 2013, 9:00:01 AM11/26/13

to

On Mon, 25 Nov 2013, Darren Hart wrote:

> On Mon, 2013-11-25 at 20:47 +0100, Thomas Gleixner wrote:
> > So now your code melts down to:
> >
> > write(hb->waiters) | write(uaddr)
> > mb | read(hb->waiters)
> > read(uaddr)
> >
> > I fear you simply managed to make the window small enough that your
> > testing was not longer able expose it.
>
> Does seem to be the case.

Actually not. It's protected by an atomic_inc() between the
write(uaddr) and read(hb->waiters) on the waker side.

It took me a while to realize, that get_futex_key_refs() is providing
the protection inadvertently. But as I explained we cannot rely on
that for all architectures.

Thanks,

tglx