Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

RE: [PATCH] smp_call_function_many SMP race

2 views
Skip to first unread message

Anton Blanchard

unread,
Jan 11, 2011, 11:07:55 PM1/11/11
to xiaogu...@cn.fujitsu.com, mi...@elte.hu, jax...@fusionio.com, npi...@gmail.com, pet...@infradead.org, ru...@rustcorp.com.au, ak...@linux-foundation.org, torv...@linux-foundation.org, pau...@linux.vnet.ibm.com, mil...@bga.com, be...@kernel.crashing.org, linux-...@vger.kernel.org

Hi,

I managed to forget all about this bug, probably because of how much it
makes my brain hurt.

The issue is not that we use RCU, but that we use RCU on a static data
structure that gets reused without waiting for an RCU grace period.
Another way to solve this bug would be to dynamically allocate the
structure, assuming we are OK with the overhead.

Anton


From: Anton Blanchard <an...@samba.org>

I noticed a failure where we hit the following WARN_ON in
generic_smp_call_function_interrupt:

if (!cpumask_test_and_clear_cpu(cpu, data->cpumask))
continue;

data->csd.func(data->csd.info);

refs = atomic_dec_return(&data->refs);
WARN_ON(refs < 0); <-------------------------

We atomically tested and cleared our bit in the cpumask, and yet the
number of cpus left (ie refs) was 0. How can this be?

It turns out commit 54fdade1c3332391948ec43530c02c4794a38172
(generic-ipi: make struct call_function_data lockless)
is at fault. It removes locking from smp_call_function_many and in
doing so creates a rather complicated race.

The problem comes about because:

- The smp_call_function_many interrupt handler walks call_function.queue
without any locking.
- We reuse a percpu data structure in smp_call_function_many.
- We do not wait for any RCU grace period before starting the next
smp_call_function_many.

Imagine a scenario where CPU A does two smp_call_functions back to
back, and CPU B does an smp_call_function in between. We concentrate on
how CPU C handles the calls:


CPU A CPU B CPU C

smp_call_function
smp_call_function_interrupt
walks
call_function.queue sees CPU A on list

smp_call_function

smp_call_function_interrupt
walks
call_function.queue sees
(stale) CPU A on list
smp_call_function reuses
percpu *data set
data->cpumask sees and
clears bit in cpumask!
sees data->refs is 0!

set data->refs (too late!)


The important thing to note is since the interrupt handler walks a
potentially stale call_function.queue without any locking, then another
cpu can view the percpu *data structure at any time, even when the
owner is in the process of initialising it.

The following test case hits the WARN_ON 100% of the time on my PowerPC
box (having 128 threads does help :)


#include <linux/module.h>
#include <linux/init.h>

#define ITERATIONS 100

static void do_nothing_ipi(void *dummy)
{
}

static void do_ipis(struct work_struct *dummy)
{
int i;

for (i = 0; i < ITERATIONS; i++)
smp_call_function(do_nothing_ipi, NULL, 1);

printk(KERN_DEBUG "cpu %d finished\n", smp_processor_id());
}

static struct work_struct work[NR_CPUS];

static int __init testcase_init(void)
{
int cpu;

for_each_online_cpu(cpu) {
INIT_WORK(&work[cpu], do_ipis);
schedule_work_on(cpu, &work[cpu]);
}

return 0;
}

static void __exit testcase_exit(void)
{
}

module_init(testcase_init)
module_exit(testcase_exit)
MODULE_LICENSE("GPL");
MODULE_AUTHOR("Anton Blanchard");


I tried to fix it by ordering the read and the write of ->cpumask and
->refs. In doing so I missed a critical case but Paul McKenney was able
to spot my bug thankfully :) To ensure we arent viewing previous
iterations the interrupt handler needs to read ->refs then ->cpumask
then ->refs _again_.

Thanks to Milton Miller and Paul McKenney for helping to debug this
issue.

---

Index: linux-2.6/kernel/smp.c
===================================================================
--- linux-2.6.orig/kernel/smp.c 2010-12-22 17:19:11.262835785 +1100
+++ linux-2.6/kernel/smp.c 2011-01-12 15:03:08.793324402 +1100
@@ -194,6 +194,31 @@ void generic_smp_call_function_interrupt
list_for_each_entry_rcu(data, &call_function.queue, csd.list) {
int refs;

+ /*
+ * Since we walk the list without any locks, we might
+ * see an entry that was completed, removed from the
+ * list and is in the process of being reused.
+ *
+ * Just checking data->refs then data->cpumask is not good
+ * enough because we could see a non zero data->refs from a
+ * previous iteration. We need to check data->refs, then
+ * data->cpumask then data->refs again. Talk about
+ * complicated!
+ */
+
+ if (atomic_read(&data->refs) == 0)
+ continue;
+
+ smp_rmb();
+
+ if (!cpumask_test_cpu(cpu, data->cpumask))
+ continue;
+
+ smp_rmb();
+
+ if (atomic_read(&data->refs) == 0)
+ continue;
+
if (!cpumask_test_and_clear_cpu(cpu, data->cpumask))
continue;

@@ -458,6 +483,14 @@ void smp_call_function_many(const struct
data->csd.info = info;
cpumask_and(data->cpumask, mask, cpu_online_mask);
cpumask_clear_cpu(this_cpu, data->cpumask);
+
+ /*
+ * To ensure the interrupt handler gets an up to date view
+ * we order the cpumask and refs writes and order the
+ * read of them in the interrupt handler.
+ */
+ smp_wmb();
+
atomic_set(&data->refs, cpumask_weight(data->cpumask));

raw_spin_lock_irqsave(&call_function.lock, flags);
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Peter Zijlstra

unread,
Jan 17, 2011, 1:17:39 PM1/17/11
to Anton Blanchard, xiaogu...@cn.fujitsu.com, mi...@elte.hu, jax...@fusionio.com, npi...@gmail.com, ru...@rustcorp.com.au, ak...@linux-foundation.org, torv...@linux-foundation.org, pau...@linux.vnet.ibm.com, mil...@bga.com, be...@kernel.crashing.org, linux-...@vger.kernel.org
On Wed, 2011-01-12 at 15:07 +1100, Anton Blanchard wrote:

> I managed to forget all about this bug, probably because of how much it
> makes my brain hurt.

Agreed.


> I tried to fix it by ordering the read and the write of ->cpumask and
> ->refs. In doing so I missed a critical case but Paul McKenney was able
> to spot my bug thankfully :) To ensure we arent viewing previous
> iterations the interrupt handler needs to read ->refs then ->cpumask
> then ->refs _again_.

> ---


>
> Index: linux-2.6/kernel/smp.c
> ===================================================================
> --- linux-2.6.orig/kernel/smp.c 2010-12-22 17:19:11.262835785 +1100
> +++ linux-2.6/kernel/smp.c 2011-01-12 15:03:08.793324402 +1100
> @@ -194,6 +194,31 @@ void generic_smp_call_function_interrupt
> list_for_each_entry_rcu(data, &call_function.queue, csd.list) {
> int refs;
>
> + /*
> + * Since we walk the list without any locks, we might
> + * see an entry that was completed, removed from the
> + * list and is in the process of being reused.
> + *
> + * Just checking data->refs then data->cpumask is not good
> + * enough because we could see a non zero data->refs from a
> + * previous iteration. We need to check data->refs, then
> + * data->cpumask then data->refs again. Talk about
> + * complicated!
> + */
> +
> + if (atomic_read(&data->refs) == 0)
> + continue;
> +

So here we might see the old ref

> + smp_rmb();
> +
> + if (!cpumask_test_cpu(cpu, data->cpumask))
> + continue;

Here we might see the new cpumask

> + smp_rmb();
> +
> + if (atomic_read(&data->refs) == 0)
> + continue;
> +

But then still see a 0 ref, at which point we skip this entry and rely
on the fact that arch_send_call_function_ipi_mask() will simply latch
our IPI line and cause a back-to-back IPI such that we can process the
data on the second go-round?

> if (!cpumask_test_and_clear_cpu(cpu, data->cpumask))
> continue;

And finally, once we observe a valid ->refs, do we test the ->cpumask
again so we cross with the store order (->cpumask first, then ->refs).

> @@ -458,6 +483,14 @@ void smp_call_function_many(const struct
> data->csd.info = info;
> cpumask_and(data->cpumask, mask, cpu_online_mask);
> cpumask_clear_cpu(this_cpu, data->cpumask);
> +
> + /*
> + * To ensure the interrupt handler gets an up to date view
> + * we order the cpumask and refs writes and order the
> + * read of them in the interrupt handler.
> + */
> + smp_wmb();
> +
> atomic_set(&data->refs, cpumask_weight(data->cpumask));
>
> raw_spin_lock_irqsave(&call_function.lock, flags);

Read side: Write side:

list_for_each_rcu()
!->refs, continue ->cpumask =
rmb wmb
!->cpumask, continue ->refs =
rmb wmb
!->refs, continue list_add_rcu()
mb
!->cpumask, continue

Wouldn't something like:

list_for_each_rcu()
!->cpumask, continue ->refs =
rmb wmb
!->refs, continue ->cpumask =
mb wmb
!->cpumask, continue list_add_rcu()


Suffice? There we can observe the old ->cpumask, new ->refs and new
->cpumask in crossed order, so we filter out the old, and cross the new,
and have one rmb and conditional less.

Or am I totally missing something here,.. like said, this stuff hurts
brains.

Milton Miller

unread,
Jan 18, 2011, 4:05:45 PM1/18/11
to Peter Zijlstra, Anton Blanchard, Paul E. McKenney, xiaogu...@cn.fujitsu.com, mi...@elte.hu, jax...@fusionio.com, npi...@gmail.com, ru...@rustcorp.com.au, ak...@linux-foundation.org, torv...@linux-foundation.org, mil...@bga.com, be...@kernel.crashing.org, linux-...@vger.kernel.org
On Mon, 17 Jan 2011 around 19:17:33 +0100, Peter Zijlstra wrote:
> On Wed, 2011-01-12 at 15:07 +1100, Anton Blanchard wrote:
>
> > I managed to forget all about this bug, probably because of how much it
> > makes my brain hurt.
>
> Agreed.
>
>
> > I tried to fix it by ordering the read and the write of ->cpumask and
> > ->refs. In doing so I missed a critical case but Paul McKenney was able
> > to spot my bug thankfully :) To ensure we arent viewing previous
> > iterations the interrupt handler needs to read ->refs then ->cpumask
> > then ->refs _again_.
>
> > ---
> >
> > Index: linux-2.6/kernel/smp.c
> > =====================================================================

Yes, I believe it does. Paul found the race case after I went home,
and I found the resulting patch with the extra calls posted the next
morning. When I tried to rase the issue, Paul said he wanted me to
try it on hardware before he did the analysis again. I finally got a
machine to do that yesterday afternoon.


>
> Suffice? There we can observe the old ->cpumask, new ->refs and new
> ->cpumask in crossed order, so we filter out the old, and cross the new,
> and have one rmb and conditional less.
>
> Or am I totally missing something here,.. like said, this stuff hurts
> brains.
>
>

In fact, if we assert that the called function is not allowed to enable
interrupts, then we can consolidate both writes to be after the function
is executed instead of doing one atomic read/write before (for the
cpumask bit) and a second one after (for the ref count). I ran the
timings on Anton's test case on a 4-node 256 thread power7 box.

What follows is the a two patch set. I am keeping the second
seperate in case some wiere funtion is enabling interrupts in the
middle of this.

milton

Milton Miller

unread,
Jan 18, 2011, 4:06:29 PM1/18/11
to Anton Blanchard, Peter Zijlstra, xiaogu...@cn.fujitsu.com, mi...@elte.hu, jax...@fusionio.com, npi...@gmail.com, ru...@rustcorp.com.au, ak...@linux-foundation.org, torv...@linux-foundation.org, pau...@linux.vnet.ibm.com, mil...@bga.com, be...@kernel.crashing.org, linux-...@vger.kernel.org
We have to test the cpu mask in the interrupt handler before checking the
refs, otherwise we can start to follow an entry before its deleted and
find it partially initailzed for the next trip. Presently we also clear
the cpumask bit before executing the called function, which implies
getting write access to the line. After the function is called we then
decrement refs, and if they go to zero we then unlock the structure.

However, this implies getting write access to the call function data
before and after another the function is called. If we can assert
that no smp_call_function execution function is allowed to enable
interrupts, then we can move both writes to after the function is
called, hopfully allowing both writes with one cache line bounce.

On a 256 thread system with a kernel compiled for 1024 threads, the
time to execute testcase in the "smp_call_function_many race"
changelog was reduced by about 30-40ms out of about 545 ms.

Signed-off-by: Milton Miller <mil...@bga.com>
---

I decided to keep this as WARN because its now a buggy function, even
though the stack trace is of no value -- a simple printk would give
us the information needed.

Raw data:

without patch
ipi_test startup took 1219366ns complete 539819014ns total 541038380ns
ipi_test startup took 1695754ns complete 543439872ns total 545135626ns
ipi_test startup took 7513568ns complete 539606362ns total 547119930ns
ipi_test startup took 13304064ns complete 533898562ns total 547202626ns
ipi_test startup took 8668192ns complete 544264074ns total 552932266ns
ipi_test startup took 4977626ns complete 548862684ns total 553840310ns
ipi_test startup took 2144486ns complete 541292318ns total 543436804ns
ipi_test startup took 21245824ns complete 530280180ns total 551526004ns

with patch
ipi_test startup took 5961748ns complete 500859628ns total 506821376ns
ipi_test startup took 8975996ns complete 495098924ns total 504074920ns
ipi_test startup took 19797750ns complete 492204740ns total 512002490ns
ipi_test startup took 14824796ns complete 487495878ns total 502320674ns
ipi_test startup took 11514882ns complete 494439372ns total 505954254ns
ipi_test startup took 8288084ns complete 502570774ns total 510858858ns
ipi_test startup took 6789954ns complete 493388112ns total 500178066ns

#include <linux/module.h>
#include <linux/init.h>
#include <linux/sched.h> /* sched clock */



#define ITERATIONS 100

static void do_nothing_ipi(void *dummy)
{
}

static void do_ipis(struct work_struct *dummy)
{
int i;

for (i = 0; i < ITERATIONS; i++)
smp_call_function(do_nothing_ipi, NULL, 1);

printk(KERN_DEBUG "cpu %d finished\n", smp_processor_id());
}

static struct work_struct work[NR_CPUS];

static int __init testcase_init(void)
{
int cpu;

u64 start, started, done;

start = local_clock();


for_each_online_cpu(cpu) {
INIT_WORK(&work[cpu], do_ipis);
schedule_work_on(cpu, &work[cpu]);
}

started = local_clock();
for_each_online_cpu(cpu)
flush_work(&work[cpu]);
done = local_clock();
pr_info("ipi_test startup took %lldns complete %lldns total %lldns\n",
started-start, done-started, done-start);



return 0;
}

static void __exit testcase_exit(void)
{
}

module_init(testcase_init)
module_exit(testcase_exit)
MODULE_LICENSE("GPL");
MODULE_AUTHOR("Anton Blanchard");

---
kernel/smp.c | 20 ++++++++++++++------
1 file changed, 14 insertions(+), 6 deletions(-)

Index: common/kernel/smp.c
===================================================================
--- common.orig/kernel/smp.c 2011-01-17 20:16:18.000000000 -0600
+++ common/kernel/smp.c 2011-01-17 20:17:50.000000000 -0600
@@ -193,6 +193,7 @@ void generic_smp_call_function_interrupt
*/


list_for_each_entry_rcu(data, &call_function.queue, csd.list) {
int refs;

+ void (*func) (void *info);

/*


* Since we walk the list without any locks, we might

@@ -212,24 +213,32 @@ void generic_smp_call_function_interrupt


if (atomic_read(&data->refs) == 0)

continue;

- if (!cpumask_test_and_clear_cpu(cpu, data->cpumask))
- continue;
-
+ func = data->csd.func; /* for later warn */
data->csd.func(data->csd.info);

+ /*
+ * If the cpu mask is not still set then it enabled interrupts,
+ * we took another smp interrupt, and executed the function
+ * twice on this cpu. In theory that copy decremented refs.
+ */
+ if (!cpumask_test_and_clear_cpu(cpu, data->cpumask)) {
+ WARN(1, "%pS enabled interrupts and double executed\n",
+ func);
+ continue;
+ }
+


refs = atomic_dec_return(&data->refs);
WARN_ON(refs < 0);

- if (!refs) {
- WARN_ON(!cpumask_empty(data->cpumask));
-
- raw_spin_lock(&call_function.lock);
- list_del_rcu(&data->csd.list);
- raw_spin_unlock(&call_function.lock);
- }

if (refs)
continue;

+ WARN_ON(!cpumask_empty(data->cpumask));
+
+ raw_spin_lock(&call_function.lock);
+ list_del_rcu(&data->csd.list);
+ raw_spin_unlock(&call_function.lock);
+
csd_unlock(&data->csd);

Milton Miller

unread,
Jan 18, 2011, 4:07:31 PM1/18/11
to Anton Blanchard, Peter Zijlstra, xiaogu...@cn.fujitsu.com, mi...@elte.hu, jax...@fusionio.com, npi...@gmail.com, ru...@rustcorp.com.au, ak...@linux-foundation.org, torv...@linux-foundation.org, pau...@linux.vnet.ibm.com, mil...@bga.com, be...@kernel.crashing.org, linux-...@vger.kernel.org
From: Anton Blanchard <an...@samba.org>

I noticed a failure where we hit the following WARN_ON in
generic_smp_call_function_interrupt:

if (!cpumask_test_and_clear_cpu(cpu, data->cpumask))
continue;

data->csd.func(data->csd.info);

refs = atomic_dec_return(&data->refs);
WARN_ON(refs < 0); <-------------------------

We atomically tested and cleared our bit in the cpumask, and yet the
number of cpus left (ie refs) was 0. How can this be?

It turns out commit 54fdade1c3332391948ec43530c02c4794a38172
(generic-ipi: make struct call_function_data lockless)
is at fault. It removes locking from smp_call_function_many and in
doing so creates a rather complicated race.

The problem comes about because:

- The smp_call_function_many interrupt handler walks call_function.queue
without any locking.
- We reuse a percpu data structure in smp_call_function_many.
- We do not wait for any RCU grace period before starting the next
smp_call_function_many.

Imagine a scenario where CPU A does two smp_call_functions back to
back, and CPU B does an smp_call_function in between. We concentrate on
how CPU C handles the calls:


CPU A CPU B CPU C CPU D

smp_call_function
smp_call_function_interrupt
walks
call_function.queue sees

data from CPU A on list

smp_call_function

smp_call_function_interrupt
walks

call_function.queue sees
(stale) CPU A on list

smp_call_function int
clears last ref on A
list_del_rcu, unlock
smp_call_function reuses
percpu *data A


data->cpumask sees and
clears bit in cpumask

might be using old or new fn!
decrements refs below 0


#include <linux/module.h>
#include <linux/init.h>

#define ITERATIONS 100

static void do_nothing_ipi(void *dummy)
{
}

return 0;
}

static void __exit testcase_exit(void)
{
}

I tried to fix it by ordering the read and the write of ->cpumask and
->refs. In doing so I missed a critical case but Paul McKenney was able
to spot my bug thankfully :) To ensure we arent viewing previous
iterations the interrupt handler needs to read ->refs then ->cpumask
then ->refs _again_.

Thanks to Milton Miller and Paul McKenney for helping to debug this
issue.

[Milton Miller: add WARN_ON and BUG_ON, remove extra read of refs before
initial read of mask that doesn't help (also noted by Peter Zijlstra),
adjust comments, hopefully clarify senerio ]

Signed-off-by: Anton Blanchard <an...@samba.org>
Revised-by: Milton Miller <mil...@bga.com> [ removed excess tests ]
Signed-off-by: Milton Miller <mil...@bga.com>
Cc: "Paul E. McKenney" <pau...@linux.vnet.ibm.com>
Cc: sta...@kernel.org # 2.6.32 and later

Index: common/kernel/smp.c
===================================================================
--- common.orig/kernel/smp.c 2011-01-17 20:15:54.000000000 -0600
+++ common/kernel/smp.c 2011-01-17 20:16:18.000000000 -0600
@@ -194,6 +194,24 @@ void generic_smp_call_function_interrupt


list_for_each_entry_rcu(data, &call_function.queue, csd.list) {
int refs;

+ /*
+ * Since we walk the list without any locks, we might
+ * see an entry that was completed, removed from the
+ * list and is in the process of being reused.
+ *

+ * We must check that the cpu is in the cpumask before
+ * checking the refs, and both must be set before
+ * executing the callback on this cpu.
+ */
+


+ if (!cpumask_test_cpu(cpu, data->cpumask))
+ continue;

+


+ smp_rmb();
+
+ if (atomic_read(&data->refs) == 0)
+ continue;
+

if (!cpumask_test_and_clear_cpu(cpu, data->cpumask))
continue;

@@ -202,6 +220,8 @@ void generic_smp_call_function_interrupt


refs = atomic_dec_return(&data->refs);
WARN_ON(refs < 0);

if (!refs) {
+ WARN_ON(!cpumask_empty(data->cpumask));
+
raw_spin_lock(&call_function.lock);
list_del_rcu(&data->csd.list);
raw_spin_unlock(&call_function.lock);
@@ -453,11 +473,21 @@ void smp_call_function_many(const struct

data = &__get_cpu_var(cfd_data);
csd_lock(&data->csd);
+ BUG_ON(atomic_read(&data->refs) || !cpumask_empty(data->cpumask));

data->csd.func = func;


data->csd.info = info;
cpumask_and(data->cpumask, mask, cpu_online_mask);
cpumask_clear_cpu(this_cpu, data->cpumask);
+
+ /*

+ * To ensure the interrupt handler gets an complete view
+ * we order the cpumask and refs writes and order the read
+ * of them in the interrupt handler. In addition we may
+ * only clear our own cpu bit from the mask.


+ */
+ smp_wmb();
+
atomic_set(&data->refs, cpumask_weight(data->cpumask));

raw_spin_lock_irqsave(&call_function.lock, flags);

Andrew Morton

unread,
Jan 19, 2011, 7:43:23 PM1/19/11
to Milton Miller, Anton Blanchard, Peter Zijlstra, xiaogu...@cn.fujitsu.com, mi...@elte.hu, jax...@fusionio.com, npi...@gmail.com, ru...@rustcorp.com.au, torv...@linux-foundation.org, pau...@linux.vnet.ibm.com, be...@kernel.crashing.org, linux-...@vger.kernel.org
On Tue, 18 Jan 2011 15:07:25 -0600
Milton Miller <mil...@bga.com> wrote:

> I noticed a failure where we hit the following WARN_ON in
> generic_smp_call_function_interrupt:
>
> if (!cpumask_test_and_clear_cpu(cpu, data->cpumask))
> continue;
>
> data->csd.func(data->csd.info);
>
> refs = atomic_dec_return(&data->refs);
> WARN_ON(refs < 0); <-------------------------
>
> We atomically tested and cleared our bit in the cpumask, and yet the
> number of cpus left (ie refs) was 0. How can this be?
>
> It turns out commit 54fdade1c3332391948ec43530c02c4794a38172
> (generic-ipi: make struct call_function_data lockless)
> is at fault. It removes locking from smp_call_function_many and in
> doing so creates a rather complicated race.

I've been waving https://bugzilla.kernel.org/show_bug.cgi?id=23042 at
the x86 guys for a while now, to no avail. Do you think you just fixed
it?

Peter Zijlstra

unread,
Jan 27, 2011, 11:22:14 AM1/27/11
to Milton Miller, Anton Blanchard, xiaogu...@cn.fujitsu.com, mi...@elte.hu, jax...@fusionio.com, npi...@gmail.com, ru...@rustcorp.com.au, ak...@linux-foundation.org, torv...@linux-foundation.org, pau...@linux.vnet.ibm.com, be...@kernel.crashing.org, linux-...@vger.kernel.org
On Tue, 2011-01-18 at 15:06 -0600, Milton Miller wrote:

> Index: common/kernel/smp.c
> ===================================================================
> --- common.orig/kernel/smp.c 2011-01-17 20:16:18.000000000 -0600
> +++ common/kernel/smp.c 2011-01-17 20:17:50.000000000 -0600
> @@ -193,6 +193,7 @@ void generic_smp_call_function_interrupt
> */
> list_for_each_entry_rcu(data, &call_function.queue, csd.list) {
> int refs;
> + void (*func) (void *info);
>
> /*
> * Since we walk the list without any locks, we might
> @@ -212,24 +213,32 @@ void generic_smp_call_function_interrupt
> if (atomic_read(&data->refs) == 0)
> continue;
>

> + func = data->csd.func; /* for later warn */
> data->csd.func(data->csd.info);
>
> + /*
> + * If the cpu mask is not still set then it enabled interrupts,
> + * we took another smp interrupt, and executed the function
> + * twice on this cpu. In theory that copy decremented refs.
> + */
> + if (!cpumask_test_and_clear_cpu(cpu, data->cpumask)) {
> + WARN(1, "%pS enabled interrupts and double executed\n",
> + func);
> + continue;
> + }
> +
> refs = atomic_dec_return(&data->refs);
> WARN_ON(refs < 0);
>

> if (refs)
> continue;
>
> + WARN_ON(!cpumask_empty(data->cpumask));
> +
> + raw_spin_lock(&call_function.lock);
> + list_del_rcu(&data->csd.list);
> + raw_spin_unlock(&call_function.lock);
> +
> csd_unlock(&data->csd);
> }
>


So after this we have:

list_for_each_entry_rcu()
rbd
!->cpumask ->cpumask =
rmb wmb
!->refs ->refs =
->func() wmb
mb list_add_rcu()
->refs--
if (!->refs)
list_del_rcu()

So even if we see it as an old-ref, when we see a valid cpumask, valid
ref, we execute the function clear our cpumask bit and decrement the ref
and delete the entry, even though it might not yet be added?


(old-ref)
->cpumask =
if (!->cpumask)
->refs =
if (!->refs)

->func()
->refs--
if (!->refs)
list_del_rcu()
list_add_rcu()

Then what happens?

Milton Miller

unread,
Jan 27, 2011, 4:59:38 PM1/27/11
to Peter Zijlstra, Anton Blanchard, xiaogu...@cn.fujitsu.com, mi...@elte.hu, jax...@fusionio.com, npi...@gmail.com, ru...@rustcorp.com.au, ak...@linux-foundation.org, torv...@linux-foundation.org, pau...@linux.vnet.ibm.com, be...@kernel.crashing.org, linux-...@vger.kernel.org
On Thu, 27 Jan 2011 about 17:22:40 +0100, Peter Zijlstra wrote:
> On Tue, 2011-01-18 at 15:06 -0600, Milton Miller wrote:
> > Index: common/kernel/smp.c
> > =====================================================================

The item remains on the queue unlocked. If we are lucky all entries
after it are serviced before we try to reuse it. If not, when we
do the list_add_rcu while still in the list all entries after it
get lost from the service queue and will hang.

This is a further scenerio exposed by the lock removal. Good spotting.

A simple fix would be to add to the queue before setting refs. I can
try to work on that tomorrow, hopefully the test machine is available.

The next question is can/should we use the lock & unlock to put the
entry on the queue in place of the wmb() between setting cpu mask and
setting refs? We still need the rmb() in _interrupt.

milton

Milton Miller

unread,
Jan 28, 2011, 7:20:36 PM1/28/11
to Peter Zijlstra, ak...@linux-foundation.org, Anton Blanchard, xiaogu...@cn.fujitsu.com, mi...@elte.hu, jax...@fusionio.com, npi...@gmail.com, ru...@rustcorp.com.au, torv...@linux-foundation.org, pau...@linux.vnet.ibm.com, be...@kernel.crashing.org, linux-...@vger.kernel.org

Peter pointed out there was nothing preventing the list_del_rcu in
smp_call_function_interrupt from running before the list_add_rcu in
smp_call_function_many. Fix this by not setting refs until we have put
the entry on the list. We can use the lock acquire and release instead
of a wmb.

Reported-by: Peter Zijlstra <pet...@infradead.org>


Signed-off-by: Milton Miller <mil...@bga.com>
Cc: "Paul E. McKenney" <pau...@linux.vnet.ibm.com>

---

I tried to force this race with a udelay before the lock & list_add and
by mixing all 64 online cpus with just 3 random cpus in the mask, but
was unsuccessful. Still, it seems to be a valid race, and the fix
is a simple change to the current code.

Index: common/kernel/smp.c
===================================================================
--- common.orig/kernel/smp.c 2011-01-28 16:23:15.000000000 -0600
+++ common/kernel/smp.c 2011-01-28 16:55:01.000000000 -0600
@@ -491,15 +491,17 @@ void smp_call_function_many(const struct
cpumask_clear_cpu(this_cpu, data->cpumask);

/*
- * To ensure the interrupt handler gets an complete view
- * we order the cpumask and refs writes and order the read
- * of them in the interrupt handler. In addition we may
- * only clear our own cpu bit from the mask.
+ * We reuse the call function data without waiting for any grace
+ * period after some other cpu removes it from the global queue.
+ * This means a cpu might find our data block as it is writen.
+ * The interrupt handler waits until it sees refs filled out
+ * while its cpu mask bit is set; here we may only clear our
+ * own cpu mask bit, and must wait to set refs until we are sure
+ * previous writes are complete and we have obtained the lock to
+ * add the element to the queue. We use the acquire and release
+ * of the lock as a wmb() -- acquire prevents write moving up and
+ * release requires old writes are visible.
*/
- smp_wmb();
-
- atomic_set(&data->refs, cpumask_weight(data->cpumask));
-
raw_spin_lock_irqsave(&call_function.lock, flags);
/*
* Place entry at the _HEAD_ of the list, so that any cpu still
@@ -509,6 +511,8 @@ void smp_call_function_many(const struct
list_add_rcu(&data->csd.list, &call_function.queue);
raw_spin_unlock_irqrestore(&call_function.lock, flags);



+ atomic_set(&data->refs, cpumask_weight(data->cpumask));

+
/*
* Make the list addition visible before sending the ipi.
* (IPIs must obey or appear to obey normal Linux cache

Mike Galbraith

unread,
Jan 31, 2011, 2:21:44 AM1/31/11
to Milton Miller, Peter Zijlstra, ak...@linux-foundation.org, Anton Blanchard, xiaogu...@cn.fujitsu.com, mi...@elte.hu, jax...@fusionio.com, npi...@gmail.com, ru...@rustcorp.com.au, torv...@linux-foundation.org, pau...@linux.vnet.ibm.com, be...@kernel.crashing.org, linux-...@vger.kernel.org
On Fri, 2011-01-28 at 18:20 -0600, Milton Miller wrote:
> Peter pointed out there was nothing preventing the list_del_rcu in
> smp_call_function_interrupt from running before the list_add_rcu in
> smp_call_function_many. Fix this by not setting refs until we have put
> the entry on the list. We can use the lock acquire and release instead
> of a wmb.

Wondering if a final sanity check makes sense. I've got a perma-spin
bug where comment apparently happened. Another CPU's diddle the mask
IPI may make this CPU do horrible things to itself as it's setting up to
IPI others with that mask.

---
kernel/smp.c | 3 +++
1 file changed, 3 insertions(+)

Index: linux-2.6.38.git/kernel/smp.c
===================================================================
--- linux-2.6.38.git.orig/kernel/smp.c
+++ linux-2.6.38.git/kernel/smp.c
@@ -490,6 +490,9 @@ void smp_call_function_many(const struct


cpumask_and(data->cpumask, mask, cpu_online_mask);
cpumask_clear_cpu(this_cpu, data->cpumask);

+ /* Did you pass me a mask that can be changed/emptied under me? */
+ BUG_ON(cpumask_empty(data->cpumask));
+
/*


* We reuse the call function data without waiting for any grace

* period after some other cpu removes it from the global queue.

--

Peter Zijlstra

unread,
Jan 31, 2011, 5:27:02 AM1/31/11
to Milton Miller, ak...@linux-foundation.org, Anton Blanchard, xiaogu...@cn.fujitsu.com, mi...@elte.hu, jax...@fusionio.com, npi...@gmail.com, ru...@rustcorp.com.au, torv...@linux-foundation.org, pau...@linux.vnet.ibm.com, be...@kernel.crashing.org, linux-...@vger.kernel.org
On Fri, 2011-01-28 at 18:20 -0600, Milton Miller wrote:
> Peter pointed out there was nothing preventing the list_del_rcu in
> smp_call_function_interrupt from running before the list_add_rcu in
> smp_call_function_many. Fix this by not setting refs until we have put
> the entry on the list. We can use the lock acquire and release instead
> of a wmb.
>
> Reported-by: Peter Zijlstra <pet...@infradead.org>
> Signed-off-by: Milton Miller <mil...@bga.com>
> Cc: "Paul E. McKenney" <pau...@linux.vnet.ibm.com>
> ---
>
> I tried to force this race with a udelay before the lock & list_add and
> by mixing all 64 online cpus with just 3 random cpus in the mask, but
> was unsuccessful. Still, it seems to be a valid race, and the fix
> is a simple change to the current code.

Yes, I think this will fix it, I think simply putting that assignment
under the lock is sufficient, because then the list removal will
serialize again the list add. But placing it after the list add does
also seem sufficient.

Acked-by: Peter Zijlstra <a.p.zi...@chello.nl>

Milton Miller

unread,
Jan 31, 2011, 3:26:18 PM1/31/11
to Peter Zijlstra, ak...@linux-foundation.org, Anton Blanchard, xiaogu...@cn.fujitsu.com, mi...@elte.hu, jax...@fusionio.com, npi...@gmail.com, ru...@rustcorp.com.au, torv...@linux-foundation.org, pau...@linux.vnet.ibm.com, be...@kernel.crashing.org, linux-...@vger.kernel.org
On Mon, 31 Jan 2011 about 11:27:45 +0100, Peter Zijlstra wrote:
> On Fri, 2011-01-28 at 18:20 -0600, Milton Miller wrote:
> > Peter pointed out there was nothing preventing the list_del_rcu in
> > smp_call_function_interrupt from running before the list_add_rcu in
> > smp_call_function_many. Fix this by not setting refs until we have put
> > the entry on the list. We can use the lock acquire and release instead
> > of a wmb.
> >
> > Reported-by: Peter Zijlstra <pet...@infradead.org>
> > Signed-off-by: Milton Miller <mil...@bga.com>
> > Cc: "Paul E. McKenney" <pau...@linux.vnet.ibm.com>
> > ---
> >
> > I tried to force this race with a udelay before the lock & list_add and
> > by mixing all 64 online cpus with just 3 random cpus in the mask, but
> > was unsuccessful. Still, it seems to be a valid race, and the fix
> > is a simple change to the current code.
>
> Yes, I think this will fix it, I think simply putting that assignment
> under the lock is sufficient, because then the list removal will
> serialize again the list add. But placing it after the list add does
> also seem sufficient.
>
> Acked-by: Peter Zijlstra <a.p.zi...@chello.nl>
>

I was worried some architectures would allow a write before the spinlock
to drop into the spinlock region, in which case the data or function
pointer could be found stale with the cpu mask bit set. The unlock
must flush all prior writes and therefore the new function and data
will be seen before refs is set.

milton

Milton Miller

unread,
Jan 31, 2011, 3:26:24 PM1/31/11
to Mike Galbraith, Peter Zijlstra, ak...@linux-foundation.org, Anton Blanchard, xiaogu...@cn.fujitsu.com, mi...@elte.hu, jax...@fusionio.com, npi...@gmail.com, ru...@rustcorp.com.au, torv...@linux-foundation.org, pau...@linux.vnet.ibm.com, be...@kernel.crashing.org, linux-...@vger.kernel.org
On Mon, 31 Jan 2011 about 08:21:22 +0100, Mike Galbraith wrote:
> Wondering if a final sanity check makes sense. I've got a perma-spin
> bug where comment apparently happened. Another CPU's diddle the mask
> IPI may make this CPU do horrible things to itself as it's setting up to
> IPI others with that mask.
>
> ---
> kernel/smp.c | 3 +++
> 1 file changed, 3 insertions(+)
>
> Index: linux-2.6.38.git/kernel/smp.c
> ===================================================================
> --- linux-2.6.38.git.orig/kernel/smp.c
> +++ linux-2.6.38.git/kernel/smp.c
> @@ -490,6 +490,9 @@ void smp_call_function_many(const struct
> cpumask_and(data->cpumask, mask, cpu_online_mask);
> cpumask_clear_cpu(this_cpu, data->cpumask);
>
> + /* Did you pass me a mask that can be changed/emptied under me? */
> + BUG_ON(cpumask_empty(data->cpumask));
> +

I was thinking of this as "the ipi cpumask was cleared", but I realize now
you are saying the caller passed in a cpumask, but between the cpu_first/
cpu_next calls above and the cpumask_and another cpu cleared all the cpus?

I could see how that could happen on say a mask of cpus that might have a
translation context, or cpus that need a push to complete an rcu window.
Instead of the BUG_ON, we can handle the mask being cleared.

The arch code to send the IPI must handle an empty mask, as the other
cpus are racing to clear their bit while its trying to send the IPI.
In fact that expected race is the cause of the x86 warning in bz 23042
https://bugzilla.kernel.org/show_bug.cgi?id=23042 that Andrew pointed
out.


How about this [untested] patch?

Mike Galbraith reported finding a lockup where aparently the passed in
cpumask was cleared on other cpu(s) while this cpu was preparing its
smp_call_function_many block. Detect this race and unlock the call
data block. Note: arch_send_call_function_ipi_mask must still handle an
empty mask because the element is globally visable before it is called.
And obviously there are no guarantees to which cpus are notified if the
mask is changed during the call.

Reported-by: Mike Galbraith <efa...@gmx.de>


Signed-off-by: Milton Miller <mil...@bga.com>
---

On top of "call_function_many: fix list delete vs add race"

Index: linux-2.6/kernel/smp.c
===================================================================
--- linux-2.6.orig/kernel/smp.c 2011-01-31 13:51:56.294049313 -0600
+++ linux-2.6/kernel/smp.c 2011-01-31 13:49:12.737064069 -0600
@@ -450,7 +450,7 @@
{
struct call_function_data *data;
unsigned long flags;
- int cpu, next_cpu, this_cpu = smp_processor_id();
+ int refs, cpu, next_cpu, this_cpu = smp_processor_id();

/*
* Can deadlock when called with interrupts disabled.
@@ -489,6 +489,13 @@
data->csd.info = info;


cpumask_and(data->cpumask, mask, cpu_online_mask);
cpumask_clear_cpu(this_cpu, data->cpumask);

+ refs = cpumask_weight(data->cpumask);
+
+ /* Some uses might have cleared all cpus in the mask before the copy. */
+ if (unlikely(!refs)) {
+ csd_unlock(&data->csd);
+ return;


+ }

/*
* We reuse the call function data without waiting for any grace

@@ -511,7 +518,7 @@


list_add_rcu(&data->csd.list, &call_function.queue);
raw_spin_unlock_irqrestore(&call_function.lock, flags);

- atomic_set(&data->refs, cpumask_weight(data->cpumask));

+ atomic_set(&data->refs, refs);



/*
* Make the list addition visible before sending the ipi.

Peter Zijlstra

unread,
Jan 31, 2011, 3:39:14 PM1/31/11
to Milton Miller, ak...@linux-foundation.org, Anton Blanchard, xiaogu...@cn.fujitsu.com, mi...@elte.hu, jax...@fusionio.com, npi...@gmail.com, ru...@rustcorp.com.au, torv...@linux-foundation.org, pau...@linux.vnet.ibm.com, be...@kernel.crashing.org, linux-...@vger.kernel.org
On Mon, 2011-01-31 at 14:26 -0600, Milton Miller wrote:
> On Mon, 31 Jan 2011 about 11:27:45 +0100, Peter Zijlstra wrote:
> > On Fri, 2011-01-28 at 18:20 -0600, Milton Miller wrote:
> > > Peter pointed out there was nothing preventing the list_del_rcu in
> > > smp_call_function_interrupt from running before the list_add_rcu in
> > > smp_call_function_many. Fix this by not setting refs until we have put
> > > the entry on the list. We can use the lock acquire and release instead
> > > of a wmb.
> > >
> > > Reported-by: Peter Zijlstra <pet...@infradead.org>
> > > Signed-off-by: Milton Miller <mil...@bga.com>
> > > Cc: "Paul E. McKenney" <pau...@linux.vnet.ibm.com>
> > > ---
> > >
> > > I tried to force this race with a udelay before the lock & list_add and
> > > by mixing all 64 online cpus with just 3 random cpus in the mask, but
> > > was unsuccessful. Still, it seems to be a valid race, and the fix
> > > is a simple change to the current code.
> >
> > Yes, I think this will fix it, I think simply putting that assignment
> > under the lock is sufficient, because then the list removal will
> > serialize again the list add. But placing it after the list add does
> > also seem sufficient.
> >
> > Acked-by: Peter Zijlstra <a.p.zi...@chello.nl>
> >
>
> I was worried some architectures would allow a write before the spinlock
> to drop into the spinlock region,

That is indeed allowed to happen

> in which case the data or function
> pointer could be found stale with the cpu mask bit set.

But that is ok, right? the atomic_read(->refs) test will fail and we'll
continue.

> The unlock
> must flush all prior writes and

and reads

> therefore the new function and data
> will be seen before refs is set.


Which again should be just fine, given the interrupt does:

if (!cpumask_test_cpu())
continue

rmb

if (!atomic_read())
continue

and thus we'll be on our happy merry way. If we do however observe the
new ->refs value we have already acquired the lock on the sending end
and the spinlock before the list_del_rcu() will serialize against it
such that we'll always finish the list_add_rcu() before executing the
del.

Or am I not quite understanding things?

Peter Zijlstra

unread,
Jan 31, 2011, 4:17:18 PM1/31/11
to Milton Miller, ak...@linux-foundation.org, Anton Blanchard, xiaogu...@cn.fujitsu.com, mi...@elte.hu, jax...@fusionio.com, npi...@gmail.com, ru...@rustcorp.com.au, torv...@linux-foundation.org, pau...@linux.vnet.ibm.com, be...@kernel.crashing.org, linux-...@vger.kernel.org
On Fri, 2011-01-28 at 18:20 -0600, Milton Miller wrote:
> @@ -491,15 +491,17 @@ void smp_call_function_many(const struct
> cpumask_clear_cpu(this_cpu, data->cpumask);
>
> /*
> - * To ensure the interrupt handler gets an complete view
> - * we order the cpumask and refs writes and order the read
> - * of them in the interrupt handler. In addition we may
> - * only clear our own cpu bit from the mask.
> + * We reuse the call function data without waiting for any grace
> + * period after some other cpu removes it from the global queue.
> + * This means a cpu might find our data block as it is writen.
> + * The interrupt handler waits until it sees refs filled out
> + * while its cpu mask bit is set; here we may only clear our
> + * own cpu mask bit, and must wait to set refs until we are sure
> + * previous writes are complete and we have obtained the lock to
> + * add the element to the queue. We use the acquire and release
> + * of the lock as a wmb() -- acquire prevents write moving up and
> + * release requires old writes are visible.

That's wrong:

->foo =
LOCK
UNLOCK
->bar =

can be re-ordered as:

LOCK
->bar =
->foo =
UNLOCK

Only a UNLOCK+LOCK sequence can be considered an MB.

However, I think the code is still OK, because list_add_rcu() implies a
wmb(), so in that respect its an improvement since we fix a race and
avoid an extra wmb. But the comment needs an update.

> */
> - smp_wmb();
> -
> - atomic_set(&data->refs, cpumask_weight(data->cpumask));
> -
> raw_spin_lock_irqsave(&call_function.lock, flags);
> /*
> * Place entry at the _HEAD_ of the list, so that any cpu still
> @@ -509,6 +511,8 @@ void smp_call_function_many(const struct
> list_add_rcu(&data->csd.list, &call_function.queue);
> raw_spin_unlock_irqrestore(&call_function.lock, flags);

And this wants to grow a comment that it relies on the wmb implied by
list_add_rcu()

Milton Miller

unread,
Jan 31, 2011, 4:36:36 PM1/31/11
to Peter Zijlstra, ak...@linux-foundation.org, Anton Blanchard, xiaogu...@cn.fujitsu.com, mi...@elte.hu, jax...@fusionio.com, npi...@gmail.com, ru...@rustcorp.com.au, torv...@linux-foundation.org, pau...@linux.vnet.ibm.com, be...@kernel.crashing.org, efa...@gmx.de, linux-...@vger.kernel.org

I knew I was copying people for a reason.

Thats a problem because there is nothing preventing the set of func and
data from being foo, which if it went after bar (set refs) would mean
that we executed the wrong function and/or call data.

>
> However, I think the code is still OK, because list_add_rcu() implies a
> wmb(), so in that respect its an improvement since we fix a race and
> avoid an extra wmb. But the comment needs an update.
>
> > */
> > - smp_wmb();
> > -
> > - atomic_set(&data->refs, cpumask_weight(data->cpumask));
> > -
> > raw_spin_lock_irqsave(&call_function.lock, flags);
> > /*
> > * Place entry at the _HEAD_ of the list, so that any cpu still
> > @@ -509,6 +511,8 @@ void smp_call_function_many(const struct
> > list_add_rcu(&data->csd.list, &call_function.queue);
> > raw_spin_unlock_irqrestore(&call_function.lock, flags);
>
> And this wants to grow a comment that it relies on the wmb implied by
> list_add_rcu()
>
> > + atomic_set(&data->refs, cpumask_weight(data->cpumask));
> > +

yea, I'll make an updated patch, and move the set into the lock.

but after some food (and meetings).

milton

Benjamin Herrenschmidt

unread,
Jan 31, 2011, 7:23:45 PM1/31/11
to Peter Zijlstra, Milton Miller, ak...@linux-foundation.org, Anton Blanchard, xiaogu...@cn.fujitsu.com, mi...@elte.hu, jax...@fusionio.com, npi...@gmail.com, ru...@rustcorp.com.au, torv...@linux-foundation.org, pau...@linux.vnet.ibm.com, linux-...@vger.kernel.org
On Mon, 2011-01-31 at 22:17 +0100, Peter Zijlstra wrote:
> That's wrong:
>
> ->foo =
> LOCK
> UNLOCK
> ->bar =
>
> can be re-ordered as:
>
> LOCK
> ->bar =
> ->foo =
> UNLOCK

Can it ? I though UNLOCK had a write barrier semantic ? It does on power
at least :-) It should have since it shall prevent stores inside the
lock region to pass the store of the unlock itself anyways.

So yes, ->bar = can leak into the lock, as can ->foo =, but they can't
be re-ordered vs. each other because the implied barrier will keep ->foo
= in the same "domain" as the unlock itself.

Or do other archs do something really nasty here that don't provide this
guarantee ?

Cheers,
Ben.

Linus Torvalds

unread,
Jan 31, 2011, 8:39:57 PM1/31/11
to Benjamin Herrenschmidt, Peter Zijlstra, Milton Miller, ak...@linux-foundation.org, Anton Blanchard, xiaogu...@cn.fujitsu.com, mi...@elte.hu, jax...@fusionio.com, npi...@gmail.com, ru...@rustcorp.com.au, pau...@linux.vnet.ibm.com, linux-...@vger.kernel.org
On Tue, Feb 1, 2011 at 10:22 AM, Benjamin Herrenschmidt
<be...@kernel.crashing.org> wrote:
> On Mon, 2011-01-31 at 22:17 +0100, Peter Zijlstra wrote:
>> That's wrong:
>>
>>  ->foo =
>>  LOCK
>>  UNLOCK
>>  ->bar =
>>
>> can be re-ordered as:
>>
>>  LOCK
>>  ->bar =
>>  ->foo =
>>  UNLOCK
>
> Can it ? I though UNLOCK had a write barrier semantic ?

The re-ordering of ->foo and UNLOCK that Peter claims is definitely
possible, and the unlock is not guaranteed to be a write barrier. The
unlock just guarantees "release" consistency, which means that no
previous reads or writes will be migrated down from within the locked
region, but it is very much a one-way barrier, so a subsequent write -
or more commonly a read - could easily migrate up past the unlock.

(Same for lock, except the permeability is obviously the other way
around - "acquire" consistency)

> So yes, ->bar = can leak into the lock, as can ->foo =, but they can't
> be re-ordered vs. each other because the implied barrier will keep ->foo
> = in the same "domain" as the unlock itself.
>
> Or do other archs do something really nasty here that don't provide this
> guarantee ?

I think we actually allow the accesses to ->bar and ->foo to be
re-ordered wrt each other, exactly because they can *both* get
re-ordered first into the locked region, and then after that they can
get re-ordered wrt each other (because there is no other memory
barrier).

So a "unlock+lock" is guaranteed to be equivalent to a full memory
barrier (because an operation before the unlock cannot pass the
unlock, and an access after the lock cannot percolate up before it).
But the regular "lock+unlock" sequence is not, exactly because
accesses outside of it are allowed to first leak in, and then not have
ordering constraints within the locked region.

That said, we may have some confusion there, and I would *STRONGLY*
suggest that architectures should have stronger lock consistency
guarantees than the theoretical ones. Especially since x86 has such
strong rules, and locking is a full memory barrier. Anybody with very
weak ordering is likely to just hit more bugs.

And so while I'm not sure it's ever been documented, I do think it is
likely a good idea to just make sure that "lock+unlock" is a full
memory barrier, the same way "unlock+lock" is. I think it's
practically true anyway on all architectures (with the possible
exception of ia64, which I think actually implements real
acquire/release semantics)

(Practically speaking, there really aren't many reasons to allow
writes to be re-ordered wrt lock/unlock operations, and there is no
reason to ever move reads later, only earlier. Writes are _not_
performance-sensitive the way reads are, so there is no real reason to
really take advantage of the one-way permeability for them. It's reads
that you really want to speculate and do early, not writes, and so
allowing a read after a UNLOCK to percolate into the critical region
is really the only relevant case from a performance perspective).

Linus

Paul E. McKenney

unread,
Jan 31, 2011, 9:18:47 PM1/31/11
to Linus Torvalds, Benjamin Herrenschmidt, Peter Zijlstra, Milton Miller, ak...@linux-foundation.org, Anton Blanchard, xiaogu...@cn.fujitsu.com, mi...@elte.hu, jax...@fusionio.com, npi...@gmail.com, ru...@rustcorp.com.au, linux-...@vger.kernel.org

Documentation/memory-barriers.txt specifies this.

Therefore, from (1), (2) and (4) an UNLOCK followed by an
unconditional LOCK is equivalent to a full barrier, but a LOCK
followed by an UNLOCK is not.

Thanx, Paul

Linus Torvalds

unread,
Jan 31, 2011, 9:51:35 PM1/31/11
to pau...@linux.vnet.ibm.com, Benjamin Herrenschmidt, Peter Zijlstra, Milton Miller, ak...@linux-foundation.org, Anton Blanchard, xiaogu...@cn.fujitsu.com, mi...@elte.hu, jax...@fusionio.com, npi...@gmail.com, ru...@rustcorp.com.au, linux-...@vger.kernel.org
On Tue, Feb 1, 2011 at 12:18 PM, Paul E. McKenney
<pau...@linux.vnet.ibm.com> wrote:
>
> Documentation/memory-barriers.txt specifies this.

Good, so it really is documented, with both cases explicitly mentioned.

That said, I do think that if your memory ordering is much weaker than
x86, you are going to see bugs that most testers don't see, and it
simply might not be worth it.

Mike Galbraith

unread,
Jan 31, 2011, 10:16:02 PM1/31/11
to Milton Miller, Peter Zijlstra, ak...@linux-foundation.org, Anton Blanchard, xiaogu...@cn.fujitsu.com, mi...@elte.hu, jax...@fusionio.com, npi...@gmail.com, ru...@rustcorp.com.au, torv...@linux-foundation.org, pau...@linux.vnet.ibm.com, be...@kernel.crashing.org, linux-...@vger.kernel.org
On Mon, 2011-01-31 at 14:26 -0600, Milton Miller wrote:

Yes, that would work. In my case, it was passed mm_cpumask(mm). What
is unclear is whether mask at call time was what the programmer needed
action on, ie mask changing may be intolerable information loss/gain.

-Mike

Paul E. McKenney

unread,
Jan 31, 2011, 11:45:29 PM1/31/11
to Linus Torvalds, Benjamin Herrenschmidt, Peter Zijlstra, Milton Miller, ak...@linux-foundation.org, Anton Blanchard, xiaogu...@cn.fujitsu.com, mi...@elte.hu, jax...@fusionio.com, npi...@gmail.com, ru...@rustcorp.com.au, linux-...@vger.kernel.org
On Tue, Feb 01, 2011 at 12:43:56PM +1000, Linus Torvalds wrote:
> On Tue, Feb 1, 2011 at 12:18 PM, Paul E. McKenney
> <pau...@linux.vnet.ibm.com> wrote:
> >
> > Documentation/memory-barriers.txt specifies this.
>
> Good, so it really is documented, with both cases explicitly mentioned.
>
> That said, I do think that if your memory ordering is much weaker than
> x86, you are going to see bugs that most testers don't see, and it
> simply might not be worth it.

IBM's CPUs do split the difference, with s390 having somewhat stronger
ordering than x86, and with powerpc being rather weaker (along with arm,
ia64, and some flavors of mips, but stronger than alpha). But yes, this
does mean that there are bugs that don't show up on x86 and s390, but
that do on powerpc, arm, ia64, and some mips, to say nothing of alpha.
Similarly, there are bugs that show up on x86 due to unsynchronized
timestamp counters that powerpc and s390 avoid due to the guaranteed
synchronization on those platforms.

Whether the weaker ordering provides worthwhile benefits is an interesting
debate that I do not believe we will be able to resolve here. ;-)

Thanx, Paul

Linus Torvalds

unread,
Feb 1, 2011, 12:47:43 AM2/1/11
to pau...@linux.vnet.ibm.com, Benjamin Herrenschmidt, Peter Zijlstra, Milton Miller, ak...@linux-foundation.org, Anton Blanchard, xiaogu...@cn.fujitsu.com, mi...@elte.hu, jax...@fusionio.com, npi...@gmail.com, ru...@rustcorp.com.au, linux-...@vger.kernel.org
On Tue, Feb 1, 2011 at 2:45 PM, Paul E. McKenney
<pau...@linux.vnet.ibm.com> wrote:
> On Tue, Feb 01, 2011 at 12:43:56PM +1000, Linus Torvalds wrote:
>>
>> That said, I do think that if your memory ordering is much weaker than
>> x86, you are going to see bugs that most testers don't see, and it
>> simply might not be worth it.
>
> IBM's CPUs do split the difference, with s390 having somewhat stronger
> ordering than x86, and with powerpc being rather weaker

I'm not talking about memory ordering as done by the cpu, but as done
by the spinlock operations. They can be arbitrarily strong, even if
the CPU architecture itself might be weakly ordered.

So what I suggest is that even if you have a CPU that can implement
acquire/release operations on the locks, you should at least think
about making a lock/unlock sequence be at least a full memory barrier,
simply because it is on x86.

And on at least alpha, with its famously weak memory ordering, you
actually have to do that, because the weak memory ordering of the
normal memory operations are coupled with rather bad memory ordering
primitives.

So on alpha, both LOCK and UNLOCK are full memory barriers
_individually_, not even paired. So despite being known for being
weakly ordered, alpha is actually the MOST STRONGLY ordered of all
when it comes to spinlock primitives, and doesn't have any notion at
all of "inside" vs "outside" the lock or anything like that. A
spinlock/unlock sequence contains TWO full memory barriers, not just
one.

So "weak memory ordering" in no way means "weak lock ordering". Quite
often it means the reverse. POWER, as you say, has weak memory
ordering on the actual memory ops (like alpha), but like alpha has
traditionally really quite crappy and bad synchronization primitives.

So don't confuse the two. When talking about spinlocks, don't confuse
the memory ordering for non-locked operations with the memory ordering
for the locks themselves. The two are totally independent, and often
weak ordering on the normal memory operations actually is associated
with just total idiotic crap on the synchronization primitives. Most
of them were designed when threading and locking wasn't seen as a big
issue, and before people understood that serialization is a big deal.

(You can tell how ia64 was designed later, for example, by looking at
its fancy serialization primitives.)

Quite frankly, the POWER case is made worse by the fact that the
synchronization primitives have this total confusion about "pipeline"
synchronization due to historical implementation oddities etc. Talk
about crazy. The whole "isync" vs "sync" vs "lwsync" thing is just an
embarrassment. But this is, after all, a company that ran out of
vowels in instruction naming, so I guess their misleading and confused
naming for memory ordering is just par for the course.

Linus

Benjamin Herrenschmidt

unread,
Feb 1, 2011, 1:17:36 AM2/1/11
to Linus Torvalds, pau...@linux.vnet.ibm.com, Peter Zijlstra, Milton Miller, ak...@linux-foundation.org, Anton Blanchard, xiaogu...@cn.fujitsu.com, mi...@elte.hu, jax...@fusionio.com, npi...@gmail.com, ru...@rustcorp.com.au, linux-...@vger.kernel.org
On Tue, 2011-02-01 at 12:43 +1000, Linus Torvalds wrote:
> On Tue, Feb 1, 2011 at 12:18 PM, Paul E. McKenney
> <pau...@linux.vnet.ibm.com> wrote:
> >
> > Documentation/memory-barriers.txt specifies this.
>
> Good, so it really is documented, with both cases explicitly mentioned.
>
> That said, I do think that if your memory ordering is much weaker than
> x86, you are going to see bugs that most testers don't see, and it
> simply might not be worth it.

Allright. The way we do it on power is stores won't pass the unlock
either way (lwsync). Some loads might migrate up tho.

Cheers,
Ben.

Benjamin Herrenschmidt

unread,
Feb 1, 2011, 1:20:42 AM2/1/11
to Linus Torvalds, pau...@linux.vnet.ibm.com, Peter Zijlstra, Milton Miller, ak...@linux-foundation.org, Anton Blanchard, xiaogu...@cn.fujitsu.com, mi...@elte.hu, jax...@fusionio.com, npi...@gmail.com, ru...@rustcorp.com.au, linux-...@vger.kernel.org
On Tue, 2011-02-01 at 15:46 +1000, Linus Torvalds wrote:
>
> Quite frankly, the POWER case is made worse by the fact that the
> synchronization primitives have this total confusion about "pipeline"
> synchronization due to historical implementation oddities etc. Talk
> about crazy. The whole "isync" vs "sync" vs "lwsync" thing is just an
> embarrassment. But this is, after all, a company that ran out of
> vowels in instruction naming, so I guess their misleading and confused
> naming for memory ordering is just par for the course.

You forgot eieio, that's where all the vowels go !

Cheers,
Ben.

Milton Miller

unread,
Feb 1, 2011, 2:12:36 AM2/1/11
to Peter Zijlstra, ak...@linux-foundation.org, Anton Blanchard, xiaogu...@cn.fujitsu.com, mi...@elte.hu, jax...@fusionio.com, npi...@gmail.com, ru...@rustcorp.com.au, efa...@gmx.de, torv...@linux-foundation.org, pau...@linux.vnet.ibm.com, be...@kernel.crashing.org, linux-...@vger.kernel.org
Peter pointed out there was nothing preventing the list_del_rcu in
smp_call_function_interrupt from running before the list_add_rcu in
smp_call_function_many. Fix this by not setting refs until we
have gotten the lock for the list. Take advantage of the wmb in
list_add_rcu to save an explicit additional one.

Reported-by: Peter Zijlstra <pet...@infradead.org>
Signed-off-by: Milton Miller <mil...@bga.com>

---
v2: rely on wmb in list_add_rcu not combined partial ordering of spin
lock and unlock, which may not provide the needed guarantees.

I tried to force this race with a udelay before the lock & list_add
and by mixing all 64 online cpus with just 3 random cpus in the mask,

but was unsuccessful. Still, inspection shows a valid race, and the
fix is a extension of the existing protection window in the current code.

Index: linux-2.6/kernel/smp.c
===================================================================
--- linux-2.6.orig/kernel/smp.c 2011-01-31 17:44:47.182756513 -0600
+++ linux-2.6/kernel/smp.c 2011-01-31 18:25:47.266755387 -0600
@@ -491,14 +491,15 @@ void smp_call_function_many(const struct


cpumask_clear_cpu(this_cpu, data->cpumask);

/*
- * To ensure the interrupt handler gets an complete view
- * we order the cpumask and refs writes and order the read
- * of them in the interrupt handler. In addition we may
- * only clear our own cpu bit from the mask.
+ * We reuse the call function data without waiting for any grace
+ * period after some other cpu removes it from the global queue.
+ * This means a cpu might find our data block as it is writen.
+ * The interrupt handler waits until it sees refs filled out
+ * while its cpu mask bit is set; here we may only clear our
+ * own cpu mask bit, and must wait to set refs until we are sure
+ * previous writes are complete and we have obtained the lock to
+ * add the element to the queue.

*/
- smp_wmb();
-
- atomic_set(&data->refs, cpumask_weight(data->cpumask));

raw_spin_lock_irqsave(&call_function.lock, flags);
/*
@@ -507,6 +508,11 @@ void smp_call_function_many(const struct
* will not miss any other list entries:
*/
list_add_rcu(&data->csd.list, &call_function.queue);
+ /*
+ * We rely on the wmb() in list_add_rcu to order the writes
+ * to func, data, and cpumask before this write to refs.
+ */


+ atomic_set(&data->refs, cpumask_weight(data->cpumask));

raw_spin_unlock_irqrestore(&call_function.lock, flags);

/*

Milton Miller

unread,
Feb 1, 2011, 2:12:41 AM2/1/11
to ak...@linux-foundation.org, Mike Galbraith, Peter Zijlstra, Anton Blanchard, xiaogu...@cn.fujitsu.com, mi...@elte.hu, jax...@fusionio.com, npi...@gmail.com, ru...@rustcorp.com.au, torv...@linux-foundation.org, pau...@linux.vnet.ibm.com, be...@kernel.crashing.org, linux-...@vger.kernel.org
Mike Galbraith reported finding a lockup ("perma-spin bug") where the
cpumask passed to smp_call_function_many was cleared by other cpu(s)
while a cpu was preparing its call_data block, resulting in no cpu to
clear the last ref and unlock the block.

Having cpus clear their bit asynchronously could be useful on a mask of


cpus that might have a translation context, or cpus that need a push to
complete an rcu window.

Instead of adding a BUG_ON and requiring yet another cpumask copy, just
detect the race and handle it.

Note: arch_send_call_function_ipi_mask must still handle an empty cpumask
because the data block is globally visible before the that arch callback
is made. And (obviously) there are no guarantees to which cpus are
notified if the mask is changed during the call; only cpus that were online
and had their mask bit set during the whole call are guaranteed to be called.

Reported-by: Mike Galbraith <efa...@gmx.de>


Signed-off-by: Milton Miller <mil...@bga.com>
---

v2: rediff for v2 of call_function_many: fix list delete vs add race

The arch code not expecting the race to empty the mask is the cause
of https://bugzilla.kernel.org/show_bug.cgi?id=23042 that Andrew pointed
out.


Index: linux-2.6/kernel/smp.c
===================================================================
--- linux-2.6.orig/kernel/smp.c 2011-01-31 18:25:47.266755387 -0600
+++ linux-2.6/kernel/smp.c 2011-01-31 18:46:37.848099236 -0600
@@ -450,7 +450,7 @@ void smp_call_function_many(const struct


{
struct call_function_data *data;
unsigned long flags;
- int cpu, next_cpu, this_cpu = smp_processor_id();
+ int refs, cpu, next_cpu, this_cpu = smp_processor_id();

/*
* Can deadlock when called with interrupts disabled.

@@ -489,6 +489,13 @@ void smp_call_function_many(const struct
data->csd.info = info;


cpumask_and(data->cpumask, mask, cpu_online_mask);
cpumask_clear_cpu(this_cpu, data->cpumask);

+ refs = cpumask_weight(data->cpumask);
+

+ /* some callers might race with other cpus changing the mask */


+ if (unlikely(!refs)) {
+ csd_unlock(&data->csd);
+ return;

+ }

/*
* We reuse the call function data without waiting for any grace

@@ -512,7 +519,7 @@ void smp_call_function_many(const struct


* We rely on the wmb() in list_add_rcu to order the writes

* to func, data, and cpumask before this write to refs.

*/


- atomic_set(&data->refs, cpumask_weight(data->cpumask));

+ atomic_set(&data->refs, refs);
raw_spin_unlock_irqrestore(&call_function.lock, flags);

/*

Paul E. McKenney

unread,
Feb 1, 2011, 9:13:17 AM2/1/11
to Linus Torvalds, Benjamin Herrenschmidt, Peter Zijlstra, Milton Miller, ak...@linux-foundation.org, Anton Blanchard, xiaogu...@cn.fujitsu.com, mi...@elte.hu, jax...@fusionio.com, npi...@gmail.com, ru...@rustcorp.com.au, linux-...@vger.kernel.org
On Tue, Feb 01, 2011 at 03:46:26PM +1000, Linus Torvalds wrote:
> On Tue, Feb 1, 2011 at 2:45 PM, Paul E. McKenney
> <pau...@linux.vnet.ibm.com> wrote:
> > On Tue, Feb 01, 2011 at 12:43:56PM +1000, Linus Torvalds wrote:
> >>
> >> That said, I do think that if your memory ordering is much weaker than
> >> x86, you are going to see bugs that most testers don't see, and it
> >> simply might not be worth it.
> >
> > IBM's CPUs do split the difference, with s390 having somewhat stronger
> > ordering than x86, and with powerpc being rather weaker
>
> I'm not talking about memory ordering as done by the cpu, but as done
> by the spinlock operations. They can be arbitrarily strong, even if
> the CPU architecture itself might be weakly ordered.

Got it.

[ . . . ]

> Quite frankly, the POWER case is made worse by the fact that the
> synchronization primitives have this total confusion about "pipeline"
> synchronization due to historical implementation oddities etc. Talk
> about crazy. The whole "isync" vs "sync" vs "lwsync" thing is just an
> embarrassment.

I am always ready to exploit embarrassing parallelism!

Thanx, Paul

Paul E. McKenney

unread,
Feb 1, 2011, 5:01:08 PM2/1/11
to Milton Miller, Peter Zijlstra, ak...@linux-foundation.org, Anton Blanchard, xiaogu...@cn.fujitsu.com, mi...@elte.hu, jax...@fusionio.com, npi...@gmail.com, ru...@rustcorp.com.au, efa...@gmx.de, torv...@linux-foundation.org, be...@kernel.crashing.org, linux-...@vger.kernel.org
On Tue, Feb 01, 2011 at 01:12:18AM -0600, Milton Miller wrote:
> Peter pointed out there was nothing preventing the list_del_rcu in
> smp_call_function_interrupt from running before the list_add_rcu in
> smp_call_function_many. Fix this by not setting refs until we
> have gotten the lock for the list. Take advantage of the wmb in
> list_add_rcu to save an explicit additional one.

Finally getting a chance to give this full attention...

I don't know how to do this patch-by-patch, so I applied these
three patches to mainline and looked at the result. This probably
gets some comments that are irrelevant to these patches, but so it
goes.

Starting with smp_call_function_many():

o The check for refs is redundant:

/* some callers might race with other cpus changing the mask */

if (unlikely(!refs)) {
csd_unlock(&data->csd);
return;
}

The memory barriers and atomic functions in
generic_smp_call_function_interrupt() prevent the callback from
being reused before the cpumask bits have all been cleared, right?
Furthermore, csd_lock() contains a full memory barrier that pairs
with the full memory barrier in csd_unlock(), so if csd_lock()
returns, we are guaranteed to see the effects of all accesses
to the prior incarnation of this structure.

Yes, there might be another CPU that got a pointer to this
callback just as another CPU removed it. That CPU will see
the callback as having their bit of the CPU mask zero until
we set it, and they will further see ->refs as being zero
until we set it. And we don't set ->refs until after we
re-insert the callback.

So I we can drop this "if" statement entirely.

o The smp_mb() below look extraneous. The comment in
generic_exec_single() agrees -- it says that the IPI
code is required to impose ordering. Besides,
there is a lock in arch_send_call_function_ipi_mask(),
and in conjunction with the unlock below, this makes
a full memory barrier. So I believe that we can drop
the smp_mb() shown below. (But not the comment!!!)

raw_spin_unlock_irqrestore(&call_function.lock, flags);

/*
* Make the list addition visible before sending the ipi.
* (IPIs must obey or appear to obey normal Linux cache

* coherency rules -- see comment in generic_exec_single).
*/
smp_mb();

/* Send a message to all CPUs in the map */
arch_send_call_function_ipi_mask(data->cpumask);

Next up: generic_smp_call_function_interrupt()

o Shouldn't the smp_mb() at the top of the function be supplied
in arch-specific code for those architectures that require it?

o So, the check for the current CPU's bit in the mask...

o If the bit is clear, then we are looking at an callback
that this CPU already processed or that was not intended
for this CPU in the first place.

Of course, this callback might be reused immediately
after we do the check. In that case, there should
be an IPI waiting for us shortly. If the IPI beat
the callback to us, that seems to me to be a bug
in the architecture rather than in this code.

o If the bit is set, then we need to process this callback.
IRQs are disabled, so we cannot race with ourselves
-- our bit will remain set until we clear it.
The list_add_rcu() in smp_call_function_many()
in conjunction with the list_for_each_entry_rcu()
in generic_smp_call_function_interrupt() guarantees
that all of the field except for ->refs will be seen as
initialized in the common case where we are looking at
an callback that has just been enqueued.

In the uncommon case where we picked up the pointer
in list_for_each_entry_rcu() just before the last
CPU removed the callback and when someone else
immediately recycled it, all bets are off. We must
ensure that we see all initialization via some other
means.

OK, so where is the memory barrier that pairs with the
smp_rmb() between the ->cpumask and ->refs checks?
It must be before the assignment to ->cpumask. One
candidate is the smp_mb() in csd_lock(), but that does
not make much sense. What we need to do is to ensure
that if we see our bit in ->cpumask, that we also see
the atomic decrement that previously zeroed ->refs.
But some third CPU did that atomic decrement, which
brings transitivity into play, which leads me to believe
that this smp_rmb() might need to be upgraded to a
full smp_mb().

o After we verify that the ->refs field is non-zero, we pick
up the ->csd.func and ->csd.info fields, but with no intervening
memory barrier. There needs to be at least an smp_rmb()
following the test of the ->refs field. Since there is
no transitivity required, smp_rmb() should do the trick.

If this seems unnecessary, please keep in mind that there
is nothing stopping the compiler and the CPU from reordering
the statements in smp_call_function_many() that initialize
->csd.func, ->csd.info, and ->cpumask.

In contrast, given the suggested smp_rmb(), we are guaranteed
of the identity of the callback from the time we test ->refs
until the time someone calls csd_unlock(), which cannot precede
the time that we atomically decrement ->refs.

o It seems silly to pick up the ->csd.func field twice. Why
not use the local variable?

o The cpumask_test_and_clear_cpu() and the atomic_dec_return()
are both atomic operations that return values, so they act
as full memory barriers.

The cpumask_test_and_clear_cpu() needs to be atomic despite
the identity guarantee because CPUs might be concurrently
trying to clear bits in the same word.

Here is the corresponding (untested) patch.

Thoughts?

Thanx, Paul

------------------------------------------------------------------------

smp_call_function: additional memory-order tightening.

The csd_lock() and csd_unlock() interaction guarantees that the
smp_call_function_many() function sees the results of interactions
with prior incarnations of the callback, so the check is not needed.
Instead, tighter memory ordering is required in the companion
generic_smp_call_function_interrupt() function to ensure proper
interaction with partially initialized callbacks.

Signed-off-by: Paul E. McKenney <pau...@linux.vnet.ibm.com>

diff --git a/kernel/smp.c b/kernel/smp.c
index 064bb6e..4c8b005 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -209,13 +209,19 @@ void generic_smp_call_function_interrupt(void)
if (!cpumask_test_cpu(cpu, data->cpumask))
continue;

- smp_rmb();
+ smp_mb(); /* If we see our bit set above, we need to see */
+ /* all the processing associated with the prior */
+ /* incarnation of this callback. */



if (atomic_read(&data->refs) == 0)
continue;

+ smp_rmb(); /* If we see non-zero ->refs, we need to see all */
+ /* other initialization for this incarnation of */
+ /* this callback. */


+
func = data->csd.func; /* for later warn */

- data->csd.func(data->csd.info);
+ func(data->csd.info);

/*
* If the cpu mask is not still set then func enabled
@@ -492,12 +498,6 @@ void smp_call_function_many(const struct cpumask *mask,
cpumask_clear_cpu(this_cpu, data->cpumask);
refs = cpumask_weight(data->cpumask);

- /* some callers might race with other cpus changing the mask */
- if (unlikely(!refs)) {
- csd_unlock(&data->csd);
- return;
- }
-
/*


* We reuse the call function data without waiting for any grace

* period after some other cpu removes it from the global queue.

@@ -527,8 +527,9 @@ void smp_call_function_many(const struct cpumask *mask,


* Make the list addition visible before sending the ipi.
* (IPIs must obey or appear to obey normal Linux cache

* coherency rules -- see comment in generic_exec_single).
+ * The unlock above combined with the lock in the IPI
+ * code covers this requirement.
*/
- smp_mb();

/* Send a message to all CPUs in the map */
arch_send_call_function_ipi_mask(data->cpumask);

Milton Miller

unread,
Feb 1, 2011, 10:19:08 PM2/1/11
to Paul E. McKenney, Peter Zijlstra, ak...@linux-foundation.org, Anton Blanchard, xiaogu...@cn.fujitsu.com, mi...@elte.hu, jax...@fusionio.com, npi...@gmail.com, JBeu...@novell.com, efa...@gmx.de, ru...@rustcorp.com.au, torv...@linux-foundation.org, be...@kernel.crashing.org, linux-...@vger.kernel.org
On Tue, 1 Feb 2011 about 14:00:26 -0800, "Paul E. McKenney" wrote:
> On Tue, Feb 01, 2011 at 01:12:18AM -0600, Milton Miller wrote:
> > Peter pointed out there was nothing preventing the list_del_rcu in
> > smp_call_function_interrupt from running before the list_add_rcu in
> > smp_call_function_many. Fix this by not setting refs until we
> > have gotten the lock for the list. Take advantage of the wmb in
> > list_add_rcu to save an explicit additional one.
>
> Finally getting a chance to give this full attention...
>
> I don't know how to do this patch-by-patch, so I applied these
> three patches to mainline and looked at the result. This probably
> gets some comments that are irrelevant to these patches, but so it
> goes.
>
> Starting with smp_call_function_many():
>
> o The check for refs is redundant:
>
> /* some callers might race with other cpus changing the mask */
> if (unlikely(!refs)) {
> csd_unlock(&data->csd);
> return;
> }
>
> The memory barriers and atomic functions in
> generic_smp_call_function_interrupt() prevent the callback from
> being reused before the cpumask bits have all been cleared, right?

The issue is not the cpumask in the csd, but the mask passed in from the
caller. If other cpus clear the mask between the cpumask_first and and
cpumask_next above (where we established there were at least two cpus not
ourself) and the cpumask_copy, then this can happen. Both Mike Galbraith
and Jan Beulich saw this in practice (Mikes case was mm_cpumask(mm)).

This was explained in the changelog for the second patch.

> Furthermore, csd_lock() contains a full memory barrier that pairs
> with the full memory barrier in csd_unlock(), so if csd_lock()
> returns, we are guaranteed to see the effects of all accesses
> to the prior incarnation of this structure.
>
> Yes, there might be another CPU that got a pointer to this
> callback just as another CPU removed it. That CPU will see
> the callback as having their bit of the CPU mask zero until
> we set it, and they will further see ->refs as being zero
> until we set it. And we don't set ->refs until after we
> re-insert the callback.
>
> So I we can drop this "if" statement entirely.

Disagree, see above.

>
> o The smp_mb() below look extraneous. The comment in
> generic_exec_single() agrees -- it says that the IPI
> code is required to impose ordering. Besides,
> there is a lock in arch_send_call_function_ipi_mask(),
> and in conjunction with the unlock below, this makes
> a full memory barrier. So I believe that we can drop
> the smp_mb() shown below. (But not the comment!!!)
>
> raw_spin_unlock_irqrestore(&call_function.lock, flags);
>
> /*
> * Make the list addition visible before sending the ipi.
> * (IPIs must obey or appear to obey normal Linux cache
> * coherency rules -- see comment in generic_exec_single).
> */
> smp_mb();


Well, this says the generic code guarantees mb before calling IPI ...
this has been here since 2.6.30 and would need at least an arch audit
so I vote to leave this here for now.

After we are confident in the locking, we can remove the internal bugs
and warn on the cpumask and refs, leaving the one for mask clear and
the ones for cpu online and calling context.

>
> /* Send a message to all CPUs in the map */
> arch_send_call_function_ipi_mask(data->cpumask);
>
> Next up: generic_smp_call_function_interrupt()
>
> o Shouldn't the smp_mb() at the top of the function be supplied
> in arch-specific code for those architectures that require it?
>

Again, this would require an arch audit, I vote to leave for now.
This says the generic code will guarantee that an IPI is received after
the list is visiable with a mb on each side (as long as the arch makes
IPI appear as ordered by smp_mb()).

> o So, the check for the current CPU's bit in the mask...
>
> o If the bit is clear, then we are looking at an callback
> that this CPU already processed or that was not intended
> for this CPU in the first place.
>
> Of course, this callback might be reused immediately
> after we do the check. In that case, there should
> be an IPI waiting for us shortly. If the IPI beat
> the callback to us, that seems to me to be a bug
> in the architecture rather than in this code.

agreed

>
> o If the bit is set, then we need to process this callback.
> IRQs are disabled, so we cannot race with ourselves
> -- our bit will remain set until we clear it.
> The list_add_rcu() in smp_call_function_many()
> in conjunction with the list_for_each_entry_rcu()
> in generic_smp_call_function_interrupt() guarantees
> that all of the field except for ->refs will be seen as
> initialized in the common case where we are looking at
> an callback that has just been enqueued.
>
> In the uncommon case where we picked up the pointer
> in list_for_each_entry_rcu() just before the last
> CPU removed the callback and when someone else
> immediately recycled it, all bets are off. We must
> ensure that we see all initialization via some other
> means.
>
> OK, so where is the memory barrier that pairs with the
> smp_rmb() between the ->cpumask and ->refs checks?
> It must be before the assignment to ->cpumask. One
> candidate is the smp_mb() in csd_lock(), but that does
> not make much sense. What we need to do is to ensure
> that if we see our bit in ->cpumask, that we also see
> the atomic decrement that previously zeroed ->refs.

We have a full mb in csd_unlock on the cpu that zeroed refs and a full
mb in csd_lock on the cpu that sets mask and later refs.

We rely on the atomic returns to order the two atomics, and the
atomic_dec_return to establish a single cpu as the last. After
that atomic is performed we do a full mb in unlock. At this
point all cpus must have visibility to all this prior processing.
On the owning cpu we then do a full mb in lock.

How can any of the second party writes after the paired mb in lock be
visible and not all of the prior third party writes?

How can any cpu see the old refs after the new mask is set? And if they
could, how can we guarantee the ordering between the cpumask clearing
and the unlock?

We rely on the statment that the cpumask memory will not change to an
intermediate value has bits set that will be later cleared, that is we
rely on the and of the incoming mask and online mask to occur before the
store. We are broken if we copy online mask then and in the source mask..

The multiple non-atomic writes to data, func, and mask must be occur
after the mb in lock and before the wmb in list_add_rcu. This wmb
must force all processing except for the set of refs, and we require
atomic_set to be atomic wrt atomic_dec_return.

> But some third CPU did that atomic decrement, which
> brings transitivity into play, which leads me to believe
> that this smp_rmb() might need to be upgraded to a
> full smp_mb().

I'm not sure I understand the need for this upgrade. We are only
performing reads on this cpu, how is smp_mb stronger than smp_rmb?

All the prior writes were ordered by atomic returns or a full mb.


>
> o After we verify that the ->refs field is non-zero, we pick
> up the ->csd.func and ->csd.info fields, but with no intervening
> memory barrier. There needs to be at least an smp_rmb()
> following the test of the ->refs field. Since there is
> no transitivity required, smp_rmb() should do the trick.
>
>
> If this seems unnecessary, please keep in mind that there
> is nothing stopping the compiler and the CPU from reordering
> the statements in smp_call_function_many() that initialize
> ->csd.func, ->csd.info, and ->cpumask.

Ok I agree, we need either wmb before setting cpumask or rmb before
reading data & func after wmb is set.

>
> In contrast, given the suggested smp_rmb(), we are guaranteed
> of the identity of the callback from the time we test ->refs
> until the time someone calls csd_unlock(), which cannot precede
> the time that we atomically decrement ->refs.
>
> o It seems silly to pick up the ->csd.func field twice. Why
> not use the local variable?

I added the local variable for the warn and just left the compiler to
recognise the duplicate load. But yes, we could use the local to make
the call too. At the time I choose fewer loc changed.

>
> o The cpumask_test_and_clear_cpu() and the atomic_dec_return()
> are both atomic operations that return values, so they act
> as full memory barriers.
>
> The cpumask_test_and_clear_cpu() needs to be atomic despite
> the identity guarantee because CPUs might be concurrently
> trying to clear bits in the same word.

Ok and in addition, since these are necessarly ordered then we
can assert that the test_and_clear_cpu must execute before the
decrement of refs, which is requried for total ordering to say
that all cpus are done clearing their mask bit and decrementing
refs before we do the csd_unlock, which allows the reuse.

>
> Here is the corresponding (untested) patch.
>
> Thoughts?
>
> Thanx, Paul
>
> ------------------------------------------------------------------------
>
> smp_call_function: additional memory-order tightening.
>
> The csd_lock() and csd_unlock() interaction guarantees that the
> smp_call_function_many() function sees the results of interactions
> with prior incarnations of the callback, so the check is not needed.
> Instead, tighter memory ordering is required in the companion
> generic_smp_call_function_interrupt() function to ensure proper
> interaction with partially initialized callbacks.
>
> Signed-off-by: Paul E. McKenney <pau...@linux.vnet.ibm.com>
>
> diff --git a/kernel/smp.c b/kernel/smp.c
> index 064bb6e..4c8b005 100644
> --- a/kernel/smp.c
> +++ b/kernel/smp.c
> @@ -209,13 +209,19 @@ void generic_smp_call_function_interrupt(void)
> if (!cpumask_test_cpu(cpu, data->cpumask))
> continue;
>
> - smp_rmb();
> + smp_mb(); /* If we see our bit set above, we need to see */
> + /* all the processing associated with the prior */
> + /* incarnation of this callback. */

Again, I want more justification. Why is mb required over rmb?
We are ordering two reads. If we could see an old value before
stored by another cpu then what does the mb in lock and unlock mean?

>
> if (atomic_read(&data->refs) == 0)
> continue;
>
> + smp_rmb(); /* If we see non-zero ->refs, we need to see all */
> + /* other initialization for this incarnation of */
> + /* this callback. */
> +

/* Need to read func and info after refs to see the new values, pairs
with the wmb before setting refs. */


> func = data->csd.func; /* for later warn */
> - data->csd.func(data->csd.info);
> + func(data->csd.info);

Ok, and can merge with the add of rmb.

>
> /*
> * If the cpu mask is not still set then func enabled
> @@ -492,12 +498,6 @@ void smp_call_function_many(const struct cpumask *mask,
> cpumask_clear_cpu(this_cpu, data->cpumask);
> refs = cpumask_weight(data->cpumask);
>
> - /* some callers might race with other cpus changing the mask */
> - if (unlikely(!refs)) {
> - csd_unlock(&data->csd);
> - return;
> - }
> -

Again, this is needed, you were considering the csd mask but this
is protecting against the cpumask argument is the source of the
cpumask_copy changing.

> /*
> * We reuse the call function data without waiting for any grace
> * period after some other cpu removes it from the global queue.
> @@ -527,8 +527,9 @@ void smp_call_function_many(const struct cpumask *mask,
> * Make the list addition visible before sending the ipi.
> * (IPIs must obey or appear to obey normal Linux cache
> * coherency rules -- see comment in generic_exec_single).
> + * The unlock above combined with the lock in the IPI
> + * code covers this requirement.
> */
> - smp_mb();

again, this requires the arch audit.

>
> /* Send a message to all CPUs in the map */
> arch_send_call_function_ipi_mask(data->cpumask);

milton

Paul E. McKenney

unread,
Feb 1, 2011, 11:18:11 PM2/1/11
to Milton Miller, Peter Zijlstra, ak...@linux-foundation.org, Anton Blanchard, xiaogu...@cn.fujitsu.com, mi...@elte.hu, jax...@fusionio.com, npi...@gmail.com, JBeu...@novell.com, efa...@gmx.de, ru...@rustcorp.com.au, torv...@linux-foundation.org, be...@kernel.crashing.org, linux-...@vger.kernel.org

Urgh. OK, I never would have had the guts to invoke smp_call_function()
with a cpumask that was being concurrently updated, nor the breadth
of imagination to think that someone else might do so, but fair point
nevertheless. I restored this code.

A case of "that can't possibly be what the meant!"

Maybe the comment should say -which- mask? ;-)

> > Furthermore, csd_lock() contains a full memory barrier that pairs
> > with the full memory barrier in csd_unlock(), so if csd_lock()
> > returns, we are guaranteed to see the effects of all accesses
> > to the prior incarnation of this structure.
> >
> > Yes, there might be another CPU that got a pointer to this
> > callback just as another CPU removed it. That CPU will see
> > the callback as having their bit of the CPU mask zero until
> > we set it, and they will further see ->refs as being zero
> > until we set it. And we don't set ->refs until after we
> > re-insert the callback.
> >
> > So I we can drop this "if" statement entirely.
>
> Disagree, see above.

K.

> > o The smp_mb() below look extraneous. The comment in
> > generic_exec_single() agrees -- it says that the IPI
> > code is required to impose ordering. Besides,
> > there is a lock in arch_send_call_function_ipi_mask(),
> > and in conjunction with the unlock below, this makes
> > a full memory barrier. So I believe that we can drop
> > the smp_mb() shown below. (But not the comment!!!)
> >
> > raw_spin_unlock_irqrestore(&call_function.lock, flags);
> >
> > /*
> > * Make the list addition visible before sending the ipi.
> > * (IPIs must obey or appear to obey normal Linux cache
> > * coherency rules -- see comment in generic_exec_single).
> > */
> > smp_mb();
>
>
> Well, this says the generic code guarantees mb before calling IPI ...
> this has been here since 2.6.30 and would need at least an arch audit
> so I vote to leave this here for now.
>
> After we are confident in the locking, we can remove the internal bugs
> and warn on the cpumask and refs, leaving the one for mask clear and
> the ones for cpu online and calling context.

Fair enough. It does not affect scalability, and probably does not
affect performance all that much.

> > /* Send a message to all CPUs in the map */
> > arch_send_call_function_ipi_mask(data->cpumask);
> >
> > Next up: generic_smp_call_function_interrupt()
> >
> > o Shouldn't the smp_mb() at the top of the function be supplied
> > in arch-specific code for those architectures that require it?
>
> Again, this would require an arch audit, I vote to leave for now.
> This says the generic code will guarantee that an IPI is received after
> the list is visiable with a mb on each side (as long as the arch makes
> IPI appear as ordered by smp_mb()).

Yep. I did leave this one.

Because smp_rmb() is not required to order prior writes against
subsequent reads. The prior third-party writes are writes, right?
When you want transitivity (observing n-th party writes that
n-1-th party observed before n-1-th party's memory barrier), then
you need a full memory barrier -- smp_mb().

> How can any cpu see the old refs after the new mask is set? And if they
> could, how can we guarantee the ordering between the cpumask clearing
> and the unlock?
>
> We rely on the statment that the cpumask memory will not change to an
> intermediate value has bits set that will be later cleared, that is we
> rely on the and of the incoming mask and online mask to occur before the
> store. We are broken if we copy online mask then and in the source mask..
>
> The multiple non-atomic writes to data, func, and mask must be occur
> after the mb in lock and before the wmb in list_add_rcu. This wmb
> must force all processing except for the set of refs, and we require
> atomic_set to be atomic wrt atomic_dec_return.
>
>
>
> > But some third CPU did that atomic decrement, which
> > brings transitivity into play, which leads me to believe
> > that this smp_rmb() might need to be upgraded to a
> > full smp_mb().
>
> I'm not sure I understand the need for this upgrade. We are only
> performing reads on this cpu, how is smp_mb stronger than smp_rmb?

The smp_mb() is stronger in that it orders prior writes against
subsequent reads, which smp_rmb() is not required to do.

> All the prior writes were ordered by atomic returns or a full mb.
>
>
> >
> > o After we verify that the ->refs field is non-zero, we pick
> > up the ->csd.func and ->csd.info fields, but with no intervening
> > memory barrier. There needs to be at least an smp_rmb()
> > following the test of the ->refs field. Since there is
> > no transitivity required, smp_rmb() should do the trick.
> >
> >
> > If this seems unnecessary, please keep in mind that there
> > is nothing stopping the compiler and the CPU from reordering
> > the statements in smp_call_function_many() that initialize
> > ->csd.func, ->csd.info, and ->cpumask.
>
> Ok I agree, we need either wmb before setting cpumask or rmb before
> reading data & func after wmb is set.

The smp_rmb() is required between testing ->refs and reading ->csd.data
and ->csd.func regardless. Otherwise the compiler or the CPU can
cause ->csd.data and ->csd.func to be read before ->refs is tested,
which can get you garbage.

At the other end, there is already plenty of ordering between setting
->cpumask and setting ->refs, which are the relevant accesses for this
memory barrier's pair.

> > In contrast, given the suggested smp_rmb(), we are guaranteed
> > of the identity of the callback from the time we test ->refs
> > until the time someone calls csd_unlock(), which cannot precede
> > the time that we atomically decrement ->refs.
> >
> > o It seems silly to pick up the ->csd.func field twice. Why
> > not use the local variable?
>
> I added the local variable for the warn and just left the compiler to
> recognise the duplicate load. But yes, we could use the local to make
> the call too. At the time I choose fewer loc changed.

Fair enough.

> > o The cpumask_test_and_clear_cpu() and the atomic_dec_return()
> > are both atomic operations that return values, so they act
> > as full memory barriers.
> >
> > The cpumask_test_and_clear_cpu() needs to be atomic despite
> > the identity guarantee because CPUs might be concurrently
> > trying to clear bits in the same word.
>
> Ok and in addition, since these are necessarly ordered then we
> can assert that the test_and_clear_cpu must execute before the
> decrement of refs, which is requried for total ordering to say
> that all cpus are done clearing their mask bit and decrementing
> refs before we do the csd_unlock, which allows the reuse.

Except for the need for transitivity.

It means that the other two CPUs were in agreement about ordering.
But we are running on a third CPU.

> > if (atomic_read(&data->refs) == 0)
> > continue;
> >
> > + smp_rmb(); /* If we see non-zero ->refs, we need to see all */
> > + /* other initialization for this incarnation of */
> > + /* this callback. */
> > +
>
> /* Need to read func and info after refs to see the new values, pairs
> with the wmb before setting refs. */

This is a suggested addition or a suggested replacement? Assuming the
latter, updated accordingly.

> > func = data->csd.func; /* for later warn */
> > - data->csd.func(data->csd.info);
> > + func(data->csd.info);
>
> Ok, and can merge with the add of rmb.

K

> > /*
> > * If the cpu mask is not still set then func enabled
> > @@ -492,12 +498,6 @@ void smp_call_function_many(const struct cpumask *mask,
> > cpumask_clear_cpu(this_cpu, data->cpumask);
> > refs = cpumask_weight(data->cpumask);
> >
> > - /* some callers might race with other cpus changing the mask */
> > - if (unlikely(!refs)) {
> > - csd_unlock(&data->csd);
> > - return;
> > - }
> > -
>
> Again, this is needed, you were considering the csd mask but this
> is protecting against the cpumask argument is the source of the
> cpumask_copy changing.

Yep, restored.

> > /*
> > * We reuse the call function data without waiting for any grace
> > * period after some other cpu removes it from the global queue.
> > @@ -527,8 +527,9 @@ void smp_call_function_many(const struct cpumask *mask,
> > * Make the list addition visible before sending the ipi.
> > * (IPIs must obey or appear to obey normal Linux cache
> > * coherency rules -- see comment in generic_exec_single).
> > + * The unlock above combined with the lock in the IPI
> > + * code covers this requirement.
> > */
> > - smp_mb();
>
> again, this requires the arch audit.

Yep, restored.

> > /* Send a message to all CPUs in the map */
> > arch_send_call_function_ipi_mask(data->cpumask);

So how about the following?

Thanx, Paul

------------------------------------------------------------------------

smp_call_function: additional memory-order tightening.

The csd_lock() and csd_unlock() interaction guarantees that the
smp_call_function_many() function sees the results of interactions
with prior incarnations of the callback, so the check is not needed.
Instead, tighter memory ordering is required in the companion
generic_smp_call_function_interrupt() function to ensure proper
interaction with partially initialized callbacks.

Signed-off-by: Paul E. McKenney <pau...@linux.vnet.ibm.com>

diff --git a/kernel/smp.c b/kernel/smp.c
index 064bb6e..e091905 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -182,7 +182,7 @@ void generic_smp_call_function_interrupt(void)

/*
* Ensure entry is visible on call_function_queue after we have
- * entered the IPI. See comment in smp_call_function_many.
+ * entered the IPI. See comment in smp_call_function_many
* If we don't have this, then we may miss an entry on the list
* and never get another IPI to process it.
*/
@@ -209,13 +209,18 @@ void generic_smp_call_function_interrupt(void)


if (!cpumask_test_cpu(cpu, data->cpumask))
continue;

- smp_rmb();
+ smp_mb(); /* If we see our bit set above, we need to see */
+ /* all the processing associated with the prior */
+ /* incarnation of this callback. */

if (atomic_read(&data->refs) == 0)
continue;

+ smp_rmb(); /* We need to read ->refs before we read either */
+ /* ->csd.func and ->csd.info. */
+


func = data->csd.func; /* for later warn */
- data->csd.func(data->csd.info);
+ func(data->csd.info);

/*
* If the cpu mask is not still set then func enabled

Mike Galbraith

unread,
Feb 2, 2011, 1:22:16 AM2/2/11
to Milton Miller, Paul E. McKenney, Peter Zijlstra, ak...@linux-foundation.org, Anton Blanchard, xiaogu...@cn.fujitsu.com, mi...@elte.hu, jax...@fusionio.com, npi...@gmail.com, JBeu...@novell.com, ru...@rustcorp.com.au, torv...@linux-foundation.org, be...@kernel.crashing.org, linux-...@vger.kernel.org
On Tue, 2011-02-01 at 14:00 -0800, Milton Miller wrote:
> On Tue, 1 Feb 2011 about 14:00:26 -0800, "Paul E. McKenney" wrote:

> > Starting with smp_call_function_many():
> >
> > o The check for refs is redundant:
> >
> > /* some callers might race with other cpus changing the mask */
> > if (unlikely(!refs)) {
> > csd_unlock(&data->csd);
> > return;
> > }
> >
> > The memory barriers and atomic functions in
> > generic_smp_call_function_interrupt() prevent the callback from
> > being reused before the cpumask bits have all been cleared, right?
>
> The issue is not the cpumask in the csd, but the mask passed in from the
> caller. If other cpus clear the mask between the cpumask_first and and
> cpumask_next above (where we established there were at least two cpus not
> ourself) and the cpumask_copy, then this can happen. Both Mike Galbraith
> and Jan Beulich saw this in practice (Mikes case was mm_cpumask(mm)).

Mine (and Jan's) is a flavor of one hit and fixed via copy in ia64.

http://git2.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=75c1c91cb92806f960fcd6e53d2a0c21f343081c


-Mike

Paul E. McKenney

unread,
Feb 6, 2011, 7:32:38 PM2/6/11
to Milton Miller, Peter Zijlstra, ak...@linux-foundation.org, Anton Blanchard, xiaogu...@cn.fujitsu.com, mi...@elte.hu, jax...@fusionio.com, npi...@gmail.com, JBeu...@novell.com, efa...@gmx.de, ru...@rustcorp.com.au, torv...@linux-foundation.org, be...@kernel.crashing.org, linux-...@vger.kernel.org
On Tue, Feb 01, 2011 at 08:17:40PM -0800, Paul E. McKenney wrote:
> On Tue, Feb 01, 2011 at 02:00:26PM -0800, Milton Miller wrote:
> > On Tue, 1 Feb 2011 about 14:00:26 -0800, "Paul E. McKenney" wrote:
> > > On Tue, Feb 01, 2011 at 01:12:18AM -0600, Milton Miller wrote:

[ . . . ]

FYI, for an example showing the need for smp_mb() to gain transitivity,
please see the following:

o http://paulmck.livejournal.com/20061.html
o http://paulmck.livejournal.com/20312.html

Thanx, Paul

Mike Galbraith

unread,
Feb 7, 2011, 3:20:20 AM2/7/11
to pau...@linux.vnet.ibm.com, Milton Miller, Peter Zijlstra, ak...@linux-foundation.org, Anton Blanchard, xiaogu...@cn.fujitsu.com, mi...@elte.hu, jax...@fusionio.com, npi...@gmail.com, JBeu...@novell.com, ru...@rustcorp.com.au, torv...@linux-foundation.org, be...@kernel.crashing.org, linux-...@vger.kernel.org
On Tue, 2011-02-01 at 20:17 -0800, Paul E. McKenney wrote:

FWIW, my red headed stepchild (.32 xen cluster beast) with..
smp-smp_call_function_many-fix-SMP-race (6dc1989)
smp-consolidate-writes-in-smp_call_function_interrupt (225c8e0)
smp-smp_call_function_many-fix-list-delete-vs-add-race (V2)
smp-smp_call_function_many-handle-concurrent-clearing-of-mask (V2)
smp-generic_smp_call_function_interrupt-additional-memory-order-tightening (below)
.has not experienced any IPI problems lately, nor have I been able to
trigger anything beating up my box running normal x64_64 kernels.

That's not saying much since my IPI woes were only the concurrent
clearing variety, just letting you know that these patches have received
(and are receiving) hefty thumpage.

Paul E. McKenney

unread,
Feb 8, 2011, 2:37:06 PM2/8/11
to Mike Galbraith, Milton Miller, Peter Zijlstra, ak...@linux-foundation.org, Anton Blanchard, xiaogu...@cn.fujitsu.com, mi...@elte.hu, jax...@fusionio.com, npi...@gmail.com, JBeu...@novell.com, ru...@rustcorp.com.au, torv...@linux-foundation.org, be...@kernel.crashing.org, linux-...@vger.kernel.org
On Mon, Feb 07, 2011 at 09:12:54AM +0100, Mike Galbraith wrote:
> On Tue, 2011-02-01 at 20:17 -0800, Paul E. McKenney wrote:
>
> FWIW, my red headed stepchild (.32 xen cluster beast) with..
> smp-smp_call_function_many-fix-SMP-race (6dc1989)
> smp-consolidate-writes-in-smp_call_function_interrupt (225c8e0)
> smp-smp_call_function_many-fix-list-delete-vs-add-race (V2)
> smp-smp_call_function_many-handle-concurrent-clearing-of-mask (V2)
> smp-generic_smp_call_function_interrupt-additional-memory-order-tightening (below)
> ..has not experienced any IPI problems lately, nor have I been able to

> trigger anything beating up my box running normal x64_64 kernels.
>
> That's not saying much since my IPI woes were only the concurrent
> clearing variety, just letting you know that these patches have received
> (and are receiving) hefty thumpage.

Very good, I have added your Tested-by to my patch.

Thanx, Paul

Milton Miller

unread,
Mar 15, 2011, 3:27:27 PM3/15/11
to Peter Zijlstra, ak...@linux-foundation.org, Anton Blanchard, xiaogu...@cn.fujitsu.com, mi...@elte.hu, jax...@fusionio.com, npi...@gmail.com, ru...@rustcorp.com.au, efa...@gmx.de, Jan Beulich, Dimitri Sivanich, Tony Luck, torv...@linux-foundation.org, pau...@linux.vnet.ibm.com, be...@kernel.crashing.org, linux-...@vger.kernel.org
Paul McKenney's review pointed out two problems with the barriers
in the 2.6.38 update to the smp call function many code.

First, a barrier that would force the func and info members of data
to be visible before their consumption in the interrupt handler
was missing. This can be solved by adding a smp_wmb between
setting the func and info members and setting setting the cpumask;
this will pair with the existing and required smp_rmb ordering
the cpumask read before the read of refs. This placement avoids
the need a second smp_rmb in the interrupt handler which would
be executed on each of the N cpus executing the call request.
(I was thinking this barrier was present but was not).

Second, the previous write to refs (establishing the zero that
we the interrupt handler was testing from all cpus) was performed
by a third party cpu. This would invoke transitivity which, as
a recient or concurrent addition to memory-barriers.txt now
explicitly states, would require a full smp_mb().

However, we know the cpumask will only be set by one cpu (the
data owner) and any preivous iteration of the mask would have
cleared by the reading cpu. By redundantly writing refs to 0 on
the owning cpu before the smp_wmb, the write to refs will follow
the same path as the writes that set the cpumask, which in turn
allows us to keep the barrier in the interrupt handler a smp_rmb
instead of promoting it to a smp_mb (which will be be executed
by N cpus for each of the possible M elements on the list).

I moved and expanded the comment about our (ab)use of the rcu list
primitives for the concurrent walk earlier into this function.
I considered moving the first two paragraphs to the queue list
head and lock, but felt it would have been too disconected from
the code.

Cc: Paul McKinney <pau...@linux.vnet.ibm.com>
Cc: stable (2.6.32 and later)
Signed-off-by: Milton Miller <mil...@bga.com>

Paul, please review this alternative to your patch at either of

http://marc.info/?l=linux-kernel&m=129662029916241&w=2
https://patchwork.kernel.org/patch/525891/

In contrast to your patch, this proposal keeps the additional
barriers in the (interrupt enabled) requester instead of the
execution interrupt, where they would be executed by all N cpus
in the system triggered concurrently for each of the N possible
elements in the list.

Index: common/kernel/smp.c
===================================================================
--- common.orig/kernel/smp.c 2011-03-15 05:21:41.000000000 -0500
+++ common/kernel/smp.c 2011-03-15 05:22:26.000000000 -0500
@@ -483,23 +483,42 @@ void smp_call_function_many(const struct

data = &__get_cpu_var(cfd_data);
csd_lock(&data->csd);
- BUG_ON(atomic_read(&data->refs) || !cpumask_empty(data->cpumask));

- data->csd.func = func;
- data->csd.info = info;
- cpumask_and(data->cpumask, mask, cpu_online_mask);
- cpumask_clear_cpu(this_cpu, data->cpumask);
+ /* This BUG_ON verifies our reuse assertions and can be removed */
+ BUG_ON(atomic_read(&data->refs) || !cpumask_empty(data->cpumask));

/*
+ * The global call function queue list add and delete are protected
+ * by a lock, but the list is traversed without any lock, relying
+ * on the rcu list add and delete to allow safe concurrent traversal.


* We reuse the call function data without waiting for any grace
* period after some other cpu removes it from the global queue.

- * This means a cpu might find our data block as it is writen.
- * The interrupt handler waits until it sees refs filled out
- * while its cpu mask bit is set; here we may only clear our
- * own cpu mask bit, and must wait to set refs until we are sure
- * previous writes are complete and we have obtained the lock to
- * add the element to the queue.
+ * This means a cpu might find our data block as it is being
+ * filled out.
+ *
+ * We hold off the interrupt handler on the other cpu by
+ * ordering our writes to the cpu mask vs our setting of the
+ * refs counter. We assert only the cpu owning the data block
+ * will set a bit in cpumask, and each bit will only be cleared
+ * by the subject cpu. Each cpu must first find its bit is
+ * set and then check that refs is set indicating the element is
+ * ready to be processed, otherwise it must skip the entry.
+ *
+ * On the previous iteration refs was set to 0 by another cpu.
+ * To avoid the use of transitivity, set the counter to 0 here
+ * so the wmb will pair with the rmb in the interrupt handler.
*/
+ atomic_set(&data->refs, 0); /* convert 3rd to 1st party write */
+
+ data->csd.func = func;
+ data->csd.info = info;
+
+ /* Ensure 0 refs is visible before mask. Also orders func and info */
+ smp_wmb();
+
+ /* We rely on the "and" being processed before the store */
+ cpumask_and(data->cpumask, mask, cpu_online_mask);
+ cpumask_clear_cpu(this_cpu, data->cpumask);

raw_spin_lock_irqsave(&call_function.lock, flags);
/*
@@ -509,8 +528,9 @@ void smp_call_function_many(const struct
*/
list_add_rcu(&data->csd.list, &call_function.queue);
/*
- * We rely on the wmb() in list_add_rcu to order the writes
- * to func, data, and cpumask before this write to refs.
+ * We rely on the wmb() in list_add_rcu to complete our writes
+ * to the cpumask before this write to refs, which indicates
+ * data is on the list and is ready to be processed.
*/


atomic_set(&data->refs, cpumask_weight(data->cpumask));
raw_spin_unlock_irqrestore(&call_function.lock, flags);

Milton Miller

unread,
Mar 15, 2011, 3:27:31 PM3/15/11
to ak...@linux-foundation.org, Peter Zijlstra, Anton Blanchard, xiaogu...@cn.fujitsu.com, mi...@elte.hu, Jan Beulich, Dimitri Sivanich, Tony Luck, jax...@fusionio.com, npi...@gmail.com, ru...@rustcorp.com.au, torv...@linux-foundation.org, pau...@linux.vnet.ibm.com, efa...@gmx.de, be...@kernel.crashing.org, linux-...@vger.kernel.org
Use the newly added smp_call_func_t in smp_call_function_interrupt
for the func variable, and make the comment above the WARN more assertive
and explicit. Also, func is a function pointer and does not need an
offset, so use %pf not %pS.

Signed-off-by: Milton Miller <mil...@bga.com>
---

v3 use func to call the function too, as suggested, instead of making
compiler notice the repeated access.

Index: common/kernel/smp.c
===================================================================
--- common.orig/kernel/smp.c 2011-03-15 06:22:26.000000000 -0500
+++ common/kernel/smp.c 2011-03-15 06:55:07.000000000 -0500
@@ -194,7 +194,7 @@ void generic_smp_call_function_interrupt
*/
list_for_each_entry_rcu(data, &call_function.queue, csd.list) {
int refs;
- void (*func) (void *info);
+ smp_call_func_t func;

/*
* Since we walk the list without any locks, we might
@@ -214,16 +214,17 @@ void generic_smp_call_function_interrupt


if (atomic_read(&data->refs) == 0)
continue;

- func = data->csd.func; /* for later warn */
- data->csd.func(data->csd.info);
+ func = data->csd.func; /* save for later warn */
+ func(data->csd.info);

/*
- * If the cpu mask is not still set then it enabled interrupts,
- * we took another smp interrupt, and executed the function
- * twice on this cpu. In theory that copy decremented refs.
+ * If the cpu mask is not still set then func enabled
+ * interrupts (BUG), and this cpu took another smp call
+ * function interrupt and executed func(info) twice
+ * on this cpu. That nested execution decremented refs.
*/
if (!cpumask_test_and_clear_cpu(cpu, data->cpumask)) {
- WARN(1, "%pS enabled interrupts and double executed\n",
+ WARN(1, "%pf enabled interrupts and double executed\n",
func);
continue;

Milton Miller

unread,
Mar 15, 2011, 3:27:43 PM3/15/11
to Peter Zijlstra, ak...@linux-foundation.org, Anton Blanchard, xiaogu...@cn.fujitsu.com, mi...@elte.hu, jax...@fusionio.com, npi...@gmail.com, ru...@rustcorp.com.au, efa...@gmx.de, Jan Beulich, Dimitri Sivanich, Tony Luck, torv...@linux-foundation.org, pau...@linux.vnet.ibm.com, be...@kernel.crashing.org, linux-...@vger.kernel.org
Picking up this thread from the beginning of Feburary, I updated
the comments in 3 and 4 based on Paul's review. I have inserted
my proposed alternative to Paul's additional patch as patch 2,
and tweaked the changelog on 1.

[PATCH 1/4 v3] call_function_many: fix list delete vs add race
[PATCH 2/4 v3] call_function_many: add missing ordering
[PATCH 3/4 v3] smp_call_function_many: handle concurrent clearing of mask
[PATCH 4/4 v3] smp_call_function_interrupt: use typedef and %pf

Peter Z acked a prior version of patch 1, and Mike Galbraith has
done some stress testing on the combinaton of 1,3,4, and Paul's patch.

The prior posting of this series is available here:
https://patchwork.kernel.org/patch/522021/
https://patchwork.kernel.org/patch/522031/
http://marc.info/?l=linux-kernel&m=129654439817236&w=2

And Paul's additional patch from review was here
https://patchwork.kernel.org/patch/525891/

Looking forward, I would suggest 1 and 2 are required for stable, 3 may
be suitable as it fixes races that otherwise requires cpumask copies
as in 75c1c91cb92806f960fcd6e53d2a0c21f343081c ([IA64] eliminate race
condition in smp_flush_tlb_mm). By contrast 4 is just comments except
for the %pS to %pf change.

Dimitri or Tony, let me know if you would like a revert patch for
75c1c91c to be added after patch 3.


milton

Milton Miller

unread,
Mar 15, 2011, 3:27:49 PM3/15/11
to Peter Zijlstra, ak...@linux-foundation.org, Anton Blanchard, xiaogu...@cn.fujitsu.com, mi...@elte.hu, jax...@fusionio.com, npi...@gmail.com, ru...@rustcorp.com.au, efa...@gmx.de, Jan Beulich, Dimitri Sivanich, Tony Luck, torv...@linux-foundation.org, pau...@linux.vnet.ibm.com, be...@kernel.crashing.org, linux-...@vger.kernel.org
Peter pointed out there was nothing preventing the list_del_rcu in
smp_call_function_interrupt from running before the list_add_rcu in
smp_call_function_many. Fix this by not setting refs until we
have gotten the lock for the list. Take advantage of the wmb in
list_add_rcu to save an explicit additional one.

I tried to force this race with a udelay before the lock & list_add


and by mixing all 64 online cpus with just 3 random cpus in the mask,
but was unsuccessful. Still, inspection shows a valid race, and the
fix is a extension of the existing protection window in the current code.

Cc: stable (v2.6.32 and later)
Reported-by: Peter Zijlstra <pet...@infradead.org>


Signed-off-by: Milton Miller <mil...@bga.com>
---

v2: rely on wmb in list_add_rcu not combined partial ordering of spin

lock and unlock, which does not provide the needed guarantees.

Index: linux-2.6/kernel/smp.c
===================================================================


--- linux-2.6.orig/kernel/smp.c 2011-01-31 17:44:47.182756513 -0600
+++ linux-2.6/kernel/smp.c 2011-01-31 18:25:47.266755387 -0600
@@ -491,14 +491,15 @@ void smp_call_function_many(const struct
cpumask_clear_cpu(this_cpu, data->cpumask);

/*
- * To ensure the interrupt handler gets an complete view
- * we order the cpumask and refs writes and order the read
- * of them in the interrupt handler. In addition we may
- * only clear our own cpu bit from the mask.

+ * We reuse the call function data without waiting for any grace
+ * period after some other cpu removes it from the global queue.
+ * This means a cpu might find our data block as it is writen.
+ * The interrupt handler waits until it sees refs filled out
+ * while its cpu mask bit is set; here we may only clear our
+ * own cpu mask bit, and must wait to set refs until we are sure
+ * previous writes are complete and we have obtained the lock to
+ * add the element to the queue.
*/
- smp_wmb();
-


- atomic_set(&data->refs, cpumask_weight(data->cpumask));

raw_spin_lock_irqsave(&call_function.lock, flags);
/*
@@ -507,6 +508,11 @@ void smp_call_function_many(const struct
* will not miss any other list entries:
*/
list_add_rcu(&data->csd.list, &call_function.queue);
+ /*

+ * We rely on the wmb() in list_add_rcu to order the writes
+ * to func, data, and cpumask before this write to refs.
+ */
+ atomic_set(&data->refs, cpumask_weight(data->cpumask));
raw_spin_unlock_irqrestore(&call_function.lock, flags);

/*

Milton Miller

unread,
Mar 15, 2011, 3:28:06 PM3/15/11
to ak...@linux-foundation.org, Mike Galbraith, Dimitri Sivanich, Tony Luck, Jan Beulich, Peter Zijlstra, Anton Blanchard, xiaogu...@cn.fujitsu.com, mi...@elte.hu, jax...@fusionio.com, npi...@gmail.com, ru...@rustcorp.com.au, torv...@linux-foundation.org, pau...@linux.vnet.ibm.com, be...@kernel.crashing.org, linux-...@vger.kernel.org
Mike Galbraith reported finding a lockup ("perma-spin bug") where the
cpumask passed to smp_call_function_many was cleared by other cpu(s)
while a cpu was preparing its call_data block, resulting in no cpu to
clear the last ref and unlock the block.

Having cpus clear their bit asynchronously could be useful on a mask of
cpus that might have a translation context, or cpus that need a push to
complete an rcu window.

Instead of adding a BUG_ON and requiring yet another cpumask copy, just
detect the race and handle it.

Note: arch_send_call_function_ipi_mask must still handle an empty
cpumask because the data block is globally visible before the that
arch callback is made. And (obviously) there are no guarantees to
which cpus are notified if the mask is changed during the call;
only cpus that were online and had their mask bit set during the
whole call are guaranteed to be called.

Reported-by: Mike Galbraith <efa...@gmx.de>
Reported-by: Jan Beulich <JBeu...@novell.com>


Signed-off-by: Milton Miller <mil...@bga.com>
---

v3: try to clarify which mask in comment


v2: rediff for v2 of call_function_many: fix list delete vs add race

The arch code not expecting the race to empty the mask is the cause
of https://bugzilla.kernel.org/show_bug.cgi?id=23042 that Andrew pointed
out.


Index: common/kernel/smp.c
===================================================================
--- common.orig/kernel/smp.c 2011-03-15 05:22:26.000000000 -0500
+++ common/kernel/smp.c 2011-03-15 06:22:26.000000000 -0500


@@ -450,7 +450,7 @@ void smp_call_function_many(const struct
{
struct call_function_data *data;
unsigned long flags;
- int cpu, next_cpu, this_cpu = smp_processor_id();
+ int refs, cpu, next_cpu, this_cpu = smp_processor_id();

/*
* Can deadlock when called with interrupts disabled.

@@ -461,7 +461,7 @@ void smp_call_function_many(const struct
WARN_ON_ONCE(cpu_online(this_cpu) && irqs_disabled()
&& !oops_in_progress && !early_boot_irqs_disabled);

- /* So, what's a CPU they want? Ignoring this one. */
+ /* Try to fastpath. So, what's a CPU they want? Ignoring this one. */
cpu = cpumask_first_and(mask, cpu_online_mask);
if (cpu == this_cpu)
cpu = cpumask_next_and(cpu, mask, cpu_online_mask);
@@ -519,6 +519,13 @@ void smp_call_function_many(const struct


/* We rely on the "and" being processed before the store */

cpumask_and(data->cpumask, mask, cpu_online_mask);
cpumask_clear_cpu(this_cpu, data->cpumask);
+ refs = cpumask_weight(data->cpumask);
+

+ /* Some callers race with other cpus changing the passed mask */


+ if (unlikely(!refs)) {
+ csd_unlock(&data->csd);
+ return;
+ }

raw_spin_lock_irqsave(&call_function.lock, flags);
/*
@@ -532,7 +539,7 @@ void smp_call_function_many(const struct
* to the cpumask before this write to refs, which indicates


* data is on the list and is ready to be processed.
*/

- atomic_set(&data->refs, cpumask_weight(data->cpumask));

+ atomic_set(&data->refs, refs);

Luck, Tony

unread,
Mar 15, 2011, 4:22:40 PM3/15/11
to Milton Miller, Peter Zijlstra, ak...@linux-foundation.org, Anton Blanchard, xiaogu...@cn.fujitsu.com, mi...@elte.hu, jax...@fusionio.com, npi...@gmail.com, ru...@rustcorp.com.au, efa...@gmx.de, Jan Beulich, Dimitri Sivanich, torv...@linux-foundation.org, pau...@linux.vnet.ibm.com, be...@kernel.crashing.org, linux-...@vger.kernel.org
> Dimitri or Tony, let me know if you would like a revert patch for
> 75c1c91c to be added after patch 3.

It appears that we won't need this anymore, so sure - revert it.

-Tony

Dimitri Sivanich

unread,
Mar 15, 2011, 4:33:13 PM3/15/11
to Luck, Tony, Milton Miller, Peter Zijlstra, ak...@linux-foundation.org, Anton Blanchard, xiaogu...@cn.fujitsu.com, mi...@elte.hu, jax...@fusionio.com, npi...@gmail.com, ru...@rustcorp.com.au, efa...@gmx.de, Jan Beulich, torv...@linux-foundation.org, pau...@linux.vnet.ibm.com, be...@kernel.crashing.org, linux-...@vger.kernel.org
On Tue, Mar 15, 2011 at 01:22:31PM -0700, Luck, Tony wrote:
> > Dimitri or Tony, let me know if you would like a revert patch for
> > 75c1c91c to be added after patch 3.
>
> It appears that we won't need this anymore, so sure - revert it.
>
Reverting it seems OK to me.

Peter Zijlstra

unread,
Mar 15, 2011, 4:37:51 PM3/15/11
to Milton Miller, ak...@linux-foundation.org, Anton Blanchard, xiaogu...@cn.fujitsu.com, mi...@elte.hu, jax...@fusionio.com, npi...@gmail.com, ru...@rustcorp.com.au, efa...@gmx.de, Jan Beulich, Dimitri Sivanich, Tony Luck, torv...@linux-foundation.org, pau...@linux.vnet.ibm.com, be...@kernel.crashing.org, linux-...@vger.kernel.org
On Tue, 2011-03-15 at 13:27 -0600, Milton Miller wrote:
> Picking up this thread from the beginning of Feburary, I updated
> the comments in 3 and 4 based on Paul's review. I have inserted
> my proposed alternative to Paul's additional patch as patch 2,
> and tweaked the changelog on 1.
>
> [PATCH 1/4 v3] call_function_many: fix list delete vs add race
> [PATCH 2/4 v3] call_function_many: add missing ordering
> [PATCH 3/4 v3] smp_call_function_many: handle concurrent clearing of mask
> [PATCH 4/4 v3] smp_call_function_interrupt: use typedef and %pf
>
> Peter Z acked a prior version of patch 1, and Mike Galbraith has
> done some stress testing on the combinaton of 1,3,4, and Paul's patch.
>
> The prior posting of this series is available here:
> https://patchwork.kernel.org/patch/522021/
> https://patchwork.kernel.org/patch/522031/
> http://marc.info/?l=linux-kernel&m=129654439817236&w=2
>
> And Paul's additional patch from review was here
> https://patchwork.kernel.org/patch/525891/
>
> Looking forward, I would suggest 1 and 2 are required for stable, 3 may
> be suitable as it fixes races that otherwise requires cpumask copies
> as in 75c1c91cb92806f960fcd6e53d2a0c21f343081c ([IA64] eliminate race
> condition in smp_flush_tlb_mm). By contrast 4 is just comments except
> for the %pS to %pf change.

Acked-by: Peter Zijlstra <a.p.zi...@chello.nl>

Very interesting that patch #2.

Catalin Marinas

unread,
Mar 15, 2011, 6:32:31 PM3/15/11
to Milton Miller, ak...@linux-foundation.org, Mike Galbraith, Dimitri Sivanich, Tony Luck, Jan Beulich, Peter Zijlstra, Anton Blanchard, xiaogu...@cn.fujitsu.com, mi...@elte.hu, jax...@fusionio.com, npi...@gmail.com, ru...@rustcorp.com.au, torv...@linux-foundation.org, pau...@linux.vnet.ibm.com, be...@kernel.crashing.org, linux-...@vger.kernel.org
On Tuesday, 15 March 2011, Milton Miller <mil...@bga.com> wrote:
> Mike Galbraith reported finding a lockup ("perma-spin bug") where the
> cpumask passed to smp_call_function_many was cleared by other cpu(s)
> while a cpu was preparing its call_data block, resulting in no cpu to
> clear the last ref and unlock the block.

I missed your patch and just posted something similar. I'll leave it
to you then ;)

https://lkml.org/lkml/2011/3/15/296

Catalin

--
Catalin

Jan Beulich

unread,
Mar 16, 2011, 3:51:52 AM3/16/11
to Milton Miller, xiaogu...@cn.fujitsu.com, mi...@elte.hu, jax...@fusionio.com, npi...@gmail.com, Mike Galbraith, Peter Zijlstra, Tony Luck, be...@kernel.crashing.org, ak...@linux-foundation.org, torv...@linux-foundation.org, pau...@linux.vnet.ibm.com, ru...@rustcorp.com.au, Anton Blanchard, Dimitri Sivanich, linux-...@vger.kernel.org
>>> On 15.03.11 at 20:27, Milton Miller <mil...@bga.com> wrote:
> Mike Galbraith reported finding a lockup ("perma-spin bug") where the
> cpumask passed to smp_call_function_many was cleared by other cpu(s)
> while a cpu was preparing its call_data block, resulting in no cpu to
> clear the last ref and unlock the block.
>
> Having cpus clear their bit asynchronously could be useful on a mask of
> cpus that might have a translation context, or cpus that need a push to
> complete an rcu window.
>
> Instead of adding a BUG_ON and requiring yet another cpumask copy, just
> detect the race and handle it.
>
> Note: arch_send_call_function_ipi_mask must still handle an empty
> cpumask because the data block is globally visible before the that
> arch callback is made. And (obviously) there are no guarantees to
> which cpus are notified if the mask is changed during the call;
> only cpus that were online and had their mask bit set during the
> whole call are guaranteed to be called.
>
> Reported-by: Mike Galbraith <efa...@gmx.de>
> Reported-by: Jan Beulich <JBeu...@novell.com>
> Signed-off-by: Milton Miller <mil...@bga.com>

Acked-by: Jan Beulich <jbeu...@novell.com>

Paul E. McKenney

unread,
Mar 16, 2011, 8:06:48 AM3/16/11
to Milton Miller, Peter Zijlstra, ak...@linux-foundation.org, Anton Blanchard, xiaogu...@cn.fujitsu.com, mi...@elte.hu, jax...@fusionio.com, npi...@gmail.com, ru...@rustcorp.com.au, efa...@gmx.de, Jan Beulich, Dimitri Sivanich, Tony Luck, torv...@linux-foundation.org, be...@kernel.crashing.org, linux-...@vger.kernel.org

Hello, Milton,

At first glance, this looks promising, but it will take me a few
more days to wrap my head fully around it. Fair enough?

Thanx, Paul

Linus Torvalds

unread,
Mar 16, 2011, 1:56:38 PM3/16/11
to Milton Miller, Peter Zijlstra, ak...@linux-foundation.org, Anton Blanchard, xiaogu...@cn.fujitsu.com, mi...@elte.hu, jax...@fusionio.com, npi...@gmail.com, ru...@rustcorp.com.au, efa...@gmx.de, Jan Beulich, Dimitri Sivanich, Tony Luck, pau...@linux.vnet.ibm.com, be...@kernel.crashing.org, linux-...@vger.kernel.org
On Tue, Mar 15, 2011 at 12:27 PM, Milton Miller <mil...@bga.com> wrote:
>
> Looking forward, I would suggest 1 and 2 are required for stable, 3 may
> be suitable as it fixes races that otherwise requires cpumask copies
> as in 75c1c91cb92806f960fcd6e53d2a0c21f343081c ([IA64] eliminate race
> condition in smp_flush_tlb_mm).  By contrast 4 is just comments except
> for the %pS to %pf change.

Ok, who should I expect to take this series from? I think the last
batch came through Andrew. The kernel/smp.c file seems to be one of
those "unclear maintenance rules" one. The git statistics for the last
12 months seem to be

Andrew Morton <ak...@linux-foundation.org> (commit_signer:4/9=44%)
Peter Zijlstra <pet...@infradead.org> (commit_signer:2/9=22%)
Milton Miller <mil...@bga.com> (commit_signer:2/9=22%)
Tejun Heo <t...@kernel.org> (commit_signer:2/9=22%)
Ingo Molnar <mi...@elte.hu> (commit_signer:2/9=22%)

according to get_maintainer.pl. Should I just take these directly?

Linus

Peter Zijlstra

unread,
Mar 16, 2011, 2:14:16 PM3/16/11
to Linus Torvalds, Milton Miller, ak...@linux-foundation.org, Anton Blanchard, xiaogu...@cn.fujitsu.com, mi...@elte.hu, jax...@fusionio.com, npi...@gmail.com, ru...@rustcorp.com.au, efa...@gmx.de, Jan Beulich, Dimitri Sivanich, Tony Luck, pau...@linux.vnet.ibm.com, be...@kernel.crashing.org, linux-...@vger.kernel.org
On Wed, 2011-03-16 at 10:55 -0700, Linus Torvalds wrote:
> On Tue, Mar 15, 2011 at 12:27 PM, Milton Miller <mil...@bga.com> wrote:
> >
> > Looking forward, I would suggest 1 and 2 are required for stable, 3 may
> > be suitable as it fixes races that otherwise requires cpumask copies
> > as in 75c1c91cb92806f960fcd6e53d2a0c21f343081c ([IA64] eliminate race
> > condition in smp_flush_tlb_mm). By contrast 4 is just comments except
> > for the %pS to %pf change.
>
> Ok, who should I expect to take this series from? I think the last
> batch came through Andrew. The kernel/smp.c file seems to be one of
> those "unclear maintenance rules" one. The git statistics for the last
> 12 months seem to be
>
> Andrew Morton <ak...@linux-foundation.org> (commit_signer:4/9=44%)
> Peter Zijlstra <pet...@infradead.org> (commit_signer:2/9=22%)
> Milton Miller <mil...@bga.com> (commit_signer:2/9=22%)
> Tejun Heo <t...@kernel.org> (commit_signer:2/9=22%)
> Ingo Molnar <mi...@elte.hu> (commit_signer:2/9=22%)
>
> according to get_maintainer.pl. Should I just take these directly?

I'm fine with that, if people want a maintainer I can volunteer, but as
Jens wrote most of it I'd prefer him to sign off on that ;-)

Mike Galbraith

unread,
Mar 16, 2011, 11:16:01 PM3/16/11
to Milton Miller, Peter Zijlstra, ak...@linux-foundation.org, Anton Blanchard, xiaogu...@cn.fujitsu.com, mi...@elte.hu, jax...@fusionio.com, npi...@gmail.com, ru...@rustcorp.com.au, Jan Beulich, Dimitri Sivanich, Tony Luck, torv...@linux-foundation.org, pau...@linux.vnet.ibm.com, be...@kernel.crashing.org, linux-...@vger.kernel.org
On Tue, 2011-03-15 at 13:27 -0600, Milton Miller wrote:
> Picking up this thread from the beginning of Feburary, I updated
> the comments in 3 and 4 based on Paul's review. I have inserted
> my proposed alternative to Paul's additional patch as patch 2,
> and tweaked the changelog on 1.
>
> [PATCH 1/4 v3] call_function_many: fix list delete vs add race
> [PATCH 2/4 v3] call_function_many: add missing ordering
> [PATCH 3/4 v3] smp_call_function_many: handle concurrent clearing of mask
> [PATCH 4/4 v3] smp_call_function_interrupt: use typedef and %pf
>
> Peter Z acked a prior version of patch 1, and Mike Galbraith has
> done some stress testing on the combinaton of 1,3,4, and Paul's patch.

I beat hell out of all v2 + Paul's ordering patch. These are verified
to have fixed a nasty enterprise cluster problem.

> The prior posting of this series is available here:
> https://patchwork.kernel.org/patch/522021/
> https://patchwork.kernel.org/patch/522031/
> http://marc.info/?l=linux-kernel&m=129654439817236&w=2
>
> And Paul's additional patch from review was here
> https://patchwork.kernel.org/patch/525891/
>
> Looking forward, I would suggest 1 and 2 are required for stable, 3 may

> be suitable..

Problematic clusters say 3 is most excellent stable material.

-Mike

Mike Galbraith

unread,
Aug 21, 2011, 2:17:37 AM8/21/11
to pau...@linux.vnet.ibm.com, Milton Miller, Peter Zijlstra, ak...@linux-foundation.org, Anton Blanchard, xiaogu...@cn.fujitsu.com, mi...@elte.hu, jax...@fusionio.com, npi...@gmail.com, JBeu...@novell.com, ru...@rustcorp.com.au, torv...@linux-foundation.org, be...@kernel.crashing.org, linux-...@vger.kernel.org
On Tue, 2011-02-08 at 11:36 -0800, Paul E. McKenney wrote:
> On Mon, Feb 07, 2011 at 09:12:54AM +0100, Mike Galbraith wrote:
> > On Tue, 2011-02-01 at 20:17 -0800, Paul E. McKenney wrote:
> >
> > FWIW, my red headed stepchild (.32 xen cluster beast) with..
> > smp-smp_call_function_many-fix-SMP-race (6dc1989)
> > smp-consolidate-writes-in-smp_call_function_interrupt (225c8e0)
> > smp-smp_call_function_many-fix-list-delete-vs-add-race (V2)
> > smp-smp_call_function_many-handle-concurrent-clearing-of-mask (V2)
> > smp-generic_smp_call_function_interrupt-additional-memory-order-tightening (below)
> > ..has not experienced any IPI problems lately, nor have I been able to
> > trigger anything beating up my box running normal x64_64 kernels.
> >
> > That's not saying much since my IPI woes were only the concurrent
> > clearing variety, just letting you know that these patches have received
> > (and are receiving) hefty thumpage.
>
> Very good, I have added your Tested-by to my patch.

I'm still using your barrier changes. Is it safe to drop them? HA
clusters falling over is _not_ something I want to risk resurrecting by
dropping them just because they didn't go anywhere.

0 new messages