if (!cpumask_test_and_clear_cpu(cpu, data->cpumask))
continue;
data->csd.func(data->csd.info);
refs = atomic_dec_return(&data->refs);
WARN_ON(refs < 0); <-------------------------
We atomically tested and cleared our bit in the cpumask, and yet the number
of cpus left (ie refs) was 0. How can this be?
It turns out commit c0f68c2fab4898bcc4671a8fb941f428856b4ad5 (generic-ipi:
cleanup for generic_smp_call_function_interrupt()) is at fault. It removes
locking from smp_call_function_many and in doing so creates a rather
complicated race.
The problem comes about because:
- The smp_call_function_many interrupt handler walks call_function.queue
without any locking.
- We reuse a percpu data structure in smp_call_function_many.
- We do not wait for any RCU grace period before starting the next
smp_call_function_many.
Imagine a scenario where CPU A does two smp_call_functions back to back, and
CPU B does an smp_call_function in between. We concentrate on how CPU C handles
the calls:
CPU A CPU B CPU C
smp_call_function
smp_call_function_interrupt
walks call_function.queue
sees CPU A on list
smp_call_function
smp_call_function_interrupt
walks call_function.queue
sees (stale) CPU A on list
smp_call_function
reuses percpu *data
set data->cpumask
sees and clears bit in cpumask!
sees data->refs is 0!
set data->refs (too late!)
The important thing to note is since the interrupt handler walks a potentially
stale call_function.queue without any locking, then another cpu can view the
percpu *data structure at any time, even when the owner is in the process
of initialising it.
The following test case hits the WARN_ON 100% of the time on my PowerPC box
(having 128 threads does help :)
#include <linux/module.h>
#include <linux/init.h>
#define ITERATIONS 100
static void do_nothing_ipi(void *dummy)
{
}
static void do_ipis(struct work_struct *dummy)
{
int i;
for (i = 0; i < ITERATIONS; i++)
smp_call_function(do_nothing_ipi, NULL, 1);
printk(KERN_DEBUG "cpu %d finished\n", smp_processor_id());
}
static struct work_struct work[NR_CPUS];
static int __init testcase_init(void)
{
int cpu;
for_each_online_cpu(cpu) {
INIT_WORK(&work[cpu], do_ipis);
schedule_work_on(cpu, &work[cpu]);
}
return 0;
}
static void __exit testcase_exit(void)
{
}
module_init(testcase_init)
module_exit(testcase_exit)
MODULE_LICENSE("GPL");
MODULE_AUTHOR("Anton Blanchard");
I tried to fix it by ordering the read and the write of ->cpumask and ->refs.
In doing so I missed a critical case but Paul McKenney was able to spot
my bug thankfully :) To ensure we arent viewing previous iterations the
interrupt handler needs to read ->refs then ->cpumask then ->refs _again_.
Thanks to Milton Miller and Paul McKenney for helping to debug this issue.
---
My head hurts. This needs some serious analysis before we can be sure it
fixes all the races. With all these memory barriers, maybe the previous
spinlocks weren't so bad after all :)
Index: linux-2.6/kernel/smp.c
===================================================================
--- linux-2.6.orig/kernel/smp.c 2010-03-23 05:09:08.000000000 -0500
+++ linux-2.6/kernel/smp.c 2010-03-23 06:12:40.000000000 -0500
@@ -193,6 +193,31 @@ void generic_smp_call_function_interrupt
list_for_each_entry_rcu(data, &call_function.queue, csd.list) {
int refs;
+ /*
+ * Since we walk the list without any locks, we might
+ * see an entry that was completed, removed from the
+ * list and is in the process of being reused.
+ *
+ * Just checking data->refs then data->cpumask is not good
+ * enough because we could see a non zero data->refs from a
+ * previous iteration. We need to check data->refs, then
+ * data->cpumask then data->refs again. Talk about
+ * complicated!
+ */
+
+ if (atomic_read(&data->refs) == 0)
+ continue;
+
+ smp_rmb();
+
+ if (!cpumask_test_cpu(cpu, data->cpumask))
+ continue;
+
+ smp_rmb();
+
+ if (atomic_read(&data->refs) == 0)
+ continue;
+
if (!cpumask_test_and_clear_cpu(cpu, data->cpumask))
continue;
@@ -446,6 +471,14 @@ void smp_call_function_many(const struct
data->csd.info = info;
cpumask_and(data->cpumask, mask, cpu_online_mask);
cpumask_clear_cpu(this_cpu, data->cpumask);
+
+ /*
+ * To ensure the interrupt handler gets an up to date view
+ * we order the cpumask and refs writes and order the
+ * read of them in the interrupt handler.
+ */
+ smp_wmb();
+
atomic_set(&data->refs, cpumask_weight(data->cpumask));
raw_spin_lock_irqsave(&call_function.lock, flags);
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
A rather simple question since my brain isn't quite ready processing the
content here..
Isn't reverting that one patch a simpler solution than adding all that
extra logic? If not, then the above statement seems false and we had a
bug even with that preempt_enable/disable() pair.
Just wondering.. :-)
If I understand correctly, if you want to fix it by reverting patches,
you have to revert back to simple locking (up to and including
54fdade1c3332391948ec43530c02c4794a38172). And I believe that the poor
performance of simple locking was whole reason for the series of patches.
Thanx, Paul
Right, then c0f68c2 did not in fact cause this bug..
;-)
Does this patch appear to have fixed things, or do you still have a
failure rate? In other words, should I be working on a proof of
(in)correctness, or should I be looking for further bugs?
Thanx, Paul
> A rather simple question since my brain isn't quite ready processing the
> content here..
After working my way through that bug my brain wasn't in good shape
either because:
> Isn't reverting that one patch a simpler solution than adding all that
> extra logic? If not, then the above statement seems false and we had a
> bug even with that preempt_enable/disable() pair.
I screwed up the explanation :( The commit message that causes it is
54fdade1c3332391948ec43530c02c4794a38172 (generic-ipi: make struct
call_function_data lockless), and backing it out fixes the issue.
Anton
But the atomic_dec_return() implies a mb, which is before
list_del_rcu(), also, the next enqueue will have a wmb in
list_rcu_add(), so it seems to me that if we issue an rmb it would be
impossible to see a !zero ref of the previous enlisting.
We could make this an actual atomic instruction of course..