[PATCH] percpu_counter: fix CPU unplug race in percpu_counter

Eric Dumazet

unread,

Dec 3, 2008, 6:50:06 PM12/3/08

to

Hi Andrew

While working on percpu_counter on net-next-2.6, I found
a CPU unplug race in percpu_counter_destroy()

(Very unlikely of course)

Thank you

[PATCH] percpu_counter: fix CPU unplug race in percpu_counter_destroy()

We should first delete the counter from percpu_counters list
before freeing memory, or a percpu_counter_hotcpu_callback()
could dereference a NULL pointer.

Signed-off-by: Eric Dumazet <da...@cosmosbay.com>
---
lib/percpu_counter.c | 4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)

percpu_counter_destroy.patch

Eric Dumazet

unread,

Dec 3, 2008, 8:40:23 PM12/3/08

to

Eric Dumazet a écrit :

Well, this percpu_counter stuff is simply not working at all.

We added some percpu_counters to network tree for 2.6.29 and we get
drift bugs if calling __percpu_counter_sum() while some heavy duty
benches are running, on a 8 cpus machine

1) __percpu_counter_sum() is buggy, it should not write
on per_cpu_ptr(fbc->counters, cpu), or another cpu
could get its changes lost.

__percpu_counter_sum should be read only (const struct percpu_counter *fbc),
and no locking needed.

2) Un-needed lock in percpu_counter_set()
This wont block another cpu doing an _add anyway.
Not a bug, but disturbing, giving false feeling of protection.
percpu_counter are not precise, we cannot reliably set them
or read them. Period.
In fact percpu_counter_set() callers should use
percpu_counter_add(). (only used from lib/proportions.c )

Thank you

[PATCH] percpu_counter: Fix __percpu_counter_sum()

This function should not write into percpu local storage,
without proper locking, or some changes done on other cpus
might be lost.

Adding proper locking would need to use atomic
operations in fast path and would be expensive.

Results of __percpu_counter_sum() can be wrong, this is a
known fact.

We also dont need to acquire the lock, this gives
no better results.

Signed-off-by: Eric Dumazet <da...@cosmosbay.com>
---

lib/percpu_counter.c | 5 -----
1 files changed, 5 deletions(-)

__percpu_counter_sum.patch

Peter Zijlstra

unread,

Dec 3, 2008, 8:50:16 PM12/3/08

to

Yeah, I see the race, and should have seen it much earlier. Thanks for
spotting it.

ext4 added this, and somehow relies on it (non of the other users
cares), mingming?

> [PATCH] percpu_counter: Fix __percpu_counter_sum()
>
> This function should not write into percpu local storage,
> without proper locking, or some changes done on other cpus
> might be lost.
>
> Adding proper locking would need to use atomic
> operations in fast path and would be expensive.
>
> Results of __percpu_counter_sum() can be wrong, this is a
> known fact.
>
> We also dont need to acquire the lock, this gives
> no better results.
>
> Signed-off-by: Eric Dumazet <da...@cosmosbay.com>

Acked-by: Peter Zijlstra <a.p.zi...@chello.nl>

> ---
> lib/percpu_counter.c | 5 -----
> 1 files changed, 5 deletions(-)

> plain text document attachment (__percpu_counter_sum.patch)
> diff --git a/lib/percpu_counter.c b/lib/percpu_counter.c
> index a866389..e79bbae 100644
> --- a/lib/percpu_counter.c
> +++ b/lib/percpu_counter.c
> @@ -57,16 +57,11 @@ s64 __percpu_counter_sum(struct percpu_counter *fbc)
> s64 ret;
> int cpu;
>
> - spin_lock(&fbc->lock);
> ret = fbc->count;
> for_each_online_cpu(cpu) {
> s32 *pcount = per_cpu_ptr(fbc->counters, cpu);
> ret += *pcount;
> - *pcount = 0;
> }
> - fbc->count = ret;
> -
> - spin_unlock(&fbc->lock);
> return ret;
> }
> EXPORT_SYMBOL(__percpu_counter_sum);

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

David Miller

unread,

Dec 4, 2008, 6:20:09 AM12/4/08

to

From: Eric Dumazet <da...@cosmosbay.com>
Date: Wed, 03 Dec 2008 19:40:07 +0100

> While working on percpu_counter on net-next-2.6, I found
> a CPU unplug race in percpu_counter_destroy()
>
> (Very unlikely of course)
>
> Thank you
>
> [PATCH] percpu_counter: fix CPU unplug race in percpu_counter_destroy()
>
> We should first delete the counter from percpu_counters list
> before freeing memory, or a percpu_counter_hotcpu_callback()
> could dereference a NULL pointer.
>
> Signed-off-by: Eric Dumazet <da...@cosmosbay.com>

Acked-by: David S. Miller <da...@davemloft.net>

David Miller

unread,

Dec 4, 2008, 6:20:09 AM12/4/08

to

From: Eric Dumazet <da...@cosmosbay.com>
Date: Wed, 03 Dec 2008 21:24:36 +0100

> [PATCH] percpu_counter: Fix __percpu_counter_sum()
>
> This function should not write into percpu local storage,
> without proper locking, or some changes done on other cpus
> might be lost.
>
> Adding proper locking would need to use atomic
> operations in fast path and would be expensive.
>
> Results of __percpu_counter_sum() can be wrong, this is a
> known fact.
>
> We also dont need to acquire the lock, this gives
> no better results.
>
> Signed-off-by: Eric Dumazet <da...@cosmosbay.com>

Acked-by: David S. Miller <da...@davemloft.net>

Andrew Morton

unread,

Dec 7, 2008, 4:30:11 AM12/7/08

to

On Wed, 03 Dec 2008 21:24:36 +0100 Eric Dumazet <da...@cosmosbay.com> wrote:

> Eric Dumazet a __crit :

> > Hi Andrew
> >
> > While working on percpu_counter on net-next-2.6, I found
> > a CPU unplug race in percpu_counter_destroy()
> >
> > (Very unlikely of course)
> >
> > Thank you
> >
> > [PATCH] percpu_counter: fix CPU unplug race in percpu_counter_destroy()
> >
> > We should first delete the counter from percpu_counters list
> > before freeing memory, or a percpu_counter_hotcpu_callback()
> > could dereference a NULL pointer.
> >
> > Signed-off-by: Eric Dumazet <da...@cosmosbay.com>
> > ---
> > lib/percpu_counter.c | 4 ++--
> > 1 files changed, 2 insertions(+), 2 deletions(-)
> >
>
> Well, this percpu_counter stuff is simply not working at all.
>
> We added some percpu_counters to network tree for 2.6.29 and we get
> drift bugs if calling __percpu_counter_sum() while some heavy duty
> benches are running, on a 8 cpus machine
>
> 1) __percpu_counter_sum() is buggy, it should not write
> on per_cpu_ptr(fbc->counters, cpu), or another cpu
> could get its changes lost.
>
> __percpu_counter_sum should be read only (const struct percpu_counter *fbc),
> and no locking needed.

No, we can't do this - it will break ext4.

Take a closer look at 1f7c14c62ce63805f9574664a6c6de3633d4a354 and at
e8ced39d5e8911c662d4d69a342b9d053eaaac4e.

I suggest that what we do is to revert both those changes. We can
worry about the possibly-unneeded spin_lock later, in a separate patch.

It should have been a separate patch anyway. It's conceptually
unrelated and is not a bugfix, but it was mixed in with a bugfix.

Mingming, this needs urgent consideration, please. Note that I had to
make additional changes to ext4 due to the subsequent introduction of
the dirty_blocks counter.

Please read the below changelogs carefully and check that I have got my
head around this correctly - I may not have done.

What a mess.

From: Andrew Morton <ak...@linux-foundation.org>

Revert

commit 1f7c14c62ce63805f9574664a6c6de3633d4a354
Author: Mingming Cao <c...@us.ibm.com>
Date: Thu Oct 9 12:50:59 2008 -0400

percpu counter: clean up percpu_counter_sum_and_set()

Before this patch we had the following:

percpu_counter_sum(): return the percpu_counter's value

percpu_counter_sum_and_set(): return the percpu_counter's value, copying
that value into the central value and zeroing the per-cpu counters before
returning.

After this patch, percpu_counter_sum_and_set() has gone, and
percpu_counter_sum() gets the old percpu_counter_sum_and_set()
functionality.

Problem is, as Eric points out, the old percpu_counter_sum_and_set()
functionality was racy and wrong. It zeroes out counters on "other" cpus,
without holding any locks which will prevent races agaist updates from
those other CPUS.

This patch reverts 1f7c14c62ce63805f9574664a6c6de3633d4a354. This means
that percpu_counter_sum_and_set() still has the race, but
percpu_counter_sum() does not.

Note that this is not a simple revert - ext4 has since started using
percpu_counter_sum() for its dirty_blocks counter as well.

Note that this revert patch changes percpu_counter_sum() semantics.

Before the patch, a call to percpu_counter_sum() will bring the counter's
central counter mostly up-to-date, so a following percpu_counter_read()
will return a close value.

After this patch, a call to percpu_counter_sum() will leave the counter's
central accumulator unaltered, so a subsequent call to
percpu_counter_read() can now return a significantly inaccurate result.

If there is any code in the tree which was introduced after
e8ced39d5e8911c662d4d69a342b9d053eaaac4e was merged, and which depends
upon the new percpu_counter_sum() semantics, that code will break.

Reported-by: Eric Dumazet <da...@cosmosbay.com>
Cc: "David S. Miller" <da...@davemloft.net>
Cc: Peter Zijlstra <a.p.zi...@chello.nl>
Cc: Mingming Cao <c...@us.ibm.com>
Cc: <linux...@vger.kernel.org>
Signed-off-by: Andrew Morton <ak...@linux-foundation.org>
---

fs/ext4/balloc.c | 4 ++--
include/linux/percpu_counter.h | 12 +++++++++---
lib/percpu_counter.c | 8 +++++---
3 files changed, 16 insertions(+), 8 deletions(-)

diff -puN fs/ext4/balloc.c~revert-percpu-counter-clean-up-percpu_counter_sum_and_set fs/ext4/balloc.c
--- a/fs/ext4/balloc.c~revert-percpu-counter-clean-up-percpu_counter_sum_and_set
+++ a/fs/ext4/balloc.c
@@ -609,8 +609,8 @@ int ext4_has_free_blocks(struct ext4_sb_

if (free_blocks - (nblocks + root_blocks + dirty_blocks) <
EXT4_FREEBLOCKS_WATERMARK) {
- free_blocks = percpu_counter_sum(fbc);
- dirty_blocks = percpu_counter_sum(dbc);
+ free_blocks = percpu_counter_sum_and_set(fbc);
+ dirty_blocks = percpu_counter_sum_and_set(dbc);
if (dirty_blocks < 0) {
printk(KERN_CRIT "Dirty block accounting "
"went wrong %lld\n",
diff -puN include/linux/percpu_counter.h~revert-percpu-counter-clean-up-percpu_counter_sum_and_set include/linux/percpu_counter.h
--- a/include/linux/percpu_counter.h~revert-percpu-counter-clean-up-percpu_counter_sum_and_set
+++ a/include/linux/percpu_counter.h
@@ -35,7 +35,7 @@ int percpu_counter_init_irq(struct percp
void percpu_counter_destroy(struct percpu_counter *fbc);
void percpu_counter_set(struct percpu_counter *fbc, s64 amount);
void __percpu_counter_add(struct percpu_counter *fbc, s64 amount, s32 batch);
-s64 __percpu_counter_sum(struct percpu_counter *fbc);
+s64 __percpu_counter_sum(struct percpu_counter *fbc, int set);

static inline void percpu_counter_add(struct percpu_counter *fbc, s64 amount)
{
@@ -44,13 +44,19 @@ static inline void percpu_counter_add(st

static inline s64 percpu_counter_sum_positive(struct percpu_counter *fbc)
{
- s64 ret = __percpu_counter_sum(fbc);
+ s64 ret = __percpu_counter_sum(fbc, 0);
return ret < 0 ? 0 : ret;
}

+static inline s64 percpu_counter_sum_and_set(struct percpu_counter *fbc)
+{
+ return __percpu_counter_sum(fbc, 1);
+}
+
+
static inline s64 percpu_counter_sum(struct percpu_counter *fbc)
{
- return __percpu_counter_sum(fbc);
+ return __percpu_counter_sum(fbc, 0);
}

static inline s64 percpu_counter_read(struct percpu_counter *fbc)
diff -puN lib/percpu_counter.c~revert-percpu-counter-clean-up-percpu_counter_sum_and_set lib/percpu_counter.c
--- a/lib/percpu_counter.c~revert-percpu-counter-clean-up-percpu_counter_sum_and_set
+++ a/lib/percpu_counter.c
@@ -52,7 +52,7 @@ EXPORT_SYMBOL(__percpu_counter_add);
* Add up all the per-cpu counts, return the result. This is a more accurate
* but much slower version of percpu_counter_read_positive()
*/
-s64 __percpu_counter_sum(struct percpu_counter *fbc)
+s64 __percpu_counter_sum(struct percpu_counter *fbc, int set)
{
s64 ret;
int cpu;
@@ -62,9 +62,11 @@ s64 __percpu_counter_sum(struct percpu_c

for_each_online_cpu(cpu) {
s32 *pcount = per_cpu_ptr(fbc->counters, cpu);
ret += *pcount;
- *pcount = 0;

+ if (set)
+ *pcount = 0;

}
- fbc->count = ret;

+ if (set)
+ fbc->count = ret;

spin_unlock(&fbc->lock);
return ret;
_

From: Andrew Morton <ak...@linux-foundation.org>

Revert

commit e8ced39d5e8911c662d4d69a342b9d053eaaac4e
Author: Mingming Cao <c...@us.ibm.com>
Date: Fri Jul 11 19:27:31 2008 -0400

percpu_counter: new function percpu_counter_sum_and_set

As described in

revert "percpu counter: clean up percpu_counter_sum_and_set()"

the new percpu_counter_sum_and_set() is racy against updates to the
cpu-local accumulators on other CPUs. Revert that change.

This means that ext4 will be slow again. But correct.

Reported-by: Eric Dumazet <da...@cosmosbay.com>
Cc: "David S. Miller" <da...@davemloft.net>
Cc: Peter Zijlstra <a.p.zi...@chello.nl>
Cc: Mingming Cao <c...@us.ibm.com>
Cc: <linux...@vger.kernel.org>
Signed-off-by: Andrew Morton <ak...@linux-foundation.org>
---

fs/ext4/balloc.c | 4 ++--
include/linux/percpu_counter.h | 12 +++---------
lib/percpu_counter.c | 7 +------
3 files changed, 6 insertions(+), 17 deletions(-)

diff -puN fs/ext4/balloc.c~revert-percpu_counter-new-function-percpu_counter_sum_and_set fs/ext4/balloc.c
--- a/fs/ext4/balloc.c~revert-percpu_counter-new-function-percpu_counter_sum_and_set
+++ a/fs/ext4/balloc.c
@@ -609,8 +609,8 @@ int ext4_has_free_blocks(struct ext4_sb_

if (free_blocks - (nblocks + root_blocks + dirty_blocks) <
EXT4_FREEBLOCKS_WATERMARK) {
- free_blocks = percpu_counter_sum_and_set(fbc);
- dirty_blocks = percpu_counter_sum_and_set(dbc);
+ free_blocks = percpu_counter_sum_positive(fbc);
+ dirty_blocks = percpu_counter_sum_positive(dbc);
if (dirty_blocks < 0) {
printk(KERN_CRIT "Dirty block accounting "
"went wrong %lld\n",
diff -puN include/linux/percpu_counter.h~revert-percpu_counter-new-function-percpu_counter_sum_and_set include/linux/percpu_counter.h
--- a/include/linux/percpu_counter.h~revert-percpu_counter-new-function-percpu_counter_sum_and_set
+++ a/include/linux/percpu_counter.h
@@ -35,7 +35,7 @@ int percpu_counter_init_irq(struct percp
void percpu_counter_destroy(struct percpu_counter *fbc);
void percpu_counter_set(struct percpu_counter *fbc, s64 amount);
void __percpu_counter_add(struct percpu_counter *fbc, s64 amount, s32 batch);
-s64 __percpu_counter_sum(struct percpu_counter *fbc, int set);
+s64 __percpu_counter_sum(struct percpu_counter *fbc);

static inline void percpu_counter_add(struct percpu_counter *fbc, s64 amount)
{
@@ -44,19 +44,13 @@ static inline void percpu_counter_add(st

static inline s64 percpu_counter_sum_positive(struct percpu_counter *fbc)
{
- s64 ret = __percpu_counter_sum(fbc, 0);
+ s64 ret = __percpu_counter_sum(fbc);
return ret < 0 ? 0 : ret;
}

-static inline s64 percpu_counter_sum_and_set(struct percpu_counter *fbc)
-{
- return __percpu_counter_sum(fbc, 1);
-}
-
-
static inline s64 percpu_counter_sum(struct percpu_counter *fbc)
{
- return __percpu_counter_sum(fbc, 0);
+ return __percpu_counter_sum(fbc);
}

static inline s64 percpu_counter_read(struct percpu_counter *fbc)
diff -puN lib/percpu_counter.c~revert-percpu_counter-new-function-percpu_counter_sum_and_set lib/percpu_counter.c
--- a/lib/percpu_counter.c~revert-percpu_counter-new-function-percpu_counter_sum_and_set
+++ a/lib/percpu_counter.c
@@ -52,7 +52,7 @@ EXPORT_SYMBOL(__percpu_counter_add);
* Add up all the per-cpu counts, return the result. This is a more accurate
* but much slower version of percpu_counter_read_positive()
*/
-s64 __percpu_counter_sum(struct percpu_counter *fbc, int set)
+s64 __percpu_counter_sum(struct percpu_counter *fbc)
{
s64 ret;
int cpu;
@@ -62,12 +62,7 @@ s64 __percpu_counter_sum(struct percpu_c

for_each_online_cpu(cpu) {
s32 *pcount = per_cpu_ptr(fbc->counters, cpu);
ret += *pcount;

- if (set)
- *pcount = 0;
}
- if (set)

- fbc->count = ret;
-

spin_unlock(&fbc->lock);
return ret;
}
_

Peter Zijlstra

unread,

Dec 7, 2008, 10:30:13 AM12/7/08

to

Thing is, as Eric pointed out, the sum_and_set() thing is fundamentally
broken. It's impossible to _set a percpu_counter.

Eric Dumazet

unread,

Dec 7, 2008, 1:30:10 PM12/7/08

to

Andrew Morton a écrit :

> On Wed, 03 Dec 2008 21:24:36 +0100 Eric Dumazet <da...@cosmosbay.com> wrote:
>
>> Eric Dumazet a __crit :
>>

>> 1) __percpu_counter_sum() is buggy, it should not write
>> on per_cpu_ptr(fbc->counters, cpu), or another cpu
>> could get its changes lost.
>>
>> __percpu_counter_sum should be read only (const struct percpu_counter *fbc),
>> and no locking needed.
>
> No, we can't do this - it will break ext4.
>
> Take a closer look at 1f7c14c62ce63805f9574664a6c6de3633d4a354 and at
> e8ced39d5e8911c662d4d69a342b9d053eaaac4e.
>
> I suggest that what we do is to revert both those changes. We can
> worry about the possibly-unneeded spin_lock later, in a separate patch.
>
> It should have been a separate patch anyway. It's conceptually
> unrelated and is not a bugfix, but it was mixed in with a bugfix.
>
> Mingming, this needs urgent consideration, please. Note that I had to
> make additional changes to ext4 due to the subsequent introduction of
> the dirty_blocks counter.
>
>
> Please read the below changelogs carefully and check that I have got my
> head around this correctly - I may not have done.
>

Hum... e8ced39d5e8911c662d4d69a342b9d053eaaac4e is probably following
the wrong path, but I see the intent. Even in the 'nr_files' case, it could
help to reduce excessive calls to percpu_counter_sum()

What we can do is to use two s64 counters (only in SMP):

s64 reference_count
s64 shadow_count

One that is guaranteed to be touched with appropriate locking
in __percpu_counter_add()

Another one that might be changed by percpu_counter_sum(), without
any locking, acting as a shadow.

Thanks

[PATCH] percpu_counter: Fix __percpu_counter_sum()

commit e8ced39d5e8911c662d4d69a342b9d053eaaac4e (percpu_counter:
new function percpu_counter_sum_and_set) was to make __percpu_counter_sum()
being able to recompute the estimate of a percpu_counter value.

Problem is that we cannot write on other cpus counters without racing.

What we can do is to use two s64 counter, one acting as a reference
that we should not change in __percpu_counter_sum(), another one, shadowing
the reference.

percpu_counter_read() is reading the shadow
percpu_counter_sum() reads the reference and recompute the shadow.

If a given percpu_counter is never 'summed', then its shadow_counter
is always equal to its reference.

Signed-off-by: Eric Dumazet <da...@cosmosbay.com>
---

include/linux/percpu_counter.h | 9 +++++----
lib/percpu_counter.c | 27 +++++++++++++++++----------
2 files changed, 22 insertions(+), 14 deletions(-)

diff --git a/include/linux/percpu_counter.h b/include/linux/percpu_counter.h
index 9007ccd..71b5c5d 100644
--- a/include/linux/percpu_counter.h
+++ b/include/linux/percpu_counter.h
@@ -17,7 +17,8 @@

struct percpu_counter {
spinlock_t lock;
- s64 count;
+ s64 reference_count;
+ s64 shadow_count;
#ifdef CONFIG_HOTPLUG_CPU
struct list_head list; /* All percpu_counters are on a list */
#endif
@@ -55,7 +56,7 @@ static inline s64 percpu_counter_sum(struct percpu_counter *fbc)

static inline s64 percpu_counter_read(struct percpu_counter *fbc)

{
- return fbc->count;
+ return fbc->shadow_count;
}

/*
@@ -65,9 +66,9 @@ static inline s64 percpu_counter_read(struct percpu_counter *fbc)
*/
static inline s64 percpu_counter_read_positive(struct percpu_counter *fbc)
{
- s64 ret = fbc->count;
+ s64 ret = percpu_counter_read(fbc);

- barrier(); /* Prevent reloads of fbc->count */
+ barrier(); /* Prevent reloads of fbc->shadow_count */
if (ret >= 0)
return ret;
return 1;
diff --git a/lib/percpu_counter.c b/lib/percpu_counter.c
index a866389..44ec857 100644
--- a/lib/percpu_counter.c
+++ b/lib/percpu_counter.c
@@ -14,6 +14,9 @@ static LIST_HEAD(percpu_counters);
static DEFINE_MUTEX(percpu_counters_lock);
#endif

+/*
+ * Note : This function is racy
+ */

void percpu_counter_set(struct percpu_counter *fbc, s64 amount)

{
int cpu;
@@ -23,7 +26,8 @@ void percpu_counter_set(struct percpu_counter *fbc, s64 amount)

s32 *pcount = per_cpu_ptr(fbc->counters, cpu);

*pcount = 0;
}
- fbc->count = amount;
+ fbc->reference_count = amount;
+ fbc->shadow_count = amount;
spin_unlock(&fbc->lock);
}
EXPORT_SYMBOL(percpu_counter_set);
@@ -38,7 +42,8 @@ void __percpu_counter_add(struct percpu_counter *fbc, s64 amount, s32 batch)
count = *pcount + amount;
if (count >= batch || count <= -batch) {
spin_lock(&fbc->lock);
- fbc->count += count;
+ fbc->reference_count += count;
+ fbc->shadow_count += count;
*pcount = 0;
spin_unlock(&fbc->lock);
} else {
@@ -57,16 +62,16 @@ s64 __percpu_counter_sum(struct percpu_counter *fbc)
s64 ret;
int cpu;

- spin_lock(&fbc->lock);
- ret = fbc->count;
+ ret = fbc->reference_count;

for_each_online_cpu(cpu) {
s32 *pcount = per_cpu_ptr(fbc->counters, cpu);
ret += *pcount;
- *pcount = 0;
}

- fbc->count = ret;
-

- spin_unlock(&fbc->lock);
+ /*
+ * Update fbc->shadow_count so that percpu_counter_read()
+ * can have a better idea of this counter 'value'
+ */
+ fbc->shadow_count = ret;
return ret;
}
EXPORT_SYMBOL(__percpu_counter_sum);
@@ -76,7 +81,8 @@ static struct lock_class_key percpu_counter_irqsafe;
int percpu_counter_init(struct percpu_counter *fbc, s64 amount)
{
spin_lock_init(&fbc->lock);
- fbc->count = amount;
+ fbc->shadow_count = amount;
+ fbc->reference_count = amount;
fbc->counters = alloc_percpu(s32);
if (!fbc->counters)
return -ENOMEM;
@@ -132,7 +138,8 @@ static int __cpuinit percpu_counter_hotcpu_callback(struct notifier_block *nb,

spin_lock_irqsave(&fbc->lock, flags);
pcount = per_cpu_ptr(fbc->counters, cpu);
- fbc->count += *pcount;
+ fbc->reference_count += *pcount;
+ fbc->shadow_count += *pcount;
*pcount = 0;
spin_unlock_irqrestore(&fbc->lock, flags);

Andrew Morton

unread,

Dec 7, 2008, 5:30:17 PM12/7/08

to

On Sun, 07 Dec 2008 14:28:00 +0100 Eric Dumazet <da...@cosmosbay.com> wrote:

> Andrew Morton a __crit :

> > On Wed, 03 Dec 2008 21:24:36 +0100 Eric Dumazet <da...@cosmosbay.com> wrote:
> >
> >> Eric Dumazet a __crit :
> >>
> >> 1) __percpu_counter_sum() is buggy, it should not write
> >> on per_cpu_ptr(fbc->counters, cpu), or another cpu
> >> could get its changes lost.
> >>
> >> __percpu_counter_sum should be read only (const struct percpu_counter *fbc),
> >> and no locking needed.
> >
> > No, we can't do this - it will break ext4.
> >
> > Take a closer look at 1f7c14c62ce63805f9574664a6c6de3633d4a354 and at
> > e8ced39d5e8911c662d4d69a342b9d053eaaac4e.
> >
> > I suggest that what we do is to revert both those changes. We can
> > worry about the possibly-unneeded spin_lock later, in a separate patch.
> >
> > It should have been a separate patch anyway. It's conceptually
> > unrelated and is not a bugfix, but it was mixed in with a bugfix.
> >
> > Mingming, this needs urgent consideration, please. Note that I had to
> > make additional changes to ext4 due to the subsequent introduction of
> > the dirty_blocks counter.
> >
> >
> > Please read the below changelogs carefully and check that I have got my
> > head around this correctly - I may not have done.
> >
>
>
> Hum... e8ced39d5e8911c662d4d69a342b9d053eaaac4e is probably following
> the wrong path, but I see the intent. Even in the 'nr_files' case, it could
> help to reduce excessive calls to percpu_counter_sum()
>

We should fix this in 2.6.28 - right now percpu_counter_sum() is subtly
corrupting the counter's value.

I sent two revert patches which I hope to merge into 2.6.28. Could you
guys please read/review/maybe-test them?

They will make ext4 as slow as it was in 2.6.26, but presumably that's
not a catastrophe.

> What we can do is to use two s64 counters (only in SMP):

We can do lots of things in 2.6.29. Including just making ->counters
an array of atomic_t.

Eric Dumazet

unread,

Dec 7, 2008, 6:10:07 PM12/7/08

to

Andrew Morton a écrit :

Your revert patches have the same effect than my first patch : No writes
in percpu_counter_sum()

I am lost here Andrew...

Theodore Tso

unread,

Dec 7, 2008, 8:40:07 PM12/7/08

to

On Sat, Dec 06, 2008 at 08:22:33PM -0800, Andrew Morton wrote:
>
> I suggest that what we do is to revert both those changes. We can
> worry about the possibly-unneeded spin_lock later, in a separate patch.
>
> It should have been a separate patch anyway. It's conceptually
> unrelated and is not a bugfix, but it was mixed in with a bugfix.
>
> Mingming, this needs urgent consideration, please. Note that I had to
> make additional changes to ext4 due to the subsequent introduction of
> the dirty_blocks counter.

I've looked the two patches which you've queued in the -mm branch, and
they look correct to me.

The bugs fixed by these patches can potentially lead to filesystem
corruption, since we ultimately use these fields to set the superblock
values. This in my mind makes them stable candidates at the very
least, and if we weren't so late in the 2.6.28 cycle, I'd be strongly
tempted to push them to Linus as a bugfix before the merge window.

Andrew, any strong objections for me to grab them for the ext4 tree?
Or would you rather carry them? I would prefer that they get pushed
to Linus as soon as the merge window opens, which is one reason why
I'd prefer carry them, but we can do this either way.

- Ted

Eric Dumazet

unread,

Dec 7, 2008, 10:30:11 PM12/7/08

to

Andrew Morton a écrit :

> We can do lots of things in 2.6.29. Including just making ->counters
> an array of atomic_t.

I was playing with this idea this week (but using atomic_long_t),
and this is why I bother you with all this stuff : cleanup patches
before work for 2.6.29

My idea was using atomic_long_t so that 64bit arches dont need
a spinlock any more, and new percpu_lcounter would be ok for
counters that dont need s64 but a plain long (nr_files, network
counters...)

struct percpu_lcounter {
atomic_long_t count;
#ifdef CONFIG_SMP

#ifdef CONFIG_HOTPLUG_CPU
struct list_head list; /* All percpu_counters are on a list */
#endif

atomic_long_t *counters;
#endif
};

void __percpu_lcounter_add(struct percpu_lcounter *fbc, long amount, s32 batch)
{
long count;
atomic_long_t *pcount;

pcount = per_cpu_ptr(fbc->counters, get_cpu());
count = atomic_long_add_return(amount, pcount);
if (unlikely(count >= batch || count <= -batch)) {
atomic_long_add(count, &fbc->count);
atomic_long_sub(count, pcount);
}
put_cpu();
}

long percpu_lcounter_sum(struct percpu_lcounter *fbc)
{
long accumulator = 0, val;
int cpu;

for_each_online_cpu(cpu) {
atomic_long_t *pcount = per_cpu_ptr(fbc->counters, cpu);

val = atomic_long_xchg(pcount, 0);
accumulator += val;
}
return atomic_long_add_return(accumulator, &fbc->count);
}

Then I found the following typo :

[PATCH] atomic: fix a typo in atomic_long_xchg()

atomic_long_xchg() not correctly defined for 32bit arches.

Signed-off-by: Eric Dumazet <da...@cosmosbay.com>
---

diff --git a/include/asm-generic/atomic.h b/include/asm-generic/atomic.h
index 4ec0a29..7abdaa9 100644
--- a/include/asm-generic/atomic.h
+++ b/include/asm-generic/atomic.h
@@ -251,7 +251,7 @@ static inline long atomic_long_add_unless(atomic_long_t *l, long a, long u)
#define atomic_long_cmpxchg(l, old, new) \
(atomic_cmpxchg((atomic_t *)(l), (old), (new)))
#define atomic_long_xchg(v, new) \
- (atomic_xchg((atomic_t *)(l), (new)))
+ (atomic_xchg((atomic_t *)(v), (new)))

#endif /* BITS_PER_LONG == 64 */

Andrew Morton

unread,

Dec 8, 2008, 4:50:06 AM12/8/08

to

(cc stable)

On Sun, 7 Dec 2008 10:28:21 -0500 Theodore Tso <ty...@mit.edu> wrote:

> On Sat, Dec 06, 2008 at 08:22:33PM -0800, Andrew Morton wrote:
> >
> > I suggest that what we do is to revert both those changes. We can
> > worry about the possibly-unneeded spin_lock later, in a separate patch.
> >
> > It should have been a separate patch anyway. It's conceptually
> > unrelated and is not a bugfix, but it was mixed in with a bugfix.
> >
> > Mingming, this needs urgent consideration, please. Note that I had to
> > make additional changes to ext4 due to the subsequent introduction of
> > the dirty_blocks counter.
>
> I've looked the two patches which you've queued in the -mm branch, and
> they look correct to me.
>
> The bugs fixed by these patches can potentially lead to filesystem
> corruption, since we ultimately use these fields to set the superblock
> values. This in my mind makes them stable candidates at the very
> least, and if we weren't so late in the 2.6.28 cycle, I'd be strongly
> tempted to push them to Linus as a bugfix before the merge window.
>
> Andrew, any strong objections for me to grab them for the ext4 tree?
> Or would you rather carry them? I would prefer that they get pushed
> to Linus as soon as the merge window opens, which is one reason why
> I'd prefer carry them, but we can do this either way.
>

I'm planning on sending them off to Linus for 2.6.28 this week,
assuming nobody can think of a plausible reason to not do that.

Now I didn't look _very_ closely at the chronology, but I think that
revert-percpu-counter-clean-up-percpu_counter_sum_and_set.patch reverts
a post-2.6.27 change, and is not needed in stable.

revert-percpu_counter-new-function-percpu_counter_sum_and_set.patch
however reverts a pre-2.6.27 change, and should be merged into 2.6.27.
This patch reverts the addition and use of
percpu_counter_sum_and_set(), which is racy and can corrupt the
counters.

However
revert-percpu_counter-new-function-percpu_counter_sum_and_set.patch
won't apply to 2.6.27 because the dirty_blocks stuff was added and
generates rejects.

So if all the above is correct, I'd propose that if and when
revert-percpu_counter-new-function-percpu_counter_sum_and_set.patch
hits mainline, we should ask the -stable guys to directly revert

commit e8ced39d5e8911c662d4d69a342b9d053eaaac4e
Author: Mingming Cao <c...@us.ibm.com>
Date: Fri Jul 11 19:27:31 2008 -0400

percpu_counter: new function percpu_counter_sum_and_set

which should be all that 2.6.27.x needs.

Agree? If so, can you please take care of getting that patch over to
sta...@kernel.org? (I added the cc:stable to the diff, so there's
probably nothing which you need to do..)

Andrew Morton

unread,

Dec 8, 2008, 5:00:14 AM12/8/08

to

heh. Here's the problem...

The first patch which was added (pre-2.6.27) was "percpu_counter: new
function percpu_counter_sum_and_set". This added the broken-by-design
percpu_counter_sum_and_set() function, **and used it in ext4**.

Later, during 2.6.28 development came the "percpu counter: clean up
percpu_counter_sum_and_set()" which propagated the
percpu_counter_sum_and_set() brokenness into percpu_counter_sum() as
well.

If we were to now merge your simple dont-modify-the-percpu-counters fix
then this would break ext4, because of the **and used it in ext4**,
above.

You see, ext4 stopped using the accurate/slow percpu_counter_sum() and
switched to percpu_counter_sum_and_set() because this new function
increases the accuracy of percpu_counter_read() in other parts of ext4.

Also, e8ced39d5e8911c662d4d69a342b9d053eaaac4e ("percpu_counter: new
function percpu_counter_sum_and_set") replaced a call to
percpu_counter_sum_positive() with a call to
percpu_counter_sum_and_set(), but there's nothing which prevents
percpu_counter_sum_and_set() from returning negative values, afacit.

Mingming Cao

unread,

Dec 8, 2008, 5:50:13 PM12/8/08

to

在 2008-12-06六的 20:22 -0800，Andrew Morton写道：

> On Wed, 03 Dec 2008 21:24:36 +0100 Eric Dumazet <da...@cosmosbay.com> wrote:
>
> > Eric Dumazet a __crit :
> > > Hi Andrew
> > >
> > > While working on percpu_counter on net-next-2.6, I found
> > > a CPU unplug race in percpu_counter_destroy()
> > >
> > > (Very unlikely of course)
> > >
> > > Thank you
> > >
> > > [PATCH] percpu_counter: fix CPU unplug race in percpu_counter_destroy()
> > >
> > > We should first delete the counter from percpu_counters list
> > > before freeing memory, or a percpu_counter_hotcpu_callback()
> > > could dereference a NULL pointer.
> > >
> > > Signed-off-by: Eric Dumazet <da...@cosmosbay.com>
> > > ---
> > > lib/percpu_counter.c | 4 ++--
> > > 1 files changed, 2 insertions(+), 2 deletions(-)
> > >
> >
> > Well, this percpu_counter stuff is simply not working at all.
> >
> > We added some percpu_counters to network tree for 2.6.29 and we get
> > drift bugs if calling __percpu_counter_sum() while some heavy duty
> > benches are running, on a 8 cpus machine
> >
> > 1) __percpu_counter_sum() is buggy, it should not write
> > on per_cpu_ptr(fbc->counters, cpu), or another cpu
> > could get its changes lost.
> >

Oh, you are right, I missed that, thanks for pointing this out.

>
> > __percpu_counter_sum should be read only (const struct percpu_counter *fbc),
> > and no locking needed.
>
> No, we can't do this - it will break ext4.
>

Yes, the needs was coming from the ext4 delayed allocation needs more
accurate free blocks counter to prevent too late ENOSPC issue. The
intention was trying to get percpu_counter_read_positive() be more
accurate so that ext4 could avoid going to the slow path very often.

But I overlooked that the update to the local counter race issue. Sorry
about it!

> Take a closer look at 1f7c14c62ce63805f9574664a6c6de3633d4a354 and at
> e8ced39d5e8911c662d4d69a342b9d053eaaac4e.
>
> I suggest that what we do is to revert both those changes. We can
> worry about the possibly-unneeded spin_lock later, in a separate patch.
>
> It should have been a separate patch anyway. It's conceptually
> unrelated and is not a bugfix, but it was mixed in with a bugfix.
>
> Mingming, this needs urgent consideration, please. Note that I had to
> make additional changes to ext4 due to the subsequent introduction of
> the dirty_blocks counter.
>
>

>
> Please read the below changelogs carefully and check that I have got my
> head around this correctly - I may not have done.
>
> What a mess.
>
>

I looked at those two revert patches, they looks correct to me.

Thanks alot to take care of the mess.

Mingming

Acked-by: Mingming Cao <c...@us.ibm.com>

> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in

Mingming Cao

unread,

Dec 8, 2008, 6:00:10 PM12/8/08

to

在 2008-12-07日的 20:42 -0800，Andrew Morton写道：

Agreed.

I checked 2.6.27.8, above are correct, the
revert-percpu-counter-clean-up-percpu_counter_sum_and_set.patch is not
needed for 2.6.27.x stable tree.

Thanks again.

Mingming

> commit e8ced39d5e8911c662d4d69a342b9d053eaaac4e
> Author: Mingming Cao <c...@us.ibm.com>
> Date: Fri Jul 11 19:27:31 2008 -0400
>
> percpu_counter: new function percpu_counter_sum_and_set
>
> which should be all that 2.6.27.x needs.
>
> Agree? If so, can you please take care of getting that patch over to
> sta...@kernel.org? (I added the cc:stable to the diff, so there's
> probably nothing which you need to do..)
> --

> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in

Theodore Tso

unread,

Dec 8, 2008, 10:20:08 PM12/8/08

to

On Sun, Dec 07, 2008 at 08:52:50PM -0800, Andrew Morton wrote:
>
> The first patch which was added (pre-2.6.27) was "percpu_counter: new
> function percpu_counter_sum_and_set". This added the broken-by-design
> percpu_counter_sum_and_set() function, **and used it in ext4**.
>

Mea culpa, I was the one who reviewed Mingming's patch, and missed
this. Part of the problem was that percpu_counter.c isn't well
documented, and I so saw the spinlock, but didn't realize it only
protected reference counter, and not the per-cpu array. I should have
read through code more thoroughly before approving the patch.

I suppose if we wanted we could add a rw spinlock which mediates
access to a "foreign" cpu counter (i.e., percpu_counter_add gets a
shared lock, and percpu_counter_set needs an exclusive lock) but it's
probably not worth it.

Actually, if all popular architectures had a hardware-implemented
atomic_t, I wonder how much ext4 really needs the percpu counter,
especially given ext4's multiblock allocator; with ext3, given that
each block allocation required taking a per-filesystem spin lock,
optimizing away that spinlock was far more important for improving
ext3's scalability. But with the multiblock allocator, it may that
we're going through a lot more effort than what is truly necessary.

- Ted

Andrew Morton

unread,

Dec 8, 2008, 10:30:09 PM12/8/08

to

On Mon, 8 Dec 2008 17:12:41 -0500
Theodore Tso <ty...@mit.edu> wrote:

> Actually, if all popular architectures had a hardware-implemented
> atomic_t, I wonder how much ext4 really needs the percpu counter,
> especially given ext4's multiblock allocator; with ext3, given that
> each block allocation required taking a per-filesystem spin lock,
> optimizing away that spinlock was far more important for improving
> ext3's scalability. But with the multiblock allocator, it may that
> we're going through a lot more effort than what is truly necessary.

I expect that the performance numbers for the percpu counters in the
superblock are buried away in the historical git changelogs somewhere.
I don't recall how much difference it made.

An atomic_inc() of an fs-wide counter will have similar cost to
spin_lock() of an fs-wide lock.

If the multiblock allocator can avoid doing one atomic_inc() for each
block and can instead do atomic_add(large_value, &counter) then yes,
I'm sure that an fs-wide atomic_long_t would be OK.

Of course, similar changes should be made in trucate, etc.

Peter Zijlstra

unread,

Dec 8, 2008, 10:30:11 PM12/8/08

to

On Mon, 2008-12-08 at 17:12 -0500, Theodore Tso wrote:
> On Sun, Dec 07, 2008 at 08:52:50PM -0800, Andrew Morton wrote:
> >
> > The first patch which was added (pre-2.6.27) was "percpu_counter: new
> > function percpu_counter_sum_and_set". This added the broken-by-design
> > percpu_counter_sum_and_set() function, **and used it in ext4**.
> >
>
> Mea culpa, I was the one who reviewed Mingming's patch, and missed
> this. Part of the problem was that percpu_counter.c isn't well
> documented, and I so saw the spinlock, but didn't realize it only
> protected reference counter, and not the per-cpu array. I should have
> read through code more thoroughly before approving the patch.
>
> I suppose if we wanted we could add a rw spinlock which mediates
> access to a "foreign" cpu counter (i.e., percpu_counter_add gets a
> shared lock, and percpu_counter_set needs an exclusive lock) but it's
> probably not worth it.

rwlocks are utter suck and should be banished from the kernel - adding
one would destroy the whole purpose of the code.

> Actually, if all popular architectures had a hardware-implemented
> atomic_t, I wonder how much ext4 really needs the percpu counter,
> especially given ext4's multiblock allocator; with ext3, given that
> each block allocation required taking a per-filesystem spin lock,
> optimizing away that spinlock was far more important for improving
> ext3's scalability. But with the multiblock allocator, it may that
> we're going through a lot more effort than what is truly necessary.

atomic_t is pretty good on all archs, but you get to keep the cacheline
ping-pong.

Mingming Cao

unread,

Dec 8, 2008, 10:50:14 PM12/8/08

to

在 2008-12-08一的 17:12 -0500，Theodore Tso写道：

> On Sun, Dec 07, 2008 at 08:52:50PM -0800, Andrew Morton wrote:
> >
> > The first patch which was added (pre-2.6.27) was "percpu_counter: new
> > function percpu_counter_sum_and_set". This added the broken-by-design
> > percpu_counter_sum_and_set() function, **and used it in ext4**.
> >
>
> Mea culpa, I was the one who reviewed Mingming's patch, and missed
> this. Part of the problem was that percpu_counter.c isn't well
> documented, and I so saw the spinlock, but didn't realize it only
> protected reference counter, and not the per-cpu array. I should have
> read through code more thoroughly before approving the patch.
>
> I suppose if we wanted we could add a rw spinlock which mediates
> access to a "foreign" cpu counter (i.e., percpu_counter_add gets a
> shared lock, and percpu_counter_set needs an exclusive lock) but it's
> probably not worth it.
>
> Actually, if all popular architectures had a hardware-implemented
> atomic_t, I wonder how much ext4 really needs the percpu counter,
> especially given ext4's multiblock allocator;

Delayed allocation will makes multiple block allocation possible for
buffered IO.

However, we still need to check the percpu counter on write_begin() time
for every single possible block allocation (this is to make sure fs is
not overbooked), unless write_begin() could cluster the write requests
and maps multiple blocks in a single shot. So in reality in ext4 the
free blocks percpu_counter check and the s_dirty_blocks (percpu counter
too, for delayed blocks) only takes 1 block at a time:(

Mingming

Theodore Tso

unread,

Dec 8, 2008, 11:10:07 PM12/8/08

to

On Mon, Dec 08, 2008 at 11:20:35PM +0100, Peter Zijlstra wrote:
>
> atomic_t is pretty good on all archs, but you get to keep the cacheline
> ping-pong.
>

Stupid question --- if you're worried about cacheline ping-pongs, why
aren't each cpu's delta counter cacheline aligned? With a 64-byte
cache-line, and a 32-bit counters entry, with less than 16 CPU's we're
going to be getting cache ping-pong effects with percpu_counter's,
right? Or am I missing something?

- Ted

Andrew Morton

unread,

Dec 8, 2008, 11:10:08 PM12/8/08

to

On Mon, 8 Dec 2008 18:00:47 -0500
Theodore Tso <ty...@mit.edu> wrote:

> On Mon, Dec 08, 2008 at 11:20:35PM +0100, Peter Zijlstra wrote:
> >
> > atomic_t is pretty good on all archs, but you get to keep the cacheline
> > ping-pong.
> >
>
> Stupid question --- if you're worried about cacheline ping-pongs, why
> aren't each cpu's delta counter cacheline aligned?

They are allocated with alloc_percpu(), so each CPU's counter lives
in a per-cpu area. If you chase through seventeen layers of Rustyness
you end up at mm/allocpercpu.c:percpu_populate() which is where that
little s32 ends up getting allocated.

Peter Zijlstra

unread,

Dec 8, 2008, 11:10:07 PM12/8/08

to

On Mon, 2008-12-08 at 18:00 -0500, Theodore Tso wrote:
> On Mon, Dec 08, 2008 at 11:20:35PM +0100, Peter Zijlstra wrote:
> >
> > atomic_t is pretty good on all archs, but you get to keep the cacheline
> > ping-pong.
> >
>
> Stupid question --- if you're worried about cacheline ping-pongs, why
> aren't each cpu's delta counter cacheline aligned? With a 64-byte
> cache-line, and a 32-bit counters entry, with less than 16 CPU's we're
> going to be getting cache ping-pong effects with percpu_counter's,
> right? Or am I missing something?

sorta - a new per-cpu allocator is in the works, but we do cacheline
align the per-cpu allocations (or used to), also, the allocations are
node affine.

Peter Zijlstra

unread,

Dec 8, 2008, 11:10:09 PM12/8/08

to

On Tue, 2008-12-09 at 00:05 +0100, Peter Zijlstra wrote:
> On Mon, 2008-12-08 at 18:00 -0500, Theodore Tso wrote:
> > On Mon, Dec 08, 2008 at 11:20:35PM +0100, Peter Zijlstra wrote:
> > >
> > > atomic_t is pretty good on all archs, but you get to keep the cacheline
> > > ping-pong.
> > >
> >
> > Stupid question --- if you're worried about cacheline ping-pongs, why
> > aren't each cpu's delta counter cacheline aligned? With a 64-byte
> > cache-line, and a 32-bit counters entry, with less than 16 CPU's we're
> > going to be getting cache ping-pong effects with percpu_counter's,
> > right? Or am I missing something?
>
> sorta - a new per-cpu allocator is in the works, but we do cacheline
> align the per-cpu allocations (or used to), also, the allocations are
> node affine.

Indeed we still (or again) do, see mm/allocpercpu.c:percpu_populate().

Theodore Tso

unread,

Dec 9, 2008, 12:00:16 AM12/9/08

to

On Mon, Dec 08, 2008 at 03:07:24PM -0800, Andrew Morton wrote:
> > Stupid question --- if you're worried about cacheline ping-pongs, why
> > aren't each cpu's delta counter cacheline aligned?
>
> They are allocated with alloc_percpu(), so each CPU's counter lives
> in a per-cpu area. If you chase through seventeen layers of Rustyness
> you end up at mm/allocpercpu.c:percpu_populate() which is where that
> little s32 ends up getting allocated.

Ah, OK. So the answer is "Ted was stupid and didn't understand n
layers of percpu abstractions".... whoever wrote that code seems to
have been a great believer of the saying, "if the code was hard to
*write*, it should be hard to *understand*". :-)

- Ted

Eric Dumazet

unread,

Dec 9, 2008, 8:20:09 AM12/9/08

to

Peter Zijlstra a écrit :

> On Mon, 2008-12-08 at 18:00 -0500, Theodore Tso wrote:
>> On Mon, Dec 08, 2008 at 11:20:35PM +0100, Peter Zijlstra wrote:
>>> atomic_t is pretty good on all archs, but you get to keep the cacheline
>>> ping-pong.
>>>
>> Stupid question --- if you're worried about cacheline ping-pongs, why
>> aren't each cpu's delta counter cacheline aligned? With a 64-byte
>> cache-line, and a 32-bit counters entry, with less than 16 CPU's we're
>> going to be getting cache ping-pong effects with percpu_counter's,
>> right? Or am I missing something?
>
> sorta - a new per-cpu allocator is in the works, but we do cacheline
> align the per-cpu allocations (or used to), also, the allocations are
> node affine.
>

I did work on a 'light weight percpu counter', aka percpu_lcounter, for
all metrics that dont need 64 bits wide, but a plain 'long'
(network, nr_files, nr_dentry, nr_inodes, ...)

struct percpu_lcounter {
atomic_long_t count;
#ifdef CONFIG_SMP

#ifdef CONFIG_HOTPLUG_CPU
struct list_head list; /* All percpu_counters are on a list */
#endif

long *counters;
#endif
};

(No more spinlock)

Then I tried to have atomic_t (or atomic_long_t) for 'counters', but got a
10% slow down of __percpu_lcounter_add(), even if never hitting the 'slow path'
atomic_long_add_return() is really expensiven, even on a non contended cache
line.

struct percpu_lcounter {
atomic_long_t count;
#ifdef CONFIG_SMP

#ifdef CONFIG_HOTPLUG_CPU
struct list_head list; /* All percpu_counters are on a list */
#endif

atomic_long_t *counters;
#endif
};

So I believe the percpu_clounter_sum() that tries to reset to 0 all cpu local
counts would be really too expensive, if it slows down _add() so much.

long percpu_lcounter_sum(struct percpu_lcounter *fblc)
{
long acc = 0;
int cpu;

for_each_online_cpu(cpu)
acc += atomic_long_xchg(per_cpu_ptr(fblc->counters, cpu), 0);
return atomic_long_add_return(acc, &fblc->count);
}

void __percpu_lcounter_add(struct percpu_lcounter *flbc, long amount, s32 batch)
{
long count;
atomic_long_t *pcount;

pcount = per_cpu_ptr(flbc->counters, get_cpu());
count = atomic_long_add_return(amount, pcount); /* way too expensive !!! */

if (unlikely(count >= batch || count <= -batch)) {

atomic_long_add(count, &flbc->count);
atomic_long_sub(count, pcount);
}
put_cpu();
}

Just forget about it and let percpu_lcounter_sum() only read the values, and
let percpu_lcounter_add() not using atomic ops in fast path.

void __percpu_lcounter_add(struct percpu_lcounter *flbc, long amount, s32 batch)
{
long count;
long *pcount;

pcount = per_cpu_ptr(flbc->counters, get_cpu());

count = *pcount + amount;

if (unlikely(count >= batch || count <= -batch)) {

atomic_long_add(count, &flbc->count);
count = 0;
}
*pcount = count;
put_cpu();
}
EXPORT_SYMBOL(__percpu_lcounter_add);

Also, with upcoming NR_CPUS=4096, it may be time to design a hierarchical percpu_counter,
to avoid hitting one shared "fbc->count" all the time a local counter overflows.

Peter Zijlstra

unread,

Dec 9, 2008, 8:40:08 AM12/9/08

to

Yeah, its an extra LOCK ins where there wasn't one before.

> if (unlikely(count >= batch || count <= -batch)) {
> atomic_long_add(count, &flbc->count);
> atomic_long_sub(count, pcount);

Also, this are two LOCKs where, with the spinlock, you'd likely only
have 1.

So yes, having the per-cpu variable an atomic seems like a way too
expensive idea. That xchg based _sum is cool though.

> }
> put_cpu();
> }
>
> Just forget about it and let percpu_lcounter_sum() only read the values, and
> let percpu_lcounter_add() not using atomic ops in fast path.
>
> void __percpu_lcounter_add(struct percpu_lcounter *flbc, long amount, s32 batch)
> {
> long count;
> long *pcount;
>
> pcount = per_cpu_ptr(flbc->counters, get_cpu());
> count = *pcount + amount;
> if (unlikely(count >= batch || count <= -batch)) {
> atomic_long_add(count, &flbc->count);
> count = 0;
> }
> *pcount = count;
> put_cpu();
> }
> EXPORT_SYMBOL(__percpu_lcounter_add);
>
>
> Also, with upcoming NR_CPUS=4096, it may be time to design a hierarchical percpu_counter,
> to avoid hitting one shared "fbc->count" all the time a local counter overflows.

So we'd normally write to the shared cacheline every cpus/batch.
Cascading this you'd get ln(cpus)/(batch^ln(cpus)) or something like
that, right? Won't just increasing batch give the same result - or are
we going to play funny games with the topology information?

Eric Dumazet

unread,

Dec 10, 2008, 5:20:08 AM12/10/08

to

Now percpu_counter_sum() is 'fixed', what about "percpu_counter_add()" ?

void __percpu_counter_add(struct percpu_counter *fbc, s64 amount, s32 batch)

{
s64 count;
s32 *pcount;
int cpu = get_cpu();

pcount = per_cpu_ptr(fbc->counters, cpu);

count = *pcount + amount;

if (count >= batch || count <= -batch) {
spin_lock(&fbc->lock);

fbc->count += count;

*pcount = 0;
spin_unlock(&fbc->lock);
} else {

*pcount = count;
}
put_cpu();
}

If I read this well, this is not IRQ safe.

get_cpu() only disables preemption IMHO

For nr_files, nr_dentry, nr_inodes, it should not be a problem.

But for network counters (only in net-next-2.6)
and lib/proportions.c, we have a problem ?

Using local_t instead of s32 for cpu
local counter here is possible, so that fast path doesnt have
to disable interrupts

(use a local_t instead of s32 for fbc->counters)

void __percpu_counter_add_irqsafe(struct percpu_counter *fbc, s64 amount, s32 batch)
{
long count;
local_t *pcount;

/* following code only matters on 32bit arches */
if (sizeof(amount) != sizeof(local_t)) {
if (unlikely(amount >= batch || amount <= -batch))) {
spin_lock_irqsave(&fbc->lock, flags);
fbc->count += amount;
spin_unlock_irqrestore(&fbc->lock, flags);
return;

}
}
pcount = per_cpu_ptr(fbc->counters, get_cpu());

count = local_add_return((long)amount, pcount);

if (unlikely(count >= batch || count <= -batch)) {

unsigned long flags;

local_sub(count, pcount);
spin_lock_irqsave(&fbc->lock, flags);
fbc->count += count;
spin_unlock_irqrestore(&fbc->lock, flags);
}
put_cpu();

Andrew Morton

unread,

Dec 10, 2008, 6:00:10 AM12/10/08

to

On Wed, 10 Dec 2008 06:09:08 +0100 Eric Dumazet <da...@cosmosbay.com> wrote:

> Now percpu_counter_sum() is 'fixed', what about "percpu_counter_add()" ?
>
> void __percpu_counter_add(struct percpu_counter *fbc, s64 amount, s32 batch)
> {
> s64 count;
> s32 *pcount;
> int cpu = get_cpu();
>
> pcount = per_cpu_ptr(fbc->counters, cpu);
> count = *pcount + amount;
> if (count >= batch || count <= -batch) {
> spin_lock(&fbc->lock);
> fbc->count += count;
> *pcount = 0;
> spin_unlock(&fbc->lock);
> } else {
> *pcount = count;
> }
> put_cpu();
> }
>
>
> If I read this well, this is not IRQ safe.

Sure. It's racy against interrupts on this cpu, it'll deadlock over
the non-irq-safe spinlock and lockdep will have a coronary over it.

> get_cpu() only disables preemption IMHO

yes

> For nr_files, nr_dentry, nr_inodes, it should not be a problem.

yes

> But for network counters (only in net-next-2.6)
> and lib/proportions.c, we have a problem ?

yes

> Using local_t instead of s32 for cpu
> local counter here is possible, so that fast path doesnt have
> to disable interrupts
>
> (use a local_t instead of s32 for fbc->counters)
>
> void __percpu_counter_add_irqsafe(struct percpu_counter *fbc, s64 amount, s32 batch)
> {
> long count;
> local_t *pcount;
>
> /* following code only matters on 32bit arches */
> if (sizeof(amount) != sizeof(local_t)) {
> if (unlikely(amount >= batch || amount <= -batch))) {
> spin_lock_irqsave(&fbc->lock, flags);
> fbc->count += amount;
> spin_unlock_irqrestore(&fbc->lock, flags);
> return;
> }
> }
> pcount = per_cpu_ptr(fbc->counters, get_cpu());
> count = local_add_return((long)amount, pcount);
> if (unlikely(count >= batch || count <= -batch)) {
> unsigned long flags;
>
> local_sub(count, pcount);
> spin_lock_irqsave(&fbc->lock, flags);
> fbc->count += count;
> spin_unlock_irqrestore(&fbc->lock, flags);
> }
> put_cpu();
> }

I think it's reasonable. If the batching is working as intended, the
increased cost of s/spin_lock/spin_lock_irqsave/ should be
insignificant.

In fact, if *at all* possible it would be best to make percpu_counters
irq-safe under all circumstances and avoid fattening and complicating the
interface.

But before adding more dependencies on local_t I do think we should
refresh ourselves on Christoph's objections to them - I remember
finding them fairly convincing at the time, but I don't recall the
details.

Here, I think:
http://lkml.indiana.edu/hypermail/linux/kernel/0805.3/2482.html

Rusty, Christoph: talk to me. If we add a new user of local_t in core
kernel, will we regret it?

Thanks.

Peter Zijlstra

unread,

Dec 10, 2008, 7:20:08 AM12/10/08

to

On Wed, 2008-12-10 at 06:09 +0100, Eric Dumazet wrote:
> Now percpu_counter_sum() is 'fixed', what about "percpu_counter_add()" ?
>
> void __percpu_counter_add(struct percpu_counter *fbc, s64 amount, s32 batch)
> {
> s64 count;
> s32 *pcount;
> int cpu = get_cpu();
>
> pcount = per_cpu_ptr(fbc->counters, cpu);
> count = *pcount + amount;
> if (count >= batch || count <= -batch) {
> spin_lock(&fbc->lock);
> fbc->count += count;
> *pcount = 0;
> spin_unlock(&fbc->lock);
> } else {
> *pcount = count;
> }
> put_cpu();
> }
>
>
> If I read this well, this is not IRQ safe.
>
> get_cpu() only disables preemption IMHO
>
> For nr_files, nr_dentry, nr_inodes, it should not be a problem.
>
> But for network counters (only in net-next-2.6)
> and lib/proportions.c, we have a problem ?

Non of percpu_counter if irqsafe, for lib/proportions I disabled irqs by
hand when needed - I don't think we ought to bother with local_t, esp as
it basically sucks chunks on anything !x86.

Eric Dumazet

unread,

Dec 10, 2008, 11:00:19 PM12/10/08

to

Andrew Morton a écrit :

__percpu_counter_add() already disables preemption (calling get_cpu())

But then, some (all but x86 ;) ) arches dont have true local_t and we fallback
to plain atomic_long_t, and this is wrong because it would add a LOCKED
instruction in fast path.

I remember Christoph added FAST_CMPXCHG_LOCAL, but no more uses of it in current
tree.

Ie : using local_t only if CONFIG_FAST_CMPXCHG_LOCAL, else something like :

void __percpu_counter_add_irqsafe(struct percpu_counter *fbc, s64 amount, s32 batch)
{

s64 count;
s32 *pcount = per_cpu_ptr(fbc->counters, get_cpu());
unsigned long flags;

local_irq_save(flags);

count = *pcount + amount;

if (unlikely(count >= batch || count <= -batch)) {

spin_lock(&fbc->lock);
fbc->count += count;

spin_unlock(&fbc->lock);

count = 0;
}
*pcount = count;

local_irq_restore(flags);
put_cpu();
}
EXPORT_SYMBOL(__percpu_counter_add_irqsafe);

Greg KH

unread,

Dec 11, 2008, 5:50:05 PM12/11/08

to

Thanks for letting me know, I'll not include it in the 2.6.27-stable
tree then.

greg k-h

Rusty Russell

unread,

Dec 12, 2008, 8:20:08 AM12/12/08

to

On Thursday 11 December 2008 09:26:37 Eric Dumazet wrote:
> But then, some (all but x86 ;) ) arches dont have true local_t and we fallback
> to plain atomic_long_t, and this is wrong because it would add a LOCKED
> instruction in fast path.
>
> I remember Christoph added FAST_CMPXCHG_LOCAL, but no more uses of it in current
> tree.
>
> Ie : using local_t only if CONFIG_FAST_CMPXCHG_LOCAL, else something like :
>
> void __percpu_counter_add_irqsafe(struct percpu_counter *fbc, s64 amount, s32 batch)
> {
> s64 count;
> s32 *pcount = per_cpu_ptr(fbc->counters, get_cpu());
> unsigned long flags;
>
> local_irq_save(flags);
> count = *pcount + amount;

This is dumb though. If local_irq_save(), add, local_irq_restore() is faster
than atomic_long_add on some arch, *that* is what that arch's local_add()
should do!

Open coding it like this is obviously wrong.

Now, archs local.h need attention (x86-32 can be optimized today, for
example), but that's not directly related.

Hope that clarifies,
Rusty.
PS. Yes, I should produce a documentation patch and fix the x86 version.
Added to TODO list.

Eric Dumazet

unread,

Dec 12, 2008, 8:30:14 AM12/12/08

to

Rusty Russell a écrit :

> On Thursday 11 December 2008 09:26:37 Eric Dumazet wrote:
>> But then, some (all but x86 ;) ) arches dont have true local_t and we fallback
>> to plain atomic_long_t, and this is wrong because it would add a LOCKED
>> instruction in fast path.
>>
>> I remember Christoph added FAST_CMPXCHG_LOCAL, but no more uses of it in current
>> tree.
>>
>> Ie : using local_t only if CONFIG_FAST_CMPXCHG_LOCAL, else something like :
>>
>> void __percpu_counter_add_irqsafe(struct percpu_counter *fbc, s64 amount, s32 batch)
>> {
>> s64 count;
>> s32 *pcount = per_cpu_ptr(fbc->counters, get_cpu());
>> unsigned long flags;
>>
>> local_irq_save(flags);
>> count = *pcount + amount;
>
> This is dumb though. If local_irq_save(), add, local_irq_restore() is faster
> than atomic_long_add on some arch, *that* is what that arch's local_add()
> should do!
>
> Open coding it like this is obviously wrong.

Hum... so you vote for using local_t instead of s32 then ?

>
> Now, archs local.h need attention (x86-32 can be optimized today, for
> example), but that's not directly related.
>
> Hope that clarifies,
> Rusty.
> PS. Yes, I should produce a documentation patch and fix the x86 version.
> Added to TODO list.
>

Thanks

Eric Dumazet

unread,

Dec 12, 2008, 11:10:11 AM12/12/08

to

After discussions on percpu_counter subject, I cooked following patch

My goals were :

- IRQ safe percpu_counter (needed for net-next-2.6)
- 64bit platforms can avoid spin_lock and reduce size of percpu_counter
- No increase of API

Final result, on x86_64, __percpu_counter_add() is really fast and irq safe :

0000000000000000 <__percpu_counter_add>:
0: 55 push %rbp
1: 48 89 d1 mov %rdx,%rcx
4: 48 8b 47 18 mov 0x18(%rdi),%rax
8: 48 89 e5 mov %rsp,%rbp
b: 65 8b 14 25 24 00 00 mov %gs:0x24,%edx
12: 00
13: 48 f7 d0 not %rax
16: 89 d2 mov %edx,%edx
18: 48 8b 14 d0 mov (%rax,%rdx,8),%rdx
1c: 48 89 f0 mov %rsi,%rax
1f: 48 0f c1 02 xadd %rax,(%rdx)
23: 48 01 f0 add %rsi,%rax
26: 48 39 c1 cmp %rax,%rcx
29: 7e 0d jle 38 <__percpu_counter_add+0x38>
2b: 48 f7 d9 neg %rcx
2e: 48 39 c1 cmp %rax,%rcx
31: 7d 05 jge 38 <__percpu_counter_add+0x38>
33: c9 leaveq
34: c3 retq
35: 0f 1f 00 nopl (%rax)
38: 48 29 02 sub %rax,(%rdx)
3b: f0 48 01 07 lock add %rax,(%rdi)
3f: c9 leaveq
40: c3 retq

Changes are :

We use local_t instead of s32 for the local storage (for each cpu)

We use atomic_long_t instead of s64 on 64bit arches, to avoid spin_lock.

On 32bit arches, we guard the shared s64 value with an irqsafe spin_lock.
As this spin_lock is not taken in fast path, this should not make a real
difference.

Signed-off-by: Eric Dumazet <da...@cosmosbay.com>
---

include/linux/percpu_counter.h | 38 +++++++++--
lib/percpu_counter.c | 104 ++++++++++++++++++-------------
2 files changed, 94 insertions(+), 48 deletions(-)

diff --git a/include/linux/percpu_counter.h b/include/linux/percpu_counter.h
index 9007ccd..f5133ce 100644
--- a/include/linux/percpu_counter.h
+++ b/include/linux/percpu_counter.h
@@ -12,16 +12,42 @@
#include <linux/threads.h>
#include <linux/percpu.h>
#include <linux/types.h>
+#include <asm/local.h>

#ifdef CONFIG_SMP

+#ifdef CONFIG_64BIT
+struct s64_counter {
+ atomic_long_t val;
+};
+
+static inline s64 s64c_read(struct s64_counter *c)
+{
+ return atomic_long_read(&c->val);
+}
+#else
+struct s64_counter {
+ spinlock_t lock;
+ s64 val;
+};
+
+static inline s64 s64c_read(struct s64_counter *c)
+{
+ /*
+ * Previous percpu_counter implementation used to
+ * read s64 without locking. Thats racy.
+ */
+ return c->val;
+}
+
+#endif
+
struct percpu_counter {
- spinlock_t lock;
- s64 count;
+ struct s64_counter counter;

#ifdef CONFIG_HOTPLUG_CPU
struct list_head list; /* All percpu_counters are on a list */
#endif

- s32 *counters;
+ local_t *counters;
};

#if NR_CPUS >= 16
@@ -34,7 +60,7 @@ int percpu_counter_init(struct percpu_counter *fbc, s64 amount);
int percpu_counter_init_irq(struct percpu_counter *fbc, s64 amount);

void percpu_counter_destroy(struct percpu_counter *fbc);
void percpu_counter_set(struct percpu_counter *fbc, s64 amount);

-void __percpu_counter_add(struct percpu_counter *fbc, s64 amount, s32 batch);
+void __percpu_counter_add(struct percpu_counter *fbc, s64 amount, long batch);

s64 __percpu_counter_sum(struct percpu_counter *fbc);

static inline void percpu_counter_add(struct percpu_counter *fbc, s64 amount)

@@ -55,7 +81,7 @@ static inline s64 percpu_counter_sum(struct percpu_counter *fbc)

static inline s64 percpu_counter_read(struct percpu_counter *fbc)

{
- return fbc->count;
+ return s64c_read(&fbc->counter);
}

/*
@@ -65,7 +91,7 @@ static inline s64 percpu_counter_read(struct percpu_counter *fbc)
*/
static inline s64 percpu_counter_read_positive(struct percpu_counter *fbc)
{
- s64 ret = fbc->count;
+ s64 ret = percpu_counter_read(fbc);

barrier(); /* Prevent reloads of fbc->count */
if (ret >= 0)
diff --git a/lib/percpu_counter.c b/lib/percpu_counter.c
index b255b93..6ef4a44 100644
--- a/lib/percpu_counter.c
+++ b/lib/percpu_counter.c
@@ -14,35 +14,58 @@ static LIST_HEAD(percpu_counters);
static DEFINE_MUTEX(percpu_counters_lock);
#endif

+#ifdef CONFIG_64BIT
+static inline void s64c_add(s64 amount, struct s64_counter *c)
+{
+ atomic_long_add(amount, &c->val);
+}
+
+static inline void s64c_set(struct s64_counter *c, s64 amount)
+{
+ atomic_long_set(&c->val, amount);
+}
+
+#else
+
+static inline void s64c_add(s64 amount, struct s64_counter *c)
+{
+ unsigned long flags;
+
+ spin_lock_irqsave(&c->lock, flags);
+ c->val += amount;
+ spin_unlock_irqrestore(&c->lock, flags);
+}
+
+static inline void s64c_set(struct s64_counter *c, s64 amount)
+{
+ unsigned long flags;
+
+ spin_lock_irqsave(&c->lock, flags);
+ c->val = amount;
+ spin_unlock_irqrestore(&c->lock, flags);
+}
+#endif /* CONFIG_64BIT */
+

void percpu_counter_set(struct percpu_counter *fbc, s64 amount)

{
int cpu;

- spin_lock(&fbc->lock);
- for_each_possible_cpu(cpu) {
- s32 *pcount = per_cpu_ptr(fbc->counters, cpu);

- *pcount = 0;
- }

- fbc->count = amount;
- spin_unlock(&fbc->lock);
+ for_each_possible_cpu(cpu)
+ local_set(per_cpu_ptr(fbc->counters, cpu), 0);
+ s64c_set(&fbc->counter, amount);
}
EXPORT_SYMBOL(percpu_counter_set);

-void __percpu_counter_add(struct percpu_counter *fbc, s64 amount, s32 batch)
+void __percpu_counter_add(struct percpu_counter *fbc, s64 amount, long batch)
{
- s64 count;
- s32 *pcount;
- int cpu = get_cpu();
-
- pcount = per_cpu_ptr(fbc->counters, cpu);
- count = *pcount + amount;
- if (count >= batch || count <= -batch) {
- spin_lock(&fbc->lock);
- fbc->count += count;
- *pcount = 0;
- spin_unlock(&fbc->lock);
- } else {
- *pcount = count;
+ long count;
+ local_t *pcount;
+
+ pcount = per_cpu_ptr(fbc->counters, get_cpu());
+ count = local_add_return(amount, pcount);
+ if (unlikely(count >= batch || count <= -batch)) {
+ local_sub(count, pcount);
+ s64c_add(count, &fbc->counter);
}
put_cpu();
}
@@ -57,24 +80,21 @@ s64 __percpu_counter_sum(struct percpu_counter *fbc)
s64 ret;
int cpu;

- spin_lock(&fbc->lock);
- ret = fbc->count;
- for_each_online_cpu(cpu) {
- s32 *pcount = per_cpu_ptr(fbc->counters, cpu);
- ret += *pcount;
- }
- spin_unlock(&fbc->lock);
+ ret = s64c_read(&fbc->counter);
+ for_each_online_cpu(cpu)
+ ret += local_read(per_cpu_ptr(fbc->counters, cpu));
return ret;
}
EXPORT_SYMBOL(__percpu_counter_sum);

-static struct lock_class_key percpu_counter_irqsafe;

int percpu_counter_init(struct percpu_counter *fbc, s64 amount)
{
- spin_lock_init(&fbc->lock);
- fbc->count = amount;
- fbc->counters = alloc_percpu(s32);
+#ifndef CONFIG_64BIT
+ spin_lock_init(&fbc->counter.lock);
+#endif
+ s64c_set(&fbc->counter, amount);
+ fbc->counters = alloc_percpu(local_t);
if (!fbc->counters)
return -ENOMEM;
#ifdef CONFIG_HOTPLUG_CPU
@@ -91,8 +111,13 @@ int percpu_counter_init_irq(struct percpu_counter *fbc, s64 amount)
int err;

err = percpu_counter_init(fbc, amount);
- if (!err)
- lockdep_set_class(&fbc->lock, &percpu_counter_irqsafe);
+#ifndef CONFIG_64BIT
+ if (!err) {
+ static struct lock_class_key percpu_counter_irqsafe;
+
+ lockdep_set_class(&fbc->counter.lock, &percpu_counter_irqsafe);
+ }
+#endif
return err;
}

@@ -124,14 +149,9 @@ static int __cpuinit percpu_counter_hotcpu_callback(struct notifier_block *nb,
cpu = (unsigned long)hcpu;
mutex_lock(&percpu_counters_lock);
list_for_each_entry(fbc, &percpu_counters, list) {
- s32 *pcount;
- unsigned long flags;
-
- spin_lock_irqsave(&fbc->lock, flags);
- pcount = per_cpu_ptr(fbc->counters, cpu);
- fbc->count += *pcount;
- *pcount = 0;
- spin_unlock_irqrestore(&fbc->lock, flags);
+ long count = local_xchg(per_cpu_ptr(fbc->counters, cpu), 0);
+
+ s64c_add(count, &fbc->counter);
}
mutex_unlock(&percpu_counters_lock);
return NOTIFY_OK;

Peter Zijlstra

unread,

Dec 12, 2008, 11:30:13 AM12/12/08

to

On Fri, 2008-12-12 at 12:08 +0100, Eric Dumazet wrote:
> After discussions on percpu_counter subject, I cooked following patch
>
> My goals were :
>
> - IRQ safe percpu_counter (needed for net-next-2.6)

> z- 64bit platforms can avoid spin_lock and reduce size of percpu_counter

> - No increase of API
>
> Final result, on x86_64, __percpu_counter_add() is really fast and irq safe :

> Changes are :

>
> We use local_t instead of s32 for the local storage (for each cpu)

do enough arches have a sane enough local_t implementation so this
doesn't make things worse for them?

> We use atomic_long_t instead of s64 on 64bit arches, to avoid spin_lock.
>
> On 32bit arches, we guard the shared s64 value with an irqsafe spin_lock.
> As this spin_lock is not taken in fast path, this should not make a real
> difference.

Cycles are cycles, and spin_lock_irqsave is more expensive than
spin_lock_irq is more expensive than spin_lock, but sure, it looks good.

I really like the code, however my worry is that we don't regress weird
archs too much.

Does this comment have any value besides archelogical? but yeah, that
was a known issue, there were some seqlock patches floating around
trying to address this.

Here I'd suggest taking that lock and fixing that race.

> + return c->val;
> +}
> +
> +#endif

> diff --git a/lib/percpu_counter.c b/lib/percpu_counter.c

Since they're inline's anyway, does it look better to stick them in the
header along with s64c_read() ?

> void percpu_counter_set(struct percpu_counter *fbc, s64 amount)
> {
> int cpu;
>

> + for_each_possible_cpu(cpu)
> + local_set(per_cpu_ptr(fbc->counters, cpu), 0);
> + s64c_set(&fbc->counter, amount);
> }

Did we document somewhere that this function is racy and only meant as
initialization?

> +void __percpu_counter_add(struct percpu_counter *fbc, s64 amount, long batch)

> {

> + long count;
> + local_t *pcount;
> +
> + pcount = per_cpu_ptr(fbc->counters, get_cpu());
> + count = local_add_return(amount, pcount);
> + if (unlikely(count >= batch || count <= -batch)) {
> + local_sub(count, pcount);
> + s64c_add(count, &fbc->counter);
> }
> put_cpu();
> }

very neat.

> @@ -91,8 +111,13 @@ int percpu_counter_init_irq(struct percpu_counter *fbc, s64 amount)
> int err;
>
> err = percpu_counter_init(fbc, amount);
> - if (!err)
> - lockdep_set_class(&fbc->lock, &percpu_counter_irqsafe);
> +#ifndef CONFIG_64BIT
> + if (!err) {
> + static struct lock_class_key percpu_counter_irqsafe;
> +
> + lockdep_set_class(&fbc->counter.lock, &percpu_counter_irqsafe);
> + }
> +#endif

Since they're all irqsafe can this be removed?

> return err;

Rusty Russell

unread,

Dec 15, 2008, 1:00:09 PM12/15/08

to

On Wednesday 10 December 2008 16:19:21 Andrew Morton wrote:
> Rusty, Christoph: talk to me. If we add a new user of local_t in core
> kernel, will we regret it?

Interesting. There's new local_t infrastructure, but it's not used.

Here's a benchmark patch showing some results. x86 is pretty close to
optimal already, though my results on Power show atomic_long is a bad choice
there.

I'll do an audit of the users, then send out some local_t cleanup patches etc.

Benchmarks for local_t variants

(This patch also fixes the x86 cpu_local_* macros, which are obviously
unused).

I chose a large array (1M longs) for the inc/add/add_return tests so
the trivalue case would show some cache pressure.

The cpu_local_inc case is always cache-hot, so it's not comparable to
the others.

Time in ns per iteration (brackets is with CONFIG_PREEMPT=y):

inc add add_return cpu_local_inc read
x86-32: 2.13 Ghz Core Duo 2
atomic_long 118 118 115 17 17
irqsave/rest 77 78 77 23 16
trivalue 45 45 127 3(6) 21
local_t 36 36 36 1(5) 17

x86-64: 2.6 GHz Dual-Core AMD Opteron 2218
atomic_long 55 60 - 6 19
irqsave/rest 54 54 - 11 19
trivalue 47 47 - 5 28
local_t 47 46 - 1 19

Signed-off-by: Rusty Russell <ru...@rustcorp.com.au>
---
arch/x86/include/asm/local.h | 20 ++--
init/main.c | 198 +++++++++++++++++++++++++++++++++++++++++++
2 files changed, 208 insertions(+), 10 deletions(-)

diff --git a/arch/x86/include/asm/local.h b/arch/x86/include/asm/local.h
--- a/arch/x86/include/asm/local.h
+++ b/arch/x86/include/asm/local.h
@@ -220,16 +220,16 @@ static inline long local_sub_return(long
preempt_enable(); \
}) \

-#define cpu_local_read(l) cpu_local_wrap_v(local_read(&__get_cpu_var((l))))
-#define cpu_local_set(l, i) cpu_local_wrap(local_set(&__get_cpu_var((l)), (i)))
-#define cpu_local_inc(l) cpu_local_wrap(local_inc(&__get_cpu_var((l))))
-#define cpu_local_dec(l) cpu_local_wrap(local_dec(&__get_cpu_var((l))))
-#define cpu_local_add(i, l) cpu_local_wrap(local_add((i), &__get_cpu_var((l))))
-#define cpu_local_sub(i, l) cpu_local_wrap(local_sub((i), &__get_cpu_var((l))))
+#define cpu_local_read(l) cpu_local_wrap_v(local_read(&__get_cpu_var(l)))
+#define cpu_local_set(l, i) cpu_local_wrap(local_set(&__get_cpu_var(l), (i)))
+#define cpu_local_inc(l) cpu_local_wrap(local_inc(&__get_cpu_var(l)))
+#define cpu_local_dec(l) cpu_local_wrap(local_dec(&__get_cpu_var(l)))
+#define cpu_local_add(i, l) cpu_local_wrap(local_add((i), &__get_cpu_var(l)))
+#define cpu_local_sub(i, l) cpu_local_wrap(local_sub((i), &__get_cpu_var(l)))

-#define __cpu_local_inc(l) cpu_local_inc((l))
-#define __cpu_local_dec(l) cpu_local_dec((l))
-#define __cpu_local_add(i, l) cpu_local_add((i), (l))
-#define __cpu_local_sub(i, l) cpu_local_sub((i), (l))
+#define __cpu_local_inc(l) cpu_local_inc(l)
+#define __cpu_local_dec(l) cpu_local_dec(l)
+#define __cpu_local_add(i, l) cpu_local_add((i), l)
+#define __cpu_local_sub(i, l) cpu_local_sub((i), l)

#endif /* _ASM_X86_LOCAL_H */
diff --git a/init/main.c b/init/main.c
--- a/init/main.c
+++ b/init/main.c
@@ -539,6 +539,200 @@ void __init __weak thread_info_cache_ini
{
}

+/* There are three obvious ways to implement local_t on an arch which
+ * can't do single-instruction inc/dec etc.
+ * 1) atomic_long
+ * 2) irq_save/irq_restore
+ * 3) multiple counters.
+ *
+ * This does a very rough benchmark on each one.
+ */
+struct local1 {
+ atomic_long_t v;
+};
+struct local2 {
+ unsigned long v;
+};
+struct local3 {
+ unsigned long v[3];
+};
+
+/* Enough to put some pressure on the caches. */
+#define NUM_LOCAL_TEST (1024*1024)
+#define NUM_LOCAL_RUNS (NUM_LOCAL_TEST*32)
+/* This will make it jump around looking random */
+#define STRIDE 514001
+
+static void *test_local_variants_mem;
+
+static void init_test_local_variants(void)
+{
+ unsigned long size;
+ size = max(sizeof(struct local1),
+ max(sizeof(struct local2),
+ max(sizeof(struct local3), sizeof(local_t))))
+ * NUM_LOCAL_TEST;
+ /* Assume this works in early boot. */
+ test_local_variants_mem = alloc_bootmem_nopanic(size);
+
+ if (!test_local_variants_mem) {
+ printk("test_local_variants: failed to allocate %lu bytes\n",
+ size);
+ return;
+ }
+}
+
+static void print_result(const char *str,
+ struct timespec start, struct timespec end)
+{
+ s64 diff;
+
+ diff = ktime_to_ns(ktime_sub(timespec_to_ktime(end), timespec_to_ktime(start)));
+ printk("%s=%lli/%lli ",
+ str, diff, diff/NUM_LOCAL_RUNS);
+}
+
+static unsigned int warm_local_test_cache(const void *mem, size_t len)
+{
+ unsigned int i, total = 0;
+ for (i = 0; i < len; i++)
+ total += ((char *)mem)[i];
+ return total;
+}
+
+#define TEST_LOOP(expr) \
+ n = 0; \
+ getnstimeofday(&start); \
+ for (i = 0; i < NUM_LOCAL_RUNS; i++) { \
+ expr; \
+ n += STRIDE; \
+ n %= NUM_LOCAL_TEST; \
+ } \
+ getnstimeofday(&end);
+
+/* This doesn't test cache effects at all */
+#define NUM_PERCPU_VARS 16
+DEFINE_PER_CPU(struct local1[NUM_PERCPU_VARS], local1_test);
+DEFINE_PER_CPU(struct local2[NUM_PERCPU_VARS], local2_test);
+DEFINE_PER_CPU(struct local3[NUM_PERCPU_VARS], local3_test);
+DEFINE_PER_CPU(local_t[NUM_PERCPU_VARS], local4_test);
+
+static void test_local_variants(void)
+{
+ struct timespec start, end;
+ unsigned int i, n;
+ unsigned long total, warm_total = 0;
+ struct local1 *l1;
+ struct local2 *l2;
+ struct local3 *l3;
+ local_t *l4;
+
+ if (!test_local_variants_mem)
+ return;
+
+ printk("Running local_t variant benchmarks\n");
+ l1 = test_local_variants_mem;
+ l2 = test_local_variants_mem;
+ l3 = test_local_variants_mem;
+ l4 = test_local_variants_mem;
+
+ printk("atomic_long: ");
+ memset(l1, 0, sizeof(*l1)*NUM_LOCAL_TEST);
+ TEST_LOOP(atomic_long_inc(&l1[n].v));
+ print_result("local_inc", start, end);
+
+ warm_total += warm_local_test_cache(l1, sizeof(*l1)*NUM_LOCAL_TEST);
+ TEST_LOOP(atomic_long_add(1234, &l1[n].v));
+ print_result("local_add", start, end);
+
+ warm_total += warm_local_test_cache(l1, sizeof(*l1)*NUM_LOCAL_TEST);
+ TEST_LOOP(atomic_long_inc(&__get_cpu_var(local1_test)[n%NUM_PERCPU_VARS].v));
+ print_result("cpu_local_inc", start, end);
+
+ warm_total += warm_local_test_cache(l1, sizeof(*l1)*NUM_LOCAL_TEST);
+ total = 0;
+ TEST_LOOP(total += atomic_long_read(&l1[n].v));
+ print_result("local_read", start, end);
+
+ printk("(total was %lu)\n", total);
+
+ printk("irqsave/restore: ");
+ memset(l2, 0, sizeof(*l2)*NUM_LOCAL_TEST);
+ TEST_LOOP(unsigned long flags;
+ local_irq_save(flags);
+ l2[n].v++;
+ local_irq_restore(flags));
+ print_result("local_inc", start, end);
+
+ warm_total += warm_local_test_cache(l2, sizeof(*l2)*NUM_LOCAL_TEST);
+ TEST_LOOP(unsigned long flags;
+ local_irq_save(flags);
+ l2[n].v += 1234;
+ local_irq_restore(flags));
+ print_result("local_add", start, end);
+
+ warm_total += warm_local_test_cache(l2, sizeof(*l2)*NUM_LOCAL_TEST);
+ TEST_LOOP(unsigned long flags;
+ local_irq_save(flags);
+ __get_cpu_var(local2_test)[n%NUM_PERCPU_VARS].v++;
+ local_irq_restore(flags));
+ print_result("cpu_local_inc", start, end);
+
+ warm_total += warm_local_test_cache(l2, sizeof(*l2)*NUM_LOCAL_TEST);
+ total = 0;
+ TEST_LOOP(total += l2[n].v);
+ print_result("local_read", start, end);
+ printk("(total was %lu)\n", total);
+
+ printk("trivalue: ");
+ memset(l3, 0, sizeof(*l3)*NUM_LOCAL_TEST);
+ TEST_LOOP(unsigned int idx
+ = !(preempt_count() & (HARDIRQ_MASK | SOFTIRQ_MASK)) +
+ !(preempt_count() & HARDIRQ_MASK);
+ l3[n].v[idx]++);
+ print_result("local_inc", start, end);
+
+ warm_total += warm_local_test_cache(l3, sizeof(*l3)*NUM_LOCAL_TEST);
+ TEST_LOOP(unsigned int idx
+ = !(preempt_count() & (HARDIRQ_MASK | SOFTIRQ_MASK)) +
+ !(preempt_count() & HARDIRQ_MASK);
+ l3[n].v[idx] += 1234);
+ print_result("local_add", start, end);
+
+ warm_total += warm_local_test_cache(l3, sizeof(*l3)*NUM_LOCAL_TEST);
+ TEST_LOOP(unsigned int idx
+ = !(preempt_count() & (HARDIRQ_MASK | SOFTIRQ_MASK)) +
+ !(preempt_count() & HARDIRQ_MASK);
+ get_cpu_var(local3_test)[n%NUM_PERCPU_VARS].v[idx]++;
+ put_cpu_var());
+ print_result("cpu_local_inc", start, end);
+
+ warm_total += warm_local_test_cache(l3, sizeof(*l3)*NUM_LOCAL_TEST);
+ total = 0;
+ TEST_LOOP(total += l3[n].v[0] + l3[n].v[1] + l3[n].v[2]);
+ print_result("local_read", start, end);
+ printk("(total was %lu)\n", total);
+
+ printk("local_t: ");
+ memset(l4, 0, sizeof(*l4)*NUM_LOCAL_TEST);
+ TEST_LOOP(local_inc(&l4[n]));
+ print_result("local_inc", start, end);
+
+ warm_total += warm_local_test_cache(l4, sizeof(*l4)*NUM_LOCAL_TEST);
+ TEST_LOOP(local_add(1234, &l4[n]));
+ print_result("local_add", start, end);
+
+ warm_total += warm_local_test_cache(l4, sizeof(*l4)*NUM_LOCAL_TEST);
+ TEST_LOOP(cpu_local_inc(local4_test[n%NUM_PERCPU_VARS]));
+ print_result("cpu_local_inc", start, end);
+
+ warm_total += warm_local_test_cache(l4, sizeof(*l4)*NUM_LOCAL_TEST);
+ total = 0;
+ TEST_LOOP(total += local_read(&l4[n]));
+ print_result("local_read", start, end);
+ printk("(total was %lu, warm_total %lu)\n", total, warm_total);
+}
+
asmlinkage void __init start_kernel(void)
{
char * command_line;
@@ -635,6 +829,8 @@ asmlinkage void __init start_kernel(void
*/
locking_selftest();

+ init_test_local_variants();
+
#ifdef CONFIG_BLK_DEV_INITRD
if (initrd_start && !initrd_below_start_ok &&
page_to_pfn(virt_to_page((void *)initrd_start)) < min_low_pfn) {
@@ -692,6 +888,8 @@ asmlinkage void __init start_kernel(void
acpi_early_init(); /* before LAPIC and SMP init */

ftrace_init();
+
+ test_local_variants();

/* Do the rest non-__init'ed, we're now alive */
rest_init();

Ingo Molnar

unread,

Dec 16, 2008, 8:20:12 PM12/16/08

to

coolness!

>
> Signed-off-by: Rusty Russell <ru...@rustcorp.com.au>
> ---
> arch/x86/include/asm/local.h | 20 ++--
> init/main.c | 198 +++++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 208 insertions(+), 10 deletions(-)

it's a gem - but init/main.c is an arguably pretty sucky place for it.
Stick it somewhere in lib/*, to be moved into testing/* later on?

Ingo

Peter Zijlstra

unread,

Dec 23, 2008, 11:50:07 AM12/23/08

to

One think that sprung to mind,..

IFF we're going to implement percpu_counter with local_t and make
local_t this funny tri-counter thing which has its own error, you need
to fix up bdi_stat_error() - it expects to be an upper bound for the
counter error, getting that wrong _will_ cause deadlocks.

Rusty Russell

unread,

Dec 25, 2008, 1:30:12 PM12/25/08

to

On Tuesday 23 December 2008 22:13:21 Peter Zijlstra wrote:
> One think that sprung to mind,..
>
> IFF we're going to implement percpu_counter with local_t and make
> local_t this funny tri-counter thing which has its own error, you need
> to fix up bdi_stat_error() - it expects to be an upper bound for the
> counter error, getting that wrong _will_ cause deadlocks.

It's OK: local_add_return() can't have error; that's why the trival
implementation disables interrupts around it.

Hope that helps,
Rusty.