[PATCH] cpuacct: VIRT_CPU_ACCOUNTING don't prevent percpu cputime count

KOSAKI Motohiro

unread,

Apr 28, 2009, 3:00:22 AM4/28/09

to

I'm not cpuacct expert. please give me comment.

====================
Subject: [PATCH] cpuacct: VIRT_CPU_ACCOUNTING don't prevent percpu cputime caching

impact: little performance improvement

cpuacct_update_stats() is called at every tick updating. and it use percpu_counter
for avoiding performance degression.

Unfortunately, it doesn't works on VIRT_CPU_ACCOUNTING=y environment properly.
if VIRT_CPU_ACCOUNTING=y, every tick update much than 1000 cputime.
Thus every percpu_counter_add() makes spinlock grabbing and update non-percpu-variable.

This patch change the batch rule. now, every cpu can store "percpu_counter_bach x jiffies"
cputime in percpu cache.
it mean this patch don't have behavior change if VIRT_CPU_ACCOUNTING=n, but
works well on VIRT_CPU_ACCOUNTING=y too.

Cc: Bharata B Rao <bha...@linux.vnet.ibm.com>
Cc: Balaji Rao <balaj...@gmail.com>
Cc: Dhaval Giani <dha...@linux.vnet.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezaw...@jp.fujitsu.com>
Cc: Peter Zijlstra <a.p.zi...@chello.nl>
Cc: Balbir Singh <bal...@linux.vnet.ibm.com>
Cc: Ingo Molnar <mi...@elte.hu>
Cc: Martin Schwidefsky <schwi...@de.ibm.com>
Signed-off-by: KOSAKI Motohiro <kosaki....@jp.fujitsu.com>
---
kernel/sched.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)

Index: b/kernel/sched.c
===================================================================
--- a/kernel/sched.c 2009-04-28 14:18:36.000000000 +0900
+++ b/kernel/sched.c 2009-04-28 15:18:07.000000000 +0900
@@ -10117,6 +10117,7 @@ struct cpuacct {
};

struct cgroup_subsys cpuacct_subsys;
+static s32 cpuacct_batch;

/* return cpu accounting group corresponding to this container */
static inline struct cpuacct *cgroup_ca(struct cgroup *cgrp)
@@ -10146,6 +10147,9 @@ static struct cgroup_subsys_state *cpuac
if (!ca->cpuusage)
goto out_free_ca;

+ if (!cpuacct_batch)
+ cpuacct_batch = jiffies_to_cputime(percpu_counter_batch);
+
for (i = 0; i < CPUACCT_STAT_NSTATS; i++)
if (percpu_counter_init(&ca->cpustat[i], 0))
goto out_free_counters;
@@ -10342,7 +10346,7 @@ static void cpuacct_update_stats(struct
ca = task_ca(tsk);

do {
- percpu_counter_add(&ca->cpustat[idx], val);
+ __percpu_counter_add(&ca->cpustat[idx], val, cpuacct_batch);
ca = ca->parent;
} while (ca);
rcu_read_unlock();

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

KOSAKI Motohiro

unread,

Apr 28, 2009, 3:40:09 AM4/28/09

to

> On Tue, Apr 28, 2009 at 03:53:32PM +0900, KOSAKI Motohiro wrote:
> >
> > I'm not cpuacct expert. please give me comment.
> >
> > ====================
> > Subject: [PATCH] cpuacct: VIRT_CPU_ACCOUNTING don't prevent percpu cputime caching
> >
> > impact: little performance improvement
> >
> > cpuacct_update_stats() is called at every tick updating. and it use percpu_counter
> > for avoiding performance degression.
> >
> > Unfortunately, it doesn't works on VIRT_CPU_ACCOUNTING=y environment properly.
> > if VIRT_CPU_ACCOUNTING=y, every tick update much than 1000 cputime.
> > Thus every percpu_counter_add() makes spinlock grabbing and update non-percpu-variable.
> >
> > This patch change the batch rule. now, every cpu can store "percpu_counter_bach x jiffies"
> > cputime in percpu cache.
> > it mean this patch don't have behavior change if VIRT_CPU_ACCOUNTING=n, but
> > works well on VIRT_CPU_ACCOUNTING=y too.
>

> Let me try to understand what you are saying...
>
> For archs which define VIRT_CPU_ACCOUNTING, every tick would result
> in around 1000 units of cputime updates and since this is much much greater
> than percpu_batch_counter, we end up taking spinlock on every tick.

Correct.

>
> If my above reading of the problem is correct, please look at my below comments.

>
> > Index: b/kernel/sched.c
> > ===================================================================
> > --- a/kernel/sched.c 2009-04-28 14:18:36.000000000 +0900
> > +++ b/kernel/sched.c 2009-04-28 15:18:07.000000000 +0900
> > @@ -10117,6 +10117,7 @@ struct cpuacct {
> > };
> >
> > struct cgroup_subsys cpuacct_subsys;
> > +static s32 cpuacct_batch;
> >
> > /* return cpu accounting group corresponding to this container */
> > static inline struct cpuacct *cgroup_ca(struct cgroup *cgrp)
> > @@ -10146,6 +10147,9 @@ static struct cgroup_subsys_state *cpuac
> > if (!ca->cpuusage)
> > goto out_free_ca;
> >
> > + if (!cpuacct_batch)
> > + cpuacct_batch = jiffies_to_cputime(percpu_counter_batch);
>

> You essentially end up increasing the batch value from the default value
> of max(32, nr_cpus*2).

>
> > +
> > for (i = 0; i < CPUACCT_STAT_NSTATS; i++)
> > if (percpu_counter_init(&ca->cpustat[i], 0))
> > goto out_free_counters;
> > @@ -10342,7 +10346,7 @@ static void cpuacct_update_stats(struct
> > ca = task_ca(tsk);
> >
> > do {
> > - percpu_counter_add(&ca->cpustat[idx], val);
> > + __percpu_counter_add(&ca->cpustat[idx], val, cpuacct_batch);
>

> And you do this unconditionally which will affect all archs ? So you make
> this behaviour default for archs which have VIRT_CPU_ACCOUNTING=n.

if VIRT_CPU_ACCOUNTING=n, jiffies_to_cputime(n) return n.
IOW, jiffies_to_cputime() don't change value.

and percpu_counter() use percpu_counter_batch as batch.

Then, if VIRT_CPU_ACCOUNTING=n, this patch don't change behavior.

> BTW, did you observe any real problem with the percpu counter spinlock ?

No.
I review percpu_counter() caller recently and it seems stragen usage.

Bharata B Rao

unread,

Apr 28, 2009, 3:40:15 AM4/28/09

to

On Tue, Apr 28, 2009 at 03:53:32PM +0900, KOSAKI Motohiro wrote:
>

> I'm not cpuacct expert. please give me comment.
>
> ====================
> Subject: [PATCH] cpuacct: VIRT_CPU_ACCOUNTING don't prevent percpu cputime caching
>
> impact: little performance improvement
>
> cpuacct_update_stats() is called at every tick updating. and it use percpu_counter
> for avoiding performance degression.
>
> Unfortunately, it doesn't works on VIRT_CPU_ACCOUNTING=y environment properly.
> if VIRT_CPU_ACCOUNTING=y, every tick update much than 1000 cputime.
> Thus every percpu_counter_add() makes spinlock grabbing and update non-percpu-variable.
>
> This patch change the batch rule. now, every cpu can store "percpu_counter_bach x jiffies"
> cputime in percpu cache.
> it mean this patch don't have behavior change if VIRT_CPU_ACCOUNTING=n, but
> works well on VIRT_CPU_ACCOUNTING=y too.

Let me try to understand what you are saying...

For archs which define VIRT_CPU_ACCOUNTING, every tick would result
in around 1000 units of cputime updates and since this is much much greater
than percpu_batch_counter, we end up taking spinlock on every tick.

If my above reading of the problem is correct, please look at my below comments.

> Index: b/kernel/sched.c

> ===================================================================
> --- a/kernel/sched.c 2009-04-28 14:18:36.000000000 +0900
> +++ b/kernel/sched.c 2009-04-28 15:18:07.000000000 +0900
> @@ -10117,6 +10117,7 @@ struct cpuacct {
> };
>
> struct cgroup_subsys cpuacct_subsys;
> +static s32 cpuacct_batch;
>
> /* return cpu accounting group corresponding to this container */
> static inline struct cpuacct *cgroup_ca(struct cgroup *cgrp)
> @@ -10146,6 +10147,9 @@ static struct cgroup_subsys_state *cpuac
> if (!ca->cpuusage)
> goto out_free_ca;
>
> + if (!cpuacct_batch)
> + cpuacct_batch = jiffies_to_cputime(percpu_counter_batch);

You essentially end up increasing the batch value from the default value
of max(32, nr_cpus*2).

> +

> for (i = 0; i < CPUACCT_STAT_NSTATS; i++)
> if (percpu_counter_init(&ca->cpustat[i], 0))
> goto out_free_counters;
> @@ -10342,7 +10346,7 @@ static void cpuacct_update_stats(struct
> ca = task_ca(tsk);
>
> do {
> - percpu_counter_add(&ca->cpustat[idx], val);
> + __percpu_counter_add(&ca->cpustat[idx], val, cpuacct_batch);

And you do this unconditionally which will affect all archs ? So you make

this behaviour default for archs which have VIRT_CPU_ACCOUNTING=n.

BTW, did you observe any real problem with the percpu counter spinlock ?

Regards,
Bharata.

Bharata B Rao

unread,

Apr 28, 2009, 4:00:19 AM4/28/09

to

I should have phrased the question better ...

So have you found any performance degradation with any benchmarks/workloads
on archs which define VIRT_CPU_ACCOUNTING due to percpu_counter spinlock
being taken on every tick ? If the answer is no, don't you think we could
wait before making the kind of change you are proposing ?

Regards,
Bharata.

KOSAKI Motohiro

unread,

Apr 28, 2009, 4:20:18 AM4/28/09

to

> On Tue, Apr 28, 2009 at 04:38:51PM +0900, KOSAKI Motohiro wrote:
> > > On Tue, Apr 28, 2009 at 03:53:32PM +0900, KOSAKI Motohiro wrote:
> >
> > > BTW, did you observe any real problem with the percpu counter spinlock ?
> >
> > No.
> > I review percpu_counter() caller recently and it seems stragen usage.
> >
>
> I should have phrased the question better ...
>
> So have you found any performance degradation with any benchmarks/workloads
> on archs which define VIRT_CPU_ACCOUNTING due to percpu_counter spinlock
> being taken on every tick ? If the answer is no, don't you think we could
> wait before making the kind of change you are proposing ?

maybe, I don't understand your point.
I don't oppose you wait something :)

Bharata B Rao

unread,

Apr 28, 2009, 6:40:10 AM4/28/09

to

On Tue, Apr 28, 2009 at 05:10:47PM +0900, KOSAKI Motohiro wrote:
> > On Tue, Apr 28, 2009 at 04:38:51PM +0900, KOSAKI Motohiro wrote:
> > > > On Tue, Apr 28, 2009 at 03:53:32PM +0900, KOSAKI Motohiro wrote:
> > >
> > > > BTW, did you observe any real problem with the percpu counter spinlock ?
> > >
> > > No.
> > > I review percpu_counter() caller recently and it seems stragen usage.
> > >
> >
> > I should have phrased the question better ...
> >
> > So have you found any performance degradation with any benchmarks/workloads
> > on archs which define VIRT_CPU_ACCOUNTING due to percpu_counter spinlock
> > being taken on every tick ? If the answer is no, don't you think we could
> > wait before making the kind of change you are proposing ?
>
> maybe, I don't understand your point.

My point is, let us not make this change if it is not a real problem that
has been observed on archs which define VIRT_CPU_ACCOUNTING.

Regards,
Bharata.

Balbir Singh

unread,

Apr 28, 2009, 6:30:09 PM4/28/09

to

* KOSAKI Motohiro <kosaki....@jp.fujitsu.com> [2009-04-28 15:53:32]:

I expect cpuacct_batch to be a large number

> for (i = 0; i < CPUACCT_STAT_NSTATS; i++)
> if (percpu_counter_init(&ca->cpustat[i], 0))
> goto out_free_counters;
> @@ -10342,7 +10346,7 @@ static void cpuacct_update_stats(struct
> ca = task_ca(tsk);
>
> do {
> - percpu_counter_add(&ca->cpustat[idx], val);
> + __percpu_counter_add(&ca->cpustat[idx], val, cpuacct_batch);

This will make the end result very off the real value due to large
batch value per cpu. If we are going to go this route, we should
probably consider using __percpu_counter_sum so that the batch value
does not show data that is way off.

> ca = ca->parent;
> } while (ca);
> rcu_read_unlock();
>
>

--
Balbir

KOSAKI Motohiro

unread,

Apr 28, 2009, 10:30:08 PM4/28/09

to

Hi

Thanks for reviewing.

> > /* return cpu accounting group corresponding to this container */
> > static inline struct cpuacct *cgroup_ca(struct cgroup *cgrp)
> > @@ -10146,6 +10147,9 @@ static struct cgroup_subsys_state *cpuac
> > if (!ca->cpuusage)
> > goto out_free_ca;
> >
> > + if (!cpuacct_batch)
> > + cpuacct_batch = jiffies_to_cputime(percpu_counter_batch);
> > +
>
> I expect cpuacct_batch to be a large number

Yes.

>
> > for (i = 0; i < CPUACCT_STAT_NSTATS; i++)
> > if (percpu_counter_init(&ca->cpustat[i], 0))
> > goto out_free_counters;
> > @@ -10342,7 +10346,7 @@ static void cpuacct_update_stats(struct
> > ca = task_ca(tsk);
> >
> > do {
> > - percpu_counter_add(&ca->cpustat[idx], val);
> > + __percpu_counter_add(&ca->cpustat[idx], val, cpuacct_batch);
>
> This will make the end result very off the real value due to large
> batch value per cpu. If we are going to go this route, we should
> probably consider using __percpu_counter_sum so that the batch value
> does not show data that is way off.

No problem.

end-user don't see cputime itself. they see converted time.
cpuacct_stats_show() use cputime64_to_clock_t. it mean
the value less than 10msec don't display.

--------------------------------------------------------
static int cpuacct_stats_show(struct cgroup *cgrp, struct cftype *cft,
struct cgroup_map_cb *cb)
{
struct cpuacct *ca = cgroup_ca(cgrp);
int i;

for (i = 0; i < CPUACCT_STAT_NSTATS; i++) {
s64 val = percpu_counter_read(&ca->cpustat[i]);
val = cputime64_to_clock_t(val);
cb->fill(cb, cpuacct_stat_desc[i], val);
}
return 0;
}
--------------------------------------------------------

KOSAKI Motohiro

unread,

Apr 28, 2009, 10:40:04 PM4/28/09

to

> > > > > BTW, did you observe any real problem with the percpu counter spinlock ?
> > > >
> > > > No.
> > > > I review percpu_counter() caller recently and it seems stragen usage.
> > > >
> > >
> > > I should have phrased the question better ...
> > >
> > > So have you found any performance degradation with any benchmarks/workloads
> > > on archs which define VIRT_CPU_ACCOUNTING due to percpu_counter spinlock
> > > being taken on every tick ? If the answer is no, don't you think we could
> > > wait before making the kind of change you are proposing ?
> >
> > maybe, I don't understand your point.
>
> My point is, let us not make this change if it is not a real problem that
> has been observed on archs which define VIRT_CPU_ACCOUNTING.

It's nice joke. but not constructive.

but another idea and another patch are always welcome.

Bharata B Rao

unread,

Apr 28, 2009, 11:30:10 PM4/28/09

to

On Wed, Apr 29, 2009 at 03:38:54AM +0530, Balbir Singh wrote:
> * KOSAKI Motohiro <kosaki....@jp.fujitsu.com> [2009-04-28 15:53:32]:
> >

> > do {
> > - percpu_counter_add(&ca->cpustat[idx], val);
> > + __percpu_counter_add(&ca->cpustat[idx], val, cpuacct_batch);
>
> This will make the end result very off the real value due to large
> batch value per cpu. If we are going to go this route, we should
> probably consider using __percpu_counter_sum so that the batch value
> does not show data that is way off.

But __percpu_counter_sum takes fbc->lock spinlock and that is the cause of
the problem Kosaki mentions I believe.

Regards,
Bharata.

Bharata B Rao

unread,

Apr 28, 2009, 11:40:06 PM4/28/09

to

On Wed, Apr 29, 2009 at 11:31:14AM +0900, KOSAKI Motohiro wrote:
> > > > > > BTW, did you observe any real problem with the percpu counter spinlock ?
> > > > >
> > > > > No.
> > > > > I review percpu_counter() caller recently and it seems stragen usage.
> > > > >
> > > >
> > > > I should have phrased the question better ...
> > > >
> > > > So have you found any performance degradation with any benchmarks/workloads
> > > > on archs which define VIRT_CPU_ACCOUNTING due to percpu_counter spinlock
> > > > being taken on every tick ? If the answer is no, don't you think we could
> > > > wait before making the kind of change you are proposing ?
> > >
> > > maybe, I don't understand your point.
> >
> > My point is, let us not make this change if it is not a real problem that
> > has been observed on archs which define VIRT_CPU_ACCOUNTING.
>
> It's nice joke. but not constructive.

I was only asking you if you have seen any real life problem with this
and you said no. In that context, if I re-read my above point, I think
I should have a pretty good sense of humour to consider it as joke :)

Regards,
Bharata.

Balbir Singh

unread,

Apr 29, 2009, 2:00:20 AM4/29/09

to

* KOSAKI Motohiro <kosaki....@jp.fujitsu.com> [2009-04-29 11:26:30]:

Yes, I know, I reviewed Bharata's patch and suggested converting to
clock_t for consistency with other metrics.

> --------------------------------------------------------
> static int cpuacct_stats_show(struct cgroup *cgrp, struct cftype *cft,
> struct cgroup_map_cb *cb)
> {
> struct cpuacct *ca = cgroup_ca(cgrp);
> int i;
>
> for (i = 0; i < CPUACCT_STAT_NSTATS; i++) {
> s64 val = percpu_counter_read(&ca->cpustat[i]);

My point is, this should probably be percpu_counter_sum(), but that
can be expensive and we were willing to tollerate some inaccuracy due
to batch value, I think your patch adds to the inaccuracy even more,
even though it fixes a genuine problem.

> val = cputime64_to_clock_t(val);
> cb->fill(cb, cpuacct_stat_desc[i], val);
> }
> return 0;
> }
> --------------------------------------------------------
>
>
>
>
>

--
Balbir

Balbir Singh

unread,

Apr 29, 2009, 2:10:08 AM4/29/09

to

* Bharata B Rao <bha...@linux.vnet.ibm.com> [2009-04-29 08:51:46]:

> On Wed, Apr 29, 2009 at 03:38:54AM +0530, Balbir Singh wrote:
> > * KOSAKI Motohiro <kosaki....@jp.fujitsu.com> [2009-04-28 15:53:32]:
> > >
> > > do {
> > > - percpu_counter_add(&ca->cpustat[idx], val);
> > > + __percpu_counter_add(&ca->cpustat[idx], val, cpuacct_batch);
> >
> > This will make the end result very off the real value due to large
> > batch value per cpu. If we are going to go this route, we should
> > probably consider using __percpu_counter_sum so that the batch value
> > does not show data that is way off.
>
> But __percpu_counter_sum takes fbc->lock spinlock and that is the cause of
> the problem Kosaki mentions I believe.

But this is on the user space read side (show stats), I thought
Kosaki's problem was with per_cpu_counter_add called from update
stats.

--
Balbir

KOSAKI Motohiro

unread,

Apr 29, 2009, 2:10:13 AM4/29/09

to

> > > This will make the end result very off the real value due to large
> > > batch value per cpu. If we are going to go this route, we should
> > > probably consider using __percpu_counter_sum so that the batch value
> > > does not show data that is way off.
> >
> > No problem.
> >
> > end-user don't see cputime itself. they see converted time.
> > cpuacct_stats_show() use cputime64_to_clock_t. it mean
> > the value less than 10msec don't display.
> >
>
> Yes, I know, I reviewed Bharata's patch and suggested converting to
> clock_t for consistency with other metrics.

Oh, sorry.
I didn't know this.

> > --------------------------------------------------------
> > static int cpuacct_stats_show(struct cgroup *cgrp, struct cftype *cft,
> > struct cgroup_map_cb *cb)
> > {
> > struct cpuacct *ca = cgroup_ca(cgrp);
> > int i;
> >
> > for (i = 0; i < CPUACCT_STAT_NSTATS; i++) {
> > s64 val = percpu_counter_read(&ca->cpustat[i]);
>
> My point is, this should probably be percpu_counter_sum(), but that
> can be expensive and we were willing to tollerate some inaccuracy due
> to batch value, I think your patch adds to the inaccuracy even more,
> even though it fixes a genuine problem.

Not expensive.
cpuacct_stats_show() is only called when reading stat file.
it definitely slow-path.
I think we can use percpu_counter_sum().

However, I doubt its worth.

before my patch:
VIRT_CPU_ACCOUNTING=y: accuracy but slow
VIRT_CPU_ACCOUNTING=n: inaccuracy few tick but fast

my patch
VIRT_CPU_ACCOUNTING=y: inaccuracy few tick but fast
VIRT_CPU_ACCOUNTING=n: inaccuracy few tick but fast

if my inaccuracy is wrong, current code is also crap when VIRT_CPU_ACCOUNTING=n.
I only make const accuracy to VIRT_CPU_ACCOUNTING=y and n.

Thought?
Although you still think percpu_counter_sum() is better, I can do it.

>
>
> > val = cputime64_to_clock_t(val);
> > cb->fill(cb, cpuacct_stat_desc[i], val);
> > }
> > return 0;
> > }
> > --------------------------------------------------------

--

Bharata B Rao

unread,

Apr 29, 2009, 2:20:07 AM4/29/09

to

On Wed, Apr 29, 2009 at 11:34:15AM +0530, Balbir Singh wrote:
> * Bharata B Rao <bha...@linux.vnet.ibm.com> [2009-04-29 08:51:46]:
>
> > On Wed, Apr 29, 2009 at 03:38:54AM +0530, Balbir Singh wrote:
> > > * KOSAKI Motohiro <kosaki....@jp.fujitsu.com> [2009-04-28 15:53:32]:
> > > >
> > > > do {
> > > > - percpu_counter_add(&ca->cpustat[idx], val);
> > > > + __percpu_counter_add(&ca->cpustat[idx], val, cpuacct_batch);
> > >
> > > This will make the end result very off the real value due to large
> > > batch value per cpu. If we are going to go this route, we should
> > > probably consider using __percpu_counter_sum so that the batch value
> > > does not show data that is way off.
> >
> > But __percpu_counter_sum takes fbc->lock spinlock and that is the cause of
> > the problem Kosaki mentions I believe.
>
> But this is on the user space read side (show stats), I thought
> Kosaki's problem was with per_cpu_counter_add called from update
> stats.

Right. Agreed.

Regards,
Bharata.

KOSAKI Motohiro

unread,

Apr 30, 2009, 2:20:09 AM4/30/09

to

Changelog:
since v1
- use percpu_counter_sum() instead percpu_counter_read()

-------------------------------------
Subject: [PATCH v2] cpuacct: VIRT_CPU_ACCOUNTING don't prevent percpu cputime count

cpuacct_update_stats() is called at every tick updating. and it use percpu_counter
for avoiding performance degression.

For archs which define VIRT_CPU_ACCOUNTING, every tick would result
in >1000 units of cputime updates and since this is much much greater

than percpu_batch_counter, we end up taking spinlock on every tick.

This patch change batch rule. now, any cpu can store "percpu_counter_bach * jiffies"
cputime in per-cpu cache.
it mean this patch don't have behavior change if VIRT_CPU_ACCOUNTING=n.

Cc: Bharata B Rao <bha...@linux.vnet.ibm.com>
Cc: Balaji Rao <balaj...@gmail.com>
Cc: Dhaval Giani <dha...@linux.vnet.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezaw...@jp.fujitsu.com>
Cc: Peter Zijlstra <a.p.zi...@chello.nl>
Cc: Balbir Singh <bal...@linux.vnet.ibm.com>
Cc: Ingo Molnar <mi...@elte.hu>
Cc: Martin Schwidefsky <schwi...@de.ibm.com>
Signed-off-by: KOSAKI Motohiro <kosaki....@jp.fujitsu.com>
---

kernel/sched.c | 8 ++++++--
1 file changed, 6 insertions(+), 2 deletions(-)

Index: b/kernel/sched.c
===================================================================
--- a/kernel/sched.c 2009-04-30 11:37:47.000000000 +0900
+++ b/kernel/sched.c 2009-04-30 14:17:00.000000000 +0900
@@ -10221,6 +10221,7 @@ struct cpuacct {

};

struct cgroup_subsys cpuacct_subsys;
+static s32 cpuacct_batch;

/* return cpu accounting group corresponding to this container */
static inline struct cpuacct *cgroup_ca(struct cgroup *cgrp)

@@ -10250,6 +10251,9 @@ static struct cgroup_subsys_state *cpuac

if (!ca->cpuusage)
goto out_free_ca;

+ if (!cpuacct_batch)
+ cpuacct_batch = jiffies_to_cputime(percpu_counter_batch);
+

for (i = 0; i < CPUACCT_STAT_NSTATS; i++)

if (percpu_counter_init(&ca->cpustat[i], 0))
goto out_free_counters;

@@ -10376,7 +10380,7 @@ static int cpuacct_stats_show(struct cgr

int i;

for (i = 0; i < CPUACCT_STAT_NSTATS; i++) {

- s64 val = percpu_counter_read(&ca->cpustat[i]);
+ s64 val = percpu_counter_sum(&ca->cpustat[i]);

val = cputime64_to_clock_t(val);
cb->fill(cb, cpuacct_stat_desc[i], val);
}

@@ -10446,7 +10450,7 @@ static void cpuacct_update_stats(struct
ca = task_ca(tsk);

do {
- percpu_counter_add(&ca->cpustat[idx], val);
+ __percpu_counter_add(&ca->cpustat[idx], val, cpuacct_batch);

ca = ca->parent;
} while (ca);
rcu_read_unlock();

Balbir Singh

unread,

Apr 30, 2009, 4:30:12 AM4/30/09

to

* KOSAKI Motohiro <kosaki....@jp.fujitsu.com> [2009-04-30 15:11:15]:

What do the test results look like with this? I'll see if I can find
some time to test this patch. On a patch read level this seems much better
to me, Peter?

Acked-by: Balbir Singh <bal...@linux.vnet.ibm.com>

--
Balbir

KOSAKI Motohiro

unread,

Apr 30, 2009, 4:30:17 AM4/30/09

to

this patch decrease slow down risk on large server. but this patch
doesn't have functional change. you can't make functional test.
AFAIK, percpu_counter_sum() don't make any performance degression,
but you have good stress test, please tell me it.

Peter Zijlstra

unread,

Apr 30, 2009, 4:50:08 AM4/30/09

to

I don't really fancy percpu_counter_sum() usage. I'm thinking its ok to
degrate accuracy on larger machines and simply use
percpu_counter_read().

KOSAKI Motohiro

unread,

Apr 30, 2009, 5:00:16 AM4/30/09

to

> > > for (i = 0; i < CPUACCT_STAT_NSTATS; i++) {
> > > - s64 val = percpu_counter_read(&ca->cpustat[i]);
> > > + s64 val = percpu_counter_sum(&ca->cpustat[i]);
> > > val = cputime64_to_clock_t(val);
> > > cb->fill(cb, cpuacct_stat_desc[i], val);
> > > }
> >

> > What do the test results look like with this? I'll see if I can find
> > some time to test this patch. On a patch read level this seems much better
> > to me, Peter?
>
> I don't really fancy percpu_counter_sum() usage. I'm thinking its ok to
> degrate accuracy on larger machines and simply use
> percpu_counter_read().

I have same opinion with peter. Balbir, What do you think?

Ingo Molnar

unread,

Apr 30, 2009, 5:00:18 AM4/30/09

to

yes - and the values will converge anyway, right? So it's just a
small delay, not even any genuine loss of accuracy.

Ingo

Balbir Singh

unread,

Apr 30, 2009, 5:10:16 AM4/30/09

to

* KOSAKI Motohiro <kosaki....@jp.fujitsu.com> [2009-04-30 17:52:16]:

> > > > for (i = 0; i < CPUACCT_STAT_NSTATS; i++) {
> > > > - s64 val = percpu_counter_read(&ca->cpustat[i]);
> > > > + s64 val = percpu_counter_sum(&ca->cpustat[i]);
> > > > val = cputime64_to_clock_t(val);
> > > > cb->fill(cb, cpuacct_stat_desc[i], val);
> > > > }
> > >
> > > What do the test results look like with this? I'll see if I can find
> > > some time to test this patch. On a patch read level this seems much better
> > > to me, Peter?
> >
> > I don't really fancy percpu_counter_sum() usage. I'm thinking its ok to
> > degrate accuracy on larger machines and simply use
> > percpu_counter_read().
>
> I have same opinion with peter. Balbir, What do you think?
>

Sure, but the larger the delta gets, the less useful the metric gets
:) I am OK with going back to percpu_counter_read() if that is the
consensus.

--
Balbir

Andrew Morton

unread,

Apr 30, 2009, 9:20:05 PM4/30/09

to

On Thu, 30 Apr 2009 15:11:15 +0900 (JST)
KOSAKI Motohiro <kosaki....@jp.fujitsu.com> wrote:

>
> Changelog:
> since v1
> - use percpu_counter_sum() instead percpu_counter_read()
>
>
> -------------------------------------
> Subject: [PATCH v2] cpuacct: VIRT_CPU_ACCOUNTING don't prevent percpu cputime count
>
> cpuacct_update_stats() is called at every tick updating. and it use percpu_counter
> for avoiding performance degression.
>
> For archs which define VIRT_CPU_ACCOUNTING, every tick would result
> in >1000 units of cputime updates and since this is much much greater
> than percpu_batch_counter, we end up taking spinlock on every tick.
>
> This patch change batch rule. now, any cpu can store "percpu_counter_bach * jiffies"
> cputime in per-cpu cache.
> it mean this patch don't have behavior change if VIRT_CPU_ACCOUNTING=n.

Does this actually matter?

If we're calling cpuacct_update_stats() with large values of `cputime'
then presumably we're also calling cpuacct_update_stats() at a low
frequency, so the common lock-taking won't cause performance problems?

KOSAKI Motohiro

unread,

Apr 30, 2009, 9:50:05 PM4/30/09

to

>> For archs which define VIRT_CPU_ACCOUNTING, every tick would result
>> in >1000 units of cputime updates and since this is much much greater
>> than percpu_batch_counter, we end up taking spinlock on every tick.
>>
>> This patch change batch rule. now, any cpu can store "percpu_counter_bach * jiffies"
>> cputime in per-cpu cache.
>> it mean this patch don't have behavior change if VIRT_CPU_ACCOUNTING=n.
>
> Does this actually matter?
>
> If we're calling cpuacct_update_stats() with large values of `cputime'
> then presumably we're also calling cpuacct_update_stats() at a low
> frequency, so the common lock-taking won't cause performance problems?

VIRT_CPU_ACCOUNTING change cputime_t meaning. but don't change calling
update time frequency.

example,
ia64, HZ=1000, VIRT_CPU_ACCOUNTING=y (1 cputime == 1ns, ie 1 jiffies
== 1000000 cputime)

every tick updating makes 1000000 cputime. (see jiffies_to_cputime)

-----------------------------------------------------------------------------------------
void account_process_tick(struct task_struct *p, int user_tick)
{
cputime_t one_jiffy = jiffies_to_cputime(1);
cputime_t one_jiffy_scaled = cputime_to_scaled(one_jiffy);
struct rq *rq = this_rq();

if (user_tick)
account_user_time(p, one_jiffy, one_jiffy_scaled);
else if (p != rq->idle)
account_system_time(p, HARDIRQ_OFFSET, one_jiffy,
one_jiffy_scaled);
else
account_idle_time(one_jiffy);
}
-----------------------------------------------------------------------------------------

but tick updating frequency don't changed.

KOSAKI Motohiro

unread,

May 7, 2009, 6:00:21 AM5/7/09

to

Changelog:
since v2
- revert using percpu_counter_sum()

since v1
- use percpu_counter_sum() instead percpu_counter_read()

-------------------------------------
Subject: [PATCH v3] cpuacct: VIRT_CPU_ACCOUNTING don't prevent percpu cputime count

cpuacct_update_stats() is called at every tick updating. and it use percpu_counter
for avoiding performance degression.

For archs which define VIRT_CPU_ACCOUNTING, every tick would result

in >1000 units of cputime updates and since this is much much greater
than percpu_batch_counter, we end up taking spinlock on every tick.

This patch change batch rule. now, any cpu can store "percpu_counter_bach * jiffies"
cputime in per-cpu cache.
it mean this patch don't have behavior change if VIRT_CPU_ACCOUNTING=n.

Cc: Bharata B Rao <bha...@linux.vnet.ibm.com>

Cc: Balaji Rao <balaj...@gmail.com>
Cc: Dhaval Giani <dha...@linux.vnet.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezaw...@jp.fujitsu.com>
Cc: Peter Zijlstra <a.p.zi...@chello.nl>
Cc: Balbir Singh <bal...@linux.vnet.ibm.com>
Cc: Ingo Molnar <mi...@elte.hu>
Cc: Martin Schwidefsky <schwi...@de.ibm.com>
Signed-off-by: KOSAKI Motohiro <kosaki....@jp.fujitsu.com>
---

kernel/sched.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)

Index: b/kernel/sched.c

===================================================================
--- a/kernel/sched.c 2009-04-30 11:37:47.000000000 +0900

+++ b/kernel/sched.c 2009-05-07 16:34:00.000000000 +0900

@@ -10221,6 +10221,7 @@ struct cpuacct {
};

struct cgroup_subsys cpuacct_subsys;
+static s32 cpuacct_batch;

/* return cpu accounting group corresponding to this container */
static inline struct cpuacct *cgroup_ca(struct cgroup *cgrp)
@@ -10250,6 +10251,9 @@ static struct cgroup_subsys_state *cpuac
if (!ca->cpuusage)
goto out_free_ca;

+ if (!cpuacct_batch)
+ cpuacct_batch = jiffies_to_cputime(percpu_counter_batch);
+

for (i = 0; i < CPUACCT_STAT_NSTATS; i++)

if (percpu_counter_init(&ca->cpustat[i], 0))
goto out_free_counters;

@@ -10446,7 +10450,7 @@ static void cpuacct_update_stats(struct
ca = task_ca(tsk);

do {
- percpu_counter_add(&ca->cpustat[idx], val);
+ __percpu_counter_add(&ca->cpustat[idx], val, cpuacct_batch);
ca = ca->parent;
} while (ca);
rcu_read_unlock();

Ingo Molnar

unread,

May 7, 2009, 7:00:19 AM5/7/09

to

should be __read_mostly i guess.

>
> /* return cpu accounting group corresponding to this container */
> static inline struct cpuacct *cgroup_ca(struct cgroup *cgrp)
> @@ -10250,6 +10251,9 @@ static struct cgroup_subsys_state *cpuac
> if (!ca->cpuusage)
> goto out_free_ca;
>
> + if (!cpuacct_batch)
> + cpuacct_batch = jiffies_to_cputime(percpu_counter_batch);

So this is a once per boot condition? Why not initialize it in
sched_init() or so? (instead of incuring this ugly check all the
time)

Ingo

KOSAKI Motohiro

unread,

May 9, 2009, 5:30:09 AM5/9/09

to

Thanks! fixed.

Changelog:
since V3
- rewirte patch description (thanks Bharata!)
- append read_mostly to cpuacct_batch
- cpuacct_batch is initialized by sched_init()

since v2
- revert using percpu_counter_sum()

since v1
- use percpu_counter_sum() instead percpu_counter_read()

---------------------------------------------------------
Subject: [PATCH v4] cpuacct: Use bigger percpu counter batch values for stats counters on archs that have VIRT_CPU_ACCOUNTING=y

percpu counters used to accumulate statistics in cpuacct controller use
the default batch value [max(2*nr_cpus, 32)] which can be too small for
archs that define VIRT_CPU_ACCOUNTING. In such archs, a tick could result in
cputime updates in the range of thousands. As a result, cpuacct_update_stats()
would end up acquiring the percpu counter spinlock on every tick which
is not good for performance.

Let those architectures to have a bigger batch threshold so that percpu counter
spinlock isn't taken on every tick. This change doesn't affect the archs which
don't define VIRT_CPU_ACCOUNTING and they continue to have the default
percpu counter batch value.

Cc: Balaji Rao <balaj...@gmail.com>
Cc: Dhaval Giani <dha...@linux.vnet.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezaw...@jp.fujitsu.com>
Cc: Peter Zijlstra <a.p.zi...@chello.nl>
Cc: Balbir Singh <bal...@linux.vnet.ibm.com>
Cc: Ingo Molnar <mi...@elte.hu>
Cc: Martin Schwidefsky <schwi...@de.ibm.com>

Signed-off-by: Bharata B Rao <bha...@linux.vnet.ibm.com>

Signed-off-by: KOSAKI Motohiro <kosaki....@jp.fujitsu.com>
---

kernel/sched.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)

Index: b/kernel/sched.c
===================================================================
--- a/kernel/sched.c 2009-05-09 16:48:08.000000000 +0900
+++ b/kernel/sched.c 2009-05-09 18:16:00.000000000 +0900
@@ -824,8 +824,12 @@ static struct file_operations sched_feat
.release = single_release,
};

+static __read_mostly s32 cpuacct_batch;
+
static __init int sched_init_debug(void)
{

+ cpuacct_batch = jiffies_to_cputime(percpu_counter_batch);
+

debugfs_create_file("sched_features", 0644, NULL, NULL,
&sched_feat_fops);

@@ -10457,7 +10461,8 @@ static void cpuacct_update_stats(struct

ca = task_ca(tsk);

do {
- percpu_counter_add(&ca->cpustat[idx], val);
+ __percpu_counter_add(&ca->cpustat[idx], val, cpuacct_batch);

+

ca = ca->parent;
} while (ca);
rcu_read_unlock();

Peter Zijlstra

unread,

May 9, 2009, 5:40:12 AM5/9/09

to

On Sat, 2009-05-09 at 18:26 +0900, KOSAKI Motohiro wrote:
>
> Index: b/kernel/sched.c
> ===================================================================
> --- a/kernel/sched.c 2009-05-09 16:48:08.000000000 +0900
> +++ b/kernel/sched.c 2009-05-09 18:16:00.000000000 +0900
> @@ -824,8 +824,12 @@ static struct file_operations sched_feat
> .release = single_release,
> };
>
> +static __read_mostly s32 cpuacct_batch;
> +
> static __init int sched_init_debug(void)
> {
> + cpuacct_batch = jiffies_to_cputime(percpu_counter_batch);
> +
> debugfs_create_file("sched_features", 0644, NULL, NULL,
> &sched_feat_fops);

I'm pretty user you can select cpu accounting without SCHED_DEBUG...

KOSAKI Motohiro

unread,

May 9, 2009, 6:00:17 AM5/9/09

to

> On Sat, 2009-05-09 at 18:26 +0900, KOSAKI Motohiro wrote:
> >
> > Index: b/kernel/sched.c
> > ===================================================================
> > --- a/kernel/sched.c 2009-05-09 16:48:08.000000000 +0900
> > +++ b/kernel/sched.c 2009-05-09 18:16:00.000000000 +0900
> > @@ -824,8 +824,12 @@ static struct file_operations sched_feat
> > .release = single_release,
> > };
> >
> > +static __read_mostly s32 cpuacct_batch;
> > +
> > static __init int sched_init_debug(void)
> > {
> > + cpuacct_batch = jiffies_to_cputime(percpu_counter_batch);
> > +
> > debugfs_create_file("sched_features", 0644, NULL, NULL,
> > &sched_feat_fops);
>
> I'm pretty user you can select cpu accounting without SCHED_DEBUG...

Grr, yes, it's idiot ;)
I'll fix it soon.

KOSAKI Motohiro

unread,

May 9, 2009, 6:20:19 AM5/9/09

to

> > On Sat, 2009-05-09 at 18:26 +0900, KOSAKI Motohiro wrote:
> > >
> > > Index: b/kernel/sched.c
> > > ===================================================================
> > > --- a/kernel/sched.c 2009-05-09 16:48:08.000000000 +0900
> > > +++ b/kernel/sched.c 2009-05-09 18:16:00.000000000 +0900
> > > @@ -824,8 +824,12 @@ static struct file_operations sched_feat
> > > .release = single_release,
> > > };
> > >
> > > +static __read_mostly s32 cpuacct_batch;
> > > +
> > > static __init int sched_init_debug(void)
> > > {
> > > + cpuacct_batch = jiffies_to_cputime(percpu_counter_batch);
> > > +
> > > debugfs_create_file("sched_features", 0644, NULL, NULL,
> > > &sched_feat_fops);
> >
> > I'm pretty user you can select cpu accounting without SCHED_DEBUG...
>
> Grr, yes, it's idiot ;)
> I'll fix it soon.

Done.

Changelog:
since V4
- move cpuacct_batch initialization into sched_init()

since V3
- rewirte patch description (thanks Bharata!)
- append read_mostly to cpuacct_batch

- cpuacct_batch is initialized by sched_init_debug()

since v2
- revert using percpu_counter_sum()

since v1
- use percpu_counter_sum() instead percpu_counter_read()

---------------------------------------------------------
Subject: [PATCH v5] cpuacct: Use bigger percpu counter batch values for stats counters on archs that have VIRT_CPU_ACCOUNTING=y

percpu counters used to accumulate statistics in cpuacct controller use
the default batch value [max(2*nr_cpus, 32)] which can be too small for
archs that define VIRT_CPU_ACCOUNTING. In such archs, a tick could result in
cputime updates in the range of thousands. As a result, cpuacct_update_stats()
would end up acquiring the percpu counter spinlock on every tick which
is not good for performance.

Let those architectures to have a bigger batch threshold so that percpu counter
spinlock isn't taken on every tick. This change doesn't affect the archs which
don't define VIRT_CPU_ACCOUNTING and they continue to have the default
percpu counter batch value.

Cc: Balaji Rao <balaj...@gmail.com>
Cc: Dhaval Giani <dha...@linux.vnet.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezaw...@jp.fujitsu.com>
Cc: Peter Zijlstra <a.p.zi...@chello.nl>
Cc: Balbir Singh <bal...@linux.vnet.ibm.com>
Cc: Ingo Molnar <mi...@elte.hu>
Cc: Martin Schwidefsky <schwi...@de.ibm.com>
Signed-off-by: Bharata B Rao <bha...@linux.vnet.ibm.com>
Signed-off-by: KOSAKI Motohiro <kosaki....@jp.fujitsu.com>
---
kernel/sched.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)

Index: b/kernel/sched.c

===================================================================
--- a/kernel/sched.c 2009-05-09 16:48:08.000000000 +0900

+++ b/kernel/sched.c 2009-05-09 18:57:35.000000000 +0900
@@ -870,6 +870,8 @@ static __read_mostly int scheduler_runni
*/
int sysctl_sched_rt_runtime = 950000;

+static __read_mostly s32 cpuacct_batch;
+

static inline u64 global_rt_period(void)
{
return (u64)sysctl_sched_rt_period * NSEC_PER_USEC;
@@ -9284,6 +9286,8 @@ void __init sched_init(void)

perf_counter_init();

+ cpuacct_batch = jiffies_to_cputime(percpu_counter_batch);
+

scheduler_running = 1;

}

@@ -10457,7 +10461,8 @@ static void cpuacct_update_stats(struct
ca = task_ca(tsk);

do {
- percpu_counter_add(&ca->cpustat[idx], val);
+ __percpu_counter_add(&ca->cpustat[idx], val, cpuacct_batch);
+
ca = ca->parent;
} while (ca);
rcu_read_unlock();

tip-bot for KOSAKI Motohiro

unread,

May 11, 2009, 8:40:09 AM5/11/09

to

Commit-ID: 0719318fea31d54d13ed8ead7f4a277038bd75a2
Gitweb: http://git.kernel.org/tip/0719318fea31d54d13ed8ead7f4a277038bd75a2
Author: KOSAKI Motohiro <kosaki....@jp.fujitsu.com>
AuthorDate: Sat, 9 May 2009 19:14:58 +0900
Committer: Ingo Molnar <mi...@elte.hu>
CommitDate: Mon, 11 May 2009 14:21:32 +0200

sched: cpuacct: Use bigger percpu counter batch values for stats counters

percpu counters used to accumulate statistics in cpuacct controller use
the default batch value [max(2*nr_cpus, 32)] which can be too small for
archs that define VIRT_CPU_ACCOUNTING. In such archs, a tick could result in
cputime updates in the range of thousands. As a result, cpuacct_update_stats()
would end up acquiring the percpu counter spinlock on every tick which
is not good for performance.

Let those architectures to have a bigger batch threshold so that percpu counter
spinlock isn't taken on every tick. This change doesn't affect the archs which
don't define VIRT_CPU_ACCOUNTING and they continue to have the default
percpu counter batch value.

v5:

- move cpuacct_batch initialization into sched_init()

v4:
- rewrite patch description (thanks Bharata!)

- append read_mostly to cpuacct_batch
- cpuacct_batch is initialized by sched_init_debug()

v3:
- revert using percpu_counter_sum()

v2:

- use percpu_counter_sum() instead percpu_counter_read()

Signed-off-by: Bharata B Rao <bha...@linux.vnet.ibm.com>
Signed-off-by: KOSAKI Motohiro <kosaki....@jp.fujitsu.com>
Acked-by: Peter Zijlstra <a.p.zi...@chello.nl>

Cc: Balaji Rao <balaj...@gmail.com>
Cc: Dhaval Giani <dha...@linux.vnet.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezaw...@jp.fujitsu.com>

Cc: Balbir Singh <bal...@linux.vnet.ibm.com>
Cc: Martin Schwidefsky <schwi...@de.ibm.com>
Cc: KOSAKI Motohiro <kosaki....@jp.fujitsu.com>
LKML-Reference: <20090509191430...@jp.fujitsu.com>
Signed-off-by: Ingo Molnar <mi...@elte.hu>

---
kernel/sched.c | 7 ++++++-
1 files changed, 6 insertions(+), 1 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 8908d19..beadb82 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -872,6 +872,8 @@ static __read_mostly int scheduler_running;

*/
int sysctl_sched_rt_runtime = 950000;

+static __read_mostly s32 cpuacct_batch;
+
static inline u64 global_rt_period(void)
{
return (u64)sysctl_sched_rt_period * NSEC_PER_USEC;

@@ -9181,6 +9183,8 @@ void __init sched_init(void)
alloc_bootmem_cpumask_var(&cpu_isolated_map);
#endif /* SMP */

+ cpuacct_batch = jiffies_to_cputime(percpu_counter_batch);
+
scheduler_running = 1;
}

@@ -10354,7 +10358,8 @@ static void cpuacct_update_stats(struct task_struct *tsk,

tip-bot for KOSAKI Motohiro

unread,

May 11, 2009, 8:40:14 AM5/11/09

to

Commit-ID: 50fbed3bb72cbe38a665512771a91a96d028de46
Gitweb: http://git.kernel.org/tip/50fbed3bb72cbe38a665512771a91a96d028de46

Author: KOSAKI Motohiro <kosaki....@jp.fujitsu.com>
AuthorDate: Sat, 9 May 2009 19:14:58 +0900
Committer: Ingo Molnar <mi...@elte.hu>

CommitDate: Mon, 11 May 2009 14:32:57 +0200

sched: cpuacct: Use bigger percpu counter batch values for stats counters

percpu counters used to accumulate statistics in cpuacct controller use
the default batch value [max(2*nr_cpus, 32)] which can be too small for
archs that define VIRT_CPU_ACCOUNTING. In such archs, a tick could result in
cputime updates in the range of thousands. As a result, cpuacct_update_stats()
would end up acquiring the percpu counter spinlock on every tick which
is not good for performance.

Let those architectures to have a bigger batch threshold so that percpu counter
spinlock isn't taken on every tick. This change doesn't affect the archs which
don't define VIRT_CPU_ACCOUNTING and they continue to have the default
percpu counter batch value.

v5:

- move cpuacct_batch initialization into sched_init()

v4:
- rewrite patch description (thanks Bharata!)

- append read_mostly to cpuacct_batch
- cpuacct_batch is initialized by sched_init_debug()

v3:
- revert using percpu_counter_sum()

v2:

- use percpu_counter_sum() instead percpu_counter_read()

Signed-off-by: Bharata B Rao <bha...@linux.vnet.ibm.com>
Signed-off-by: KOSAKI Motohiro <kosaki....@jp.fujitsu.com>
Acked-by: Peter Zijlstra <a.p.zi...@chello.nl>

Cc: Balaji Rao <balaj...@gmail.com>
Cc: Dhaval Giani <dha...@linux.vnet.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezaw...@jp.fujitsu.com>

Cc: Balbir Singh <bal...@linux.vnet.ibm.com>
Cc: Martin Schwidefsky <schwi...@de.ibm.com>
Cc: KOSAKI Motohiro <kosaki....@jp.fujitsu.com>
LKML-Reference: <20090509191430...@jp.fujitsu.com>
Signed-off-by: Ingo Molnar <mi...@elte.hu>

---
kernel/sched.c | 7 ++++++-

1 files changed, 6 insertions(+), 1 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 8908d19..873bc3b 100644

--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -872,6 +872,8 @@ static __read_mostly int scheduler_running;

*/
int sysctl_sched_rt_runtime = 950000;

+static __read_mostly s32 cpuacct_batch;
+
static inline u64 global_rt_period(void)
{
return (u64)sysctl_sched_rt_period * NSEC_PER_USEC;

@@ -9171,6 +9173,8 @@ void __init sched_init(void)
*/
current->sched_class = &fair_sched_class;

+ cpuacct_batch = jiffies_to_cputime(percpu_counter_batch);
+

/* Allocate the nohz_cpu_mask if CONFIG_CPUMASK_OFFSTACK */
alloc_bootmem_cpumask_var(&nohz_cpu_mask);
#ifdef CONFIG_SMP

@@ -10354,7 +10358,8 @@ static void cpuacct_update_stats(struct task_struct *tsk,

Ingo Molnar

unread,

May 11, 2009, 11:10:15 AM5/11/09

to

* tip-bot for KOSAKI Motohiro <kosaki....@jp.fujitsu.com> wrote:

> Commit-ID: 50fbed3bb72cbe38a665512771a91a96d028de46
> Gitweb: http://git.kernel.org/tip/50fbed3bb72cbe38a665512771a91a96d028de46
> Author: KOSAKI Motohiro <kosaki....@jp.fujitsu.com>
> AuthorDate: Sat, 9 May 2009 19:14:58 +0900
> Committer: Ingo Molnar <mi...@elte.hu>
> CommitDate: Mon, 11 May 2009 14:32:57 +0200
>
> sched: cpuacct: Use bigger percpu counter batch values for stats counters

-tip testing found a build failure with this patch:

kernel/sched.c:9278: error: ‘percpu_counter_batch’ undeclared (first use in this function)
kernel/sched.c:9278: error: (Each undeclared identifier is reported only once
kernel/sched.c:9278: error: for each function it appears in.)

Ingo

Peter Zijlstra

unread,

May 12, 2009, 6:10:10 AM5/12/09

to

On Tue, 2009-05-12 at 19:01 +0900, KOSAKI Motohiro wrote:
> +#ifdef CONFIGCONFIG_SMP

typo ?

KOSAKI Motohiro

unread,

May 12, 2009, 6:10:13 AM5/12/09

to

> On Tue, 2009-05-12 at 19:01 +0900, KOSAKI Motohiro wrote:
> > +#ifdef CONFIGCONFIG_SMP
>
> typo ?

yes, this week is my memorial stupid one ;)

KOSAKI Motohiro

unread,

May 12, 2009, 6:10:22 AM5/12/09

to

>
> * tip-bot for KOSAKI Motohiro <kosaki....@jp.fujitsu.com> wrote:
>
> > Commit-ID: 50fbed3bb72cbe38a665512771a91a96d028de46
> > Gitweb: http://git.kernel.org/tip/50fbed3bb72cbe38a665512771a91a96d028de46
> > Author: KOSAKI Motohiro <kosaki....@jp.fujitsu.com>
> > AuthorDate: Sat, 9 May 2009 19:14:58 +0900
> > Committer: Ingo Molnar <mi...@elte.hu>
> > CommitDate: Mon, 11 May 2009 14:32:57 +0200
> >
> > sched: cpuacct: Use bigger percpu counter batch values for stats counters
>
> -tip testing found a build failure with this patch:
>
> kernel/sched.c:9278: error: ‘percpu_counter_batch’ undeclared (first use in this function)
> kernel/sched.c:9278: error: (Each undeclared identifier is reported only once
> kernel/sched.c:9278: error: for each function it appears in.)

Oh, my fault again. Good catch!

Changelog:
since V5
- fix build error when UP

since V4

- move cpuacct_batch initialization into sched_init()

since V3
- rewirte patch description (thanks Bharata!)

- append read_mostly to cpuacct_batch

- cpuacct_batch is initialized by sched_init()

since v2
- revert using percpu_counter_sum()

since v1

- use percpu_counter_sum() instead percpu_counter_read()

---------------------------------------------------------
Subject: [PATCH v6] cpuacct: Use bigger percpu counter batch values for stats counters on archs that have VIRT_CPU_ACCOUNTING=y

percpu counters used to accumulate statistics in cpuacct controller use
the default batch value [max(2*nr_cpus, 32)] which can be too small for
archs that define VIRT_CPU_ACCOUNTING. In such archs, a tick could result in
cputime updates in the range of thousands. As a result, cpuacct_update_stats()
would end up acquiring the percpu counter spinlock on every tick which
is not good for performance.

Let those architectures to have a bigger batch threshold so that percpu counter
spinlock isn't taken on every tick. This change doesn't affect the archs which
don't define VIRT_CPU_ACCOUNTING and they continue to have the default
percpu counter batch value.

Cc: Balaji Rao <balaj...@gmail.com>

Cc: Dhaval Giani <dha...@linux.vnet.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezaw...@jp.fujitsu.com>

Cc: Peter Zijlstra <a.p.zi...@chello.nl>
Cc: Balbir Singh <bal...@linux.vnet.ibm.com>

Cc: Ingo Molnar <mi...@elte.hu>
Cc: Martin Schwidefsky <schwi...@de.ibm.com>

Signed-off-by: Bharata B Rao <bha...@linux.vnet.ibm.com>
Signed-off-by: KOSAKI Motohiro <kosaki....@jp.fujitsu.com>

---
kernel/sched.c | 9 ++++++++-
1 file changed, 8 insertions(+), 1 deletion(-)

Index: b/kernel/sched.c
===================================================================
--- a/kernel/sched.c 2009-05-12 13:12:59.000000000 +0900
+++ b/kernel/sched.c 2009-05-12 13:28:53.000000000 +0900

@@ -870,6 +870,8 @@ static __read_mostly int scheduler_runni

*/
int sysctl_sched_rt_runtime = 950000;

+static __read_mostly s32 cpuacct_batch;
+
static inline u64 global_rt_period(void)
{
return (u64)sysctl_sched_rt_period * NSEC_PER_USEC;

@@ -9284,6 +9286,10 @@ void __init sched_init(void)

perf_counter_init();

+#ifdef CONFIGCONFIG_SMP
+ cpuacct_batch = jiffies_to_cputime(percpu_counter_batch);
+#endif
+
scheduler_running = 1;
}

@@ -10457,7 +10463,8 @@ static void cpuacct_update_stats(struct

ca = task_ca(tsk);

do {
- percpu_counter_add(&ca->cpustat[idx], val);
+ __percpu_counter_add(&ca->cpustat[idx], val, cpuacct_batch);
+
ca = ca->parent;
} while (ca);
rcu_read_unlock();

KOSAKI Motohiro

unread,

May 12, 2009, 6:20:17 AM5/12/09

to

> > +#ifdef CONFIGCONFIG_SMP
> > + cpuacct_batch = jiffies_to_cputime(percpu_counter_batch);
> > +#endif
>

> Slow down and compile patches before sending them out.. please. That
> is a basic expectation if you expect it to be merged.

Unfortunately, this mistake pass test successfully ;)
it because cpuacct_batch=0 works even SMP.

Balbir Singh

unread,

May 12, 2009, 6:20:17 AM5/12/09

to

* KOSAKI Motohiro <kosaki....@jp.fujitsu.com> [2009-05-12 19:01:25]:

> >
> > * tip-bot for KOSAKI Motohiro <kosaki....@jp.fujitsu.com> wrote:
> >
> > > Commit-ID: 50fbed3bb72cbe38a665512771a91a96d028de46
> > > Gitweb: http://git.kernel.org/tip/50fbed3bb72cbe38a665512771a91a96d028de46
> > > Author: KOSAKI Motohiro <kosaki....@jp.fujitsu.com>
> > > AuthorDate: Sat, 9 May 2009 19:14:58 +0900
> > > Committer: Ingo Molnar <mi...@elte.hu>
> > > CommitDate: Mon, 11 May 2009 14:32:57 +0200
> > >
> > > sched: cpuacct: Use bigger percpu counter batch values for stats counters
> >
> > -tip testing found a build failure with this patch:
> >

> > kernel/sched.c:9278: error: ?$B!Fpercpu_counter_batch?$B!G undeclared (first use in this function)

Slow down and compile patches before sending them out.. please. That

is a basic expectation if you expect it to be merged.

--
Balbir

Ingo Molnar

unread,

May 12, 2009, 6:30:11 AM5/12/09

to

* KOSAKI Motohiro <kosaki....@jp.fujitsu.com> wrote:

> > > On Tue, 2009-05-12 at 19:01 +0900, KOSAKI Motohiro wrote:
> > > > +#ifdef CONFIGCONFIG_SMP
> > >
> > > typo ?
> >
> > yes, this week is my memorial stupid one ;)
>

> ok, assemble list indicate current patch have no typo ;)
>
> ffffffff814ae920: 48 c7 40 30 10 eb 2a movq $0xffffffff812aeb10,0x30(%rax)
> ffffffff814ae927: 81
> ffffffff814ae928: c7 05 3e 37 fe ff 01 movl $0x1,-0x1c8c2(%rip) # ffffffff81492070 <scheduler_running>
> ffffffff814ae92f: 00 00 00
> ffffffff814ae932: 8b 05 28 52 fe ff mov -0x1add8(%rip),%eax # ffffffff81493b60 <percpu_counter_batch>
> ffffffff814ae938: 89 05 36 37 fe ff mov %eax,-0x1c8ca(%rip) # ffffffff81492074 <cpuacct_batch>
> ffffffff814ae93e: 41 5c pop %r12
>
>
>
> ---------------------------------------------------------
> Subject: [PATCH] cpuacct: Use bigger percpu counter batch values for stats counters on archs that have VIRT_CPU_ACCOUNTING=y

>
> percpu counters used to accumulate statistics in cpuacct controller use
> the default batch value [max(2*nr_cpus, 32)] which can be too small for
> archs that define VIRT_CPU_ACCOUNTING. In such archs, a tick could result in
> cputime updates in the range of thousands. As a result, cpuacct_update_stats()
> would end up acquiring the percpu counter spinlock on every tick which
> is not good for performance.
>
> Let those architectures to have a bigger batch threshold so that percpu counter
> spinlock isn't taken on every tick. This change doesn't affect the archs which
> don't define VIRT_CPU_ACCOUNTING and they continue to have the default
> percpu counter batch value.
>
> Cc: Balaji Rao <balaj...@gmail.com>
> Cc: Dhaval Giani <dha...@linux.vnet.ibm.com>
> Cc: KAMEZAWA Hiroyuki <kamezaw...@jp.fujitsu.com>
> Cc: Peter Zijlstra <a.p.zi...@chello.nl>
> Cc: Balbir Singh <bal...@linux.vnet.ibm.com>
> Cc: Ingo Molnar <mi...@elte.hu>
> Cc: Martin Schwidefsky <schwi...@de.ibm.com>
> Signed-off-by: Bharata B Rao <bha...@linux.vnet.ibm.com>
> Signed-off-by: KOSAKI Motohiro <kosaki....@jp.fujitsu.com>
> ---
> kernel/sched.c | 9 ++++++++-
> 1 file changed, 8 insertions(+), 1 deletion(-)

Please reuse the commit log formatting fixes i did when i applied
your patch. (you should still have that commit notification email)

Ingo

Balbir Singh

unread,

May 12, 2009, 6:30:15 AM5/12/09

to

* KOSAKI Motohiro <kosaki....@jp.fujitsu.com> [2009-05-12 19:13:42]:

> > > +#ifdef CONFIGCONFIG_SMP
> > > + cpuacct_batch = jiffies_to_cputime(percpu_counter_batch);
> > > +#endif
> >
> > Slow down and compile patches before sending them out.. please. That
> > is a basic expectation if you expect it to be merged.
>
> Unfortunately, this mistake pass test successfully ;)
> it because cpuacct_batch=0 works even SMP.
>

OK, BTW, using an #ifdef right in the middle of a function makes the
code harder to read, can't we use an inline function to abstract out
SMP?

--
Balbir

KOSAKI Motohiro

unread,

May 12, 2009, 6:30:25 AM5/12/09

to

> > On Tue, 2009-05-12 at 19:01 +0900, KOSAKI Motohiro wrote:
> > > +#ifdef CONFIGCONFIG_SMP
> >
> > typo ?
>
> yes, this week is my memorial stupid one ;)

ok, assemble list indicate current patch have no typo ;)

Index: b/kernel/sched.c

===================================================================
--- a/kernel/sched.c 2009-05-12 13:12:59.000000000 +0900

+++ b/kernel/sched.c 2009-05-12 19:04:49.000000000 +0900

@@ -870,6 +870,8 @@ static __read_mostly int scheduler_runni
*/
int sysctl_sched_rt_runtime = 950000;

+static __read_mostly s32 cpuacct_batch;
+
static inline u64 global_rt_period(void)
{
return (u64)sysctl_sched_rt_period * NSEC_PER_USEC;
@@ -9284,6 +9286,10 @@ void __init sched_init(void)

perf_counter_init();

+#ifdef CONFIG_SMP

+ cpuacct_batch = jiffies_to_cputime(percpu_counter_batch);
+#endif

+
scheduler_running = 1;
}

@@ -10457,7 +10463,8 @@ static void cpuacct_update_stats(struct
ca = task_ca(tsk);

do {
- percpu_counter_add(&ca->cpustat[idx], val);
+ __percpu_counter_add(&ca->cpustat[idx], val, cpuacct_batch);
+
ca = ca->parent;
} while (ca);
rcu_read_unlock();

Ingo Molnar

unread,

May 12, 2009, 6:40:19 AM5/12/09

to

* Balbir Singh <bal...@linux.vnet.ibm.com> wrote:

> * KOSAKI Motohiro <kosaki....@jp.fujitsu.com> [2009-05-12 19:13:42]:
>
> > > > +#ifdef CONFIGCONFIG_SMP
> > > > + cpuacct_batch = jiffies_to_cputime(percpu_counter_batch);
> > > > +#endif
> > >
> > > Slow down and compile patches before sending them out.. please. That
> > > is a basic expectation if you expect it to be merged.
> >
> > Unfortunately, this mistake pass test successfully ;)
> > it because cpuacct_batch=0 works even SMP.
> >
>
> OK, BTW, using an #ifdef right in the middle of a function makes
> the code harder to read, can't we use an inline function to
> abstract out SMP?

or rather, to make cpuacct_batch have a sane value on UP too. (1?
0?)

Ingo

KOSAKI Motohiro

unread,

May 12, 2009, 6:50:19 AM5/12/09

to

> > Cc: Balaji Rao <balaj...@gmail.com>
> > Cc: Dhaval Giani <dha...@linux.vnet.ibm.com>
> > Cc: KAMEZAWA Hiroyuki <kamezaw...@jp.fujitsu.com>
> > Cc: Peter Zijlstra <a.p.zi...@chello.nl>
> > Cc: Balbir Singh <bal...@linux.vnet.ibm.com>
> > Cc: Ingo Molnar <mi...@elte.hu>
> > Cc: Martin Schwidefsky <schwi...@de.ibm.com>
> > Signed-off-by: Bharata B Rao <bha...@linux.vnet.ibm.com>
> > Signed-off-by: KOSAKI Motohiro <kosaki....@jp.fujitsu.com>
> > ---
> > kernel/sched.c | 9 ++++++++-
> > 1 file changed, 8 insertions(+), 1 deletion(-)
>
> Please reuse the commit log formatting fixes i did when i applied
> your patch. (you should still have that commit notification email)

Oh, I didn't notice this change. I'll resend it. thanks

KOSAKI Motohiro

unread,

May 12, 2009, 6:50:20 AM5/12/09

to

>
> * Balbir Singh <bal...@linux.vnet.ibm.com> wrote:
>
> > * KOSAKI Motohiro <kosaki....@jp.fujitsu.com> [2009-05-12 19:13:42]:
> >
> > > > > +#ifdef CONFIGCONFIG_SMP
> > > > > + cpuacct_batch = jiffies_to_cputime(percpu_counter_batch);
> > > > > +#endif
> > > >
> > > > Slow down and compile patches before sending them out.. please. That
> > > > is a basic expectation if you expect it to be merged.
> > >
> > > Unfortunately, this mistake pass test successfully ;)
> > > it because cpuacct_batch=0 works even SMP.
> > >
> >
> > OK, BTW, using an #ifdef right in the middle of a function makes
> > the code harder to read, can't we use an inline function to
> > abstract out SMP?
>
> or rather, to make cpuacct_batch have a sane value on UP too. (1?
> 0?)

umm..

I've reviewed my patch again.
but sched_init() already has multiple #ifdef SMP. Thus I don't think
cosmetic changing improve readability largely.

------------------------------------
Subject: [PATCH] sched: cpuacct: Use bigger percpu counter batch values for stats counters

percpu counters used to accumulate statistics in cpuacct controller use
the default batch value [max(2*nr_cpus, 32)] which can be too small for
archs that define VIRT_CPU_ACCOUNTING. In such archs, a tick could result in
cputime updates in the range of thousands. As a result, cpuacct_update_stats()
would end up acquiring the percpu counter spinlock on every tick which
is not good for performance.

Let those architectures to have a bigger batch threshold so that percpu counter
spinlock isn't taken on every tick. This change doesn't affect the archs which
don't define VIRT_CPU_ACCOUNTING and they continue to have the default
percpu counter batch value.

v7:
- fix typo and changelog

v6:

- fix build error when UP

v5:

- move cpuacct_batch initialization into sched_init()

v4:
- rewrite patch description (thanks Bharata!)

- append read_mostly to cpuacct_batch

- cpuacct_batch is initialized by sched_init_debug()

v3:
- revert using percpu_counter_sum()

v2:

- use percpu_counter_sum() instead percpu_counter_read()

Cc: Balaji Rao <balaj...@gmail.com>

Cc: Dhaval Giani <dha...@linux.vnet.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezaw...@jp.fujitsu.com>
Cc: Peter Zijlstra <a.p.zi...@chello.nl>
Cc: Balbir Singh <bal...@linux.vnet.ibm.com>
Cc: Ingo Molnar <mi...@elte.hu>
Cc: Martin Schwidefsky <schwi...@de.ibm.com>
Signed-off-by: Bharata B Rao <bha...@linux.vnet.ibm.com>
Signed-off-by: KOSAKI Motohiro <kosaki....@jp.fujitsu.com>
---
kernel/sched.c | 9 ++++++++-
1 file changed, 8 insertions(+), 1 deletion(-)

Index: b/kernel/sched.c

===================================================================
--- a/kernel/sched.c 2009-05-12 13:12:59.000000000 +0900
+++ b/kernel/sched.c 2009-05-12 19:04:49.000000000 +0900
@@ -870,6 +870,8 @@ static __read_mostly int scheduler_runni
*/
int sysctl_sched_rt_runtime = 950000;

+static __read_mostly s32 cpuacct_batch;
+
static inline u64 global_rt_period(void)
{
return (u64)sysctl_sched_rt_period * NSEC_PER_USEC;
@@ -9284,6 +9286,10 @@ void __init sched_init(void)

perf_counter_init();

+#ifdef CONFIG_SMP

+ cpuacct_batch = jiffies_to_cputime(percpu_counter_batch);
+#endif

+
scheduler_running = 1;
}

@@ -10457,7 +10463,8 @@ static void cpuacct_update_stats(struct
ca = task_ca(tsk);

do {
- percpu_counter_add(&ca->cpustat[idx], val);
+ __percpu_counter_add(&ca->cpustat[idx], val, cpuacct_batch);
+
ca = ca->parent;
} while (ca);
rcu_read_unlock();

Ingo Molnar

unread,

May 12, 2009, 9:30:26 AM5/12/09

to

* KOSAKI Motohiro <kosaki....@jp.fujitsu.com> wrote:

> > or rather, to make cpuacct_batch have a sane value on UP too.
> > (1? 0?)
>
> umm..
>
> I've reviewed my patch again.

> but sched_init() already has multiple #ifdef SMP. [...]

Patches are welcome to remove more of them.

> [...] Thus I don't think cosmetic changing improve readability
> largely.

an avoidable #ifdef should aways be avoided.

Ingo

Ingo Molnar

unread,

May 12, 2009, 9:50:11 AM5/12/09

to

* KOSAKI Motohiro <kosaki....@jp.fujitsu.com> wrote:

> 2009/5/12 Ingo Molnar <mi...@elte.hu>:

> >
> > * KOSAKI Motohiro <kosaki....@jp.fujitsu.com> wrote:
> >
> >> > or rather, to make cpuacct_batch have a sane value on UP too.
> >> > (1? 0?)
> >>
> >> umm..
> >>
> >> I've reviewed my patch again.
> >> but sched_init() already has multiple #ifdef SMP. [...]
> >
> > Patches are welcome to remove more of them.
> >
> >> [...] Thus I don't think cosmetic changing improve readability
> >> largely.
> >
> > an avoidable #ifdef should aways be avoided.
>

> ok. I'll fix it tommorow.

With the caveat that you should do it only if i'm right and if it
results in better code in general.

KOSAKI Motohiro

unread,

May 12, 2009, 9:50:14 AM5/12/09

to

2009/5/12 Ingo Molnar <mi...@elte.hu>:

>
> * KOSAKI Motohiro <kosaki....@jp.fujitsu.com> wrote:
>
>> > or rather, to make cpuacct_batch have a sane value on UP too.
>> > (1? 0?)
>>
>> umm..
>>
>> I've reviewed my patch again.
>> but sched_init() already has multiple #ifdef SMP. [...]
>
> Patches are welcome to remove more of them.
>
>> [...] Thus I don't think cosmetic changing improve readability
>> largely.
>
> an avoidable #ifdef should aways be avoided.

ok. I'll fix it tommorow.

Bharata B Rao

unread,

Jul 16, 2009, 4:20:10 AM7/16/09

to

On ppc64, calling jiffies_to_cputime() from sched_init() is too early because
jiffies_to_cputime() needs tb_ticks_per_sec which gets initialized only
later in time_init(). Because of this I see that cpuacct_batch will always
be zero effectively negating what this patch is trying to do.

As explained by you earlier, we too are finding the default batch value to
be too low for ppc64 with VIRT_CPU_ACCOUNTING turned on. Hence I guess
if this patch is taken in (ofcourse with the above issue fixed), it will
benefit ppc64 also.

Regards,
Bharata.

Anton Blanchard

unread,

Jul 16, 2009, 4:50:11 AM7/16/09

to

Hi,

> On ppc64, calling jiffies_to_cputime() from sched_init() is too early because
> jiffies_to_cputime() needs tb_ticks_per_sec which gets initialized only
> later in time_init(). Because of this I see that cpuacct_batch will always
> be zero effectively negating what this patch is trying to do.
>
> As explained by you earlier, we too are finding the default batch value to
> be too low for ppc64 with VIRT_CPU_ACCOUNTING turned on. Hence I guess
> if this patch is taken in (ofcourse with the above issue fixed), it will
> benefit ppc64 also.

I created this patch earlier today when I hit the problem. Thoughts?

Anton
--

When CONFIG_VIRT_CPU_ACCOUNTING is enabled we can call cpuacct_update_stats
with values much larger than percpu_counter_batch. This means the
call to percpu_counter_add will always add to the global count which is
protected by a spinlock.

Since reading of the CPU accounting cgroup counters is not performance
critical, we can use a maximum size batch of INT_MAX and use
percpu_counter_sum on the read side which will add all the percpu
counters.

With this patch an 8 core POWER6 with CONFIG_VIRT_CPU_ACCOUNTING and
CONFIG_CGROUP_CPUACCT shows an improvement in aggregate context switch rate of
397k/sec to 3.9M/sec, a 10x improvement.

Signed-off-by: Anton Blanchard <an...@samba.org>
---

Index: linux.trees.git/kernel/sched.c
===================================================================
--- linux.trees.git.orig/kernel/sched.c 2009-07-16 10:11:02.000000000 +1000
+++ linux.trees.git/kernel/sched.c 2009-07-16 10:16:41.000000000 +1000
@@ -10551,7 +10551,7 @@
int i;

for (i = 0; i < CPUACCT_STAT_NSTATS; i++) {
- s64 val = percpu_counter_read(&ca->cpustat[i]);
+ s64 val = percpu_counter_sum(&ca->cpustat[i]);
val = cputime64_to_clock_t(val);
cb->fill(cb, cpuacct_stat_desc[i], val);
}
@@ -10621,7 +10621,7 @@

ca = task_ca(tsk);

do {
- percpu_counter_add(&ca->cpustat[idx], val);

+ __percpu_counter_add(&ca->cpustat[idx], val, INT_MAX);

ca = ca->parent;
} while (ca);
rcu_read_unlock();

Anton Blanchard

unread,

Aug 20, 2009, 1:20:09 AM8/20/09

to

Hi,

Looks like this issue is still present. I tested on a 32 core box and
the patch improved maximum context switch rate from from 76k/sec to 9.5M/sec.
Thats over 100x faster, or 50x per line of code. That's got to be some sort of
record :)

Any chance we can get a fix in for 2.6.31? Don't make me find an even bigger
box so I can break the 200x mark :)

KAMEZAWA Hiroyuki

unread,

Aug 20, 2009, 1:20:08 AM8/20/09

to

On Thu, 20 Aug 2009 15:10:38 +1000
Anton Blanchard <an...@samba.org> wrote:

>
> Hi,
>
> Looks like this issue is still present. I tested on a 32 core box and
> the patch improved maximum context switch rate from from 76k/sec to 9.5M/sec.
> Thats over 100x faster, or 50x per line of code. That's got to be some sort of
> record :)
>
> Any chance we can get a fix in for 2.6.31? Don't make me find an even bigger
> box so I can break the 200x mark :)
>

I like the patch itself, I think it does a correct fix.

Hmm, but, IIRC, there was a discssion "what happens at reading
this counter once per 1ms ?" or some.

What overhead at read on big system ? Can you show it ?

Thanks,
-Kame

Balbir Singh

unread,

Aug 20, 2009, 1:30:13 AM8/20/09

to

* Anton Blanchard <an...@samba.org> [2009-08-20 15:10:38]:

>
> Hi,
>
> Looks like this issue is still present. I tested on a 32 core box and
> the patch improved maximum context switch rate from from 76k/sec to 9.5M/sec.
> Thats over 100x faster, or 50x per line of code. That's got to be some sort of
> record :)
>
> Any chance we can get a fix in for 2.6.31? Don't make me find an even bigger
> box so I can break the 200x mark :)
>
> Anton
>
> > --
> >
> > When CONFIG_VIRT_CPU_ACCOUNTING is enabled we can call cpuacct_update_stats
> > with values much larger than percpu_counter_batch. This means the
> > call to percpu_counter_add will always add to the global count which is
> > protected by a spinlock.
> >
> > Since reading of the CPU accounting cgroup counters is not performance
> > critical, we can use a maximum size batch of INT_MAX and use
> > percpu_counter_sum on the read side which will add all the percpu
> > counters.
> >
> > With this patch an 8 core POWER6 with CONFIG_VIRT_CPU_ACCOUNTING and
> > CONFIG_CGROUP_CPUACCT shows an improvement in aggregate context switch rate of
> > 397k/sec to 3.9M/sec, a 10x improvement.
> >

Looks good overall, why not keep the batch size conditional on
CONFIG_VIRT_CPU_ACCOUNTING? I'd still like to stick with
percpu_counter_read() on the read side because My concern is that a
bad user space application can read cpuacct.stat and bring the kernel
to its knees.

> > Signed-off-by: Anton Blanchard <an...@samba.org>
> > ---
> >
> > Index: linux.trees.git/kernel/sched.c
> > ===================================================================
> > --- linux.trees.git.orig/kernel/sched.c 2009-07-16 10:11:02.000000000 +1000
> > +++ linux.trees.git/kernel/sched.c 2009-07-16 10:16:41.000000000 +1000
> > @@ -10551,7 +10551,7 @@
> > int i;
> >
> > for (i = 0; i < CPUACCT_STAT_NSTATS; i++) {
> > - s64 val = percpu_counter_read(&ca->cpustat[i]);
> > + s64 val = percpu_counter_sum(&ca->cpustat[i]);
> > val = cputime64_to_clock_t(val);
> > cb->fill(cb, cpuacct_stat_desc[i], val);
> > }
> > @@ -10621,7 +10621,7 @@
> > ca = task_ca(tsk);
> >
> > do {
> > - percpu_counter_add(&ca->cpustat[idx], val);
> > + __percpu_counter_add(&ca->cpustat[idx], val, INT_MAX);
> > ca = ca->parent;
> > } while (ca);
> > rcu_read_unlock();

--
Balbir

Peter Zijlstra

unread,

Aug 20, 2009, 2:00:13 AM8/20/09

to

On Thu, 2009-08-20 at 10:56 +0530, Balbir Singh wrote:

> Looks good overall, why not keep the batch size conditional on
> CONFIG_VIRT_CPU_ACCOUNTING? I'd still like to stick with
> percpu_counter_read() on the read side because My concern is that a
> bad user space application can read cpuacct.stat and bring the kernel
> to its knees.

Agreed, the first hunk is dangerous, on the second hunk, I'm not sure
about the INT_MAX thing, that's a 4s gap per cpu, that might be a tad
much.

> > > Signed-off-by: Anton Blanchard <an...@samba.org>
> > > ---
> > >
> > > Index: linux.trees.git/kernel/sched.c
> > > ===================================================================
> > > --- linux.trees.git.orig/kernel/sched.c 2009-07-16 10:11:02.000000000 +1000
> > > +++ linux.trees.git/kernel/sched.c 2009-07-16 10:16:41.000000000 +1000
> > > @@ -10551,7 +10551,7 @@
> > > int i;
> > >
> > > for (i = 0; i < CPUACCT_STAT_NSTATS; i++) {
> > > - s64 val = percpu_counter_read(&ca->cpustat[i]);
> > > + s64 val = percpu_counter_sum(&ca->cpustat[i]);
> > > val = cputime64_to_clock_t(val);
> > > cb->fill(cb, cpuacct_stat_desc[i], val);
> > > }
> > > @@ -10621,7 +10621,7 @@
> > > ca = task_ca(tsk);
> > >
> > > do {
> > > - percpu_counter_add(&ca->cpustat[idx], val);
> > > + __percpu_counter_add(&ca->cpustat[idx], val, INT_MAX);
> > > ca = ca->parent;
> > > } while (ca);
> > > rcu_read_unlock();
>

--

Anton Blanchard

unread,

Aug 20, 2009, 2:10:08 AM8/20/09

to

Hi,

> Agreed, the first hunk is dangerous, on the second hunk, I'm not sure
> about the INT_MAX thing, that's a 4s gap per cpu, that might be a tad
> much.

Yeah I somehow missed the fact we read it every timer tick. I was hoping
for an easy fix but I guess we just need to set a reasonable batch size
at runtime.

Anton

KAMEZAWA Hiroyuki

unread,

Aug 20, 2009, 2:20:08 AM8/20/09

to

On Thu, 20 Aug 2009 16:05:28 +1000
Anton Blanchard <an...@samba.org> wrote:

>
> Hi,
>
> > Agreed, the first hunk is dangerous, on the second hunk, I'm not sure
> > about the INT_MAX thing, that's a 4s gap per cpu, that might be a tad
> > much.
>
> Yeah I somehow missed the fact we read it every timer tick. I was hoping
> for an easy fix but I guess we just need to set a reasonable batch size
> at runtime.
>

Could you share contex-switch-test program ?
I'd like to play with it to find out what I can do against percpu counter.

Thanks,
-Kame

Anton Blanchard

unread,

Aug 20, 2009, 2:30:17 AM8/20/09

to

Hi,

> Could you share contex-switch-test program ?
> I'd like to play with it to find out what I can do against percpu counter.

Sure:

http://ozlabs.org/~anton/junkcode/context_switch.c

Very simple, just run it once per core:

for i in `seq 0 31`
do
taskset -c $i ./context_switch &
done

Then look at the context switch rates in vmstat.

KAMEZAWA Hiroyuki

unread,

Aug 20, 2009, 4:50:06 AM8/20/09

to

On Thu, 20 Aug 2009 16:24:51 +1000
Anton Blanchard <an...@samba.org> wrote:

>
> Hi,
>
> > Could you share contex-switch-test program ?
> > I'd like to play with it to find out what I can do against percpu counter.
>
> Sure:
>
> http://ozlabs.org/~anton/junkcode/context_switch.c
>
> Very simple, just run it once per core:
>
> for i in `seq 0 31`
> do
> taskset -c $i ./context_switch &
> done
>
> Then look at the context switch rates in vmstat.
>

Thank you for test program.

Before adjusting batch counter (I think you should modify it),
Could you try this ?

I only have 8cpu(2socket) host but works well.
(But...my host is x86-64 and has not virt-cpu-accouting.)

with your program
before patch.
cpuacct off : 414000-416000 ctsw per sec.
cpuacct on : 401000-404000 ctsw per sec.

after patch
cpuacct on : 412000-413000 ctsw per sec.

Maybe I should check cache-miss late ;)
==
It's bad to place pointer for array of per-cpu-data on the
same cache line of spinlock. This patch moves percpu_counter's
cacheline to reduce false sharing.

Signed-off-by: KAMEZAWA Hiroyuki <kamezaw...@jp.fujitsu.com>
---
include/linux/percpu_counter.h | 12 +++++++++++-
1 file changed, 11 insertions(+), 1 deletion(-)

Index: linux-2.6.31-rc6/include/linux/percpu_counter.h
===================================================================
--- linux-2.6.31-rc6.orig/include/linux/percpu_counter.h 2009-08-20 12:09:27.000000000 +0900
+++ linux-2.6.31-rc6/include/linux/percpu_counter.h 2009-08-20 17:31:13.000000000 +0900
@@ -14,14 +14,24 @@
#include <linux/types.h>

#ifdef CONFIG_SMP
+struct __percpu_counter_padding {
+ char x[0];
+} ____cacheline_internodealigned_in_smp;
+#define CACHELINE_PADDING(name) struct __percpu_counter_padding name

struct percpu_counter {
+ /*
+ * This pointer is persistent and accessed firstly.
+ * Then, should not be purged by locking in other cpus.
+ */
+ s32 *counters;
+ CACHELINE_PADDING(pad);
spinlock_t lock;
s64 count;
#ifdef CONFIG_HOTPLUG_CPU
+ /* rarely accessed field */
struct list_head list; /* All percpu_counters are on a list */
#endif
- s32 *counters;
};

extern int percpu_counter_batch;

Ingo Molnar

unread,

Aug 20, 2009, 6:10:15 AM8/20/09

to

Btw., in latest upstream you can do that via:

cd tools/perf/
make -j install

perf stat --repeat 5 -- taskset -c 1 ./context_switch

there will be cache-miss and other stats, error bars so
you can compare the before/after better, etc.

Ingo

Balbir Singh

unread,

Aug 20, 2009, 7:50:09 AM8/20/09

to

Hi, Anton,

Do you have any data on what level of batching is good for
VIRT_CPU_ACCOUNTING? If it is going to be as high as INT_MAX, then
your patch seems to be the best way to go forward. We'll need to herd
away bad users who read cpuacct.stat frequently enough to create a
problem.

Balbir Singh.

KAMEZAWA Hiroyuki

unread,

Aug 20, 2009, 10:30:15 PM8/20/09

to

On Thu, 20 Aug 2009 12:04:03 +0200
Ingo Molnar <mi...@elte.hu> wrote:
> * KAMEZAWA Hiroyuki <kamezaw...@jp.fujitsu.com> wrote:
> > with your program
> > before patch.
> > cpuacct off : 414000-416000 ctsw per sec.
> > cpuacct on : 401000-404000 ctsw per sec.
> >
> > after patch
> > cpuacct on : 412000-413000 ctsw per sec.
> >
> > Maybe I should check cache-miss late ;)
>
> Btw., in latest upstream you can do that via:
>
> cd tools/perf/
> make -j install
>
> perf stat --repeat 5 -- taskset -c 1 ./context_switch
>

tried. (on 8cpu/2socket host). It seems cache-miss decreases.
But IPC ..?
==
/root/bin/perf stat --repeat 5 -a -e cycles,instructions,cache-misses,L1-dcache-load-misses,L1-dcache-store-misses -- ./ctxt_sw.sh
[Before] patch
Performance counter stats for './ctxt_sw.sh' (5 runs):

1511260148530 cycles ( +- 0.025% ) (scaled from 63.49% )
470690642181 instructions # 0.311 IPC ( +- 0.098% )(scaled from 79.49%)
1210051728 cache-misses ( +- 0.629% ) (scaled from 79.00% )
3202978828 L1-dcache-load-misses ( +- 1.118% ) (scaled from 78.00% )
1803963907 L1-dcache-store-misses ( +- 0.728% ) (scaled from 42.99% )

60.161941918 seconds time elapsed ( +- 0.029% )

[After] patch
Performance counter stats for './ctxt_sw.sh' (5 runs):

1511961867506 cycles ( +- 0.018% ) (scaled from 71.50%)
448724406149 instructions # 0.297 IPC ( +- 0.133% ) (scaled from 75.49%)
1184548041 cache-misses ( +- 0.136% ) (scaled from 75.50%)
3086357048 L1-dcache-load-misses ( +- 0.822% ) (scaled from 77.50%)
1708375493 L1-dcache-store-misses ( +- 0.328% ) (scaled from 47.00%)

60.179814774 seconds time elapsed ( +- 0.052% )

Thanks,
-Kame

KAMEZAWA Hiroyuki

unread,

Aug 21, 2009, 12:00:11 AM8/21/09

to

On Thu, 20 Aug 2009 17:41:23 +0900
KAMEZAWA Hiroyuki <kamezaw...@jp.fujitsu.com> wrote:

>
> Signed-off-by: KAMEZAWA Hiroyuki <kamezaw...@jp.fujitsu.com>
> ---
> include/linux/percpu_counter.h | 12 +++++++++++-
> 1 file changed, 11 insertions(+), 1 deletion(-)
>
> Index: linux-2.6.31-rc6/include/linux/percpu_counter.h
> ===================================================================
> --- linux-2.6.31-rc6.orig/include/linux/percpu_counter.h 2009-08-20 12:09:27.000000000 +0900
> +++ linux-2.6.31-rc6/include/linux/percpu_counter.h 2009-08-20 17:31:13.000000000 +0900
> @@ -14,14 +14,24 @@
> #include <linux/types.h>
>
> #ifdef CONFIG_SMP
> +struct __percpu_counter_padding {
> + char x[0];
> +} ____cacheline_internodealigned_in_smp;

above alignement is too big...plz ignore this patch :(

-Kame

KAMEZAWA Hiroyuki

unread,

Aug 21, 2009, 2:30:08 AM8/21/09

to

On Thu, 20 Aug 2009 16:24:51 +1000
Anton Blanchard <an...@samba.org> wrote:

>
> Hi,
>
> > Could you share contex-switch-test program ?
> > I'd like to play with it to find out what I can do against percpu counter.
>
> Sure:
>
> http://ozlabs.org/~anton/junkcode/context_switch.c
>
> Very simple, just run it once per core:
>
> for i in `seq 0 31`
> do
> taskset -c $i ./context_switch &
> done
>
> Then look at the context switch rates in vmstat.
>

Ok, I reproduced it on ia64 box (ia64 supports VIRT_CPU_ACCOUNTING)

On 8cpu
-- config=on--
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
r b swpd free buff cache si so bi bo in cs us sy id wa st
11 0 0 7185152 66048 284416 0 0 0 0 5918 12354 0 100 0 0 0
--config=off--
11 0 0 7194496 66304 283392 0 0 0 20 8023 1424022 32 68 0 0 0

Wow, yes. 100x regression.

I'll dig this more.

IIUC, If CONFIG_VIRT_CPU_ACCOUNTING=on, process's stime/utime records "nanosecond".
Then, batch=32 is too small. How about once per msec ?

Thanks,
-Kame

Ingo Molnar

unread,

Aug 21, 2009, 7:40:09 AM8/21/09

to

* KAMEZAWA Hiroyuki <kamezaw...@jp.fujitsu.com> wrote:

> On Thu, 20 Aug 2009 12:04:03 +0200
> Ingo Molnar <mi...@elte.hu> wrote:
> > * KAMEZAWA Hiroyuki <kamezaw...@jp.fujitsu.com> wrote:
> > > with your program
> > > before patch.
> > > cpuacct off : 414000-416000 ctsw per sec.
> > > cpuacct on : 401000-404000 ctsw per sec.
> > >
> > > after patch
> > > cpuacct on : 412000-413000 ctsw per sec.
> > >
> > > Maybe I should check cache-miss late ;)
> >
> > Btw., in latest upstream you can do that via:
> >
> > cd tools/perf/
> > make -j install
> >
> > perf stat --repeat 5 -- taskset -c 1 ./context_switch
> >
>
> tried. (on 8cpu/2socket host). It seems cache-miss decreases. But
> IPC ..?

All the numbers have gone down - about the same amount of cycles but
fewer instructions executed, and fewer cache-misses. That's good.

The Instructions Per Cycle metric got worse because cycles stayed
constant. One thing is that you have triggered counter-over-commit
(the 'scaled from' messages) - this means that more counters are
used than the hardware has space for - so we round-robin schedule
them.

If you want to get to the bottom of that, to get the most precise
result try something like:

perf stat --repeat 5 -a -e \
cycles,instructions,L1-dcache-load-misses,L1-dcache-store-misses \
-- ./ctxt_sw.sh

( this is almost the same as the command line you used, but without
the 'cache-misses' counter. Your CPU should be able to
simultaneously activate all these counters and they should count
100% of the events. )

Ingo

Anton Blanchard

unread,

Jan 17, 2010, 11:50:02 PM1/17/10

to

Hi,

Another try at this percpu_counter batch issue with CONFIG_VIRT_CPU_ACCOUNTING
and CONFIG_CGROUP_CPUACCT enabled. Thoughts?

--

When CONFIG_VIRT_CPU_ACCOUNTING and CONFIG_CGROUP_CPUACCT are enabled we can

call cpuacct_update_stats with values much larger than percpu_counter_batch.
This means the call to percpu_counter_add will always add to the global count

which is protected by a spinlock and we end up with a global spinlock in
the scheduler.

Based on an idea by KOSAKI Motohiro, this patch scales the batch value by
cputime_one_jiffy such that we have the same batch limit as we would
if CONFIG_VIRT_CPU_ACCOUNTING was disabled. His patch did this once at boot
but that initialisation happened too early on PowerPC (before time_init)
and it was never updated at runtime as a result of a hotplug cpu add/remove.

This patch instead scales percpu_counter_batch by cputime_one_jiffy at
runtime, which keeps the batch correct even after cpu hotplug operations.
We cap it at INT_MAX in case of overflow.

For architectures that do not support CONFIG_VIRT_CPU_ACCOUNTING,
cputime_one_jiffy is the constant 1 and gcc is smart enough to
optimise min(s32 percpu_counter_batch, INT_MAX) to just percpu_counter_batch
at least on x86 and PowerPC. So there is no need to add an #ifdef.

On a 64 thread PowerPC box with CONFIG_VIRT_CPU_ACCOUNTING and
CONFIG_CGROUP_CPUACCT enabled, a context switch microbenchmark is 234x faster
and almost matches a CONFIG_CGROUP_CPUACCT disabled kernel:

CONFIG_CGROUP_CPUACCT disabled: 16906698 ctx switches/sec
CONFIG_CGROUP_CPUACCT enabled: 61720 ctx switches/sec
CONFIG_CGROUP_CPUACCT + patch: 16663217 ctx switches/sec

Tested with:

wget http://ozlabs.org/~anton/junkcode/context_switch.c
make context_switch
for i in `seq 0 63`; do taskset -c $i ./context_switch & done
vmstat 1

Signed-off-by: Anton Blanchard <an...@samba.org>
---

Note: ccing ia64 and s390 who have not yet added code to statically
initialise cputime_one_jiffy at boot.
See a42548a18866e87092db93b771e6c5b060d78401 (cputime: Optimize
jiffies_to_cputime(1) for details). Adding this would help optimise not only
this patch but many other areas of the scheduler when
CONFIG_VIRT_CPU_ACCOUNTING is enabled.

Index: linux.trees.git/kernel/sched.c
===================================================================
--- linux.trees.git.orig/kernel/sched.c 2010-01-18 14:27:12.000000000 +1100
+++ linux.trees.git/kernel/sched.c 2010-01-18 15:21:59.000000000 +1100
@@ -10894,6 +10894,7 @@ static void cpuacct_update_stats(struct
enum cpuacct_stat_index idx, cputime_t val)
{
struct cpuacct *ca;
+ int batch;

if (unlikely(!cpuacct_subsys.active))
return;
@@ -10901,8 +10902,9 @@ static void cpuacct_update_stats(struct
rcu_read_lock();
ca = task_ca(tsk);

+ batch = min_t(long, percpu_counter_batch * cputime_one_jiffy, INT_MAX);

do {
- percpu_counter_add(&ca->cpustat[idx], val);

+ __percpu_counter_add(&ca->cpustat[idx], val, batch);

KOSAKI Motohiro

unread,

Jan 18, 2010, 12:00:01 AM1/18/10

to

Looks fine to me. Thanks Anton!
Reviewed-by: KOSAKI Motohiro <kosaki....@jp.fujitsu.com>

Peter Zijlstra

unread,

Jan 18, 2010, 3:00:02 AM1/18/10

to

On Mon, 2010-01-18 at 15:41 +1100, Anton Blanchard wrote:
> Hi,
>
> Another try at this percpu_counter batch issue with CONFIG_VIRT_CPU_ACCOUNTING
> and CONFIG_CGROUP_CPUACCT enabled. Thoughts?

Seems like a good idea, but isn't that batch number rather static? If
so, computing it in some init path would save that multiply on the
actual accounting path.

Anton Blanchard

unread,

Jan 18, 2010, 3:30:03 AM1/18/10

to

Hi Peter,

> Seems like a good idea, but isn't that batch number rather static? If
> so, computing it in some init path would save that multiply on the
> actual accounting path.

It is mostly static, but percpu_counter_batch does change with hotplug
operations. Adding a hotplug notifier felt like a lot of work but I was
worried we have issues if we didn't scale with hotplug add operations.

Anton

Balbir Singh

unread,

Jan 18, 2010, 3:40:02 AM1/18/10

to

Looks good to me, but I'll test it as well and revert back. I think we
might need to look at the call side where we do the percpu_counter_read().

Acked-by: Balbir Singh <bal...@linux.vnet.ibm.com>

Balbir

Peter Zijlstra

unread,

Jan 18, 2010, 3:40:01 AM1/18/10

to

On Mon, 2010-01-18 at 19:21 +1100, Anton Blanchard wrote:
> Hi Peter,
>
> > Seems like a good idea, but isn't that batch number rather static? If
> > so, computing it in some init path would save that multiply on the
> > actual accounting path.
>
> It is mostly static, but percpu_counter_batch does change with hotplug
> operations. Adding a hotplug notifier felt like a lot of work but I was
> worried we have issues if we didn't scale with hotplug add operations.

OK, trading a notifier for a mult might be worth it, we'll see if
someone complains ;-)

Lets got with this for now.

Martin Schwidefsky

unread,

Jan 18, 2010, 4:50:02 AM1/18/10

to

Hi Anton,

On Mon, 18 Jan 2010 15:41:42 +1100
Anton Blanchard <an...@samba.org> wrote:

> Note: ccing ia64 and s390 who have not yet added code to statically
> initialise cputime_one_jiffy at boot.
> See a42548a18866e87092db93b771e6c5b060d78401 (cputime: Optimize
> jiffies_to_cputime(1) for details). Adding this would help optimise not only
> this patch but many other areas of the scheduler when
> CONFIG_VIRT_CPU_ACCOUNTING is enabled.

For s390 the jiffies_to_cputime is a compile time constant. No need to
initialize it at runtime, no?

> Index: linux.trees.git/kernel/sched.c
> ===================================================================
> --- linux.trees.git.orig/kernel/sched.c 2010-01-18 14:27:12.000000000 +1100
> +++ linux.trees.git/kernel/sched.c 2010-01-18 15:21:59.000000000 +1100
> @@ -10894,6 +10894,7 @@ static void cpuacct_update_stats(struct
> enum cpuacct_stat_index idx, cputime_t val)
> {
> struct cpuacct *ca;
> + int batch;
>
> if (unlikely(!cpuacct_subsys.active))
> return;
> @@ -10901,8 +10902,9 @@ static void cpuacct_update_stats(struct
> rcu_read_lock();
> ca = task_ca(tsk);
>
> + batch = min_t(long, percpu_counter_batch * cputime_one_jiffy, INT_MAX);
> do {
> - percpu_counter_add(&ca->cpustat[idx], val);
> + __percpu_counter_add(&ca->cpustat[idx], val, batch);
> ca = ca->parent;
> } while (ca);
> rcu_read_unlock();

The patch itself trades some accuracy (larger cpu accounting value that
are stored per-cpu) against runtime overhead (spinlock to transfer the
value to the global variable in __percpu_counter_add). Did you
calculate how big the loss in accuracy is?

--
blue skies,
Martin.

"Reality continues to ruin my life." - Calvin.

Anton Blanchard

unread,

Jan 18, 2010, 5:00:02 AM1/18/10

to

Hi Martin,

> For s390 the jiffies_to_cputime is a compile time constant. No need to
> initialize it at runtime, no?

Indeed it is, I didn't look closely enough. Same with ia64 so no work to
do on either arch :)

> The patch itself trades some accuracy (larger cpu accounting value that
> are stored per-cpu) against runtime overhead (spinlock to transfer the
> value to the global variable in __percpu_counter_add). Did you
> calculate how big the loss in accuracy is?

The idea is we are already batching percpu_counter_batch jiffies, so
with CONFIG_VIRT_CPU_ACCOUNTING we batch the equivalent amount in
cputime.

Anton

Balbir Singh

unread,

Jan 18, 2010, 6:10:02 AM1/18/10

to

On Monday 18 January 2010 03:25 PM, Anton Blanchard wrote:
>
> Hi Martin,
>
>> For s390 the jiffies_to_cputime is a compile time constant. No need to
>> initialize it at runtime, no?
>
> Indeed it is, I didn't look closely enough. Same with ia64 so no work to
> do on either arch :)
>
>> The patch itself trades some accuracy (larger cpu accounting value that
>> are stored per-cpu) against runtime overhead (spinlock to transfer the
>> value to the global variable in __percpu_counter_add). Did you
>> calculate how big the loss in accuracy is?
>
> The idea is we are already batching percpu_counter_batch jiffies, so
> with CONFIG_VIRT_CPU_ACCOUNTING we batch the equivalent amount in
> cputime.
>

This patch worked well for me.

Acked-by: Balbir Singh <bal...@linux.vnet.ibm.com>
Tested-by: Balbir Singh <bal...@linux.vnet.ibm.com>

Balbir Singh.

Balbir Singh

unread,

Jan 21, 2010, 10:30:02 AM1/21/10

to

On Monday 18 January 2010 03:25 PM, Anton Blanchard wrote:
>

> Hi Martin,
>
>> For s390 the jiffies_to_cputime is a compile time constant. No need to
>> initialize it at runtime, no?
>
> Indeed it is, I didn't look closely enough. Same with ia64 so no work to
> do on either arch :)
>
>> The patch itself trades some accuracy (larger cpu accounting value that
>> are stored per-cpu) against runtime overhead (spinlock to transfer the
>> value to the global variable in __percpu_counter_add). Did you
>> calculate how big the loss in accuracy is?
>
> The idea is we are already batching percpu_counter_batch jiffies, so
> with CONFIG_VIRT_CPU_ACCOUNTING we batch the equivalent amount in
> cputime.
>

Hi, Peter, Ingo

Could we please pick up the patch for -tip?

--
Three Cheers,
Balbir Singh

Peter Zijlstra

unread,

Jan 21, 2010, 10:40:04 AM1/21/10

to

On Thu, 2010-01-21 at 20:55 +0530, Balbir Singh wrote:

> Could we please pick up the patch for -tip?

OK, will queue it up if Ingo doesn't beat me to it.

Andrew Morton

unread,

Jan 25, 2010, 6:20:02 PM1/25/10

to

On Mon, 18 Jan 2010 15:41:42 +1100
Anton Blanchard <an...@samba.org> wrote:

> When CONFIG_VIRT_CPU_ACCOUNTING and CONFIG_CGROUP_CPUACCT are enabled we can
> call cpuacct_update_stats with values much larger than percpu_counter_batch.
> This means the call to percpu_counter_add will always add to the global count
> which is protected by a spinlock and we end up with a global spinlock in
> the scheduler.

When one looks at the end result:

: static void cpuacct_update_stats(struct task_struct *tsk,
: enum cpuacct_stat_index idx, cputime_t val)
: {
: struct cpuacct *ca;
: int batch;
:
: if (unlikely(!cpuacct_subsys.active))
: return;
:
: rcu_read_lock();
: ca = task_ca(tsk);
:
: batch = min_t(long, percpu_counter_batch * cputime_one_jiffy, INT_MAX);
: do {
: __percpu_counter_add(&ca->cpustat[idx], val, batch);

: ca = ca->parent;
: } while (ca);
: rcu_read_unlock();

: }

the code (which used to be quite obvious) becomes pretty unobvious. In
fact it looks quite wrong.

Shouldn't there be a comment there explaining wtf is going on?

KOSAKI Motohiro

unread,

Jan 25, 2010, 6:50:03 PM1/25/10

to

> On Mon, 18 Jan 2010 15:41:42 +1100
> Anton Blanchard <an...@samba.org> wrote:
>
> > When CONFIG_VIRT_CPU_ACCOUNTING and CONFIG_CGROUP_CPUACCT are enabled we can
> > call cpuacct_update_stats with values much larger than percpu_counter_batch.
> > This means the call to percpu_counter_add will always add to the global count
> > which is protected by a spinlock and we end up with a global spinlock in
> > the scheduler.
>
> When one looks at the end result:

We have about 32 jiffies batch both with or without CONFIG_VIRT_CPU_ACCOUNTING.
Then, The enduser can looks some jiffies after.

Balbir Singh

unread,

Jan 26, 2010, 1:20:02 AM1/26/10

to

Andrew,

I guess a lot of the changelog and comments are in the email history,
but your point on the comment is valid. Why does it look quite wrong to you?

cputime_one_jiffy tells us how many cputime_t's we've gotten in one
jiffy. If virtual accounting is enabled, this number is quite large, and
1 if virtual accounting is not enabled. Overall the value is set to 32
for non-virtual accounting enabled systems. On systems that support
virtual accounting, the value is set to 32*cputime_per_jifffy, so the
per cpu counter syncs up roughly once in 32 jiffies assuming
cpuacct_update_stats is called once per jiffy for non-virtual machines.

If the above comment, pleases you I'll polish it and send it across.
Anton, could you please confirm what I've said above is indeed correct.

--
Three Cheers,
Balbir Singh

Andrew Morton

unread,

Jan 26, 2010, 1:40:01 AM1/26/10

to

Not a very useful location for it!

> Why does it look quite wrong to you?

Because it computes the correct value and then if it's larger than
INT_MAX, it inexplicably assigns INT_MAX to it, giving a wrong result!

Does that code actually work, btw? percpu_counter_batch has type `int'
and cputime_one_jiffy has type `int' so their product has type `int'.
So by the time min_t performs its comparison, the upper 32 bits of the
product are already lost.

Anton Blanchard

unread,

Jan 26, 2010, 3:20:02 AM1/26/10

to

Hi Andrew,

> > I guess a lot of the changelog and comments are in the email history,
>
> Not a very useful location for it!

Good point, I'll work on a useful comment.

> > Why does it look quite wrong to you?
>
> Because it computes the correct value and then if it's larger than
> INT_MAX, it inexplicably assigns INT_MAX to it, giving a wrong result!
>
> Does that code actually work, btw? percpu_counter_batch has type `int'
> and cputime_one_jiffy has type `int' so their product has type `int'.
> So by the time min_t performs its comparison, the upper 32 bits of the
> product are already lost.

On ppc64, s390 and ia64 cputime_one_jiffy is 64bit and I want to prevent us
creating too large a batch value:

void __percpu_counter_add(struct percpu_counter *fbc, s64 amount, s32 batch)

If we pass in something bigger than INT_MAX we could end up with 0 after
truncation which will turn the percpu counter into a single spinlock global
counter.

Anton

tip-bot for Anton Blanchard

unread,

Jan 27, 2010, 8:20:01 AM1/27/10

to

Commit-ID: 43f85eab1411905afe5db510fbf9841b516e7e6a
Gitweb: http://git.kernel.org/tip/43f85eab1411905afe5db510fbf9841b516e7e6a
Author: Anton Blanchard <an...@samba.org>
AuthorDate: Mon, 18 Jan 2010 15:41:42 +1100
Committer: Ingo Molnar <mi...@elte.hu>
CommitDate: Wed, 27 Jan 2010 08:34:38 +0100

sched: cpuacct: Use bigger percpu counter batch values for stats counters

Tested with:

Signed-off-by: Anton Blanchard <an...@samba.org>
Signed-off-by: Peter Zijlstra <a.p.zi...@chello.nl>
LKML-Reference: <20100118044142.GS12666@kryten>
Signed-off-by: Ingo Molnar <mi...@elte.hu>
---
kernel/sched.c | 4 +++-
1 files changed, 3 insertions(+), 1 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 3a8fb30..8f94138 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -10906,6 +10906,7 @@ static void cpuacct_update_stats(struct task_struct *tsk,

enum cpuacct_stat_index idx, cputime_t val)
{
struct cpuacct *ca;
+ int batch;

if (unlikely(!cpuacct_subsys.active))
return;

@@ -10913,8 +10914,9 @@ static void cpuacct_update_stats(struct task_struct *tsk,

Balbir Singh

unread,

Jan 27, 2010, 8:40:01 AM1/27/10

to

On Wed, Jan 27, 2010 at 6:45 PM, tip-bot for Anton Blanchard
<an...@samba.org> wrote:
> Commit-ID: �43f85eb1411905afe5db510fbf9841b516e7e6a

IIRC, Andrew picked up this patch as well and applied some checkpatch
fixes too..

Balbir

Andrew Morton

unread,

Jan 27, 2010, 4:30:01 PM1/27/10

to

On Wed, 27 Jan 2010 19:04:55 +0530
Balbir Singh <bal...@linux.vnet.ibm.com> wrote:

> On Wed, Jan 27, 2010 at 6:45 PM, tip-bot for Anton Blanchard
> <an...@samba.org> wrote:

> > Commit-ID: __43f85eb1411905afe5db510fbf9841b516e7e6a
> > Gitweb: __ __ http://git.kernel.org/tip/43f85eab1411905afe5db510fbf9841b516e7e6a
> > Author: __ __ Anton Blanchard <an...@samba.org>

> > AuthorDate: Mon, 18 Jan 2010 15:41:42 +1100

> > Committer: __Ingo Molnar <mi...@elte.hu>

> > CommitDate: Wed, 27 Jan 2010 08:34:38 +0100
> >
> > sched: cpuacct: Use bigger percpu counter batch values for stats counters
> >
> > When CONFIG_VIRT_CPU_ACCOUNTING and CONFIG_CGROUP_CPUACCT are enabled we
> > can call cpuacct_update_stats with values much larger than
> > percpu_counter_batch. This means the call to percpu_counter_add will
> > always add to the global count which is protected by a spinlock and we
> > end up with a global spinlock in the scheduler.
> >
> > Based on an idea by KOSAKI Motohiro, this patch scales the batch value by
> > cputime_one_jiffy such that we have the same batch limit as we would if
> > CONFIG_VIRT_CPU_ACCOUNTING was disabled. His patch did this once at boot
> > but that initialisation happened too early on PowerPC (before time_init)
> > and it was never updated at runtime as a result of a hotplug cpu
> > add/remove.
> >
> > This patch instead scales percpu_counter_batch by cputime_one_jiffy at
> > runtime, which keeps the batch correct even after cpu hotplug operations.
> > We cap it at INT_MAX in case of overflow.
> >
> > For architectures that do not support CONFIG_VIRT_CPU_ACCOUNTING,
> > cputime_one_jiffy is the constant 1 and gcc is smart enough to optimise
> > min(s32 percpu_counter_batch, INT_MAX) to just percpu_counter_batch at
> > least on x86 and PowerPC. So there is no need to add an #ifdef.
> >
> > On a 64 thread PowerPC box with CONFIG_VIRT_CPU_ACCOUNTING and
> > CONFIG_CGROUP_CPUACCT enabled, a context switch microbenchmark is 234x
> > faster and almost matches a CONFIG_CGROUP_CPUACCT disabled kernel:
> >

> > CONFIG_CGROUP_CPUACCT disabled: __ __ __ __ 16906698 ctx switches/sec
> > CONFIG_CGROUP_CPUACCT enabled: __ __ __ __ __ __ 61720 ctx switches/sec
> > CONFIG_CGROUP_CPUACCT + patch: __ __ __ __ __16663217 ctx switches/sec
> >
> > Tested with:
> >
> > __wget http://ozlabs.org/~anton/junkcode/context_switch.c
> > __make context_switch
> > __for i in `seq 0 63`; do taskset -c $i ./context_switch & done
> > __vmstat 1

> >
> > Signed-off-by: Anton Blanchard <an...@samba.org>
> > Signed-off-by: Peter Zijlstra <a.p.zi...@chello.nl>
> > LKML-Reference: <20100118044142.GS12666@kryten>
> > Signed-off-by: Ingo Molnar <mi...@elte.hu>
> > ---

> > __kernel/sched.c | __ __4 +++-
> > __1 files changed, 3 insertions(+), 1 deletions(-)

> >
> > diff --git a/kernel/sched.c b/kernel/sched.c
> > index 3a8fb30..8f94138 100644
> > --- a/kernel/sched.c
> > +++ b/kernel/sched.c
> > @@ -10906,6 +10906,7 @@ static void cpuacct_update_stats(struct task_struct *tsk,

> > __ __ __ __ __ __ __ __enum cpuacct_stat_index idx, cputime_t val)
> > __{
> > __ __ __ __struct cpuacct *ca;
> > + __ __ __ int batch;
> >
> > __ __ __ __if (unlikely(!cpuacct_subsys.active))
> > __ __ __ __ __ __ __ __return;

> > @@ -10913,8 +10914,9 @@ static void cpuacct_update_stats(struct task_struct *tsk,

> > __ __ __ __rcu_read_lock();
> > __ __ __ __ca = task_ca(tsk);
> >
> > + __ __ __ batch = min_t(long, percpu_counter_batch * cputime_one_jiffy, INT_MAX);
> > __ __ __ __do {
> > - __ __ __ __ __ __ __ percpu_counter_add(&ca->cpustat[idx], val);
> > + __ __ __ __ __ __ __ __percpu_counter_add(&ca->cpustat[idx], val, batch);
> > __ __ __ __ __ __ __ __ca = ca->parent;
> > __ __ __ __} while (ca);
> > __ __ __ __rcu_read_unlock();

^^ your email client inexplicably fills emails with 0xa0

> IIRC, Andrew picked up this patch as well and applied some checkpatch
> fixes too..

No, I have no changes.

Last I heard, Anton was "working on a useful comment" and will be
redoing the patch.

Peter Zijlstra

unread,

Jan 27, 2010, 4:50:02 PM1/27/10

to

On Wed, 2010-01-27 at 13:22 -0800, Andrew Morton wrote:

> Last I heard, Anton was "working on a useful comment" and will be
> redoing the patch.

If Anton is working on a new version, would Anton then also care to post
one that compiles on UP? :-)

Balbir Singh

unread,

Jan 28, 2010, 12:30:02 AM1/28/10

to

>> > __ __ __ __ __ __ __ __ca = ca->parent;
>> > __ __ __ __} while (ca);
>> > __ __ __ __rcu_read_unlock();
>
> ^^ your email client inexplicably fills emails with 0xa0
>

I am going to be changing my email client soon, back to the good old
text client.

>> IIRC, Andrew picked up this patch as well and applied some checkpatch
>> fixes too..
>
> No, I have no changes.
>

Sorry, I was confused, the changes you had were for getdelays.c

> Last I heard, Anton was "working on a useful comment" and will be
> redoing the patch.

Yep, lets wait on him.

Three Cheers,
Balbir

Anton Blanchard

unread,

Jan 28, 2010, 8:00:02 AM1/28/10

to

Even though batch isn't used on UP, we may want to pass one in to keep the
SMP and UP code paths similar. Convert __percpu_counter_add to an inline
function so we wont get variable unused warnings if we do.

Signed-off-by: Anton Blanchard <an...@samba.org>
---

diff --git a/include/linux/percpu_counter.h b/include/linux/percpu_counter.h
index a7684a5..794662b 100644
--- a/include/linux/percpu_counter.h
+++ b/include/linux/percpu_counter.h
@@ -98,9 +98,6 @@ static inline void percpu_counter_set(struct percpu_counter *fbc, s64 amount)
fbc->count = amount;
}

-#define __percpu_counter_add(fbc, amount, batch) \
- percpu_counter_add(fbc, amount)
-
static inline void
percpu_counter_add(struct percpu_counter *fbc, s64 amount)
{
@@ -109,6 +106,12 @@ percpu_counter_add(struct percpu_counter *fbc, s64 amount)
preempt_enable();
}

+static inline void
+__percpu_counter_add(struct percpu_counter *fbc, s64 amount, s32 batch)
+{
+ percpu_counter_add(fbc, amount);
+}
+
static inline s64 percpu_counter_read(struct percpu_counter *fbc)
{
return fbc->count;

Anton Blanchard

unread,

Jan 28, 2010, 8:00:03 AM1/28/10

to

When CONFIG_VIRT_CPU_ACCOUNTING and CONFIG_CGROUP_CPUACCT are enabled we can
call cpuacct_update_stats with values much larger than percpu_counter_batch.
This means the call to percpu_counter_add will always add to the global count
which is protected by a spinlock and we end up with a global spinlock in
the scheduler.

Based on an idea by KOSAKI Motohiro, this patch scales the batch value by
cputime_one_jiffy such that we have the same batch limit as we would
if CONFIG_VIRT_CPU_ACCOUNTING was disabled. His patch did this once at boot
but that initialisation happened too early on PowerPC (before time_init)
and it was never updated at runtime as a result of a hotplug cpu add/remove.

This patch instead scales percpu_counter_batch by cputime_one_jiffy at
runtime, which keeps the batch correct even after cpu hotplug operations.
We cap it at INT_MAX in case of overflow.

For architectures that do not support CONFIG_VIRT_CPU_ACCOUNTING,
cputime_one_jiffy is the constant 1 and gcc is smart enough to
optimise min(s32 percpu_counter_batch, INT_MAX) to just percpu_counter_batch
at least on x86 and PowerPC. So there is no need to add an #ifdef.

On a 64 thread PowerPC box with CONFIG_VIRT_CPU_ACCOUNTING and
CONFIG_CGROUP_CPUACCT enabled, a context switch microbenchmark is 234x faster
and almost matches a CONFIG_CGROUP_CPUACCT disabled kernel:

CONFIG_CGROUP_CPUACCT disabled: 16906698 ctx switches/sec
CONFIG_CGROUP_CPUACCT enabled: 61720 ctx switches/sec
CONFIG_CGROUP_CPUACCT + patch: 16663217 ctx switches/sec

Tested with:

wget http://ozlabs.org/~anton/junkcode/context_switch.c
make context_switch

for i in `seq 0 63`; do taskset -c $i ./context_switch & done

vmstat 1

Signed-off-by: Anton Blanchard <an...@samba.org>
---

v2: Added a comment and fixed the UP build.

Index: linux-cpumask/kernel/sched.c
===================================================================
--- linux-cpumask.orig/kernel/sched.c 2010-01-22 18:23:43.377212514 +1100
+++ linux-cpumask/kernel/sched.c 2010-01-28 23:24:02.677233753 +1100
@@ -10885,12 +10885,30 @@ static void cpuacct_charge(struct task_s
}

/*
+ * When CONFIG_VIRT_CPU_ACCOUNTING is enabled one jiffy can be very large
+ * in cputime_t units. As a result, cpuacct_update_stats calls
+ * percpu_counter_add with values large enough to always overflow the
+ * per cpu batch limit causing bad SMP scalability.
+ *
+ * To fix this we scale percpu_counter_batch by cputime_one_jiffy so we
+ * batch the same amount of time with CONFIG_VIRT_CPU_ACCOUNTING disabled
+ * and enabled. We cap it at INT_MAX which is the largest allowed batch value.
+ */
+#ifdef CONFIG_SMP
+#define CPUACCT_BATCH \
+ min_t(long, percpu_counter_batch * cputime_one_jiffy, INT_MAX)
+#else
+#define CPUACCT_BATCH 0
+#endif
+
+/*
* Charge the system/user time to the task's accounting group.
*/

static void cpuacct_update_stats(struct task_struct *tsk,

enum cpuacct_stat_index idx, cputime_t val)
{
struct cpuacct *ca;

+ int batch = CPUACCT_BATCH;

if (unlikely(!cpuacct_subsys.active))
return;
@@ -10899,7 +10917,7 @@ static void cpuacct_update_stats(struct
ca = task_ca(tsk);

do {
- percpu_counter_add(&ca->cpustat[idx], val);
+ __percpu_counter_add(&ca->cpustat[idx], val, batch);

ca = ca->parent;
} while (ca);
rcu_read_unlock();