volano ~30% regression with 2.6.33-rc1 & -rc2

Lin Ming

unread,

Jan 4, 2010, 3:40:01 AM1/4/10

to

Mike & Peter,

Compared with 2.6.32, volano has ~30% regression with 2.6.33-rc1 & -rc2.
Testing machine: Tigerton Xeon, 16cpus(4P/4Core), 16G memory

Bisect to below commit,

commit a1f84a3ab8e002159498814eaa7e48c33752b04b
Author: Mike Galbraith <efa...@gmx.de>
Date: Tue Oct 27 15:35:38 2009 +0100

sched: Check for an idle shared cache in select_task_rq_fair()

When waking affine, check for an idle shared cache, and if
found, wake to that CPU/sibling instead of the waker's CPU.

This improves pgsql+oltp ramp up by roughly 8%. Possibly more
for other loads, depending on overlap. The trade-off is a
roughly 1% peak downturn if tasks are truly synchronous.

Signed-off-by: Mike Galbraith <efa...@gmx.de>
Cc: Arjan van de Ven <ar...@infradead.org>
Cc: Peter Zijlstra <pet...@infradead.org>
Cc: <sta...@kernel.org>
LKML-Reference: <1256654138.1...@marge.simson.net>
Signed-off-by: Ingo Molnar <mi...@elte.hu>

This commit can't be reverted due to conflict, so I reverted below 4
commits related to idle-shared-cache in 2.6.33-rc2, and then the
performance was restored to 2.6.32.

fe3bcfe (sched: More generic WAKE_AFFINE vs select_idle_sibling())
a50bde5 (sched: Cleanup select_task_rq_fair())
fd21073 (sched: Fix affinity logic in select_task_rq_fair())
a1f84a3 (sched: Check for an idle shared cache in select_task_rq_fair())

This regression seems caused by cache misses of access to per cpu data.
(see below perf top cache-misses data for detail)

select_idle_sibling(...)
{
....
for_each_cpu_and(i, sched_domain_span(sd), &p->cpus_allowed) {
if (!cpu_rq(i)->cfs.nr_running) {
target = i;
break;
}
}
....
}

The performance can be restored to 2.6.32 as well if SD_PREFER_SIBLING
is not set, so select_idle_sibling will not be called.

perf top data as follow,

2.6.33-rc1 cache-misses data (note 11.8% select_task_rq_fair)
------------------------------------------------------------------------------------
PerfTop: 12262 irqs/sec kernel:90.6% [1000Hz cache-misses], (all, 16 CPUs)
------------------------------------------------------------------------------------

samples pcnt function DSO
_______ _____ _____________________________ ________________

18272.00 11.8% select_task_rq_fair [kernel.kallsyms]
15499.00 10.0% schedule [kernel.kallsyms]
9447.00 6.1% update_curr [kernel.kallsyms]
9255.00 6.0% _raw_spin_lock [kernel.kallsyms]
5161.00 3.3% tcp_sendmsg [kernel.kallsyms]

2.6.32 cache-misses data
--------------------------------------------------------------------------------------
PerfTop: 11749 irqs/sec kernel:88.2% [1000Hz cache-misses], (all, 16 CPUs)
--------------------------------------------------------------------------------------

samples pcnt function DSO
_______ _____ _____________________________ _________________
11974.00 11.5% schedule [kernel.kallsyms]
6656.00 6.4% _spin_lock [kernel.kallsyms]
5852.00 5.6% update_curr [kernel.kallsyms]
3140.00 3.0% enqueue_entity [kernel.kallsyms]
2846.00 2.7% tcp_sendmsg [kernel.kallsyms]

2.6.33-rc1 cycles data (note 6.5% select_task_rq_fair)
-------------------------------------------------------------------------------
PerfTop: 11106 irqs/sec kernel:99.7% [1000Hz cycles], (all, 16 CPUs)
-------------------------------------------------------------------------------

samples pcnt function DSO
_______ _____ _________________________ _________________

11658.00 10.0% schedule [kernel.kallsyms]
10870.00 9.4% _raw_spin_lock [kernel.kallsyms]
7576.00 6.5% select_task_rq_fair [kernel.kallsyms]
3696.00 3.2% tcp_sendmsg [kernel.kallsyms]
3000.00 2.6% update_curr [kernel.kallsyms]

2.6.32 cycles data
------------------------------------------------------------------------------------
PerfTop: 10462 irqs/sec kernel:99.8% [1000Hz cycles], (all, 16 CPUs)
------------------------------------------------------------------------------------

samples pcnt function DSO
_______ _____ _________________________ _________________

13364.00 9.9% schedule [kernel.kallsyms]
13140.00 9.8% _spin_lock [kernel.kallsyms]
4903.00 3.6% tcp_sendmsg [kernel.kallsyms]
4017.00 3.0% update_curr [kernel.kallsyms]
3395.00 2.5% _spin_lock_bh [kernel.kallsyms]

Lin Ming

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Arjan van de Ven

unread,

Jan 4, 2010, 7:40:01 AM1/4/10

to

On Mon, 04 Jan 2010 16:15:58 +0800
Lin Ming <ming....@intel.com> wrote:

> Mike & Peter,
>
> Compared with 2.6.32, volano has ~30% regression with 2.6.33-rc1 &
> -rc2. Testing machine: Tigerton Xeon, 16cpus(4P/4Core), 16G memory

did this show up only on this cpu?
(since this is a multi-core-without-shared-cache cpu, it could be that
we get the topology wrong and think cores share cache where they don't)

--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

Mike Galbraith

unread,

Jan 4, 2010, 8:00:02 AM1/4/10

to

On Mon, 2010-01-04 at 04:40 -0800, Arjan van de Ven wrote:
> On Mon, 04 Jan 2010 16:15:58 +0800
> Lin Ming <ming....@intel.com> wrote:
>
> > Mike & Peter,
> >
> > Compared with 2.6.32, volano has ~30% regression with 2.6.33-rc1 &
> > -rc2. Testing machine: Tigerton Xeon, 16cpus(4P/4Core), 16G memory
>
> did this show up only on this cpu?
> (since this is a multi-core-without-shared-cache cpu, it could be that
> we get the topology wrong and think cores share cache where they don't)

My fault for using PREFER_SIBLING I guess. However, I do wonder why in
the heck we set that at the CPU domain level. Siblings lie northward.

-Mike

Peter Zijlstra

unread,

Jan 4, 2010, 8:10:02 AM1/4/10

to

On Mon, 2010-01-04 at 13:57 +0100, Mike Galbraith wrote:
> On Mon, 2010-01-04 at 04:40 -0800, Arjan van de Ven wrote:
> > On Mon, 04 Jan 2010 16:15:58 +0800
> > Lin Ming <ming....@intel.com> wrote:
> >
> > > Mike & Peter,
> > >
> > > Compared with 2.6.32, volano has ~30% regression with 2.6.33-rc1 &
> > > -rc2. Testing machine: Tigerton Xeon, 16cpus(4P/4Core), 16G memory
> >
> > did this show up only on this cpu?
> > (since this is a multi-core-without-shared-cache cpu, it could be that
> > we get the topology wrong and think cores share cache where they don't)
>
> My fault for using PREFER_SIBLING I guess. However, I do wonder why in
> the heck we set that at the CPU domain level. Siblings lie northward.

Ah, PREFER_SIBLING means prefer sibling domain, not sibling thread. Its
set at the CPU (really socket) level so make tasks spread over sockets
first, so that there is no competition for the socket wide resources.

Your change is sane, but we really want a more extensive sched domain
tree in the near future, reflecting the full machine topology.

Mike Galbraith

unread,

Jan 4, 2010, 8:20:01 AM1/4/10

to

On Mon, 2010-01-04 at 14:02 +0100, Peter Zijlstra wrote:
> On Mon, 2010-01-04 at 13:57 +0100, Mike Galbraith wrote:
> > On Mon, 2010-01-04 at 04:40 -0800, Arjan van de Ven wrote:
> > > On Mon, 04 Jan 2010 16:15:58 +0800
> > > Lin Ming <ming....@intel.com> wrote:
> > >
> > > > Mike & Peter,
> > > >
> > > > Compared with 2.6.32, volano has ~30% regression with 2.6.33-rc1 &
> > > > -rc2. Testing machine: Tigerton Xeon, 16cpus(4P/4Core), 16G memory
> > >
> > > did this show up only on this cpu?
> > > (since this is a multi-core-without-shared-cache cpu, it could be that
> > > we get the topology wrong and think cores share cache where they don't)
> >
> > My fault for using PREFER_SIBLING I guess. However, I do wonder why in
> > the heck we set that at the CPU domain level. Siblings lie northward.
>
> Ah, PREFER_SIBLING means prefer sibling domain, not sibling thread. Its
> set at the CPU (really socket) level so make tasks spread over sockets
> first, so that there is no competition for the socket wide resources.

WRT the regression, would you prefer only the sched_fair.c hunk, and
maybe plunking the topology hunk in sched_devel, or both lines in one
patch, since ramp-up gain remains unrealized half of the time on Nehalem
and ilk.

> Your change is sane, but we really want a more extensive sched domain
> tree in the near future, reflecting the full machine topology.

Yeah.

-Mike

Peter Zijlstra

unread,

Jan 4, 2010, 8:30:01 AM1/4/10

to

On Mon, 2010-01-04 at 14:15 +0100, Mike Galbraith wrote:
> On Mon, 2010-01-04 at 14:02 +0100, Peter Zijlstra wrote:
> > On Mon, 2010-01-04 at 13:57 +0100, Mike Galbraith wrote:
> > > On Mon, 2010-01-04 at 04:40 -0800, Arjan van de Ven wrote:
> > > > On Mon, 04 Jan 2010 16:15:58 +0800
> > > > Lin Ming <ming....@intel.com> wrote:
> > > >
> > > > > Mike & Peter,
> > > > >
> > > > > Compared with 2.6.32, volano has ~30% regression with 2.6.33-rc1 &
> > > > > -rc2. Testing machine: Tigerton Xeon, 16cpus(4P/4Core), 16G memory
> > > >
> > > > did this show up only on this cpu?
> > > > (since this is a multi-core-without-shared-cache cpu, it could be that
> > > > we get the topology wrong and think cores share cache where they don't)
> > >
> > > My fault for using PREFER_SIBLING I guess. However, I do wonder why in
> > > the heck we set that at the CPU domain level. Siblings lie northward.
> >
> > Ah, PREFER_SIBLING means prefer sibling domain, not sibling thread. Its
> > set at the CPU (really socket) level so make tasks spread over sockets
> > first, so that there is no competition for the socket wide resources.
>
> WRT the regression, would you prefer only the sched_fair.c hunk, and
> maybe plunking the topology hunk in sched_devel, or both lines in one
> patch, since ramp-up gain remains unrealized half of the time on Nehalem
> and ilk.

Both bits seem sane I guess, you change SD_SIBLING_INIT(), right?
Threads really do share package resources so it makes sense to set it.

I guess its back to poking at nehalem to see what makes it tick..

Mike Galbraith

unread,

Jan 4, 2010, 8:50:03 AM1/4/10

to

On Mon, 2010-01-04 at 14:26 +0100, Peter Zijlstra wrote:
> On Mon, 2010-01-04 at 14:15 +0100, Mike Galbraith wrote:

> > WRT the regression, would you prefer only the sched_fair.c hunk, and
> > maybe plunking the topology hunk in sched_devel, or both lines in one
> > patch, since ramp-up gain remains unrealized half of the time on Nehalem
> > and ilk.
>
> Both bits seem sane I guess, you change SD_SIBLING_INIT(), right?

Right.

> Threads really do share package resources so it makes sense to set it.
>
> I guess its back to poking at nehalem to see what makes it tick..

I asked Santa for a quad socket Nehalem and a portable nuclear reactor
to power it, but the stingy old fart let me down ;-)

sched: fix vmark regression on big machines

SD_PREFER_SIBLING is set at the CPU domain level if power saving isn't enabled,
leading to many cache misses on large machines as we traverse looking for an
idle shared cache to wake to. Change the enabler of select_idle_sibling() to
SD_SHARE_PKG_RESOURCES, and enable same at the sibling domain level.

Signed-off-by: Mike Galbraith <efa...@gmx.de>
Cc: Ingo Molnar <mi...@elte.hu>
Cc: Peter Zijlstra <a.p.zi...@chello.nl>
Reported-by: Lin Ming <ming....@intel.com>
LKML-Reference: <new-submission>

diff --git a/include/linux/topology.h b/include/linux/topology.h
index 57e6357..5b81156 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -99,7 +99,7 @@ int arch_update_cpu_topology(void);
| 1*SD_WAKE_AFFINE \
| 1*SD_SHARE_CPUPOWER \
| 0*SD_POWERSAVINGS_BALANCE \
- | 0*SD_SHARE_PKG_RESOURCES \
+ | 1*SD_SHARE_PKG_RESOURCES \
| 0*SD_SERIALIZE \
| 0*SD_PREFER_SIBLING \
, \
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 42ac3c9..8fe7ee8 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -1508,7 +1508,7 @@ static int select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flag
* If there's an idle sibling in this domain, make that
* the wake_affine target instead of the current cpu.
*/
- if (tmp->flags & SD_PREFER_SIBLING)
+ if (tmp->flags & SD_SHARE_PKG_RESOURCES)
target = select_idle_sibling(p, tmp, target);

if (target >= 0) {

Lin Ming

unread,

Jan 4, 2010, 8:10:02 PM1/4/10

to

On Mon, 2010-01-04 at 20:40 +0800, Arjan van de Ven wrote:
> On Mon, 04 Jan 2010 16:15:58 +0800
> Lin Ming <ming....@intel.com> wrote:
>
> > Mike & Peter,
> >
> > Compared with 2.6.32, volano has ~30% regression with 2.6.33-rc1 &
> > -rc2. Testing machine: Tigerton Xeon, 16cpus(4P/4Core), 16G memory
>
> did this show up only on this cpu?
> (since this is a multi-core-without-shared-cache cpu, it could be that
> we get the topology wrong and think cores share cache where they don't)

Tulsa machine(16cpus, 4P/2Core/HT) also has ~8% regression
and it's now fixed by Mike's patch.

Lin Ming

Mike Galbraith

unread,

Jan 4, 2010, 9:50:02 PM1/4/10

to

On Tue, 2010-01-05 at 08:44 +0800, Lin Ming wrote:
> On Mon, 2010-01-04 at 20:40 +0800, Arjan van de Ven wrote:
> > On Mon, 04 Jan 2010 16:15:58 +0800
> > Lin Ming <ming....@intel.com> wrote:
> >
> > > Mike & Peter,
> > >
> > > Compared with 2.6.32, volano has ~30% regression with 2.6.33-rc1 &
> > > -rc2. Testing machine: Tigerton Xeon, 16cpus(4P/4Core), 16G memory
> >
> > did this show up only on this cpu?
> > (since this is a multi-core-without-shared-cache cpu, it could be that
> > we get the topology wrong and think cores share cache where they don't)
>
> Tulsa machine(16cpus, 4P/2Core/HT) also has ~8% regression
> and it's now fixed by Mike's patch.

Excellent. Thanks for testing.

-Mike

tip-bot for Mike Galbraith

unread,

Jan 21, 2010, 9:00:03 AM1/21/10

to

Commit-ID: 50b926e439620c469565e8be0f28be78f5fca1ce
Gitweb: http://git.kernel.org/tip/50b926e439620c469565e8be0f28be78f5fca1ce
Author: Mike Galbraith <efa...@gmx.de>
AuthorDate: Mon, 4 Jan 2010 14:44:56 +0100
Committer: Ingo Molnar <mi...@elte.hu>
CommitDate: Thu, 21 Jan 2010 13:39:03 +0100

sched: Fix vmark regression on big machines

SD_PREFER_SIBLING is set at the CPU domain level if power saving isn't
enabled, leading to many cache misses on large machines as we traverse
looking for an idle shared cache to wake to. Change the enabler of
select_idle_sibling() to SD_SHARE_PKG_RESOURCES, and enable same at the
sibling domain level.

Reported-by: Lin Ming <ming....@intel.com>
Signed-off-by: Mike Galbraith <efa...@gmx.de>
Signed-off-by: Peter Zijlstra <a.p.zi...@chello.nl>
LKML-Reference: <1262612696.1...@marge.simson.net>
Signed-off-by: Ingo Molnar <mi...@elte.hu>
---
include/linux/topology.h | 2 +-
kernel/sched_fair.c | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)