[PATCH 0/3] smp: reduce stack requirements for smp_call_function

Mike Travis

unread,

Sep 5, 2008, 5:50:12 PM9/5/08

to

* Cleanup cpumask_t usages in smp_call_function_mask function chain
to prevent stack overflow problem when NR_CPUS=4096.

* Reduce the number of passed cpumask_t variables in the following
call chain for x86_64:

smp_call_function_mask -->
arch_send_call_function_ipi->
smp_ops.send_call_func_ipi -->
genapic->send_IPI_mask

Since the smp_call_function_mask() is an EXPORTED function, we
cannot change it's calling interface for a patch to 2.6.27.

The smp_ops.send_call_func_ipi interface is internal only and
has two arch provided functions:

arch/x86/kernel/smp.c: .send_call_func_ipi = native_send_call_func_ipi
arch/x86/xen/smp.c: .send_call_func_ipi = xen_smp_send_call_function_ipi
arch/x86/mach-voyager/voyager_smp.c: (uses native_send_call_func_ipi)

Therefore modifying the internal interface to use a cpumask_t pointer
is straight-forward.

The changes to genapic are much more extensive and are affected by the
recent additions of the x2apic modes, so they will be done for 2.6.28 only.

Based on 2.6.27-rc5-git6.

Applies to linux-2.6.tip/master (with FUZZ).

Signed-off-by: Mike Travis <tra...@sgi.com>
---

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Mike Travis

unread,

Sep 5, 2008, 5:50:11 PM9/5/08

to

smp_call_function_mask

Ingo Molnar

unread,

Sep 6, 2008, 9:40:14 AM9/6/08

to

* Mike Travis <tra...@sgi.com> wrote:

> * Cleanup cpumask_t usages in smp_call_function_mask function chain
> to prevent stack overflow problem when NR_CPUS=4096.
>
> * Reduce the number of passed cpumask_t variables in the following
> call chain for x86_64:
>
> smp_call_function_mask -->
> arch_send_call_function_ipi->
> smp_ops.send_call_func_ipi -->
> genapic->send_IPI_mask
>
> Since the smp_call_function_mask() is an EXPORTED function, we
> cannot change it's calling interface for a patch to 2.6.27.
>
> The smp_ops.send_call_func_ipi interface is internal only and
> has two arch provided functions:
>
> arch/x86/kernel/smp.c: .send_call_func_ipi = native_send_call_func_ipi
> arch/x86/xen/smp.c: .send_call_func_ipi = xen_smp_send_call_function_ipi
> arch/x86/mach-voyager/voyager_smp.c: (uses native_send_call_func_ipi)
>
> Therefore modifying the internal interface to use a cpumask_t pointer
> is straight-forward.
>
> The changes to genapic are much more extensive and are affected by the
> recent additions of the x2apic modes, so they will be done for 2.6.28 only.
>
> Based on 2.6.27-rc5-git6.
>
> Applies to linux-2.6.tip/master (with FUZZ).

applied to tip/cpus4096, thanks Mike.

I'm still wondering whether we should get rid of non-reference based
cpumask_t altogether ...

Did you have a chance to look at the ftrace/stacktrace tracer in latest
tip/master, which will show the maximum stack footprint that can occur?

Also, i've applied the patch below as well to restore MAXSMP in a muted
form - with big warning signs added as well.

Ingo

-------------->
From 363a5e3d7b4b69371f21bcafd7fc76e68c73733a Mon Sep 17 00:00:00 2001
From: Ingo Molnar <mi...@elte.hu>
Date: Sat, 6 Sep 2008 15:24:52 +0200
Subject: [PATCH] x86: add MAXSMP

restore MAXSMP, it's a nice debugging helper to trigger various crashes
and problems with maximum sized x86 systems.

Make it depend on EXPERIMENTAL and DEBUG_KERNEL, and inform the user
about the effects (stacksize, overhead, memory usage) of this flag.

Signed-off-by: Ingo Molnar <mi...@elte.hu>
---
arch/x86/Kconfig | 11 ++++++++---
1 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index ed97f2b..91212c1 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -580,10 +580,15 @@ config IOMMU_HELPER

config MAXSMP
bool "Configure Maximum number of SMP Processors and NUMA Nodes"
- depends on X86_64 && SMP && BROKEN
- default n
+ depends on X86_64 && SMP && DEBUG_KERNEL && EXPERIMENTAL
help
- Configure maximum number of CPUS and NUMA Nodes for this architecture.
+ Configure maximum number of CPUS and NUMA Nodes for this
+ architecture (up to 4096!).
+
+ This can increase memory usage, bigger stack footprint and can
+ add some runtime overhead as well so unless you want a generic
+ distro kernel you likely want to say N.
+
If unsure, say N.

config NR_CPUS

Mike Travis

unread,

Sep 6, 2008, 2:20:14 PM9/6/08

to

Ingo Molnar wrote:
> * Mike Travis <tra...@sgi.com> wrote:
>
>> * Cleanup cpumask_t usages in smp_call_function_mask function chain
>> to prevent stack overflow problem when NR_CPUS=4096.
>>
>> * Reduce the number of passed cpumask_t variables in the following
>> call chain for x86_64:
>>
>> smp_call_function_mask -->
>> arch_send_call_function_ipi->
>> smp_ops.send_call_func_ipi -->
>> genapic->send_IPI_mask
>>
>> Since the smp_call_function_mask() is an EXPORTED function, we
>> cannot change it's calling interface for a patch to 2.6.27.
>>
>> The smp_ops.send_call_func_ipi interface is internal only and
>> has two arch provided functions:
>>
>> arch/x86/kernel/smp.c: .send_call_func_ipi = native_send_call_func_ipi
>> arch/x86/xen/smp.c: .send_call_func_ipi = xen_smp_send_call_function_ipi
>> arch/x86/mach-voyager/voyager_smp.c: (uses native_send_call_func_ipi)
>>
>> Therefore modifying the internal interface to use a cpumask_t pointer
>> is straight-forward.
>>
>> The changes to genapic are much more extensive and are affected by the
>> recent additions of the x2apic modes, so they will be done for 2.6.28 only.
>>
>> Based on 2.6.27-rc5-git6.
>>
>> Applies to linux-2.6.tip/master (with FUZZ).
>
> applied to tip/cpus4096, thanks Mike.

Thanks Ingo! Could you send me the git id for the merge?

>
> I'm still wondering whether we should get rid of non-reference based
> cpumask_t altogether ...

I've got a whole slew of "get-ready-to-remove-cpumask_t's" coming soon.
There are two phases, one completely within the x86 arch and the 2nd hits
the generic smp_call_function_mask ABI (won't be doable as a back-ported
patch to 2.6.27.)

>
> Did you have a chance to look at the ftrace/stacktrace tracer in latest
> tip/master, which will show the maximum stack footprint that can occur?

Hmm, no. I'm using a default config right now as I can boot that pretty
easily. I'll turn on the ftrace thing and check it out.

>
> Also, i've applied the patch below as well to restore MAXSMP in a muted
> form - with big warning signs added as well.

The main thing is to allow the distros to set it manually for their QA
testing of 2.6.27. I'm sure I'll get back bugs because of just that.

(Is there a way to have them know to assign bugzilla's to me if NR_CPUS=4k
is the root of the problem? This is an extremely serious issue for SGI
and I'd like to avoid any delays in me finding out about problems.)

Thanks again,
Mike

Ingo Molnar

unread,

Sep 6, 2008, 2:30:10 PM9/6/08

to

the commits are:

363a5e3: x86: add MAXSMP
01f569c: x86: restore 4096 limit for NR_CPUS
ae74da3: x86: reduce stack requirements for send_call_func_ipi
562d8c2: smp: reduce stack requirements for smp_call_function_mask

the merge into tip/master is:

That merge commit will go away on the next integration run though.

your changes seem to be largely problem-free so far - with two dozen
MAXSMP=y random bootups already.

> > I'm still wondering whether we should get rid of non-reference based
> > cpumask_t altogether ...
>
> I've got a whole slew of "get-ready-to-remove-cpumask_t's" coming
> soon. There are two phases, one completely within the x86 arch and the
> 2nd hits the generic smp_call_function_mask ABI (won't be doable as a
> back-ported patch to 2.6.27.)

ok. None of this can go into v2.6.27 obviously - the stack corruptions
were rather nasty. But it's looking good for v2.6.28 - especially if you
are removing cpumask_t.

> > Did you have a chance to look at the ftrace/stacktrace tracer in
> > latest tip/master, which will show the maximum stack footprint that
> > can occur?
>
> Hmm, no. I'm using a default config right now as I can boot that
> pretty easily. I'll turn on the ftrace thing and check it out.

it's CONFIG_STACK_TRACER=y and rather nifty.

> > Also, i've applied the patch below as well to restore MAXSMP in a
> > muted form - with big warning signs added as well.
>
> The main thing is to allow the distros to set it manually for their QA
> testing of 2.6.27. I'm sure I'll get back bugs because of just that.
>
> (Is there a way to have them know to assign bugzilla's to me if
> NR_CPUS=4k is the root of the problem? This is an extremely serious
> issue for SGI and I'd like to avoid any delays in me finding out about
> problems.)

i dont think there's any easy mapping.

Ingo

Jes Sorensen

unread,

Sep 8, 2008, 5:50:09 AM9/8/08

to

Ingo Molnar wrote:
>> Applies to linux-2.6.tip/master (with FUZZ).
>
> applied to tip/cpus4096, thanks Mike.
>
> I'm still wondering whether we should get rid of non-reference based
> cpumask_t altogether ...

Cool,

I think we should, it's like a ticking bomb waiting to explode on us
eventually. IMHO it was a big mistake to allow cpumask_t being passed
by value in the first place.

Cheers,
Jes

Nick Piggin

unread,

Sep 8, 2008, 6:40:11 AM9/8/08

to

Considering that, unless I'm mistaken, you want to run production systems
with 4096 CPUs at some point, then I would say you should really consider
increasing NR_CPUS _further_ than that in QA efforts, so that we might be
a bit more confident of running production kernels with 4096.

Is that being tried? Setting it to 8192 or even higher during QA seems
like a good idea to me.

Mike Travis

unread,

Sep 8, 2008, 11:50:17 AM9/8/08

to

Jes Sorensen wrote:
> Ingo Molnar wrote:
>>> Applies to linux-2.6.tip/master (with FUZZ).
>>
>> applied to tip/cpus4096, thanks Mike.
>>
>> I'm still wondering whether we should get rid of non-reference based
>> cpumask_t altogether ...
>
> Cool,
>
> I think we should, it's like a ticking bomb waiting to explode on us
> eventually. IMHO it was a big mistake to allow cpumask_t being passed
> by value in the first place.
>
> Cheers,
> Jes

Linus's idea of defining cpumask_t to be a simple long[1] or a pointer to
a cpumask is a good one. Unfortunately, the amount (and breadth) of the
code changes required is daunting, to say the least. In my source tree
there are 892 references to cpumask_t.

But I'll start looking into it asap. I don't know however if "NR_CPUS >
BITS_PER_LONG" is the correct metric to decide when to use pointers. There
must be a better "pain" indicator... ;-)

Thanks,
Mike

Mike Travis

unread,

Sep 8, 2008, 11:50:16 AM9/8/08

to

That's a good idea. I do occasionally set it to 16k (and 64k) for experimental
reasons (and to really highlight where cpumask_t space hogs reside), but I
hadn't thought to do it in the QA environment.

Thanks,
Mike

David Miller

unread,

Sep 8, 2008, 4:00:14 PM9/8/08

to

From: Nick Piggin <nickp...@yahoo.com.au>
Date: Mon, 8 Sep 2008 20:30:41 +1000

> Is that being tried? Setting it to 8192 or even higher during QA seems
> like a good idea to me.

This is a great idea, especially since it will make it even more
painfully obvious that essentially any function local cpumask_t
variable is a bug.

Really, it seems sensible to do something like:

1) Make cpumask_t a pointer.

2) Add cpumask_data_t which is what cpumask_t is now. This gets
used when for the actual storage, and will only get applied to
datastructures that are dynamically allocated. For example, for
the cpu_vm_mask in mm_struct.

3) Type make and fix build failures until they are all gone.

Mike Travis

unread,

Sep 8, 2008, 4:20:12 PM9/8/08

to

David Miller wrote:
> From: Nick Piggin <nickp...@yahoo.com.au>
> Date: Mon, 8 Sep 2008 20:30:41 +1000
>
>> Is that being tried? Setting it to 8192 or even higher during QA seems
>> like a good idea to me.
>
> This is a great idea, especially since it will make it even more
> painfully obvious that essentially any function local cpumask_t
> variable is a bug.

Yes, that's what I have done in the past ... but putting it into the QA
testing would really trigger those stack overflow problems... ;-)

>
> Really, it seems sensible to do something like:
>
> 1) Make cpumask_t a pointer.
>
> 2) Add cpumask_data_t which is what cpumask_t is now. This gets
> used when for the actual storage, and will only get applied to
> datastructures that are dynamically allocated. For example, for
> the cpu_vm_mask in mm_struct.
>
> 3) Type make and fix build failures until they are all gone.

I was wondering if we'd need to be able to default a cpumask_t pointer
argument to be a const and then use a different method for those cases
where it shouldn't be? This would strengthen the compiler type checking
of functions calls.

For example:

proto(cpumask_t mask)

would imply that *mask is a const, whereas

proto(cpumask_var mask)

would indicate it to be non-const?

But then we couldn't use "cpumask_t" as a local declarator... So perhaps
we need something completely different for declaring cpumask arguments?

(I'm trying to figure out how to structure this with the least amount of
source editing.)

Thanks!
Mike

David Miller

unread,

Sep 8, 2008, 4:50:14 PM9/8/08

to

From: Mike Travis <tra...@sgi.com>
Date: Mon, 08 Sep 2008 13:11:59 -0700

> I was wondering if we'd need to be able to default a cpumask_t pointer
> argument to be a const and then use a different method for those cases
> where it shouldn't be? This would strengthen the compiler type checking
> of functions calls.

Yes, of course, the pointer should be const.

[PATCH 0/3] smp: reduce stack requirements for smp_call_function_mask

Mike Travis

Mike Travis

Ingo Molnar

Mike Travis

Ingo Molnar

Jes Sorensen

Nick Piggin

Mike Travis

Mike Travis

David Miller

Mike Travis

David Miller