[PATCH] crypto: ecc - Unbreak the build on arm with CONFIG_KASAN_STACK=y

4 views
Skip to first unread message

Lukas Wunner

unread,
Apr 8, 2026, 2:16:09 AMApr 8
to Herbert Xu, David S. Miller, Andrew Morton, Arnd Bergmann, Andrey Ryabinin, Ignat Korchagin, Stefan Berger, linux-...@vger.kernel.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, Alexander Potapenko, Andrey Konovalov, Dmitry Vyukov, Vincenzo Frascino, Andy Shevchenko
Andrew reports the following build breakage of arm allmodconfig,
reproducible with gcc 14.2.0 and 15.2.0:

crypto/ecc.c: In function 'ecc_point_mult':
crypto/ecc.c:1380:1: error: the frame size of 1360 bytes is larger than 1280 bytes [-Werror=frame-larger-than=]

gcc excessively inlines functions called by ecc_point_mult() (without
there being any explicit inline declarations) and doesn't seem smart
enough to stay below CONFIG_FRAME_WARN.

clang does not exhibit the issue.

The issue only occurs with CONFIG_KASAN_STACK=y because it enlarges the
frame size. This has been a controversial topic a couple of times:

https://lore.kernel.org/r/CAK8P3a3_Tdc-XVPXrJ69j3S9...@mail.gmail.com/

Prevent gcc from going overboard with inlining to unbreak the build.
The maximum inline limit to avoid the error is 101. Use 100 to get a
nice round number per Andrew's preference.

Reported-by: Andrew Morton <ak...@linux-foundation.org> # off-list
Signed-off-by: Lukas Wunner <lu...@wunner.de>
---
crypto/Makefile | 5 +++++
1 file changed, 5 insertions(+)

diff --git a/crypto/Makefile b/crypto/Makefile
index 04e269117589..b3ac7f29153e 100644
--- a/crypto/Makefile
+++ b/crypto/Makefile
@@ -181,6 +181,11 @@ obj-$(CONFIG_CRYPTO_ZSTD) += zstd.o
obj-$(CONFIG_CRYPTO_ECC) += ecc.o
obj-$(CONFIG_CRYPTO_ESSIV) += essiv.o

+# Avoid exceeding stack frame due to excessive gcc inlining in ecc_point_mult()
+ifeq ($(ARCH)$(CONFIG_KASAN_STACK)$(LLVM),army)
+CFLAGS_ecc.o += $(call cc-option,-finline-limit=100)
+endif
+
ecdh_generic-y += ecdh.o
ecdh_generic-y += ecdh_helper.o
obj-$(CONFIG_CRYPTO_ECDH) += ecdh_generic.o
--
2.51.0

Andy Shevchenko

unread,
Apr 8, 2026, 7:31:29 AMApr 8
to Lukas Wunner, Herbert Xu, David S. Miller, Andrew Morton, Arnd Bergmann, Andrey Ryabinin, Ignat Korchagin, Stefan Berger, linux-...@vger.kernel.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, Alexander Potapenko, Andrey Konovalov, Dmitry Vyukov, Vincenzo Frascino
On Wed, Apr 08, 2026 at 08:15:49AM +0200, Lukas Wunner wrote:
> Andrew reports the following build breakage of arm allmodconfig,
> reproducible with gcc 14.2.0 and 15.2.0:
>
> crypto/ecc.c: In function 'ecc_point_mult':
> crypto/ecc.c:1380:1: error: the frame size of 1360 bytes is larger than 1280 bytes [-Werror=frame-larger-than=]
>
> gcc excessively inlines functions called by ecc_point_mult() (without
> there being any explicit inline declarations) and doesn't seem smart
> enough to stay below CONFIG_FRAME_WARN.
>
> clang does not exhibit the issue.
>
> The issue only occurs with CONFIG_KASAN_STACK=y because it enlarges the
> frame size. This has been a controversial topic a couple of times:
>
> https://lore.kernel.org/r/CAK8P3a3_Tdc-XVPXrJ69j3S9...@mail.gmail.com/
>
> Prevent gcc from going overboard with inlining to unbreak the build.
> The maximum inline limit to avoid the error is 101. Use 100 to get a
> nice round number per Andrew's preference.

I think this is not the best solution. We still can refactor the code and avoid
being dependant to the (useful) kernel options.

--
With Best Regards,
Andy Shevchenko


Lukas Wunner

unread,
Apr 8, 2026, 9:36:50 AMApr 8
to Andy Shevchenko, Herbert Xu, David S. Miller, Andrew Morton, Arnd Bergmann, Andrey Ryabinin, Ignat Korchagin, Stefan Berger, linux-...@vger.kernel.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, Alexander Potapenko, Andrey Konovalov, Dmitry Vyukov, Vincenzo Frascino
Refactor how? Mark functions "noinline"? That may negatively impact
performance for everyone.

Note that this is a different kind of stack frame exhaustion than the one
in drivers/mtd/chips/cfi_cmdset_0001.c:do_write_buffer(): The latter
is a single function with lots of large local variables, whereas
ecc_point_mult() itself has a reasonable number of variables on the stack,
but gcc inlines numerous function calls that each increase the stack frame.

And gcc isn't smart enough to stop inlining when it reaches the maximum
stack frame size allowed by CONFIG_FRAME_WARN.

It's apparently a compiler bug. Why should we work around compiler bugs
by refactoring the code? The proposed patch instructs gcc to limit
inlining and we can easily remove that once the bug is fixed.

As Arnd explains in the above-linked message, stack frame exhaustion
in crypto/ tends to be caused by compiler bugs. There are already two
other workarounds for compiler bugs in crypto/Makefile, one for wp512.o
and another for serpent_generic.o. Amending CFLAGS is how we've dealt
with these issues in the past, not by refactoring code.

Thanks,

Lukas

Andy Shevchenko

unread,
Apr 8, 2026, 10:32:54 AMApr 8
to Lukas Wunner, Herbert Xu, David S. Miller, Andrew Morton, Arnd Bergmann, Andrey Ryabinin, Ignat Korchagin, Stefan Berger, linux-...@vger.kernel.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, Alexander Potapenko, Andrey Konovalov, Dmitry Vyukov, Vincenzo Frascino
Ah, that makes the difference, thanks for elaborating!

> And gcc isn't smart enough to stop inlining when it reaches the maximum
> stack frame size allowed by CONFIG_FRAME_WARN.
>
> It's apparently a compiler bug. Why should we work around compiler bugs
> by refactoring the code? The proposed patch instructs gcc to limit
> inlining and we can easily remove that once the bug is fixed.
>
> As Arnd explains in the above-linked message, stack frame exhaustion
> in crypto/ tends to be caused by compiler bugs. There are already two
> other workarounds for compiler bugs in crypto/Makefile, one for wp512.o
> and another for serpent_generic.o. Amending CFLAGS is how we've dealt
> with these issues in the past, not by refactoring code.

Yeah, that's the way we may deal with the issue.

Acked-by: Andy Shevchenko <andriy.s...@linux.intel.com>

Nathan Chancellor

unread,
Apr 8, 2026, 4:57:54 PMApr 8
to Lukas Wunner, Herbert Xu, David S. Miller, Andrew Morton, Arnd Bergmann, Andrey Ryabinin, Ignat Korchagin, Stefan Berger, linux-...@vger.kernel.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, Alexander Potapenko, Andrey Konovalov, Dmitry Vyukov, Vincenzo Frascino, Andy Shevchenko
On Wed, Apr 08, 2026 at 08:15:49AM +0200, Lukas Wunner wrote:
Please use proper Kconfig variables here.

ifeq ($(CONFIG_ARM)$(CONFIG_KASAN_STACK)$(CONFIG_CC_IS_GCC),yyy)

Which is both more robust, as $(LLVM) may not be set but CC=clang could
be, and it is clearer (in my opinion). If all supported versions of GCC
support this flag, you could drop the cc-option at that point.

Arnd Bergmann

unread,
Apr 13, 2026, 11:43:04 AMApr 13
to Lukas Wunner, Andy Shevchenko, Herbert Xu, David S . Miller, Andrew Morton, Andrey Ryabinin, Ignat Korchagin, Stefan Berger, linux-...@vger.kernel.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, Alexander Potapenko, Andrey Konovalov, Dmitry Vyukov, Vincenzo Frascino
On Wed, Apr 8, 2026, at 15:36, Lukas Wunner wrote:
> On Wed, Apr 08, 2026 at 02:31:21PM +0300, Andy Shevchenko wrote:
>> On Wed, Apr 08, 2026 at 08:15:49AM +0200, Lukas Wunner wrote:
>> > Prevent gcc from going overboard with inlining to unbreak the build.
>> > The maximum inline limit to avoid the error is 101. Use 100 to get a
>> > nice round number per Andrew's preference.

Have you checked if the total call chain gets a lower stack usage this
way? Usually the high stack usage is a sign of absolutely awful
code generation when the compiler runs into a corner case that
spills variables onto the stack instead of keeping them in registers.

The question is whether the lower inline limit causes the compiler
to not get into this state at all and produce the expected object
code, or if it just ends up producing multiple functions that
stay under the limit individually but have the same problems with
stack usage and performance as before.

I think your patch can be merged either way, but it would be
good to describe what type of problem we are hitting here.

>> I think this is not the best solution. We still can refactor the code
>> and avoid being dependant to the (useful) kernel options.
>
> Refactor how? Mark functions "noinline"? That may negatively impact
> performance for everyone.

I ran into the same issue last year and worked around it by
turning off kasan for this file, which of course is problematic
for other reasons, and I never submitted my hack for inclusion:

--- a/crypto/Makefile
+++ b/crypto/Makefile
@@ -176,6 +176,7 @@ obj-$(CONFIG_CRYPTO_USER_API_RNG) += algif_rng.o
obj-$(CONFIG_CRYPTO_USER_API_AEAD) += algif_aead.o
obj-$(CONFIG_CRYPTO_ZSTD) += zstd.o
obj-$(CONFIG_CRYPTO_ECC) += ecc.o
+KASAN_SANITIZE_ecc.o = n
obj-$(CONFIG_CRYPTO_ESSIV) += essiv.o

ecdh_generic-y += ecdh.o

In principle this could be done on a per-function basis.

Arnd

Lukas Wunner

unread,
Apr 13, 2026, 3:46:41 PMApr 13
to Arnd Bergmann, Andy Shevchenko, Herbert Xu, David S . Miller, Andrew Morton, Andrey Ryabinin, Ignat Korchagin, Stefan Berger, linux-...@vger.kernel.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, Alexander Potapenko, Andrey Konovalov, Dmitry Vyukov, Vincenzo Frascino
On Mon, Apr 13, 2026 at 05:42:39PM +0200, Arnd Bergmann wrote:
> On Wed, Apr 8, 2026, at 15:36, Lukas Wunner wrote:
> > On Wed, Apr 08, 2026 at 02:31:21PM +0300, Andy Shevchenko wrote:
> > > On Wed, Apr 08, 2026 at 08:15:49AM +0200, Lukas Wunner wrote:
> > > > Prevent gcc from going overboard with inlining to unbreak the build.
> > > > The maximum inline limit to avoid the error is 101. Use 100 to get a
> > > > nice round number per Andrew's preference.
>
> Have you checked if the total call chain gets a lower stack usage this
> way? Usually the high stack usage is a sign of absolutely awful
> code generation when the compiler runs into a corner case that
> spills variables onto the stack instead of keeping them in registers.
>
> The question is whether the lower inline limit causes the compiler
> to not get into this state at all and produce the expected object
> code, or if it just ends up producing multiple functions that
> stay under the limit individually but have the same problems with
> stack usage and performance as before.

Attached please find the Assembler output created by gcc -save-temps,
both the original version and the one with limited inlining.

The former requires a 1360 bytes stack frame, the latter 1232 bytes.
E.g. xycz_initial_double() is not inlined into ecc_point_mult(),
together with all its recursive baggage, so the latter version
contains two branch instructions to that function which the former
(original) version does not contain.

At the beginning of the function, it looks like the same register values
are stored to multiple locations on the stack. I assume that's what you
mean by awful code generation? This odd behavior seems more subdued in
the version with limited inlining.

> I think your patch can be merged either way, but it would be
> good to describe what type of problem we are hitting here.

I will respin and I will also take Nathan's suggestion into account.

Thanks,

Lukas
ecc_point_mult_orig.s
ecc_point_mult_limited_inlining.s

Arnd Bergmann

unread,
Apr 13, 2026, 4:32:47 PMApr 13
to Lukas Wunner, Andy Shevchenko, Herbert Xu, David S . Miller, Andrew Morton, Andrey Ryabinin, Ignat Korchagin, Stefan Berger, linux-...@vger.kernel.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, Alexander Potapenko, Andrey Konovalov, Dmitry Vyukov, Vincenzo Frascino
On Mon, Apr 13, 2026, at 21:46, Lukas Wunner wrote:
> On Mon, Apr 13, 2026 at 05:42:39PM +0200, Arnd Bergmann wrote:
>> On Wed, Apr 8, 2026, at 15:36, Lukas Wunner wrote:
>
> Attached please find the Assembler output created by gcc -save-temps,
> both the original version and the one with limited inlining.
>
> The former requires a 1360 bytes stack frame, the latter 1232 bytes.
> E.g. xycz_initial_double() is not inlined into ecc_point_mult(),
> together with all its recursive baggage, so the latter version
> contains two branch instructions to that function which the former
> (original) version does not contain.

Thanks!

So it indeed appears that the problem does not go away but only
stays below the arbitrary threshold of 1280 bytes (which was
recently raised). I would not trust that to actually be the
case across all architectures then, as there are some targets
like mips or parisc tend to use even more stack space than
arm. With your current patch, that means there is a good chance
the problem will come back later.

> At the beginning of the function, it looks like the same register values
> are stored to multiple locations on the stack. I assume that's what you
> mean by awful code generation? This odd behavior seems more subdued in
> the version with limited inlining.

Right. As far as I can tell, the source code is heavily optimized
for performance, but with the sanitizer active this would likely
be several times slower, both from the actual sanitizing and
from the register spilling. I can see how the use of 'u64'
arrays makes this harder for a 32-bit target with limited
available registers.

Arnd

Lukas Wunner

unread,
Apr 14, 2026, 12:57:16 AMApr 14
to Arnd Bergmann, Andy Shevchenko, Herbert Xu, David S . Miller, Andrew Morton, Andrey Ryabinin, Ignat Korchagin, Stefan Berger, linux-...@vger.kernel.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, Alexander Potapenko, Andrey Konovalov, Dmitry Vyukov, Vincenzo Frascino
On Mon, Apr 13, 2026 at 10:32:24PM +0200, Arnd Bergmann wrote:
> On Mon, Apr 13, 2026, at 21:46, Lukas Wunner wrote:
> > On Mon, Apr 13, 2026 at 05:42:39PM +0200, Arnd Bergmann wrote:
> > > On Wed, Apr 8, 2026, at 15:36, Lukas Wunner wrote:
> > Attached please find the Assembler output created by gcc -save-temps,
> > both the original version and the one with limited inlining.
> >
> > The former requires a 1360 bytes stack frame, the latter 1232 bytes.
> > E.g. xycz_initial_double() is not inlined into ecc_point_mult(),
> > together with all its recursive baggage, so the latter version
> > contains two branch instructions to that function which the former
> > (original) version does not contain.
>
> So it indeed appears that the problem does not go away but only
> stays below the arbitrary threshold of 1280 bytes (which was
> recently raised). I would not trust that to actually be the
> case across all architectures then, as there are some targets
> like mips or parisc tend to use even more stack space than
> arm. With your current patch, that means there is a good chance
> the problem will come back later.

The only 32-bit architectures with HAVE_ARCH_KASAN are:
arm powerpc xtensa

I've cross-compiled ecc.o successfully in an allmodconfig build for
powerpc and xtensa, so arm seems to be the only architecture affected
by the large stack frame issue.

Maybe mips and parisc will see the issue as well but they'd have to
support KASAN first.

The problem is that gcc *knows* that it should warn when the stack
goes above CONFIG_FRAME_WARN and that warning is even promoted to
an error, but gcc happily keeps inlining stuff and goes beyond that
limit. My expectation is it should stop inlining before that happens.
clang doesn't have the same problem.

Completely disabling KASAN for this file doesn't seem like a good option
as this is security-relevant code. On the other hand disabling inlining
for this file isn't great either because I recall Google is dogfooding
KASAN on internally used phones, I imagine it would ruin performance
for such use cases (granted those are likely arm64 devices).

*Limiting* inlining strikes a middle ground between those two extremes.

And I don't want to annotate individual functions as noinline only
because gcc does stupid things on a single architecture.

Thanks,

Lukas

David Laight

unread,
Apr 14, 2026, 6:26:06 AMApr 14
to Arnd Bergmann, Lukas Wunner, Andy Shevchenko, Herbert Xu, David S . Miller, Andrew Morton, Andrey Ryabinin, Ignat Korchagin, Stefan Berger, linux-...@vger.kernel.org, linux-...@vger.kernel.org, kasa...@googlegroups.com, Alexander Potapenko, Andrey Konovalov, Dmitry Vyukov, Vincenzo Frascino
On Mon, 13 Apr 2026 22:32:24 +0200
"Arnd Bergmann" <ar...@arndb.de> wrote:

> On Mon, Apr 13, 2026, at 21:46, Lukas Wunner wrote:
> > On Mon, Apr 13, 2026 at 05:42:39PM +0200, Arnd Bergmann wrote:
> >> On Wed, Apr 8, 2026, at 15:36, Lukas Wunner wrote:
> >
> > Attached please find the Assembler output created by gcc -save-temps,
> > both the original version and the one with limited inlining.
> >
> > The former requires a 1360 bytes stack frame, the latter 1232 bytes.
> > E.g. xycz_initial_double() is not inlined into ecc_point_mult(),
> > together with all its recursive baggage, so the latter version
> > contains two branch instructions to that function which the former
> > (original) version does not contain.
>
> Thanks!
>
> So it indeed appears that the problem does not go away but only
> stays below the arbitrary threshold of 1280 bytes (which was
> recently raised). I would not trust that to actually be the
> case across all architectures then, as there are some targets
> like mips or parisc tend to use even more stack space than
> arm. With your current patch, that means there is a good chance
> the problem will come back later.

Not only that, the 'stack frome size' is just a proxy for total
stack use - which is a lot harder to calculate.
I've a cunning plan to use clangs function prototype hashing
to do a static stack calculation that includes indirect calls.
(I did one many years ago for some embedded code that had none.)
I suspect it will find all sorts of code paths that 'blow' the
kernel stack out of the water.
A good bet will be snprintf() calls in unusual error paths
(even after ignoring recursive snprintf() calls and all the %px
modifiers).

> > At the beginning of the function, it looks like the same register values
> > are stored to multiple locations on the stack. I assume that's what you
> > mean by awful code generation? This odd behavior seems more subdued in
> > the version with limited inlining.
>
> Right. As far as I can tell, the source code is heavily optimized
> for performance, but with the sanitizer active this would likely
> be several times slower, both from the actual sanitizing and
> from the register spilling. I can see how the use of 'u64'
> arrays makes this harder for a 32-bit target with limited
> available registers.

gcc make a right 'pigs breakfast' of handling u64 items on 32bit.
It gets really horrid on x86 (which has 8 registers including %sp
and %bp).
I got the impression it sometimes treats a u64 as being two 32bit
values, and other times as a 64bit value held in two registers.
The former tends to generate better code, but that latter happens
if an asm() block (or probably anything else) ends up with an 'A'
constraint for a value in %edx:%eax.
It will spill constant zero words to stack, and do multiplies by
values that are constant zero.
(I think the code generated for a single call to mul_64_64()
will show it all.)

I've just looked at that source.
It seems to be doing 'very wide' arithmetic using u64[].
That will be really horrid on 32bit - it needs to use u32[].

Stopping some of those function being inlined will help.
Even on 64bit I doubt it'll make that much difference to
overall performance.

David

>
> Arnd
>

Reply all
Reply to author
Forward
0 new messages