I was wrong about gcc's inlining behavior

Juha Nieminen

unread,

Oct 21, 2022, 7:44:59 AM10/21/22

to

(I don't know if this applies to other compilers like clang. Could be.)

For the longest time I thought that if gcc sees a function implementation
at the place where the function is being called, then it doesn't matter
if the keyword 'inline' appears before that function implementation or
not. I thought that gcc just ignores that keyword when making inlining
decisions. (After all, why wouldn't it?)

Just now, however, I got a concrete counter-example to this.

I had a number-crunching class declared and implemented locally in a
a source file (ie. inside a nameless namespace). While I separated
the class definition from the function implementations, I didn't use
any 'inline' keywords.

After a while, when the class grew a bit too large, I decided to move its
function implementations to another source file and, thus, make the class
global (well, inside a namespace, but still global). I figured that this
wouldn't affect its speed because all the number-cruching happened
inside the class, with its function calling other functions of the
same class.

To my puzzlement, the class became almost twice as slow. To my logic it
shouldn't have, because all the function implementations were visible
where they needed to be as fast as possible.

After wondering about it for a while I decided to try to add the 'inline'
keyword in front of all the function implementations (except the couple
that were called from the outside).

And what do you know, the class became as fast as before.

So it appears that I have been wrong: When it comes to functions visible
at the global scope, the 'inline' keyword can make a huge difference,
even when the implementations are seen by the compiler at the call
places.

It appears that gcc uses a different inlining strategy when dealing
with local and global functions (that have no 'inline' keyword).

Michael S

unread,

Oct 21, 2022, 10:01:40 AM10/21/22

to

Did you compare which functions were inlined before and after the change?
Is it possible that the function(s) that made the difference for performance
are called once?
Assuming -O2, the 'static' function (static in C sense rather in C++ sense) will
be inlined even when it does not pass other heuristics, most importantly
the size heuristics.
I recommend to read this section of the manual:
https://gcc.gnu.org/onlinedocs/gcc-12.2.0/gcc/Optimize-Options.html#Optimize-Options
It would not make things clear, because the matter is *not* simple, but
hopefully it will give you better look.
In particular, read about -finline-functions-called-once and -finline-limit.

If your program is not big or release builds are not frequent, you can try -flto.
It reduces difference between local and global things.

Bo Persson

unread,

Oct 21, 2022, 10:42:03 AM10/21/22

to

On 2022-10-21 at 13:44, Juha Nieminen wrote:
> (I don't know if this applies to other compilers like clang. Could be.)
>
> For the longest time I thought that if gcc sees a function implementation
> at the place where the function is being called, then it doesn't matter
> if the keyword 'inline' appears before that function implementation or
> not. I thought that gcc just ignores that keyword when making inlining
> decisions. (After all, why wouldn't it?)
>
> Just now, however, I got a concrete counter-example to this.
>

>

> So it appears that I have been wrong: When it comes to functions visible
> at the global scope, the 'inline' keyword can make a huge difference,
> even when the implementations are seen by the compiler at the call
> places.
>

It probably doesn't make a huge difference in general, but if your
function is 'close' to getting inlined, an actual 'inline' keyword might
add a few points to the heuristics, and tip it over the decision point.

Andrey Tarasevich

unread,

Oct 21, 2022, 11:47:31 AM10/21/22

to

On 10/21/2022 4:44 AM, Juha Nieminen wrote:
> (I don't know if this applies to other compilers like clang. Could be.)
>
> For the longest time I thought that if gcc sees a function implementation
> at the place where the function is being called, then it doesn't matter
> if the keyword 'inline' appears before that function implementation or
> not. I thought that gcc just ignores that keyword when making inlining
> decisions. (After all, why wouldn't it?)

GCC's behavior in this regard is difficult to predict and it seems to be
rather fluid.

I can easily provide an example for you in which GCC will honor `inline`
keyword when making decisions about inlining a call

#include <iostream>

void hello()
{
std::cout << "Hello World" << std::endl;
}

int main()
{
hello();
}

GCC refuses to inline the `hello()` call even in -O3 mode. But if you
add the `inline` keyword to the declaration of `hello()`, the call will
be inlined starting from -O1.

However, make two calls to `hello()` from `main()`

int main()
{
hello();
hello();
}

and the calls will not be inlined even if you slap `static inline` onto
`hello()`.

--
Best regards,
Andrey.

Scott Lurndal

unread,

Oct 21, 2022, 12:04:34 PM10/21/22

to

That's because of the awful iostream crap.

This version inlines both calls:

$ cat /tmp/c.cpp
#include <stdio.h>

inline void hello()
{
printf("Hello World\n");
}

int main()
{
hello();
hello();
}

0000000000400500 <main>:
400500: 48 83 ec 08 sub $0x8,%rsp
400504: bf a0 06 40 00 mov $0x4006a0,%edi
400509: e8 d2 ff ff ff callq 4004e0 <puts@plt>
40050e: bf a0 06 40 00 mov $0x4006a0,%edi
400513: e8 c8 ff ff ff callq 4004e0 <puts@plt>
400518: 31 c0 xor %eax,%eax
40051a: 48 83 c4 08 add $0x8,%rsp
40051e: c3 retq
40051f: 90 nop

As opposed to the iostream version, which it doesn't make sense to inline.

0000000000400990 <hello()>:
400990: 53 push %rbx
400991: ba 0b 00 00 00 mov $0xb,%edx
400996: be 90 0a 40 00 mov $0x400a90,%esi
40099b: bf 80 10 60 00 mov $0x601080,%edi
4009a0: e8 7b fe ff ff callq 400820 <std::basic_ostream<char, std::char_traits<char> >& std::__ostream_insert<char, std::char_traits<char> >(std::basic_ostream<char, std::char_traits<char> >&, char const*, long)@plt>
4009a5: 48 8b 05 d4 06 20 00 mov 0x2006d4(%rip),%rax # 601080 <std::cout@@GLIBCXX_3.4>
4009ac: 48 8b 40 e8 mov -0x18(%rax),%rax
4009b0: 48 8b 98 70 11 60 00 mov 0x601170(%rax),%rbx
4009b7: 48 85 db test %rbx,%rbx
4009ba: 74 3c je 4009f8 <hello()+0x68>
4009bc: 80 7b 38 00 cmpb $0x0,0x38(%rbx)
4009c0: 74 1e je 4009e0 <hello()+0x50>
4009c2: 0f b6 43 43 movzbl 0x43(%rbx),%eax
4009c6: bf 80 10 60 00 mov $0x601080,%edi
4009cb: 0f be f0 movsbl %al,%esi
4009ce: e8 6d fe ff ff callq 400840 <std::ostream::put(char)@plt>
4009d3: 5b pop %rbx
4009d4: 48 89 c7 mov %rax,%rdi
4009d7: e9 54 fe ff ff jmpq 400830 <std::ostream::flush()@plt>
4009dc: 0f 1f 40 00 nopl 0x0(%rax)
4009e0: 48 89 df mov %rbx,%rdi
4009e3: e8 e8 fd ff ff callq 4007d0 <std::ctype<char>::_M_widen_init() const@plt>
4009e8: 48 8b 03 mov (%rbx),%rax
4009eb: be 0a 00 00 00 mov $0xa,%esi
4009f0: 48 89 df mov %rbx,%rdi
4009f3: ff 50 30 callq *0x30(%rax)
4009f6: eb ce jmp 4009c6 <hello()+0x36>
4009f8: e8 b3 fd ff ff callq 4007b0 <std::__throw_bad_cast()@plt>
4009fd: 0f 1f 00 nopl (%rax)

Marcel Mueller

unread,

Oct 22, 2022, 3:22:03 AM10/22/22

to

Am 21.10.22 um 13:44 schrieb Juha Nieminen:

> For the longest time I thought that if gcc sees a function implementation
> at the place where the function is being called, then it doesn't matter
> if the keyword 'inline' appears before that function implementation or
> not. I thought that gcc just ignores that keyword when making inlining
> decisions. (After all, why wouldn't it?)

Well, excessive inlining could be counterproductive. It causes the code
to blow up and this impacts the efficiency of the CPUs instruction cache
and probably more important the branch target buffer.

I have observed a similar effect at the data cache level. I had an in
memory database application. At some point we decided to deduplicate
strings to conserve memory. We expected this to significantly reduce the
memory footprint and slightly decrease performance due to the
deduplication effort. Well, that was not true. It reduced the memory
footprint and the application was /faster/ afterwards. The impact of the
large B-tree for deduplication was less than the benefit of the
increased cache efficiency.

So it could be relevant how many callers exist. For an externally
visible function this is unpredictable. So the size limit for automatic
inlining is probably smaller.

In fact the effect of inlining is only that large because the calling
convention is inefficient at least on x86/x64. I observed noticable
improvements in speed and executable size in the past by choosing mainly
register based function calls w/o standard stack frames.
Unfortunately it is not always possible, to do so. While this works well
with Windows (and OS/2), by just passing an appropriate compiler option,
it is unusable on Linux because the standard header files do not declare
the calling convention of the standard library functions. So nothing
works anymore if you change the default.

> It appears that gcc uses a different inlining strategy when dealing
> with local and global functions (that have no 'inline' keyword).

I guess it does for good reasons.

Marcel

Scott Lurndal

unread,

Oct 22, 2022, 11:06:09 AM10/22/22

to

Marcel Mueller <news.5...@spamgourmet.org> writes:
>Am 21.10.22 um 13:44 schrieb Juha Nieminen:
>> For the longest time I thought that if gcc sees a function implementation
>> at the place where the function is being called, then it doesn't matter
>> if the keyword 'inline' appears before that function implementation or
>> not. I thought that gcc just ignores that keyword when making inlining
>> decisions. (After all, why wouldn't it?)
>
>Well, excessive inlining could be counterproductive. It causes the code
>to blow up and this impacts the efficiency of the CPUs instruction cache
>and probably more important the branch target buffer.

Indeed, and that is exactly why GCC refused to inline the function.
(Note, using the gcc attribute always_inline will always inline
the function, regardless of internal GCC heuristics).

>
>In fact the effect of inlining is only that large because the calling
>convention is inefficient at least on x86/x64. I observed noticable
>improvements in speed and executable size in the past by choosing mainly
>register based function calls w/o standard stack frames.
>Unfortunately it is not always possible, to do so. While this works well
>with Windows (and OS/2), by just passing an appropriate compiler option,
>it is unusable on Linux because the standard header files do not declare
>the calling convention of the standard library functions. So nothing
>works anymore if you change the default.

Do note that the standard unix/linux ABI for x86_64 always passes the
first six arguments in registers when they fit. There is no need
to 'declare calling conventions' because there is only one.

Juha Nieminen

unread,

Oct 24, 2022, 2:45:51 AM10/24/22

to

Michael S <already...@yahoo.com> wrote:
> If your program is not big or release builds are not frequent, you can try -flto.
> It reduces difference between local and global things.

I don't think you understand. I didn't move the function implementations to
the header and make them 'inline'. They were still in their own source file.
The only thing I did was add the word 'inline' in front of the functions,
and the code became as fast as previously.

By them being in the global scope they weren't being inlined within each
other. By adding 'inline' it made gcc inline them, making the code fast
once again.

Juha Nieminen

unread,

Oct 24, 2022, 3:00:31 AM10/24/22

to

Marcel Mueller <news.5...@spamgourmet.org> wrote:
> In fact the effect of inlining is only that large because the calling
> convention is inefficient at least on x86/x64.

The original "classical" reason for inlining in the distant past was that
it would avoid an extra function call, which adds many clock cycles.
However, that's not the reason why inlining makes code faster nowadays.
Function call overhead is extremely small, even inconsequential in most
cases.

By far the main reason why inlining makes code faster nowadays, especially
in number-crunching code, is that it allows the compiler to optimize the
calling code with things like auto-vectorization. A function call (that
doesn't get inlined) in a tight inner loop acts effectively as an
optimization barrier: The compiler can't see what the function is doing
and can't do autovectorization, move things around, etc. (For example,
if the function reads some value from an array, if the compiler can
inline that read, it will help it optimize those array accesses.)

>> It appears that gcc uses a different inlining strategy when dealing
>> with local and global functions (that have no 'inline' keyword).
>
> I guess it does for good reasons.

Well, it's enormously inconvenient. Having your number-crunching class
become almost twice as slow just because you did some refactoring and
cleaning up (that shouldn't affect is performance) is annoying. (Good
thing I actually had a benchmark and noticed.)

I really had to go my way to make as many functions 'inline' as I
reasonably could, moving some of the function implementations to the
header, making others private (so that they could be declared 'inline'
in the source file), and so on. Even then I had to compromise by
having the class become about 20% slower (in order to avoid
excessive amounts of code in the header file). Really annoying.

Michael S

unread,

Oct 24, 2022, 1:12:06 PM10/24/22

to

Or, may be, you don't understand.
I am guessing that the decisive difference between your former
variant and the slow one is that in former variant compiler
was able to figure out that the function is called exactly
once.
For gcc that happens to be the strongest criterion for inlining,
criterion that beats almost anything short of -O0 and, may be,
in some situations, of -Og.
When in your last variant you added inline keyword, gcc did
inlining due to some other, weaker, heuristics. Something
like: "It's possibly called mores than once and it's not
particularly short, but it's not outrageously long either,
so, while I think that inlining here is not very good idea
I am willing to respect the explicit wish of the programmer".

With -flto, on the other hand, it's possible that compiler
will be able to figure out, again, that the function in question
is called once.

Scott Lurndal

unread,

Oct 24, 2022, 1:54:37 PM10/24/22

to

Michael S <already...@yahoo.com> writes:
>On Monday, October 24, 2022 at 9:45:51 AM UTC+3, Juha Nieminen wrote:
>> Michael S <already...@yahoo.com> wrote:
>> > If your program is not big or release builds are not frequent, you can try -flto.
>> > It reduces difference between local and global things.
>> I don't think you understand. I didn't move the function implementations to
>> the header and make them 'inline'. They were still in their own source file.
>> The only thing I did was add the word 'inline' in front of the functions,
>> and the code became as fast as previously.
>>
>> By them being in the global scope they weren't being inlined within each
>> other. By adding 'inline' it made gcc inline them, making the code fast
>> once again.
>
>Or, may be, you don't understand.
>I am guessing that the decisive difference between your former
>variant and the slow one is that in former variant compiler
>was able to figure out that the function is called exactly
>once.

>For gcc that happens to be the strongest criterion for inlining,
>criterion that beats almost anything short of -O0 and, may be,
>in some situations, of -Og.

The GCC criteria for inlining is fundamentally based on the
size of the function being inlined, as shown in an earlier post.
It certainly has no problem inlining the same function call
several times in a single function, so long as the code
footprint isn't too large.

And if you really, really want to blow up the function's icache
footprint, GCC allows overriding it's internal heuristics with the
"always_inline" attribute.