Juha Nieminen <nos...@thanks.invalid> wrote:
> In comp.lang.c++ William Ahern <
wil...@25thandclement.com> wrote:
>> Unsurprising that inlining can improve performance.
>
> The performance boost doesn't come from inlining. One single function call
> wouldn't make a difference when it comes to sorting. It doesn't matter if
> the std::sort() function gets inlined into the calling code or whether
> the compiler decides to make it a separate function. That would only be
> a difference of a few clock cycles at most.
I posted two sets of benchmarks, one for qsort-inline and another for
qsort-extern. qsort-inline clearly has a substantially different performance
profile, and that could only happen if the comparison callback itself was
inlined into the sort routine. See below for the disassembly.
> The difference comes from what could perhaps be called "reverse inlining"
> (which happens thanks to std::sort() being a template). In other words,
> rather than std::sort() being embedded into the calling code, data from
> the calling code gets embedded into the std::sort() implementation.
> (In other words, we are doing the "inlining" kind of in reverse.)
>
> This means that when std::sort() needs to, for example, compare two
> elements, it knows exactly the type of the element and can do the
> comparison "inlined" in its own code, rather than having to call some
> external function provided by the caller.
This is exactly what is happening in the qsort-inline trial I posted: the
comparison callback function is inlined into the qsort implementation. The
reason I couldn't just use -flto with the C runtime qsort is that -flto in
clang and GCC require the intermediate code for both the comparison callback
*and* the qsort implementation itself, and that information isn't included
in Alpine's C runtime--neither the dynamic libc.so nor static libc.a. In
theory the intermediate code *could* be included, just like a system C
runtime can ship with debug symbols.
Below is the dissassembly of some relevant portions of qsort-inline proving
inlining. After recompiling qsort-inline with debug symbols (-g), the
fragment was generated using `objdump -drwC -S -l`, which interleaves
assembly blocks with the originating source block.
This first fragment begins inside qsort (at openbsd-qsort.c line 147, inside
the inner introsort routine) and ends with code from the comparator lambda
in the post C++ benchmark code (at qsort.cc line 51, +/- a couple lines from
the original posted code), without any interveaning calls (only jumps),
showing that the comparator function is inlined into the sorting
implementation itself:
/tmp/comp.lang.c-qsort/openbsd-qsort.c:147
if (n > 7) {
1e40: 48 83 fb 07 cmp $0x7,%rbx
1e44: 74 2c je 1e72 <introsort.constprop.0+0x72>
introsort.constprop.0():
/tmp/comp.lang.c-qsort/qsort.cc:51
return l < r ? -1 : l > r ? 1 : 0;
1e46: 8b 08 mov (%rax),%ecx
1e48: 8b 32 mov (%rdx),%esi
This second fragment begins inside the timedSort lambda, showing that
qsort itself is inlined into the lambda:
operator()<main()::<lambda(unsigned int*, unsigned int*)> >():
/tmp/comp.lang.c-qsort/qsort.cc:35
while( rounds-- )
1315: 4c 8d 78 ff lea -0x1(%rax),%r15
1319: 4c 89 fd mov %r15,%rbp
131c: 0f 1f 40 00 nopl 0x0(%rax)
unsigned int* std::__copy_move<false, true, std::random_access_iterator_tag>::__copy_m<unsigned int>(unsigned int const*, unsigned int const*, unsigned int*):
/usr/include/c++/10.3.1/bits/stl_algobase.h:426
__builtin_memmove(__result, __first, sizeof(_Tp) * _Num);
1320: 48 8b 7c 24 30 mov 0x30(%rsp),%rdi
1325: 48 89 da mov %rbx,%rdx
1328: e8 03 fd ff ff call 1030 <memmove@plt>
__gnu_cxx::__normal_iterator<unsigned int*, std::vector<unsigned int, std::allocator<unsigned int> > >::__normal_iterator(unsigned int* const&):
/usr/include/c++/10.3.1/bits/stl_iterator.h:979
: _M_current(__i) { }
132d: 48 8b 7c 24 30 mov 0x30(%rsp),%rdi
openbsd_qsort():
/tmp/comp.lang.c-qsort/openbsd-qsort.c:225
{
size_t i, maxdepth = 0;
int swaptype;
/* Approximate 2*ceil(lg(n + 1)) */
for (i = n; i > 0; i >>= 1)
1332: 4c 89 e8 mov %r13,%rax
The OpenBSD source code used can be found at
https://cvsweb.openbsd.org/cgi-bin/cvsweb/src/lib/libc/stdlib/qsort.c?annotate=1.18
https://cvsweb.openbsd.org/cgi-bin/cvsweb/src/lib/libc/stdlib/heapsort.c?annotate=1.11