Optimization when local variables are not used

Paul

unread,

Nov 16, 2018, 5:23:16 AM11/16/18

to

I posted the below code on another thread.
A respondent said that the code should be "optimized away" by a good compiler.
Presumably, this means that because basicPopcount has no side effects,
and because main() doesn't do anything with the local variables it
generates by calling basicPopcount, many of the basicPopcount calls should
simply be skipped because the compiler has worked out that the computation
would be pointless.
I attempted to test this theory with Godbolt's compiler explorer.
I have no formal education in assembly or computing more generally,
so the feedback is cryptic.
Complete code and compiler explorer data are copy-pasted below.
Compiler is x86-64 gcc 8.2
What does the compiler feedback say about whether the computations
are skipped in the way described above?
Many thanks for your comments, and for all help previously given.
Paul

#include<iostream>
int basicPopcount(int x)
{
int count = 0;
int place = 0;
// An overflow bug can arise in this method for large values of x
// So make sure types are large enough
while(1ull << place <= x)
if(1ull << place++ & x)
++count;

return count;
}

int main()
{
for(int i = 0; i < 10 * 1000 * 1000; ++i)
basicPopcount(i);
}

basicPopcount(int):
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-20], edi
mov DWORD PTR [rbp-4], 0
mov DWORD PTR [rbp-8], 0
.L4:
mov eax, DWORD PTR [rbp-8]
mov edx, 1
mov ecx, eax
sal rdx, cl
mov eax, DWORD PTR [rbp-20]
cdqe
cmp rdx, rax
ja .L2
mov eax, DWORD PTR [rbp-20]
movsx rsi, eax
mov eax, DWORD PTR [rbp-8]
lea edx, [rax+1]
mov DWORD PTR [rbp-8], edx
mov ecx, eax
shr rsi, cl
mov rax, rsi
and eax, 1
test rax, rax
setne al
test al, al
je .L4
add DWORD PTR [rbp-4], 1
jmp .L4
.L2:
mov eax, DWORD PTR [rbp-4]
pop rbp
ret
main:
push rbp
mov rbp, rsp
sub rsp, 16
mov DWORD PTR [rbp-4], 0
.L8:
cmp DWORD PTR [rbp-4], 9999999
jg .L7
mov eax, DWORD PTR [rbp-4]
mov edi, eax
call basicPopcount(int)
add DWORD PTR [rbp-4], 1
jmp .L8
.L7:
mov eax, 0
leave
ret
__static_initialization_and_destruction_0(int, int):
push rbp
mov rbp, rsp
sub rsp, 16
mov DWORD PTR [rbp-4], edi
mov DWORD PTR [rbp-8], esi
cmp DWORD PTR [rbp-4], 1
jne .L12
cmp DWORD PTR [rbp-8], 65535
jne .L12
mov edi, OFFSET FLAT:_ZStL8__ioinit
call std::ios_base::Init::Init() [complete object constructor]
mov edx, OFFSET FLAT:__dso_handle
mov esi, OFFSET FLAT:_ZStL8__ioinit
mov edi, OFFSET FLAT:_ZNSt8ios_base4InitD1Ev
call __cxa_atexit
.L12:
nop
leave
ret
_GLOBAL__sub_I_basicPopcount(int):
push rbp
mov rbp, rsp
mov esi, 65535
mov edi, 1
call __static_initialization_and_destruction_0(int, int)
pop rbp
ret

Ben Bacarisse

unread,

Nov 16, 2018, 6:43:15 AM11/16/18

to

Paul <peps...@gmail.com> writes:

> I posted the below code on another thread.
> A respondent said that the code should be "optimized away" by a good compiler.
> Presumably, this means that because basicPopcount has no side effects,
> and because main() doesn't do anything with the local variables it
> generates by calling basicPopcount, many of the basicPopcount calls should
> simply be skipped because the compiler has worked out that the computation
> would be pointless.

Yup. That's what happens with, for example, with g++ 8.2.0 with -O1 and
above.

> I attempted to test this theory with Godbolt's compiler explorer.
> I have no formal education in assembly or computing more generally,
> so the feedback is cryptic.
> Complete code and compiler explorer data are copy-pasted below.
> Compiler is x86-64 gcc 8.2
> What does the compiler feedback say about whether the computations
> are skipped in the way described above?

No. What you show has a main function that loops calling your
basicPopcount function:

> main:
> push rbp
> mov rbp, rsp
> sub rsp, 16
> mov DWORD PTR [rbp-4], 0
> .L8:
> cmp DWORD PTR [rbp-4], 9999999
> jg .L7
> mov eax, DWORD PTR [rbp-4]
> mov edi, eax
> call basicPopcount(int)
> add DWORD PTR [rbp-4], 1
> jmp .L8
> .L7:
> mov eax, 0
> leave
> ret

Maybe you did not request optimisation?

A couple of points about the function itself...

> int basicPopcount(int x)
> {
> int count = 0;
> int place = 0;
> // An overflow bug can arise in this method for large values of x
> // So make sure types are large enough
> while(1ull << place <= x)
> if(1ull << place++ & x)
> ++count;
>
> return count;
> }

The term "overflow" is a little odd here. There are two things to worry
about. The first is shifting a 1 bit into the sign position and you've
dealt with that but using 1u. The second thing is not really "overflow"
per se. Shifting by the width of the (promoted) left-hand operand is
undefined so the only way to give your loop defined behaviour is use a
type that is wider than int. Unfortunately, unsigned long long int is
not guaranteed to be wider -- you might find some odd implementation
that has decided to make all the int types the same width.

There are various ways round this but I won't give any right now since I
get the impression you like to work things out for yourself.

Finally, you *do* have a theoretical overflow problem here:

> int main()
> {
> for(int i = 0; i < 10 * 1000 * 1000; ++i)
> basicPopcount(i);
> }

10 * 1000 * 1000 is larger than the smallest permitted value for INT_MAX
(or, in more C++ terms, std::numeric_limits<int>::max()). You may never
come across an implementation with that narrow an int, but you could
impress an interviewer by showing that you are aware of the issue.

--
Ben.

David Brown

unread,

Nov 16, 2018, 7:44:11 AM11/16/18

to

On 16/11/18 11:23, Paul wrote:
> I posted the below code on another thread.
> A respondent said that the code should be "optimized away" by a good compiler.
> Presumably, this means that because basicPopcount has no side effects,
> and because main() doesn't do anything with the local variables it
> generates by calling basicPopcount, many of the basicPopcount calls should
> simply be skipped because the compiler has worked out that the computation
> would be pointless.
> I attempted to test this theory with Godbolt's compiler explorer.
> I have no formal education in assembly or computing more generally,
> so the feedback is cryptic.

The code below looks like you did not have optimisations enabled. You
can put whatever flags you want in the command line on godbolt.org.
Useful ones include :

-x c // For C compilation rather than C++
-std=gnu++17 // For gcc extended C++17
-std=c++17 // For standard C++17
-std=gnu11 // For gcc extended C11 (with -x c)

-Wpedantic // For warnings about non-standard C or C++

-O1 or -O2 // Basic optimisation
-Wall -Wextra // Common warnings

-m32 // Generate 32-bit code instead of 64-bit

Another good trick when looking at the code is to avoid complicated
library code. For example, if you want to see the code for calculating
"foo(x) + bar(x)", don't try to pass the result to std::cout or printf.
The important stuff will get lost in the details. Rather, write a
simple function "int foobar(x) { return foo(x) + bar(x); }" so that you
can look at the code for that function.

Öö Tiib

unread,

Nov 16, 2018, 10:04:34 AM11/16/18

to

On Friday, 16 November 2018 12:23:16 UTC+2, Paul wrote:
> I posted the below code on another thread.
> A respondent said that the code should be "optimized away" by a good compiler.
> Presumably, this means that because basicPopcount has no side effects,
> and because main() doesn't do anything with the local variables it
> generates by calling basicPopcount, many of the basicPopcount calls should
> simply be skipped because the compiler has worked out that the computation
> would be pointless.
> I attempted to test this theory with Godbolt's compiler explorer.
> I have no formal education in assembly or computing more generally,
> so the feedback is cryptic.
> Complete code and compiler explorer data are copy-pasted below.
> Compiler is x86-64 gcc 8.2
> What does the compiler feedback say about whether the computations
> are skipped in the way described above?
> Many thanks for your comments, and for all help previously given.

Unoptimized code is specially generated bad performing
one-to-one translation. That is for to make it possible to step
through our algorithm logic with debugger. Additionally the
compiler or library implementation may instrument code that
is built for debugging with run-time checks that standard
does not require. Such thing does not perform in any way
similarly to actual product so there are no point to measure
performance of debug builds.

Optimizer however tries to achieve maximally well performing
results. So part of the calculations done with constant values
it can do compile time. Calculations done in loop with same,
unchanged arguments it may do once and reuse the result.
Calculations whose results cause no side effects it may leave
not done and so on. So naive performance tests of optimized
builds can also give nonsensical results.

I see such optimized main of yours in godbolt.org:

main:
xor eax, eax
ret