tdm gcc 5.1 slower than 4.7

Prroffessorr Fir Kenobi

unread,

Aug 26, 2015, 3:41:23 PM8/26/15

to

Im using tdm gcc compilers to compile my winapi projects, when i test some simple mandelbrot sse
code 9maybe with other projests its like the same but i not tested everything), 5.1 generates larger executable 330kB against 270kB (though i not recompiled everything, only hot loop module and linked with already compiled (in 4.7) ones)
and also noticalby slower code 23.5 ms /frame agianst 20 ms in 4.7

it is scary.. what can i do with it? (the rest of settings etc is the same i only rename the compiler folder from one to anither

(If co maybe i should download the whole variety of 10 versions and check which one is the best :C,
as i said this is scary to me)

Bo Persson

unread,

Aug 26, 2015, 4:16:35 PM8/26/15

to

In this case I would recompile everything, and not blame the speed on
the new compiler if some code is compiled with the old compiler.

Perhaps the optimizer doesn't work well with the mix?

Bo Persson

Prroffessorr Fir Kenobi

unread,

Aug 26, 2015, 4:44:52 PM8/26/15

to

well alright, i can recompile (got separate compile scripts in large numnber but can do it in few minutes), but i doubt..

Prroffessorr Fir Kenobi

unread,

Aug 26, 2015, 4:53:16 PM8/26/15

to

allright checked.. nothing changes.. the main module with the mandelbrot code is fundamental as on those 23.5 ms it consumes 23 ms

I notice also that when compiling mofules with 5.1 modules .o seem 1k bigger (each one),

could disassemble maybe and inspect it for both versions, though im bit tired now, so maybe toomorrow

Prroffessorr Fir Kenobi

unread,

Aug 26, 2015, 5:07:09 PM8/26/15

to

ps i made yet quick test

as i may compile loop module in 51 and link all in 47

compile loop 47 link 47: size 270k speed 20 ms

compile loop 51 link 51: size 330k speed 23.5 ms

compile loop 47 link 51: size 330k speed 20 ms

compile loop 51 link 47: size 270k speed 23.5 ms

it shows that speed drop comes by compile in 51
and size bloat comes from link 51

Prroffessorr Fir Kenobi

unread,

Aug 27, 2015, 2:20:13 PM8/27/15

to

i testet also other versions of tdm gcc

5.1 - slower
4.9 - slower
4.8 - slower
4.7 - faster
4.6 - faster
4.5 - compile, links, runtime exception on my code
4.4 - compile, links, runtime exception on my code
3.3 - compile but dont link on some symbols (primarely rawinput api i use, also __cxa_guard_acquire __cxa_guard_release calls)

Vir Campestris

unread,

Aug 27, 2015, 4:21:23 PM8/27/15

to

What optimisation settings are you using?

Andy

Prroffessorr Fir Kenobi

unread,

Aug 27, 2015, 5:02:26 PM8/27/15

to

c:\mingw\bin\g++ -std=c++98 -O2 -c -w test7.c -fno-rtti -fno-exceptions -fno-threadsafe-statics -march=core2 -mtune=generic -mfpmath=both -msse2

(-std=c++98 was used to check if it would be maybe speed up but not a difference)

Prroffessorr Fir Kenobi

unread,

Aug 28, 2015, 3:16:43 AM8/28/15

to

ps check up yet a disasembly genereted (probably this is the source of slowdown, though i was not yet read in this as im no to much crafted here)

I moved recently from tdm gcc 4.7 onto tdm gcc 5.1
(by renaming compiler folders, not even char of source or build scripts was changed) and it showed
my test mandelbrot sse rutine app slowed down from 20 ms per frame till nearli 24 ms per frame

why is that? this is asm from disasembly (im presently to tired to read and compare it)

4.7

.p2align 4,,15
.globl __Z16mandelbrot_n_sseU8__vectorfS_i
.def __Z16mandelbrot_n_sseU8__vectorfS_i; .scl 2; .type 32; .endef
__Z16mandelbrot_n_sseU8__vectorfS_i:
pushl %ebp
movl %esp, %ebp
andl $-16, %esp
subl $16, %esp
movl 8(%ebp), %ecx
movaps %xmm0, (%esp)
testl %ecx, %ecx
js L12
xorps %xmm0, %xmm0
xorl %eax, %eax
movaps %xmm0, %xmm2
movaps %xmm0, %xmm4
jmp L11
.p2align 4,,7
L19:
mulps %xmm4, %xmm2
addl $1, %eax
subps %xmm5, %xmm6
movaps (%esp), %xmm4
cmpl %eax, %ecx
addps %xmm6, %xmm4
addps %xmm2, %xmm2
addps %xmm1, %xmm2
jl L10
L11:
movaps %xmm4, %xmm6
movaps %xmm2, %xmm5
movaps LC5, %xmm7
mulps %xmm4, %xmm6
mulps %xmm2, %xmm5
movaps %xmm6, %xmm3
addps %xmm5, %xmm3
cmpltps LC4, %xmm3
andps %xmm3, %xmm7
movmskps %xmm3, %edx
testl %edx, %edx
addps %xmm7, %xmm0
jne L19
L10:
cvtps2dq %xmm0, %xmm0
leave
ret
L12:
xorps %xmm0, %xmm0
jmp L10
.globl __Z16mandelbrot_n_sseDv4_fS_i
.def __Z16mandelbrot_n_sseDv4_fS_i; .scl 2; .type 32; .endef
.set __Z16mandelbrot_n_sseDv4_fS_i,__Z16mandelbrot_n_sseU8__vectorfS_i

5.1

LHOTB8:
.p2align 4,,15
.globl __Z16mandelbrot_n_sseDv4_fS_i
.def __Z16mandelbrot_n_sseDv4_fS_i; .scl 2; .type 32; .endef
__Z16mandelbrot_n_sseDv4_fS_i:
pushl %ebp
movl %esp, %ebp
andl $-16, %esp
subl $16, %esp
movl 8(%ebp), %ecx
movaps %xmm0, (%esp)
testl %ecx, %ecx
js L11
pxor %xmm0, %xmm0
xorl %edx, %edx
movaps %xmm0, %xmm5
movaps %xmm0, %xmm2
jmp L10
.p2align 4,,10
L18:
mulps %xmm2, %xmm5
addl $1, %edx
subps %xmm6, %xmm4
cmpl %edx, %ecx
addps %xmm5, %xmm5
addps (%esp), %xmm4
addps %xmm1, %xmm5
jl L9
movaps %xmm4, %xmm2
L10:
movaps %xmm2, %xmm4
movaps %xmm5, %xmm6
movaps LC7, %xmm7
mulps %xmm2, %xmm4
mulps %xmm5, %xmm6
movaps %xmm4, %xmm3
addps %xmm6, %xmm3
cmpltps LC6, %xmm3
andps %xmm3, %xmm7
movmskps %xmm3, %eax
testl %eax, %eax
addps %xmm7, %xmm0
jne L18
L9:
cvtps2dq %xmm0, %xmm0
leave
ret
L11:
pxor %xmm0, %xmm0
jmp L9
.section .text.unlikely,"x"
LCOLDE8:
.text
LHOTE8:
.globl __Z16mandelbrot_n_sseU8__vectorfS_i
.def __Z16mandelbrot_n_sseU8__vectorfS_i; .scl 2; .type 32; .endef
.set __Z16mandelbrot_n_sseU8__vectorfS_i,__Z16mandelbrot_n_sseDv4_fS_i
.section .rdata,"dr"

Prroffessorr Fir Kenobi

unread,

Aug 28, 2015, 11:20:20 AM8/28/15

to

seems its very nearly the same only some slight changes like order, and different variable is pushed to ram

seems like this 5.1 version is unlucky to get worse

Vir Campestris

unread,

Aug 29, 2015, 12:57:11 PM8/29/15

to

On 27/08/2015 22:02, Prroffessorr Fir Kenobi wrote:
> W dniu czwartek, 27 sierpnia 2015 22:21:23 UTC+2 użytkownik Vir Campestris napisał:
>> On 26/08/2015 20:40, Prroffessorr Fir Kenobi wrote:
>> What optimisation settings are you using?
>>
>

> c:\mingw\bin\g++ -std=c++98 -O2 -c -w test7.c -fno-rtti -fno-exceptions -fno-threadsafe-statics -march=core2 -mtune=generic -mfpmath=both -msse2
>
> (-std=c++98 was used to check if it would be maybe speed up but not a difference)
>

OK, I'm not familiar with that compiler. But is O2 really the highest
level? GCC and Clang both have O3.

No point in complaining the compiler is too slow when you haven't told
it to do its best.

Andy

Prroffessorr Fir Kenobi

unread,

Aug 29, 2015, 1:17:18 PM8/29/15

to

-O3 may produce unstable code which has risky optimisations and sometimes even crashes.. here i tested and O3 makes slower code

David Brown

unread,

Aug 30, 2015, 10:22:29 AM8/30/15

to

On 29/08/15 19:17, Prroffessorr Fir Kenobi wrote:
> W dniu sobota, 29 sierpnia 2015 18:57:11 UTC+2 użytkownik Vir
> Campestris napisał:
>> On 27/08/2015 22:02, Prroffessorr Fir Kenobi wrote:
>>> W dniu czwartek, 27 sierpnia 2015 22:21:23 UTC+2 użytkownik Vir
>>> Campestris napisał:
>>>> On 26/08/2015 20:40, Prroffessorr Fir Kenobi wrote: What
>>>> optimisation settings are you using?
>>>>
>>>
>>> c:\mingw\bin\g++ -std=c++98 -O2 -c -w test7.c -fno-rtti
>>> -fno-exceptions -fno-threadsafe-statics -march=core2
>>> -mtune=generic -mfpmath=both -msse2
>>>

Why are you using "-mtune=generic" ? Here you are telling gcc that the
code will always run on a "core2" machine (with -march=core2), yet it
should (with "-mtune=generic") make code that runs reasonably on any x86.

Instead, you want "-march=native" for best performance - this tells the
compiler that it can use anything supported by the current processor,
and should optimise code to run on exactly that system. You only need
other flags if you are trying to make binaries that run fast on a range
of different x86 systems - but here your main aim is speed on your own
system.

And you don't want to use flags like "-mfpmath=both" or "-msse2" - use
"-march=native" and let the compiler do the best job, as it already
knows what extensions your cpu supports.

If your code has any floating point, and you don't need absolute
conformance to the IEEE rules (few programs need that), then use
"-ffast-math" to allow the compiler greater freedom in maths optimisations.

>>> (-std=c++98 was used to check if it would be maybe speed up but
>>> not a difference)
>>>
>>
>> OK, I'm not familiar with that compiler. But is O2 really the
>> highest level? GCC and Clang both have O3.

mingw is gcc on Windows.

>>
>> No point in complaining the compiler is too slow when you haven't
>> told it to do its best.
>>
>> Andy
>
> -O3 may produce unstable code which has risky optimisations and
> sometimes even crashes.. here i tested and O3 makes slower code
>

-O3 should not ever produce "risky" optimisations or code that crashes
because of optimisations. If your code crashes with -O3 but not with
-O2, then either your code has bugs or the compiler has bugs. Almost
certainly it is the former, especially when you are mixing in inline
assembly. But if you think it is a real gcc bug, then the gcc folk
would love to know about it.

There are sometimes optimisations that are experimental or known to
cause problems - these are clearly marked in the documentation for that
release of gcc, and always require specific flags (they are never part
of any -O flag).

On the other hand, it is well-known that -O3 does not necessarily lead
to faster code than -O2. It depends very much on the circumstances and
the code in question, as well as the processor in use. For example, -O3
is more aggressive about loop unrolling - cache effects and a good
branch predictor may mean that an unrolled loop is actually slower than
the rolled loop. Some trial and error here is worth the effort for best
code, and you might have fun playing with other options in:

<https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html>

You can even change the optimisation levels on different functions, at
least in newer gcc.