optimizers are overrated

copx

unread,

Apr 22, 2008, 7:38:18 PM4/22/08

to

Optimizers are overrated

I started learning ASM not long ago to improve my understanding of the
hardware architecture and my ability to optimize C code. The results of my
first experiment were surprising to say at least. After reading the chapter
on loops in my ASM book I wanted to test whether modern C compilers are
actually as smart as commonly claimed. I chose a most simple loop: calling
putchar 100 times. The first function (foo) uses a typical C style loop to
test the assumption that "the compiler will optimize that better than any
human could". The second function (bar) is based my newly gained knowledge,
the loop is basically ASM written in C. Now I am certain if I asked here
which one is more efficient all you guys would reply "the compiler will most
likely generate the same code in both cases" (I have read such claims
countless times here). Well, look at the ASM output below to see how wrong
your assumption is.

/* C Code */

void foo(void)
{
int i;

for (i = 0; i < 100; i++) putchar('a');
}

void bar(void)
{
int i = 100;

do {
putchar('a');
} while (--i);
}

As I said, damn simple. No nasty side effects, no access to global
variables, etc. The optimizer has no excuses. It should generate optimial
code in both cases. But see the result:

/* GCC 4.3.0 on x86/Windows, -O2 */

/* foo */
L7:
subl $12, %esp
pushl $97
call _putchar

incl %ebx
addl $16, %esp
cmpl $100, %ebx
jne L7

/* bar */
L2:
subl $12, %esp
pushl $97
call _putchar
addl $16, %esp

decl %ebx
jne L2

Comment: See, even the most recent version of the probably most widely used
compiler can not correctly optimize a most simple loop! At least GCC
understood the bar loop, so my "write C like ASM" optimization worked.

At this point you might wonder what horrible things an average C compiler
will do when GCC already fails so badly. Here is the gruesome result:

/* lccwin32, optimize on */

/* foo */
_$4:
pushl $97
call _putchar
popl %ecx

incl %edi
cmpl $100,%edi
jl _$4

/* bar */
_$10:
pushl $97
call _putchar
popl %ecx

movl %edi,%eax
decl %eax
movl %eax,%edi
or %eax,%eax
jne _$10

Comment: lcc is unable to optimize the loop just like GCC, but it adds
insults to injury by actually generating worse code for the ASM-style loop!
So you cannot even optimize the loop yourself!

/* MS Visual C++ 6 /O2 */

For this compiler I had to replace the putchar call with a call to a custom
my_putchar function otherwise the compiler replaces the putchar calls with
direct OS API stuff. While this is a good optimization it is not the
subject of this test, and only makes the resulting asm harder to read, so I
supressed that.

/* foo */

jmp SHORT $L833
$L834:
mov eax, DWORD PTR _i$[ebp]
add eax, 1
mov DWORD PTR _i$[ebp], eax
$L833:
cmp DWORD PTR _i$[ebp], 100
jge SHORT $L835

push 97
call _my_putchar
add esp, 4

jmp SHORT $L834
$L835:

/* bar */

$L840:
push 97
call _my_putchar
add esp, 4

mov eax, DWORD PTR _i$[ebp]
sub eax, 1
mov DWORD PTR _i$[ebp], eax
cmp DWORD PTR _i$[ebp], 0
jne SHORT $L840

Comment: Amazingly enough, this compiler has found yet another way to screw
up. Would you have thought that each compiler generates different code for
such a simple construct?
I hope you agree that the compiler of the beast deserves the award "Worst of
Show" for this mess. Are MS compilers still this bad?

Ian Collins

unread,

Apr 22, 2008, 7:50:01 PM4/22/08

to

copx wrote:
> /* C Code */
>
> void foo(void)
> {
> int i;
>
> for (i = 0; i < 100; i++) putchar('a');
> }
>
>
> void bar(void)
> {
> int i = 100;
>
> do {
> putchar('a');
> } while (--i);
> }
>
> As I said, damn simple. No nasty side effects, no access to global
> variables, etc. The optimizer has no excuses. It should generate optimial
> code in both cases. But see the result:
>

You didn't say what you expected to see. Did you optimise for speed or
space, or use a default?

The first compiler I tried partly unrolled both loops generating near
identical code for both, which is what I'd expect for a default
optimisation.

The only difference was the starting condition and test.

--
Ian Collins.

copx

unread,

Apr 22, 2008, 8:37:46 PM4/22/08

to

"Ian Collins" <ian-...@hotmail.com> schrieb im Newsbeitrag
news:677bt8F...@mid.individual.net...

> copx wrote:
>> /* C Code */
>>
>> void foo(void)
>> {
>> int i;
>>
>> for (i = 0; i < 100; i++) putchar('a');
>> }
>>
>>
>> void bar(void)
>> {
>> int i = 100;
>>
>> do {
>> putchar('a');
>> } while (--i);
>> }
>>
>> As I said, damn simple. No nasty side effects, no access to global
>> variables, etc. The optimizer has no excuses. It should generate optimial
>> code in both cases. But see the result:
>>
> You didn't say what you expected to see.

I wasn't sure what to expect, that's why I tested it.

> Did you optimise for speed or
> space, or use a default?

Read the post again, I specified the used compiler flags (e.g. -O2 for GCC).
When there was a choice (lcc's UI doesn't offer one) I chose "optimize for
speed".

> The first compiler I tried partly unrolled both loops generating near
> identical code for both, which is what I'd expect for a default
> optimisation.

And which compiler was that? Just curious.

> The only difference was the starting condition and test.

The test is the whole point of optimizing this loop on x86. You do not need
a test (cmp instruction) in the loop if you decrement towards zero instead
of incrementing towards 100. This saves one instruction in the body of loop.
The other obvious optimizations are using a register to hold the counter,
and skipping the first check because it is known at compile time that the
loop will always execute at least once. Not one compiler managed to do all
that when feed with the "the optimizer will understand" version (foo).

Loop unrolling is a trickier optimization. You sacrifice code size for
speed. Or lets say the hope for speed, because the increased code size might
cause the code to end up being slower in the end.

Ian Collins

unread,

Apr 22, 2008, 8:56:05 PM4/22/08

to

copx wrote:
> "Ian Collins" <ian-...@hotmail.com> schrieb im Newsbeitrag
> news:677bt8F...@mid.individual.net...
>> copx wrote:
>>> /* C Code */
>>>
>>> void foo(void)
>>> {
>>> int i;
>>>
>>> for (i = 0; i < 100; i++) putchar('a');
>>> }
>>>
>>>
>>> void bar(void)
>>> {
>>> int i = 100;
>>>
>>> do {
>>> putchar('a');
>>> } while (--i);
>>> }
>>>
>>> As I said, damn simple. No nasty side effects, no access to global
>>> variables, etc. The optimizer has no excuses. It should generate optimial
>>> code in both cases. But see the result:
>>>
>> You didn't say what you expected to see.
>
> I wasn't sure what to expect, that's why I tested it.
>
>> Did you optimise for speed or
>> space, or use a default?
>
> Read the post again, I specified the used compiler flags (e.g. -O2 for GCC).
> When there was a choice (lcc's UI doesn't offer one) I chose "optimize for
> speed".
>

Now everyone knows what a specific compiler's flags do.

>> The first compiler I tried partly unrolled both loops generating near
>> identical code for both, which is what I'd expect for a default
>> optimisation.
>
> And which compiler was that? Just curious.
>

Sun c99.

>> The only difference was the starting condition and test.
>
> The test is the whole point of optimizing this loop on x86. You do not need
> a test (cmp instruction) in the loop if you decrement towards zero instead
> of incrementing towards 100. This saves one instruction in the body of loop.

You still have to test for 0, which may or may not be faster.

> The other obvious optimizations are using a register to hold the counter,
> and skipping the first check because it is known at compile time that the
> loop will always execute at least once. Not one compiler managed to do all
> that when feed with the "the optimizer will understand" version (foo).
>

c99 appears to, generated

movl $100,%ebx ;/ line : 18
leaq __iob+128(%rip),%r12 ;/ line : 18
.align 16
.CG6.21:
movl $97,%edi ;/ line : 18
movq %r12,%rsi ;/ line : 18
call putc ;/ line : 18
movl $97,%edi ;/ line : 18
movq %r12,%rsi ;/ line : 18
call putc ;/ line : 18
movl $97,%edi ;/ line : 18
movq %r12,%rsi ;/ line : 18
call putc ;/ line : 18
movl $97,%edi ;/ line : 18
movq %r12,%rsi ;/ line : 18
call putc ;/ line : 18
movl $97,%edi ;/ line : 18
movq %r12,%rsi ;/ line : 18
call putc ;/ line : 18
addl $-5,%ebx ;/ line : 19
.LU7.69:
testl %ebx,%ebx ;/ line : 19
jne .CG6.21 ;/ line : 19

--
Ian Collins.

robert...@yahoo.com

unread,

Apr 22, 2008, 9:11:06 PM4/22/08

to

On Apr 22, 6:38 pm, "copx" <c...@gazeta.pl> wrote:
> Optimizers are overrated
>
> I started learning ASM not long ago to improve my understanding of the
> hardware architecture and my ability to optimize C code. The results of my
> first experiment were surprising to say at least. After reading the chapter
> on loops in my ASM book I wanted to test whether modern C compilers are
> actually as smart as commonly claimed. I chose a most simple loop: calling
> putchar 100 times. The first function (foo) uses a typical C style loop to
> test the assumption that "the compiler will optimize that better than any
> human could". The second function (bar) is based my newly gained knowledge,
> the loop is basically ASM written in C. Now I am certain if I asked here
> which one is more efficient all you guys would reply "the compiler will most
> likely generate the same code in both cases" (I have read such claims
> countless times here). Well, look at the ASM output below to see how wrong
> your assumption is.
>
> /* C Code */
>
> void foo(void)
> {
> int i;
>
> for (i = 0; i < 100; i++) putchar('a');
>
> }
>
> void bar(void)
> {
> int i = 100;
>
> do {
> putchar('a');
> } while (--i);
>
> }

> ...(GCC and LccWin stuff snipped)...

So just how much abuse do *you* think you deserve for testing, and
complaining about, a compiler over a *decade* old? Especially when
you can download a current version for free.

But be that as it may, you've clearly not managed to run the compiler
correctly, because my copy of VC6 generates the following when run
with -O2 or -Ox. In fact, the output you included appears to be the
default, non-optimized output for VC6.

MSVC6 ("cl -Ox -c -Fa test48.c"):

_foo PROC NEAR
; File test48.c
; Line 2
push esi
; Line 6
mov esi, 100 ; 00000064H
$L90:
push 97 ; 00000061H
call _putchar
add esp, 4
dec esi
jne SHORT $L90
pop esi
; Line 10
ret 0
_foo ENDP
_TEXT ENDS
PUBLIC _bar
_TEXT SEGMENT
_bar PROC NEAR
; Line 14
push esi
; Line 15
mov esi, 100 ; 00000064H
$L98:
; Line 18
push 97 ; 00000061H
call _putchar
add esp, 4
; Line 19
dec esi
jne SHORT $L98
pop esi
; Line 23
ret 0
_bar ENDP

A. Sinan Unur

unread,

Apr 22, 2008, 10:07:28 PM4/22/08

to

"copx" <co...@gazeta.pl> wrote in news:fulste$336$1...@inews.gazeta.pl:

> Optimizers are overrated
>
> I started learning ASM not long ago to improve my understanding of the
> hardware architecture and my ability to optimize C code. The results
> of my first experiment were surprising to say at least. After reading
> the chapter on loops in my ASM book I wanted to test whether modern C
> compilers are actually as smart as commonly claimed. I chose a most
> simple loop: calling putchar 100 times.

And how many programs have you written which just print 100 a's?

The argument for relying on the compiler has never been that they are
perfect. Rather, in most cases, given a clean algorithm, you will get
good enough results.

Then, if there are performance sensitive areas of the code that are
being executed too slowly given your criteria, you think hard about the
algorithm and see if you can improve performance where it matters most
through algorithmic changes.

Maybe then it is time to figure out if there are any remaining
bottlenecks that can be solved through hand tuning.

Sure, you can hand to tune every single part of a large application but
the world will have moved on by then.

An example of this is seen in the example you chose. Any optimizations
you make in decrementing/incrementing loop counters will be dwarfed by
the fact that you have to make 100 IO calls.

Constructing the string once and making one puts call should improve
things more than fiddling with the loop counter especially if this
function is called repeatedly.

Look at the tests below. It takes about 3 seconds to run your version
(t1.c) 1000 times. Whereas the version with just a single IO call per
invocation (t2.c) (albeit with a longer string) takes only 0.2 seconds
to finish the 1000 calls.

As another check, I changed this second version to randomize the
contents of the string. Even including the calls to rand to, 1000
invocations still took about 0.2 seconds.

E:\Test> cat t1.c
#include <stdio.h>

void foo(void) {
int i;
for (i = 0; i < 100; i++) {
putchar('a');
}

putchar('\n');
}

int main(void) {
int i;
for ( i = 0; i < 1000; ++i ) {
foo();
}
return 0;
}

E:\Test> cl /O2 /nologo t1.c
t1.c

TimeThis : Command Line : t1
TimeThis : Start Time : Tue Apr 22 21:46:28 2008
TimeThis : End Time : Tue Apr 22 21:46:31 2008
TimeThis : Elapsed Time : 00:00:03.047

E:\Test> cat t2.c
#include <stdio.h>
#include <string.h>

void foo(void) {
char x[101];
memset( x, 'a', 100 );
x[100] = 0;
puts( x );
return;
}

int main(void) {
int i;
for ( i = 0; i < 1000; ++i) {
foo();
}
return 0;
}

E:\Test> cl /O2 /nologo t2.c
t2.c

TimeThis : Command Line : t2
TimeThis : Start Time : Tue Apr 22 21:50:20 2008
TimeThis : End Time : Tue Apr 22 21:50:20 2008
TimeThis : Elapsed Time : 00:00:00.187

For control, here is what I get with a dummy function:

E:\Test> cat t3.c
void foo(void) { return; }

int main(void) {
int i = 0;
for ( i = 0; i < 1000; ++i) {
foo();
}
return 0;
}

E:Test> cl /O2 /nologo t3.c
t3.c

TimeThis : Command Line : t3
TimeThis : Start Time : Tue Apr 22 21:52:39 2008
TimeThis : End Time : Tue Apr 22 21:52:39 2008
TimeThis : Elapsed Time : 00:00:00.125

In between each invocation, I ran the following program to combat any
kind of cache effects:

E:\Test> cat flushmem.c
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int main(void) {
char *p;
size_t bufsize = 1024;

while ( p = malloc( bufsize ) ) {
memset( p, 0xda, bufsize );
free( p );
bufsize *= 2;
printf("%x\n", bufsize/1024);
}

return 0;
}

--
A. Sinan Unur <1u...@llenroc.ude.invalid>
(remove .invalid and reverse each component for email address)

copx

unread,

Apr 22, 2008, 10:34:12 PM4/22/08

to

"A. Sinan Unur" <1u...@llenroc.ude.invalid> schrieb im Newsbeitrag
news:Xns9A88E10F459F...@127.0.0.1...

> "copx" <co...@gazeta.pl> wrote in news:fulste$336$1...@inews.gazeta.pl:
>
>> Optimizers are overrated
>>
>> I started learning ASM not long ago to improve my understanding of the
>> hardware architecture and my ability to optimize C code. The results
>> of my first experiment were surprising to say at least. After reading
>> the chapter on loops in my ASM book I wanted to test whether modern C
>> compilers are actually as smart as commonly claimed. I chose a most
>> simple loop: calling putchar 100 times.
>
> And how many programs have you written which just print 100 a's?

Come on! Obviously the point of this test was to figure out whether the
compilers are smart enough to optimize a for (i = 0; i < x; i++) loop to a i
= x do {}while (--i) loop. That is a simple micro-optimization you can do
(on x86 and any other platform where the dec/jump trick works) and exactly
the kind of stuff most regulars here would claim "is done by the compiler
anyway". I just proved that this common assumption in wrong.

[snip stuff about IO]

As I said the point of this was not printing characters. Maybe I should have
made it clearer by calling a function with no specified purpose in the loop.

copx

unread,

Apr 22, 2008, 10:56:34 PM4/22/08

to

"Ian Collins" <ian-...@hotmail.com> schrieb im Newsbeitrag

news:677fp5F...@mid.individual.net...

>> The test is the whole point of optimizing this loop on x86. You do not
>> need
>> a test (cmp instruction) in the loop if you decrement towards zero
>> instead
>> of incrementing towards 100. This saves one instruction in the body of
>> loop.
>
> You still have to test for 0, which may or may not be faster.

No you don't and that's the point. Look at the output of GCC for the "bar"
function for example, see any cmp? The dec instruction sets the necessary
flags to terminate the loop if we reach zero. That's a feature of the x86
instruction set (and others) a compiler could/should exploit to create more
efficient loops.

Yet another way to translate this simple construct. Compilers certainly have
personality!

A. Sinan Unur

unread,

Apr 22, 2008, 11:00:07 PM4/22/08

to

"copx" <co...@gazeta.pl> wrote in news:fum77a$qjc$1...@inews.gazeta.pl:

>
> "A. Sinan Unur" <1u...@llenroc.ude.invalid> schrieb im Newsbeitrag
> news:Xns9A88E10F459F...@127.0.0.1...
>> "copx" <co...@gazeta.pl> wrote in news:fulste$336$1...@inews.gazeta.pl:
>>
>>> Optimizers are overrated
>>>
>>> I started learning ASM not long ago to improve my understanding of
>>> the hardware architecture and my ability to optimize C code. The
>>> results of my first experiment were surprising to say at least.
>>> After reading the chapter on loops in my ASM book I wanted to test
>>> whether modern C compilers are actually as smart as commonly
>>> claimed. I chose a most simple loop: calling putchar 100 times.
>>
>> And how many programs have you written which just print 100 a's?
>
> Come on! Obviously the point of this test was to figure out whether
> the compilers are smart enough to optimize

And the point of my post that the programmer ought to be smart enough to
be able understand when the effort spent in beating the compiler is
worth it.

> loop to a i = x do {}while (--i) loop. That is a simple
> micro-optimization you can do (on x86 and any other platform where the
> dec/jump trick works) and exactly the kind of stuff most regulars here
> would claim "is done by the compiler anyway". I just proved that this
> common assumption in wrong.

And I don't see why anyone ought to care even if you are correct.

Sinan

copx

unread,

Apr 22, 2008, 11:10:35 PM4/22/08

to

<robert...@yahoo.com> schrieb im Newsbeitrag
news:104814e3-8638-4180...@f36g2000hsa.googlegroups.com...
[snip]

>So just how much abuse do *you* think you deserve for testing, and
>complaining about, a compiler over a *decade* old?

Eh, none? Since when is testing old software offensive?
I did not try to claim that this reflects the current performance of the MS
compiler. In fact, I explicitly asked "Are MS compilers still this bad?"

>Especially when you can download a current version for free.

The current version is not simply better than the old one. AFAIK software
built with VS2008 won't run on older versions of Windows or so I have been
told.

>But be that as it may, you've clearly not managed to run the compiler
correctly

Maybe, I have never used the command line version before and rarely use VC
anyway.

>because my copy of VC6 generates the following when run
>with -O2 or -Ox. In fact, the output you included appears to be the
>default, non-optimized output for VC6.

I compiled with /O2. So -O2 is the correct form? Strange, I could swear cl
/help suggested the /O2 form.. But maybe I confused something there.

> MSVC6 ("cl -Ox -c -Fa test48.c"):

[snip]

If that is true, VC6 moves from the bottom to the top. The first compiler
which actually manages to properly optimize the for loop!

Eric Sosman

unread,

Apr 22, 2008, 11:14:07 PM4/22/08

to

copx wrote:
> "A. Sinan Unur" <1u...@llenroc.ude.invalid> schrieb im Newsbeitrag
> news:Xns9A88E10F459F...@127.0.0.1...
>> "copx" <co...@gazeta.pl> wrote in news:fulste$336$1...@inews.gazeta.pl:
>>
>>> Optimizers are overrated
>>>
>>> I started learning ASM not long ago to improve my understanding of the
>>> hardware architecture and my ability to optimize C code. The results
>>> of my first experiment were surprising to say at least. After reading
>>> the chapter on loops in my ASM book I wanted to test whether modern C
>>> compilers are actually as smart as commonly claimed. I chose a most
>>> simple loop: calling putchar 100 times.
>> And how many programs have you written which just print 100 a's?
>
> Come on! Obviously the point of this test was to figure out whether the
> compilers are smart enough to optimize a for (i = 0; i < x; i++) loop to a i
> = x do {}while (--i) loop. That is a simple micro-optimization you can do
> (on x86 and any other platform where the dec/jump trick works) and exactly
> the kind of stuff most regulars here would claim "is done by the compiler
> anyway". I just proved that this common assumption in wrong.

If you give your car a nice fresh coat of wax, do you
improve its fuel efficiency by reducing air drag or hurt
the efficiency by increasing weight?

Compiler writers do not stay up nights trying to figure
out how to optimize silly loops; they give their attention
to getting more "serious" programs to run well. They worry
about how to map many local variables to a few CPU registers;
you don't have enough variables to notice. They worry about
eliminating common sub-expressions; you don't have any and
again can't see any optimization. They worry about strength
reductions; your sample has no strength to be reduced. They
worry about cache lines, about prefetching, about filling the
various instruction pipelines, about branch prediction ...
and you are oblivious to all of these.

You'd probably call Bach overrated because he never wrote
any good kazoo concertos.

--
Eric Sosman
eso...@ieee-dot-org.invalid

copx

unread,

Apr 22, 2008, 11:41:10 PM4/22/08

to

"A. Sinan Unur" <1u...@llenroc.ude.invalid> schrieb im Newsbeitrag

news:Xns9A88E9FC7166...@127.0.0.1...
[snip]

> And the point of my post that the programmer ought to be smart enough to
> be able understand when the effort spent in beating the compiler is
> worth it.

I wasn't trying to beat the compiler. I just measured its performance.

I disagree with your point that optimization should be limited to profiled
bottlenecks and choosing the right algorithm. In fact, I suspect that this
common belief (which really isn't news to me - you hear that from > 90% of
all programmers these days) is responsible for software disasters like
Microsoft Vista. If you ignore efficiency issues completely while writing
your program you will not be able to just shave away all the wasted RAM and
CPU cycles at the end by rewriting a single central algorithm after
profiling. Whatever, when to optimize or not is not the topic of this
thread. In the professional world the answer to that question is determined
by market forces anyway I guess.

What I am trying to discuss here is what you can except a C compiler to
optimize and what not. I posted my results to counter the misinformation
which has been spread here in the past (without bad intent most of the time
for sure).

[snip]

robert...@yahoo.com

unread,

Apr 22, 2008, 11:58:48 PM4/22/08

to

On Apr 22, 10:41 pm, "copx" <c...@gazeta.pl> wrote:
> I wasn't trying to beat the compiler. I just measured its performance.

Of course, you didn't do that either - just counting generated
instructions is hardly definitive on modern processors. For example,
dec/jne is faster on many processors than a single loop instruction.
Also the current versions of MSVC generate "sub esi,1" rather than
"dec esi", since the former is faster on many CPUs.

Modern CPUs are complex enough that actually measuring performance is
the only way to tell if a particular optimization is successful.

copx

unread,

Apr 23, 2008, 12:00:35 AM4/23/08

to

"Eric Sosman" <eso...@ieee-dot-org.invalid> schrieb im Newsbeitrag
news:8PCdne2DTcUcNJPV...@comcast.com...
[snip]

> If you give your car a nice fresh coat of wax, do you
> improve its fuel efficiency by reducing air drag or hurt
> the efficiency by increasing weight?
>
> Compiler writers do not stay up nights trying to figure
> out how to optimize silly loops; they give their attention
> to getting more "serious" programs to run well.

"Serious" programs probably contain many of such "silly loops". And what
exactly is "silly" about loops based on incrementing/decrementing an integer
value counting towards a maximum/minimum? Have you written many "serious"
programs without them?
A loop like this executed lets say a million times means one million wasted
instructions. Of course, that won't matter most the the time, I am not
trying to argue with that.

> They worry about how to map many local variables to a few CPU registers;
> you don't have enough variables to notice. They worry about
> eliminating common sub-expressions; you don't have any and
> again can't see any optimization. They worry about strength
> reductions; your sample has no strength to be reduced. They
> worry about cache lines, about prefetching, about filling the
> various instruction pipelines, about branch prediction ...
> and you are oblivious to all of these.

Thanks for giving me ideas for what to test next!

> You'd probably call Bach overrated because he never wrote
> any good kazoo concertos.

Error: analogy mismatch

copx

unread,

Apr 23, 2008, 12:09:46 AM4/23/08

to

<robert...@yahoo.com> schrieb im Newsbeitrag
news:dea1a1cb-9269-45dd...@z72g2000hsb.googlegroups.com...

On Apr 22, 10:41 pm, "copx" <c...@gazeta.pl> wrote:
> I wasn't trying to beat the compiler. I just measured its performance.
>
>Of course, you didn't do that either - just counting generated
>instructions is hardly definitive on modern processors.

Ok, good point. I will measure execution time next time, too.

vipp...@gmail.com

unread,

Apr 23, 2008, 1:53:10 AM4/23/08

to

On Apr 23, 6:41 am, "copx" <c...@gazeta.pl> wrote:
> "A. Sinan Unur" <1...@llenroc.ude.invalid> schrieb im Newsbeitragnews:Xns9A88E9FC7166...@127.0.0.1...

> [snip]
>
> > And the point of my post that the programmer ought to be smart enough to
> > be able understand when the effort spent in beating the compiler is
> > worth it.
>
> I wasn't trying to beat the compiler. I just measured its performance.
>
> I disagree with your point that optimization should be limited to profiled
> bottlenecks and choosing the right algorithm. In fact, I suspect that this
> common belief (which really isn't news to me - you hear that from > 90% of
> all programmers these days) is responsible for software disasters like
> Microsoft Vista.

So, you read a few pages off an assembly book, and suddenly you know
beter than compiler and OS devs?
Optimization is hardly about the smallest output possible.
Your remark about the 'common belief' that leads to 'software
disasters' is just stupid, to the point I'd claim you are either a
troll or that you are very new to programming.
My suggestion is that you finish this book you read, then you read
your processors manuals, then you read about various compiler
optimization methods, and then start this discussion again.
If you want a compiler for intel that optimizes very well, try icc
(intel c++ compiler).
It's developed from Intel, and I'd expect it to be the best compiler
for intel products.

user923005

unread,

Apr 23, 2008, 1:56:12 AM4/23/08

to

From this:
#include <stdio.h>

/* C Code */
void foo(void)
{
int i;

for (i = 0; i < 100; i++)
putchar('a');
}

void bar(void)
{
int i = 100;
do {
putchar('a');
} while (--i);
}

int main(void)
{
foo();
bar();
return 0;
}

I got this:
; Listing generated by Microsoft (R) Optimizing Compiler Version
14.00.50727.762

TITLE c:\tmp\foo.c
.686P
.XMM
include listing.inc
.model flat

INCLUDELIB OLDNAMES

EXTRN __imp__putchar:PROC
PUBLIC @bar@0
; Function compile flags: /Ogtpy
; File c:\tmp\foo.c
; COMDAT @bar@0
_TEXT SEGMENT
@bar@0 PROC ; COMDAT

; 12 : {

00000 56 push esi

; 13 : int i = 100;

00001 8b 35 00 00 00
00 mov esi, DWORD PTR __imp__putchar
00007 57 push edi
00008 bf 64 00 00 00 mov edi, 100 ; 00000064H
0000d 8d 49 00 npad 3
$LL3@:

; 14 : do {
; 15 : putchar('a');

00010 6a 61 push 97 ; 00000061H
00012 ff d6 call esi
00014 83 c4 04 add esp, 4

; 16 : } while (--i);

00017 83 ef 01 sub edi, 1
0001a 75 f4 jne SHORT $LL3@
0001c 5f pop edi
0001d 5e pop esi

; 17 : }

0001e c3 ret 0
@bar@0 ENDP
_TEXT ENDS
PUBLIC @foo@0
; Function compile flags: /Ogtpy
; COMDAT @foo@0
_TEXT SEGMENT
@foo@0 PROC ; COMDAT

; 4 : {

00000 56 push esi

; 5 : int i;
; 6 :
; 7 : for (i = 0; i < 100; i++)

00001 8b 35 00 00 00
00 mov esi, DWORD PTR __imp__putchar
00007 57 push edi
00008 bf 64 00 00 00 mov edi, 100 ; 00000064H
0000d 8d 49 00 npad 3
$LL3@:

; 8 : putchar('a');

00010 6a 61 push 97 ; 00000061H
00012 ff d6 call esi
00014 83 c4 04 add esp, 4
00017 83 ef 01 sub edi, 1
0001a 75 f4 jne SHORT $LL3@
0001c 5f pop edi
0001d 5e pop esi

; 9 : }

0001e c3 ret 0
@foo@0 ENDP
_TEXT ENDS
PUBLIC _main
; Function compile flags: /Ogtpy
; COMDAT _main
_TEXT SEGMENT
_main PROC ; COMDAT

; 20 : {

00000 56 push esi

; 21 : foo();

00001 8b 35 00 00 00
00 mov esi, DWORD PTR __imp__putchar
00007 57 push edi
00008 bf 64 00 00 00 mov edi, 100 ; 00000064H
0000d 8d 49 00 npad 3
$LL5@main:
00010 6a 61 push 97 ; 00000061H
00012 ff d6 call esi
00014 83 c4 04 add esp, 4
00017 83 ef 01 sub edi, 1
0001a 75 f4 jne SHORT $LL5@main

; 22 : bar();

0001c bf 64 00 00 00 mov edi, 100 ; 00000064H
$LL10@main:
00021 6a 61 push 97 ; 00000061H
00023 ff d6 call esi
00025 83 c4 04 add esp, 4
00028 83 ef 01 sub edi, 1
0002b 75 f4 jne SHORT $LL10@main
0002d 5f pop edi

; 23 : return 0;

0002e 33 c0 xor eax, eax
00030 5e pop esi

; 24 : }

00031 c3 ret 0
_main ENDP
_TEXT ENDS
END

I would be interested to see some example of a C algorithm, then the
same algorithm in assembly, and then demonstrate how yours is actually
measurably faster.

Paul Brettschneider

unread,

Apr 23, 2008, 4:20:30 AM4/23/08

to

copx wrote:

gcc tends to produce horrible code (I could show you examples that are
downright silly) and x86 is a horrible architecture. But curiously gcc
3.4.1 *does* "optimise" your two functions to the same code on PA-RISC
(various assembler directives and stackframe setup code snipped):

foo
[...]
ldi 100,%r3
ldo -1(%r3),%r3
L$0011
bl putchar,%r2
ldi 97,%r26
comib,<>,n 0,%r3,L$0011
ldo -1(%r3),%r3
[...]

bar
[...]
ldi 100,%r3
ldo -1(%r3),%r3
L$0018
bl putchar,%r2
ldi 97,%r26
comib,<>,n 0,%r3,L$0018
ldo -1(%r3),%r3
[...]

Note: This code looks strange due to branch delay slots.

So one has to wonder why the same optimisation is not done on x86. Maybe it
simply doesn't matter nowadays.

Bartc

unread,

Apr 23, 2008, 6:34:13 AM4/23/08

to

"copx" <co...@gazeta.pl> wrote in message
news:fulste$336$1...@inews.gazeta.pl...
> Optimizers are overrated

> void foo(void)
> {
> int i;
> for (i = 0; i < 100; i++) putchar('a');
> }

> void bar(void)
> {
> int i = 100;
>
> do {
> putchar('a');
> } while (--i);
> }
>
> As I said, damn simple. No nasty side effects, no access to global
> variables, etc. The optimizer has no excuses. It should generate optimial
> code in both cases. But see the result:

This is not very interesting code to optimise, there's nothing much for the
compiler to get it's teeth into. So perhaps the attempts were half-hearted.

I never believed compilers were getting as good as hand-coded asm, but
recently I almost came to believe the hype.

Just yesterday I finished porting to C, the key module of a byte-code
interpreter. Testing this on only one input program, the results on two
compilers I tried (x86 target) were:

lccwin32 3000ms with -O
gcc 3.4.5 1600ms with no optimisation (any -O switches made much slower)
original 800ms, using tight mix of asm and compiled code (not C).

And these times were /after/ considerable tweaking of the C code.

Have some avenues to explore yet, but so far disappointing.

--
Bart

CBFalconer

unread,

Apr 23, 2008, 7:21:12 AM4/23/08

to

copx wrote:
>
... snip ...

>
> Come on! Obviously the point of this test was to figure out whether
> the compilers are smart enough to optimize a for (i = 0; i < x; i++)
> loop to a i = x do {}while (--i) loop. That is a simple micro-
> optimization you can do (on x86 and any other platform where the
> dec/jump trick works) and exactly the kind of stuff most regulars
> here would claim "is done by the compiler anyway". I just proved
> that this common assumption in wrong.

No you didn't. You proved that makeing the termination depend on
the variable reaching 0 differs from having the variable reach 100.

--
[mail]: Chuck F (cbfalconer at maineline dot net)
[page]: <http://cbfalconer.home.att.net>
Try the download section.

** Posted from http://www.teranews.com **

Eric Sosman

unread,

Apr 23, 2008, 8:41:38 AM4/23/08

to

copx wrote:
> [...]

> A loop like this executed lets say a million times means one million wasted
> instructions. Of course, that won't matter most the the time, I am not

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

> trying to argue with that.

Well put.

--
Eric Sosman
eso...@ieee-dot-org.invalid

Walter Banks

unread,

Apr 23, 2008, 9:08:21 AM4/23/08

to

copx wrote:

> Optimizers are overrated
>

> The first function (foo) uses a typical C style loop to
> test the assumption that "the compiler will optimize that better than any
> human could". The second function (bar) is based my newly gained knowledge,
> the loop is basically ASM written in C. Now I am certain if I asked here
> which one is more efficient all you guys would reply "the compiler will most
> likely generate the same code in both cases"

You made the assumption that the two pieces of code are equivalent
they are not.

Regards,

--
Walter Banks
Byte Craft Limited
Tel. (519) 888-6911
http://www.bytecraft.com
wal...@bytecraft.com

Willem

unread,

Apr 23, 2008, 11:49:45 AM4/23/08

to

Walter wrote:
) copx wrote:
)> Optimizers are overrated
)>
)> The first function (foo) uses a typical C style loop to
)> test the assumption that "the compiler will optimize that better than any
)> human could". The second function (bar) is based my newly gained knowledge,
)> the loop is basically ASM written in C. Now I am certain if I asked here
)> which one is more efficient all you guys would reply "the compiler will most
)> likely generate the same code in both cases"
)
) You made the assumption that the two pieces of code are equivalent
) they are not.

They are too. The variable 'i' is not used for anything but
counting the loop. I've seen GCC produce the same code for
the two snippets and that was several years back. Before 3.0.

SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT

Keith Thompson

unread,

Apr 23, 2008, 12:54:37 PM4/23/08

to

CBFalconer <cbfal...@yahoo.com> writes:
> copx wrote:
> ... snip ...
>>
>> Come on! Obviously the point of this test was to figure out whether
>> the compilers are smart enough to optimize a for (i = 0; i < x; i++)
>> loop to a i = x do {}while (--i) loop. That is a simple micro-
>> optimization you can do (on x86 and any other platform where the
>> dec/jump trick works) and exactly the kind of stuff most regulars
>> here would claim "is done by the compiler anyway". I just proved
>> that this common assumption in wrong.
>
> No you didn't. You proved that makeing the termination depend on
> the variable reaching 0 differs from having the variable reach 100.

But it doesn't have to. Since i is used only to control the number of
times the loop is executed, a sufficiently clever optimizer *could*
cause it to iterate in either order; as long as the loop executes the
expected number of times, it's a legitimate transformation.

On the other hand, copx is assuming that a test against 0 is going to
be better than a test against 100. Even if the test against 0 is
implicit (the decrement instruction implicitly sets a flag), that may
or may not be the case. Without either measurement (which copx has
not done) or intimate knowledge of how the x86 CPU behaves (which I
lack, and which might vary across models anyway), we can't tell
whether the optimization would be worthwhile.

The latter raises a (possibly) interesting point. The value of some
optimizations might depend on which model of the x86 family the code
is going to run on.

In any case, without *measuring* the performance it doesn't make much
sense to talk about how good or bad the optimizations are.

--
Keith Thompson (The_Other_Keith) <ks...@mib.org>
Nokia
"We must do something. This is something. Therefore, we must do this."
-- Antony Jay and Jonathan Lynn, "Yes Minister"

Flash Gordon

unread,

Apr 23, 2008, 2:33:55 PM4/23/08

to

Bartc wrote, On 23/04/08 11:34:

<snip>

> I never believed compilers were getting as good as hand-coded asm, but
> recently I almost came to believe the hype.
>
> Just yesterday I finished porting to C, the key module of a byte-code
> interpreter. Testing this on only one input program, the results on two
> compilers I tried (x86 target) were:
>
> lccwin32 3000ms with -O
> gcc 3.4.5 1600ms with no optimisation (any -O switches made much slower)
> original 800ms, using tight mix of asm and compiled code (not C).
>
> And these times were /after/ considerable tweaking of the C code.
>
> Have some avenues to explore yet, but so far disappointing.

Why do you think comparing your original code against using a compiler
*without* optimisation has any baring on how good the compiler is if you
tell it to produce good code? Try actually telling the compiler to
optimise and telling it what type of processor to optimise for and what
the minimum processor it can assume is (this defaults to an 80386 for
Intel I believe, do you really need to support processors that old?).

Anonymous

unread,

Apr 23, 2008, 3:05:14 PM4/23/08

to

On Wed, 23 Apr 2008 01:38:18 +0200, copx wrote:

> Optimizers are overrated

In some ways yes. In others, no.

For example, in the code you put, GCC inserts various alignment
instructions. In larger segments of code, it'll reorder instructions to
make pipelining more efficient (non-x86 will benefit more from this,
especially embedded processors that don't do their own instruction
reordering at runtime). I doubt that any human could write tens of
millions of lines of assembly to carefully consider where to insert nops
to align loop boundaries, or how to reorder instructions to prevent data
dependency stalls in the pipeline.

On the flip side of the coin, for a small function, a human will almost
always be able to do a better job of optimizing than the compiler,
especially since the semantics of the code are in the user's mind, and
they can ignore the corner cases that the C compiler has to get right
when compiling C code, and doing certain optimizations.

Finally, the loop optimization you mentioned is quite unimportant. Why?
Because a single cmp instruction is more or less lost in the noise when
you benchmark it. For a billion iterations of an _empty_ loop (which I
had to write by hand, or GCC's optimizer would remove the loop entirely),
it leads to a 15% difference. For any code which did a single memory
access within the loop (writing to a volatile int), there was on average
a 3% performance difference.

Conclusion:
Draw your own! (ok, ok... I think both assembly and C has it's
place. I won't spend my time writing lots of dumb assembly, but I have no
problems optimizing a hot-spot in my code by dropping into it for an
inner loop.)

Anonymous

unread,

Apr 23, 2008, 3:09:36 PM4/23/08

to

On Wed, 23 Apr 2008 09:54:37 -0700, Keith Thompson wrote:

> The latter raises a (possibly) interesting point. The value of some
> optimizations might depend on which model of the x86 family the code is
> going to run on.

Of course. On the Pentium 4, for example, the 'inc' and 'dec'
instructions are slower than 'add $1,%reg', and 'sub $1,%reg'
respectively. However, on the Core architecture, the inc and dec are
preferred, as they aren't any slower than add/sub, but they are smaller,
and thus reduce code size.

user923005

unread,

Apr 23, 2008, 3:49:48 PM4/23/08

to

On Apr 23, 9:54 am, Keith Thompson <ks...@mib.org> wrote:

The right way to optimize this is to puts() a row of the chars instead
of looping. Unrolling the loop is just silly.
Optimization is only important if you can measure the difference
anyway, and even in that case it does not matter if the project
already meets the performance specifications.
The right way to write code is to write clear, optimal algorithms.
The little tweaky things you can do in assembly will sometimes make a
difference (e.g. I often use assembly for bit manipulation in chess
programs) but for the most part it will be a complete waste of time
and money. If someone has not performed a profile and yet they
already have begun to optimize, it shows that they are not thinking
clearly.

IMO-YMMV

Knight

unread,

Apr 23, 2008, 3:59:42 PM4/23/08

to

Hey, I am still waiting for the Super Compiler that would understand
that

for (i = 0; i < 100; i++) putchar('a');

is the same as
putchar('a');
Is that too much to hope for?

Flash Gordon

unread,

Apr 23, 2008, 3:15:47 PM4/23/08

to

copx wrote, On 23/04/08 00:38:

> Optimizers are overrated
>
> I started learning ASM not long ago to improve my understanding of the
> hardware architecture and my ability to optimize C code. The results of my
> first experiment were surprising to say at least. After reading the chapter
> on loops in my ASM book I wanted to test whether modern C compilers are
> actually as smart as commonly claimed. I chose a most simple loop: calling

> putchar 100 times. The first function (foo) uses a typical C style loop to

> test the assumption that "the compiler will optimize that better than any

> human could". The second function (bar) is based my newly gained knowledge,

> the loop is basically ASM written in C. Now I am certain if I asked here

> which one is more efficient all you guys would reply "the compiler will most

> likely generate the same code in both cases" (I have read such claims
> countless times here). Well, look at the ASM output below to see how wrong
> your assumption is.
>
> /* C Code */
>
> void foo(void)
> {
> int i;
>

> for (i = 0; i < 100; i++) putchar('a');
> }
>
>

> void bar(void)
> {
> int i = 100;
>
> do {
> putchar('a');
> } while (--i);
> }
>
> As I said, damn simple. No nasty side effects, no access to global
> variables, etc. The optimizer has no excuses. It should generate optimial
> code in both cases. But see the result:
>
>

> /* GCC 4.3.0 on x86/Windows, -O2 */

<snip>

Add -funroll-loops and gcc will do some loop unrolling for you
potentially speeding it up. This is likely to have a bigger impact that
the optimisation you thought of.

The larger the function/module/program the more likely it is for the
compiler to beet hand coded assembler and the more it will cost to
produce hand coded assembler. I have beet a compiler where it mattered
(admittedly 10 years or so ago) but I've also speeded up code by more
than a factor of 2 by removing stupidities in what beople had done in a
high level language (and I know others who have also done this).

Generally writing large chunks of assembler code is simply too expensive
and error prone. There are exceptions where you really do have to get
the last cycle, but I believe they are rare when you move away from the
really small embedded processors.
--
Flash Gordon

Knight

unread,

Apr 23, 2008, 4:02:18 PM4/23/08

to

D'oh!
It's not.
I'll throw some tomatoes on myself now.

To be fair, substitute
for (i = 0; i < 100; i++) a = b;
And find a compiler that would substitute this with
a = b;

user923005

unread,

Apr 23, 2008, 4:10:17 PM4/23/08

to

Instead of unrolling loops, how about memset with the char of choice
for 'count' characters, null termination with a zero byte, and puts()?
I guess that the choice of the right optimization is better than all
the silly tweaks in the world.

lawrenc...@siemens.com

unread,

Apr 23, 2008, 3:25:04 PM4/23/08

to

Keith Thompson <ks...@mib.org> wrote:
>
> The latter raises a (possibly) interesting point. The value of some
> optimizations might depend on which model of the x86 family the code
> is going to run on.

In fact, the x86 family has been notorious for having optimization
techniques for one generation of processor performing horribly on the
next generation of processor.

-Larry Jones

I like maxims that don't encourage behavior modification. -- Calvin

Anonymous

unread,

Apr 23, 2008, 4:18:44 PM4/23/08

to

gcc 4.2.3 does it for me.
actually, if I do:

void foo()
{
int i, a, b;
b = 3;

for (i = 0; i < 100; i++)
a = b;
}

it optimizes it to:

void foo()
{
}

since none of the code in the function actually affects program state.

Richard

unread,

Apr 23, 2008, 4:33:30 PM4/23/08

to

user923005 <dco...@connx.com> writes:

This depends on the definition of "difference" and performance
specifications. One should always strive for the tightest code if you
can do so without losing reliability and maintainability where they are
a factor. Too many people write sloppy code because their machine eats
it up. Its why office suites etc are as slow as they were 10 years go -
not enough effort in streamlining.

I would consider running any moderately intensive app with the profiler
on just to be sure to be sure.

Richard Heathfield

unread,

Apr 23, 2008, 4:50:25 PM4/23/08

to

Knight said:

Presumably you mean:

i = 100;
a = b;

--
Richard Heathfield <http://www.cpax.org.uk>
Email: -http://www. +rjh@
Google users: <http://www.cpax.org.uk/prg/writings/googly.php>
"Usenet is a strange place" - dmr 29 July 1999

Dann Corbit

unread,

Apr 23, 2008, 4:59:10 PM4/23/08

to

"Richard Heathfield" <r...@see.sig.invalid> wrote in message
news:Q_WdnaOyh_CUPZLV...@bt.com...

> Knight said:
>
>> On Apr 23, 12:59 pm, Knight <knightt...@yahoo.com> wrote:
>>> Hey, I am still waiting for the Super Compiler that would understand
>>> that
>>> for (i = 0; i < 100; i++) putchar('a');
>>> is the same as
>>> putchar('a');
>>> Is that too much to hope for?
>>
>> D'oh!
>> It's not.
>> I'll throw some tomatoes on myself now.
>>
>> To be fair, substitute
>> for (i = 0; i < 100; i++) a = b;
>> And find a compiler that would substitute this with
>> a = b;
>
> Presumably you mean:
>
> i = 100;
> a = b;

I guess it would look something like this:

#include <stdio.h>
int main (void)
{
int i, a=0, b;
b = 31415;

for (i = 0; i < 100; i++)
a = b;

printf ("a=%d,i=%d\n", a, i);
return 0;
}

; Listing generated by Microsoft (R) Optimizing Compiler Version
14.00.50727.762

TITLE c:\tmp\foo.c
.686P
.XMM
include listing.inc
.model flat

INCLUDELIB OLDNAMES

EXTRN __imp__printf:PROC
$SG-4 DB 'a=%d,i=%d', 0aH, 00H

PUBLIC _main
; Function compile flags: /Ogtpy

; File c:\tmp\foo.c

; COMDAT _main
_TEXT SEGMENT
_main PROC ; COMDAT

; 4 : int i, a=0, b;
; 5 : b = 31415;
; 6 : for (i = 0; i < 100; i++)
; 7 : a = b;
; 8 : printf ("a=%d,i=%d\n", a, i);

00000 6a 64 push 100 ; 00000064H
00002 68 b7 7a 00 00 push 31415 ; 00007ab7H
00007 68 00 00 00 00 push OFFSET $SG-4
0000c ff 15 00 00 00
00 call DWORD PTR __imp__printf
00012 83 c4 0c add esp, 12 ; 0000000cH

; 9 : return 0;

00015 33 c0 xor eax, eax

; 10 : }

00017 c3 ret 0

_main ENDP
_TEXT ENDS
END

Mark McIntyre

unread,

Apr 23, 2008, 5:23:49 PM4/23/08

to

copx wrote:
>
> The test is the whole point of optimizing this loop on x86. You do not need
> a test (cmp instruction) in the loop if you decrement towards zero instead
> of incrementing towards 100. This saves one instruction in the body of loop.

You're assuming that this and your other optimisations are actually
faster. Did you test this? Its entirely possible that there's some other
h/w magic which you don't know about but which the compiler writer does.

Also, not meaning to be funny, but you admit you just started ASM
programming recently. Do you /really/ think you know more already than
people who write compilers for a living?

--
Mark McIntyre

CLC FAQ <http://c-faq.com/>
CLC readme: <http://www.ungerhu.com/jxh/clc.welcome.txt>

Mark McIntyre

unread,

Apr 23, 2008, 5:28:08 PM4/23/08

to

copx wrote:
> "Eric Sosman" <eso...@ieee-dot-org.invalid> schrieb im Newsbeitrag
> news:8PCdne2DTcUcNJPV...@comcast.com...
> [snip]
>> If you give your car a nice fresh coat of wax, do you
>> improve its fuel efficiency by reducing air drag or hurt
>> the efficiency by increasing weight?
>>
>> Compiler writers do not stay up nights trying to figure
>> out how to optimize silly loops; they give their attention
>> to getting more "serious" programs to run well.
>
> "Serious" programs probably contain many of such "silly loops".

And how much experience do you have of serious programmes, compared to
Eric?

Please stop embarassing yourself.

>And what
> exactly is "silly" about loops based on incrementing/decrementing an integer
> value counting towards a maximum/minimum? Have you written many "serious"
> programs without them?

Yes, but its irrelevant. I never wrote a programme that did nothing but
count down to zero and print a meaningless statement....

Eric Sosman

unread,

Apr 23, 2008, 5:52:50 PM4/23/08

to

Mark McIntyre wrote:
> copx wrote:
>> "Eric Sosman" <eso...@ieee-dot-org.invalid> schrieb im Newsbeitrag
>> news:8PCdne2DTcUcNJPV...@comcast.com...
>> [snip]
>>> If you give your car a nice fresh coat of wax, do you
>>> improve its fuel efficiency by reducing air drag or hurt
>>> the efficiency by increasing weight?
>>>
>>> Compiler writers do not stay up nights trying to figure
>>> out how to optimize silly loops; they give their attention
>>> to getting more "serious" programs to run well.
>>
>> "Serious" programs probably contain many of such "silly loops".
>
> And how much experience do you have of serious programmes, compared to
> Eric?

Even when it's made on my behalf I find the "Experience
counts" argument only mildly persuasive. We all know people
of vast experience and half-vast ability ...

The point, still, is that one observation (not even a
measurement!) of one tiny program is insufficient basis for
a sweeping statement like "optimizers are overrated." The
sample is too small and the observation too inexact to support
such a conclusion. Using similar methodology I could show
that penicillin is overrated: Give one dose to one healthy
person, ask him how much better he feels, and draw^H^H^H^H
leap to your conclusion. (I think election pollsters use
this general approach ...)

--
Eric....@sun.com

christian.bau

unread,

Apr 23, 2008, 6:35:24 PM4/23/08

to

On Apr 23, 5:54 pm, Keith Thompson <ks...@mib.org> wrote:

> On the other hand, copx is assuming that a test against 0 is going to
> be better than a test against 100. Even if the test against 0 is
> implicit (the decrement instruction implicitly sets a flag), that may
> or may not be the case. Without either measurement (which copx has
> not done) or intimate knowledge of how the x86 CPU behaves (which I
> lack, and which might vary across models anyway), we can't tell
> whether the optimization would be worthwhile.

We can get a hint from the fact that the decrement and increment
instructions have been removed from the IA-64 instruction set, so both
Intel and AMD seem to think that having this instruction is not
worthwhile. Also, the increment and decrement instructions don't
affect the carry flag, unlike an add or subtract instruction, which
means inc and dec force a partial condition code register update,
which means the instruction will run slower.

robert...@yahoo.com

unread,

Apr 23, 2008, 6:54:02 PM4/23/08

to

On Apr 23, 5:35 pm, "christian.bau" <christian....@cbau.wanadoo.co.uk>
wrote:

That's not really correct.

First, you almost certainly don't mean IA-64 (especially since you
mention AMD). IA-64 is the IPF or Itanium ISA. What you want is
x86-64, which is the most common name for what AMD calls AMD64 and
Intel calls EM64T or Intel 64, which are all names for the 64 bit
version of good old x86.

Anyway, AMD removed the one byte forms of the inc and dec
instructions, since they needed a range of single byte opcodes so they
could encode the new REX prefix. The longer forms of inc and dec,
which are functionally identical, are still there (as they have since
the 8086). And given that Intel copied AMD64, it doesn't really mater
what they thought of the one byte forms of those instructions - they
were stuck with AMD's decision.

Anyway, the performance of incs and decs is back up to par with the
Core generations of CPUs.

rzed

unread,

Apr 23, 2008, 9:07:36 PM4/23/08

to

Eric Sosman <eso...@ieee-dot-org.invalid> wrote in
news:8PCdne2DTcUcNJPV...@comcast.com:

> You'd probably call Bach overrated because he never wrote
> any good kazoo concertos.

See here, now! P.D.Q. Bach notoriously used kazoos in his works. You
can't seriously doubt the ... the, um ... the QUALITY of his
compositions, now can you?

--
rzed
... also never wrote a good kazoo concerto.

A. Sinan Unur

unread,

Apr 23, 2008, 10:06:45 PM4/23/08

to

user923005 <dco...@connx.com> wrote in news:3554078d-425c-4da3-b98d-
eac16c...@34g2000hsh.googlegroups.com:

> Instead of unrolling loops, how about memset with the char of choice
> for 'count' characters, null termination with a zero byte, and puts()?
> I guess that the choice of the right optimization is better than all
> the silly tweaks in the world.

See

http://groups.google.com/group/comp.lang.c/msg/dc5762477174dde4

--
A. Sinan Unur <1u...@llenroc.ude.invalid>
(remove .invalid and reverse each component for email address)

TonyMc

unread,

Apr 24, 2008, 6:36:09 AM4/24/08

to

CBFalconer <cbfal...@yahoo.com> writes:

> No you didn't. You proved that makeing the termination depend on
> the variable reaching 0 differs from having the variable reach 100.

You have used the word "makeing". No such word exists in the language
standard. Perhaps you were looking for the word "petard".

--
Tony

Bartc

unread,

Apr 24, 2008, 7:17:50 AM4/24/08

to

"Flash Gordon" <sp...@flash-gordon.me.uk> wrote in message
news:l1c4e5x...@news.flash-gordon.me.uk...

I've tried extra options, with little effect.

I've since looked in more detail at gcc code output. And decided to write my
main dispatcher in ASM! Which will also take over some of the simplest
operations (this is a byte-code interpreter).

I will keep the C dispatcher (for portability reasons), but see no reason to
have to run it at half-speed (or one-third speed on another machine) because
the compiler can't make the global strategic decisions that I can make in
ASM.

(BTW I don't know what's the matter with lccwin32; I also have a version in
my own language+compiler (no asm), that does no optimisation /whatsoever/,
yet performs between gcc and lccwin32.)

--
Bart

Nick Keighley

unread,

Apr 24, 2008, 7:53:53 AM4/24/08

to

On 23 Apr, 20:15, Flash Gordon <s...@flash-gordon.me.uk> wrote:

> The larger the function/module/program the more likely it is for the
> compiler to beet hand coded assembler

club it to death with a root vegetable?

--
Nick Keighley

Walter Banks

unread,

Apr 24, 2008, 9:54:22 AM4/24/08

to

rzed wrote:

> Eric Sosman <eso...@ieee-dot-org.invalid> wrote in
> news:8PCdne2DTcUcNJPV...@comcast.com:
>
> > You'd probably call Bach overrated because he never wrote
> > any good kazoo concertos.
>
> See here, now! P.D.Q. Bach notoriously used kazoos in his works. You
> can't seriously doubt the ... the, um ... the QUALITY of his
> compositions, now can you?

I once went to a P.D.Q. Bach concert and the two old ladies who
sat behind me failed to see the humour until after the intermission
drinks had taken affect. They expected . . . . Bach.

w..

Eric Sosman

unread,

Apr 24, 2008, 9:57:45 AM4/24/08

to

rzed wrote:
> Eric Sosman <eso...@ieee-dot-org.invalid> wrote in
> news:8PCdne2DTcUcNJPV...@comcast.com:
>
>> You'd probably call Bach overrated because he never wrote
>> any good kazoo concertos.
>
> See here, now! P.D.Q. Bach notoriously used kazoos in his works. You
> can't seriously doubt the ... the, um ... the QUALITY of his
> compositions, now can you?

<off-topic>

Long years ago my high-school calculus teacher (who had
many interests outside mathematics) played for his class a
tape of a mini-concert given by a singing group to which he
belonged, on the occasion of their annual party. A speaker
soberly introduced the program as the first fruits of a piece
of ground-breaking musical scholarship, which had recently
discovered that J.S. Bach's "Die Kunst der Fuge" ("The Art
of the Fugue") had been mis-identified for all this time,
the word "Fuge" being a mis-transcription of Bach's intended
"Ruge," which the speaker explained was an Old German word
meaning "kazoo." A quartet of kazooists, my teacher among
them, then gave the first "original instruments" performance
of this long-misunderstood masterpiece of, er, casuistry.

</off-topic>

--
Eric....@sun.com

Walter Banks

unread,

Apr 24, 2008, 10:01:34 AM4/24/08

to

Willem wrote:

> Walter wrote:
>
> ) You made the assumption that the two pieces of code are equivalent
> ) they are not.
>
> They are too. The variable 'i' is not used for anything but
> counting the loop. I've seen GCC produce the same code for
> the two snippets and that was several years back. Before 3.0.

I understand about i and the results are the same. I also understand
that it could be optimized to be the same and many compilers would.

The for and while implementations have different meaning. It is a suttle
point that I am making essentially about the care that is needed in
creating benchmarks like this.

w..

santosh

unread,

Apr 24, 2008, 10:25:39 AM4/24/08

to

Bartc wrote:

If you want heavy optimisations the choice is between gcc and intel (or
perhaps another compiler like DigitalMars). jacob navia will be (I'm
sure) the first to admit the lcc-win32 doesn't do all the optimisations
under the Sun. He has admitted in the past to do peephole optimisation
though.

In my experience for Intel x86 chips and their clones, executables
produced by the Intel compilers usually outperform others.

But yes, if it's feasible, a manual design can always perform high-level
optimisations that compilers will struggle to match. I suppose your
interpreter is a case study in this. It's still surprising to hear that
the C version is about twice as slow as your assembler one. Is the
speed gain of the assembler code spread out uniformly over the source
code or is it confined to one or more specific constructs/routines?

Willem

unread,

Apr 24, 2008, 11:00:00 AM4/24/08

to

Walter wrote:
) I understand about i and the results are the same. I also understand
) that it could be optimized to be the same and many compilers would.
)
) The for and while implementations have different meaning. It is a suttle
) point that I am making essentially about the care that is needed in
) creating benchmarks like this.

I don't get your point. Could you elaborate ?

The C standard allows that a conforming compiler creates
exactly the same machine code for the two code snippets.
To me, that implies that the two have exactly the same meaning.

SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT

Charlton Wilbur

unread,

Apr 24, 2008, 11:30:54 AM4/24/08

to

>>>>> "ES" == Eric Sosman <Eric....@sun.com> writes:

ES> <off-topic>

ES> A quartet of kazooists, my teacher among
ES> them, then gave the first "original instruments" performance
ES> of this long-misunderstood masterpiece of, er, casuistry.

Did they seriously perform something from Der Kunst der Fuge on kazoo?
That takes an incredible amount of skill to get reasonably right.

Charlton

(who played in a jazz combo once that was known for doing whimsical
things like playing a verse on kazoo and slide whistle and thus knows
how difficult it is to actually play the kazoo in such a context)

ES> </off-topic>

--
Charlton Wilbur
cwi...@chromatico.net

Richard Tobin

unread,

Apr 24, 2008, 12:08:33 PM4/24/08

to

In article <481092BE...@bytecraft.com>,
Walter Banks <wal...@bytecraft.com> wrote:

>The for and while implementations have different meaning.

No, they have the same meaning. Used in other contexts - in
particular, where the loop variable is used afterwards - they
would have different meanings.

-- Richard
--
:wq

Eric Sosman

unread,

Apr 24, 2008, 12:08:44 PM4/24/08

to

Charlton Wilbur wrote:
>>>>>> "ES" == Eric Sosman <Eric....@sun.com> writes:
>
> ES> <off-topic>
>
> ES> A quartet of kazooists, my teacher among
> ES> them, then gave the first "original instruments" performance
> ES> of this long-misunderstood masterpiece of, er, casuistry.
>
> Did they seriously perform something from Der Kunst der Fuge on kazoo?

They did, indeed. "Daaaah, daaaah, daaaah, daaaah, daaaah,
daah daah daaaaaah, dah dah dah daaaaaah," and so on.

> That takes an incredible amount of skill to get reasonably right.

"Reasonably right?" Where did *that* notion come from?
They were about as right as (violent wrench towards topicality)
`void main()' -- recognizable, but not of the highest Standard.

--
Eric....@sun.com

Walter Banks

unread,

Apr 24, 2008, 12:42:22 PM4/24/08

to

Willem wrote:

> Walter wrote:
> ) I understand about i and the results are the same. I also understand
> ) that it could be optimized to be the same and many compilers would.
> )
> ) The for and while implementations have different meaning. It is a suttle
> ) point that I am making essentially about the care that is needed in
> ) creating benchmarks like this.
>
> I don't get your point. Could you elaborate ?
>
> The C standard allows that a conforming compiler creates
> exactly the same machine code for the two code snippets.
> To me, that implies that the two have exactly the same meaning.

I would have been a lot more comfortable comparing

for (i =100 ; i > 0; i--) putchar('a');

int i = 100;
do {
putchar('a');
} while (--i);

My point is this thread is on the line between
optimization and algorithm rewriting and was a simple
example encompassing both. for and while loops have
different meanings and in this specific example the
for can be rewritten as a while and the count sequence
can be changed from an up count to a down count
and produce the same results.

The change I made is an optimization test of
for loop implementation.

w..

Bartc

unread,

Apr 24, 2008, 12:54:07 PM4/24/08

to

"santosh" <santo...@gmail.com> wrote in message
news:fuq59g$ai8$1...@registered.motzarella.org...
> Bartc wrote:

>>>> Just yesterday I finished porting to C, the key module of a
>>>> byte-code interpreter. Testing this on only one input program, the
>>>> results on two compilers I tried (x86 target) were:
>>>>
>>>> lccwin32 3000ms with -O
>>>> gcc 3.4.5 1600ms with no optimisation (any -O switches made much
>>>> slower)
>>>> original 800ms, using tight mix of asm and compiled code (not
>>>> C).

>>>> Have some avenues to explore yet, but so far disappointing.

>> I've tried extra options, with little effect.

>>
>> I've since looked in more detail at gcc code output. And decided to
>> write my main dispatcher in ASM! Which will also take over some of the
>> simplest operations (this is a byte-code interpreter).

> If you want heavy optimisations the choice is between gcc and intel (or

> perhaps another compiler like DigitalMars).

I think Intel would know their own CPU pretty well.

Can't use DigitalMars (DMC) yet because my project is still in hybrid stage
and it doesn't like my NASM/OBJ files. But it picked up a few errors missed
by gcc and lccwin, eg. typing:

p; instead of p();

was missed by these two but picked up by DMC.

> jacob navia will be (I'm
> sure) the first to admit the lcc-win32 doesn't do all the optimisations
> under the Sun. He has admitted in the past to do peephole optimisation
> though.

I'm curious enough to do more specific tests so that I can see what the
problem is. On another input file, lccwin was 40% slower than gcc. My own
compiler (for a language similar to C) performed pretty much as well as gcc,
and my code generator is simplistic (it doesn't even convert a*16 to a<<4
for example). So something strange going on I think, in my program or in
lccwin.

>... It's still surprising to hear that

> the C version is about twice as slow as your assembler one. Is the
> speed gain of the assembler code spread out uniformly over the source
> code or is it confined to one or more specific constructs/routines?

I've concentrated on integer code. On a more general input program there
would be more non-asm code being executed and the differences will be
smaller. But fast integer code is important so that the input language can
be used more widely.

But, the gcc-compiled version is still 2-3 times faster than the latest
Python, when executing the same input program.

-- Bartc

jacob navia

unread,

Apr 24, 2008, 12:57:48 PM4/24/08

to

If you send me the assembler output I will take a look at it.
Or you can send me the object files. Please tell me where are the
important points so that I can see where the optimizer goes wrong.

Thanks

--
jacob navia
jacob at jacob point remcomp point fr
logiciels/informatique
http://www.cs.virginia.edu/~lcc-win32

Ben Bacarisse

unread,

Apr 24, 2008, 1:06:14 PM4/24/08

to

"Bartc" <b...@freeuk.com> writes:
<snip>

> Can't use DigitalMars (DMC) yet because my project is still in hybrid stage
> and it doesn't like my NASM/OBJ files. But it picked up a few errors missed
> by gcc and lccwin, eg. typing:
>
> p; instead of p();

My gcc warns me about a "statement with no effect" with only the most
modest warning settings (-W).

--
Ben.

Flash Gordon

unread,

Apr 24, 2008, 1:46:11 PM4/24/08

to

Nick Keighley wrote, On 24/04/08 12:53:

Yes, just as soon as the program dereferences a null pointer at run
time. You know how dangerous undefined behaviour can be.
--
Flash Gordon