Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Gforth and gcc "progress"

199 views
Skip to first unread message

Anton Ertl

unread,
Jun 23, 2007, 4:11:17 PM6/23/07
to
Today I played around with optimizations for Gforth in the context of
various gcc versions. With 0.6.2, the situation was:

gcc-2.95: nice and fast, although a little buggy.

gcc-3.0: slower, thanks to some GCSE misbehaviour

gcc-3.x, x>1 (certainly for 3.3 and 3.4): very slow: they fixed the
GCSE problem, but on the way destroyed gforth's dynamic code
generation and the branch prediction advantage of using threaded
code (PR15242).

gcc-4.x: also very slow: Thanks to anal-retentive syntax checking that
was introduced without prior warning, dynamic code generation is
turned off even though it would work in principle (PR15242 is mostly
fixed, with the exception of PR25285, and they ignore that).

So in the meantime we introduced workarounds for PR15242, but as a
result the performance suffered for the other compilers, too. As a
result, a few days ago the performance picture looked like this (on a
2.2GHz Athlon 64 X2):

sieve bubble matrix fib
0.248 0.340 0.112 0.388 0.6.9, gcc-2.95.4 --enable-force-reg
0.188 0.292 0.128 0.308 0.6.2, gcc-2.95.1 --enable-force-reg

Well, quite a bit of slowdown compared to 0.6.2. So today I worked on
getting some of the gcc-2.95 speed back and improving the gcc-4.x
speed. So the current CVS has the following speeds (all configured
with --enable-force-reg):

sieve bubble matrix fib
0.208 0.296 0.108 0.328 gcc 2.95.4 20011002 (Debian prerelease)
0.264 0.344 0.120 0.360 gcc 3.4.6 (Debian 3.4.6-5)
0.384 0.432 0.296 0.520 gcc 4.1.2 20061115 (prerelease) (Debian 4.1.1-21)

Nice progress by the gcc maintainers, eh?-(

The slowdown between 2.95 and 3.4 can be explained with PR15242; our
workaround helps a lot, but some slowness cannot be worked around (in
particular, not the part that I got back for 2.x and 4.x today).

The slowdown of gcc-4.1 seems to come from bad register allocation and
a failure of copy propagation. I actually had to reduce what
--enable-force-reg does on this compiler, otherwise the compiler would
not compile, or produce wrong code. As an example of the low quality
of the resulting code, consider this:

0.6.9, gcc 4.1 0.6.9, gcc 2.95.4 0.6.2, gcc 2.95.1 optimal on K7,K8
Code + Code + Code + Code +
mov edi, 21C [esp] mov eax, 4 [esi] mov eax, 4 [esi] add ecx, 4 [esi]
mov edx, ebp add esi, # 4 add esi, # 4 add ebx, # 4
add ebx, # 4 add ecx, eax add ebx, # 4 add esi, # 4
mov ecx, 4 [edi] add ebx, # 4 add ecx, eax jmp -4 [ebx]
add edi, # 4 mov eax, -4 [ebx] jmp -4 [ebx] end-code
add edx, ecx jmp eax end-code
mov 21C [esp], edi end-code
mov ebp, edx
mov esi, -4 [ebx]
mov eax, esi
jmp eax
end-code

The difference between 0.6.9 and 0.6.2 on gcc-2.95 is due to a
workaround for PR15242 that I have not (yet?) made
gcc-version-specific.

- anton

--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: http://www.forth200x.org/forth200x.html
EuroForth 2007: http://www.complang.tuwien.ac.at/anton/euroforth2007/

sl...@jedit.org

unread,
Jun 23, 2007, 5:31:32 PM6/23/07
to
On Jun 23, 4:11 pm, a...@mips.complang.tuwien.ac.at (Anton Ertl)
wrote:

> 0.6.9, gcc 4.1 0.6.9, gcc 2.95.4 0.6.2, gcc 2.95.1 optimal on K7,K8
> Code + Code + Code + Code +
> mov edi, 21C [esp] mov eax, 4 [esi] mov eax, 4 [esi] add ecx, 4 [esi]
> mov edx, ebp add esi, # 4 add esi, # 4 add ebx, # 4
> add ebx, # 4 add ecx, eax add ebx, # 4 add esi, # 4
> mov ecx, 4 [edi] add ebx, # 4 add ecx, eax jmp -4 [ebx]
> add edi, # 4 mov eax, -4 [ebx] jmp -4 [ebx] end-code
> add edx, ecx jmp eax end-code
> mov 21C [esp], edi end-code
> mov ebp, edx
> mov esi, -4 [ebx]
> mov eax, esi
> jmp eax
> end-code

Interesting, I didn't realize gcc has regressed so much since 2.95.
Why does gforth rely on gcc at all? Maybe its time for a rewrite? :)

Slava

Andrew Haley

unread,
Jun 24, 2007, 11:06:40 AM6/24/07
to
Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
> Today I played around with optimizations for Gforth in the context of
> various gcc versions. With 0.6.2, the situation was:

> gcc-2.95: nice and fast, although a little buggy.

> gcc-3.0: slower, thanks to some GCSE misbehaviour

> gcc-3.x, x>1 (certainly for 3.3 and 3.4): very slow: they fixed the
> GCSE problem, but on the way destroyed gforth's dynamic code
> generation and the branch prediction advantage of using threaded
> code (PR15242).

> gcc-4.x: also very slow: Thanks to anal-retentive syntax checking that
> was introduced without prior warning, dynamic code generation is
> turned off even though it would work in principle

What syntax does this refer to?

There seem to be a lot of different issues here, and it's quite hard
for me to disentangle them. Do you mean that with gcc 4.1, you can't
force all the Forth system pointers you need into registers, because
the compiler runs out of registers, but you could do this with earlier
compilers?

Andrew.

Anton Ertl

unread,
Jun 24, 2007, 7:47:22 AM6/24/07
to
"sl...@jedit.org" <sl...@jedit.org> writes:
>Why does gforth rely on gcc at all?

We want portability and speed. Gcc used to have outstanding
advantages there:

- Labels-as-values allowed us to do threaded code (factor of 2 over
switch dispatch) and dynamic superinstructions (another factor of 2).

- Explicit register allocation allowed us to get decent register
allocation.

- Long long allowed us to do doubles (they broke that many years ago
on the Alpha and "fixed" it by changing the documentation, so we
have had workarounds for BUGGY_LONG_LONG for a long time).

Unfortunately, the gcc maintainers are working hard at eliminating
these advantages, so I would love not to have to rely on gcc.

> Maybe its time for a rewrite? :)

Yes, I have been thinking about compiling definitions to C source
code, then compiling it and dynamically linking it in. Thus the gcc
maintainers would have succeeded in their goal of weaning us off GNU C
extensions, and, thanks to gcc's long compile times, would have
successfully eliminated themselves from the competition.

Anton Ertl

unread,
Jun 24, 2007, 11:24:12 AM6/24/07
to
Andrew Haley <andr...@littlepinkcloud.invalid> writes:
>Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
>> Today I played around with optimizations for Gforth in the context of
>> various gcc versions. With 0.6.2, the situation was:
>
>> gcc-2.95: nice and fast, although a little buggy.
>
>> gcc-3.0: slower, thanks to some GCSE misbehaviour
>
>> gcc-3.x, x>1 (certainly for 3.3 and 3.4): very slow: they fixed the
>> GCSE problem, but on the way destroyed gforth's dynamic code
>> generation and the branch prediction advantage of using threaded
>> code (PR15242).
>
>> gcc-4.x: also very slow: Thanks to anal-retentive syntax checking that
>> was introduced without prior warning, dynamic code generation is
>> turned off even though it would work in principle
>
>What syntax does this refer to?

IIRC it could no longer compile this function:

int foo(int x)
{
if (x) {
label1:
asm(".skip 16");
label2:
}
return (&&label2)-(&&label1);
}

IIRC "label2:" must have a statement following it in gcc-4.x.

>> The slowdown of gcc-4.1 seems to come from bad register allocation and
>> a failure of copy propagation. I actually had to reduce what
>> --enable-force-reg does on this compiler, otherwise the compiler would
>> not compile, or produce wrong code. As an example of the low quality
>> of the resulting code, consider this:
>
>> 0.6.9, gcc 4.1 0.6.9, gcc 2.95.4 0.6.2, gcc 2.95.1 optimal on K7,K8
>> Code + Code + Code + Code +
>> mov edi, 21C [esp] mov eax, 4 [esi] mov eax, 4 [esi] add ecx, 4 [esi]
>> mov edx, ebp add esi, # 4 add esi, # 4 add ebx, # 4
>> add ebx, # 4 add ecx, eax add ebx, # 4 add esi, # 4
>> mov ecx, 4 [edi] add ebx, # 4 add ecx, eax jmp -4 [ebx]
>> add edi, # 4 mov eax, -4 [ebx] jmp -4 [ebx] end-code
>> add edx, ecx jmp eax end-code
>> mov 21C [esp], edi end-code
>> mov ebp, edx
>> mov esi, -4 [ebx]
>> mov eax, esi
>> jmp eax
>> end-code

>There seem to be a lot of different issues here, and it's quite hard


>for me to disentangle them. Do you mean that with gcc 4.1, you can't
>force all the Forth system pointers you need into registers, because
>the compiler runs out of registers, but you could do this with earlier
>compilers?

Yes, the explicit register allocations that worked for gcc-2.95 and
gcc-3.4 do not work for gcc-4.1, so I used a different one, which
resulted in spilling the data stack pointer in the example above. I
find the many superfluous register-register moves more worrying,
though.

Anton Ertl

unread,
Jun 24, 2007, 12:30:39 PM6/24/07
to
an...@mips.complang.tuwien.ac.at (Anton Ertl) writes:
>The slowdown of gcc-4.1 seems to come from bad register allocation and
>a failure of copy propagation. I actually had to reduce what
>--enable-force-reg does on this compiler, otherwise the compiler would
>not compile, or produce wrong code. As an example of the low quality
>of the resulting code, consider this:
>
>0.6.9, gcc 4.1 0.6.9, gcc 2.95.4 0.6.2, gcc 2.95.1 optimal on K7,K8
>Code + Code + Code + Code +
>mov edi, 21C [esp] mov eax, 4 [esi] mov eax, 4 [esi] add ecx, 4 [esi]
>mov edx, ebp add esi, # 4 add esi, # 4 add ebx, # 4
>add ebx, # 4 add ecx, eax add ebx, # 4 add esi, # 4
>mov ecx, 4 [edi] add ebx, # 4 add ecx, eax jmp -4 [ebx]
>add edi, # 4 mov eax, -4 [ebx] jmp -4 [ebx] end-code
>add edx, ecx jmp eax end-code
>mov 21C [esp], edi end-code
>mov ebp, edx
>mov esi, -4 [ebx]
>mov eax, esi
>jmp eax
>end-code

Ok, in order to get something better for this, I turned off caching
the TOS by default, resulting in the following +:

Code +
( $804C502 ) mov esi , dword ptr 8 [ebp] \ $8B $75 $8
( $804C505 ) mov eax , dword ptr 4 [ebp] \ $8B $45 $4
( $804C508 ) add esi , eax \ $1 $C6
( $804C50A ) add ebx , # 4 \ $83 $C3 $4
( $804C50D ) mov dword ptr 8 [ebp] , esi \ $89 $75 $8
( $804C510 ) add ebp , # 4 \ $83 $C5 $4
( $804C513 ) mov esi , dword ptr FC [ebx] \ $8B $73 $FC
( $804C516 ) mov edx , esi \ $89 $F2
( $804C518 ) jmp 804BAF6 \ $E9 $D9 $F5 $FF $FF
end-code

That looks better but the times are slower (see below). On a hunch I
checked ?BRANCH, and found:

Code ?branch
( $804BCDD ) add ebp , # 4 \ $83 $C5 $4
( $804BCE0 ) mov eax , dword ptr [ebx] \ $8B $3
( $804BCE2 ) mov edi , dword ptr 0 [ebp] \ $8B $7D $0
( $804BCE5 ) test edi , edi \ $85 $FF
( $804BCE7 ) jne 804BCF5 \ $75 $C
( $804BCE9 ) mov edx , dword ptr [eax] \ $8B $10
( $804BCEB ) lea ebx , dword ptr 4 [eax] \ $8D $58 $4
( $804BCEE ) mov esi , edx \ $89 $D6
( $804BCF0 ) jmp 804BAF6 \ $E9 $1 $FE $FF $FF
( $804BCF5 ) add ebx , # 8 \ $83 $C3 $8
( $804BCF8 ) mov esi , dword ptr FC [ebx] \ $8B $73 $FC
( $804BCFB ) mov edx , esi \ $89 $F2
( $804BCFD ) jmp 804BAF6 \ $E9 $F4 $FD $FF $FF
end-code

Yes, PR25285 strikes here (instead of the "jmp 804BAF6", a better
compiler would write "jmp esi"); that's the one the gcc maintainers
prefer to ignore. Ok, turn on the workaround for that (that's what
causes the slowdown between 2.95 and 3.4), and we get some mixed
results:

sieve bubble matrix fib
0.208 0.296 0.108 0.328 gcc 2.95.4 20011002 (Debian prerelease)
0.264 0.344 0.120 0.360 gcc 3.4.6 (Debian 3.4.6-5)

0.384 0.432 0.296 0.520 gcc 4.1.2 (default configuration)
0.476 0.748 0.280 0.476 gcc 4.1.2 STACK_CACHE_DEFAULT_FAST=0
0.364 0.524 0.288 0.472 gcc 4.1.2 STACK_CACHE_DEFAULT_FAST=0 condbranch_opt=0

condbranch_opt=0 is one of the workarounds for PR15242 and PR25825.
STACK_CACHE_DEFAULT_FAST=0 turns off caching the TOS by default.

In conclusion, no matter what we do, gcc-4.1 sucks. I have heard that
gcc-4.2.0 is similar (it does not build with 32-bit support on the
Debian boxes I have here, so I cannot check this myself).

Krishna Myneni

unread,
Jun 24, 2007, 1:36:50 PM6/24/07
to
Anton Ertl wrote:
> "sl...@jedit.org" <sl...@jedit.org> writes:
>
>>Why does gforth rely on gcc at all?
>
>
> We want portability and speed. Gcc used to have outstanding
> advantages there:
>
> - Labels-as-values allowed us to do threaded code (factor of 2 over
> switch dispatch) and dynamic superinstructions (another factor of 2).
>
> - Explicit register allocation allowed us to get decent register
> allocation.
>
> - Long long allowed us to do doubles (they broke that many years ago
> on the Alpha and "fixed" it by changing the documentation, so we
> have had workarounds for BUGGY_LONG_LONG for a long time).
>
> Unfortunately, the gcc maintainers are working hard at eliminating
> these advantages, so I would love not to have to rely on gcc.
>
>
>>Maybe its time for a rewrite? :)
>
>
> Yes, I have been thinking about compiling definitions to C source
> code, then compiling it and dynamically linking it in. Thus the gcc
> maintainers would have succeeded in their goal of weaning us off GNU C
> extensions, and, thanks to gcc's long compile times, would have
> successfully eliminated themselves from the competition.
>
> - anton

Ok. I'll throw in my two cents worth of suggestions here. Eliminating gcc is a
drastic option. However, it may be possible to isolate the critical parts of the
code into a separate assembly source file. The critical code may then be tweaked
without restrictions or compiler interference. This is the model we use in
kForth, which is a mix of assembler, C, and C++ source. Of course, this means
much more work in producing a system which can run on many platforms (David
Williams can back me up on this statement!), since all of the source is no
longer portable. But that seems to be the case anyway, now, for gforth's
dependence on the gcc version.

Krishna

Anton Ertl

unread,
Jun 24, 2007, 4:37:39 PM6/24/07
to
Krishna Myneni <krishn...@bellsouth.net> writes:

>Anton Ertl wrote:
>Ok. I'll throw in my two cents worth of suggestions here. Eliminating gcc is a
>drastic option. However, it may be possible to isolate the critical parts of the
>code into a separate assembly source file. The critical code may then be tweaked
>without restrictions or compiler interference. This is the model we use in
>kForth, which is a mix of assembler, C, and C++ source. Of course, this means
>much more work in producing a system which can run on many platforms

Yes, too much work.

> (David
>Williams can back me up on this statement!), since all of the source is no
>longer portable. >But that seems to be the case anyway, now, for gforth's
>dependence on the gcc version.

No, all gcc versions still work (for now), the newer ones are just
slow.

Andrew Haley

unread,
Jun 26, 2007, 5:48:12 AM6/26/07
to
Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
> "sl...@jedit.org" <sl...@jedit.org> writes:
>>Why does gforth rely on gcc at all?

> We want portability and speed. Gcc used to have outstanding
> advantages there:

> - Labels-as-values allowed us to do threaded code (factor of 2 over
> switch dispatch) and dynamic superinstructions (another factor of 2).

> - Explicit register allocation allowed us to get decent register
> allocation.

> - Long long allowed us to do doubles (they broke that many years ago
> on the Alpha and "fixed" it by changing the documentation, so we
> have had workarounds for BUGGY_LONG_LONG for a long time).

Huh? I don't understand this remark. What broke "long long" on
Alpha? I don't imagine it was gcc.

> Unfortunately, the gcc maintainers are working hard at eliminating
> these advantages, so I would love not to have to rely on gcc.

I can assure you that's not deliberate. We know that stack slot
allocation and reuse has been poor in gcc, but it is getting better.
We had some performance regressions as a result of the tree-SSA
rewrite, but we've managed to claw most of that back. It seems that
gforth has suffered more than most programs as a result of thse
changes.

Andrew.

Anton Ertl

unread,
Jun 26, 2007, 8:17:39 AM6/26/07
to
Andrew Haley <andr...@littlepinkcloud.invalid> writes:
>Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
>> "sl...@jedit.org" <sl...@jedit.org> writes:
>>>Why does gforth rely on gcc at all?
>
>> We want portability and speed. Gcc used to have outstanding
>> advantages there:
>
>> - Labels-as-values allowed us to do threaded code (factor of 2 over
>> switch dispatch) and dynamic superinstructions (another factor of 2).
>
>> - Explicit register allocation allowed us to get decent register
>> allocation.
>
>> - Long long allowed us to do doubles (they broke that many years ago
>> on the Alpha and "fixed" it by changing the documentation, so we
>> have had workarounds for BUGGY_LONG_LONG for a long time).
>
>Huh? I don't understand this remark. What broke "long long" on
>Alpha? I don't imagine it was gcc.

Long long was documented as being twice as long as "long int". On
Alpha it was not.

>> Unfortunately, the gcc maintainers are working hard at eliminating
>> these advantages, so I would love not to have to rely on gcc.
>
>I can assure you that's not deliberate.

Well, I think it's an attitude problem. My impression is that any
code that's not ANSI C is considered unworthy (including code using
documented GNU C extensions), and reporters of bugs in this area are
made to feel unwelcome, e.g., by "resolving" a bug report as "invalid"
in less time than it took to prepare the bug report (PR25285);
comments like

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=15242#c2
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=15242#c7
<121j5pa...@news.supernews.com> ff.

also contribute to my impression.

>We know that stack slot
>allocation and reuse has been poor in gcc, but it is getting better.

We don't care much for that (might change when callbacks are used
more, though), and in any case, gcc never was great in that respect.

>We had some performance regressions as a result of the tree-SSA
>rewrite, but we've managed to claw most of that back. It seems that
>gforth has suffered more than most programs as a result of thse
>changes.

Well, it's not all doom and gloom. Bernd has found an explicit
register allocation that works well with gcc-4.2.0 (but not 4.1);
scaling the results to our 2.2GHz Athlon 64X2 gives:

sieve bubble matrix fib
0.208 0.296 0.108 0.328 gcc 2.95.4 20011002 (Debian prerelease)
0.264 0.344 0.120 0.360 gcc 3.4.6 (Debian 3.4.6-5)

0.384 0.432 0.296 0.520 gcc 4.1.2 (default configuration)
0.476 0.748 0.280 0.476 gcc 4.1.2 STACK_CACHE_DEFAULT_FAST=0
0.364 0.524 0.288 0.472 gcc 4.1.2 STACK_CACHE_DEFAULT_FAST=0 condbranch_opt=0

0.240 0.331 0.113 0.342 gcc 4.2.0

Still not quite in the 2.95 leage, but better than 3.4.

Andrew Haley

unread,
Jun 26, 2007, 9:40:26 AM6/26/07
to
Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
> Andrew Haley <andr...@littlepinkcloud.invalid> writes:
>>Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
>>> "sl...@jedit.org" <sl...@jedit.org> writes:
>>>>Why does gforth rely on gcc at all?
>>
>>> We want portability and speed. Gcc used to have outstanding
>>> advantages there:
>>
>>> - Labels-as-values allowed us to do threaded code (factor of 2 over
>>> switch dispatch) and dynamic superinstructions (another factor of 2).
>>
>>> - Explicit register allocation allowed us to get decent register
>>> allocation.
>>
>>> - Long long allowed us to do doubles (they broke that many years ago
>>> on the Alpha and "fixed" it by changing the documentation, so we
>>> have had workarounds for BUGGY_LONG_LONG for a long time).
>>
>>Huh? I don't understand this remark. What broke "long long" on
>>Alpha? I don't imagine it was gcc.

> Long long was documented as being twice as long as "long int".

How odd.

> On Alpha it was not.

OK, but that has nothing at all to do with gcc: the sizes of types is
determined by the ABI. gcc implements the ABI of whatever platform it
runs on.

>>> Unfortunately, the gcc maintainers are working hard at eliminating
>>> these advantages, so I would love not to have to rely on gcc.
>>
>>I can assure you that's not deliberate.

> Well, I think it's an attitude problem.

My take on this is that we are, or should be, playing for the same
team. So, please, let's go ahead with this exchange in the spirit of
co-operation...

> My impression is that any code that's not ANSI C is considered
> unworthy (including code using documented GNU C extensions), and
> reporters of bugs in this area are made to feel unwelcome, e.g., by
> "resolving" a bug report as "invalid" in less time than it took to
> prepare the bug report (PR25285); comments like

> also contribute to my impression.

The thing is that for you this seems to be a bad attitude, but for me
it seems perfectly reasonable! What does it mean to jump out of a
statement expression, anyway?

The core issue here is that many of the gcc extensions never were
properly defined, so the corner cases don't work correctly. As a
result of that there is a strong opinion against any new externsions,
and some pressure to deprecate old ones. In the main that pressure
has been resisted, though.

>>We know that stack slot allocation and reuse has been poor in gcc,
>>but it is getting better.

> We don't care much for that (might change when callbacks are used
> more, though), and in any case, gcc never was great in that respect.

>>We had some performance regressions as a result of the tree-SSA
>>rewrite, but we've managed to claw most of that back. It seems that
>>gforth has suffered more than most programs as a result of thse
>>changes.

> Well, it's not all doom and gloom. Bernd has found an explicit
> register allocation that works well with gcc-4.2.0 (but not 4.1);
> scaling the results to our 2.2GHz Athlon 64X2 gives:

> sieve bubble matrix fib
> 0.208 0.296 0.108 0.328 gcc 2.95.4 20011002 (Debian prerelease)
> 0.264 0.344 0.120 0.360 gcc 3.4.6 (Debian 3.4.6-5)
> 0.384 0.432 0.296 0.520 gcc 4.1.2 (default configuration)
> 0.476 0.748 0.280 0.476 gcc 4.1.2 STACK_CACHE_DEFAULT_FAST=0
> 0.364 0.524 0.288 0.472 gcc 4.1.2 STACK_CACHE_DEFAULT_FAST=0 condbranch_opt=0
> 0.240 0.331 0.113 0.342 gcc 4.2.0

> Still not quite in the 2.95 leage, but better than 3.4.

OK, so that's progress. This is for 32-bit code, is it? I guess it
must be, given that gcc 2.95 didn't support 64-bit code.

Andrew.

Bernd Paysan

unread,
Jun 26, 2007, 9:54:31 AM6/26/07
to
Andrew Haley wrote:
>> - Long long allowed us to do doubles (they broke that many years ago
>> on the Alpha and "fixed" it by changing the documentation, so we
>> have had workarounds for BUGGY_LONG_LONG for a long time).
>
> Huh? I don't understand this remark. What broke "long long" on
> Alpha? I don't imagine it was gcc.

"long long" was specified to be twice as long as "long". It makes perfect
sense to have one data type which is twice as long as the longest "native"
data type, especially to encode multiplication and division operations;
this was the original (RMS) motivation to create that data type after all.
Alpha didn't "break" that, because Alpha has a fast 64x64->128
multiplication (done in two operations, one delivering the low, one the
high result). Actually, the "right" specification of long long should
be "twice as long as intptr_t", because on a 16 bit platform, long could
already be twice as long as the largest native format.

The first GCC port to 64 bit was before, MIPS. There, they initially had a
command line switch where you can select long long to be 128 or 64 bits,
the latter for downward compatibility with people who assume long long=64
bits. The switch was dropped soon afterwards (don't know if it worked ever,
because the time we had a 64 bit MIPS station to port Gforth to, it already
stopped working).

The Alpha porting team decided to make long long=64 bits, and after we
reported that this is a bug (because it's not according to the
specification of "long long"), they changed the documentation instead
(making it "twice as long as int").

Recently, the situation improved slightly, because AMD put in a "backdoor"
to access a 128 bit data type for amd64. There, you can typedef

typedef int int128_t __attribute__((__mode__(TI)));
typedef unsigned int uint128_t __attribute__((__mode__(TI)));

and apart from converting to and from FP, it actually works (i.e. there are
TI instruction pattern; I haven't checked the FP conversions in GCC 4.2.0,
because the impact is minimal). However, I consider this more as an "easter
egg" than a real feature, as we can't depend on it - we can test if it
works, but that's all.

>> Unfortunately, the gcc maintainers are working hard at eliminating
>> these advantages, so I would love not to have to rely on gcc.
>
> I can assure you that's not deliberate. We know that stack slot
> allocation and reuse has been poor in gcc, but it is getting better.
> We had some performance regressions as a result of the tree-SSA
> rewrite, but we've managed to claw most of that back. It seems that
> gforth has suffered more than most programs as a result of thse
> changes.

Yes, but that means that you should use Gforth as regression test, because
it's one of the rare programs which actually use a significant amount of
special GCC features, has a sufficiently complex control flow (with all the
indirect branches), and other corner cases which make it a good regression
test target. Yes, I know, people like "unit tests", where each test tests
one feature in isolation, but while you ought to have these tests, you also
need the system tests. And for the system tests, you need a sufficiently
complex system with a well-defined benchmark. Using SPEC as benchmark is
not complete, because SPEC sources don't use GCC extensions.

Until 4.2.0, it was hard to see a "progress" (different from the 2.x line,
where 2.95 definitely was the best). And most of what 4.2.0 does better
than before is that the explicit register allocation doesn't conflict with
instructions using this register when it's actually dead.

--
Bernd Paysan
"If you want it done right, you have to do it yourself"
http://www.jwdt.com/~paysan/

Bernd Paysan

unread,
Jun 26, 2007, 10:01:35 AM6/26/07
to
Andrew Haley wrote:

>> On Alpha it was not.
>
> OK, but that has nothing at all to do with gcc: the sizes of types is
> determined by the ABI.  gcc implements the ABI of whatever platform it
> runs on.

long long wasn't part of the ABI. long long back then was a GCC extension,
not supported by any other compiler. C99 put long long into the standard,
and therefore now, it's part of the ABI, and you have to take C99's weasel
wording (courtesy to other vendors which have rendered the original C
typing system completely useless by now, like having no C90 integer type
which can hold a pointer - the IL32P64 model).

I'm ok when int128_t is implemented in a useful way; I don't care about how
the pointer type is called, as long as I can check it with autoconf.

Anton Ertl

unread,
Jun 26, 2007, 11:05:20 AM6/26/07
to
Andrew Haley <andr...@littlepinkcloud.invalid> writes:
>Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
>> Andrew Haley <andr...@littlepinkcloud.invalid> writes:
>>>Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
>>>> "sl...@jedit.org" <sl...@jedit.org> writes:
>>>Huh? I don't understand this remark. What broke "long long" on
>>>Alpha? I don't imagine it was gcc.
>
>> Long long was documented as being twice as long as "long int".
>
>How odd.

What's odd about that? Long long was a GNU C extension, so gcc was
free to define what it meant.

>> On Alpha it was not.
>
>OK, but that has nothing at all to do with gcc: the sizes of types is
>determined by the ABI. gcc implements the ABI of whatever platform it
>runs on.

Not when it does not occur in library functions and structures (e.g.,
gcc did not follow the ABI when passing structures as arguments); I
doubt that long long occured in any functions or structures, certainly
not in functions that we used.

Deviating from GCC's defined API broke Gforth, and cost us quite a bit
of time to work around (and is still costing us time, as the two cases
have to be programmed and tested for every primitive that involves a
double-cell type).

>>>> Unfortunately, the gcc maintainers are working hard at eliminating
>>>> these advantages, so I would love not to have to rely on gcc.
>>>
>>>I can assure you that's not deliberate.
>
>> Well, I think it's an attitude problem.

...
>> My impression is that any code that's not ANSI C is considered
>> unworthy (including code using documented GNU C extensions), and
>> reporters of bugs in this area are made to feel unwelcome, e.g., by
>> "resolving" a bug report as "invalid" in less time than it took to
>> prepare the bug report (PR25285); comments like
>
>> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=15242#c2
>> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=15242#c7
>> <121j5pa...@news.supernews.com> ff.
>
>> also contribute to my impression.
>
>The thing is that for you this seems to be a bad attitude,

I am sorry that I was unclear in the description of the attitude. If
you want a label for it, let's call it the "ANSI C blinders" attitude;
those suffering from that attitude prefer not to even think about C
code that's outside ANSI C. This attitude is exemplified here:

>What does it mean to jump out of a
>statement expression, anyway?

I'm sure that, if you dared to think what it means, you would find a
reasonable interpretation, but as long as you have the blinders on,
you will only see that this is not ANSI C, and will therefore not
think of a meaning.

If you need help, here's a hint: What does it mean to call longjmp()
or exit() in an expression; that can occur even in an ANSI C program,
no?

>The core issue here is that many of the gcc extensions never were
>properly defined, so the corner cases don't work correctly.

The case mentioned above is where the meaning of the extension is
pretty straightforward, and the non-working comes from a bug in the
compiler (I guess it does not reset the stack depth before performing
the jump). But of course with ANSI C blinders on you only see a
non-conforming program, and a compiler that therefore has the right
not to work.

If you think that the extensions are not defined well enough, get
someone to specify them well enough. I could specify them in language
like that used in the ANSI C document, if I had the time.

> As a
>result of that there is a strong opinion against any new externsions,
>and some pressure to deprecate old ones. In the main that pressure
>has been resisted, though.

On the surface, maybe. But when it comes to dealing with bug reports,
the attitude shines through.

Is that attitude bad? Well, if you are working on a program that
relies on GNU C extensions, it's bad for you. We are in that
situation.

>OK, so that's progress. This is for 32-bit code, is it?

Yes.

> I guess it
>must be, given that gcc 2.95 didn't support 64-bit code.

It does not support AMD64 (it does support, e.g., Alpha). Hmm, maybe
it would be less work to maintain gcc 2.95, and add a few bug fixes
and an AMD64 port rather than jumping through the hoops that the more
recent gcc versions have put up for us. But every hoop is not that
much work, and I always hope that it will be the last one; but I am
beginning to fear that, with the attitude problem, there will be many
more hoops to come.

Andrew Haley

unread,
Jun 26, 2007, 11:56:11 AM6/26/07
to
Bernd Paysan <bernd....@gmx.de> wrote:
> Andrew Haley wrote:
>>> - Long long allowed us to do doubles (they broke that many years ago
>>> on the Alpha and "fixed" it by changing the documentation, so we
>>> have had workarounds for BUGGY_LONG_LONG for a long time).
>>
>> Huh? I don't understand this remark. What broke "long long" on
>> Alpha? I don't imagine it was gcc.

> "long long" was specified to be twice as long as "long".

Ahh, I see, fair enough. Either I never knew that or I'd forgotten.

So, did DEC's Alpha compiler not have a 64-bit long long type?

>>> Unfortunately, the gcc maintainers are working hard at eliminating
>>> these advantages, so I would love not to have to rely on gcc.
>>
>> I can assure you that's not deliberate. We know that stack slot
>> allocation and reuse has been poor in gcc, but it is getting better.
>> We had some performance regressions as a result of the tree-SSA
>> rewrite, but we've managed to claw most of that back. It seems that
>> gforth has suffered more than most programs as a result of thse
>> changes.

> Yes, but that means that you should use Gforth as regression test,
> because it's one of the rare programs which actually use a
> significant amount of special GCC features, has a sufficiently
> complex control flow (with all the indirect branches), and other
> corner cases which make it a good regression test target.

Indeed so, but it's also sufficiently unusual that it's hard to make a
good case for it as a general-purpose test.

> Until 4.2.0, it was hard to see a "progress" (different from the 2.x
> line, where 2.95 definitely was the best). And most of what 4.2.0
> does better than before is that the explicit register allocation
> doesn't conflict with instructions using this register when it's
> actually dead.

I know it's been hard to see the progress from the outside, but most
of the changes were infrastructure re-engineering to make more advanced
optimization possible. Inevitably, there have been some degradations
along the way but we had to do the work to make moving forward
possible. It is, I admit, a great shame that gforth has suffered
fallout from this.

Andrew.

Andrew Haley

unread,
Jun 26, 2007, 12:34:14 PM6/26/07
to
Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
> Andrew Haley <andr...@littlepinkcloud.invalid> writes:
>>Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
>>> Andrew Haley <andr...@littlepinkcloud.invalid> writes:
>>>>Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
>>>>> "sl...@jedit.org" <sl...@jedit.org> writes:
>>>>Huh? I don't understand this remark. What broke "long long" on
>>>>Alpha? I don't imagine it was gcc.
>>
>>> Long long was documented as being twice as long as "long int".
>>
>>How odd.

> What's odd about that? Long long was a GNU C extension, so gcc was
> free to define what it meant.

Sure, but as I said to Bernd I had forgotten that it was ever defined
that way. In hindsight it was surely a mistake to define it in such a
way, but that's all history now.

Please, cut the sarcasm. It's not helping.

> If you need help, here's a hint: What does it mean to call longjmp()
> or exit() in an expression; that can occur even in an ANSI C program,
> no?

exit() and longjmp() return void, so it doesn't make any sense to use
them in a non-void context. The question here is, therefore, whether
it ever makes sense to have a statement expression whose value is
void. I suspect it probably is, since the spec says

------------------------------------
The last thing in the compound statement should be an expression
followed by a semicolon; the value of this subexpression serves as the
value of the entire construct. (If you use some other kind of statement
last within the braces, the construct has type `void', and thus
effectively no value.)
------------------------------------

so what should happen is that the spec is tightened and the
implementation corrected, if indeed it's still broken.

>>The core issue here is that many of the gcc extensions never were
>>properly defined, so the corner cases don't work correctly.

> The case mentioned above is where the meaning of the extension is
> pretty straightforward, and the non-working comes from a bug in the
> compiler (I guess it does not reset the stack depth before performing
> the jump). But of course with ANSI C blinders on you only see a
> non-conforming program, and a compiler that therefore has the right
> not to work.

What we need in order not to have these kinds of arguments is a
language that conforms to ISO C + a bunch of well-defined extensions.
If the extension in question had been well-defined, the kind of
arguments you've had would not have been possible.

> If you think that the extensions are not defined well enough, get
> someone to specify them well enough. I could specify them in
> language like that used in the ANSI C document, if I had the time.

>>As a result of that there is a strong opinion against any new

>>extensions, and some pressure to deprecate old ones. In the main


>>that pressure has been resisted, though.

> On the surface, maybe. But when it comes to dealing with bug reports,
> the attitude shines through.

OK, but that's my point: to you it's a bad attitude, to me it's a
perfectly reasonable response.

> Is that attitude bad? Well, if you are working on a program that
> relies on GNU C extensions, it's bad for you. We are in that
> situation.

>>OK, so that's progress. This is for 32-bit code, is it?

> Yes.

>> I guess it
>>must be, given that gcc 2.95 didn't support 64-bit code.

> It does not support AMD64 (it does support, e.g., Alpha). Hmm,
> maybe it would be less work to maintain gcc 2.95, and add a few bug
> fixes and an AMD64 port rather than jumping through the hoops that
> the more recent gcc versions have put up for us. But every hoop is
> not that much work, and I always hope that it will be the last one;
> but I am beginning to fear that, with the attitude problem, there
> will be many more hoops to come.

Well, maybe. But you are so absurdly rude when you post on the
subject that it's hard to have a sensible discussion. I really would
like to have a sensible discussion about how to fix some of the
problems you're having with gcc.

Andrew.

Bernd Paysan

unread,
Jun 27, 2007, 3:56:40 AM6/27/07
to
Andrew Haley wrote:
> Sure, but as I said to Bernd I had forgotten that it was ever defined
> that way.  In hindsight it was surely a mistake to define it in such a
> way, but that's all history now.

IMHO the definition was right, and if the Alpha people had asked RMS about
his reasoning, they probably wouldn't have changed it. It's the "ANSI C"
blinder attitude shining through here again: long long is a GCC extension,
so "it can mean whatever we like it". That's not correct, the GCC
extensions are all there for good reasons.

This is all history now, for sure, and C99 has a better way to select types
with a specific size, so GCC should support this scheme - e.g. a way to
access int128_t on all 64 bit platforms without GCC-specific attributes
would be nice to have.

Andrew Haley

unread,
Jun 27, 2007, 6:35:16 AM6/27/07
to
Bernd Paysan <bernd....@gmx.de> wrote:
> Andrew Haley wrote:

>> Sure, but as I said to Bernd I had forgotten that it was ever
>> defined that way. In hindsight it was surely a mistake to define
>> it in such a way, but that's all history now.

> IMHO the definition was right, and if the Alpha people had asked RMS
> about his reasoning, they probably wouldn't have changed it. It's
> the "ANSI C" blinder attitude shining through here again: long long
> is a GCC extension, so "it can mean whatever we like it".

No, long long wasn't only a gcc-specific extension: several compiler
vendors also supported it or were about to, so it wasn't possible for
gcc simply to define what it meant without regard to anyone else.
Thre's an excruciating discussion at
http://yarchive.net/comp/longlong.html

> This is all history now, for sure, and C99 has a better way to
> select types with a specific size, so GCC should support this scheme
> - e.g. a way to access int128_t on all 64 bit platforms without
> GCC-specific attributes would be nice to have.

AFAIAA intN_t is provided by the C library, not the compiler, so it's
up to the library vendors. The split between library and compiler is
specified by the standard.

Andrew.

Anton Ertl

unread,
Jun 27, 2007, 7:15:35 AM6/27/07
to
Andrew Haley <andr...@littlepinkcloud.invalid> writes:
>Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
>> Andrew Haley <andr...@littlepinkcloud.invalid> writes:
>>>What does it mean to jump out of a statement expression, anyway?
...

>> If you need help, here's a hint: What does it mean to call longjmp()
>> or exit() in an expression; that can occur even in an ANSI C program,
>> no?
>
>exit() and longjmp() return void, so it doesn't make any sense to use
>them in a non-void context.

To be pedantic, exit() and longjmp() don't return.

Still, to satisfy the type checker, one has to use them in a context
where a void "result" is ok. An example similar to the one discussed
in

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=15242#c8

would be

f("%d\n", (longjump(buf), 0))

Can you find a meaning for that? Now compare with

f("%d\n", ({goto a; 0;}))

Can you now find a meaning for that?

>>>The core issue here is that many of the gcc extensions never were
>>>properly defined, so the corner cases don't work correctly.
>
>> The case mentioned above is where the meaning of the extension is
>> pretty straightforward, and the non-working comes from a bug in the
>> compiler (I guess it does not reset the stack depth before performing
>> the jump). But of course with ANSI C blinders on you only see a
>> non-conforming program, and a compiler that therefore has the right
>> not to work.
>
>What we need in order not to have these kinds of arguments is a
>language that conforms to ISO C + a bunch of well-defined extensions.
>If the extension in question had been well-defined, the kind of
>arguments you've had would not have been possible.

So if ANSI C was well-defined, it would be impossible for us to argue
what it means to call longjmp() or exit() in an expression, right?

In any case, if you want a tighter specification of the extension, you
could commission such a spec.

>I really would
>like to have a sensible discussion about how to fix some of the
>problems you're having with gcc.

So do you have any ideas how they could be fixed? I used to write bug
reports for gcc. But the reaction to PR25285 convinced me that this
is a waste of time.

Andrew Haley

unread,
Jun 27, 2007, 8:10:08 AM6/27/07
to
Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
> Andrew Haley <andr...@littlepinkcloud.invalid> writes:
>>Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
>>> Andrew Haley <andr...@littlepinkcloud.invalid> writes:
>>>>What does it mean to jump out of a statement expression, anyway?
> ...
>>> If you need help, here's a hint: What does it mean to call longjmp()
>>> or exit() in an expression; that can occur even in an ANSI C program,
>>> no?
>>
>>exit() and longjmp() return void, so it doesn't make any sense to use
>>them in a non-void context.

> To be pedantic, exit() and longjmp() don't return.

Oh, for goodness' sake! Let's move on.

>>>>The core issue here is that many of the gcc extensions never were
>>>>properly defined, so the corner cases don't work correctly.
>>
>>> The case mentioned above is where the meaning of the extension is
>>> pretty straightforward, and the non-working comes from a bug in the
>>> compiler (I guess it does not reset the stack depth before performing
>>> the jump). But of course with ANSI C blinders on you only see a
>>> non-conforming program, and a compiler that therefore has the right
>>> not to work.
>>
>>What we need in order not to have these kinds of arguments is a
>>language that conforms to ISO C + a bunch of well-defined extensions.
>>If the extension in question had been well-defined, the kind of
>>arguments you've had would not have been possible.

> So if ANSI C was well-defined, it would be impossible for us to argue
> what it means to call longjmp() or exit() in an expression, right?

Given sufficient bloody-mindedness it's possible to argue about
anything, as you amply demonstrate.

> In any case, if you want a tighter specification of the extension, you
> could commission such a spec.

>>I really would like to have a sensible discussion about how to fix
>>some of the problems you're having with gcc.

> So do you have any ideas how they could be fixed?

That depends on the specific problem. Some things, for exmaple
register allocation, are very hard, and in any case are being actively
worked on. Other things might be easier.

What would be interesting to me is a to know the most important gcc
deficiencies from the point of view of GForth and your estimate of how
significant these deficiencies are. There might well be some push-back
from gcc developers, with the claim that GForth "isn't
representative". But I'm not convinced of that, as I suspect that some
of the problems you've seen might have an impact on other code bases.
There is more to the world than SPECint.

> I used to write bug reports for gcc. But the reaction to PR25285
> convinced me that this is a waste of time.

Well, I agree that Andrew Pinski's Comment #3 seems to be very
unhelpful. However, it's not just a matter of writing bug reports,
but of finding a gcc maintainer who understands the problem and is
motivated to work on it.

Andrew.

Bernd Paysan

unread,
Jun 28, 2007, 4:28:17 AM6/28/07
to
Andrew Haley wrote:
> What would be interesting to me is a to know the most important gcc
> deficiencies from the point of view of GForth and your estimate of how
> significant these deficiencies are. There might well be some push-back
> from gcc developers, with the claim that GForth "isn't
> representative". But I'm not convinced of that, as I suspect that some
> of the problems you've seen might have an impact on other code bases.
> There is more to the world than SPECint.

One obvious deficiency in the current GCC is that it deals poorly with copy
propagation, and that must impact every program. The code GCC 4.1.2
compiles on x86_64 for the + primitive is a good example (and 4.2.0
compiles almost the same code):

# +
#NO_APP
leaq 8(%r15), %rax ; create AGU bubble
addq $8, %rbx ; create another AGU bubble
addq (%rax), %r14
movq %rax, %r15
.L261:
movq -8(%rbx), %rbp
.L262:
movq %rbp, %rax
jmp *%rax

The update of the address before use is a "no-no" in modern pipelined CPUs,
as it introduces a pipeline bubble. The two movq reg,reg in this example
are completely superfluous.

What I would write is:

# +
#NO_APP
addq 8(%r15), %r14
movq (%rbx), %rax ; avoid AGU bubble
addq $8, %r15
addq $8, %rbx
.L261:
jmp *%rax

Actually, we could change our source code so that GCC has more opportunity
to avoid the second AGU bubble (it's optimized for register pressure to
make it possible to combine the jump into

jmp *-8(%rbx)

, but the way GCC 4.x handles indirect jumps prevent such an optimization,
anyway).

Andrew Haley

unread,
Jun 28, 2007, 6:06:59 AM6/28/07
to

Hmm, OK. Just to be clear that we are talking about the same block of
code, I've included what I think is the corresponding source code. Is
this right?

J_plus: asm(""); I_plus:

{ saved_ip=ip; asm(""); }
{
Label ca;
Cell n1;
Cell n2;
Cell n;
({cfa1=cfa; cfa=*ip;});
((n1)=(Cell)(sp[1]));
((n2)=(Cell)((sp[0])));

sp += 1;
{
# 681 "./prim"
n = n1+n2;
# 2883 "prim.i"
}

({ip++; ca=*cfa;});
(((sp[0]))=(Cell)(n));
K_plus:
({goto *ca;});
}

I'm rather worried that this may turn out to be the reload pass
performing poorly, but let's see.

Andrew.

Bernd Paysan

unread,
Jun 28, 2007, 7:26:40 AM6/28/07
to
Andrew Haley wrote:

Not exactly, should look like this (fast with TOS cached, from the latest
snapshot):

H_plus: asm(""); I_plus:
asm("# " "+");
{

Cell n1;
Cell n2;
Cell n;

;
((n1)=(Cell)(sp[1]));
((n2)=(Cell)(spTOS));
sp += 1;
{
n = n1+n2;
}
(ip++);
((spTOS)=(Cell)(n));
K_plus: asm("");
do {asm("":"=X"(cfa)); do {(real_ca=(*(ip-1)));} while(0);} while(0);
J_plus: asm("");
goto *real_ca;
}

> I'm rather worried that this may turn out to be the reload pass
> performing poorly, but let's see.

Anton Ertl

unread,
Jun 28, 2007, 8:38:25 AM6/28/07
to
Andrew Haley <andr...@littlepinkcloud.invalid> writes:
>Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
>> Andrew Haley <andr...@littlepinkcloud.invalid> writes:
>>>I really would like to have a sensible discussion about how to fix
>>>some of the problems you're having with gcc.
>
>> So do you have any ideas how they could be fixed?
>
>That depends on the specific problem. Some things, for exmaple
>register allocation, are very hard, and in any case are being actively
>worked on. Other things might be easier.
>
>What would be interesting to me is a to know the most important gcc
>deficiencies from the point of view of GForth and your estimate of how
>significant these deficiencies are.

1) Register allocation is one issue, but that is probably not easy to
solve. One thing that I can imagine somthing can be done about, is
that, even on platforms with many registers, like Alpha, MIPS, or
AMD64, gcc allocates registers badly for the Gforth engine: we can
barely allocate the virtual machine registers into real registers, but
then we don't have any registers left for stack caching.

The reason for this seems to be that these machines have few
callee-saved registers, and gcc seems to allocate our virtual machine
registers only into these registers (probably because the survive a
number of calls). What's worse, even with explicit register
allocation we cannot get around that, because we can only use
callee-saved registers there, too (at least last time I tried). The
only architecture where I am happy about the registers is PPC, because
it has many callee-saved registers.

It would be great if gcc would make better use of the registers by
itself, but if not, I would at least like to do it myself using
explicit register allocation.

2) Fixing PR25285. This one strikes at unpredictable times, as the
example with gcc-4.1.2 without and with STACK_CACHE_DEFAULT_FAST=0
shows. We do have a workaround for that, but that workaround has a
negative performance impact even if the compiler does not exhibit
PR25285. Speed impacts:

Difference between raw PR25285 and workaround.

0.476 0.748 0.280 0.476 PR25285 without workaround
0.364 0.524 0.288 0.472 with workaround enabled

The cost of the workaround on a compiler without PR25285:

sieve bubble matrix fib
0.208 0.296 0.108 0.336 no PR25285 without workaround
0.248 0.340 0.120 0.384 with workaround enabled

Other programs are affected even more by this. In particular, in our
work on using dynamic superinstructions in the Cacao JVM interpreter
<http://www.complang.tuwien.ac.at/papers/ertl+06dotnet.ps.gz>, we
found slowdowns by up to a factor of 2 by enabling or disabling the
"throw" feature that is affected by PR25285. I found a workaround for
that, but it is brittle; it worked if we did not add static
superinstructions, but when I did add static superinstructions, it no
longer worked.

3) Code arrangement. Not a problem with current gcc versions (apart
from PR25285), but it has been in the past, so maybe one should add a
test case or some other way to remind the maintainers of this issue.
We generate code dynamically by taking code fragments (between two
labels (as values)) that gcc has generated and copying the fragments
elsewhere. In order for this technique to work, there is one
requirement: If a piece of source code is between two labels, the
corresponding executable code must be between the adddresses
corresponding to these labels. This property should be guaranteed, at
least through a compiler option like -fno-reorder-blocks (PR25285
breaks it).

The performance impact of dynamic code generation is typically a
factor of 2, but sometimes much higher:

sieve bubble matrix fib
0.212 0.292 0.108 0.336 dynamic code generation
0.420 0.540 0.704 0.696 no dynamic code generation

>There might well be some push-back
>from gcc developers, with the claim that GForth "isn't
>representative". But I'm not convinced of that, as I suspect that some
>of the problems you've seen might have an impact on other code bases.

1) Register allocation affects everyone. Most interpreters will have
register liveness characteristics similar to Gforth.

2,3) The code arrangement issue and PR25285 affect systems that use
similar techniques to Gforth. Apart from Gforth, SableVM, and the
Cacao interpreter, two other projects I know that use similar
techniqes are Qemu and the Tempo partial evaluator. I have read
somewhere that Qemu has big problems with recent gccs thanks to a code
generation issue that is somewhat similar to PR25285, except that it
involves returns instead of general indirect jumps.

>> I used to write bug reports for gcc. But the reaction to PR25285
>> convinced me that this is a waste of time.
>
>Well, I agree that Andrew Pinski's Comment #3 seems to be very
>unhelpful. However, it's not just a matter of writing bug reports,
>but of finding a gcc maintainer who understands the problem and is
>motivated to work on it.

Yes. So how do we find one?

Andrew Haley

unread,
Jun 28, 2007, 10:48:41 AM6/28/07
to
Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
> Andrew Haley <andr...@littlepinkcloud.invalid> writes:
>>Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
>>> Andrew Haley <andr...@littlepinkcloud.invalid> writes:
>>>>I really would like to have a sensible discussion about how to fix
>>>>some of the problems you're having with gcc.
>>
>>> So do you have any ideas how they could be fixed?
>>
>>That depends on the specific problem. Some things, for exmaple
>>register allocation, are very hard, and in any case are being
>>actively worked on. Other things might be easier.
>>
>>What would be interesting to me is a to know the most important gcc
>>deficiencies from the point of view of GForth and your estimate of how
>>significant these deficiencies are.

> 2) Fixing PR25285.

OK, I might be able to find time to look at this.

>>> I used to write bug reports for gcc. But the reaction to PR25285
>>> convinced me that this is a waste of time.
>>
>>Well, I agree that Andrew Pinski's Comment #3 seems to be very
>>unhelpful. However, it's not just a matter of writing bug reports,
>>but of finding a gcc maintainer who understands the problem and is
>>motivated to work on it.

> Yes. So how do we find one?

I'm here. However, some of the problems you mention are already being
actively worked on and in any case are almost research-grade problems.
Vladimir Makarov has written various register allocators, and there
are rumours of a new one for gcc that might improve things.

Some of the other things you mention are likely to be viewed by gcc
maintainers as not a bug, in particular the practice of taking code
fragments that gcc has generated and copying them elsewhere. I would
find it very hard to defend a patch for that.

Andrew.

Bernd Paysan

unread,
Jun 28, 2007, 11:57:50 AM6/28/07
to
Andrew Haley wrote:
> I'm here. However, some of the problems you mention are already being
> actively worked on and in any case are almost research-grade problems.
> Vladimir Makarov has written various register allocators, and there
> are rumours of a new one for gcc that might improve things.

One particular gripe I have with typical register allocators and the usage
pattern of registers in code like Gforth is that when such an allocator
fails at some point, the register ends up in memory all the time. That's
clearly not optimal. One comp.arch poster had a footer that said that most
performance problems resort to caching, and register allocation can be
treated just the same. Caching means that a variable that's loaded into a
register should live there as long as possible, but when no longer
possible, writing it to memory and later reloading it is perfectly ok. The
metric to minimize is the number of dynamic loads and stores. So if a value
sits in a register, and needs to be stored in memory for function calls
(caller-save), this is ok, as long as the function calls aren't more
frequent than accesses to the particular value.

This also helps for the x86 problem with its irregular register file - if
you find out that you need CX for an occasional shift operation, you then
still can use it for global values, which have to be pushed out just when
the shift operation needs CX (same for AX:DX for multiplications or
SI/DI/CX for string instructions).

There are a number of complications with this sort of split lifetime
register allocators, especially since debuggers have problems to localize
values when they change their position during the program flow. One interim
solution is to disable this sort of optimization when debugging.

Andrew Haley

unread,
Jun 28, 2007, 1:03:02 PM6/28/07
to
Bernd Paysan <bernd....@gmx.de> wrote:

> There are a number of complications with this sort of split lifetime
> register allocators, especially since debuggers have problems to
> localize values when they change their position during the program
> flow. One interim solution is to disable this sort of optimization
> when debugging.

With the move to static single assignment form (aka SSA), gcc
effectively does this anyway. A prgram like this:

int poo (int N, int M, int I)
{
int a, b;

a = N;
a = a + M ;
a = a * 7;
b = (I - a);
b = b * b;
return a ^ b;
}

gets translated to:

poo (N, M, I)
{
int a.26;
int b;
int a;

a = N + M;
a.26 = a * 7;
b = I - a.26;
return a.26 ^ b * b;
}

where every assignment creates a separate temporary variable.

Andrew.

Andrew Haley

unread,
Jun 28, 2007, 1:29:06 PM6/28/07
to
Andrew Haley <andr...@littlepinkcloud.invalid> wrote:
> every assignment creates a separate temporary variable.

Just to follow up on my own posting, I realized afterwards that I was
being slightly simplistic, to the point of misleading. Before code
generation, gcc converts out of SSA back into standard form, and this
can result in the separate SSA variables being coalesced back into a
single temporary variable.

Andrew.

Anton Ertl

unread,
Jun 29, 2007, 8:42:43 AM6/29/07
to
Andrew Haley <andr...@littlepinkcloud.invalid> writes:
>Some of the other things you mention are likely to be viewed by gcc
>maintainers as not a bug, in particular the practice of taking code
>fragments that gcc has generated and copying them elsewhere. I would
>find it very hard to defend a patch for that.

You mean that gcc maintainers would not consider it a bug if they
broke this technique? I think they should. It's a practice that
worked in earlier gccs (in particular, gcc-2.95), and that still works
well enough for Gforth. That practice provides speedups by a factor
of 2 or more for Gforth and the Cacao interpreter, and AFAIK it's also
very important for Qemu. I expect (but have not checked) that it is
also used in Squeak.

And the cost of supporting that practice does not seem too expensive:
at every optimization that might reorder the code, add a bypass, and
use a command-line flag (if not -fno-reorder-blocks, then something
else) to decide between using the optimization and the bypass.

Compare the cost of that and the speedup by a factor of 2 in Gforth
and similar applications to the cost and benefit of other
optimizations you do; how many do you have that give a speedup by a
factor of 2 on real-world uses of some applications? And are these as
simple to implement as the one I suggest?

Would it be considered important enough if it gave a nice speedup for
Python or Ruby?

Anton Ertl

unread,
Jun 29, 2007, 9:17:35 AM6/29/07
to
Andrew Haley <andr...@littlepinkcloud.invalid> writes:
>Bernd Paysan <bernd....@gmx.de> wrote:
>
>> There are a number of complications with this sort of split lifetime
>> register allocators, especially since debuggers have problems to
>> localize values when they change their position during the program
>> flow. One interim solution is to disable this sort of optimization
>> when debugging.
>
>With the move to static single assignment form (aka SSA), gcc
>effectively does this anyway.

I don't think so (even if the coalescing back does not happen).

In SSA form our engine looks like this:

indirect_jump: #introduced by gcc for all the separate "goto *"s we have
sp.0 = phi(sp.1, ...., sp.n);
ip.0 = phi(ip.1, ...., ip.n);
goto *target;

I_free:
free(...);
ip.1 = ip.0+1;
target = ip.1[-1];
goto indirect_jump;

As you can see, in the case of FREE, sp does not change, and ip is
only changed after the call to free(). So both ip.0 and sp.0 survive
the call. With a gcc-style register allocator, if it does not put
them into a callee-saved register, it will spill them, and not just
around the call to free(), but everywhere.

Bernd Paysan

unread,
Jun 29, 2007, 10:46:07 AM6/29/07
to
Anton Ertl wrote:
> Would it be considered important enough if it gave a nice speedup for
> Python or Ruby?

IIRC, there was or is a project to use vmgen for Ruby (carbone), but
apparently that project isn't followed. There's also YARV, which is a
similar stack-based approach, but doesn't use vmgen directly.

Bernd Paysan

unread,
Jun 29, 2007, 10:57:58 AM6/29/07
to
Anton Ertl wrote:

> Andrew Haley <andr...@littlepinkcloud.invalid> writes:
>>Bernd Paysan <bernd....@gmx.de> wrote:
>>
>>> There are a number of complications with this sort of split lifetime
>>> register allocators, especially since debuggers have problems to
>>> localize values when they change their position during the program
>>> flow. One interim solution is to disable this sort of optimization
>>> when debugging.
>>
>>With the move to static single assignment form (aka SSA), gcc
>>effectively does this anyway.
>
> I don't think so (even if the coalescing back does not happen).

I think the argument is different: The main roadblock for using a split
lifetime register allocator, when I last proposed one (must be 10 years
ago) was that gdb has problems with that. Now since GCC may have "walking"
variables, anyway (through the SSA optimizations), the roadblock of not
being able to debug this stuff is already gone. So it's possible to replace
GCC's current allocator with such an allocator without giving up debugging.

I don't know how important the coalescing is, because stack frame packing is
also better achieved by treating the stack frame as (hypothetically
infinitely large) register file. The algorithm should free "dead" slots
(like in a register file), and try to put new values into the slots with
the smallest offset (smallest, because that way the "hot" stuff gets in as
few cache lines as possible, and on variable length architectures like x86,
8 bit offsets give better code size). The same thing is also useful for
register files in some architectures which have a save-multiple-register
instruction (like ARM), where caller-saved registers should live in
consecutive registers.

Andrew Haley

unread,
Jun 29, 2007, 11:22:29 AM6/29/07
to
Bernd Paysan <bernd....@gmx.de> wrote:
> Anton Ertl wrote:

>> Andrew Haley <andr...@littlepinkcloud.invalid> writes:
>>>Bernd Paysan <bernd....@gmx.de> wrote:
>>>
>>>> There are a number of complications with this sort of split lifetime
>>>> register allocators, especially since debuggers have problems to
>>>> localize values when they change their position during the program
>>>> flow. One interim solution is to disable this sort of optimization
>>>> when debugging.
>>>
>>>With the move to static single assignment form (aka SSA), gcc
>>>effectively does this anyway.
>>
>> I don't think so (even if the coalescing back does not happen).

> I think the argument is different: The main roadblock for using a split
> lifetime register allocator, when I last proposed one (must be 10 years
> ago) was that gdb has problems with that.

I'm very surprised that was the case. I'm pretty sure that if anyone
today were to make that objection it would not be viewed as a blocker.
Besides, debug formats have come a long way and can now express the
idea of variables that don't always live in the same slot.

There were two killer problems with replacing the register allocator.
The first was political rather than technical: graph colouring
allocators were covered by patents. The second was perhaps rather
surprising: when new register allocators were created for gcc they
didn't perform better on a mix of code.

Andrew.

Andrew Haley

unread,
Jun 29, 2007, 11:31:31 AM6/29/07
to
Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
> Andrew Haley <andr...@littlepinkcloud.invalid> writes:
>>Some of the other things you mention are likely to be viewed by gcc
>>maintainers as not a bug, in particular the practice of taking code
>>fragments that gcc has generated and copying them elsewhere. I would
>>find it very hard to defend a patch for that.

> You mean that gcc maintainers would not consider it a bug if they
> broke this technique?

That's right.

> I think they should. It's a practice that worked in earlier gccs
> (in particular, gcc-2.95), and that still works well enough for
> Gforth. That practice provides speedups by a factor of 2 or more
> for Gforth and the Cacao interpreter, and AFAIK it's also very
> important for Qemu. I expect (but have not checked) that it is also
> used in Squeak.

In general it's hard to support. For example, PC-relative references
to constant data will break. Some targets need this such references
to load large integer constants, some don't.

I think I've made my position clear: a compiler should support the
language as defined by a standard and documented extensions.
Everything else is forbidden.

So, if there were to be a proper definition of what the compiler needs
to do to support this technique, then it could be made into a
documented extension, and then gcc would have to support it. I am
very strongly opposed to gcc informally supporting an undocumented
extension.

> And the cost of supporting that practice does not seem too
> expensive: at every optimization that might reorder the code, add a
> bypass, and use a command-line flag (if not -fno-reorder-blocks,
> then something else) to decide between using the optimization and
> the bypass.

> Compare the cost of that and the speedup by a factor of 2 in Gforth
> and similar applications to the cost and benefit of other
> optimizations you do; how many do you have that give a speedup by a
> factor of 2 on real-world uses of some applications? And are these
> as simple to implement as the one I suggest?

You're right. This is a very strong argument.

> Would it be considered important enough if it gave a nice speedup
> for Python or Ruby?

I don't think it would make any difference.

Andrew.

Anton Ertl

unread,
Jun 29, 2007, 12:09:37 PM6/29/07
to
Andrew Haley <andr...@littlepinkcloud.invalid> writes:
>Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
>> Andrew Haley <andr...@littlepinkcloud.invalid> writes:
>>>Some of the other things you mention are likely to be viewed by gcc
>>>maintainers as not a bug, in particular the practice of taking code
>>>fragments that gcc has generated and copying them elsewhere. I would
>>>find it very hard to defend a patch for that.
...

>In general it's hard to support. For example, PC-relative references
>to constant data will break.

I am not asking the gcc maintainers to help us about PC-relative
references (that would be cool, too, but that's a different issue),
just to avoid reordering basic blocks.

>Some targets need this such references
>to load large integer constants, some don't.

I have never encountered a platform where this is the case. Many
platforms do position-independent code using a global pointer, and use
that to load large integer constants (Alpha, PPC, MIPS); AMD64 uses
PC-relative references to access global variables, but has an inline
instruction for large integer constants (so no need to use PC-relative
references for large integer constants).

So, as long as we avoid global variables and relative calls and other
relative branches to code outside the unit of copying, we are fine in
may experience. Global variables can be worked around (although help
from the compiler would be appreciated); calls are not a problem for
us and there's a workaround for that, too; relative branches we can
mostly control through source code, if the compiler keeps the code in
the same order, and that's why we want that feature.

>So, if there were to be a proper definition of what the compiler needs
>to do to support this technique, then it could be made into a
>documented extension, and then gcc would have to support it. I am
>very strongly opposed to gcc informally supporting an undocumented
>extension.

So, do you want me and maybe some other concerned parties to write up
a proper definition?

Andrew Haley

unread,
Jun 29, 2007, 1:35:59 PM6/29/07
to
Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
> Andrew Haley <andr...@littlepinkcloud.invalid> writes:
>>Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
>>> Andrew Haley <andr...@littlepinkcloud.invalid> writes:
>>>>Some of the other things you mention are likely to be viewed by gcc
>>>>maintainers as not a bug, in particular the practice of taking code
>>>>fragments that gcc has generated and copying them elsewhere. I would
>>>>find it very hard to defend a patch for that.
> ...
>>In general it's hard to support. For example, PC-relative references
>>to constant data will break.

> I am not asking the gcc maintainers to help us about PC-relative
> references (that would be cool, too, but that's a different issue),
> just to avoid reordering basic blocks.

OK, that is reasonable.

>>Some targets need this such references
>>to load large integer constants, some don't.

> I have never encountered a platform where this is the case.

I have. It was quite tricky too, because the PC-relative load range
was only a few kybytes, so we couldn't simply put all the constants at
the end of the function. Instead, we had to look for a nearby
unconditional branch after which to put the data. If there wasn't one
we had to insert one.

> So, as long as we avoid global variables and relative calls and
> other relative branches to code outside the unit of copying, we are

> fine in my experience. Global variables can be worked around


> (although help from the compiler would be appreciated); calls are
> not a problem for us and there's a workaround for that, too;
> relative branches we can mostly control through source code, if the
> compiler keeps the code in the same order, and that's why we want
> that feature.

Yes, I see.

>>So, if there were to be a proper definition of what the compiler needs
>>to do to support this technique, then it could be made into a
>>documented extension, and then gcc would have to support it. I am
>>very strongly opposed to gcc informally supporting an undocumented
>>extension.

> So, do you want me and maybe some other concerned parties to write
> up a proper definition?

No, because I do not think it would be accepted.

Andrew.

Bernd Paysan

unread,
Jun 24, 2007, 6:00:42 PM6/24/07
to
Anton Ertl wrote:
> In conclusion, no matter what we do, gcc-4.1 sucks. I have heard that
> gcc-4.2.0 is similar (it does not build with 32-bit support on the
> Debian boxes I have here, so I cannot check this myself).

It's better, as it again allows us to fit four of the Forth virtual
registers in real ones (they finally detected that using a dead variable as
target register is not really a problem). The unability to remove
superfluous register and memory moves still is worrying.

I don't know what's the easy way out (easy in terms of "not much work to
do"), but the hard way is: Writing our own compiler backend and porting all
relevant and alive platforms GCC supports to it (x86, x86_64, MIPS, SPARC,
PowerPC, ARM), and make it good enough so that other people can use it at
JIT in their project, too (sort of "vmgen on steroids" then).

Anton Ertl

unread,
Jul 1, 2007, 4:03:34 AM7/1/07
to
Bernd Paysan <bernd....@gmx.de> writes:
>I don't know what's the easy way out (easy in terms of "not much work to
>do"), but the hard way is: Writing our own compiler backend and porting all
>relevant and alive platforms GCC supports to it (x86, x86_64, MIPS, SPARC,
>PowerPC, ARM), and make it good enough so that other people can use it at
>JIT in their project, too (sort of "vmgen on steroids" then).

I think there are several easier ways:

- The approach based on dynamically generating, compiling and linking
C code.

- Maybe reviving gcc-2.95, i.e., adding an AMD64 (and maybe IA-64)
port. Not necessarily easier than doing a native-code compiler
ourselves, but since we are not the only ones in this situation, maybe
others would help.

- Use GNU Lightning for the native code compiler, and add more targets
for that. Also not necessarily easier than doing a native-code
compiler ourselves, but again, a higher chance of getting help from
others.

However, I am not sure that our manpower is sufficient for any
approach apart from the first.

Andrew Haley

unread,
Jul 2, 2007, 1:31:48 PM7/2/07
to
Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
> Andrew Haley <andr...@littlepinkcloud.invalid> writes:
>>Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:

>>> - Long long allowed us to do doubles (they broke that many years ago
>>> on the Alpha and "fixed" it by changing the documentation, so we
>>> have had workarounds for BUGGY_LONG_LONG for a long time).
>>

>>Huh? I don't understand this remark. What broke "long long" on
>>Alpha? I don't imagine it was gcc.

> Long long was documented as being twice as long as "long int". On
> Alpha it was not.

Incidentally, I can't find anything to support this, so it seems my
memory was not faulty at all. I can find nowhere in the documentation
log that 'long long int' was ever documented to be twice as long as
'long int'. On 23 Sep 1996 the gcc documentation said:

GNU C supports data types for integers that are twice as long as
@code{int}. Simply write @code{long long int} for a signed integer,
or @code{unsigned long long int} for an unsigned integer. To make an
integer constant of type @code{long long int}, add the suffix
@code{LL} to the integer. To make an integer constant of type
@code{unsigned long long int}, add the suffix @code{ULL} to the
integer.

It could be that there is some earlier documentation than this, but I
can't find it.

Andrew.

Bernd Paysan

unread,
Jul 3, 2007, 4:00:16 AM7/3/07
to
Andrew Haley wrote:

We filed the bug report somewhere around 1992 or 1993, soon after we started
Gforth. No wonder that this 1996 documentation is already "fixed".

I've just downloaded gcc-2.0.tar.bz2 from ftp.gnu.org/old-gnu/gcc (a small
3.0MB piece), and grep'd through the .info files:

gcc.info-4:
"Double-Word Integers
====================

GNU C supports data types for integers that are twice as long as

`long int'. Simply write `long long int' for a signed integer, or
`unsigned long long int' for an unsigned integer."

GCC 2.0 is of February 1992.

Bernd Paysan

unread,
Jul 3, 2007, 7:28:45 AM7/3/07
to
Bernd Paysan wrote:
> GCC 2.0 is of February 1992.

I should also note that GCC 2.0 is the primary reason why we started this
endever Gforth 15 years ago: It provided us enough features to do so.
Especially long long (as defined back then) and labels as values.

Andrew Haley

unread,
Jul 3, 2007, 9:50:55 AM7/3/07
to

> gcc.info-4:
> "Double-Word Integers
> ====================

Ah, OK, that explains it. I didn't remember it because I wasn't there
at the time. Bizarrely, gcc.info-4 isn't in the source repository,
although it is in the tarballs. Go figure...

Andrew.

Anton Ertl

unread,
Jul 4, 2007, 4:58:00 PM7/4/07
to
Andrew Haley <andr...@littlepinkcloud.invalid> writes:
>> We filed the bug report somewhere around 1992 or 1993, soon after we started
>> Gforth.

I don't think we reported this before 1995, when we first had contact
with 64-bit systems. Ah, I found the bug report:

http://groups.google.com/group/gnu.gcc.bug/msg/d6f3bcaf15a77814?dmode=source

and it cites the gcc 2.7.0 Manual:

| GNU C supports data types for integers that are twice as long as
|`long int'.

>Bizarrely, gcc.info-4 isn't in the source repository,


>although it is in the tarballs. Go figure...

Well, of course it is not in the source repository, because it is not
a source file. Look for a .texinfo or .texi file.

0 new messages