gcc-2.95: nice and fast, although a little buggy.
gcc-3.0: slower, thanks to some GCSE misbehaviour
gcc-3.x, x>1 (certainly for 3.3 and 3.4): very slow: they fixed the
  GCSE problem, but on the way destroyed gforth's dynamic code
  generation and the branch prediction advantage of using threaded
  code (PR15242).
gcc-4.x: also very slow: Thanks to anal-retentive syntax checking that
  was introduced without prior warning, dynamic code generation is
  turned off even though it would work in principle (PR15242 is mostly
  fixed, with the exception of PR25285, and they ignore that).
So in the meantime we introduced workarounds for PR15242, but as a
result the performance suffered for the other compilers, too.  As a
result, a few days ago the performance picture looked like this (on a
2.2GHz Athlon 64 X2):
sieve bubble matrix  fib
 0.248 0.340  0.112 0.388 0.6.9, gcc-2.95.4 --enable-force-reg
 0.188 0.292  0.128 0.308 0.6.2, gcc-2.95.1 --enable-force-reg
Well, quite a bit of slowdown compared to 0.6.2.  So today I worked on
getting some of the gcc-2.95 speed back and improving the gcc-4.x
speed.  So the current CVS has the following speeds (all configured
with --enable-force-reg):
sieve bubble matrix  fib
 0.208 0.296  0.108 0.328 gcc 2.95.4 20011002 (Debian prerelease)
 0.264 0.344  0.120 0.360 gcc 3.4.6 (Debian 3.4.6-5)
 0.384 0.432  0.296 0.520 gcc 4.1.2 20061115 (prerelease) (Debian 4.1.1-21)
Nice progress by the gcc maintainers, eh?-(
The slowdown between 2.95 and 3.4 can be explained with PR15242; our
workaround helps a lot, but some slowness cannot be worked around (in
particular, not the part that I got back for 2.x and 4.x today).
The slowdown of gcc-4.1 seems to come from bad register allocation and
a failure of copy propagation.  I actually had to reduce what
--enable-force-reg does on this compiler, otherwise the compiler would
not compile, or produce wrong code.  As an example of the low quality
of the resulting code, consider this:
0.6.9, gcc 4.1        0.6.9, gcc 2.95.4   0.6.2, gcc 2.95.1  optimal on K7,K8
Code +                Code +              Code +             Code +           
mov  edi, 21C [esp]   mov  eax, 4 [esi]   mov  eax, 4 [esi]  add  ecx, 4 [esi]
mov  edx, ebp         add  esi, # 4       add  esi, # 4      add  ebx, # 4    
add  ebx, # 4         add  ecx, eax       add  ebx, # 4      add  esi, # 4    
mov  ecx, 4 [edi]     add  ebx, # 4       add  ecx, eax      jmp  -4 [ebx]
add  edi, # 4         mov  eax, -4 [ebx]  jmp  -4 [ebx]      end-code         
add  edx, ecx         jmp  eax            end-code           
mov  21C [esp], edi   end-code          
mov  ebp, edx
mov  esi, -4 [ebx]
mov  eax, esi
jmp  eax
end-code
The difference between 0.6.9 and 0.6.2 on gcc-2.95 is due to a
workaround for PR15242 that I have not (yet?) made
gcc-version-specific.
- anton
-- 
M. Anton Ertl  http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: http://www.forth200x.org/forth200x.html
EuroForth 2007: http://www.complang.tuwien.ac.at/anton/euroforth2007/
Interesting, I didn't realize gcc has regressed so much since 2.95.
Why does gforth rely on gcc at all? Maybe its time for a rewrite? :)
Slava
> gcc-2.95: nice and fast, although a little buggy.
> gcc-3.0: slower, thanks to some GCSE misbehaviour
> gcc-3.x, x>1 (certainly for 3.3 and 3.4): very slow: they fixed the
>   GCSE problem, but on the way destroyed gforth's dynamic code
>   generation and the branch prediction advantage of using threaded
>   code (PR15242).
> gcc-4.x: also very slow: Thanks to anal-retentive syntax checking that
>   was introduced without prior warning, dynamic code generation is
>   turned off even though it would work in principle 
What syntax does this refer to?
There seem to be a lot of different issues here, and it's quite hard
for me to disentangle them.  Do you mean that with gcc 4.1, you can't
force all the Forth system pointers you need into registers, because
the compiler runs out of registers, but you could do this with earlier
compilers?
Andrew.
We want portability and speed.  Gcc used to have outstanding
advantages there:
- Labels-as-values allowed us to do threaded code (factor of 2 over
  switch dispatch) and dynamic superinstructions (another factor of 2).
- Explicit register allocation allowed us to get decent register
  allocation.
- Long long allowed us to do doubles (they broke that many years ago
  on the Alpha and "fixed" it by changing the documentation, so we
  have had workarounds for BUGGY_LONG_LONG for a long time).
Unfortunately, the gcc maintainers are working hard at eliminating
these advantages, so I would love not to have to rely on gcc.
> Maybe its time for a rewrite? :)
Yes, I have been thinking about compiling definitions to C source
code, then compiling it and dynamically linking it in.  Thus the gcc
maintainers would have succeeded in their goal of weaning us off GNU C
extensions, and, thanks to gcc's long compile times, would have
successfully eliminated themselves from the competition.
IIRC it could no longer compile this function:
int foo(int x)
{
  if (x) {
  label1:
    asm(".skip 16");
  label2:
  }
  return (&&label2)-(&&label1);
}
IIRC "label2:" must have a statement following it in gcc-4.x.
>> The slowdown of gcc-4.1 seems to come from bad register allocation and
>> a failure of copy propagation.  I actually had to reduce what
>> --enable-force-reg does on this compiler, otherwise the compiler would
>> not compile, or produce wrong code.  As an example of the low quality
>> of the resulting code, consider this:
>
>> 0.6.9, gcc 4.1        0.6.9, gcc 2.95.4   0.6.2, gcc 2.95.1  optimal on K7,K8
>> Code +                Code +              Code +             Code +           
>> mov  edi, 21C [esp]   mov  eax, 4 [esi]   mov  eax, 4 [esi]  add  ecx, 4 [esi]
>> mov  edx, ebp         add  esi, # 4       add  esi, # 4      add  ebx, # 4    
>> add  ebx, # 4         add  ecx, eax       add  ebx, # 4      add  esi, # 4    
>> mov  ecx, 4 [edi]     add  ebx, # 4       add  ecx, eax      jmp  -4 [ebx]
>> add  edi, # 4         mov  eax, -4 [ebx]  jmp  -4 [ebx]      end-code         
>> add  edx, ecx         jmp  eax            end-code           
>> mov  21C [esp], edi   end-code          
>> mov  ebp, edx
>> mov  esi, -4 [ebx]
>> mov  eax, esi
>> jmp  eax
>> end-code
>There seem to be a lot of different issues here, and it's quite hard
>for me to disentangle them.  Do you mean that with gcc 4.1, you can't
>force all the Forth system pointers you need into registers, because
>the compiler runs out of registers, but you could do this with earlier
>compilers?
Yes, the explicit register allocations that worked for gcc-2.95 and
gcc-3.4 do not work for gcc-4.1, so I used a different one, which
resulted in spilling the data stack pointer in the example above.  I
find the many superfluous register-register moves more worrying,
though.
Ok, in order to get something better for this, I turned off caching
the TOS by default, resulting in the following +:
Code +  
( $804C502 )  mov     esi , dword ptr 8 [ebp]  \ $8B $75 $8 
( $804C505 )  mov     eax , dword ptr 4 [ebp]  \ $8B $45 $4 
( $804C508 )  add     esi , eax  \ $1 $C6 
( $804C50A )  add     ebx , # 4  \ $83 $C3 $4 
( $804C50D )  mov     dword ptr 8 [ebp] , esi  \ $89 $75 $8 
( $804C510 )  add     ebp , # 4  \ $83 $C5 $4 
( $804C513 )  mov     esi , dword ptr FC [ebx]  \ $8B $73 $FC 
( $804C516 )  mov     edx , esi  \ $89 $F2 
( $804C518 )  jmp     804BAF6  \ $E9 $D9 $F5 $FF $FF 
end-code
That looks better but the times are slower (see below).  On a hunch I
checked ?BRANCH, and found:
Code ?branch  
( $804BCDD )  add     ebp , # 4  \ $83 $C5 $4 
( $804BCE0 )  mov     eax , dword ptr [ebx]  \ $8B $3 
( $804BCE2 )  mov     edi , dword ptr 0 [ebp]  \ $8B $7D $0 
( $804BCE5 )  test    edi , edi  \ $85 $FF 
( $804BCE7 )  jne     804BCF5  \ $75 $C 
( $804BCE9 )  mov     edx , dword ptr [eax]  \ $8B $10 
( $804BCEB )  lea     ebx , dword ptr 4 [eax]  \ $8D $58 $4 
( $804BCEE )  mov     esi , edx  \ $89 $D6 
( $804BCF0 )  jmp     804BAF6  \ $E9 $1 $FE $FF $FF 
( $804BCF5 )  add     ebx , # 8  \ $83 $C3 $8 
( $804BCF8 )  mov     esi , dword ptr FC [ebx]  \ $8B $73 $FC 
( $804BCFB )  mov     edx , esi  \ $89 $F2 
( $804BCFD )  jmp     804BAF6  \ $E9 $F4 $FD $FF $FF 
end-code
Yes, PR25285 strikes here (instead of the "jmp 804BAF6", a better
compiler would write "jmp esi"); that's the one the gcc maintainers
prefer to ignore.  Ok, turn on the workaround for that (that's what
causes the slowdown between 2.95 and 3.4), and we get some mixed
results:
sieve bubble matrix  fib
 0.208 0.296  0.108 0.328 gcc 2.95.4 20011002 (Debian prerelease)
 0.264 0.344  0.120 0.360 gcc 3.4.6 (Debian 3.4.6-5)
 0.384 0.432  0.296 0.520 gcc 4.1.2 (default configuration)
 0.476 0.748  0.280 0.476 gcc 4.1.2 STACK_CACHE_DEFAULT_FAST=0 
 0.364 0.524  0.288 0.472 gcc 4.1.2 STACK_CACHE_DEFAULT_FAST=0 condbranch_opt=0
condbranch_opt=0 is one of the workarounds for PR15242 and PR25825.
STACK_CACHE_DEFAULT_FAST=0 turns off caching the TOS by default.
In conclusion, no matter what we do, gcc-4.1 sucks.  I have heard that
gcc-4.2.0 is similar (it does not build with 32-bit support on the
Debian boxes I have here, so I cannot check this myself).
Ok. I'll throw in my two cents worth of suggestions here. Eliminating gcc is a 
drastic option. However, it may be possible to isolate the critical parts of the 
code into a separate assembly source file. The critical code may then be tweaked 
without restrictions or compiler interference. This is the model we use in 
kForth, which is a mix of assembler, C, and C++ source. Of course, this means 
much more work in producing a system which can run on many platforms (David 
Williams can back me up on this statement!), since all of the source is no 
longer portable. But that seems to be the case anyway, now, for gforth's 
dependence on the gcc version.
Krishna
Yes, too much work.
> (David 
>Williams can back me up on this statement!), since all of the source is no 
>longer portable. >But that seems to be the case anyway, now, for gforth's 
>dependence on the gcc version.
No, all gcc versions still work (for now), the newer ones are just
slow.
> We want portability and speed.  Gcc used to have outstanding
> advantages there:
> - Labels-as-values allowed us to do threaded code (factor of 2 over
>   switch dispatch) and dynamic superinstructions (another factor of 2).
> - Explicit register allocation allowed us to get decent register
>   allocation.
> - Long long allowed us to do doubles (they broke that many years ago
>   on the Alpha and "fixed" it by changing the documentation, so we
>   have had workarounds for BUGGY_LONG_LONG for a long time).
Huh?  I don't understand this remark.  What broke "long long" on
Alpha?  I don't imagine it was gcc.
> Unfortunately, the gcc maintainers are working hard at eliminating
> these advantages, so I would love not to have to rely on gcc.
I can assure you that's not deliberate.  We know that stack slot
allocation and reuse has been poor in gcc, but it is getting better.
We had some performance regressions as a result of the tree-SSA
rewrite, but we've managed to claw most of that back.  It seems that
gforth has suffered more than most programs as a result of thse
changes.
Andrew.
Long long was documented as being twice as long as "long int".  On
Alpha it was not.
>> Unfortunately, the gcc maintainers are working hard at eliminating
>> these advantages, so I would love not to have to rely on gcc.
>
>I can assure you that's not deliberate.
Well, I think it's an attitude problem.  My impression is that any
code that's not ANSI C is considered unworthy (including code using
documented GNU C extensions), and reporters of bugs in this area are
made to feel unwelcome, e.g., by "resolving" a bug report as "invalid"
in less time than it took to prepare the bug report (PR25285);
comments like
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=15242#c2
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=15242#c7
<121j5pa...@news.supernews.com> ff.
also contribute to my impression.
>We know that stack slot
>allocation and reuse has been poor in gcc, but it is getting better.
We don't care much for that (might change when callbacks are used
more, though), and in any case, gcc never was great in that respect.
>We had some performance regressions as a result of the tree-SSA
>rewrite, but we've managed to claw most of that back.  It seems that
>gforth has suffered more than most programs as a result of thse
>changes.
Well, it's not all doom and gloom.  Bernd has found an explicit
register allocation that works well with gcc-4.2.0 (but not 4.1);
scaling the results to our 2.2GHz Athlon 64X2 gives:
sieve bubble matrix  fib
 0.208 0.296  0.108 0.328 gcc 2.95.4 20011002 (Debian prerelease)
 0.264 0.344  0.120 0.360 gcc 3.4.6 (Debian 3.4.6-5)
 0.384 0.432  0.296 0.520 gcc 4.1.2 (default configuration)
 0.476 0.748  0.280 0.476 gcc 4.1.2 STACK_CACHE_DEFAULT_FAST=0 
 0.364 0.524  0.288 0.472 gcc 4.1.2 STACK_CACHE_DEFAULT_FAST=0 condbranch_opt=0
0.240 0.331 0.113 0.342 gcc 4.2.0
Still not quite in the 2.95 leage, but better than 3.4.
> Long long was documented as being twice as long as "long int".
How odd.
> On Alpha it was not.
OK, but that has nothing at all to do with gcc: the sizes of types is
determined by the ABI.  gcc implements the ABI of whatever platform it
runs on.
>>> Unfortunately, the gcc maintainers are working hard at eliminating
>>> these advantages, so I would love not to have to rely on gcc.
>>
>>I can assure you that's not deliberate.
> Well, I think it's an attitude problem.
My take on this is that we are, or should be, playing for the same
team.  So, please, let's go ahead with this exchange in the spirit of
co-operation...
> My impression is that any code that's not ANSI C is considered
> unworthy (including code using documented GNU C extensions), and
> reporters of bugs in this area are made to feel unwelcome, e.g., by
> "resolving" a bug report as "invalid" in less time than it took to
> prepare the bug report (PR25285); comments like
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=15242#c2
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=15242#c7
> <121j5pa...@news.supernews.com> ff.
> also contribute to my impression.
The thing is that for you this seems to be a bad attitude, but for me
it seems perfectly reasonable!  What does it mean to jump out of a
statement expression, anyway?
The core issue here is that many of the gcc extensions never were
properly defined, so the corner cases don't work correctly.  As a
result of that there is a strong opinion against any new externsions,
and some pressure to deprecate old ones.  In the main that pressure
has been resisted, though.
>>We know that stack slot allocation and reuse has been poor in gcc,
>>but it is getting better.
> We don't care much for that (might change when callbacks are used
> more, though), and in any case, gcc never was great in that respect.
>>We had some performance regressions as a result of the tree-SSA
>>rewrite, but we've managed to claw most of that back.  It seems that
>>gforth has suffered more than most programs as a result of thse
>>changes.
> Well, it's not all doom and gloom.  Bernd has found an explicit
> register allocation that works well with gcc-4.2.0 (but not 4.1);
> scaling the results to our 2.2GHz Athlon 64X2 gives:
> sieve bubble matrix  fib
>  0.208 0.296  0.108 0.328 gcc 2.95.4 20011002 (Debian prerelease)
>  0.264 0.344  0.120 0.360 gcc 3.4.6 (Debian 3.4.6-5)
>  0.384 0.432  0.296 0.520 gcc 4.1.2 (default configuration)
>  0.476 0.748  0.280 0.476 gcc 4.1.2 STACK_CACHE_DEFAULT_FAST=0 
>  0.364 0.524  0.288 0.472 gcc 4.1.2 STACK_CACHE_DEFAULT_FAST=0 condbranch_opt=0
>  0.240 0.331  0.113 0.342 gcc 4.2.0
> Still not quite in the 2.95 leage, but better than 3.4.
OK, so that's progress.  This is for 32-bit code, is it?  I guess it
must be, given that gcc 2.95 didn't support 64-bit code.
Andrew.
"long long" was specified to be twice as long as "long". It makes perfect
sense to have one data type which is twice as long as the longest "native"
data type, especially to encode multiplication and division operations;
this was the original (RMS) motivation to create that data type after all.
Alpha didn't "break" that, because Alpha has a fast 64x64->128
multiplication (done in two operations, one delivering the low, one the
high result). Actually, the "right" specification of long long should
be "twice as long as intptr_t", because on a 16 bit platform, long could
already be twice as long as the largest native format.
The first GCC port to 64 bit was before, MIPS. There, they initially had a
command line switch where you can select long long to be 128 or 64 bits,
the latter for downward compatibility with people who assume long long=64
bits. The switch was dropped soon afterwards (don't know if it worked ever,
because the time we had a 64 bit MIPS station to port Gforth to, it already
stopped working).
The Alpha porting team decided to make long long=64 bits, and after we
reported that this is a bug (because it's not according to the
specification of "long long"), they changed the documentation instead
(making it "twice as long as int").
Recently, the situation improved slightly, because AMD put in a "backdoor"
to access a 128 bit data type for amd64. There, you can typedef
typedef int int128_t __attribute__((__mode__(TI)));
typedef unsigned int uint128_t __attribute__((__mode__(TI)));
and apart from converting to and from FP, it actually works (i.e. there are
TI instruction pattern; I haven't checked the FP conversions in GCC 4.2.0,
because the impact is minimal). However, I consider this more as an "easter
egg" than a real feature, as we can't depend on it - we can test if it
works, but that's all.
>> Unfortunately, the gcc maintainers are working hard at eliminating
>> these advantages, so I would love not to have to rely on gcc.
> 
> I can assure you that's not deliberate.  We know that stack slot
> allocation and reuse has been poor in gcc, but it is getting better.
> We had some performance regressions as a result of the tree-SSA
> rewrite, but we've managed to claw most of that back.  It seems that
> gforth has suffered more than most programs as a result of thse
> changes.
Yes, but that means that you should use Gforth as regression test, because
it's one of the rare programs which actually use a significant amount of
special GCC features, has a sufficiently complex control flow (with all the
indirect branches), and other corner cases which make it a good regression
test target. Yes, I know, people like "unit tests", where each test tests
one feature in isolation, but while you ought to have these tests, you also
need the system tests. And for the system tests, you need a sufficiently
complex system with a well-defined benchmark. Using SPEC as benchmark is
not complete, because SPEC sources don't use GCC extensions.
Until 4.2.0, it was hard to see a "progress" (different from the 2.x line,
where 2.95 definitely was the best). And most of what 4.2.0 does better
than before is that the explicit register allocation doesn't conflict with
instructions using this register when it's actually dead.
-- 
Bernd Paysan
"If you want it done right, you have to do it yourself"
http://www.jwdt.com/~paysan/
>> On Alpha it was not.
> 
> OK, but that has nothing at all to do with gcc: the sizes of types is
> determined by the ABI.  gcc implements the ABI of whatever platform it
> runs on.
long long wasn't part of the ABI. long long back then was a GCC extension,
not supported by any other compiler. C99 put long long into the standard,
and therefore now, it's part of the ABI, and you have to take C99's weasel
wording (courtesy to other vendors which have rendered the original C
typing system completely useless by now, like having no C90 integer type
which can hold a pointer - the IL32P64 model).
I'm ok when int128_t is implemented in a useful way; I don't care about how
the pointer type is called, as long as I can check it with autoconf.
What's odd about that?  Long long was a GNU C extension, so gcc was
free to define what it meant.
>> On Alpha it was not.
>
>OK, but that has nothing at all to do with gcc: the sizes of types is
>determined by the ABI.  gcc implements the ABI of whatever platform it
>runs on.
Not when it does not occur in library functions and structures (e.g.,
gcc did not follow the ABI when passing structures as arguments); I
doubt that long long occured in any functions or structures, certainly
not in functions that we used.
Deviating from GCC's defined API broke Gforth, and cost us quite a bit
of time to work around (and is still costing us time, as the two cases
have to be programmed and tested for every primitive that involves a
double-cell type).
>>>> Unfortunately, the gcc maintainers are working hard at eliminating
>>>> these advantages, so I would love not to have to rely on gcc.
>>>
>>>I can assure you that's not deliberate.
>
>> Well, I think it's an attitude problem.  
...
>> My impression is that any code that's not ANSI C is considered
>> unworthy (including code using documented GNU C extensions), and
>> reporters of bugs in this area are made to feel unwelcome, e.g., by
>> "resolving" a bug report as "invalid" in less time than it took to
>> prepare the bug report (PR25285); comments like
>
>> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=15242#c2
>> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=15242#c7
>> <121j5pa...@news.supernews.com> ff.
>
>> also contribute to my impression.
>
>The thing is that for you this seems to be a bad attitude,
I am sorry that I was unclear in the description of the attitude.  If
you want a label for it, let's call it the "ANSI C blinders" attitude;
those suffering from that attitude prefer not to even think about C
code that's outside ANSI C.  This attitude is exemplified here:
>What does it mean to jump out of a
>statement expression, anyway?
I'm sure that, if you dared to think what it means, you would find a
reasonable interpretation, but as long as you have the blinders on,
you will only see that this is not ANSI C, and will therefore not
think of a meaning.
If you need help, here's a hint: What does it mean to call longjmp()
or exit() in an expression; that can occur even in an ANSI C program,
no?
>The core issue here is that many of the gcc extensions never were
>properly defined, so the corner cases don't work correctly.
The case mentioned above is where the meaning of the extension is
pretty straightforward, and the non-working comes from a bug in the
compiler (I guess it does not reset the stack depth before performing
the jump).  But of course with ANSI C blinders on you only see a
non-conforming program, and a compiler that therefore has the right
not to work.
If you think that the extensions are not defined well enough, get
someone to specify them well enough.  I could specify them in language
like that used in the ANSI C document, if I had the time.
>  As a
>result of that there is a strong opinion against any new externsions,
>and some pressure to deprecate old ones.  In the main that pressure
>has been resisted, though.
On the surface, maybe.  But when it comes to dealing with bug reports,
the attitude shines through.
Is that attitude bad?  Well, if you are working on a program that
relies on GNU C extensions, it's bad for you.  We are in that
situation.
>OK, so that's progress. This is for 32-bit code, is it?
Yes.
>  I guess it
>must be, given that gcc 2.95 didn't support 64-bit code.
It does not support AMD64 (it does support, e.g., Alpha).  Hmm, maybe
it would be less work to maintain gcc 2.95, and add a few bug fixes
and an AMD64 port rather than jumping through the hoops that the more
recent gcc versions have put up for us.  But every hoop is not that
much work, and I always hope that it will be the last one; but I am
beginning to fear that, with the attitude problem, there will be many
more hoops to come.
> "long long" was specified to be twice as long as "long".
Ahh, I see, fair enough. Either I never knew that or I'd forgotten.
So, did DEC's Alpha compiler not have a 64-bit long long type?
>>> Unfortunately, the gcc maintainers are working hard at eliminating
>>> these advantages, so I would love not to have to rely on gcc.
>> 
>> I can assure you that's not deliberate.  We know that stack slot
>> allocation and reuse has been poor in gcc, but it is getting better.
>> We had some performance regressions as a result of the tree-SSA
>> rewrite, but we've managed to claw most of that back.  It seems that
>> gforth has suffered more than most programs as a result of thse
>> changes.
> Yes, but that means that you should use Gforth as regression test,
> because it's one of the rare programs which actually use a
> significant amount of special GCC features, has a sufficiently
> complex control flow (with all the indirect branches), and other
> corner cases which make it a good regression test target.
Indeed so, but it's also sufficiently unusual that it's hard to make a
good case for it as a general-purpose test.
> Until 4.2.0, it was hard to see a "progress" (different from the 2.x
> line, where 2.95 definitely was the best). And most of what 4.2.0
> does better than before is that the explicit register allocation
> doesn't conflict with instructions using this register when it's
> actually dead.
I know it's been hard to see the progress from the outside, but most
of the changes were infrastructure re-engineering to make more advanced
optimization possible.  Inevitably, there have been some degradations
along the way but we had to do the work to make moving forward
possible.  It is, I admit, a great shame that gforth has suffered
fallout from this.
Andrew.
> What's odd about that?  Long long was a GNU C extension, so gcc was
> free to define what it meant.
Sure, but as I said to Bernd I had forgotten that it was ever defined
that way.  In hindsight it was surely a mistake to define it in such a
way, but that's all history now.
Please, cut the sarcasm. It's not helping.
> If you need help, here's a hint: What does it mean to call longjmp()
> or exit() in an expression; that can occur even in an ANSI C program,
> no?
exit() and longjmp() return void, so it doesn't make any sense to use
them in a non-void context.  The question here is, therefore, whether
it ever makes sense to have a statement expression whose value is
void.  I suspect it probably is, since the spec says
------------------------------------
The last thing in the compound statement should be an expression
followed by a semicolon; the value of this subexpression serves as the
value of the entire construct.  (If you use some other kind of statement
last within the braces, the construct has type `void', and thus
effectively no value.)
------------------------------------
so what should happen is that the spec is tightened and the
implementation corrected, if indeed it's still broken.
>>The core issue here is that many of the gcc extensions never were
>>properly defined, so the corner cases don't work correctly.
> The case mentioned above is where the meaning of the extension is
> pretty straightforward, and the non-working comes from a bug in the
> compiler (I guess it does not reset the stack depth before performing
> the jump).  But of course with ANSI C blinders on you only see a
> non-conforming program, and a compiler that therefore has the right
> not to work.
What we need in order not to have these kinds of arguments is a
language that conforms to ISO C + a bunch of well-defined extensions.
If the extension in question had been well-defined, the kind of
arguments you've had would not have been possible.
> If you think that the extensions are not defined well enough, get
> someone to specify them well enough.  I could specify them in
> language like that used in the ANSI C document, if I had the time.
>>As a result of that there is a strong opinion against any new
>>extensions, and some pressure to deprecate old ones. In the main
>>that pressure has been resisted, though.
> On the surface, maybe.  But when it comes to dealing with bug reports,
> the attitude shines through.
OK, but that's my point: to you it's a bad attitude, to me it's a
perfectly reasonable response.
> Is that attitude bad?  Well, if you are working on a program that
> relies on GNU C extensions, it's bad for you.  We are in that
> situation.
>>OK, so that's progress. This is for 32-bit code, is it?
> Yes.
>>  I guess it
>>must be, given that gcc 2.95 didn't support 64-bit code.
> It does not support AMD64 (it does support, e.g., Alpha).  Hmm,
> maybe it would be less work to maintain gcc 2.95, and add a few bug
> fixes and an AMD64 port rather than jumping through the hoops that
> the more recent gcc versions have put up for us.  But every hoop is
> not that much work, and I always hope that it will be the last one;
> but I am beginning to fear that, with the attitude problem, there
> will be many more hoops to come.
Well, maybe.  But you are so absurdly rude when you post on the
subject that it's hard to have a sensible discussion.  I really would
like to have a sensible discussion about how to fix some of the
problems you're having with gcc.
Andrew.
IMHO the definition was right, and if the Alpha people had asked RMS about
his reasoning, they probably wouldn't have changed it. It's the "ANSI C"
blinder attitude shining through here again: long long is a GCC extension,
so "it can mean whatever we like it". That's not correct, the GCC
extensions are all there for good reasons.
This is all history now, for sure, and C99 has a better way to select types
with a specific size, so GCC should support this scheme - e.g. a way to
access int128_t on all 64 bit platforms without GCC-specific attributes
would be nice to have.
>> Sure, but as I said to Bernd I had forgotten that it was ever
>> defined that way.  In hindsight it was surely a mistake to define
>> it in such a way, but that's all history now.
> IMHO the definition was right, and if the Alpha people had asked RMS
> about his reasoning, they probably wouldn't have changed it. It's
> the "ANSI C" blinder attitude shining through here again: long long
> is a GCC extension, so "it can mean whatever we like it".
No, long long wasn't only a gcc-specific extension: several compiler
vendors also supported it or were about to, so it wasn't possible for
gcc simply to define what it meant without regard to anyone else.
Thre's an excruciating discussion at
http://yarchive.net/comp/longlong.html
> This is all history now, for sure, and C99 has a better way to
> select types with a specific size, so GCC should support this scheme
> - e.g. a way to access int128_t on all 64 bit platforms without
> GCC-specific attributes would be nice to have.
AFAIAA intN_t is provided by the C library, not the compiler, so it's
up to the library vendors.  The split between library and compiler is
specified by the standard.
Andrew.
To be pedantic, exit() and longjmp() don't return.
Still, to satisfy the type checker, one has to use them in a context
where a void "result" is ok.  An example similar to the one discussed
in
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=15242#c8
would be
f("%d\n", (longjump(buf), 0))
Can you find a meaning for that? Now compare with
f("%d\n", ({goto a; 0;}))
Can you now find a meaning for that?
>>>The core issue here is that many of the gcc extensions never were
>>>properly defined, so the corner cases don't work correctly.
>
>> The case mentioned above is where the meaning of the extension is
>> pretty straightforward, and the non-working comes from a bug in the
>> compiler (I guess it does not reset the stack depth before performing
>> the jump).  But of course with ANSI C blinders on you only see a
>> non-conforming program, and a compiler that therefore has the right
>> not to work.
>
>What we need in order not to have these kinds of arguments is a
>language that conforms to ISO C + a bunch of well-defined extensions.
>If the extension in question had been well-defined, the kind of
>arguments you've had would not have been possible.
So if ANSI C was well-defined, it would be impossible for us to argue
what it means to call longjmp() or exit() in an expression, right?
In any case, if you want a tighter specification of the extension, you
could commission such a spec.
>I really would
>like to have a sensible discussion about how to fix some of the
>problems you're having with gcc.
So do you have any ideas how they could be fixed?  I used to write bug
reports for gcc.  But the reaction to PR25285 convinced me that this
is a waste of time.
> To be pedantic, exit() and longjmp() don't return.
Oh, for goodness' sake! Let's move on.
>>>>The core issue here is that many of the gcc extensions never were
>>>>properly defined, so the corner cases don't work correctly.
>>
>>> The case mentioned above is where the meaning of the extension is
>>> pretty straightforward, and the non-working comes from a bug in the
>>> compiler (I guess it does not reset the stack depth before performing
>>> the jump).  But of course with ANSI C blinders on you only see a
>>> non-conforming program, and a compiler that therefore has the right
>>> not to work.
>>
>>What we need in order not to have these kinds of arguments is a
>>language that conforms to ISO C + a bunch of well-defined extensions.
>>If the extension in question had been well-defined, the kind of
>>arguments you've had would not have been possible.
> So if ANSI C was well-defined, it would be impossible for us to argue
> what it means to call longjmp() or exit() in an expression, right?
Given sufficient bloody-mindedness it's possible to argue about
anything, as you amply demonstrate.
> In any case, if you want a tighter specification of the extension, you
> could commission such a spec.
>>I really would like to have a sensible discussion about how to fix
>>some of the problems you're having with gcc.
> So do you have any ideas how they could be fixed?
That depends on the specific problem.  Some things, for exmaple
register allocation, are very hard, and in any case are being actively
worked on.  Other things might be easier.
What would be interesting to me is a to know the most important gcc
deficiencies from the point of view of GForth and your estimate of how
significant these deficiencies are.  There might well be some push-back
from gcc developers, with the claim that GForth "isn't
representative".  But I'm not convinced of that, as I suspect that some
of the problems you've seen might have an impact on other code bases.
There is more to the world than SPECint.
> I used to write bug reports for gcc.  But the reaction to PR25285
> convinced me that this is a waste of time.
Well, I agree that Andrew Pinski's Comment #3 seems to be very
unhelpful.  However, it's not just a matter of writing bug reports,
but of finding a gcc maintainer who understands the problem and is
motivated to work on it.
Andrew.
One obvious deficiency in the current GCC is that it deals poorly with copy
propagation, and that must impact every program. The code GCC 4.1.2
compiles on x86_64 for the + primitive is a good example (and 4.2.0
compiles almost the same code):
        # +
#NO_APP
        leaq    8(%r15), %rax ; create AGU bubble
        addq    $8, %rbx ; create another AGU bubble
        addq    (%rax), %r14
        movq    %rax, %r15
.L261:
        movq    -8(%rbx), %rbp
.L262:
        movq    %rbp, %rax
        jmp     *%rax
The update of the address before use is a "no-no" in modern pipelined CPUs,
as it introduces a pipeline bubble. The two movq reg,reg in this example
are completely superfluous.
What I would write is:
        # +
#NO_APP
        addq    8(%r15), %r14
        movq    (%rbx), %rax ; avoid AGU bubble
        addq    $8, %r15
        addq    $8, %rbx
.L261:
        jmp     *%rax
Actually, we could change our source code so that GCC has more opportunity
to avoid the second AGU bubble (it's optimized for register pressure to
make it possible to combine the jump into
jmp *-8(%rbx)
, but the way GCC 4.x handles indirect jumps prevent such an optimization,
anyway).
Hmm, OK.  Just to be clear that we are talking about the same block of
code, I've included what I think is the corresponding source code.  Is
this right?
J_plus: asm(""); I_plus:
    { saved_ip=ip; asm(""); }
    {
      Label ca;
      Cell n1;
      Cell n2;
      Cell n;
      ({cfa1=cfa; cfa=*ip;});
      ((n1)=(Cell)(sp[1]));
      ((n2)=(Cell)((sp[0])));
      sp += 1;
      {
# 681 "./prim"
	n = n1+n2;
# 2883 "prim.i"
      }
      ({ip++; ca=*cfa;});
      (((sp[0]))=(Cell)(n));
    K_plus:
      ({goto *ca;});
    }
I'm rather worried that this may turn out to be the reload pass
performing poorly, but let's see.
Andrew.
Not exactly, should look like this (fast with TOS cached, from the latest
snapshot):
H_plus: asm(""); I_plus:
asm("# " "+");
{
Cell n1;
Cell n2;
Cell n;
;
((n1)=(Cell)(sp[1]));
((n2)=(Cell)(spTOS));
sp += 1;
{
n = n1+n2;
}
(ip++);
((spTOS)=(Cell)(n));
K_plus: asm("");
do {asm("":"=X"(cfa)); do {(real_ca=(*(ip-1)));} while(0);} while(0);
J_plus: asm("");
goto *real_ca;
}
> I'm rather worried that this may turn out to be the reload pass
> performing poorly, but let's see.
1) Register allocation is one issue, but that is probably not easy to
solve.  One thing that I can imagine somthing can be done about, is
that, even on platforms with many registers, like Alpha, MIPS, or
AMD64, gcc allocates registers badly for the Gforth engine: we can
barely allocate the virtual machine registers into real registers, but
then we don't have any registers left for stack caching.
The reason for this seems to be that these machines have few
callee-saved registers, and gcc seems to allocate our virtual machine
registers only into these registers (probably because the survive a
number of calls).  What's worse, even with explicit register
allocation we cannot get around that, because we can only use
callee-saved registers there, too (at least last time I tried).  The
only architecture where I am happy about the registers is PPC, because
it has many callee-saved registers.
It would be great if gcc would make better use of the registers by
itself, but if not, I would at least like to do it myself using
explicit register allocation.
2) Fixing PR25285.  This one strikes at unpredictable times, as the
example with gcc-4.1.2 without and with STACK_CACHE_DEFAULT_FAST=0
shows.  We do have a workaround for that, but that workaround has a
negative performance impact even if the compiler does not exhibit
PR25285.  Speed impacts:
Difference between raw PR25285 and workaround.
 0.476 0.748  0.280 0.476 PR25285 without workaround
 0.364 0.524  0.288 0.472 with workaround enabled
The cost of the workaround on a compiler without PR25285:
sieve bubble matrix  fib
 0.208 0.296  0.108 0.336 no PR25285 without workaround
 0.248 0.340  0.120 0.384 with workaround enabled
Other programs are affected even more by this.  In particular, in our
work on using dynamic superinstructions in the Cacao JVM interpreter
<http://www.complang.tuwien.ac.at/papers/ertl+06dotnet.ps.gz>, we
found slowdowns by up to a factor of 2 by enabling or disabling the
"throw" feature that is affected by PR25285.  I found a workaround for
that, but it is brittle; it worked if we did not add static
superinstructions, but when I did add static superinstructions, it no
longer worked.
3) Code arrangement.  Not a problem with current gcc versions (apart
from PR25285), but it has been in the past, so maybe one should add a
test case or some other way to remind the maintainers of this issue.
We generate code dynamically by taking code fragments (between two
labels (as values)) that gcc has generated and copying the fragments
elsewhere.  In order for this technique to work, there is one
requirement: If a piece of source code is between two labels, the
corresponding executable code must be between the adddresses
corresponding to these labels.  This property should be guaranteed, at
least through a compiler option like -fno-reorder-blocks (PR25285
breaks it).
The performance impact of dynamic code generation is typically a
factor of 2, but sometimes much higher:
sieve bubble matrix  fib
 0.212 0.292  0.108 0.336 dynamic code generation
 0.420 0.540  0.704 0.696 no dynamic code generation
>There might well be some push-back
>from gcc developers, with the claim that GForth "isn't
>representative".  But I'm not convinced of that, as I suspect that some
>of the problems you've seen might have an impact on other code bases.
1) Register allocation affects everyone.  Most interpreters will have
register liveness characteristics similar to Gforth.
2,3) The code arrangement issue and PR25285 affect systems that use
similar techniques to Gforth.  Apart from Gforth, SableVM, and the
Cacao interpreter, two other projects I know that use similar
techniqes are Qemu and the Tempo partial evaluator.  I have read
somewhere that Qemu has big problems with recent gccs thanks to a code
generation issue that is somewhat similar to PR25285, except that it
involves returns instead of general indirect jumps.
>> I used to write bug reports for gcc.  But the reaction to PR25285
>> convinced me that this is a waste of time.
>
>Well, I agree that Andrew Pinski's Comment #3 seems to be very
>unhelpful.  However, it's not just a matter of writing bug reports,
>but of finding a gcc maintainer who understands the problem and is
>motivated to work on it.
Yes. So how do we find one?
> 2) Fixing PR25285.
OK, I might be able to find time to look at this.
>>> I used to write bug reports for gcc.  But the reaction to PR25285
>>> convinced me that this is a waste of time.
>>
>>Well, I agree that Andrew Pinski's Comment #3 seems to be very
>>unhelpful.  However, it's not just a matter of writing bug reports,
>>but of finding a gcc maintainer who understands the problem and is
>>motivated to work on it.
> Yes. So how do we find one?
I'm here.  However, some of the problems you mention are already being
actively worked on and in any case are almost research-grade problems.
Vladimir Makarov has written various register allocators, and there
are rumours of a new one for gcc that might improve things.
Some of the other things you mention are likely to be viewed by gcc
maintainers as not a bug, in particular the practice of taking code
fragments that gcc has generated and copying them elsewhere.  I would
find it very hard to defend a patch for that.
Andrew.
One particular gripe I have with typical register allocators and the usage
pattern of registers in code like Gforth is that when such an allocator
fails at some point, the register ends up in memory all the time. That's
clearly not optimal. One comp.arch poster had a footer that said that most
performance problems resort to caching, and register allocation can be
treated just the same. Caching means that a variable that's loaded into a
register should live there as long as possible, but when no longer
possible, writing it to memory and later reloading it is perfectly ok. The
metric to minimize is the number of dynamic loads and stores. So if a value
sits in a register, and needs to be stored in memory for function calls
(caller-save), this is ok, as long as the function calls aren't more
frequent than accesses to the particular value.
This also helps for the x86 problem with its irregular register file - if
you find out that you need CX for an occasional shift operation, you then
still can use it for global values, which have to be pushed out just when
the shift operation needs CX (same for AX:DX for multiplications or
SI/DI/CX for string instructions).
There are a number of complications with this sort of split lifetime
register allocators, especially since debuggers have problems to localize
values when they change their position during the program flow. One interim
solution is to disable this sort of optimization when debugging.
> There are a number of complications with this sort of split lifetime
> register allocators, especially since debuggers have problems to
> localize values when they change their position during the program
> flow. One interim solution is to disable this sort of optimization
> when debugging.
With the move to static single assignment form (aka SSA), gcc
effectively does this anyway.  A prgram like this:
int poo (int N, int M, int I)
{
  int a, b;
  a = N;
  a = a + M ;
  a = a * 7;
  b = (I - a);
  b = b * b;
  return a ^ b;
}
gets translated to:
poo (N, M, I)
{
  int a.26;
  int b;
  int a;
  a = N + M;
  a.26 = a * 7;
  b = I - a.26;
  return a.26 ^ b * b;
}
where every assignment creates a separate temporary variable.
Andrew.
Just to follow up on my own posting, I realized afterwards that I was
being slightly simplistic, to the point of misleading.  Before code
generation, gcc converts out of SSA back into standard form, and this
can result in the separate SSA variables being coalesced back into a
single temporary variable.
Andrew.
You mean that gcc maintainers would not consider it a bug if they
broke this technique?  I think they should.  It's a practice that
worked in earlier gccs (in particular, gcc-2.95), and that still works
well enough for Gforth.  That practice provides speedups by a factor
of 2 or more for Gforth and the Cacao interpreter, and AFAIK it's also
very important for Qemu.  I expect (but have not checked) that it is
also used in Squeak.
And the cost of supporting that practice does not seem too expensive:
at every optimization that might reorder the code, add a bypass, and
use a command-line flag (if not -fno-reorder-blocks, then something
else) to decide between using the optimization and the bypass.
Compare the cost of that and the speedup by a factor of 2 in Gforth
and similar applications to the cost and benefit of other
optimizations you do; how many do you have that give a speedup by a
factor of 2 on real-world uses of some applications?  And are these as
simple to implement as the one I suggest?
Would it be considered important enough if it gave a nice speedup for
Python or Ruby?
I don't think so (even if the coalescing back does not happen).
In SSA form our engine looks like this:
indirect_jump: #introduced by gcc for all the separate "goto *"s we have
  sp.0 = phi(sp.1, ...., sp.n);
  ip.0 = phi(ip.1, ...., ip.n);
  goto *target;
I_free:
  free(...);
  ip.1 = ip.0+1;
  target = ip.1[-1];
  goto indirect_jump;
As you can see, in the case of FREE, sp does not change, and ip is
only changed after the call to free().  So both ip.0 and sp.0 survive
the call.  With a gcc-style register allocator, if it does not put
them into a callee-saved register, it will spill them, and not just
around the call to free(), but everywhere.
IIRC, there was or is a project to use vmgen for Ruby (carbone), but
apparently that project isn't followed. There's also YARV, which is a
similar stack-based approach, but doesn't use vmgen directly.
> Andrew Haley <andr...@littlepinkcloud.invalid> writes:
>>Bernd Paysan <bernd....@gmx.de> wrote:
>>
>>> There are a number of complications with this sort of split lifetime
>>> register allocators, especially since debuggers have problems to
>>> localize values when they change their position during the program
>>> flow. One interim solution is to disable this sort of optimization
>>> when debugging.
>>
>>With the move to static single assignment form (aka SSA), gcc
>>effectively does this anyway.
> 
> I don't think so (even if the coalescing back does not happen).
I think the argument is different: The main roadblock for using a split
lifetime register allocator, when I last proposed one (must be 10 years
ago) was that gdb has problems with that. Now since GCC may have "walking"
variables, anyway (through the SSA optimizations), the roadblock of not
being able to debug this stuff is already gone. So it's possible to replace
GCC's current allocator with such an allocator without giving up debugging.
I don't know how important the coalescing is, because stack frame packing is
also better achieved by treating the stack frame as (hypothetically
infinitely large) register file. The algorithm should free "dead" slots
(like in a register file), and try to put new values into the slots with
the smallest offset (smallest, because that way the "hot" stuff gets in as
few cache lines as possible, and on variable length architectures like x86,
8 bit offsets give better code size). The same thing is also useful for
register files in some architectures which have a save-multiple-register
instruction (like ARM), where caller-saved registers should live in
consecutive registers.
>> Andrew Haley <andr...@littlepinkcloud.invalid> writes:
>>>Bernd Paysan <bernd....@gmx.de> wrote:
>>>
>>>> There are a number of complications with this sort of split lifetime
>>>> register allocators, especially since debuggers have problems to
>>>> localize values when they change their position during the program
>>>> flow. One interim solution is to disable this sort of optimization
>>>> when debugging.
>>>
>>>With the move to static single assignment form (aka SSA), gcc
>>>effectively does this anyway.
>> 
>> I don't think so (even if the coalescing back does not happen).
> I think the argument is different: The main roadblock for using a split
> lifetime register allocator, when I last proposed one (must be 10 years
> ago) was that gdb has problems with that.
I'm very surprised that was the case.  I'm pretty sure that if anyone
today were to make that objection it would not be viewed as a blocker.
Besides, debug formats have come a long way and can now express the
idea of variables that don't always live in the same slot.
There were two killer problems with replacing the register allocator.
The first was political rather than technical: graph colouring
allocators were covered by patents.  The second was perhaps rather
surprising: when new register allocators were created for gcc they
didn't perform better on a mix of code.
Andrew.
> You mean that gcc maintainers would not consider it a bug if they
> broke this technique?
That's right.
> I think they should.  It's a practice that worked in earlier gccs
> (in particular, gcc-2.95), and that still works well enough for
> Gforth.  That practice provides speedups by a factor of 2 or more
> for Gforth and the Cacao interpreter, and AFAIK it's also very
> important for Qemu.  I expect (but have not checked) that it is also
> used in Squeak.
In general it's hard to support.  For example, PC-relative references
to constant data will break.  Some targets need this such references
to load large integer constants, some don't.
I think I've made my position clear: a compiler should support the
language as defined by a standard and documented extensions.
Everything else is forbidden.  
So, if there were to be a proper definition of what the compiler needs
to do to support this technique, then it could be made into a
documented extension, and then gcc would have to support it.  I am
very strongly opposed to gcc informally supporting an undocumented
extension.
> And the cost of supporting that practice does not seem too
> expensive: at every optimization that might reorder the code, add a
> bypass, and use a command-line flag (if not -fno-reorder-blocks,
> then something else) to decide between using the optimization and
> the bypass.
> Compare the cost of that and the speedup by a factor of 2 in Gforth
> and similar applications to the cost and benefit of other
> optimizations you do; how many do you have that give a speedup by a
> factor of 2 on real-world uses of some applications?  And are these
> as simple to implement as the one I suggest?
You're right. This is a very strong argument.
> Would it be considered important enough if it gave a nice speedup
> for Python or Ruby?
I don't think it would make any difference.
Andrew.
I am not asking the gcc maintainers to help us about PC-relative
references (that would be cool, too, but that's a different issue),
just to avoid reordering basic blocks.
>Some targets need this such references
>to load large integer constants, some don't.
I have never encountered a platform where this is the case.  Many
platforms do position-independent code using a global pointer, and use
that to load large integer constants (Alpha, PPC, MIPS); AMD64 uses
PC-relative references to access global variables, but has an inline
instruction for large integer constants (so no need to use PC-relative
references for large integer constants).
So, as long as we avoid global variables and relative calls and other
relative branches to code outside the unit of copying, we are fine in
may experience.  Global variables can be worked around (although help
from the compiler would be appreciated); calls are not a problem for
us and there's a workaround for that, too; relative branches we can
mostly control through source code, if the compiler keeps the code in
the same order, and that's why we want that feature.
>So, if there were to be a proper definition of what the compiler needs
>to do to support this technique, then it could be made into a
>documented extension, and then gcc would have to support it.  I am
>very strongly opposed to gcc informally supporting an undocumented
>extension.
So, do you want me and maybe some other concerned parties to write up
a proper definition?
> I am not asking the gcc maintainers to help us about PC-relative
> references (that would be cool, too, but that's a different issue),
> just to avoid reordering basic blocks.
OK, that is reasonable.
>>Some targets need this such references
>>to load large integer constants, some don't.
> I have never encountered a platform where this is the case.
I have.  It was quite tricky too, because the PC-relative load range
was only a few kybytes, so we couldn't simply put all the constants at
the end of the function.  Instead, we had to look for a nearby
unconditional branch after which to put the data.  If there wasn't one
we had to insert one.
> So, as long as we avoid global variables and relative calls and
> other relative branches to code outside the unit of copying, we are
> fine in my experience. Global variables can be worked around
> (although help from the compiler would be appreciated); calls are
> not a problem for us and there's a workaround for that, too;
> relative branches we can mostly control through source code, if the
> compiler keeps the code in the same order, and that's why we want
> that feature.
Yes, I see.
>>So, if there were to be a proper definition of what the compiler needs
>>to do to support this technique, then it could be made into a
>>documented extension, and then gcc would have to support it.  I am
>>very strongly opposed to gcc informally supporting an undocumented
>>extension.
> So, do you want me and maybe some other concerned parties to write
> up a proper definition?
No, because I do not think it would be accepted.
Andrew.
It's better, as it again allows us to fit four of the Forth virtual
registers in real ones (they finally detected that using a dead variable as
target register is not really a problem). The unability to remove
superfluous register and memory moves still is worrying.
I don't know what's the easy way out (easy in terms of "not much work to
do"), but the hard way is: Writing our own compiler backend and porting all
relevant and alive platforms GCC supports to it (x86, x86_64, MIPS, SPARC,
PowerPC, ARM), and make it good enough so that other people can use it at
JIT in their project, too (sort of "vmgen on steroids" then).
I think there are several easier ways:
- The approach based on dynamically generating, compiling and linking
C code.
- Maybe reviving gcc-2.95, i.e., adding an AMD64 (and maybe IA-64)
port.  Not necessarily easier than doing a native-code compiler
ourselves, but since we are not the only ones in this situation, maybe
others would help.
- Use GNU Lightning for the native code compiler, and add more targets
for that.  Also not necessarily easier than doing a native-code
compiler ourselves, but again, a higher chance of getting help from
others.
However, I am not sure that our manpower is sufficient for any
approach apart from the first.
>>> - Long long allowed us to do doubles (they broke that many years ago
>>>   on the Alpha and "fixed" it by changing the documentation, so we
>>>   have had workarounds for BUGGY_LONG_LONG for a long time).
>>
>>Huh?  I don't understand this remark.  What broke "long long" on
>>Alpha?  I don't imagine it was gcc.
> Long long was documented as being twice as long as "long int".  On
> Alpha it was not.
Incidentally, I can't find anything to support this, so it seems my
memory was not faulty at all.  I can find nowhere in the documentation
log that 'long long int' was ever documented to be twice as long as
'long int'.  On 23 Sep 1996 the gcc documentation said:
  GNU C supports data types for integers that are twice as long as
  @code{int}.  Simply write @code{long long int} for a signed integer,
  or @code{unsigned long long int} for an unsigned integer.  To make an
  integer constant of type @code{long long int}, add the suffix
  @code{LL} to the integer.  To make an integer constant of type
  @code{unsigned long long int}, add the suffix @code{ULL} to the
  integer.
It could be that there is some earlier documentation than this, but I
can't find it.
Andrew.
We filed the bug report somewhere around 1992 or 1993, soon after we started
Gforth. No wonder that this 1996 documentation is already "fixed".
I've just downloaded gcc-2.0.tar.bz2 from ftp.gnu.org/old-gnu/gcc (a small
3.0MB piece), and grep'd through the .info files:
gcc.info-4:
"Double-Word Integers
====================
   GNU C supports data types for integers that are twice as long as
`long int'.  Simply write `long long int' for a signed integer, or
`unsigned long long int' for an unsigned integer."
GCC 2.0 is of February 1992.
I should also note that GCC 2.0 is the primary reason why we started this
endever Gforth 15 years ago: It provided us enough features to do so.
Especially long long (as defined back then) and labels as values.
> gcc.info-4:
> "Double-Word Integers
> ====================
Ah, OK, that explains it.  I didn't remember it because I wasn't there
at the time.  Bizarrely, gcc.info-4 isn't in the source repository,
although it is in the tarballs.  Go figure...
Andrew.
I don't think we reported this before 1995, when we first had contact
with 64-bit systems.  Ah, I found the bug report:
http://groups.google.com/group/gnu.gcc.bug/msg/d6f3bcaf15a77814?dmode=source
and it cites the gcc 2.7.0 Manual:
|   GNU C supports data types for integers that are twice as long as
|`long int'.
>Bizarrely, gcc.info-4 isn't in the source repository,
>although it is in the tarballs.  Go figure...
Well, of course it is not in the source repository, because it is not
a source file.  Look for a .texinfo or .texi file.