On the pitfalls of C and UB: C23 seems to be even worse. :-(

Terje Mathisen

unread,

Apr 3, 2023, 2:39:14 PM4/3/23

to

https://queue.acm.org/detail.cfm?id=3588242

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

MitchAlsup

unread,

Apr 3, 2023, 5:22:17 PM4/3/23

to

On Monday, April 3, 2023 at 1:39:14 PM UTC-5, Terje Mathisen wrote:
> https://queue.acm.org/detail.cfm?id=3588242
<
In my 38 years of writing C code, I, personally, have never used realloc()
(or calloc() BTW).
<
In all my future years of programming, I do not expect to compare pointers.
I haven't compared a pointer for the first 53 years of programming, why start
now ? Apparently I have a knack of writing code using pointers that does not
need pointer comparisons.
<
Also note: For the last 52 years I have not programmed a machine where
comparing pointers was "tricky"--I never used the segment stuff in x86-32.
{and thus remain un-brain-damaged or is that un-dame-bramaged.}
<
For the foreseeable future; I will not use machines that have tricky pointers.
The set of such machines are at least {x86-64, ARM, RISC-V, MIPS, VAX,
PDP-11, Nova, SPARC, IBM 360-3090, Mc 68K, Mc 88K, .....} It is not like you
are hurting your access to machines and architectures by making pointer
comparisons "easy".
<
Nor do I care if the C standards committee persists in seeing that C remains
compatible with architectures that are fundamentally against the large flat
linear address space.
<
Can we face facts: The large flat linear address space model has won--
Just like Little Endian has won--and that comparing pointers in a large
flat linear address space is "easy" and there is no reason to make the
language "special" in this way. Better to make the non LFLAS models
pay the cost (are there any left?? If there are, why can't they continue
to use C11 ??)
<
Now, I understand that languages have to evolve forward, And I, for one,
do not care that C23 is migrating even farther from C++, I might even
consider that a GOOD THING !!!
<
I still think in terms of K&R C, but have migrated to using the uinit46_t
spellings for typing.

Scott Lurndal

unread,

Apr 3, 2023, 5:36:27 PM4/3/23

to

MitchAlsup <Mitch...@aol.com> writes:
>On Monday, April 3, 2023 at 1:39:14=E2=80=AFPM UTC-5, Terje Mathisen wrote:
>> https://queue.acm.org/detail.cfm?id=3D3588242=20

><
>In my 38 years of writing C code, I, personally, have never used realloc()
>(or calloc() BTW).

Likewise, albeit in 43 years of C programming.

><
>In all my future years of programming, I do not expect to compare pointers.

Except when comparing them to the null pointer (NULL), which technically
is 'comparing pointers' given the definition of NULL is usually (void *)0.

>I haven't compared a pointer for the first 53 years of programming, why sta=
>rt
>now ? Apparently I have a knack of writing code using pointers that does no=

>t
>need pointer comparisons.
><
>Also note: For the last 52 years I have not programmed a machine where
>comparing pointers was "tricky"--I never used the segment stuff in x86-32.
>{and thus remain un-brain-damaged or is that un-dame-bramaged.}

Unfortunately, the Burroughs mainframes I programmed had an
odd segmentation scheme (designed to allow backward compatability
with 1,000,000 digit address space limits from the original early
60's design, so pointers were rather tricky).

MitchAlsup

unread,

Apr 3, 2023, 5:45:06 PM4/3/23

to

On Monday, April 3, 2023 at 4:36:27 PM UTC-5, Scott Lurndal wrote:
> MitchAlsup <Mitch...@aol.com> writes:
> >On Monday, April 3, 2023 at 1:39:14=E2=80=AFPM UTC-5, Terje Mathisen wrote:
> >> https://queue.acm.org/detail.cfm?id=3D3588242=20
> ><
> >In my 38 years of writing C code, I, personally, have never used realloc()
> >(or calloc() BTW).
> Likewise, albeit in 43 years of C programming.
> ><
> >In all my future years of programming, I do not expect to compare pointers.
> Except when comparing them to the null pointer (NULL), which technically
> is 'comparing pointers' given the definition of NULL is usually (void *)0.
<

Yes, but when written::
<
if( p ) { dereference( p ); }
or
if( !p ) { avoid_dereferencing( p ); }
<
it is not a comparison !!
<
{{ I fully agree that it is TECHNICALLY and PEDANTICALLY a comparison;
but it does not have a comparison operator in the source...........................}}

>
>
> >I haven't compared a pointer for the first 53 years of programming, why sta=
> >rt
> >now ? Apparently I have a knack of writing code using pointers that does no=
> >t
> >need pointer comparisons.
> ><
> >Also note: For the last 52 years I have not programmed a machine where
> >comparing pointers was "tricky"--I never used the segment stuff in x86-32.
> >{and thus remain un-brain-damaged or is that un-dame-bramaged.}
<
> Unfortunately, the Burroughs mainframes I programmed had an
> odd segmentation scheme (designed to allow backward compatability
> with 1,000,000 digit address space limits from the original early
> 60's design, so pointers were rather tricky).
<

Never programmed on one (except in algol60 which likely does not count ).

Bill Findlay

unread,

Apr 3, 2023, 6:16:07 PM4/3/23

to

On 3 Apr 2023, MitchAlsup wrote
(in article<a533d5c4-8dca-44eb...@googlegroups.com>):

> I still think in terms of K&R C, but have migrated to using the
> uinit46_t
> spellings for typing.

Is that one of Quadibloc's types? 8-)

--
Bill Findlay

MitchAlsup

unread,

Apr 3, 2023, 8:21:54 PM4/3/23

to

yes and Woops !!
>
> --
> Bill Findlay

Andy Valencia

unread,

Apr 3, 2023, 11:24:41 PM4/3/23

to

MitchAlsup <Mitch...@aol.com> writes:
> In my 38 years of writing C code, I, personally, have never used realloc()
> (or calloc() BTW).

So if you have a list of, I don't know, int's, and you add one more to the
list. What do you do instead of realloc() it to make room for another
addition?

Andy Valencia
Home page: https://www.vsta.org/andy/
To contact me: https://www.vsta.org/contact/andy.html

Thomas Koenig

unread,

Apr 4, 2023, 1:31:14 AM4/4/23

to

Terje Mathisen <terje.m...@tmsw.no> schrieb:
> https://queue.acm.org/detail.cfm?id=3588242

While I agree C could be better (who doesn't?) there are some
points in the article that I disagree with, others I agree with.

Paywalls - yes, that's a nuisance. Fortunately, as the article
implies, draft copies exist. (Not of the IEEE 754 standard,
it appears).

New keywords outside of the reserved space (leading underscores)
in a language like C, which reserves keywords, are not a good
idea in general because they tend to break existing code. I like
Fortran's approach better in such a case.

unreachable - I don't see the problem here. Removing code that
is not reachable where one branch has unreachable and all other
branches before also do not return is fine. It seems the authors
do not expect programmers to read or understand the control flow
of their programs.

realloc of zero-sized objects: Arguments on both sides. Breaking
existing code is bad, but having zero-size objects is something
I like (cf. Fortran, where these are first-class citicens).
I'd probably come down on the "don't change" side.

BGB

unread,

Apr 4, 2023, 1:53:55 AM4/4/23

to

On 4/3/2023 10:23 PM, Andy Valencia wrote:
> MitchAlsup <Mitch...@aol.com> writes:
>> In my 38 years of writing C code, I, personally, have never used realloc()
>> (or calloc() BTW).
>
> So if you have a list of, I don't know, int's, and you add one more to the
> list. What do you do instead of realloc() it to make room for another
> addition?
>

I use realloc() and pointer comparison fairly often, among other things.

Of course, I haven't been writing code for quite that long...

I didn't do a whole lot of coding in my early years, not really getting
much into programming until after I was in elementary school (originally
in QBasic during the MS-DOS and Win 3.x era; started learning C by late
elem...). Much before that, it was a chaotic time that I don't really
remember.

But, yeah, didn't really entirely switch over to C until roughly middle
school, then mostly using a computer set to dual boot Linux and Windows
NT4, using Cygwin on the latter (before then switching over to Win2K and
using this for pretty much all of high school; during the era where for
most people Win98 started to give way to WinXP).

Prior to Linux+NT4, had been using Win95, but a great HDD crash lost
everything I had prior to this point. I think I went over to NT4 mostly
because I didn't want to reinstall Win95, and Win98's instability didn't
exactly appeal to me much either (at the time, another PC in the house
had been running Win98, so was already familiar with it).

But, all this was many decades ago... (and I am much older now...).

So, say, maybe 25 years of C coding, +/- a little...

robf...@gmail.com

unread,

Apr 4, 2023, 2:10:29 AM4/4/23

to

I do not use pointer comparison at all AFAIK. I do check for NULL pointers though.
Quite often use comparisons of objects pointed to, as in. p1->object == p2->object.
Because one does not generally know where in the heap an allocated object is
going to end up, it is not that useful to compare pointers for most code. I think I
have seen some code for garbage collection that compares pointers, but that is
a rare case.

Anton Ertl

unread,

Apr 4, 2023, 2:29:19 AM4/4/23

to

MitchAlsup <Mitch...@aol.com> writes:
>On Monday, April 3, 2023 at 1:39:14=E2=80=AFPM UTC-5, Terje Mathisen wrote:
>> https://queue.acm.org/detail.cfm?id=3D3588242=20

><
>In my 38 years of writing C code, I, personally, have never used realloc()
>(or calloc() BTW).

I have.

>In all my future years of programming, I do not expect to compare pointers.

>I haven't compared a pointer for the first 53 years of programming, why sta=
>rt
>now ?

One reason for comparing pointers is if you want to implement
memmove(), as discussed in the article.

Another one (this one within standard-C confines) is walking through
an array with an idiom like this:

for (p=start; p<end; p++)
... *p ...

Another one was when unifying two free logic variables in a Prolog
implementation. We do this by letting one logic variable point to the
other, but which should point to which? The newer one should point to
the older, because the newer one may be deallocated earlier. With
stack allocation, we can find out by pointer comparison which one may
be deallocated earlier. If the stack was a single object in standard
C terms, this might be within its confines, but of course the Prolog
implementation (like pretty much all other production programs)
contained other things that have not been standardized in C, but which
the compilers of the day used to compile as intended.

In another case I needed an arbitrary total order of the objects (I
don't remember for what purpose), and the addresses of the objects
conveniently provided an order while the contents didn't. I remember
that this was in a Forth program, and in standard Forth you can
compare addresses.

>For the foreseeable future; I will not use machines that have tricky pointe=

>rs.
>The set of such machines are at least {x86-64, ARM, RISC-V, MIPS, VAX,
>PDP-11, Nova, SPARC, IBM 360-3090, Mc 68K, Mc 88K, .....}

I think you mean that these are machines that do not have tricky
pointers.

You do not mention IA-32. Is there any OS that exposes the potential
tricky pointers of IA-32 to user-level code?

One interesting aspect of the way IA-32 was designed was that they had
the segments as potentially overlapping extents in a 32-bit linear
address space.

My current take would be to treat the segments like ASIDs, i.e.,
extend the 32-bit address with 13 additional bits from the segment and
have a paging system that maps these 45-bit addresses to physical
memory; i.e., segments would be non-overlapping in this 45-bit address
space (paging could result in several segments sharing pages, and a
Unix would use this to implement, e.g., COW). This might have made it
unnecessary to add the PCID later. I guess it would not have
precluded PAE, unless they had catered for longer physical addresses
from the start. I have not worked out the ramifications of this idea,
so I might change my opinion once I have.

>Can we face facts: The large flat linear address space model has won--
>Just like Little Endian has won--and that comparing pointers in a large
>flat linear address space is "easy" and there is no reason to make the
>language "special" in this way. Better to make the non LFLAS models

>pay the cost (are there any left?? If there are, why can't they continue=20
>to use C11 ??)

C11, C99, and C89 are not any better wrt. comparing pointers. If you
want a language that has a flat address space, use Forth.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>

Thomas Koenig

unread,

Apr 4, 2023, 2:45:08 AM4/4/23

to

MitchAlsup <Mitch...@aol.com> schrieb:

> On Monday, April 3, 2023 at 1:39:14 PM UTC-5, Terje Mathisen wrote:
>> https://queue.acm.org/detail.cfm?id=3588242
><
> In my 38 years of writing C code, I, personally, have never used realloc()
> (or calloc() BTW).

Reading a line of arbitrary length (getline) is the canonical example,
I think.

[...]

> Can we face facts: The large flat linear address space model has won--
> Just like Little Endian has won--

With the difference that there are no longer segmented machines on the
market, and big-endian machines are still being sold. Less than before,
sure (and I don't think we will ever see a little-endian zSystem).

[...]

> Now, I understand that languages have to evolve forward, And I, for one,
> do not care that C23 is migrating even farther from C++, I might even
> consider that a GOOD THING !!!

The common subset isn't bad. C++ has been suffering from feature
stampede for decades...

Niklas Holsti

unread,

Apr 4, 2023, 5:18:32 AM4/4/23

to

On 2023-04-04 8:39, Anton Ertl wrote:
> MitchAlsup <Mitch...@aol.com> writes:
>> On Monday, April 3, 2023 at 1:39:14=E2=80=AFPM UTC-5, Terje Mathisen wrote:
>>> https://queue.acm.org/detail.cfm?id=3D3588242=20

[ snip ]

> C11, C99, and C89 are not any better wrt. comparing pointers. If you
> want a language that has a flat address space, use Forth.

Or use Ada. While you can compare Ada pointers ("access values") only
for equality, you can take the address of any object and compare
addresses in any way you like (=, <, >, ...). You can also convert an
address to a (dedicated predefined) integer type, do arithmetic on it,
and then convert the result back to an address or a pointer.

Michael S

unread,

Apr 4, 2023, 8:21:47 AM4/4/23

to

Does Ada language guarantees persistence of address comparison?
I.e if address of object A is smaller than address of object B at one
one point in the program then the same holds for whole duration of the lifetime
of A and B?
If it's true it effectively disallows any form of GC that does memory compaction.

Michael S

unread,

Apr 4, 2023, 8:40:20 AM4/4/23

to

Well, thinking about it, the statement above is incorrect.
Compaction without exchanging order of object in memory is perfectly possible.
Also, arbitrary compaction is possible if 'address of object'
returns address of object's descriptor which (descriptor) does not have
to be adjacent in memory to object's bulk.
Sometimes, before posting, I should remind to myself that there
is more than one way to skin a cat.

Scott Lurndal

unread,

Apr 4, 2023, 9:41:09 AM4/4/23

to

MitchAlsup <Mitch...@aol.com> writes:
>On Monday, April 3, 2023 at 4:36:27=E2=80=AFPM UTC-5, Scott Lurndal wrote:
>> MitchAlsup <Mitch...@aol.com> writes:=20
>> >On Monday, April 3, 2023 at 1:39:14=3DE2=3D80=3DAFPM UTC-5, Terje Mathis=
>en wrote:=20
>> >> https://queue.acm.org/detail.cfm?id=3D3D3588242=3D20
>> ><=20
>> >In my 38 years of writing C code, I, personally, have never used realloc=
>()=20

>> >(or calloc() BTW).
>> Likewise, albeit in 43 years of C programming.

>> ><=20
>> >In all my future years of programming, I do not expect to compare pointe=
>rs.
>> Except when comparing them to the null pointer (NULL), which technically=
>=20
>> is 'comparing pointers' given the definition of NULL is usually (void *)0=
>.=20

><
>Yes, but when written::
><
> if( p ) { dereference( p ); }
>or
> if( !p ) { avoid_dereferencing( p ); }

I personally never use that style, reserving it
for boolean (particularly in C++) comparisons, I always
explicitly check for "== NULL".

><
>> Unfortunately, the Burroughs mainframes I programmed had an=20
>> odd segmentation scheme (designed to allow backward compatability=20
>> with 1,000,000 digit address space limits from the original early=20

>> 60's design, so pointers were rather tricky).
><
>Never programmed on one (except in algol60 which likely does not count ).

ALGOL was on the large systems family (48-bit, stack architecture). I
worked on the medium systems family (addressible to the digit, no
binary arithmetic).

Scott Lurndal

unread,

Apr 4, 2023, 9:42:40 AM4/4/23

to

Andy Valencia <van...@vsta.org> writes:
>MitchAlsup <Mitch...@aol.com> writes:
>> In my 38 years of writing C code, I, personally, have never used realloc()
>> (or calloc() BTW).
>
>So if you have a list of, I don't know, int's, and you add one more to the
>list. What do you do instead of realloc() it to make room for another
>addition?

I have never needed to have a variable sized table of ints in 43 years of C programming.

Niklas Holsti

unread,

Apr 4, 2023, 10:40:23 AM4/4/23

to

Those are good questions that I don't remember ever seeing discussed for
Ada. I'll post a question on comp.lang.ada and report back.

> If it's true it effectively disallows any form of GC that does
> memory compaction.

I don't know of any current Ada programming system that provides garbage
collection based on pointer tracing. The defunct port of the GNAT Ada
compiler to the Java Virtual Machine surely did that, but I don't know
in what form and how it would affect this question. There are reports
that the Boehm conservative collector has been used successfully for Ada
programs, but it is not clear to me if that collector does compaction.
It is said to have generational collection, which I believe does usually
involve copying (moving) objects, and that would change addresses.

Niklas Holsti

unread,

Apr 4, 2023, 10:48:47 AM4/4/23

to

On 2023-04-04 15:40, Michael S wrote:
> On Tuesday, April 4, 2023 at 3:21:47 PM UTC+3, Michael S wrote:
>> On Tuesday, April 4, 2023 at 12:18:32 PM UTC+3, Niklas Holsti wrote:
>>> On 2023-04-04 8:39, Anton Ertl wrote:
>>>> MitchAlsup <Mitch...@aol.com> writes:
>>>>> On Monday, April 3, 2023 at 1:39:14=E2=80=AFPM UTC-5, Terje Mathisen wrote:
>>>>>> https://queue.acm.org/detail.cfm?id=3D3588242=20
>>> [ snip ]
>>>> C11, C99, and C89 are not any better wrt. comparing pointers. If you
>>>> want a language that has a flat address space, use Forth.
>>> Or use Ada. While you can compare Ada pointers ("access values") only
>>> for equality, you can take the address of any object and compare
>>> addresses in any way you like (=, <, >, ...). You can also convert an
>>> address to a (dedicated predefined) integer type, do arithmetic on it,
>>> and then convert the result back to an address or a pointer.
>> Does Ada language guarantees persistence of address comparison?
>> I.e if address of object A is smaller than address of object B at one
>> one point in the program then the same holds for whole duration of the lifetime
>> of A and B?
>> If it's true it effectively disallows any form of GC that does memory compaction.
>
> Well, thinking about it, the statement above is incorrect.
> Compaction without exchanging order of object in memory is perfectly possible.

Yes, but if you have generational collection that moves objects between
generations and sometimes swaps the generation spaces, it seems
impossible to preserve the order in all cases.

> Also, arbitrary compaction is possible if 'address of object'
> returns address of object's descriptor which (descriptor) does not have
> to be adjacent in memory to object's bulk.

Right, but as one can take the address of any element of an array --
A(I)'Address -- that would require separate descriptors for each
element, which is not very practical. In fact, the Ada standard strongly
recommends that the address of an array -- A'Address -- should equal the
address of the first array element -- A(A'First)'Address) rather than
the address of the "array descriptor" (if there is one).

Jean-Marc Bourguet

unread,

Apr 4, 2023, 11:50:16 AM4/4/23

to

Niklas Holsti <niklas...@tidorum.invalid> writes:

> There are reports that the
> Boehm conservative collector has been used successfully for Ada programs,
> but it is not clear to me if that collector does compaction.

A conservative collector can't do compaction, it doesn't know for sure if a
bit pattern is a pointer or not.

Yours,

--
Jean-Marc

EricP

unread,

Apr 4, 2023, 12:07:05 PM4/4/23

to

Terje Mathisen wrote:
> https://queue.acm.org/detail.cfm?id=3588242
>

> Terje

I did not think the authors use of !!num idiom would produce optimal
code and was curious whether the optimizer would figure it out anyway
so I ran it through godbolt x86-64 gcc 12.2 -O3 (it did not)

// Authors version using !!num
typedef unsigned long Num_t;
Num_t Foo (Num_t ndogs, Num_t ncats, Num_t nfish)
{
Num_t petFee;

petFee =
!!ndogs * (10 + (ndogs-1) * 5)
+ !!ncats * ( 7 + (ncats-1) * 3)
+ !!nfish * (47 + (nfish-1) * 1);

return petFee;
}

// My no !!num version
Num_t Bar (Num_t ndogs, Num_t ncats, Num_t nfish)
{
Num_t petFee;

petFee =
(ndogs ? ndogs*5 + 5 : 0) +
(ncats ? ncats*3 + 4 : 0) +
(nfish ? nfish + 46 : 0);

return petFee;
}

# Compilation provided by Compiler Explorer at https://godbolt.org/
Foo(unsigned long, unsigned long, unsigned long):
xor ecx, ecx
test rsi, rsi
lea rax, [rsi+4+rsi*2]
setne cl
imul rax, rcx
xor ecx, ecx
test rdx, rdx
setne cl
add rdx, 46
imul rdx, rcx
xor ecx, ecx
add rax, rdx
test rdi, rdi
lea rdx, [rdi+1]
setne cl
imul rdx, rcx
lea rdx, [rdx+rdx*4]
add rax, rdx
ret

Bar(unsigned long, unsigned long, unsigned long):
mov rax, rdi
test rdi, rdi
je .L4
lea rax, [rdi+5+rdi*4]
.L4:
test rsi, rsi
je .L5
lea rcx, [rsi+4+rsi*2]
add rax, rcx
.L5:
lea rcx, [rax+46+rdx]
test rdx, rdx
cmovne rax, rcx
ret

I could not get gcc to generate a non-branching
version using CMOVs, for example:

// My non-branching version
Bar(unsigned long, unsigned long, unsigned long):
// ndogs: rdi ncats: rsi nfish: rdx
lea rax, [rdi+rdi*4+5]
test rdi, rdi
cmove rax, rdi // if rdi==0 we can use it to source 0
lea rcx, [rsi+rsi*2+4]
add rcx, rax
test rsi, rsi
cmovne rax, rcx
lea rcx, [rax+rdx+46]
test rdx, rdx
cmovne rax, rcx
ret

BGB

unread,

Apr 4, 2023, 1:12:52 PM4/4/23

to

Sometimes they are a better option than a linked list or tree structure...

So, I consider "realloc()" to be a reasonably useful feature.

I also consider relative comparison between arbitrary unrelated pointers
as a "this needs to work as usually expected" feature (comparison based
on address). Albeit in BGBCC with bounds-checks or similar enabled, this
does mean that internally the pointers need to have their tag bits
stripped off for the comparison to work correctly (though, could maybe
consider a "SUBA" instruction or similar, which subtracts two pointers
and also sign-extends over the tag bits as-if both pointers had their
tag bits set to zero).

...

In some contexts, a tree-like structure for arrays does have the
advantage that it can support creation and destruction of potentially
dynamically-sized arrays without leading to excessive memory
fragmentation, which is one potential downside of resizing things with
realloc.

Granted, in cases where I had used realloc() for dynamically sized
arrays, they had usually followed an "sz1=sz0+(sz0>>1);" pattern.

For some other contexts, I have used microfloats. Also my memory
managers often use microfloats for representing object sizes internally
(often E5.F3 or similar).

Albeit with the allocation strategy varying some based on object size, say:
Under 512B: Series of 16-byte cells managed by a bitmap;
512B..128K: Use variable-sized memory blocks within a larger region;
128K+: Use memory pages for the allocation.

...

Scott Lurndal

unread,

Apr 4, 2023, 1:18:36 PM4/4/23

to

BGB <cr8...@gmail.com> writes:
>On 4/4/2023 8:42 AM, Scott Lurndal wrote:
>> Andy Valencia <van...@vsta.org> writes:
>>> MitchAlsup <Mitch...@aol.com> writes:
>>>> In my 38 years of writing C code, I, personally, have never used realloc()
>>>> (or calloc() BTW).
>>>
>>> So if you have a list of, I don't know, int's, and you add one more to the
>>> list. What do you do instead of realloc() it to make room for another
>>> addition?
>>
>> I have never needed to have a variable sized table of ints in 43 years of C programming.
>
>Sometimes they are a better option than a linked list or tree structure...

Examples would be useful, other than academic exercises.

>
>So, I consider "realloc()" to be a reasonably useful feature.

It is used under-the-covers in C code generated by lex(1), I've noticed
(albeit wrapped by yyrealloc) when parsing.

>
>I also consider relative comparison between arbitrary unrelated pointers
>as a "this needs to work as usually expected" feature (comparison based
>on address).

Seems prone to error. If your pointers reference elements in a contiguous array, that's
one thing. If they're allocated via some allocator, that's a case prone to
fail eventually. Comparing for equality or NULL is the only nonproblematic usage.

MitchAlsup

unread,

Apr 4, 2023, 1:23:01 PM4/4/23

to

On Monday, April 3, 2023 at 10:24:41 PM UTC-5, Andy Valencia wrote:
> MitchAlsup <Mitch...@aol.com> writes:
> > In my 38 years of writing C code, I, personally, have never used realloc()
> > (or calloc() BTW).
> So if you have a list of, I don't know, int's, and you add one more to the
> list. What do you do instead of realloc() it to make room for another
> addition?
<

Malloc and linked lists.
<
I never use vectors of structs unless the length of the vector is fixed
at compile time. When unbounded or unknown, I use pointers lnked
lists.

MitchAlsup

unread,

Apr 4, 2023, 1:33:37 PM4/4/23

to

On Tuesday, April 4, 2023 at 1:29:19 AM UTC-5, Anton Ertl wrote:
> MitchAlsup <Mitch...@aol.com> writes:
> >On Monday, April 3, 2023 at 1:39:14=E2=80=AFPM UTC-5, Terje Mathisen wrote:
> >> https://queue.acm.org/detail.cfm?id=3D3588242=20
> ><
> >In my 38 years of writing C code, I, personally, have never used realloc()
> >(or calloc() BTW).
> I have.
> >In all my future years of programming, I do not expect to compare pointers.
> >I haven't compared a pointer for the first 53 years of programming, why sta=
> >rt
> >now ?
>
> One reason for comparing pointers is if you want to implement
> memmove(), as discussed in the article.
<

Memmove() is free to implement pointer comparison, and I can use
memmove (but mainly I use struct assignment and let the compiler
figure it out.) But I did not write memmove(), but in my ISA there is
an MM instruction where HW does the pointer compares.
<
In my uses of memmove() it is structurally equal to memcpy().

>
> Another one (this one within standard-C confines) is walking through
> an array with an idiom like this:
>
> for (p=start; p<end; p++)
> ... *p ...
<

for( i = 0; i<max; i++ )
array[i] ...
<
I actually write array code that looks like array code.

>
> Another one was when unifying two free logic variables in a Prolog
> implementation. We do this by letting one logic variable point to the
> other, but which should point to which? The newer one should point to
> the older, because the newer one may be deallocated earlier. With
> stack allocation, we can find out by pointer comparison which one may
> be deallocated earlier. If the stack was a single object in standard
> C terms, this might be within its confines, but of course the Prolog
> implementation (like pretty much all other production programs)
> contained other things that have not been standardized in C, but which
> the compilers of the day used to compile as intended.
>
> In another case I needed an arbitrary total order of the objects (I
> don't remember for what purpose), and the addresses of the objects
> conveniently provided an order while the contents didn't. I remember
> that this was in a Forth program, and in standard Forth you can
> compare addresses.
<

Remember, I do not ever expect to use a machine where pointer
comparisons are tricky, nor do I even expect to create sell or give
away code that will be run on such a machine.

>
> >For the foreseeable future; I will not use machines that have tricky pointe=
> >rs.
> >The set of such machines are at least {x86-64, ARM, RISC-V, MIPS, VAX,
> >PDP-11, Nova, SPARC, IBM 360-3090, Mc 68K, Mc 88K, .....}
> I think you mean that these are machines that do not have tricky
> pointers.
>
> You do not mention IA-32. Is there any OS that exposes the potential
> tricky pointers of IA-32 to user-level code?
<

Not that I would be using.

>
> One interesting aspect of the way IA-32 was designed was that they had
> the segments as potentially overlapping extents in a 32-bit linear
> address space.
<

A poor feature that causes dame-branage.

>
> My current take would be to treat the segments like ASIDs, i.e.,
> extend the 32-bit address with 13 additional bits from the segment and
> have a paging system that maps these 45-bit addresses to physical
> memory; i.e., segments would be non-overlapping in this 45-bit address
> space (paging could result in several segments sharing pages, and a
> Unix would use this to implement, e.g., COW). This might have made it
> unnecessary to add the PCID later. I guess it would not have
> precluded PAE, unless they had catered for longer physical addresses
> from the start. I have not worked out the ramifications of this idea,
> so I might change my opinion once I have.
<
> >Can we face facts: The large flat linear address space model has won--
> >Just like Little Endian has won--and that comparing pointers in a large
> >flat linear address space is "easy" and there is no reason to make the
> >language "special" in this way. Better to make the non LFLAS models
> >pay the cost (are there any left?? If there are, why can't they continue=20
> >to use C11 ??)
>
> C11, C99, and C89 are not any better wrt. comparing pointers. If you
> want a language that has a flat address space, use Forth.
<

Yeah, lets see you write 8-wide BGOoO simulators in forth.

MitchAlsup

unread,

Apr 4, 2023, 1:42:09 PM4/4/23

to

On Tuesday, April 4, 2023 at 1:45:08 AM UTC-5, Thomas Koenig wrote:
> MitchAlsup <Mitch...@aol.com> schrieb:
> > On Monday, April 3, 2023 at 1:39:14 PM UTC-5, Terje Mathisen wrote:
> >> https://queue.acm.org/detail.cfm?id=3588242
> ><
> > In my 38 years of writing C code, I, personally, have never used realloc()
> > (or calloc() BTW).
> Reading a line of arbitrary length (getline) is the canonical example,
> I think.
<

while( (c = getchar()) != EOF )
{
if( table[ c ] & NewLine ) break;
// still in the same line
}
<
Doing it 1 character at a time eliminates arbitrary length problems.

>
> [...]
> > Can we face facts: The large flat linear address space model has won--
> > Just like Little Endian has won--
<
> With the difference that there are no longer segmented machines on the
> market, and big-endian machines are still being sold. Less than before,
> sure (and I don't think we will ever see a little-endian zSystem).
<

I don't have a problem with BE machines, I only have a problem with
not large flat address space machines.

>
> [...]
> > Now, I understand that languages have to evolve forward, And I, for one,
> > do not care that C23 is migrating even farther from C++, I might even
> > consider that a GOOD THING !!!
<
> The common subset isn't bad. C++ has been suffering from feature
> stampede for decades...
<

C++ has gotten a little big big for its britches.

BGB

unread,

Apr 4, 2023, 1:48:46 PM4/4/23

to

On 4/4/2023 12:18 PM, Scott Lurndal wrote:
> BGB <cr8...@gmail.com> writes:
>> On 4/4/2023 8:42 AM, Scott Lurndal wrote:
>>> Andy Valencia <van...@vsta.org> writes:
>>>> MitchAlsup <Mitch...@aol.com> writes:
>>>>> In my 38 years of writing C code, I, personally, have never used realloc()
>>>>> (or calloc() BTW).
>>>>
>>>> So if you have a list of, I don't know, int's, and you add one more to the
>>>> list. What do you do instead of realloc() it to make room for another
>>>> addition?
>>>
>>> I have never needed to have a variable sized table of ints in 43 years of C programming.
>>
>> Sometimes they are a better option than a linked list or tree structure...
>
> Examples would be useful, other than academic exercises.
>

Say:
//O(1)
v=arr[idx];

//O(n)
while(obj && (obj->key!=idx))
obj=obj->next;

//O(log n)
l=3; p=root; v=0;
while(p && (l>0))
{ p=p[(idx>>(l*8))&255]; l--; }
if(p)
v=((int *)p)[idx&255];

Main merit of the latter being mostly that the "array" can be expanded
without leading to excessive fragmentation.

But, this may come at a potentially significant performance impact.

>>
>> So, I consider "realloc()" to be a reasonably useful feature.
>
> It is used under-the-covers in C code generated by lex(1), I've noticed
> (albeit wrapped by yyrealloc) when parsing.
>
>>
>> I also consider relative comparison between arbitrary unrelated pointers
>> as a "this needs to work as usually expected" feature (comparison based
>> on address).
>
> Seems prone to error. If your pointers reference elements in a contiguous array, that's
> one thing. If they're allocated via some allocator, that's a case prone to
> fail eventually. Comparing for equality or NULL is the only nonproblematic usage.

As I see it, pointer comparison should be expected to work as expected.
As for the relative location of things in the heap or similar, this is
pretty much arbitrary (and may vary between runs based on things like
ASLR or similar).

MitchAlsup

unread,

Apr 4, 2023, 1:51:20 PM4/4/23

to

Come to think of it; I don't need/use garbage collection, nor do I end up free()ing
allocations. I tend to use fork-excel and shmat when I need to build an arbitrary
<something> that will get evaluated and reduced to something small-ish. The
forked process gets all free()ed at exit() without the possibility of memory leaks.
<
I have found this faster than attempting to do GC (i.e., free()) myself.

Scott Lurndal

unread,

Apr 4, 2023, 1:54:31 PM4/4/23

to

MitchAlsup <Mitch...@aol.com> writes:

>On Tuesday, April 4, 2023 at 1:29:19=E2=80=AFAM UTC-5, Anton Ertl wrote:
>> MitchAlsup <Mitch...@aol.com> writes:=20
>> >On Monday, April 3, 2023 at 1:39:14=3DE2=3D80=3DAFPM UTC-5, Terje Mathis=
>en wrote:=20
>> >> https://queue.acm.org/detail.cfm?id=3D3D3588242=3D20
>> ><=20

>> >In my 38 years of writing C code, I, personally, have never used realloc=
>()=20

>> >(or calloc() BTW).
>> I have.

>> >In all my future years of programming, I do not expect to compare pointe=
>rs.
>> >I haven't compared a pointer for the first 53 years of programming, why =
>sta=3D=20
>> >rt=20
>> >now ?=20
>>=20
>> One reason for comparing pointers is if you want to implement=20
>> memmove(), as discussed in the article.=20

><
>Memmove() is free to implement pointer comparison, and I can use
>memmove (but mainly I use struct assignment and let the compiler
>figure it out.) But I did not write memmove(), but in my ISA there is
>an MM instruction where HW does the pointer compares.

Most implementations of memmove I'm familiar with are written
in assembler for speed, or fall back to a simple reverse bytewise
copy loop when the source is after the destination - and they
don't actually compare pointers, but rather subtract the
source pointer from the destination pointer and take the
reverse path if the difference is less than the length
being moved.

(from GLIBC 2.18):

rettype
inhibit_loop_to_libcall
MEMMOVE (a1, a2, len)
a1const void *a1;
a2const void *a2;
size_t len;
{
unsigned long int dstp = (long int) dest;
unsigned long int srcp = (long int) src;

/* This test makes the forward copying code be used whenever possible.
Reduces the working set. */
if (dstp - srcp >= len) /* *Unsigned* compare! */

In GLIBC 2.3:

oid bcopy PARAMS ((const void*, void*, size_t));

PTR
memmove (s1, s2, n)
PTR s1;
const PTR s2;
size_t n;
{
bcopy (s2, s1, n);
return s1;
}

In GLIBC 2.34 (specific for 32-bit x86):

ENTRY (memmove)

pushl %edi
cfi_adjust_cfa_offset (4)

movl LEN(%esp), %ecx
movl DEST(%esp), %edi
cfi_rel_offset (edi, 0)
movl %esi, %edx
movl SRC(%esp), %esi
cfi_register (esi, edx)

movl %edi, %eax
subl %esi, %eax
cmpl %eax, %ecx
ja 3f

cld
shrl $1, %ecx
jnc 1f
movsb
1: shrl $1, %ecx
jnc 2f
movsw
2: rep
movsl
movl %edx, %esi
cfi_restore (esi)
#ifndef USE_AS_BCOPY
movl DEST(%esp), %eax
#endif

popl %edi
cfi_adjust_cfa_offset (-4)
cfi_restore (edi)

ret

cfi_adjust_cfa_offset (4)
cfi_rel_offset (edi, 0)
cfi_register (esi, edx)

/* Backward copying. */
3: std
leal -1(%edi, %ecx), %edi
leal -1(%esi, %ecx), %esi
shrl $1, %ecx
jnc 1f
movsb
1: subl $1, %edi
subl $1, %esi
shrl $1, %ecx
jnc 2f
movsw
2: subl $2, %edi
subl $2, %esi
rep
movsl
movl %edx, %esi
cfi_restore (esi)
#ifndef USE_AS_BCOPY
movl DEST(%esp), %eax
#endif

cld
popl %edi
cfi_adjust_cfa_offset (-4)
cfi_restore (edi)

ret
END (memmove)
#ifndef USE_AS_BCOPY
libc_hidden_builtin_def (memmove)
#endif

For X86_64:

/* memmove/memcpy/mempcpy is implemented as:
1. Use overlapping load and store to avoid branch.
2. Load all sources into registers and store them together to avoid
possible address overlap between source and destination.
3. If size is 8 * VEC_SIZE or less, load all sources into registers
and store them together.
4. If address of destination > address of source, backward copy
4 * VEC_SIZE at a time with unaligned load and aligned store.
Load the first 4 * VEC and last VEC before the loop and store
them after the loop to support overlapping addresses.
5. Otherwise, forward copy 4 * VEC_SIZE at a time with unaligned
load and aligned store. Load the last 4 * VEC and first VEC
before the loop and store them after the loop to support
overlapping addresses.
6. On machines with ERMS feature, if size greater than equal or to
__x86_rep_movsb_threshold and less than
__x86_rep_movsb_stop_threshold, then REP MOVSB will be used.
7. If size >= __x86_shared_non_temporal_threshold and there is no
overlap between destination and source, use non-temporal store
instead of aligned store copying from either 2 or 4 pages at
once.
8. For point 7) if size < 16 * __x86_shared_non_temporal_threshold
and source and destination do not page alias, copy from 2 pages
at once using non-temporal stores. Page aliasing in this case is
considered true if destination's page alignment - sources' page
alignment is less than 8 * VEC_SIZE.
9. If size >= 16 * __x86_shared_non_temporal_threshold or source
and destination do page alias copy from 4 pages at once using
non-temporal stores. */

For RISCV:

ENTRY(__memmove)
WEAK(memmove)
move t0, a0
move t1, a1

beq a0, a1, exit_memcpy
beqz a2, exit_memcpy
srli t2, a2, 0x2

slt t3, a0, a1
beqz t3, do_reverse

andi a2, a2, 0x3
li t4, 1
beqz t2, byte_copy

word_copy:
lw t3, 0(a1)
addi t2, t2, -1
addi a1, a1, 4
sw t3, 0(a0)
addi a0, a0, 4
bnez t2, word_copy
beqz a2, exit_memcpy
j byte_copy

do_reverse:
add a0, a0, a2
add a1, a1, a2
andi a2, a2, 0x3
li t4, -1
beqz t2, reverse_byte_copy

reverse_word_copy:
addi a1, a1, -4
addi t2, t2, -1
lw t3, 0(a1)
addi a0, a0, -4
sw t3, 0(a0)
bnez t2, reverse_word_copy
beqz a2, exit_memcpy

reverse_byte_copy:
addi a0, a0, -1
addi a1, a1, -1

byte_copy:
lb t3, 0(a1)
addi a2, a2, -1
sb t3, 0(a0)
add a1, a1, t4
add a0, a0, t4
bnez a2, byte_copy

exit_memcpy:
move a0, t0
move a1, t1
ret
END(__memmove)

Scott Lurndal

unread,

Apr 4, 2023, 1:59:47 PM4/4/23

to

That's not what I asked for. Examples showing the usefulness
of a variable array of ints. What kind of problem is it attempting
to solve?

>As I see it, pointer comparison should be expected to work as expected.
>As for the relative location of things in the heap or similar, this is
>pretty much arbitrary (and may vary between runs based on things like
>ASLR or similar).

They should be considered unpredictable, and thus can't be used meaningfully
other than the case where both are known to point within a contiguous vector.

What does it _mean_ that one pointer is less than another to you in the
general case? How is it useful for application code to know that (leaving
aside memmove, which is a special case as a system interface).

Andy Valencia

unread,

Apr 4, 2023, 2:22:37 PM4/4/23

to

sc...@slp53.sl.home (Scott Lurndal) writes:
> I have never needed to have a variable sized table of ints in 43 years of C
> programming.

(I can't tell if you're being ironic or not.)

stdin has (guaranteed well-formed) int's, read them to EOF and then
note values and number of times the value was encountered. I've also
skipped out-of-memory and such.

(Interestingly, once compiled, it ran correctly on the first try.)

How would you do it?

=====================

#include <stdio.h>
#include <stdlib.h>

int
main(int argc, char **argv)
{
char *s, buf[16];
int x, val;
int nval = 0, *vals = NULL, *vcount = NULL;

for (;;) {
s = fgets(buf, sizeof(buf), stdin);
if (s == NULL) {
break;
}
val = atoi(s);
for (x = 0; x < nval; ++x) {
if (vals[x] == val) {
vcount[x] += 1;
break;
}
}
if (x == nval) {
nval += 1;
vals = realloc(vals, sizeof(int) * nval);
vcount = realloc(vcount, sizeof(int) * nval);
vals[nval-1] = val;
vcount[nval-1] = 1;
}
}
for (x = 0; x < nval; ++x) {
printf("%d\t%d\n", vals[x], vcount[x]);
}
return(0);

Scott Lurndal

unread,

Apr 4, 2023, 2:33:27 PM4/4/23

to

Andy Valencia <van...@vsta.org> writes:
>sc...@slp53.sl.home (Scott Lurndal) writes:
>> I have never needed to have a variable sized table of ints in 43 years of C
>> programming.
>
>(I can't tell if you're being ironic or not.)
>
>stdin has (guaranteed well-formed) int's, read them to EOF and then
>note values and number of times the value was encountered. I've also
>skipped out-of-memory and such.
>
>(Interestingly, once compiled, it ran correctly on the first try.)
>
>How would you do it?

That's not what I said. I said I've never had a need to do it. No
application, operating system, hypervisor, or other C or C++ code
that I've written has ever required reading a sequence of integers
from stdin.

I will note that I never use atoi, but rather use strtoul/strtoull
for unsigned (which by far is most common for me) or strtol/strtoll
for signed (very uncommon for me to use any signed integers in my
programming).

I particuarly like that strtoul, et alia, can be used in parsing
where the value may have a suffix (e.g. 10g or 10m or 10k) because
it returns an optional pointer to the character following the
integer, and it also supports any base (and can automatically
detect the base using the standard unix rules, i.e. 0 prefix
for octal and 0x prefix for hex).

e.g.

// Usage:
// control CC/0 line LINE# serial DEVICE BAUD-RATE {PS|P2P}
// control CC/0 line LINE# listen PORT [INTERFACE]
// Examples:
// control 17/0 line 0 listen 10017
// control 17/0 line 1 serial /dev/ttyUSB0 38400
// control 17/0 line 2 listen 10117 localhost
//
if (strcasecmp(argv[1], "line") == 0) {
char *cp;
ulong line;

if (unit != 0) {
if (unit == INVALID_UNIT) {
lp->log("%s Unit # must be supplied\n", t_dlp_name);
} else {
lp->log("%s Invalid Unit\n", t_dlp_name);
}
return true;
}

if (argc < 3) {
lp->log("%s Line # must follow line keyword\n", t_dlp_name);
return true;
}

line = strtoul(argv[2], &cp, 0);
if ((cp == argv[2]) /* Argument is not numeric */
|| (*cp != '\0') /* Suffix on numeric value not allowed */
|| (line >= MAX_DCDLP_LINES)) { /* Line number is out of range */
lp->log("%s Illegal line number '%s' specified\n",
t_dlp_name, argv[2]);
return true;
}

Terje Mathisen

unread,

Apr 4, 2023, 2:53:09 PM4/4/23

to

That's pretty good. :-)

Minimum latency, given a 1-cycle CMOVcc, would be 5-6 cycles, but 8-10
seems more probable.

The only obvious way to beat that would be to set it all up for SIMD, so
that we could do the 3 MULS simultaneously, 3 tests for nonzero and 3
AND of the base cost into those masks, before adding them and a final
horisontal add. Still probably 5-6 cycles...

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Andy Valencia

unread,

Apr 4, 2023, 2:53:48 PM4/4/23

to

sc...@slp53.sl.home (Scott Lurndal) writes:
> That's not what I said. I said I've never had a need to do it. No
> application, operating system, hypervisor, or other C or C++ code
> that I've written has ever required reading a sequence of integers
> from stdin.

Ah, well then. I've never had to write a program from scratch in C++
which used the letter "i". (Having moved this conversation thus far down the
field, I will now retire.)

Terje Mathisen

unread,

Apr 4, 2023, 2:55:12 PM4/4/23

to

MitchAlsup wrote:
> On Monday, April 3, 2023 at 10:24:41 PM UTC-5, Andy Valencia wrote:
>> MitchAlsup <Mitch...@aol.com> writes:
>>> In my 38 years of writing C code, I, personally, have never used realloc()
>>> (or calloc() BTW).
>> So if you have a list of, I don't know, int's, and you add one more to the
>> list. What do you do instead of realloc() it to make room for another
>> addition?
> <
> Malloc and linked lists.
> <
> I never use vectors of structs unless the length of the vector is fixed
> at compile time. When unbounded or unknown, I use pointers lnked
> lists.

That means that you are often giving up 50% performance and use up to
100% more memory...

MitchAlsup

unread,

Apr 4, 2023, 3:52:26 PM4/4/23

to

On Tuesday, April 4, 2023 at 1:55:12 PM UTC-5, Terje Mathisen wrote:
> MitchAlsup wrote:
> > On Monday, April 3, 2023 at 10:24:41 PM UTC-5, Andy Valencia wrote:
> >> MitchAlsup <Mitch...@aol.com> writes:
> >>> In my 38 years of writing C code, I, personally, have never used realloc()
> >>> (or calloc() BTW).
> >> So if you have a list of, I don't know, int's, and you add one more to the
> >> list. What do you do instead of realloc() it to make room for another
> >> addition?
> > <
> > Malloc and linked lists.
> > <
> > I never use vectors of structs unless the length of the vector is fixed
> > at compile time. When unbounded or unknown, I use pointers lnked
> > lists.
> That means that you are often giving up 50% performance and use up to
> 100% more memory...
<

But I don't make linked lists of scalar ints, nor do I need variable vectors of
scalar ints.
<
When I need a variable length vector of structs, I use linked lists.
<
And these never seem to dominate execution time.

BGB

unread,

Apr 4, 2023, 9:22:36 PM4/4/23

to

Say, consider you have a symbol table in a compiler:
Do you have a symbol table that can expand as needed;
Or, do you just allocate the maximum number and have the compiler die if
the limit is hit?
...

Say, compiler writer is like "64k symbols should be enough for anyone",
and then compiler craps itself if one tries to include too many headers.

Or, say, the buffers for sections in a linker.
Section buffers that can expand?...
Or, just up-front allocate the largest possible section, and possibly
waste a bunch of memory doing so?...

Granted, neither of these necessarily requires realloc(), but it is a
moderately simple and effective way to allow for this sort of thing.

>
>
>> As I see it, pointer comparison should be expected to work as expected.
>> As for the relative location of things in the heap or similar, this is
>> pretty much arbitrary (and may vary between runs based on things like
>> ASLR or similar).
>
> They should be considered unpredictable, and thus can't be used meaningfully
> other than the case where both are known to point within a contiguous vector.
>
> What does it _mean_ that one pointer is less than another to you in the
> general case? How is it useful for application code to know that (leaving
> aside memmove, which is a special case as a system interface).

Consider you want to have an array of objects and then look up the
object in an array given a pointer somewhere into the interior of the
object.

Being able to compare arbitrary pointers is plenty useful in this case,
even if one does not necessarily know (in advance) where those objects
will be within the address space.

Granted, as I see it, programs should ideally not have behavior that
depends on the relative order of memory allocations in memory, this much
is given.

So, it is more a case of:
ptra=malloc(100);
ptrb=malloc(100);
printf("%d\n", ptra>ptrb);

Printing a value that is either 0 or 1, and not being able to say in
advance whether the program will print 0 or 1, rather than, say, giving
the compiler license to "nuke the program from orbit".

Similarly, in BGBCC:
static int arra[256];
static int arrb[256];

Will tend to have the two arrays in randomized locations relative to
each other (the compiler defaults to using a pseudo-random shuffling of
all of the program contents), followed by a partial re-sort based on
access probabilities (and alignment packing-density for global
variables). The re-sort is a little lax, as a strict ranking would
defeat the effect of the shuffling (so, it is more like, "swap A and B
if for priority pA and pB, (pB*2.65)>pA").

Where, idea is to have "more commonly used stuff" closer to the start of
the sections, and the far ends for "less commonly used stuff" (with
stuff that is unreachable being culled from the program binary).

Granted, one might also want things like called functions to be
clustered close to their callers, but optimizing for this case would be
computationally expensive.

Does not means any of this stuff can't be compared, only that (based on
the source code), one has no idea where it will be in the final binary.

Though, this is intended as a sort of passive security features (if
programs are recompiled occasionally, this lessens the potential damage
that can be done by buffer overflow exploits, assuming they get past
some of the other checks; and is "reasonably cheap").

...

robf...@gmail.com

unread,

Apr 5, 2023, 12:15:14 AM4/5/23

to

I guess I could be in for trouble, but I have not run into limits yet. I just
statically allocate most things, although calloc() and free() are also used.
I think it is faster to have things statically allocated when the program is
loaded rather than make calls to calloc() and free() at run time. Many blocks
are allocated at once with one large calloc(), then the block treated as an
array.

I think it makes more sense to calloc() and free() with smaller memory
systems. With todays machines a static 4MB allocation is not a lot.
My compiler is designed for my personal use, so maybe it would not work
so great in large batch builds.

class Compiler
{
public:
int typenum;
int symnum;
short int funcnum;
SYM symbolTable[32768];
Function functionTable[3000];
TYP typeTable[32768];
OperandFactory of;
FunctionFactory ff;
ExpressionFactory ef;
StatementFactory sf;

BGB

unread,

Apr 5, 2023, 2:10:33 AM4/5/23

to

It depends.

For example, even if my compiler is pretty lightweight by modern
standards, it would still pretty heavyweight if dealing with, say,
needing to be able to run on my BJX2 core...

Generally uses expandable arrays for many of its core tables.

Similarly, section buffers start at a few K, but can potentially reach
MB, ...

> class Compiler
> {
> public:
> int typenum;
> int symnum;
> short int funcnum;
> SYM symbolTable[32768];
> Function functionTable[3000];
> TYP typeTable[32768];
> OperandFactory of;
> FunctionFactory ff;
> ExpressionFactory ef;
> StatementFactory sf;
>

In BGBCC, structure is more like:
Global Table
Functions, Variables, etc.
Literal Table
Structures, etc.
Type-Overflow Table
Types which can't fit into a bit-packed 32-bit format.
...

Where, the default internal representation for types is roughly:
(27: 0): Depends on type
(31:28): Type-Tag Type
With a few variants which shuffle the fields around some (larger
base-type, or larger array size).

Variant 0: Default
(11: 0): Base Type ID
(15:12): Pointer/Array Indirection Level
(27:16): Array Size
(31:28): Type-Tag Type (0)
Variant 2: Big Array, Primitive Type
( 5: 0): Base Type ID
( 7: 6): Pointer/Array Indirection Level
(27: 8): Array Size
(31:28): Type-Tag Type (1)
Variant 3: Big Type, Small Array
(19: 0): Base Type ID
(21:20): Pointer/Array Indirection Level
(27:22): Array Size (0-63)
(31:28): Type-Tag Type (3)
Variant 4: Type Overflow
(19: 0): Type-Overflow Index
(27:20): Reserved
(31:28): Type-Tag Type (2)
...

With a type-overflow structure then holding the various parameters, and
some extra fields for special modifiers.

In other cases, types are represented offline using an ASCII based notation.

In this scheme, Base-Types 000..0FF are used for primitive types, and
100..FFF are an index into the Literal Table (presumably giving an index
to a struct or similar).

Global and Literal tables consist of pointers to structs which describe
the object in question (name, type, contents, ...).

Variable and Constant references have a similar structure, but are
bit-packed 64-bit format (often containing the type-field as part of the
bit-packed structure).

For local variables:
(11: 0): Variable ID
(23:12): Sequence Number
(55:24): Access Type
(63:56): Variable-Tag-Type
For global variables:
(23: 0): Global ID (index into Global Table)
(55:24): Access Type
(63:56): Variable-Tag-Type
Others would include things like integer literals, string literals (an
offset into a string table), ...
...

Granted, the tables are potentially a little bigger, given BGBCC
compiles a full program at once rather than having separate compilation
and linking (for static libraries, it runs the compiler front-end but
then spits out a stack-oriented bytecode which can be loaded back into
the compiler; generally before any new contents are compiled).

In C mode, the ASTs and similar are discarded after each translation
unit is compiled by the front-end.

Where ASTs are organized as a tree of key-value nodes:
{
u16 tag; //tag for node
byte nattr; //number of attributes in node
byte nalvl; //node level
byte ztag; //zone tag
u16 attr_key[8];
u64 attr_val[8];
}

Where, say:
nalvl==0: keys held directly in node.
nalvl>0: keys held indirectly within sub-nodes.
key holds the starting key of the child node;
value refers to the sub-node.

Where, keys are bit-packed:
(11: 0): Index for key name
000..0FF: Names
100..FFF: Index Number (reversed, 0->FFF, 1->FFE, ...).
(15:12): Value type (String, Integer, Double, Node, ...).

This avoids any variable-length arrays, instead using multiple nodes to
encode their contents if they can't fit in a single node (essentially, a
type of B-Tree). This works OK mostly as most nodes have fewer than 8 keys.

...

In some sense, the nodes will represent code with a structure like:
{tag=#if, cond: {...}, then: {...}, else: {...}}

But, then for legacy reasons, may represent it externally in a textual
form as:
<if>
<cond>...</cond>
<then>...</then>
<else>...</else>
</if>

Though, a few of my other compilers have used other structures, such as
cons-cell lists, ...

Terje Mathisen

unread,

Apr 5, 2023, 3:48:57 AM4/5/23

to

MitchAlsup wrote:
> On Tuesday, April 4, 2023 at 1:55:12 PM UTC-5, Terje Mathisen wrote:
>> MitchAlsup wrote:
>>> On Monday, April 3, 2023 at 10:24:41 PM UTC-5, Andy Valencia wrote:
>>>> MitchAlsup <Mitch...@aol.com> writes:
>>>>> In my 38 years of writing C code, I, personally, have never used realloc()
>>>>> (or calloc() BTW).
>>>> So if you have a list of, I don't know, int's, and you add one more to the
>>>> list. What do you do instead of realloc() it to make room for another
>>>> addition?
>>> <
>>> Malloc and linked lists.
>>> <
>>> I never use vectors of structs unless the length of the vector is fixed
>>> at compile time. When unbounded or unknown, I use pointers lnked
>>> lists.
>> That means that you are often giving up 50% performance and use up to
>> 100% more memory...
> <
> But I don't make linked lists of scalar ints, nor do I need variable vectors of
> scalar ints.
> <
> When I need a variable length vector of structs, I use linked lists.

OK.

> <
> And these never seem to dominate execution time.

I should have known/assumed that you've measured it! :-)

Terje

Scott Lurndal

unread,

Apr 5, 2023, 10:39:51 AM4/5/23

to

I use a double-linked list, of course, such as this from a mainframe assembler:

/**
* Symbol table entry
*/
class c_symbol: public c_dlist {
public:
static const size_t MAX_SYMSIZE = 6; //< For now (matching 1975 semantics)
static const uint64 DEFAULT_VERSION = 1;

enum e_datatype {
SYM_UN = 0, //< Unsigned Numeric
SYM_SN = 1, //< Signed Numeric
SYM_UA = 2, //< Unsigned Alphanumeric
SYM_IA = 3, //< Indirect Address
SYM_NONE = 4, //< None specified.
SYM_ENV = 5, //< Environment Entry
SYM_MACRO = 6, //< Macro Entry
SYM_PROC = 7, //< Procedure (PROC) entry
};

/**
* Data structure defining format of symbol table entry
* in an ICM or Codefile.
*/
struct s_symtab {
uint32 st_index;
int32 st_env;
uint64 st_address;
uint64 st_size;
uint8 st_ma;
uint8 st_datatype;
char st_symbol[MAX_SYMSIZE+1];
} PACKED;

private:
/**
* Data structure used to maintain a symbol cross reference.
*/
struct s_xref: public c_dlist {
uint64 xref_line;

s_xref(void) { c_dlist::init(); }
s_xref(uint64 l) { c_dlist::init(); xref_line = l; }
};

char *sym_name; //< Read-Only Symbol Name string
mem_addr_t sym_addr; //< Base-relative address of symbol
man_t sym_ma; //< Memory Area for symbol
long sym_env; //< Environment Number (mcp < 0)
mem_size_t sym_size; //< Size, in units
e_datatype sym_datatype; //< Symbol datatype
uint8 sym_ixreg; //< Associated Index Register
bool sym_reserved; //< Reserved symbol owned by sprasm
bool sym_constant; //< Symbolic constant.
uint32 sym_index; //< Index in symbol table.

/**
* A pointer to the data used to initialize the memory referenced by the
* symbol.
*/
void *sym_contents;

* Listhead of references to symbol. The listhead records the
* symbol definition point.
*/
s_xref sym_xref;

/**
* The value of this symbol for symbolic constants. Must fit in an
* a-literal field.
*/
mem_value_t sym_constant_value;

static uint32 sym_master_index;
public:
c_symbol(const char *, mem_addr_t, man_t, mem_size_t, e_datatype,
uint64, c_symbol *, bool = false, void * = NULL);
c_symbol(c_symbol *);
~c_symbol(void);

size_t serialize(uint8 *, size_t, uint64 = DEFAULT_VERSION);

void add_ref(uint64 l);

const char *get_name(void) { return sym_name; }
man_t get_ma(void) const { return sym_ma; }
long get_env(void) const { return sym_env; }
mem_addr_t get_addr(void) const { return sym_addr; }
mem_size_t get_size(void) const { return sym_size; }
mem_size_t get_size_in_digits(void) const;
e_datatype get_type(void) const { return sym_datatype; }
uint8 get_ixreg(void) const { return sym_ixreg; }
uint32 get_index(void) const { return sym_index; }
mem_value_t get_constant(void) const { return sym_constant_value; }

void set_ma(man_t ma) { sym_ma = ma; }
void set_type(e_datatype edt) { sym_datatype = edt; }
void set_size(mem_size_t size) { sym_size = size; }
void set_addr(mem_addr_t addr) { sym_addr = addr; }
void set_env(long env) { sym_env = env; }
void set_ixreg(uint8 r) { sym_ixreg = r; }
void adjust_addr(long adj) { sym_addr += adj; }
void adjust_addr(c_symbol *sp);
void adjust_size(ssize_t adj) { sym_size += adj; }
bool align(size_t alignment);
void set_constant(mem_value_t v) { sym_constant = true,
sym_constant_value = v; }

bool is_constant(void) { return sym_constant; }
bool is_reserved(void) { return sym_reserved; }

void dump(FILE *, bool = false, bool = false);
char *dump(char *, size_t);
void dump(c_symbol_walker *, void *);

static size_t get_symbol_count(void) { return sym_master_index - 1; }
};

>Or, say, the buffers for sections in a linker.

class c_memimage {
public:
/**
* Size, in bytes, of a control-state codefile header.
*/
static const ssize_t CSHDRSIZE = 256;

private:
/**
* Pointer to the first byte of the simulated memory space.
*/
unsigned char *volatile m_base;

/**
* Size, in bytes, of the simulated memory space.
*/
mem_size_t m_size;

/**
* Maximum addressable digit.
*/
mem_size_t m_maxdigit;

/**
* Highest address address written (+1)
*/
mem_size_t m_highest;

/*
* Size of the mmap(2) region allocated.
*/
size_t m_map_size;

/**
* The logger object to use for error messages.
*/
c_logger *m_logger;

/**
* The digit to use to fill the memory area.
*/
mem_digit_t m_filldigit;

>Section buffers that can expand?...

Allocate a new buffer and copy from the old. No need for realloc.

>Or, just up-front allocate the largest possible section, and possibly
>waste a bunch of memory doing so?...
>
>
>Granted, neither of these necessarily requires realloc(), but it is a
>moderately simple and effective way to allow for this sort of thing.
>
>
>>
>>
>>> As I see it, pointer comparison should be expected to work as expected.
>>> As for the relative location of things in the heap or similar, this is
>>> pretty much arbitrary (and may vary between runs based on things like
>>> ASLR or similar).
>>
>> They should be considered unpredictable, and thus can't be used meaningfully
>> other than the case where both are known to point within a contiguous vector.
>>
>> What does it _mean_ that one pointer is less than another to you in the
>> general case? How is it useful for application code to know that (leaving
>> aside memmove, which is a special case as a system interface).
>
>Consider you want to have an array of objects and then look up the
>object in an array given a pointer somewhere into the interior of the
>object.

That's comparing for equality, which I explicitly noted as one of the
valid uses of pointer comparison.

BGB

unread,

Apr 5, 2023, 12:50:30 PM4/5/23

to

You can't binary-search a linked list though...

Allocate new buffer and copy old buffer, is typically what realloc()
does anyways. The only real difference is that realloc has a way to know
the size of the original buffer.

>> Or, just up-front allocate the largest possible section, and possibly
>> waste a bunch of memory doing so?...
>>
>>
>> Granted, neither of these necessarily requires realloc(), but it is a
>> moderately simple and effective way to allow for this sort of thing.
>>
>>
>>>
>>>
>>>> As I see it, pointer comparison should be expected to work as expected.
>>>> As for the relative location of things in the heap or similar, this is
>>>> pretty much arbitrary (and may vary between runs based on things like
>>>> ASLR or similar).
>>>
>>> They should be considered unpredictable, and thus can't be used meaningfully
>>> other than the case where both are known to point within a contiguous vector.
>>>
>>> What does it _mean_ that one pointer is less than another to you in the
>>> general case? How is it useful for application code to know that (leaving
>>> aside memmove, which is a special case as a system interface).
>>
>> Consider you want to have an array of objects and then look up the
>> object in an array given a pointer somewhere into the interior of the
>> object.
>
> That's comparing for equality, which I explicitly noted as one of the
> valid uses of pointer comparison.
>

Not exactly sure how one would define pointers such that:
A==B, gives sane results for unrelated pointers;
A>B, gives sane results within the same object;
A>B, does not give sane results within different objects.

If one assumes either a linear or segmented address space, once one has
the former two, the latter follow by extension.

If one looks up interior pointers, then they will typically have, say:
if((ptr>=refBase) && (ptr<(refBase+refSize)))
...

Along with more relative comparisons if they use binary search or similar.

Well, and say, people use binary search because linear search is slow
(but, does require keeping everything sorted, which incidentally makes
B-Trees preferable to linear arrays as 'n' gets bigger).

But, for these cases in my memory managers, admittedly I have thus far
used linear arrays rather than B-Trees or similar.

But, say, if one assumes a 48-bit address space filled with 256K objects
or similar, managing this with a linear array becomes implausible.
Haven't quite gotten to this stage yet.

But, say, for example, the memory manager (malloc) has a structure sorta
like:
struct MM_TopObjectNode_s {
uint8_t n_lvl; //0000, node level (0=root)
uint8_t n_pad; //0001, pad byte
uint16_t n_elem; //0002, elements in node
uint32_t o_sztg[1365]; //0004, size and tag bits
uint64_t o_ptr[1365]; //1558, base address
//4000
};

Then, in theory, one can manage an entire 48-bit address space with a
3-level tree (heap isn't likely to be quite this large though).

Note that in this style of tree, the insert operation needs to keep
track of higher tree levels, in case it is necessary to split nodes.

Size can be encoded as a microfloat to save space, with some tag bits to
encode what category of object is held at this address, and some other
object metadata.

Say:
o_sztg:
(7 : 0): Object Size (E5.F3 or similar, page based)
(15: 8): Object Category
(31:16): Zone Tag (if Type represents a Large Object)
o_ptr:
(47: 0): Base Address
(59:48): Object Type Tag
(63:60): 0 = Large Object, 1 = Heap Region, ...

For a Large Object, the underlying memory is a span of pages.

For smaller objects, each region would then contain another array of
object pointers (the size of this array being based on
sizeOfRegion/sizeOfMinObject), with sizeOfMinObject being 512 or
similar, and smaller objects managed within a different structure
(allocated in terms of these medium sized objects).

Note that in a malloc/free implementation, a lot of this would be behind
the free-list, with the free-list being checked first for whether it can
serve a malloc request.

While in premise this could be used for page-span management for
something like "mmap()", it makes more sense to leave this up to the
kernel or similar (so, say, this would be more for userland memory
management).

...

Scott Lurndal

unread,

Apr 5, 2023, 4:23:38 PM4/5/23

to

For an assembler, where the maximum number of symbols will
be insubstantial (e.g. low 1000s), a binary search is unnecessary on modern
CPUs. An insertion sort on the dlist is more than
sufficient if it is necessary to sort the symbols.

If the symbol table is expected to be significantly large,
a tree structure (e.g. red-black tree) like the C++ std::Map is used.

>But, say, for example, the memory manager (malloc) has a structure sorta
>like:
>struct MM_TopObjectNode_s {
> uint8_t n_lvl; //0000, node level (0=root)
> uint8_t n_pad; //0001, pad byte
> uint16_t n_elem; //0002, elements in node
> uint32_t o_sztg[1365]; //0004, size and tag bits
> uint64_t o_ptr[1365]; //1558, base address
> //4000
>};
>

Fairly optimal malloc implementations abound. For many of the projects
I have worked on, run-time allocation is a performance bottleneck
and something to be avoided or replaced with pool-based allocators
(which extend in non-contiguous groups of physical pages if not
fixed size).

MitchAlsup

unread,

Apr 5, 2023, 5:43:49 PM4/5/23

to

On Wednesday, April 5, 2023 at 3:23:38 PM UTC-5, Scott Lurndal wrote:
> >You can't binary-search a linked list though...
<

depends on how the links are formed........

<
> For an assembler, where the maximum number of symbols will
> be insubstantial (e.g. low 1000s), a binary search is unnecessary on modern
> CPUs. An insertion sort on the dlist is more than
> sufficient if it is necessary to sort the symbols.
<

I always found that a simple hash on the first character (52 list starting positions)
thins the length of the list down sufficiently. But if you get concerned, each such
links can be treeified, or use a hash that uses more characters.
<
But then again, the assembler is getting a symbol for each branch in the code,
each named memory in the code, and each external reference mentioned in
the code. I can see this going over 10,000 easily--especially when one takes
the output from a compiler, and fix the compiler shortcomings, then feed it
back to the assembler. Heck, we have GOTs with 10,000+ entries.
>

Scott Lurndal

unread,

Apr 5, 2023, 6:27:28 PM4/5/23

to

MitchAlsup <Mitch...@aol.com> writes:
>On Wednesday, April 5, 2023 at 3:23:38=E2=80=AFPM UTC-5, Scott Lurndal wrot=

>e:
>> >You can't binary-search a linked list though...
><
>depends on how the links are formed........
><

>> For an assembler, where the maximum number of symbols will=20
>> be insubstantial (e.g. low 1000s), a binary search is unnecessary on mode=
>rn=20
>> CPUs. An insertion sort on the dlist is more than=20
>> sufficient if it is necessary to sort the symbols.=20
><
>I always found that a simple hash on the first character (52 list starting =
>positions)
>thins the length of the list down sufficiently. But if you get concerned, e=

>ach such
>links can be treeified, or use a hash that uses more characters.
><

>But then again, the assembler is getting a symbol for each branch in the co=
>de,
>each named memory in the code, and each external reference mentioned in=20
>the code. I can see this going over 10,000 easily--especially when one take=
>s=20

>the output from a compiler, and fix the compiler shortcomings, then feed it
>back to the assembler. Heck, we have GOTs with 10,000+ entries.

>>=20
>

For assemblers, generically, perhaps. Although there is no need for
the GOT entries to be in the same symbol table as that are searched when
assembling the instructions, similarly for temporary labels associated
with branches.

In this specific case, for which the code fragment was posted, there
is no GOT, and the compilers all generate the object file directly
without a pass through the assembler (there was actually no assembler
provided to customers at all). The All-Orders tests, which test each
instruction were coded using an assembler only available to the
hardware designers and operating system group.

(in the original system, the Electrodata 220 (1957), the instruction
opcodes were known as "Orders", the term survived into the
B3500 (1965) and successors (last one turned off in 2010)).

HBR is the halt branch instruction. When encountered, depending
on the value of the digit at address 48, the processor would
halt. When restarted from the console, it would brach to the
target address and continue execution.

*
* All-Orders Test 1
* Test the following processor instructions
*
* COMS toggle, Overflow Toggle, Conditional Branches, HBR
*
* CPN, CPA, MVN, MVA, MVL, MVW, MVC, SEA
*
ORIG 0, 1
* The original ALO1CO initialized the first 12 bytes to blanks.
CNST 12 UA =" " / 000 - 023

ORIG 40, 1
CNST 6 UN =@000000@ / 040 - 045
DATA 2 UN / 046 - 047
HDIGIT CNST 1 UN =@0@ / 048

ORIG 200, 1
COUNT CNST 4 UN =@0001@ / 200 - 203
ITERS CNST 4 UN =@ffff@ / 204 - 207
SIX1 CNST 7 UN =@0000000@ / 208 - 214
UN8 CNST 8 UN =@01234567@ / 215 - 222
UN8A CNST 8 UN =@01235678@ / 223 - 230
SN8 CNST 8 SN =+1234567 / 231 - 239
SN8A CNST 8 SN =+1235678 / 240 - 248
SN8M CNST 8 SN =-1234567 / 249 - 257
SN8MA CNST 8 SN =-1235678 / 258 - 266
SYNC 2
UA8 CNST 8 UA =" ABCDEFG" / 268 - 283
UA8A CNST 8 UA =" ABCEFGH" / 284 - 299
UA10 CNST 10 UA ="ABCDEFG " / 200 - 319
UA10A CNST 10 UA ="ABDDEFG " / 320 - 339
UN20 CNST 20 UN =@02726957389140687629@ / 340 - 359
SN19 CNST 19 SN =+2793458216483195637 / 360 - 379
SN19M CNST 19 SN =-8376925841719462350 / 380 - 399
UA20 CNST 20 UA =" SGBFRNXCQZJD OHGFSR" / 400 - 440
DUN20 CNST 20 UN =@4040404040404040ffff@ / 440 - 459
DSN19 DATA 19 SN / 460 - 479
DUA20 DATA 20 UA / 480 - 519

....

ORIG 2000, 1
MAIN PROC
*
START MVN =1000, ITERS
HBR LOOP
*
* Main loop
*
LOOP CPN UN8, UN8
EQL +Next
HBR LOOP
.Next LEQ +Next
HBR LOOP
.Next GEQ +Insn
HBR LOOP
.Insn CPN UN8, UN8A
LSS +Next
HBR -Insn

....

.Insn MVN =0, IX1
SEA (0405) UA1906, UA1906+10(UA), UA1906+60(UN)
GTR +Next
HBR -Insn
.Next CPN =0, IX1+RL
EQL +Insn
HBR -Insn
.Insn SEA (0310) UA1906, UA1906+10(UN), UA1906+61(SN)
EQL +Next
HBR -Insn
.Next CPN =001956, IX1+RL
EQL +Insn
HBR -Insn
.Insn SEA (0501) UA1906, UA1906+10(UA), UA1906+50(UA)
GTR +Next
HBR -Insn
.Next CPN =UA1906, IX1+RL
EQL +Skip
HBR -Insn

.Skip DEC =1, COUNT
EQL +Insn
HBR -Skip

.Insn MVN =1, COUNT
DEC =1, ITERS
EQL +Next
BUN LOOP
.Next MVN SIX1, IX1
MVN (/04) =0, BASE:+X1(UN)
MVN (/01) =0, BASE:+48(UN)
BUN START
CORP

MitchAlsup

unread,

Apr 5, 2023, 8:23:26 PM4/5/23

to

We can all agree that the size of the symbol table is larger than the number
of entries in GOT. So, GOT stands as a lower bound.

Thomas Koenig

unread,

Apr 6, 2023, 1:32:23 AM4/6/23

to

MitchAlsup <Mitch...@aol.com> schrieb:

> On Wednesday, April 5, 2023 at 3:23:38 PM UTC-5, Scott Lurndal wrote:
>> >You can't binary-search a linked list though...
><
> depends on how the links are formed........
><
>> For an assembler, where the maximum number of symbols will
>> be insubstantial (e.g. low 1000s), a binary search is unnecessary on modern
>> CPUs. An insertion sort on the dlist is more than
>> sufficient if it is necessary to sort the symbols.
><
> I always found that a simple hash on the first character (52 list starting positions)
> thins the length of the list down sufficiently.

That runs afoul of name mangling schemes. Example: In libgfortran,
there are 1976 symbols defined, all but 33 of them start with
an underscore.

And C++ name mangling rules make things even more complicated,
(almost) every symbol there starts with _Z.

> But if you get concerned, each such
> links can be treeified, or use a hash that uses more characters.

You'd probably have to expand this a few levels deep for C++... better
to use a hash to begin with.

robf...@gmail.com

unread,

Apr 6, 2023, 4:23:19 AM4/6/23

to

On Thursday, April 6, 2023 at 1:32:23 AM UTC-4, Thomas Koenig wrote:
> MitchAlsup <Mitch...@aol.com> schrieb:
> > On Wednesday, April 5, 2023 at 3:23:38 PM UTC-5, Scott Lurndal wrote:
> >> >You can't binary-search a linked list though...
> ><
> > depends on how the links are formed........
> ><
> >> For an assembler, where the maximum number of symbols will
> >> be insubstantial (e.g. low 1000s), a binary search is unnecessary on modern
> >> CPUs. An insertion sort on the dlist is more than
> >> sufficient if it is necessary to sort the symbols.
> ><
> > I always found that a simple hash on the first character (52 list starting positions)
> > thins the length of the list down sufficiently.
> That runs afoul of name mangling schemes. Example: In libgfortran,
> there are 1976 symbols defined, all but 33 of them start with
> an underscore.
>
> And C++ name mangling rules make things even more complicated,
> (almost) every symbol there starts with _Z.

Name mangling in cc64 starts with an alphabetic hash of the symbol's return type.
followed by argument types. Three characters for each type, followed by the
method name.

> > But if you get concerned, each such
> > links can be treeified, or use a hash that uses more characters.
> You'd probably have to expand this a few levels deep for C++... better
> to use a hash to begin with.

For an assembler in the past, I just used a hash table containing 65536 slots for
symbols. A decent hash function is used. There are all kinds of static allocations
in the assembler, 10 MB for combined source files for instance. It uses a few
megabytes here and there. That is still only a fraction of percentage of memory
available. Memory was available so I use it. Code bloat traded off for reduced
development and run times.

But lately I have been adapting vasm. I am not sure how it deals with symbols and
other tables. I believe it does a bsearch() on the opcode tables. I have been able to
use vlink out-of-the box. For several projects now. Anything I do not have to
maintain myself is a bonus.

Andy Valencia

unread,

Apr 6, 2023, 10:15:30 AM4/6/23

to

MitchAlsup <Mitch...@aol.com> writes:
> I always found that a simple hash on the first character (52 list starting
> positions)
> thins the length of the list down sufficiently.

Any time I want a very simple hash, a hash built from first and last
character plus string length seems to be the sweet spot. YMMV.

Scott Lurndal

unread,

Apr 6, 2023, 12:09:00 PM4/6/23

to

MitchAlsup <Mitch...@aol.com> writes:
>On Wednesday, April 5, 2023 at 5:27:28=E2=80=AFPM UTC-5, Scott Lurndal wrot=

>> For assemblers, generically, perhaps. Although there is no need for=20
>> the GOT entries to be in the same symbol table as that are searched when=
>=20
>> assembling the instructions, similarly for temporary labels associated=20
>> with branches.=20
><
>We can all agree that the size of the symbol table is larger than the numbe=
>r=20

>of entries in GOT. So, GOT stands as a lower bound.

I'm not sure what you're saying here.

I just compiled a large C++ source file, 4432 source-lines-of-code (excluding
comments, blank lines, etc) to an (x86_64) assembler file. The resulting assembler
file had 1.5 million lines.

The symbol table (defined and undefined symbols) has 1206 entries.

A simple linked list is perfectly suitable for that size table.

There were 15887 unique branch labels (.Lnnnn), none of which
appear in the symbol table (and thus can be maintained in a
different data structure, such as one indexed by the 'nnnn' in
the label).

The DWARF data is, of course, far larger, but that's not searched
when assembling instructions.

The GOT and PLT are produced by the linker, based on data in the ELF
object file generated by the assembler, not the symbol table.

Clearly on ancient hardware, for example PAL-D on a PDP-8, there
are constraints on the size of the symbol table that don't apply
to modern systems. That's why they had the EXPUNGE directive in
PAL-D to blow away the symbol table if necessary.

MitchAlsup

unread,

Apr 6, 2023, 1:18:27 PM4/6/23

to

On Thursday, April 6, 2023 at 11:09:00 AM UTC-5, Scott Lurndal wrote:
> MitchAlsup <Mitch...@aol.com> writes:
> >On Wednesday, April 5, 2023 at 5:27:28=E2=80=AFPM UTC-5, Scott Lurndal wrot=
>
> >> For assemblers, generically, perhaps. Although there is no need for=20
> >> the GOT entries to be in the same symbol table as that are searched when=
> >=20
> >> assembling the instructions, similarly for temporary labels associated=20
> >> with branches.=20
> ><
> >We can all agree that the size of the symbol table is larger than the numbe=
> >r=20
> >of entries in GOT. So, GOT stands as a lower bound.
> I'm not sure what you're saying here.
>
> I just compiled a large C++ source file, 4432 source-lines-of-code (excluding
> comments, blank lines, etc) to an (x86_64) assembler file. The resulting assembler
> file had 1.5 million lines.
>
> The symbol table (defined and undefined symbols) has 1206 entries.
>
> A simple linked list is perfectly suitable for that size table.
>
> There were 15887 unique branch labels (.Lnnnn), none of which
> appear in the symbol table (and thus can be maintained in a
> different data structure, such as one indexed by the 'nnnn' in
> the label).
<

But if you took that same compilation module and had the compiler
emit assembly code instead of linkable-binary, the assembler would
have to determine where those 15,887 symbols were positioned
before converting the branch symbols into fixed displacements.
<
That is you are looking at linkable-binary symbol table, not the symbol
table the assembler would have had to produce looking at the same
semantic ASCII content.

Scott Lurndal

unread,

Apr 6, 2023, 1:35:27 PM4/6/23

to

MitchAlsup <Mitch...@aol.com> writes:
>On Thursday, April 6, 2023 at 11:09:00=E2=80=AFAM UTC-5, Scott Lurndal wrot=
>e:
>> MitchAlsup <Mitch...@aol.com> writes:=20
>> >On Wednesday, April 5, 2023 at 5:27:28=3DE2=3D80=3DAFPM UTC-5, Scott Lur=
>ndal wrot=3D=20
>>=20
>> >> For assemblers, generically, perhaps. Although there is no need for=3D=
>20=20
>> >> the GOT entries to be in the same symbol table as that are searched wh=
>en=3D=20
>> >=3D20=20

>> >> assembling the instructions, similarly for temporary labels associated=

>=3D20=20
>> >> with branches.=3D20=20
>> ><=20
>> >We can all agree that the size of the symbol table is larger than the nu=
>mbe=3D=20
>> >r=3D20

>> >of entries in GOT. So, GOT stands as a lower bound.

>> I'm not sure what you're saying here.=20
>>=20
>> I just compiled a large C++ source file, 4432 source-lines-of-code (exclu=
>ding=20
>> comments, blank lines, etc) to an (x86_64) assembler file. The resulting =
>assembler=20
>> file had 1.5 million lines.=20

Re-read the above statement.

>>=20
>> The symbol table (defined and undefined symbols) has 1206 entries.=20
>>=20
>> A simple linked list is perfectly suitable for that size table.=20
>>=20
>> There were 15887 unique branch labels (.Lnnnn), none of which=20
>> appear in the symbol table (and thus can be maintained in a=20
>> different data structure, such as one indexed by the 'nnnn' in=20
>> the label).=20

><
>But if you took that same compilation module and had the compiler
>emit assembly code instead of linkable-binary, the assembler would

Huh? The example above did exactly that - the compiler emitted
text assembler code, not a linkable binary.

>have to determine where those 15,887 symbols were positioned
>before converting the branch symbols into fixed displacements.

The branch label table (or linked fixed-size table fragments if you
don't want to 'realloc' in place with a unnecessary memcpy) are indexed
by the label number,
so O(1) access to the displacment, which is what would be passed
to the linker via the ELF object file produced by the assembler.

The branch label table contains two fields - the text segment
index and the offset from the start of the segment, and potentially
a linked list of references to make pass two more efficient
when updating forward branch instruction target address fields
for subsequently defined branch labels/symbols.

><
>That is you are looking at linkable-binary symbol table, not the symbol
>table the assembler would have had to produce looking at the same
>semantic ASCII content.

No, it is not. The compiler produced assembler text. Nothing
linkable or binary at all.

MitchAlsup

unread,

Apr 6, 2023, 2:30:35 PM4/6/23

to

The compiler produced already resolved assembler text; as if
<
BR label37
<
had already been converted into
<
BR 0x347
<
Which is perfectly acceptable to the assembler, but not how any
human using the assembler would have written the ASCII.

Scott Lurndal

unread,

Apr 6, 2023, 2:35:48 PM4/6/23

to

MitchAlsup <Mitch...@aol.com> writes:
>On Thursday, April 6, 2023 at 12:35:27=E2=80=AFPM UTC-5, Scott Lurndal wrot=

>e:
>> MitchAlsup <Mitch...@aol.com> writes:=20

>> >On Thursday, April 6, 2023 at 11:09:00=3DE2=3D80=3DAFAM UTC-5, Scott Lur=
>ndal wrot=3D=20
>> >e:=20
>> >> MitchAlsup <Mitch...@aol.com> writes:=3D20=20
>> >> >On Wednesday, April 5, 2023 at 5:27:28=3D3DE2=3D3D80=3D3DAFPM UTC-5, =
>Scott Lur=3D=20
>> >ndal wrot=3D3D=3D20=20
>> >>=3D20=20

>> >> >> For assemblers, generically, perhaps. Although there is no need for=

>=3D3D=3D=20
>> >20=3D20=20
>> >> >> the GOT entries to be in the same symbol table as that are searched=
> wh=3D=20
>> >en=3D3D=3D20=20
>> >> >=3D3D20=3D20
>> >> >> assembling the instructions, similarly for temporary labels associa=
>ted=3D
>> >=3D3D20=3D20=20
>> >> >> with branches.=3D3D20=3D20=20
>> >> ><=3D20=20
>> >> >We can all agree that the size of the symbol table is larger than the=
> nu=3D=20
>> >mbe=3D3D=3D20=20
>> >> >r=3D3D20

>> >> >of entries in GOT. So, GOT stands as a lower bound.

>> >> I'm not sure what you're saying here.=3D20=20
>> >>=3D20=20
>> >> I just compiled a large C++ source file, 4432 source-lines-of-code (ex=
>clu=3D=20
>> >ding=3D20=20
>> >> comments, blank lines, etc) to an (x86_64) assembler file. The resulti=
>ng =3D=20
>> >assembler=3D20=20
>> >> file had 1.5 million lines.=3D20=20
>>=20
>> Re-read the above statement.=20
>>=20
>> >>=3D20=20
>> >> The symbol table (defined and undefined symbols) has 1206 entries.=3D2=
>0=20
>> >>=3D20=20
>> >> A simple linked list is perfectly suitable for that size table.=3D20=
>=20
>> >>=3D20=20
>> >> There were 15887 unique branch labels (.Lnnnn), none of which=3D20=20
>> >> appear in the symbol table (and thus can be maintained in a=3D20=20
>> >> different data structure, such as one indexed by the 'nnnn' in=3D20=20
>> >> the label).=3D20
>> ><=20
>> >But if you took that same compilation module and had the compiler=20

>> >emit assembly code instead of linkable-binary, the assembler would

>> Huh? The example above did exactly that - the compiler emitted=20

>> text assembler code, not a linkable binary.

>> >have to determine where those 15,887 symbols were positioned=20

>> >before converting the branch symbols into fixed displacements.

>> The branch label table (or linked fixed-size table fragments if you=20
>> don't want to 'realloc' in place with a unnecessary memcpy) are indexed=
>=20
>> by the label number,=20
>> so O(1) access to the displacment, which is what would be passed=20
>> to the linker via the ELF object file produced by the assembler.=20
>>=20
>> The branch label table contains two fields - the text segment=20
>> index and the offset from the start of the segment, and potentially=20
>> a linked list of references to make pass two more efficient=20
>> when updating forward branch instruction target address fields=20

>> for subsequently defined branch labels/symbols.

>> ><=20
>> >That is you are looking at linkable-binary symbol table, not the symbol=
>=20
>> >table the assembler would have had to produce looking at the same=20
>> >semantic ASCII content.
><
>> No, it is not. The compiler produced assembler text. Nothing=20

>> linkable or binary at all.
><
>The compiler produced already resolved assembler text; as if
><
> BR label37
><
>had already been converted into
><
> BR 0x347
><
>Which is perfectly acceptable to the assembler, but not how any
>human using the assembler would have written the ASCII.

I beg to differ:

.loc 13 9667 56 is_stmt 0
cmpq $3, %rax
jbe .L2381
.loc 13 9669 14 is_stmt 1
.LVL947:
.loc 16 102 5
.LBB28265:
.LBB28233:
.LBB28232:
.loc 12 517 5
.loc 12 517 44 is_stmt 0
movabsq $193518341455871, %rsi
xorq %rsi, %rcx
.loc 12 517 5
movabsq $72057589742960640, %rsi
testq %rsi, %rcx
jne .L2419
.L2422:
.loc 12 521 9 is_stmt 1
.LVL948:
.LBE28232:
.LBE28233:
.LBE28265:
.loc 13 9669 57 is_stmt 0
cmpq $3, %rax
ja .L2419

BGB

unread,

Apr 6, 2023, 9:01:26 PM4/6/23

to

On 4/6/2023 11:08 AM, Scott Lurndal wrote:
> MitchAlsup <Mitch...@aol.com> writes:
>> On Wednesday, April 5, 2023 at 5:27:28=E2=80=AFPM UTC-5, Scott Lurndal wrot=
>
>>> For assemblers, generically, perhaps. Although there is no need for=20
>>> the GOT entries to be in the same symbol table as that are searched when=
>> =20
>>> assembling the instructions, similarly for temporary labels associated=20
>>> with branches.=20
>> <
>> We can all agree that the size of the symbol table is larger than the numbe=
>> r=20
>> of entries in GOT. So, GOT stands as a lower bound.
>
> I'm not sure what you're saying here.
>
> I just compiled a large C++ source file, 4432 source-lines-of-code (excluding
> comments, blank lines, etc) to an (x86_64) assembler file. The resulting assembler
> file had 1.5 million lines.
>
> The symbol table (defined and undefined symbols) has 1206 entries.
>
> A simple linked list is perfectly suitable for that size table.
>
> There were 15887 unique branch labels (.Lnnnn), none of which
> appear in the symbol table (and thus can be maintained in a
> different data structure, such as one indexed by the 'nnnn' in
> the label).
>

From my ROTT port:
C source lines: 126k
H source lines: 18k

ASM debug output lines: 390k;
71k labels if going by grep+wc
Symbol map lines: 44k;
Though, this includes line-number data as well.
Quick survey: ~ 40% symbols, 60% line numbers.
Roughly 18k if one excludes line-numbers;
This does not count internal labels though.

Binary stats (skipping tiny sections):
.text 829k (binary code, XG2 Mode)
.strtab 69k (literal strings)
.rodata 26k (read-only / 'const' data)
.reloc 17k (base relocations)
.pdata 19k (exception unwinding metadata)
.data 122k (initialized global variables)
.bss 1162k (uninitialized global variables)

So, approximately 2.13 ".text" bytes per ASM line.
Or, ~ 6.6 bytes per C source line.

This is for a binary compiled in XG2 mode, so the minimum instruction
length is 4 bytes in this case.

The size of .text drops to 790k if building for the baseline ISA (-O2),
or 735k (-Os).

So, in this case, the difference in code density between these is
roughly 13% (with no real effect on the size of other sections).

Though, in this case, have all 64 GPRs enabled does sorta wreck the
ability to make as much use of 16-bit instructions (so the binary
remains dominated by 32-bit instructions in this case).

The same program, built for x86-64 via MSVC ("/O1"), is 1474k for
".text" (with ~ 100k for ".rdata", but in this case a shared section
seems to be used for both read-only data and strings).

Checking, a build using GCC for WSL (for x86-64, "-O2") gives 766k for
".text" (though differs from both of the prior cases in that this uses a
dynamically linked C library). This drops to 436k though with "-Os".

However the ELF binary (in total) is still rather large if compared with
the other builds (several MB). There is a significant size reduction
with "strip -s" though (knocks the binary down to around 800k).

So, x86-64 can still get good code density in this case (at least, so
long as one is not using MSVC).

Would compare against an x86-64 ASM version, but need to try to get ROTT
to compile with a unity build to get a big ASM blob, which is posing a
little more of a challenge at the moment...

> The DWARF data is, of course, far larger, but that's not searched
> when assembling instructions.
>
> The GOT and PLT are produced by the linker, based on data in the ELF
> object file generated by the assembler, not the symbol table.
>

In my case, I was considering a symbol table for all of the ASM labels,
not the function and variable declarations (these are the "global table"
in BGBCC).

Though, at this scale, the tables are internally keyed with ID numbers,
with named labels held in a separate table (also keyed by both ID number
and name).

> Clearly on ancient hardware, for example PAL-D on a PDP-8, there
> are constraints on the size of the symbol table that don't apply
> to modern systems. That's why they had the EXPUNGE directive in
> PAL-D to blow away the symbol table if necessary.

Fair enough.

MitchAlsup

unread,

Apr 6, 2023, 9:50:24 PM4/6/23

to

So when you debug this program and you want to set a breakpoint at
.LBE28233:
what do you type into the debug interface ??

BGB

unread,

Apr 7, 2023, 12:08:19 AM4/7/23

to

Eventually got it.
x86-64 version, as ASM, is 172k lines.

Vs 390k lines for the BJX2 output.

So, it would appear the x86-64 version, for some reason, only needs
around 44% as many lines of ASM?...

I am not really aware of any particular feature (or misfeature) in
either ISA that would account for this difference... (More so as the
x86-64 ASM in question appears to be following the Load/Store pattern).

The difference also seems outside the scope of what could be easily
explained by formatting or notation differences.

In term of ".text bytes per line of ASM", x86-64 does still appear to be
a little bulkier than BJX2 (so, fewer instructions, but the average
instruction is slightly bigger?...).

...

Granted there is a notation difference between the two in that BGBCC
tends to put newlines between basic-blocks. This shouldn't be enough
though to account for the difference here.

Thomas Koenig

unread,

Apr 7, 2023, 7:11:33 AM4/7/23

to

Scott Lurndal <sc...@slp53.sl.home> schrieb:

> For an assembler, where the maximum number of symbols will
> be insubstantial (e.g. low 1000s), a binary search is unnecessary on modern
> CPUs. An insertion sort on the dlist is more than
> sufficient if it is necessary to sort the symbols.

If working on gfortran has taught me anything, it is that
quadratic (or worse) algorithms sooner or later come up against
code with a huge number of variables and/or humungous basic blocks,
introducing unreasonable behavior in the toolchain. Autogenerated
code can be particularly bad in that respect; an extreme example
can be seen at https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80960 ,

In general, any toochain should strive to be worst case O(n*log(n))
if at all possible.

Michael S

unread,

Apr 7, 2023, 7:37:24 AM4/7/23

to

Agree 99%
Begin with std::unordered_map as a default and then switch to other containers
only when you are really really sure that you really really know what you're doing.

99% rather than 100% is because your post appears to implicitly suggest
std::map (or equivalent rb or avl tree based container) to be a default.

Personally, I like boost::intrusive:set, but I am not in dev tools buiseness. In
my apps full control of memory allocation/de-allocation is of high importance.
On the other hand, in typical dev tool it is of very low importance.

Terje Mathisen

unread,

Apr 7, 2023, 9:20:32 AM4/7/23

to

------- THIS ^^^^^ IS IMPORTANT! ------

I strongly agree.

Even in my pure toy code I never use classic/naive linked lists, at
worst I'll use lists of fixed-size chunks, but if I can get away with a
hash table I'll do so.

Scott Lurndal

unread,

Apr 7, 2023, 10:37:17 AM4/7/23

to

MitchAlsup <Mitch...@aol.com> writes:
>On Thursday, April 6, 2023 at 1:35:48=E2=80=AFPM UTC-5, Scott Lurndal wrote=
>:
>> MitchAlsup <Mitch...@aol.com> writes:=20
>> >On Thursday, April 6, 2023 at 12:35:27=3DE2=3D80=3DAFPM UTC-5, Scott Lur=
>ndal wrot=3D

>> >e:=20
>> >> MitchAlsup <Mitch...@aol.com> writes:=3D20

>> >> >On Thursday, April 6, 2023 at 11:09:00=3D3DE2=3D3D80=3D3DAFAM UTC-5, =
>Scott Lur=3D=20
>> >ndal wrot=3D3D=3D20=20
>> >> >e:=3D20=20
>> >> >> MitchAlsup <Mitch...@aol.com> writes:=3D3D20=3D20=20
>> >> >> >On Wednesday, April 5, 2023 at 5:27:28=3D3D3DE2=3D3D3D80=3D3D3DAFP=
>M UTC-5, =3D=20
>> >Scott Lur=3D3D=3D20=20
>> >> >ndal wrot=3D3D3D=3D3D20=3D20=20
>> >> >>=3D3D20=3D20
>> >> >> >> For assemblers, generically, perhaps. Although there is no need =
>for=3D
>> >=3D3D3D=3D3D=3D20=20
>> >> >20=3D3D20=3D20=20
>> >> >> >> the GOT entries to be in the same symbol table as that are searc=
>hed=3D=20
>> > wh=3D3D=3D20=20
>> >> >en=3D3D3D=3D3D20=3D20=20
>> >> >> >=3D3D3D20=3D3D20=20
>> >> >> >> assembling the instructions, similarly for temporary labels asso=
>cia=3D=20
>> >ted=3D3D=20
>> >> >=3D3D3D20=3D3D20=3D20=20
>> >> >> >> with branches.=3D3D3D20=3D3D20=3D20=20
>> >> >> ><=3D3D20=3D20=20
>> >> >> >We can all agree that the size of the symbol table is larger than =
>the=3D=20
>> > nu=3D3D=3D20=20
>> >> >mbe=3D3D3D=3D3D20=3D20=20
>> >> >> >r=3D3D3D20

>> >> >> >of entries in GOT. So, GOT stands as a lower bound.

>> >> >> I'm not sure what you're saying here.=3D3D20=3D20=20
>> >> >>=3D3D20=3D20=20
>> >> >> I just compiled a large C++ source file, 4432 source-lines-of-code =
>(ex=3D=20
>> >clu=3D3D=3D20=20
>> >> >ding=3D3D20=3D20=20
>> >> >> comments, blank lines, etc) to an (x86_64) assembler file. The resu=
>lti=3D=20
>> >ng =3D3D=3D20=20
>> >> >assembler=3D3D20=3D20=20
>> >> >> file had 1.5 million lines.=3D3D20=3D20=20
>> >>=3D20=20
>> >> Re-read the above statement.=3D20=20
>> >>=3D20=20
>> >> >>=3D3D20=3D20=20

>> >> >> The symbol table (defined and undefined symbols) has 1206 entries.=

>=3D3D2=3D=20
>> >0=3D20=20
>> >> >>=3D3D20=3D20=20
>> >> >> A simple linked list is perfectly suitable for that size table.=3D3=
>D20=3D=20
>> >=3D20=20
>> >> >>=3D3D20=3D20=20
>> >> >> There were 15887 unique branch labels (.Lnnnn), none of which=3D3D2=
>0=3D20=20
>> >> >> appear in the symbol table (and thus can be maintained in a=3D3D20=
>=3D20=20
>> >> >> different data structure, such as one indexed by the 'nnnn' in=3D3D=
>20=3D20=20
>> >> >> the label).=3D3D20=20
>> >> ><=3D20=20
>> >> >But if you took that same compilation module and had the compiler=3D2=

>0
>> >> >emit assembly code instead of linkable-binary, the assembler would

>> >> Huh? The example above did exactly that - the compiler emitted=3D20

>> >> text assembler code, not a linkable binary.

>> >> >have to determine where those 15,887 symbols were positioned=3D20

>> >> >before converting the branch symbols into fixed displacements.

>> >> The branch label table (or linked fixed-size table fragments if you=3D=
>20=20
>> >> don't want to 'realloc' in place with a unnecessary memcpy) are indexe=
>d=3D=20
>> >=3D20=20
>> >> by the label number,=3D20=20
>> >> so O(1) access to the displacment, which is what would be passed=3D20=
>=20
>> >> to the linker via the ELF object file produced by the assembler.=3D20=
>=20
>> >>=3D20=20
>> >> The branch label table contains two fields - the text segment=3D20=20
>> >> index and the offset from the start of the segment, and potentially=3D=
>20=20
>> >> a linked list of references to make pass two more efficient=3D20=20
>> >> when updating forward branch instruction target address fields=3D20

>> >> for subsequently defined branch labels/symbols.

>> >> ><=3D20=20
>> >> >That is you are looking at linkable-binary symbol table, not the symb=
>ol=3D=20
>> >=3D20=20
>> >> >table the assembler would have had to produce looking at the same=3D2=
>0=20
>> >> >semantic ASCII content.=20
>> ><=20
>> >> No, it is not. The compiler produced assembler text. Nothing=3D20
>> >> linkable or binary at all.=20
>> ><=20
>> >The compiler produced already resolved assembler text; as if=20
>> ><=20
>> > BR label37=20
>> ><=20
>> >had already been converted into=20
>> ><=20
>> > BR 0x347=20
>> ><=20
>> >Which is perfectly acceptable to the assembler, but not how any=20

>> >human using the assembler would have written the ASCII.

>> I beg to differ:=20
>>=20
>> .loc 13 9667 56 is_stmt 0=20
>> cmpq $3, %rax=20
>> jbe .L2381=20
>> .loc 13 9669 14 is_stmt 1=20
>> .LVL947:=20
>> .loc 16 102 5=20
>> .LBB28265:=20
>> .LBB28233:=20
>> .LBB28232:=20
>> .loc 12 517 5=20
>> .loc 12 517 44 is_stmt 0=20
>> movabsq $193518341455871, %rsi=20
>> xorq %rsi, %rcx=20
>> .loc 12 517 5=20
>> movabsq $72057589742960640, %rsi=20
>> testq %rsi, %rcx=20
>> jne .L2419=20
>> .L2422:=20
>> .loc 12 521 9 is_stmt 1=20
>> .LVL948:=20
>> .LBE28232:=20
>> .LBE28233:=20
>> .LBE28265:=20
>> .loc 13 9669 57 is_stmt 0=20
>> cmpq $3, %rax=20

>> ja .L2419
><
>So when you debug this program and you want to set a breakpoint at

>.LBE28233:=20

>what do you type into the debug interface ??

The source code line number, or the instruction address, depending
on what I'm trying to debug.

Scott Lurndal

unread,

Apr 7, 2023, 10:40:11 AM4/7/23

to

Thomas Koenig <tko...@netcologne.de> writes:
>Scott Lurndal <sc...@slp53.sl.home> schrieb:
>
>> For an assembler, where the maximum number of symbols will
>> be insubstantial (e.g. low 1000s), a binary search is unnecessary on modern
>> CPUs. An insertion sort on the dlist is more than
>> sufficient if it is necessary to sort the symbols.
>
>If working on gfortran has taught me anything, it is that
>quadratic (or worse) algorithms sooner or later come up against
>code with a huge number of variables and/or humungous basic blocks,
>introducing unreasonable behavior in the toolchain. Autogenerated
>code can be particularly bad in that respect; an extreme example
>can be seen at https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80960 ,

In the code where the probability of large symbol tables is
greater than zero, I use std::map. In the assembler from which
the code fragment had been posted, that probability is zero,
so the dlist was sufficient.

Scott Lurndal

unread,

Apr 7, 2023, 11:13:29 AM4/7/23

to

sc...@slp53.sl.home (Scott Lurndal) writes:
>MitchAlsup <Mitch...@aol.com> writes:
>>On Thursday, April 6, 2023 at 1:35:48=E2=80=AFPM UTC-5, Scott Lurndal wrote=
>>:

https://sourceware.org/binutils/docs/as/Symbol-Names.html#Symbol-Names

"A local symbol is any symbol beginning with certain local label
prefixes. By default, the local label prefix is ".L" for ELF systems or "L"
for traditional a.out systems, but each target may have its own set of
local label prefixes."

Local symbols are defined and used within the assembler, but they are
normally not saved in object files. Thus, they are not visible when
debugging. You may use the '-L' option (see Include Local Symbols)
to retain the local symbols in the object files.

BGB

unread,

Apr 7, 2023, 12:07:02 PM4/7/23

to

FWIW:
In BGBCC, any label starting with '.' is understood to be local.

However, the compiler still uses '.Lnnnnnn' when emitting debug ASM.

Note that ar present, the ASM is merely a debug output, as the compiler
does not (actually) produce ASM and feed everything through an assembler.

However, given the mess that the compiler backend has become, generating
ASM and then assembling it might have almost been preferable...

Or, at least, a "binary ASM" like in some of my other code generators
(where the program exists as a big array of structures each representing
an ASM instruction, as opposed to necessarily going through the ASCII
format).

Directly emitting instructions into the sections seemed to make more
sense originally (and some things, like the WEXifier, operate "after the
fact" on the emitted instruction stream; but will only operate on
sequences of fixed-length 32-bit instructions).

However, going through an intermediate stage (such as an array of
instructions) would allow adding a peephole optimizer stage.

...

BGB

unread,

Apr 7, 2023, 12:51:48 PM4/7/23

to

Yeah.

Reasonably fast symbol lookups are needed in a compiler otherwise the
time needed to compile stuff is *slow*...

Say, if the compiler needs to chew through 100 kLOC of code, I would
rather have seconds than minutes...

So, hashes and binary search and similar are good, linear search and
naive linked lists, not so much.

Similarly, there are multiple cases where I went through the effort of
implementing specialized versions of quick-sort, because using naive
selection-sort or similar effectively put up a roadblock in terms of
"getting stuff compiled in a reasonable amount of time".

Sadly, BGBCC still isn't as fast at compiling stuff as MSVC, which IME
seems to be one of the faster compilers around (at least for plain C
code). Though, it isn't too hard to compile stuff faster than GCC or
Clang, which (in comparison) tend to take a fairly long time...

John Dallman

unread,

Apr 7, 2023, 2:35:15 PM4/7/23

to

In article <u0phn0$spgv$1...@dont-email.me>, cr8...@gmail.com (BGB) wrote:

> Sadly, BGBCC still isn't as fast at compiling stuff as MSVC, which
> IME seems to be one of the faster compilers around (at least for
> plain C code). Though, it isn't too hard to compile stuff faster
> than GCC or Clang, which (in comparison) tend to take a fairly long
> time...

I find the Android NDK Clang on Linux compiles quickly, much faster than
MSVC on comparable hardware.

Recent versions of GCC have the -pipe option, which transmits data
between passes using pipes, rather than temporary files, which speeds
things up.

John

robf...@gmail.com

unread,

Apr 7, 2023, 2:51:55 PM4/7/23

to

I use rather simple link lists to locate symbols in cc64, however, the compiler keeps
separate symbol tables for each function and each compound block within
a function. The number of symbols that have to be searched then is typically
very small. The compiler jumps from one symbol table to a higher table in the
hierarchy if the symbol is not found. Since the compiler supports method
overloading, the symbol searches also act like filters and return a list of
symbols that might match the requested one. Later code in the compiler
selects the best matching symbol for use.

I have found the compiler to be acceptably fast even though it is using linked
lists. Most of the time a compile is under a second.

robf...@gmail.com

unread,

Apr 7, 2023, 2:56:03 PM4/7/23

to

I got this a bit wrong, a single symbol table is used, but there are pointers into
it for each level in the hierarchy.

BGB

unread,

Apr 7, 2023, 9:17:55 PM4/7/23

to

On 4/7/2023 1:35 PM, John Dallman wrote:
> In article <u0phn0$spgv$1...@dont-email.me>, cr8...@gmail.com (BGB) wrote:
>
>> Sadly, BGBCC still isn't as fast at compiling stuff as MSVC, which
>> IME seems to be one of the faster compilers around (at least for
>> plain C code). Though, it isn't too hard to compile stuff faster
>> than GCC or Clang, which (in comparison) tend to take a fairly long
>> time...
>
> I find the Android NDK Clang on Linux compiles quickly, much faster than
> MSVC on comparable hardware.
>

I had noted (when targeting x86-64) that MSVC often seems to be a bit
faster.

Build times for a program (ROTT):
GCC: 17.06 seconds
MSVC: 4.76 seconds
BGBCC: 31.30 seconds

So, BGBCC is apparently the slowest of this group at present...

This is for plain C though, have observed before that for C++, MSVC
build times get significantly worse, which is (yet another) disincentive
for me to use C++.

However, I have also noted that often for the same code, GCC and Clang
output is often around 40-60% faster than MSVC for typical programs,
with roughly 1/3 to 1/2 the size for the ".text" section.

Well, except Dhystone, where the GCC output is around 4x faster than the
MSVC output (say, scores of around 50 million vs around 12 million).

Or, say, ~ 7.7 DMIPS/MHz vs 1.85 DMIPS/MHz.

Whereas, on BJX2, BGBCC seems to be stuck at around 0.85 DMIPS/MHz

Or, on the other end, some of the code I had optimized for BGBCC tends
to show relatively little difference between GCC and MSVC in terms of
performance (and/or, is slightly faster for MSVC). Whereas, some idioms
that work well in GCC kinda fall on their face for MSVC.

Same C source compiled to ASM:
GCC (X64): 172k lines (producing around 700k of binary code)
MSVC (X64): 360k lines (producing around 1400k of binary code)
BGBCC (BJX2): 390k lines (producing around 800k of binary code)

Where, the GCC output also seems to be apparently using a Load/Store
strategy, whereas the MSVC output also seems to be using lots of Reg/Mem
operations.

But, yeah, this is just what I am seeing at the moment...

> Recent versions of GCC have the -pipe option, which transmits data
> between passes using pipes, rather than temporary files, which speeds
> things up.
>

OK.

> John

Anton Ertl

unread,

Apr 8, 2023, 2:36:34 AM4/8/23

to

sc...@slp53.sl.home (Scott Lurndal) writes:
>I just compiled a large C++ source file, 4432 source-lines-of-code (excluding
>comments, blank lines, etc) to an (x86_64) assembler file. The resulting assembler
>file had 1.5 million lines.
>
>The symbol table (defined and undefined symbols) has 1206 entries.
>
>A simple linked list is perfectly suitable for that size table.

In Gforth the building of the normal image works by loading a kernel
built with some other means; this contains the regular Gforth compiler
and uses a symbol table organized as a linked list. Then some files
are loaded, including the one that implements the symbol table as a
hash table, and switches to the hash table; at that point 1523 entries
are in the symbol table. Then many more files are loaded, resulting
in 4258 hash table entries on AMD64 (residing in several linked lists
in the linked-list representation). By turning off the switch to the
hash table we can see how much worse the linked list is:

#load everything up to and including the switch to the hash table
LC_NUMERIC=prog perf stat -e cycles -e instructions -e L1-dcache-load-misses ./gforth-fast -i kernl64l.fi rec-sequence.fs search.fs options.fs environ.fs errors.fs hash.fs -e "bye"
#load everything
LC_NUMERIC=prog perf stat -e cycles -e instructions -e L1-dcache-load-misses ./gforth-fast -i kernl64l.fi startup.fs arch/amd64/asm.fs arch/amd64/disasm.fs -e "hashpop @ dec. bye"

And here are the results on a Skylake (Core i5-6600K):

up to hash all hashed all linked
42_172_103 178_694_165 1_237_167_755 cycles
41_724_082 267_315_089 668_961_786 instructions
2_036_522 5_457_186 70_530_798 L1-dcache-load-misses

If we subtract the up-to-hash stuff, the linked list is 8.7 times more
expensive for the rest than the hash table. Note especially the large
increase in D-cache misses and the reduction in IPC (1.50->0.54) when
turning off hashing.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>

BGB

unread,

Apr 8, 2023, 12:49:12 PM4/8/23

to

Comparing more:
In general, GCC seems to be generating less ASM code while also using a
more limited subset of the x86-64 ISA vs that used vs MSVC.

If I remove all the blank newlines, the BJX2 output drops to 320k lines.

Haven't noted many cases where x86-64 has an obvious advantage.

Have noted a few things though in the code generation.
For something like, say:
static int colors[16]={...};
unsigned char clr;
...
clr=colors[13];

Both MSVC and GCC emit something like:
MOVZXB ECX, [colors+52]

Whereas BGBCC handles this case as, say:
MOV colors, R19
MOV.L (R19, 52), R18
EXTU.B R18, R24

Well, and other cases where constant indices into global arrays are
handled directly, rather than first loading the array into a register
and then doing another load.

These could be expressed in the ISA in a single instruction, albeit:
MOV.B (colors, 52), R24

Can't currently be expressed in my ASM syntax or handled by my
assembler. Also would likely require special case handling by the
compiler (it pretty much always loads arrays into a register before
loading a value from them).

And, mostly, a lot of cases along these lines.

Trying to compare code between compilers, had noted that both GCC and
MSVC were prone to inline some things.

In particular, because ROTT had accessed video memory in a weird way
which did not map to a simple linear array (they were effectively
treating the screen, rather than as linear 320x200, as 4 planes of
96x200 and twiddling VGA mode registers to control which plane was
modified), my port effectively turned all the stores into VGA memory
into function calls.

Otherwise, I would have needed to rewrite much of the renderer to "not
be quite so weird".

Inlining for ease of illustration:
void VGAWRITE(unsigned addr, byte clr)
{
byte *buf;
int a1, msk;
a1=(addr<<2)&screen_buffermsk;
msk=screen_planemask;
buf=screenbuf+a1;
if(msk&1) buf[0]=clr;
if(msk&2) buf[1]=clr;
if(msk&4) buf[2]=clr;
if(msk&8) buf[3]=clr;
}

Where addr is typically in the range of, say, 0x000A0000..0x000A4AFF.

In the BGBCC output, something like:
static int colors[16]={...};
unsigned char clr;
...
clr=colors[13];
VGAWRITE(addr, clr);
Becomes, say:
MOV colors, R19
MOV.L (R19, 52), R18
EXTU.B R18, R24
MOV.L R24, (SP, 192) //dirty variable, so spill to stack...
MOV R11, R4 | MOV R24, R5
BSR VGAWRITE

Both GCC and MSVC seem to inline and transform the VGAWRITE call into,
say essentially:
a1=(addr<<2)&0x1FFFC; //*1
msk=screen_planemask;
buf=screenbuf+a1;
clr=colors[13];
if(msk&1) buf[0]=clr;
if(msk&2) buf[1]=clr;
if(msk&4) buf[2]=clr;
if(msk&8) buf[3]=clr;

In this case, BGBCC still doesn't support function inlining and similar...

*1: screen_buffermsk is a variable assigned in an Init function during
program startup, essentially using a value that reduces to a constant.
Somehow, both compilers seem to be able to "see" into the Init function
and propagate the value to the inlined VGAWRITE (elsewhere in the
program). I am not entirely sure what is going on here... (seems like
some sort of arcane magic).

It seems like, even if one can see that the variable is only ever
assigned once, it would still need to be proven that the relevant Init
function was called before any of the code which uses the variable was
called.

As noted, BGBCC does not propagate this constant, or really propagate
constants across variables.

But, say, if one does write:
j=3*5+7;
It is at least smart enough to turn this into:
j=22;

Well, and also for global 'const' variables and 'enum' variables and
similar (and other optimizations of this sort).

Well, and also optimizations like using a register allocator and caching
values in registers (vs, say, always using memory loads and stores on
every access).

Albeit, both compilers use a test+branch for each store (in VGAWRITE):
TEST EDX, 1
JZ .L0003
MOVB [RAX+0], CL
.L0003:
TEST EDX, 2
JZ .L0004
MOVB [RAX+1], CL
.L0004:
...

Vs, say, using CMOVB or similar.

Whereas BJX2 does this as, say:
TEST 1, R23
MOV.B?F R5, (R19, 0)
TEST 2, R23
MOV.B?F R5, (R19, 1)
...

Well, and the usual perennial issues in BGBCC of using more registers
than strictly necessary, and the occasional pointless stack spill.

Both GCC and MSVC seem to figure out ways to cleverly work around
x86-64's comparably much smaller register space.

Not a lot here that seems like an obvious ISA deficiency, but a lot of
"paper cut" issues, and places where my compiler's optimizer is a little
weak... (even vs MSVC).

I am not entirely sure what is going on here with GCC, as it seems to
often alter code sequences into unrecognizable forms (whereas MSVC seems
to keep the same general structure as the C source intact).

...

Though, some of this does make me wonder if another 2x (or more) gain in
performance could still be possible with a compiler with stronger
optimizing abilities?...

But, alas...

Stephen Fuld

unread,

Apr 8, 2023, 1:03:52 PM4/8/23

to

On 4/7/2023 6:17 PM, BGB wrote:
> On 4/7/2023 1:35 PM, John Dallman wrote:
>> In article <u0phn0$spgv$1...@dont-email.me>, cr8...@gmail.com (BGB) wrote:
>>
>>> Sadly, BGBCC still isn't as fast at compiling stuff as MSVC, which
>>> IME seems to be one of the faster compilers around (at least for
>>> plain C code). Though, it isn't too hard to compile stuff faster
>>> than GCC or Clang, which (in comparison) tend to take a fairly long
>>> time...
>>
>> I find the Android NDK Clang on Linux compiles quickly, much faster than
>> MSVC on comparable hardware.
>>
>
> I had noted (when targeting x86-64) that MSVC often seems to be a bit
> faster.
>
> Build times for a program (ROTT):
> GCC: 17.06 seconds
> MSVC: 4.76 seconds
> BGBCC: 31.30 seconds
>
> So, BGBCC is apparently the slowest of this group at present...

John pointed out that, at least in his experiences, Clang seems to be
faster than MSVC. So can you run the same tests using Clang? That
would provide a more complete picture.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

Scott Lurndal

unread,

Apr 8, 2023, 1:14:22 PM4/8/23

to

BGB <cr8...@gmail.com> writes:
>On 4/7/2023 1:35 PM, John Dallman wrote:
>> In article <u0phn0$spgv$1...@dont-email.me>, cr8...@gmail.com (BGB) wrote:
>>
>>> Sadly, BGBCC still isn't as fast at compiling stuff as MSVC, which
>>> IME seems to be one of the faster compilers around (at least for
>>> plain C code). Though, it isn't too hard to compile stuff faster
>>> than GCC or Clang, which (in comparison) tend to take a fairly long
>>> time...
>>
>> I find the Android NDK Clang on Linux compiles quickly, much faster than
>> MSVC on comparable hardware.
>>
>
>I had noted (when targeting x86-64) that MSVC often seems to be a bit
>faster.
>
>Build times for a program (ROTT):
> GCC: 17.06 seconds
> MSVC: 4.76 seconds
> BGBCC: 31.30 seconds
>

Truely, you are spoiled :-)

I remember when compilation speed was limited to the number
of cards that could be read in a minute (typically 300 or
less).

I'd also ask _which_ of the 11 generations of GCC you
were using, and if you were using it native on a unix
or linux system, or using the less than optimal versions
built for windows, and which level of compiler optimization
you had selected (-O3 can significantly extend compilation
times by an order of magnitude under certain conditions).

Scott Lurndal

unread,

Apr 8, 2023, 1:16:18 PM4/8/23

to

BGB <cr8...@gmail.com> writes:
>On 4/7/2023 8:17 PM, BGB wrote:

>Comparing more:
>In general, GCC seems to be generating less ASM code while also using a
>more limited subset of the x86-64 ISA vs that used vs MSVC.

And here, again, this depends on the compiler flags; if you specify
that you want code generated for the latest generation of intel chips
(the default settings aim for binary portability, so many of the
optional or newer generation instructions will not be generated)
you'll find more optimal code for that particular microarchitecture.

BGB

unread,

Apr 8, 2023, 1:43:00 PM4/8/23

to

OK, just tested:
Clang: 10.33 seconds.

It does not beat MSVC in this case, but does beat GCC...

This is on a Ryzen 2700X running Win10 (48GB RAM), with GCC and Clang
running in WSL (Windows Subsystem for Linux).
GCC ver: 4.8.4
Clang ver: 3.3-16, LLVM 3.3

...

Both MSVC and BGBCC are native Win64 programs in this case.

BGBCC could be a little faster, part of this is because it is dumping
roughly 25 MB of debug output during each rebuild.

If I disable the debug dumps, its time drops to 24 seconds, though sadly
still the slowest in this group (~ 50% slower than GCC).

But, alas, I have yet to write a BJX2 backend for either GCC or Clang,
and that trying to recompile LLVM/Clang tends to "totally own" my PC,
this is a disincentive (takes a very long time to recompile from source
and uses absurd amounts of RAM and swap-space to do so).

Time to rebuild BGBCC from source: 8.99 seconds...

Bernd Linsel

unread,

Apr 8, 2023, 2:34:49 PM4/8/23

to

BGB wrote on Sat, 2023-04-08T18:49:12+0200:
> > Where, the GCC output also seems to be apparently using a Load/Store

> > strategy, whereas [...]
Only in -O0 or -O1, from -O2 on it tries to hold all locals in registers

[...]

> I am not entirely sure what is going on here with GCC, as it seems to
> often alter code sequences into unrecognizable forms (whereas MSVC seems
> to keep the same general structure as the C source intact).

The first thing that causes this effect is that with gcc, variables don't have a home register. You can clearly see the effects of its SSA analysis, where every generated value gets its own register; and the register allocation is performed more often.

B.

John Dallman

unread,

Apr 8, 2023, 2:35:14 PM4/8/23

to

In article <u0s92v$1bgh3$1...@dont-email.me>, cr8...@gmail.com (BGB) wrote:

> This is on a Ryzen 2700X running Win10 (48GB RAM), with GCC and
> Clang running in WSL (Windows Subsystem for Linux).
> GCC ver: 4.8.4
> Clang ver: 3.3-16, LLVM 3.3

Those are rather old versions. What Linux distribution do you have in WSL?

GCC 4.8.4 is from 2014, 4.8 having originally released in 2013. I'm
currently working with GCC 11.3, from 2022, 11.x having originally
released in 2021. GCC 13 is due for release this year. Yes, they changed
the release numbering scheme, at GCC 5 in 2015.

LLVM 3.3 is from 2013. I'm currently working with LLVM 12, from 2021.
LLVM 16 was released last month.

John

BGB

unread,

Apr 8, 2023, 2:51:26 PM4/8/23

to

On 4/8/2023 12:14 PM, Scott Lurndal wrote:
> BGB <cr8...@gmail.com> writes:
>> On 4/7/2023 1:35 PM, John Dallman wrote:
>>> In article <u0phn0$spgv$1...@dont-email.me>, cr8...@gmail.com (BGB) wrote:
>>>
>>>> Sadly, BGBCC still isn't as fast at compiling stuff as MSVC, which
>>>> IME seems to be one of the faster compilers around (at least for
>>>> plain C code). Though, it isn't too hard to compile stuff faster
>>>> than GCC or Clang, which (in comparison) tend to take a fairly long
>>>> time...
>>>
>>> I find the Android NDK Clang on Linux compiles quickly, much faster than
>>> MSVC on comparable hardware.
>>>
>>
>> I had noted (when targeting x86-64) that MSVC often seems to be a bit
>> faster.
>>
>> Build times for a program (ROTT):
>> GCC: 17.06 seconds
>> MSVC: 4.76 seconds
>> BGBCC: 31.30 seconds
>>
>
> Truely, you are spoiled :-)
>
> I remember when compilation speed was limited to the number
> of cards that could be read in a minute (typically 300 or
> less).
>

I did not exist in that era.

Around the time I came into existence, apparently people were mostly
using 5.25" floppies, and the IBM PC was (still) relatively new (but
apparently companies like Compaq and similar were already busy making
IBM PC Clones).

Some older PCs existed, like the Commodore 64 and similar, but had
apparently already started to fall into decline vs the rise of the PC.

Well, and apparently the NES was still fairly new (and some of its more
iconic games did not yet exist).

So, yeah, despite me playing them in my early childhood, Super Mario
Bros 2 and 3 and similar are younger than I am...

By the time I started messing with C, Win 9X was already a thing,
nevermind if I mostly ended up jumping over to NT4 and 2K around this
era (the lackluster stability of Win95 and Win98 being off-putting; to
when my HDD died around 7th grade, switched over to WinNT instead).

I don't really remember what build times were like in the late 1990s to
early 2000s (mostly using Cygwin), they didn't seem to be too
unreasonably long though.

I do remember many years later, when the Doom 3 source got released, and
I tried building it any it took a painfully long time (nearly 25 minutes
IIRC; on a Phenom II with 16GB of RAM).

Granted, this was years before I started messing around with Verilog and
FPGAs (was early 30s at the time; now I am in the twilight era, almost
40, grr...).

> I'd also ask _which_ of the 11 generations of GCC you
> were using, and if you were using it native on a unix
> or linux system, or using the less than optimal versions
> built for windows, and which level of compiler optimization
> you had selected (-O3 can significantly extend compilation
> times by an order of magnitude under certain conditions).
>

GCC 4.8.4, testing in this case both "-O1" and "-Os", running on WSL (*).
Both options seem to give similar-looking output.
Meanwhile, "-O2" and "-O3" produces somewhat bigger output.
But, can generate faster programs.
Similarly, "-O0" produces particularly terrible output.
A whole lot of "Load, Load, Op, Store" and similar.

*: AFAIK, this is a Linux userland running on top of Windows with a shim
layer. Hadn't switched to WSL2, which apparently uses a modified Linux
kernel running inside a hypervisor (switching seemed like too much risk
of stuff getting messed up for not much gain in functionality).

For X11, I am running an external X server (Xming).

I am primarily using it for running Verilator simulations, and for sake
of being a "less terrible" alternative to Cygwin for many similar use
cases (and has some of the same advantages as Cygwin if compared with
trying to run Linux in a virtual machine).

Was also testing MSVC with "/O1", which seems to have similar properties
(though with less size variation with optimization level if compared
with GCC; and comparably bigger ".text" sections; though smaller
binaries due to GCC's ELF output tending to have a lot of "fluff").

Similarly, "/O2" is a case of "may or may not make program faster, may
or may not cause it to break in weird ways...". Does tend to make
binaries bigger, and much of its gains/losses seem due to this mode
enabling SIMD vectorization.

Similarly, "/O1" seems to cause MSVC to have performance-related
behavioral properties more like those of BGBCC, so makes some sense for
testing and optimizing in this sense.

Contrast, GCC seems to behave a fair bit differently from BGBCC in
pretty much all cases.

...

BGB

unread,

Apr 8, 2023, 3:07:13 PM4/8/23

to

On 4/8/2023 1:35 PM, John Dallman wrote:
> In article <u0s92v$1bgh3$1...@dont-email.me>, cr8...@gmail.com (BGB) wrote:
>
>> This is on a Ryzen 2700X running Win10 (48GB RAM), with GCC and
>> Clang running in WSL (Windows Subsystem for Linux).
>> GCC ver: 4.8.4
>> Clang ver: 3.3-16, LLVM 3.3
>
> Those are rather old versions. What Linux distribution do you have in WSL?
>

It looks like Ubuntu-14.04 ...

>
> GCC 4.8.4 is from 2014, 4.8 having originally released in 2013. I'm
> currently working with GCC 11.3, from 2022, 11.x having originally
> released in 2021. GCC 13 is due for release this year. Yes, they changed
> the release numbering scheme, at GCC 5 in 2015.
>
> LLVM 3.3 is from 2013. I'm currently working with LLVM 12, from 2021.
> LLVM 16 was released last month.
>

I am mostly using what came with WSL; and apt-get doesn't seem to have
anything newer.

Basically works, I am mostly using it to run Verilator, and also to
occasionally use grep and similar, which it works well for.

I also have a RISC-V build of GCC installed, version 11.1.0 ...
And an ARM cross-compiler, version 4.8.2 ...
...

I had an SH-4 GCC install, but IIRC this one is in my Cygwin
installation, and haven't messed with SH-4 in a while.

...

> John

BGB

unread,

Apr 8, 2023, 5:13:52 PM4/8/23

to

On 4/8/2023 1:34 PM, Bernd Linsel wrote:
> BGB wrote on Sat, 2023-04-08T18:49:12+0200:
>>> Where, the GCC output also seems to be apparently using a Load/Store
>>> strategy, whereas [...]
> Only in -O0 or -O1, from -O2 on it tries to hold all locals in registers
>

Load/Store in that it loads values into registers and mostly operates on
them in registers, storing the final results back to memory when it is
done with them or similar.

Not in the sense that it loads/stores for every operation (like in "-O0").

Contrast with MSVC, which is making heavy use of x86-64's complex
addressing modes to perform operations directly against values held in
memory.

I am not saying that either is good or bad, and I have noted that GCC's
output does tend to outperform MSVC's output, as well, as the executable
code being smaller. So, it is also very well possible that Reg/Mem ops
are not really a win on x86-64, even if it does have them for most
instructions.

> [...]
>
>> I am not entirely sure what is going on here with GCC, as it seems to
>> often alter code sequences into unrecognizable forms (whereas MSVC seems
>> to keep the same general structure as the C source intact).
>
> The first thing that causes this effect is that with gcc, variables don't have a home register. You can clearly see the effects of its SSA analysis, where every generated value gets its own register; and the register allocation is performed more often.
>

Possibly true.

Both GCC and Clang use SSA.

I don't know what MSVC uses here.

BGBCC had attempted to use a sort of "pseudo SSA", but I was unable to
figure out how to make things like phi operators work, so it is
effectively still plain 3-address-code.

Between being C source, and being 3-Address-Code, it does take a detour
through a stack-machine representation.

Where, say, an expression like:
y=m*x+b;
Gets represented as, say:
LOAD m
LOAD x
MUL
LOAD b
ADD
STORE y

Which then needs to be translated back to 3AC form during load:
MUL:i t0, m, x
ADD:i y, t0, b
( This is a shorthand, the notation used by BGBCC for debug output is a
lot more verbose. )

Mostly by noting that, the stack can be seen, not as directly holding
values, but instead as a temporary holding space for variables (and
cases where it does need to "hold" a value, the stack item is used to
hold a temporary variable).

This makes the stack -> 3AC transformation stage relatively
straightforward, but does still put a lot of constraints on how the
front-end ASTs can be expressed in terms of the underlying IR (things
like the "z=c?x:y;" operator not mapping all that cleanly to the stack
machine model).

Though, given .NET CIL is also stack-based, it is possible it may be not
too far off from what MSVC uses. BGBCC's RPNIL is also (at least
vaguely) comparable to .NET CIL in terms of many areas of its design.

I had at one point considered going more directly from ASTs to a 3AC
format, however I would need to rewrite a big chunk of the compiler to
do so (as well as redesign how I handle static libraries); so had not
done so.

The newest form of my BGBScript VM (BS2VM, ~ 2015 era) had switched to a
hybrid stack + register model (1), however this happened long after
BGBCC and the BGBScript VM split into two different entities (in the
late 2000s).

1: Operations could either use the stack or a variable as inputs, or the
stack or a variable as outputs; with the stack serving primarily as
dynamically created temporary variables. Internally, the interpreter
itself used a 3AC model.

Say, hypothetically (for such a VM), one uses a few bits to encode the
type of operation:
000: S=S+S
001: S=S+R
010: S=S+I
011: S=R+R
100: R=S+S
101: R=R+S
110: R=R+I
111: R=R+R

More bits to encode the value base type and operator, all within a
logical 16-bit opcode space (with a few bytes following for operands).
0: Bare 16-bit stack op
1: Op + 8-bit ID.
2: Op + Immed (VLN).
3: Op + 2x 8-bit ID.
...

With a register space something like:
00: 'this'
01..N: Arg1..ArgN
N+1 .. L: Local Variables
L+1 .. FF: Stack/Temporaries

But, almost easier to just forgo the stack entirely, and move entirely
to just using temporaries.

I had at least one attempt at a VM that went entirely to a 3AC model
(though, this was targeted using BGBCC, which still used a stack model
internally).

Big obvious screw-up in the design was trying to put locals,
temporaries, and function arguments into 3 different register spaces (VM
had 48 logical registers, as 3x 16 registers).

If I were doing it now, would just have a single register space probably
with 256 logical registers (local to each call-frame).

Hmm, operand suffix:
dddd-dsss-sstt-ttt0 //3R, 0..31
dddd-dddd-ssss-ssss tttt-tttt-0000-0001 //3R, 0..255
dddd-dddd-ssss-ssss iiii-iiii-iiii-i101 //3RI, 0..255, Imm13s
The '11' case would escape to "complex operands", such as global
variables and literal expressions.

This would be preceded by an opcode word encoding operation+type (for
simple types); likely using a JVM-like type model (with simple types
being encoded using bits, complex types encoded as a reference to a
signature string).

So, primary types:
I, 32-bit Int
L, 64-bit Long
F, 32-bit Float
D, 64-bit Double
A, 64-bit Address
X, 128-bit Int128
With secondary types:
UI, 32-bit Unsigned Int
UL, 64-bit Unsigned Long
UF, -
UD, -
UA, -
X, 128-bit Unsigned Int128
And, tertiary types (Array ops):
SB, UB: Byte
SW, UW: Short
H: Biunary16
...

Possibly, a 6-bit type field:
puyyyy: p=pointer, u=unsigned, yyy=base.
yyyy:
0000, I, 32-bit Int
0001, L, 64-bit Long
0010, F, 32-bit Float (Array/Pointer)
0011, D, 64-bit Double
0100, A, 64-bit Address
0101, X, 128-bit Int128
0110, B, 8-bit Byte (Array/Pointer)
0111, W, 16-bit Word (Array/Pointer)

Some combinations could be reserved, with "Unsigned Address" as an
escape to a signature string (held in a string table). Some operators
would partially discard type distinction.

So, basic instructions could consist of a 16-bit opcode word (type +
operator), followed by a 16-bit or 32-bit operand word. String table or
global/constant references would add additional 16-bit words.

Another option would be trying to make the effort to move over to SSA.

But, in any case, writing a new C compiler (more or less ground up)
would be a massive pain.

And, short of trying to write a backend for GCC or LLVM (looks like a
massive pain combined with their long build times), no existing options
seemingly offering much beyond what I already have with BGBCC.

Goal would be to have the front-end produce output in a "reasonably
generic" IR form, where IR/bytecode images will be loaded into the
compiler backend to produce the final binaries.

Ideally, would also want it to be possible to invoke the backend as an
AOT compiler; and have it so that the backend can still operate within a
constrained memory footprint. This means the image format needs to be
such that the compiler need not be required to be able to load all of
the IR and metadata all at the same time.

May not be held in a WAD-like format.

Would also want the IR design to be able to be used with an interpreter
without it being excessively slow (this would be a downside both with
implicit-typed stack machines and also with SSA form).

...

Scott Lurndal

unread,

Apr 8, 2023, 7:44:44 PM4/8/23

to

BGB <cr8...@gmail.com> writes:
>On 4/8/2023 12:14 PM, Scott Lurndal wrote:
>> BGB <cr8...@gmail.com> writes:
>>> On 4/7/2023 1:35 PM, John Dallman wrote:
>>>> In article <u0phn0$spgv$1...@dont-email.me>, cr8...@gmail.com (BGB) wrote:
>>>>
>>>>> Sadly, BGBCC still isn't as fast at compiling stuff as MSVC, which
>>>>> IME seems to be one of the faster compilers around (at least for
>>>>> plain C code). Though, it isn't too hard to compile stuff faster
>>>>> than GCC or Clang, which (in comparison) tend to take a fairly long
>>>>> time...
>>>>
>>>> I find the Android NDK Clang on Linux compiles quickly, much faster than
>>>> MSVC on comparable hardware.
>>>>
>>>
>>> I had noted (when targeting x86-64) that MSVC often seems to be a bit
>>> faster.
>>>
>>> Build times for a program (ROTT):
>>> GCC: 17.06 seconds
>>> MSVC: 4.76 seconds
>>> BGBCC: 31.30 seconds
>>>
>>

>> I'd also ask _which_ of the 11 generations of GCC you
>> were using, and if you were using it native on a unix
>> or linux system, or using the less than optimal versions
>> built for windows, and which level of compiler optimization
>> you had selected (-O3 can significantly extend compilation
>> times by an order of magnitude under certain conditions).
>>
>
>GCC 4.8.4, testing in this case both "-O1" and "-Os", running on WSL (*).
> Both options seem to give similar-looking output.
> Meanwhile, "-O2" and "-O3" produces somewhat bigger output.
> But, can generate faster programs.
> Similarly, "-O0" produces particularly terrible output.
> A whole lot of "Load, Load, Op, Store" and similar.

So. That compiler is a decade old now. Try GCC12
and specify the target architecture and your results may
be substantially different.

Scott Lurndal

unread,

Apr 8, 2023, 7:51:45 PM4/8/23

to

BGB <cr8...@gmail.com> writes:
>On 4/8/2023 1:34 PM, Bernd Linsel wrote:
>> BGB wrote on Sat, 2023-04-08T18:49:12+0200:
>>>> Where, the GCC output also seems to be apparently using a Load/Store
>>>> strategy, whereas [...]
>> Only in -O0 or -O1, from -O2 on it tries to hold all locals in registers
>>
>
>Load/Store in that it loads values into registers and mostly operates on
>them in registers, storing the final results back to memory when it is
>done with them or similar.
>

Intel has a 1000 page document (248966-046a-software-optimization-manual.pdf)
that describes best practices for code generators targeting the various
intel processor families (AMD has a similar document).

The guidelines generally recommend using a load store paradigm rather
than the read-modify-write style of instructions - reduced code density
supplies better performance at a slight cost in icache footprint.

BGB

unread,

Apr 8, 2023, 8:17:39 PM4/8/23

to

Possible, though I am not sure how much of a difference it would make in
terms of practical use (well, unless maybe the claim is that its build
times got faster or something).

I would need to download and build it myself given "apt-get upgrade"
doesn't fetch anything newer than what I have already.

I suspect it is possible that GCC treating x86-64 more like it were a
Load/Store architecture was probably intentional.

Some of my own code generators for x86-64 had also done this, but IIRC
my past thinking was mostly the observation that, say:
ADD [RAX], ECX
Was not much faster than:
MOV EDX, [RAX]
ADD EDX, ECX
MOV [RAX], EDX
And, if one was interleaving it with other instructions, then splitting
up the loads and ALU operations was faster than using the Reg/Mem ops.

Granted, this was also on slightly older CPUs.

Looking it up, apparently the difference is between several ops with 3
cycle latency, and one op with a 6 cycle latency. In the worst cases,
the former would be preferable, but in most others, the latter would be
preferable. It doesn't appear that the latency has gotten much better.

But, in terms of both performance and code density, GCC already beats
MSVC (VS2022 in this case) on both fronts.

I don't necessarily need to update GCC just so I can see if it beats
MSVC harder.

But, as noted, the once metric it could still improve would be if it has
faster build times...

MitchAlsup

unread,

Apr 8, 2023, 9:24:30 PM4/8/23

to

The instruction decoder sees::
ADD [RAX], ECX
and produces µOps with the expressed intent of::

MOV EDX, [RAX]
ADD EDX, ECX
MOV [RAX], EDX
<

So it is of no surprise that neither runs faster than the other.

Terje Mathisen

unread,

Apr 9, 2023, 3:31:27 AM4/9/23

to

In my own experience, use-once variables (like streaming inputs) should
use load-op, while anything used several times gain a bit from being
kept in a register. The big step is of course for anything that gets
modified before reuse, these really need to stay in regs for as long as
possible.

Back in the 16-bit/8-register days the pressure could get really bad, up
to the point where it was a win to process a $L1-sized chunk multiple
times (with all the extra store/re-load ops) instead of trying to do it
all with a single pass.

These issues have mostly gone away with x64.

Anton Ertl

unread,

Apr 9, 2023, 4:43:56 AM4/9/23

to

sc...@slp53.sl.home (Scott Lurndal) writes:
>Intel has a 1000 page document (248966-046a-software-optimization-manual.pdf)
>that describes best practices for code generators targeting the various
>intel processor families (AMD has a similar document).
>
>The guidelines generally recommend using a load store paradigm rather
>than the read-modify-write style of instructions

For what reason?

In the P6 (and many of its successors) Intel uses a 3-1-1 or 4-1-1
decoder: It can decode the first instruction it looks at into 3 or 4
uops, and the second and third into one uop. If the second or third
instructions do not decode into one uop, this decoding slot and the
next one is not used, and instead that instruction is decoded as first
instruction of the next decoding cycle. I can imagine that using
1-uop instructions everywhere may give an advantage with this kind of
decoder.

The K7 OTOH has this concept of macroinstructions where a load-op
instruction or an RMW instruction occupies one slot in some staging
area after the decoder, and is only dissolved into smaller uops
(called ROPs by AMD) somewhere inside the OoO engine; and it has a
load-store uop. I think that on a K7/K8/K10 using load-op and RMW
instructions may be better than doing the same work with load, op, and
store instructions. I can imagine Intel recommending against load-op
and RMW even if it was not a disadvantage for Intel CPUs, if it was an
advantage for AMD.

In more recent CPUs (Pentium 4, Sandy Bridge 2011, Ryzen 2017, but not
the smaller cores of Intel like Tremont) Intel and AMD employ
microcode caches, so the decoding probably plays a smaller role. But
the role was not that big even before: Decoding tends to run far ahead
of execution, and a slight increase or decrease of decoding bandwidth
rarely matters. It's not clear to me how the kind of instructions
used affect the in-order retirement, and if that affects the
performance much.

John Dallman

unread,

Apr 9, 2023, 6:01:55 AM4/9/23

to

In article <u0se0t$1c83s$1...@dont-email.me>, cr8...@gmail.com (BGB) wrote:

> > Those are rather old versions. What Linux distribution do you
> > have in WSL?
> It looks like Ubuntu-14.04 ...

That makes sense. It's nine years old.

> I am mostly using what came with WSL; and apt-get doesn't seem to
> have anything newer.

It won't. Ubuntu is a Debian derivative, and that family of Linuxes can't
readily take GCC versions later than the one corresponding to the GCC
run-times shipped with the OS. The full story is complicated, but the
distributor's assumption is that you'll upgrade to a later Ubuntu
reasonably frequently. You may well have to move to WSL2 to do that,
though.

As I said above, recent versions of GCC have the -pipe option, which
transmits compilation data between passes using pipes, rather than
temporary files. That reduces compilation times noticeably IME.

> Basically works, I am mostly using it to run Verilator, and also to
> occasionally use grep and similar, which it works well for.

OK, but generalising from its compilers to the present day is misleading.

John

Michael S

unread,

Apr 9, 2023, 7:31:42 AM4/9/23

to

On Sunday, April 9, 2023 at 1:01:55 PM UTC+3, John Dallman wrote:
> In article <u0se0t$1c83s$1...@dont-email.me>, cr8...@gmail.com (BGB) wrote:
>
> > > Those are rather old versions. What Linux distribution do you
> > > have in WSL?
> > It looks like Ubuntu-14.04 ...
> That makes sense. It's nine years old.
> > I am mostly using what came with WSL; and apt-get doesn't seem to
> > have anything newer.
> It won't. Ubuntu is a Debian derivative, and that family of Linuxes can't
> readily take GCC versions later than the one corresponding to the GCC
> run-times shipped with the OS. The full story is complicated, but the
> distributor's assumption is that you'll upgrade to a later Ubuntu
> reasonably frequently. You may well have to move to WSL2 to do that,
> though.

According to my understanding, WSL2 runs only on relatively recent builds of Win10 or on 11.
It seem simpler to install additional distro into WSL. Debian Gnu/Linux 12 "bookworm"
is certainly supported and it comes with gcc 10 or 11.

>
> As I said above, recent versions of GCC have the -pipe option, which
> transmits compilation data between passes using pipes, rather than
> temporary files. That reduces compilation times noticeably IME.
> > Basically works, I am mostly using it to run Verilator, and also to
> > occasionally use grep and similar, which it works well for.
> OK, but generalising from its compilers to the present day is misleading.
>

Newer gcc is unlikely to be faster. Newer clang is almost certainly slower.

>
> John

Scott Lurndal

unread,

Apr 9, 2023, 11:52:33 AM4/9/23

to

an...@mips.complang.tuwien.ac.at (Anton Ertl) writes:
>sc...@slp53.sl.home (Scott Lurndal) writes:
>>Intel has a 1000 page document (248966-046a-software-optimization-manual.pdf)
>>that describes best practices for code generators targeting the various
>>intel processor families (AMD has a similar document).
>>
>>The guidelines generally recommend using a load store paradigm rather
>>than the read-modify-write style of instructions
>
>For what reason?

The reasons are discussed extensively in the aforementioned document.

Scott Lurndal

unread,

Apr 9, 2023, 11:56:03 AM4/9/23

to

j...@cix.co.uk (John Dallman) writes:
>In article <u0se0t$1c83s$1...@dont-email.me>, cr8...@gmail.com (BGB) wrote:
>
>> > Those are rather old versions. What Linux distribution do you
>> > have in WSL?
>> It looks like Ubuntu-14.04 ...
>
>That makes sense. It's nine years old.
>
>> I am mostly using what came with WSL; and apt-get doesn't seem to
>> have anything newer.
>
>It won't. Ubuntu is a Debian derivative, and that family of Linuxes can't
>readily take GCC versions later than the one corresponding to the GCC
>run-times shipped with the OS. The full story is complicated, but the
>distributor's assumption is that you'll upgrade to a later Ubuntu
>reasonably frequently. You may well have to move to WSL2 to do that,
>though.

We still run on RHEL7 (gcc4.8) for the most part, but we've built all the
GCC versions from gcc5 through gcc13 and use the 'module' functionality
to specify which to use when building software. Generally you'll need
the most recent glibc and C++ libraries along with a contemporary
binutils (newer compilers generate ELF objects with relocation types
not supported by older binutils).

Building a compiler is time consuming, but not difficult.

Michael S

unread,

Apr 9, 2023, 12:09:47 PM4/9/23

to

Which you obviously didn't read for quite some time if ever.

Anton Ertl

unread,

Apr 9, 2023, 1:00:20 PM4/9/23

to

I looked into AMD's 51-page "Software Optimization Guide for the AMD
Zen4 Microarchitecture"
<https://www.amd.com/system/files/TechDocs/57647.zip> to check your
claim, and did not find any recommendation to use a load/store
paradigm, and therefore no extensive discission of such a
recommendation. On the contrary, I found that a lot of resources are
in terms of macro-op (e.g., the op cache contains macro-ops and can
deliver 9 macro-ops/cycle, the dispatcher can dispatch 6
macro-ops/cycle, and the retirement unit can process 6 macro-op/cycle.

Section 2.3 lists the number of macro-ops of various instructions:

1 MOV reg,[mem]
1 ADD reg,reg
1 MOV [mem],reg
1 ADD reg,[mem]
2 ADD [mem],reg

So using "ADD [mem],reg" requires 2 macro-ops, while "MOV reg,[mem];
ADD reg,reg; MOV [mem],reg" (the load/store paradigm) requires 3. So
while the guide does not explicitly recommend anything in this
respect, it seems obvious that an RMW instruction is preferred over
the load/store paradigm.

What the guide does mention is:

|Note that a store or an integer instruction using a memory operand
|that is listed as a fastpath single becomes a fastpath double when
|using an addressing mode with two register sources. For these
|instructions it is recommended that compilers avoid addressing modes
|with two register sources (base+index, or base+index+displacement).

It's not entirely clear what this means for "ADD [r8+r9], rax"
compared to "MOV r10,[r8+r9]; ADD r10, rax; MOV [r8+r9], r10"; the
latter seems to by five macro-ops, but is the former three macro-ops
or a microcode instruction?

BGB

unread,

Apr 9, 2023, 2:09:45 PM4/9/23

to

I have done it for cross compilers, but not particularly inclined to do
it for the main compiler in a sort of "if it aint broke, don't fix it"
inertia...

Admittedly, it is a somewhat different situation with BGBCC, which is
sort of a grab bag of bugs...

I can't give a blanket statement of "always use the newest version",
because I can't guarantee that new bugs (and/or breaking changes)
haven't managed to sneak in without me noticing.

Was kinda similar with Vivado, where apparently there were "good"
versions and "bad" versions (with obvious/significant bugs).

I have mostly been using 2018.3, which (as far as I can gather) is one
of the good versions.

One example of a borderline breaking change recently was when adding the
inline-shuffle SIMD ops, where it turned out there was an encoding
disagreement between my documentation and my implementation regarding
the Op64 rounding-mode instructions. So, my initial attempt at adding
the shuffle-forms ops broke the rounding-mode encodings. Had to slightly
tweak the decoding (and fix my documentation) such that things didn't
conflict (and side-stepping a breaking change).

The documentation had said:
FFw0_00ii-F0nm-ZeoZ
But, really, they were using:
FFw0_08ii-F0nm-ZeoZ

So, reworked it such that if the 'ss' subfield is 00, it is a rounding
mode, but 01/10/11 will encode a shuffle (Rm/Ro/Rm+Ro). Implicitly, it
means one can't encode a rounding mode and shuffle at the same time, but
this isn't a big loss (for Binary32 and Binary16 SIMD, the FPU is
effectively hard-wired as truncate-towards-zero in any case).
As-is, these ops will require both the low-precision SIMD unit and the
the "low precision SIMD unit does Binary32" options to be enabled.

I have been trying to fix bugs in BGBCC, but sometimes new ones crop up.
Most recently found bug was that some code for trying to realign the
instruction stream was treating XGPR encodings (7x/9x) as a pair of
16-bit ops, and sometimes sticking a (16-bit) NOP in the middle of the
pair of instruction words (breaking the instruction stream).

This is thus far only partially worked around, still need to look at the
compilers' section-handling code some more and try to figure out a
better fix.

Partly it is trying to work on an issue:
Op96 encodings could be decoded by a slightly cheaper core *if* they can
be forced to always have a 32-bit alignment (free alignment of an Op96
requires the L1 to work with a 2x 128-bit even/odd pair, whereas a
32-bit alignment would allow handling it with a 2x 64-bit even/odd pair).

I had started also writing up some possible ideas for a new IR format.
Would mostly be a plain 3AC IR (non SSA), though one could argue for
SSA. Practically, SSA would require a bigger register space though.

Or, one option could be mostly to try to implement a direct binary
serialization for the existing 3AC IR (as opposed to designing something
new that could be closer to something that could be passed off as a more
traditional interpreter bytecode).

Though, as noted, the existing IR uses variables as a combination of a
base index and a modification sequence number (each time a variable gets
assigned to, its sequence number is incremented). Variables with the
same base ID may or may not be distinguished based on sequence number
(with some logic for whether to write back the contents of a dirty
variable based on whether any instances exist where the live-range of a
given ID+SEQ pair crosses a basic-block boundary, ...).

One other possibility could be to look into trying to use (or at least,
bridge over to) the LLVM textual IR format. If I could do so, it could
potentially allow using Clang as a front-end with part of BGBCC as a
backend.

Granted, given the BGBCC BJX2 backend is the main "nasty mess" part at
present, it could almost be argued for working on writing a new backend
(and probably splitting the code-generator, assembler, and linker into
different entities). A case could maybe also be made for adding support
for COFF objects or similar.

Say:
Frontend (Parse/AST) -> 3AC IR
Backend: Nx 3AC IR -> output
CGEN: IR -> BinASM
ASM: BinASM -> OBJ
LINK: Nx OBJ -> PEL/raw

Where the 3AC IR could be emitted into a binary format, with multiple
objects reloaded for the backend (typically, program and any static libs).

BinASM could be an array-of-structs style representation (similar to the
3AC IR), which could be serialized as, or reloaded from, a textual ASM
format.

The ASM and LINK stages could function more like a traditional assembler
and linker, likely producing and consuming OBJ contexts (which would
roughly contain similar information to an object file), and which could
potentially be flattened out as and reloaded from COFF objects or similar.

Though, if I went this route, the existing WEXifier would need to go
away, likely with the bundling being handled instead at the BinASM stage
(currently, nothing like this exists; instead function calls are used to
emit instructions directly into the output sections from the codegen
stage; with "ASM output" mostly a hack of emitting side-channel outputs
from the instruction-emitter functions and similar).

...

David Brown

unread,

Apr 10, 2023, 5:59:58 AM4/10/23

to

On 03/04/2023 23:45, MitchAlsup wrote:
> On Monday, April 3, 2023 at 4:36:27 PM UTC-5, Scott Lurndal wrote:
>> MitchAlsup <Mitch...@aol.com> writes:
>>> On Monday, April 3, 2023 at 1:39:14=E2=80=AFPM UTC-5, Terje Mathisen wrote:
>>>> https://queue.acm.org/detail.cfm?id=3D3588242=20
>>> <
>>> In my 38 years of writing C code, I, personally, have never used realloc()
>>> (or calloc() BTW).
>> Likewise, albeit in 43 years of C programming.
>>> <
>>> In all my future years of programming, I do not expect to compare pointers.
>> Except when comparing them to the null pointer (NULL), which technically
>> is 'comparing pointers' given the definition of NULL is usually (void *)0.

It is also very common to compare pointers for equality, which continues
to be absolutely fine as long as the pointers are to compatible types or
void pointers. (Comparison to null pointers is just a special case of
this.)

It is only the ordered relations ("<", ">", "<=" and ">=") that are
undefined behaviour when used on pointers that do not point to parts of
the same aggregate object.

None of this has changed since the earliest days of C, and nothing has
changed in C23. I have no idea why the author of a paper that claims to
be about C23 is merely ranting about his favourite dislikes about C in
general.

(It's fine if you don't like this undefined behaviour, fine if you /do/
like it, and fine if you don't care because the issue doesn't affect
you. But it's not fine to say "Here's why you should avoid C23" and
list things you don't like about all C standards.)

> <
> Yes, but when written::
> <
> if( p ) { dereference( p ); }
> or
> if( !p ) { avoid_dereferencing( p ); }
> <
> it is not a comparison !!
> <
> {{ I fully agree that it is TECHNICALLY and PEDANTICALLY a comparison;
> but it does not have a comparison operator in the source...........................}}

Technically, pedantically, factually - writing "if (x)" is precisely the
same as "if (x != 0)", which is precisely the same as "if (!(x == 0))".
If you are interesting in writing good quality portable C, then it's
worth understanding these things.

David Brown

unread,

Apr 10, 2023, 6:45:06 AM4/10/23

to

On 04/04/2023 07:39, Anton Ertl wrote:
> MitchAlsup <Mitch...@aol.com> writes:
>> On Monday, April 3, 2023 at 1:39:14=E2=80=AFPM UTC-5, Terje Mathisen wrote:
>>> https://queue.acm.org/detail.cfm?id=3D3588242=20
>> <
>> In my 38 years of writing C code, I, personally, have never used realloc()
>> (or calloc() BTW).
>

> I have.

>
>> In all my future years of programming, I do not expect to compare pointers.

>> I haven't compared a pointer for the first 53 years of programming, why sta=
>> rt
>> now ?
>
> One reason for comparing pointers is if you want to implement
> memmove(), as discussed in the article.

memmove(), and other standard library functions, are part of the
implementation. They were traditionally implemented in assembly, and
only later in C. But as part of the implementation, they can use
features that are implementation-specific - there was never any need or
intention that such functions should be implementable in portable
standards-compliant C. The language never had any need to support
writing an efficient memmove() in C - it gave you a ready-made memmove()
function so that implementing it yourself is not necessary.

It's absolutely fine to say that you'd /like/ C to support comparison of
arbitrary pointers, and there are other uses that are more useful than
replicating existing standard library functions. A lot of
implementations will support it - or at the very least, they will have a
"uintptr_t" type so that you can cast the pointers to uintptr_t values
and compare those.

But it is quite absurd for an article to claim to be about the horrors
of C23 and then rant about things the author dislikes about /all/ C.

>
> Another one (this one within standard-C confines) is walking through
> an array with an idiom like this:
>
> for (p=start; p<end; p++)
> ... *p ...

Yes, that's very common - and absolutely fine if you are working within
an array (as you usually are in such code).

> C11, C99, and C89 are not any better wrt. comparing pointers.

(Note that you can always compare pointers to compatible types or void
for equality - it's only ordering relations that are an issue.)

> If you
> want a language that has a flat address space, use Forth.
>

There are plenty of systems where you don't have a single flat address
space, and it's nice to be able to use C on them. An example would be
the extremely popular AVR microcontrollers, where flash and ram IO
spaces are entirely separate (with different assembly instructions for
accessing them). And then you have segmented address architectures,
like x86 - comparing pointers for ordering can be complicated if
calculations are needed to determine the real address.

There are also other complications, such as the sizes of size_t and
ptrdiff_t. If you have a 16-bit system, you'll want ptrdiff_t to be
16-bit for efficiency. But are you going to allow "p < q" to be valid
while "q - p" might not be, due to an overflow in the ptrdiff_t range?
How should "p < q" be evaluated if your architecture treats addresses as
signed offsets from a base pointer, and "q" is viewed as negative while
"p" is viewed as positive, despite "q" having a bigger numeric value?

Saying that this is all "undefined by the C standards" is perhaps a lazy
way out, but it is arguably better than insisting on rules that cannot
be efficiently and consistently implemented in all cases.

John Dallman

unread,

Apr 10, 2023, 7:48:52 AM4/10/23

to

In article <a533d5c4-8dca-44eb...@googlegroups.com>,

Mitch...@aol.com (MitchAlsup) wrote:

> In my 38 years of writing C code, I, personally, have never
> used realloc() (or calloc() BTW).

Same here, although I work on software that uses alloca(), inside safety
wrappers.

> Also note: For the last 52 years I have not programmed a machine
> where comparing pointers was "tricky"--I never used the segment
> stuff in x86-32. {and thus remain un-brain-damaged or is that
> un-dame-bramaged.}

I believe "drain bamaged" is favoured.

> For the foreseeable future; I will not use machines that have
> tricky pointers. The set of such machines are at least {x86-64,
> ARM, RISC-V, MIPS, VAX, PDP-11, Nova, SPARC, IBM 360-3090,
> Mc 68K, Mc 88K, .....}

Presumably "tricky" pointers on x86-64 are required when setting up
segment registers for program execution? Once a user process is launched,
it invariably deals with a flat 64-bit address space.

John

Terje Mathisen

unread,

Apr 10, 2023, 9:31:03 AM4/10/23

to

David Brown wrote:
> On 04/04/2023 07:39, Anton Ertl wrote:
>> One reason for comparing pointers is if you want to implement
>> memmove(), as discussed in the article.
>
> memmove(), and other standard library functions, are part of the
> implementation. They were traditionally implemented in assembly, and
> only later in C. But as part of the implementation, they can use
> features that are implementation-specific - there was never any need or
> intention that such functions should be implementable in portable
> standards-compliant C. The language never had any need to support
> writing an efficient memmove() in C - it gave you a ready-made memmove()
> function so that implementing it yourself is not necessary.

Is pointer comparison implementation defined (ID) or UB?

If it is ID, then I can write my own version of memmove() (or similar
ops) on any given compiler, as long as I do it the way that particular
compiler have defined it, right?

If it is UB, then that's impossible, you have to step outside of the C
standard to implement it, and do so in a way that makes it impossible
for the compiler to discover that you have done so. :-)

> It's absolutely fine to say that you'd /like/ C to support comparison of
> arbitrary pointers, and there are other uses that are more useful than
> replicating existing standard library functions. A lot of
> implementations will support it - or at the very least, they will have a
> "uintptr_t" type so that you can cast the pointers to uintptr_t values
> and compare those.

That's fine: If I can take two arbitrary pointers and convert them to
uintptr_t, and the resulting values are comparable, at least to the
point where it is possible to discover if one object could overlap
another, then I can write my own memmove().

Anton Ertl

unread,

Apr 10, 2023, 9:56:32 AM4/10/23

to

David Brown <david...@hesbynett.no> writes:
>On 04/04/2023 07:39, Anton Ertl wrote:
>> One reason for comparing pointers is if you want to implement
>> memmove(), as discussed in the article.
>
>memmove(), and other standard library functions, are part of the
>implementation. They were traditionally implemented in assembly, and
>only later in C. But as part of the implementation, they can use
>features that are implementation-specific - there was never any need or
>intention that such functions should be implementable in portable
>standards-compliant C. The language never had any need to support
>writing an efficient memmove() in C - it gave you a ready-made memmove()
>function so that implementing it yourself is not necessary.

For a supposed systems programming language like C, the inability to
implement memmove() is a failure.

>It's absolutely fine to say that you'd /like/ C to support comparison of
>arbitrary pointers, and there are other uses that are more useful than
>replicating existing standard library functions.

"memmove()" more succinct than describing the general case of an
operation that reads from one memory block and writes to a potentially
overlapping block in a way that is equivalent to doing all the reading
before doing any of the writing. We leave it to the intelligence of
the reader to extrapolate from "memmove()" to the general case. And
of course it's clear that an advocate like you will try to confuse
things by pointing out that the special case that was so succinctly
expressed can be fulfilled with the succinct library call.

>> If you
>> want a language that has a flat address space, use Forth.
>>
>
>There are plenty of systems where you don't have a single flat address
>space, and it's nice to be able to use C on them. An example would be
>the extremely popular AVR microcontrollers, where flash and ram IO
>spaces are entirely separate (with different assembly instructions for
>accessing them).

There are several Forth systems for AVR microcontrollers. The
Forth-Gesellschaft even made a whole special edition of it's magazine
for AVR
<https://forth-ev.de/wiki/res/lib/exe/fetch.php/vd-archiv:4d2007-sonderheft-avr.pdf>

So a flat address space in the programming language does not prevent
implementation on the AVR.

>And then you have segmented address architectures,
>like x86 - comparing pointers for ordering can be complicated if
>calculations are needed to determine the real address.

There is no "x86" architecture. If you mean the 8086, there have been
many Forth systems for that. Some are for the tiny or small memory
models. If you want one that deals with more memory, look at F-PC (in
particular, look for "Internals" in
<https://github.com/uho/F-PC/blob/main/fpc/fpcwords.txt>). There the
standard Forth words @, !, etc. access 64KB of memory (making F-PC a
medium-model system in terms of 8086 memory models), but it also has
words @L, !L, CMOVEL etc. for accessing more memory; the latter words
did not find the way into the Forth standard, and nobody has ever
publically missed them. Looking at
<https://devblogs.microsoft.com/oldnewthing/20200728-00/?p=104012>,
similar things were used in C, e.g., _fmemcpy() for copying involving
far pointers (probably a non-standard C type).

>But are you going to allow "p < q" to be valid
>while "q - p" might not be, due to an overflow in the ptrdiff_t range?

C has this problem even without the 8086 nonsense and with the
restriction of comparing and subtracting only pointers to the same
object. I.e., standard C has this problem. Consider a sane system
like a PDP-11, and you have an array

char c[40000U];
char *p=c;
char *q=&c[39999];

The C standard defines what q<p means, but does not define what p-q
means, does it?

>How should "p < q" be evaluated if your architecture treats addresses as
>signed offsets from a base pointer, and "q" is viewed as negative while
>"p" is viewed as positive, despite "q" having a bigger numeric value?

It's not clear to me how a negative value is supposed to be
numerically bigger than a positive value. Your description must be
missing something.

The whole question reeks like an exercise in dreaming up some
justification for excluding a feature from a standard. If you go that
way, we can dream up justifications for excluding every single
feature. So that's not the way to go. The examples must be from
existing practice, not from something dreamed-up, and even if you have
a system where implementing a feature is not easy or not very
efficient, that's not a reason not to standardize the feature. If
somebody builds a system with more the 64KB and bases it on the 8086
rather than, say, a Cortex-M3, they, not the rest of the world should
deal with the fallout. If that means that p<q is slow on that system,
so be it.

Michael S

unread,

Apr 10, 2023, 9:57:55 AM4/10/23

to

On Monday, April 10, 2023 at 4:31:03 PM UTC+3, Terje Mathisen wrote:
> David Brown wrote:
> > On 04/04/2023 07:39, Anton Ertl wrote:
> >> One reason for comparing pointers is if you want to implement
> >> memmove(), as discussed in the article.
> >
> > memmove(), and other standard library functions, are part of the
> > implementation. They were traditionally implemented in assembly, and
> > only later in C. But as part of the implementation, they can use
> > features that are implementation-specific - there was never any need or
> > intention that such functions should be implementable in portable
> > standards-compliant C. The language never had any need to support
> > writing an efficient memmove() in C - it gave you a ready-made memmove()
> > function so that implementing it yourself is not necessary.
> Is pointer comparison implementation defined (ID) or UB?
>
> If it is ID, then I can write my own version of memmove() (or similar
> ops) on any given compiler, as long as I do it the way that particular
> compiler have defined it, right?
>
> If it is UB, then that's impossible, you have to step outside of the C
> standard to implement it, and do so in a way that makes it impossible
> for the compiler to discover that you have done so. :-)

I don't think so.
For example, left shift of negative signed integers is UB according to standard.
It does not prevent gcc from making it fully defined on all supported architectures.
I don't see why comparison of unrelated pointers, or, at least, casting of unrelated
pointers to pointer-sized unsigned integers and comparison of resulting integers,
could not be treated similarly.
Nsal demons are allowed by The Standard rather than mandated by it.

Scott Lurndal

unread,

Apr 10, 2023, 10:34:45 AM4/10/23

to

an...@mips.complang.tuwien.ac.at (Anton Ertl) writes:
>David Brown <david...@hesbynett.no> writes:
>>On 04/04/2023 07:39, Anton Ertl wrote:
>>> One reason for comparing pointers is if you want to implement
>>> memmove(), as discussed in the article.
>>
>>memmove(), and other standard library functions, are part of the
>>implementation. They were traditionally implemented in assembly, and
>>only later in C. But as part of the implementation, they can use
>>features that are implementation-specific - there was never any need or
>>intention that such functions should be implementable in portable
>>standards-compliant C. The language never had any need to support
>>writing an efficient memmove() in C - it gave you a ready-made memmove()
>>function so that implementing it yourself is not necessary.
>
>For a supposed systems programming language like C, the inability to
>implement memmove() is a failure.

How so? A simple byte-by-byte loop, reversed if necessary,
is sufficient. E.g. subtract the two pointers and depending on whether
the subtraction is negative or positive do a byte-by-byte copy either
forward or in reverse. Perfectly legal C, if not particuarly performant.

In real life, memmove is often implemented in assembler for performance.